In-Depth
Q&A: Tips and Traps of Large Data Warehouse Design
Think of a data warehouse as a project, not a program
Nicholas Galemmo is an information architect with Nestlé USA, and famously prolific. He is the co-author (along with Jonathan Geiger and Claudia Imhoff) of "Mastering Data Warehouse Design: Relational and Dimensional Techniques," which Wiley will publish in August. Mr. Galemmo has also led educational tracks at industry events, such as the Enterprise Data Forum, and is a contributor to at least one data warehousing discussion list.
In a series of e-mail messages recently, Mr. Galemmo offered design considerationstips, tricks, traps, and strategiesfor large data warehouses.
Q. What defines a large data warehouse these days
[and] how rapidly are the sizes of enterprise data warehouse increasing?
A. Hard to say. I guess 500 GB+ would be “large.” But I find raw database size to be a very misleading statistic. For example, we have an Oracle-based [enterprise data warehouse] as well as an implementation of SAP [Business Warehouse (BW)]. BW consumes considerably more disk yet contains far less information.
Warehouse architecture, use of multidimensional databasessuch as [Hyperion’s] Essbaseand other choices can lead to vastly different storage requirements for the same information. But, to answer your question, I would guess a 40 to 50 percent increase per year would not be out of line for a fairly new [data warehouse] implementation. This growth would flatten over time, probably 4-5 years into the life of the warehouse.
Q. Are there common architectural or design considerations that large data warehouses have in common? If so, what are they? Also, how have data warehouse design philosophies changed over the years?
A. I don't think design philosophies have changed much in the last [seven to eight] years. Generally speaking, both [Ralph] Kimball's and [Bill] Inmon's approaches are the most commonly used today. There are a few other methodologies, such as Dan Linstedt's Data Vault, but they have, as yet, to receive wide acceptance.
In any case, as [a] database gets larger, the challenge is at the physical database level, which is the same problem whether using a dimensional or ER approach to your design. The primary tool here is table partitioning, usually based on the age of the data (such as transaction date), so that backups are more manageable and performance is enhanced by allowing parallelism in some queries. There are also other deployment options, such as distributing across multiple servers to reduce the size of any one database.
Q. Is it possible to design a data warehouse architecture that can accommodate very rapid growth, such that efficiency and performance are maintained even as a data warehouse doubles or triples in size?
A. I have found a properly implemented star schema provides the most consistent performance regardless of size provided you use a database system that supports star schemas well. This means the database must support partitioning, bitmap indexes, and star joins. By the latter I mean the database should resolve dimensional selections first, then apply the dimensional keys en-masse against the bitmap indexes on the fact table. This is very fast and execution times are somewhat proportional to the size of the result set.
Q. What about existing data warehouses? How difficult is it to go in and re-design many existing data warehouses to increase performance or efficiency? The short answer, I'm guessing, is that this must be undertaken on a case-by-case basis.
A. Of couse, your short answer is the most accurate. But, even if it means a complete makeover into a new database and architecture, the biggest issues are more administrative than technical. It depends more on the existing query/OLAP tools and how much the user interface will change with the new design. If you have a tool that shields the end user from the underlying data structures (such as Business Objects) you can conceivably completely change databases and no one would know the difference (other than better service levels). In those cases, rearchitecting the data warehouse is far less of an undertaking.
Q. What tips do you have for data warehouse designers charged with creating large data warehouses? What are some issues that they should pay a lot of attention to?
A. At the physical level, table partitioning, RAID, a lot of memory, and a lot of disk channel bandwidth are most important. At the logical design level, designing tables for efficient loading is critical. For fact tables, I design them so the only action we take is to insert new rows. No updates-in-place (they take too long). We generate deltas in a staging area and bulk load them into the fact table.
If you need to provide 24/7 availability, use a database that allows you to swap partitions. The basic technique is to create a duplicate of the partition you need to load, bulk load into the duplicate, do any index rebuilding necessary then swap the updated duplicate with the current partition. The swap would not interfere with current queries and any new queries would use the updated partition.
Q. What about traps? What unforeseen obstacles are they likely to encounter as they move from data warehouse design to implementation? What traps open up as data warehouses grow in size?
A. The biggest trap is not thinking ahead, not understanding that a data warehouse is a program, not a project. The real work doesn't begin until after the first phase has been implemented. It is not until the user community begins using the data warehouse that start to appreciate what it can do for them. It is at that point the flood of requests for additional information start coming in. More importantly, the nature of those requests become more sophisticated and focused and often involve information that is much more difficult to obtain.
The other issue, since we are taking about disk, is at the beginning of the project, make your best estimate of the processing power and disk space you are going to need. Then triple it.
Q. What are some of the emerging trends in large data warehouse design? To what extent should designers be wary of emerging trends when architecting their data warehouses?
A. I hate the term "emerging trends"it makes me think of lemmings. There are a lot of different ways you can field a data warehouse. It can be centralized, distributed, federated, and so on. The fact that company X was successful doing it one way doesn't mean you will be successful doing it the same way. It is a very personal decision based on the way the company functions.
I also don't buy into a lot of marketing noise such as "If your data warehouse is over 1TB you must use database X". The choice just isn't that simple. All RDBMSes are not created equal and the proper choice has more to do with the particular problems you are trying to address and the overall deployment architecture.
Q. A question about a specific trend: I've heard that organizations are increasingly using flat file database hubsas opposed to RDBMSesin some data warehousing architectures. What are the benefits, if any, associated with this approach, and what are the potential pratfalls?
A. There is no reason why flat files cannot be used to stage data for distribution. After all, a relational set is, in concept, nothing more than a flat file. One the plus side, flat files can be very fast. The downside is that, administratively, it is more work. It requires writing more code, you do not have the transaction safeguards a RDBMS provides, you lack meta data, changing file formats is a bother, you need a good file management strategy
basically the same administrative problems IT has had since the beginning of time. A good ETL tool will mitigate some of these issues.
Another downside is if the flat files are the only source of detailed data, you do not have a workable provision to provide drill-through capability from an aggregate data mart. While some database systems do have a means to read flat files, a sequential scan through a huge file doesn't exactly make you a hero in the end-user's eyes.
Nicholas Galemmo is an information architect at Nestlé USA and has 27 years' experience as a practitioner and consultant involved in all aspects of application systems design and development. He has been actively involved inlarge-scale data warehousing and systems integration projects for the past 11 years. Mr. Galemmo's first book is "Mastering the Data Warehouse - Relational and Dimensional Techniques" with co-authors Claudia Imhoff andJonathan G. Geiger (Wiley Publishing). Mr. Galemmo can be reached at [email protected].