The Schema Wars

There's a debate going on in the decision support world on the topic of the most appropriate data model to use for a relational decision support database. Two camps are promoting their respective points of view: the third normal form (3NF) camp and the star schema camp.

First, a little bit of background. Relational databases, which are the industry norm for storing data from transaction processing applications, are based on a mathematical model that describes tables, columns, and their relationships. Normalization is the process of defining the relationships between tables and columns. There are several levels of normalization, but it is generally accepted that third normal form is the structure that is most appropriate for transaction-processing databases.

In a 3NF model, every nonkey column is dependent on one, and only one, key column; it is dependent on the entire key, not a subset of a key; and there are no repeating groups of columns within a table. Nonkey columns within a table can be, and often are, keys in other tables. These are called foreign keys. An order table will have order-related information with "order_number" as the primary key. A "customer_id" column is a foreign key that points to a primary key in the customer table.

However, as they say, the devil is in the details. It turns out that when most modelers implement their logical normalized model as a physical model, the result is to create hundreds, or even thousands, of tables. It quickly becomes apparent to many organizations that end users can be completely baffled when they are presented with unfettered access to a fully normalized database. They have no idea where to start or how to search, browse and navigate the database to locate the data that is meaningful to them.

In addition, a typical decision support query can stress the system in a variety of ways. Decision support queries often need to look at data from multiple tables, which forces a multitable join. Users want to aggregate data, which requires the database to perform sums, counts and other calculations. Often the database engine must scan very large amounts of data. Finally, the data needs to be sorted before it's presented to the user. These activities -- joins, aggregation, scanning, and sorting -- are all very resource intensive. Some databases perform them better than others.

The star schema, which provides a multidimensional view of the data based on familiar relational technology, was designed to alleviate these complexities. A star schema has a central table -- the fact table -- and dimensional tables. Each dimension table has a single-part primary key. The fact table has a multipart primary key, which is a concatenation of the dimension table keys. The fact table contains atomic, detail-level facts, usually numerical measures, while the dimension tables contain descriptive data.

Since the star schema provides a hierarchical, multidimensional view of the data, it is easier for end users to navigate. In addition, the dimensional model takes care of aggregates and reduces the number of joins. But like OLAP engines, the star schema requires the developer to explicitly define and build the dimensions, which means that he or she has to have a good idea about what questions the users will want to ask. One of the strengths of the 3NF model is that any query can be launched against the database, as long as the data is there. Therefore, the star schema is good for routine categories of questions, while 3NF makes more sense for ad hoc, unanticipated questions.

Given this disparity of presentation, which approach should a database designer adopt? I think the thing to do is support both. Your power users and more sophisticated business analysts need the ability to ask any question of the database. These people, who are usually highly trained and experienced, are not likely to be hindered by the number of tables they need to understand. On the other hand, you need to provide an easy-to-navigate interface that enables less-sophisticated users to easily browse the database. For these users you can, and should, create star schema views. This approach lets you create different views for different categories of end users, based on their analytical needs, without limiting overall database flexibility for power users. --Robert Craig is director of the Data Warehousing and Business Intelligence Division at Hurwitz Group Inc. (Framingham, Mass.). Contact him at rcraig@hurwitz.com or via the Web at www.hurwitz.com.