From Star to Snowflake to ERD: Comparing Data Warehouse Design Approaches
As data warehousing gains acceptance across a variety of industries, a number of method or "best practice" questions have emerged. One recurring question is "What is the best approach for modeling data in a Decision Support System (DSS) environment?" This article explores the considerations and issues associated with three alternative database schema representations: a traditional relational model and two dimensional models, star and snowflake.
Our first step in exploring the schema types was to look for guidance from industry experts. We quickly discovered a spectrum of opinions. Ralph Kimball advocates the star schema as the best approach. He asserts that if any dimensions are "snowflaked," there will be a decrease in performance. And in Kimball’s more recent book, he remains an advocate for the star schema in general, but does give conditions in which snowflaking is permissible.
Sam Anahory and Dennis Murray advocate a hybrid approach based on practical considerations, including the access frequency and the RDBMS optimization capabilities.
MicroStrategy developed its Online Analytical Processing (OLAP) product suite to work optimally against the snowflake design. They, on the other hand, assert that the star schema will result in poorer performance.
"[The] not-as-fast schema is the Atomic Star … one look-up table per dimensions with atomic base tables only;" from the MicroStrategy DSS Architect User Guide.
Rob Armstrong of NCR Corporation advocates the use of the relational model in data warehouse development. He makes a case for using the relational model as the foundation for the data warehouse and limiting the use of dimensional models to data marts in which response times of known queries are of utmost importance.
The traditional relational model, also referred to as the entity relationship (ER) model, has been successfully used for many years to design databases supporting online transaction processing systems (OLTP). These systems typically involve very large numbers of transactions, each of which accesses a small number of database rows, usually via unique or largely unique indices. The relational model, with judicious denormalization, can result in a physical database design that meets this type of access requirement very well. Decision support systems, however, involve radically different access requirements. These systems involve fewer transactions (queries) that usually access large percentages of the database tables, often joining multiple tables, sorting and aggregating the results as well.
Early users of the relational model with very large decision support databases had performance problems. They found that DBMS optimizers often had problems generating efficient access paths when the number of tables joined exceeded a certain threshold, and that queries joining more tables than this threshold number tended to perform poorly. This problem led to development of a second type of model that "flattened out" the relational model structures by creating consolidated reference tables from several pre-joined existing tables. Other optimizations were also employed, such as using artificial keys for relating the large tables containing the data of interest to the smaller tables that associated them to user-identifiable search criteria. This particular technique reduced the size of the large tables and influenced the DBMS optimizer to access the large table last, only after it had the most specific set of rows identified for the join.
The dimensional modeling approaches evolved from efforts like these, to derive structures that were both optimal for decision support query performance and also highly usable for the aggregations and analysis typically performed in these applications.
What Are Dimensional Models?
Dimensional models were developed to support analytical tasks, and are unique in that they have two types of tables. First, all business events to be analyzed are gathered into fact tables. Facts are typically numeric and inserted frequently, and fact tables have large numbers of rows. Examples of facts include product sales and bank transactions. Anything that is to be analyzed must be captured in a fact table.
The second table type is the dimension table, which are like reference tables that define how to analyze fact information. They are typically small and relatively static. In the product sales example, the facts will be analyzed by time and by store and inserted in dimension tables.
As described in the introduction, there are two types of dimensional models. Facts are collected in the fact table named Sales. In the snowflake schema, dimension information is modeled with a normalized structure, much like it would be in a traditional relational model. For example, the Time dimension has three tables Day, Month and Year.
The star dimensional model has the same fact table as the snowflake schema; each dimension, however, is captured in one table – resulting in some redundant data. The Time dimension is captured in one table with all the day, month and year information.
To compare design approaches, we formed a team to execute a case study. The team included experienced practitioners and researchers; business and technical experts who had experience with decision support systems; and technical experts capable of implementing databases and OLAP tools. One key consideration was to remain objective – we wanted representation of many analysts and a set of data that was complex enough to uncover real world issues.
The business case for this project was a global retailer with a centralized headquarters function. This case provided both sufficient database volumes and design complexity to provide value to real world projects.
Comparisons and findings were examined across the entire data warehousing lifecycle because the design approach not only affects the modeling process, but also impacts the construction and use of the data warehouse. Therefore, the compilation of results across the entire lifecycle is important for recommending an approach.
Modeling Process. Business relationships within a dimension are not clear in the star schema – all entities for a dimension are collapsed into a single table. While this approach reduces data-model complexity, it introduces ambiguity. The model no longer shows relationships among entities within the dimension. In the star schema representation of the Product Dimension, it is not clear that the hierarchy is Department, Class, Style, Product and that in fact, a compound key of Department and Class determines Class. With a star schema, this knowledge must be materialized in the loading and querying processes. The relational and snowflake schemas do model the business relationships within the dimensions. We recommend that the logical model should be a snowflake schema.
The relational model works well for depicting and understanding business rules. Here, the focus is on the entities and the rules in the business, rather than any business process. Often relational models of the source systems of the data warehouse are available and help generate a clear understanding of the business rules. In contrast, the dimensional models are more likely to be tuned to specific business processes. If possible, include a relational model in the analysis process; the team will benefit with additional knowledge about the data and business.
The need for business expertise is more pronounced in dimensional modeling, which requires a greater understanding of the key business processes that need to be supported. The specific facts needed for analysis must be determined, putting more demands on the analysts and business users in the modeling process.
In addition, dimensional modeling is less mature than relational techniques. The team was less experienced with dimensional models and had more issues to resolve.
Generic Entities. Using generic entities in a data warehouse schema may be problematic. Generic tables are occasionally used in relational designs to house entity sub-types. The example in our model was a store management hierarchy. In it, three logical entities (Region, District and Complex) were defined as a single generic entity with a recursive relationship to handle the one-to-many relationships between the levels of the hierarchy . The primary motivation for this construct is to hedge against entity volatility and relationship changes brought on by events such as business acquisitions and rapid growth in size, product lines or services.
Using generic entities often lessens the impact of such changes by accommodating them through adjustments to value sets stored within the generic entity (as opposed to requiring database structural changes). When used in an OLTP environment, this approach can reduce the number of physical tables to manage and take advantage of some common code. Usability tests, however, indicated that this type of structure is difficult to understand and doesn’t suit end user access well. The OLAP tool metadata set up was problematic for this type of entity. Reverting from using generic entities to one discrete table per level avoids the creation of DBMS views, and in some cases may create performance advantages.
Fact or Dimension. The distinction between fact and dimension can be blurry. In our case study, our most time-consuming decision was whether Promotions should be a fact or a dimension, because promotion information is used for both analyses as a fact and as a dimension to drill into daily sales. We chose an approach that designated Promotions as a fact. As we moved further into physical design and implementation, we realized that in the end, Promotions was used in both ways. As a result, we replicated the necessary data to a dimension table. Don’t agonize over proper dimensional model rules because in some cases, they haven’t been established yet. Choose an approach and move into design and prototyping to validate and refine your design.
Improper Star Joins. When joining a fact to a dimension table, be sure that the granularity of the fact table is never higher than the granularity of the dimension table being joined with it. Otherwise, a star schema can lead to erroneous join results.
As an example, consider a model with an atomic fact table that contains daily product sales, and a summary fact table that contains daily aggregations of sales up to a higher level called Style. A Product-level dimension table in the star schema has an embedded hierarchy that includes Style. Therefore, it is possible to use the Style ID column from this table in a join with the Style-level fact table, but doing so would result in double counting. Each fact table row would be joined with every Product dimension row with the same style.
To avoid this problem, you can create additional dimension tables that correspond to each level of granularity in the fact tables. In our model, we created a Style dimension table, in addition to the Product dimension table, to avoid the duplicate row problem. An alternative to creating additional tables is to create a DBMS view on the dimension table using a SELECT DISTINCT construct, provided that it doesn’t adversely affect performance. Establishing separate levels of dimension tables for each level of fact-table aggregation is one complication of adopting the star schema.
Relational Schema Performance. Relational schema performance had the longest elapsed time. When the tool-generated SQL was "explained," we observed that the access path chosen by the DBMS for the relational schema was the least efficient of the three.
Impact of OLAP Tools. End user query complexity differences were masked by the OLAP tool. The tool used in this study essentially hid the differences in query complexity among the three schema types. The tool metadata presents dimension hierarchies to the end user; thus, all schemas – even the star schema and the relational schema – appeared the same to the end user. The differences in the structures were more of a factor for the project team member responsible for defining the metadata.
An OLAP tool may be biased toward either the snowflake or star schema. For example, the tool we used can require considerably more temporary table storage space with a star than with a snowflake or relational structure, in cases where duplicate rows are generated. If many queries of this nature run simultaneously, this could become a significant operational issue. In a snowflake schema, temporary space isn’t as significant because a unique dimension table row exists for every description. While tools may support many schema types, the underlying processing may be biased. You need to understand your OLAP tool prior to physical database design and implementation.
The OLAP tool also influences additional database design and implementation decisions. Column naming strategies, compound key restrictions and partitioning strategies were all affected by the OLAP tool. Therefore, an understanding of the technology is essential for data warehouse design success.
Database Influence. The DBMS influences which schema will result in better performance. We conducted a set of tests using native SQL to determine whether the DBMS itself yielded better performance with one of the schema. For these tests, the star and snowflake schemas were compared using two leading relational DBMSs. Tests were run using identical sets of queries, joining the fact table with varying numbers of dimensions. One DBMS, which supported the creation of bitmap indexes on the large dimension tables, generally performed better with the star schema, although in one test, it favored the snowflake structure by a very large margin. The other DBMS, which is known for having a highly advanced optimizer, generally performed twice as fast with the snowflake structure.
Many project directors spend a great deal of time analyzing and debating the merits of the star versus snowflake, versus relational design approaches. There are merits on both sides and the decision about which design approach to take depends on the goals of the warehouse, technologies being used and the implementation phase.
The logical model should be a snowflake. This model enables the analyst to understand the business relationships within the dimensions. The question of whether to use a snowflake, star or hybrid should be deferred to the physical design phase, which is dependent on the DBMS, OLAP tool and application.
Ideally, analyze with a relational and dimensional model, but implement a dimensional model. During analysis, creating both a dimensional and relational model will ensure a better representation of the business.
For implementation, we found that for typical decision support applications, a dimensional model provides a better solution than a relational model. Dimensional modeling techniques result in structures specifically geared for decision support queries. Traditional relational modeling techniques are less likely to result in an optimal structure for this type of processing.
About the Authors: Tricia Spencer is a Principal with American Management Systems (AMS) Inc. and Director of the Data Warehousing Laboratory at the AMS Center for Advanced Technologies (AMSCAT). She can be reached at email@example.com.
Tom Loukas is a Senior Principal with AMS and Director of the Database Management Laboratory at AMSCAT. He can be reached at firstname.lastname@example.org.