The Forgotten Data Store

Staging areas, which store data temporarily as it's captured from source systems, are critical to successful data warehouses—and to good business intelligence.

Staging areas, which store data temporarily as it's captured from source systems, are critical to successful data warehouses—and to good business intelligence.

If you read the trade magazines, you’ve seen an endless stream of information and opinion about data warehouses, data marts and operational data stores. But what about staging areas—the forgotten data store? A staging area provides a place to store data as it’s captured from source systems, starting the process of reorganizing and recasting business data for business intelligence use. To create and maintain a successful data-warehousing environment, it’s critical that you understand the design and implementation of staging areas.

The idea of staging areas came before either data warehousing or information technology. They’re analogous to builders using a place to store materials at a construction site where both delivery trucks and workers can access them. Staging areas promote the smooth introduction of resources for easy access and use. In a business intelligence environment, staging areas are the gateway from operational systems to the data warehouse.

Staging areas are typically associated with the initial capture of business data from operational systems. In federated data mart architectures, staging areas serve both as the target of initial data capture processes and as the primary repository for time-variant, atomic data used to build data marts. In enterprise data warehouse architectures with dependent data marts, staging areas support data capture from operational systems in preparation for mobilization into the enterprise data warehouse. A second staging area can also be used to help move data from the enterprise data warehouse to dependent data marts.

Three Kinds of Staging Areas
Before we examine the characteristics of staging areas, let’s define the three types: temporary, persistent and transient.

Temporary staging areas are simple structures that provide a data store for captured data for loading a database or for moving data from one physical platform to another. For example, you might create a temporary staging area as the result of a data acquisition process for transfer across a network to a separate data warehouse server. Once this data is transferred to the data-warehousing platform, it resides in a temporary staging area, ready for loading into the warehouse. Once you’ve used the data, you can discard the temporary staging area, then re-create it during the next load.

Unlike temporary staging areas, persistent staging areas are permanent data stores. Staging areas in federated data mart architectures are persistent, because they’re the sole source of atomic business data (the lowest level of data) without going back to the operational systems. In dependent data mart architectures, staging areas are usually persistent between the operational systems and the enterprise data warehouse.

Finally, transient staging areas are components of persistent staging areas, but they aren’t retained between loads. Data that enters a staging area and passes on to the data warehouse entirely and at an atomic level is no longer needed in the persistent staging area. It’s like building materials that have now been used. In some cases, the task of managing transient elements in persistent staging areas requires more effort than just "persisting" all the elements, even the elements that passed on to the data warehouse. The larger the data volumes, the more likely it is that a transient staging area will pay off in saved disk storage and computer resources.

The Builder Analogy

To understand the value of persistent staging areas, consider the construction site example. As construction begins, materials for the building are unloaded from trucks into the staging area and used by construction workers as the building proceeds. One truck might unload lumber and steel needed this week, along with dry wall, pipes and carpet for the weeks to come. The staging area on the construction site is persistent because it accumulates the materials that workers need today, as well as storing materials that won't be needed until later. If the staging area were "deleted" every week, materials would be permanently lost-and at great cost, whether we're talking about construction or data warehousing.

—J.G. & K.D.

The Builder Analogy

To understand the value of persistent staging areas, consider the construction site example. As construction begins, materials for the building are unloaded from trucks into the staging area and used by construction workers as the building proceeds. One truck might unload lumber and steel needed this week, along with dry wall, pipes and carpet for the weeks to come. The staging area on the construction site is persistent because it accumulates the materials that workers need today, as well as storing materials that won't be needed until later. If the staging area were "deleted" every week, materials would be permanently lost-and at great cost, whether we're talking about construction or data warehousing.

-J.G. & K.D.

Staging Areas Defined
Operational systems, data warehouses and data marts each have distinctly different characteristics based on their purpose in the enterprise information architecture; so do staging areas. Within a persistent staging area, subject orientation is a priority, but optimizing for data intake into the data warehousing architecture is the primary mission. Light data integration may occur as business data is loaded into the staging area, so the data model must reflect this integration.

Unlike other data stores in a data-warehouse architecture, some elements of the persistent staging area are populated only during the load process. These elements make up the transient staging areas. You’ll need to understand a technique called triage to determine whether elements of a staging area should be persistent or transient.

When a data-warehousing project introduces new data into the business intelligence environment, the data warehousing team identifies new operational sources where the data is captured, or "sourced." Project success depends on your team’s ability to stick to the project scope and deal only with data required by the project scope document. To avoid "scope creep," don’t move data into the data warehouse or data marts unless it directly supports the objectives of the project.

This raises a question: What happens a year later when a new business requirement calls for new information closely related to data already in the warehouse for which no history exists? For example, say product information was captured from the master product table in the operational systems years ago and moved into the data warehouse and data marts. However, the number of products placed on a full pallet for shipping—the pallet count—was out of scope in the earlier project and wasn’t captured. As a result, there’s no history available on changing product pallet counts over the past year unless your team goes through the expensive process of reading backup tapes and archives. Even worse, history may not be available at all. If you use triage properly, it can solve this problem.

With triage, the first time a data-warehousing project needs any data from a particular file or table in an operational system, all other data in that source that might have future business value is also captured; that’s true even if it’s outside the scope of the current project. You need only do a cursory review of the source, add an element to the persistent staging area and capture the additional elements. Since the additional "triaged" data elements are outside the scope of the current project, the values are simply stored in a persistent staging area for potential future use and not addressed again in the current project. If you want to avoid scope creep while capturing history on potentially valuable data, a persistent staging area is an absolute requirement.

There is a common pitfall associated with triage. Don’t consider a triaged data element again within a project once it’s in the persistent staging area. If you don’t need it for the current project, don’t waste time thinking about it, and don’t mobilize it to the data warehouse or data marts. Too often, a data element appears to be so simple that it seems like a good idea to go ahead and move it to the data warehouse. However, the impact of moving each "simple" element can add up to several hours when combined with the incremental efforts of understanding the source, performing data mappings, developing the data acquisition processes, changing the data model and managing the associated metadata. In the worst case, weeks can be spent on a single out-of-scope element!

If we go back to our product and pallet-count example, by using triage you can capture the pallet count in the persistent staging area in the early data-warehousing project. A year later, when the pallet count is needed in a new project, history is available in the persistent staging area that goes back to the first time you accessed the operational product master.

Atomic Data
A final staging area consideration is the treatment of atomic data, the most detailed and lowest-level business data. Atomic data is often the basis for whatever business aggregations, summarizations and derivations you want from the data. Except in a few exceptional cases where storage costs are prohibitive, atomic data captured from operational systems should be retained in the data-warehousing environment. That eliminates the need to store it in a persistent staging area. If atomic data is mobilized from the staging area to the data warehouse, these elements of the staging area can be transient. If atomic data isn’t housed in the data warehouse, retain it in a persistent staging area so that it’s available for future use. When you’re working with large volumes of atomic data, eliminating redundant storage in the data warehouse and the staging area can have a significant positive impact on storage and performance without sacrificing data availability. For small volumes, however, it may be irrelevant, so be sure to take a practical approach.