When Should You Clean Warehoused Data?

Scrub your data at the right time

Scrubbing your data too soon or too late is a waste of effort—plan carefully to make your quality checks at the right time.

In my August 2001 column ("The Forgotten Data Store"), we discussed the different types and characteristics of staging areas. A staging area, as defined by TDWI, "is any data store designed primarily to receive data into a warehousing environment." The staging area represents the first of three logical layers in a robust data warehousing architecture. The other two are the data warehouse and the data mart. The data warehouse is optimized for information distribution and the data marts are optimized for information access. In this process of moving data from various sources—through staging area and data warehouse to data marts—where is information quality validated and improved?

Earlier is Better
There are five basic places to enhance data quality in a data warehousing environment.

Few things are as unnerving to an experienced data warehouse project manager as hearing that there will be no data quality issues with operational sources. This is usually an indication that the business and IT teams don’t realize the magnitude of existing data quality shortcomings. It’s actually more comforting to hear that there are significant data quality issues—because then it won’t be a surprise when the inevitable problems are detected and must be addressed.

Business intelligence data quality issues are usually addressed in the data warehousing environment, not in the operational systems. However, there are usually operational systems that are no longer actively understood and maintained. Other systems may be too costly to change to make information quality corrections. Data quality issues must be addressed in the data warehousing environment as well as the operational environment.

Too Early is Bad
Triage is the practice of moving all business data from an operational file or table into the staging areas if any part of that file or table is needed. The primary justification for a separate, persistent staging area is to support triage without moving triaged data into the data warehouse. Because data cleansing is one of the more labor-intensive parts of the data warehousing process, there’s no reason to cleanse triaged data.

Because many persistent staging areas are laced with triaged data, application of data quality rules must be minimized to avoid efforts applied to a triaged element that adds no immediate business value. It’s too early to apply the bulk of data quality rules until after data is passing from the staging area and into the data warehouse and data marts.

Too Late is Worse
If it’s bad to apply data quality too early in the data warehousing process, it’s much worse to apply data quality rules too late in the process. Data that’s included in more than one data mart, including data that may be included in additional future data marts, should be completely cleaned in the data warehouse for distribution, not as the data is mobilized from the data warehouse to the data mart. If data quality rules are replicated in multiple extract, transformation and load (ETL) processes from data warehouse to data marts, it’s inevitable that they’ll diverge over time. This divergence eliminates one of the primary values of an architectural business intelligence environment: A single version of the truth across the enterprise. The ETL process from data warehouse to data mart is primarily focused on filtering, summarizing and reorganizing for optimal information access.

Cleansing data before staging is too early; cleansing data after the data warehouse is too late. The vast majority of data cleansing occurs between the staging area and the data warehouse, where received data is prepared to be distributed to data marts.

The data warehouse can’t be used to routinely correct data in operational systems. Data quality rules that are exclusively for decision support should be applied in the data warehousing environment unless it’s simply too late to capture needed data at this point in the process, or if the rule can be applied in the operational environment in a nondisruptive manner. Similarly, data cleansing rules that are for purely operational purposes should be applied only in operational systems and never in the data warehousing environment.

Natural Key Integration There are some important data integrity rules that can and should be applied while mobilizing data from sources to the staging area. These primarily fall into two categories: conversions and natural key integration.

When source data formats are incompatible with the data warehousing environment technology, conversions are necessary. For example, if source data is EBCDIC and the data warehousing environment is ASCII, conversion should take place before the data is moved into the staging area.

Natural key integration is the process of joining multiple source files and tables together in the staging area when they share a common key. Natural keys are the keys found in the operational systems of business that uniquely identify entities such as customers, products, employees and accounts. This integration and normalization based on natural keys is a part of beginning to bring data from function-oriented operational systems into the subject-oriented data warehouse.

The Sweet Spot
The data cleansing sweet spot lies in the ETL process between the staging area and the data warehouse. Because the vast majority of data quality rules are applied here, it’s relatively simple to centralize the definition, management and use of these business rules in a single location. Ideally, these rules are represented in the form of metadata driving the ETL process, defining rules for source selection, filtering, conversion, translation, derivation, summarization, and cleansing for the enterprise.

This process also includes creation of surrogate keys to manage data duplicated in multiple operational systems, reuse of natural keys and generally creating a unique identification for an entity within the enterprise where no such identifier exists in the operational systems.

About the Author

Jeff Gentry is President of Technology to Value, LLC and Chief Strategy Officer of eScribendi, LLC. He can be reached at jgentry@tech2value.com. Karolyn Duncan, who co-authored this column, is a consultant with Information Strategies Inc.

Must Read Articles