Analysis: Data Quality is Job Number 1

BI tools are only as good as the quality of the data they work with. Analyst Michael Schiff is still surprised at how many BI professionals still ignore this fact.

Much attention has been focused recently on integrating data from multiple sources to populate data warehouses or data marts for analysis purposes or as part of a migration effort for new enterprise applications. For example, a recent press release from a well-established business intelligence vendor highlights the ability of its BI platform to access multiple data sources where they reside without first having to move the data into a data warehouse or a data mart.

Rather than dwelling on the tradeoffs (perhaps a topic for future analysis) among centralized data warehouses, federated databases, enterprise information integration (EII), or even the old concept of virtual data warehouses, I would like to point out a common property they share: the quality of the information obtained from any of them is directly dependent on the quality of the data they access or contain. In other words, GIGO (garbage in, garbage out) always has applied (and always will apply) to both analytical and operational systems.

We all recognize this, yet I continue to be amazed as to how often we ignore the fact. In most cases, it's not deliberate; rather it results from taking at face value preliminary assumptions about data quality rather than using techniques such as data profiling to validate the assumptions.

When integrating data from multiple sources, the data quality at each source may be accurate and consistent but the sources may be inconsistent among themselves. Consider a simple example: inventory quantities from one data source are recorded in different units-of-measure from the others (for example dozens versus pieces). A company I once worked at was about to place an order for a large amount of optical cable until they discovered, at the last minute, that the quantity it though it had on-hand consisted of reels, rather than feet; in this case, different inventory systems had implemented different units of measure.

A more complex example involves different code sets so that the vendor number or customer number in one database is inconsistent with the codes used to represent a vendor or customer in another. Worse yet is using the same customer number to represent different customers in two different systems without users being aware that the numbers in each system were assigned independently. This is an example of the classic issue of "we understand the data in our department (or division); it is the data we receive from the other departments (or divisions) that doesn't make any sense." In fact, consistency across departments and divisions is one of the drivers for master data management.

One of the advantages of consolidating data from multiple sources into a data warehouse is that inconsistencies among sources quickly become apparent (hopefully before the data warehouse goes "live") and remedial steps can be taken to ensure proper data transformation (or reformatting of the source data itself) takes place to correct the problem and prevent it from occurring in future loads. For example, it is certainly possible to define a set of corporate code sets and value lists for the data warehouse and then convert data from each source to these corporate values when loading the data warehouse. Although this may be obvious to BI professionals, it was one of the selling points for data warehousing when it was first popularized in the late 1980s and early 1990s. Unfortunately, this lesson is now sometimes forgotten, especially when organizations try to directly access data from multiple sources rather than first cleansing it and then loading it into a data warehouse.

There are several reasons for this, not the least of which is the spread of business intelligence to business users for use with a particular operational system. In these operational BI implementations, the users are typically very familiar with the data contained in "their own" operational system and, as long as they are using it as their sole data source, they usually avoid consistency problems. However, when they try to incorporate data from multiple operational systems into their analyses, the inconsistencies cause problems.

Both BI practitioners and IT professionals must recognize and, together with their business users, take steps (such as those associated with a data governance program) to avoid these problems. Although BI tools are now easier to use than ever before and make it possible for business users to graphically link and consolidate data from multiple heterogeneous sources, these tools do not ensure that the underlying data sources are consistent. The same potential for data consistency problems also exists with composite applications and mashups. As BI becomes more pervasive within our organizations, we may be doing our business users a disservice if the underlying data that they can now analyze is not accurate and consistent.

About the Author

Michael A. Schiff is a principal consultant for MAS Strategies.