In-Depth
Business Intelligence: Quality Assurance
I believe that building warehouse databases is relatively easy -- the hard part is obtaining and putting good data into the warehouse. Data quality and information quality are among the most difficult issues of data warehousing. The difficulties begin with the simple question, "What is data quality?" and become more complex when you ask, "What is the difference between data quality and information quality?" Once quality is understood, then the really tough question is, "How to achieve data and information quality?"
What Is Data Quality?
J. M. Juran’s book (Juran’s Quality Handbook, McGraw-Hill, 1999) focuses on quality as absence of defects. Data defects are conditions in the data that make it difficult or impossible to obtain needed information, or that result in delivery of incorrect or unreliable information. Data quality is the degree to which data is free of defects that limit its utility as an information resource. Data defects are of two types: Integrity and Correctness.
Integrity defects occur when a data structure is incorrect or unreliable. Data integrity means that the data structure has all properties necessary to provide a reliable and trustworthy view of the business. Desirable data integrity qualities include:
1. Identity Integrity. Every occurrence of a real world object, and every row of a warehouse table, is uniquely identifiable.
2. Referential Integrity. Navigation of "dead end" relationships never occurs.
3. Cardinal Integrity. The number of participants in any relationship complies with business rules.
4. Value Set Integrity. No data element contains a meaningless value.
5. Data Dependency Integrity. Dependencies among values and dependencies among relationships comply with business rules.
Correctness defects occur when the data content is incorrect or unreliable. Data correctness means that the data content has all properties necessary to provide a reliable and trustworthy view of the business. Larry English (Improving Data Warehouse and Business Information Quality, John Wiley & Sons, 1999) describes many of the desirable data correctness qualities including:
1. Completeness. Needed data is present to provide a full picture of the business.
2. Validity. Data values and combinations of values have business meaning in a specific context, and at a particular point in time.
3. Accuracy. Data represents a true and factual view of the real world objects that it describes.
4. Precision. Data is sufficiently detailed and granular to meet business needs.
5. Consistency. Redundant data sources do not produce conflicting facts.
What Is Information Quality?
Information quality is the degree to which data is free of defects that limit its utility as a business intelligence resource. Data warehousing is a process of turning data into information. Information defects occur when that process is defective. They result from using defective data to produce information, from turning good data into bad information, and from using good information to reach wrong conclusions. Information defects are of three types:
• Materials defects occur when using the wrong data, or when using data of poor quality to produce information. When data of poor quality is used, data defects are propagated to become information defects. When data is misunderstood and used inappropriately, high-quality data becomes low quality information.
• Presentation defects occur when information is delivered in a form that is unreliable, inconsistent or subject to misunderstanding. Presentation quality means that information is delivered in a useful and understandable form.
• Application defects occur when good information results in wrong conclusions. Application quality means that information is fully understood and appropriately used.
Achieving Quality
Understanding and detecting the various kinds of defects is the first step to data and information quality. Data quality is a procedural issue. Quality improvements are achieved through data cleansing, with attention to defect prevention and removal as data is placed into the warehouse. Deciding when, where and how to audit, filter and correct warehouse data is itself a complex topic.
Information quality improvements are focused more on people than procedure. Information quality strategies focus on preventing and removing defects when information is delivered from the warehouse and used to make business decisions. Tools and tactics include metadata, education, training and support.
About the Author: David Wells is an Enterprise Systems Manager at the University of Washington, the founder and Principal Consultant of Infocentric, adn a fellow of TDWI. He can be reached at dwells@infocentric.org.