In-Depth
Business Metadata: The End Users’ Roadmap Through the Data Warehouse
As metadata becomes widely recognized for its role in data warehousing, more metadata-enabled warehousing tools are available than ever before. And metadata can serve either as a blueprint to the inner technical workings of the warehouse, or as a roadmap to assist in navigating the warehouse and locating useful information. The data warehouse industry is struggling to produce a metadata standard to enable sharing of metadata across the enterprise and among different vendor tools. Additionally, the demands of the business user are changing the nature of metadata. Its previous role of describing and cataloging the multitude of relational database tables in the warehouse is expanding to include providing information to users about the content, meaning, accuracy and quality of the data they are using to make business decisions.
All Metadata Is Not Created Equal
There are two types of metadata crucial to the warehouse and targeted to two different audiences. The first type of metadata applied to the warehouse is technical metadata, used by IT staff such as systems analysts, warehouse managers and database administrators. Technical metadata provides a detailed technical blueprint or "wiring diagram" of the data warehouse that can be used to assist IT in expanding and maintaining the warehouse. Technical metadata traces the flow of data, providing information such as what sources data is extracted from, when the data was extracted, which target it was loaded into and what technical and business rule transformations were applied to that data as it moved from source to target. By tracking, auditing or analyzing how any given target data model was populated, IT can use technical metadata to verify the accuracy of warehoused information.
The second type of metadata is business metadata. It plays a critical role in warehousing because it connects the business user with the relevant data in the enterprise. Business metadata supports the business user’s perspective of the warehouse, by using common business terms and providing information about the warehoused data in terms of context, understandability, searchability and usability, rather than in terms of infrastructure and database technology.
It educates users about the origins of information and ensures that they apply the information correctly by describing the rules that govern data use and by identifying valid ways in which data can be leveraged and combined with other warehoused data. Without business metadata, knowledge workers would have to rely on ad hoc inquiries to IT staff to provide the data definitions and explain the logic behind database tables in order to conduct their analyses.
A mistake many organizations make is assuming that technical metadata can help business end users navigate the data warehouse. Business users need a less technical and more user-friendly way to access and analyze metadata. They need to understand and deal with the data in a different way than technical users. When business users can’t understand data in terms of context, usability, consistency, currency, accuracy and quality, the likely result is they will lose faith in the data and become reluctant to use the warehouse, thereby minimizing the value of the organization’s investment in data warehousing.
There are many business cases and situations where business metadata can provide the means for answering the end user’s data needs and enabling appropriate use of warehouse data. Following are a few of the most common and useful scenarios.
What Does This Data Mean?
The first scenario we will examine illustrates how business metadata can help the end user make sense of and better understand data in the warehouse. It does so by providing business definitions and valid values for the data, and identifying where the data originated. For example, a user may receive a report and see a column called "product status code." Using business metadata, the end user can look up the report in the warehouse, scan through various columns of the report, and view the business definition of the data element "product status code." If the user wants to do an analysis of inventory, they can decide whether the report’s column is or isn’t relevant, based on the business definition provided by the metadata (e.g., "Indicates lifecycle stages of this product, as determined by product marketing"). If the user still isn’t sure whether "product status code" is relevant to the analysis, they can again use business metadata to find the relevant values or codes related to "product status code." Using the same example, further research may turn up five valid values for "product status code" as shown in Table 1 (see page 59.)
With the help of business metadata, the user may realize that the product status code "current product" does provide relevant data for inventory analysis and that the warehouse report is useful after all.
In Table 1, "business metadata" can be interchanged with "reference data" – the often neglected piece of the metadata layer within the data warehouse architecture. No matter where data is in the warehouse environment, it is crucial that consistent, accurate data is delivered to business decision makers. Reference data defines key data elements and keeps business rules consistent from departmental data marts to global warehouses. Reference data represents discrete codes and dimension hierarchies within the data warehouse model, a component that can make up as much as 60 percent of the entire model.
Enterprise-level decisions are especially dependent upon consistent definitions of key data elements. The recent proliferation of independent data marts has underscored the importance of ensuring data consistency. Organizations that have not considered key reference data while implementing data marts are likely to find themselves with large amounts of disparate data that can’t be shared or combined. Some organizations that have built these "stovepipe" decision support systems are finding that they are incapable of making accurate decisions across the enterprise because they have no way to consistently define the data. Considering key reference data while implementing independent data marts can prevent this problem.
Sometimes though, even when the business user receives a warehouse report, understands the business definitions and has determined that the data is relevant, there may be a reluctance to use or accept it because the user does not know where the data came from and how it got into the data warehouse or data mart.
Business metadata can play a role in inspiring user confidence in the completeness or provenance of the data. Whereas technical metadata answers the question by providing the database and table the data resides in and by showing the algorithmic detail of the transformation, business metadata puts the answer in textual, non-technical terms that the business user can understand. It shows the source systems, files and summary of transformation logic, and perhaps most importantly, provides the user with a human connection to the data via identification of the "data steward." The data steward is the person responsible for the semantics and content of a particular type of data.
Searching the Data Warehouse?
The next scenario demonstrates how business metadata provides the roadmap to destinations that users would probably never arrive at on their own. The metadata can help the end user drill down through categories of tables in the warehouse databases to uncover meaningful nuggets of data that answer the information need and provide insight into the analysis. For example, suppose an insurance portfolio analyst needs to research the portion of claims at P&C Insurance Co. related to brake failure on minivans.
The logical organization and categorization of data in the warehouse by metadata can help the analyst to quickly sift through volumes of data to identify the right data to use in this inquiry. In this example, the analyst might first be presented with a list of the major "topics" of data familiar to business users at the company – policy, claims, marketing, and financials and select "claims." Within the claims area, business metadata may reveal subcategories for malpractice, boat, auto and homeowner claims. Since the agent is dealing with minivan brake failure, the search can be further narrowed to focus on data related to "auto." Within the auto folder, the agent sees accident, theft and comprehensive damage listings. The "accident" listing is the one most relevant to this query, and so the agent is able to identify quickly all of the data in the warehouse that is related to auto accident claims.
Business metadata can help the end user go even further than just locating the right database. For example, it enables the insurance agent to browse table descriptions to find ones that include accident cause. Thus, the agent may discover a table called "Auto Accident Event Table" and look up the definition to learn that the table contains auto accident events, at the individual event level, updated weekly.
Further investigation of the metadata might reveal that the table contains accident events extracted from the claims databases after claim status is set to "closed." Thus, the agent knows that the information about individual accidents and their causes can be found, and proceeds to drill down further to see if time period, scope, granularity, etc., are appropriate for the search. At this point, the agent could scan through a metadata inventory of existing reports related to automobile accidents. Using this metadata, the agent could look at the inventory of existing reports against that data and select the one that most closely matches the current need.
From the list of stored reports, the agent may wish to narrow search criteria further and could use a decision support tool to modify the search to include only accidents involving single vehicles where damage was greater than $5,000 or to only show single car accidents by type and region with the number of events and average dollar amount of damages.
How Good Is This Data?
The last business metadata scenario we’re going to explore involves the difficult issue of data quality. Data quality refers to a number of important factors including data consistency, accuracy and confidence that the end user relies on when making important business decisions using warehouse data. It is not an absolute but a relative value. Data that is "good enough" for one user may be "bad" for another. For most business users to feel comfortable using data from the warehouse, they will want to know two things: 1) How good is this data, including any known deficiencies or problems with this data? and, 2) Is this data suitable for my purposes?
Let’s return to the example of the insurance agent investigating brake failure in single car accidents. The agent may be excited to have found a table with five years of accident event statistics, but before being confident in the conclusions drawn from this table, the agent needs to determine whether the data is "sufficient" enough to perform an accurate analysis. Here again, business metadata can provide the roadmap to knowledge because it allows the agent to check the table definition to see if there are restrictions on the data (e.g., does not include third-party accident investigator report data, or only includes incidents that resulted in both injury and property damage) which might impact the conclusion.
Based on these restrictions, the agent can decide whether the data is skewed or invalid for a particular purpose. For example, if the analyst discovered that the data being considered does not include data from third-party accident investigators, and up to 25 percent of P&C Insurance Co.’s claim are investigated by third parties, this data may not be suitable for the analyst’s purposes.
Another frequent concern of business users is data freshness. Business metadata can be used to answer questions about currency or relevancy, such as: How current is the data? When was the data last updated? What is the data’s periodicity? For instance, if the metadata reveals this accident data is grouped by month, and the most recent month entered is September 1998, this would alert the analyst that certain types of analysis would not be complete using this data. An analysis comparing the number of winter minivan brake failures between 1988, 1997 and 1998, for example, would be meaningless, since the database does not yet contain full data for the winter of 1998. In this case, the metadata shows that the "freshness" of this data will not support the desired analysis, and allows our analyst to avoid producing skewed and invalid results. By the same token, a report that shows brake failures by week would be meaningless if the metadata revealed the periodicity of the data to be monthly.
Quality metrics is another way that metadata can help the business user to be confident in analyses. Business metadata provides metrics that an end user can understand and apply to ensure sound statistical calculations. Some examples of metrics include:
• Standard deviation
• Number of distinct values
• Frequency of value occurrence
• Time period
For example, there may be only three "distinct values" for the insurance agent’s "cause of accident" column: obstacle in road, driver error and mechanical failure. "Mechanical failure" is probably too general a value for the agent to give much weight to a high "frequency of value occurrence" even if "mechanical failure" had a frequency rate of 44 percent of all claims filed. With the help of metrics, the agent knows that additional information is needed before the conclusion can be drawn that brake failure happens more often than not. Thus, business metadata provides quality metrics and identifies other factors, such as number of exceptions during transformation and loading, or user feedback that can be used to create a composite data confidence and quality score that business users utilize to evaluate the quality of warehouse data.
Finding Business Metadata?
With business metadata being kept in different formats in different systems with different user interfaces, and with technical metadata spread across different components of the warehouse – from the scheduler to the data extraction/ cleansing tools to the loading and OLAP tools – it is crucial for all warehouse users to have one place to go to view metadata for any data residing in the enterprise. Therefore, the repository component of the data warehouse architecture is perhaps the most critical.
Data consistency can be achieved on several levels – from a simple key reference table for a departmental data mart to a full repository implementation of all logical and physical metadata throughout the enterprise. A repository will provide information about the location and nature of operational data, which is critical in the construction of a data warehouse. The repository acts as a guide to all the warehoused data, storing the migration environment definitions, mappings of sources to targets, translation requirements, business rules and selection criteria to build the warehouse. In other words, the repository helps develop the blueprint for designing the warehouse and provides the roadmap to help users navigate the warehouse.
The repository should be completely self-defining, self-referencing and extensible, based on a common entity/relationship diagram. By using a model that reflects industry standards and is based on a unified modeling language (UML), users are given the ability to share metadata across multiple tools from multiple vendors.
Navigating the Warehouse
While business metadata uses many of the same terms that technical metadata uses for data definition and location (e.g., source, transformation, table, column, data element, valid values, field, report and data steward), it uses them in a context more appropriate to the end user’s business requirements and understanding. The result is increased use of the warehouse by all levels of users, not just technical users, and decreased frustration and fear of accessing and accepting the data housed within. Thus, organizations that can distinguish between technical and business metadata and incorporate both types when building and maintaining the enterprise data warehouse will achieve maximum return on investment and create a win-win situation for technical and non-technical users of the warehouse.
About the Author:
David Gleason is the Vice President of Enterprise Business Solutions for PLATINUM technology’s Global Consulting Organization. He is a contributor to the book, Data Warehouse: Practical Advice from the Experts, Prentice-Hall, 1997.