Drill Down: Metadata Mayhem
Imagine this. When you finally decided to address the Year 2000 issue in your organization a couple of years ago, you didn’t have to assemble a task force, invest in new software tools or hire outside consultants to analyze and correct the problem. Instead, all you had to do was query a database that indicated every database and application program that used two-digit file field. It would have been pretty sweet.
According to published reports, director of knowledge management for the Canadian postal system estimates that a database like that would have saved Canada Post $500,000 on its Year 2000 effort. The experience has led the postal service to invest in developing a centralized metadata repository to help them understand the data spread through their organization.
Metadata is a key new term for data warehousing applications. Though the term has some post-modern panache — fitting nicely with terms like meta-theory — at its core, the concept is quite simple. Metadata is data about data. Conceptually, metadata is similar to the foundation of packet-switched networks, in which data is dropped into packets, labeled and sent to their destinations. The information that forms the label, such as the address, the type of information enclosed and the number of packets in a transmission can be thought of as metadata.
The basic force driving the need for metadata is that more people want more access to more data stored in more databases. For example, when NASA launched its Mission to Planet Earth and the Earth Observation System in the mid-1990s, it wanted to store the terabytes of data streaming in from instruments on satellites, airplanes and ground stations to eight data centers strategically located around the globe. It then wanted to open those data centers to scientists wherever they worked. The volume of data gathered was unprecedented and the key to the system was to develop a complex metadata superstructure that would allow scientists to identify data sets, data formats and data structures.
Along the same lines, obviously metadata is essential to successful data warehousing. In data warehousing applications, data flows in from operational system throughout the enterprise. The data is then transformed and stored in ways that should make it useful across the organization. And then the data is used and transformed again. Within that environment, metadata — information about the information and the processes to which the information is subjected — is needed to improve data quality and compliance with standards, and to reduce data redundancy. And it is the key to developing a deeper understanding and overview of corporate data assets.
Though in theory, metadata is relatively simple, in practice capturing, propagating, integrating and synchronizing metadata is complex. To make matters worse, the technology to implement metadata systems is still in its infancy. And to exacerbate the problem, guess what? No standards for handling metadata in data warehousing applications has yet emerged.
In practice, gathering metadata is complex for two years. First, throughout their life cycle, data warehousing applications are driven by several different communities within the enterprise. Database designers, database administrator and end-users need to know different characteristics about the data to use them efficiently. Metadata for data warehouses must address the needs of all three communities. Analysts and consultants describe these needs in different ways. The different needs can be described in different ways but, in essence, the metadata must describe data creation and administration, database administration and data movement administration.
Not only are the needs of different communities different, the tool sets each community uses are different as well. And while each tool set creates its own metadata, such as data dictionaries and data catalogues. But those metadata are not intended to be shared. For example, the column and table definitions created in a production database may be appropriate for use in a data mart. But often the information has to be reentered into the data modeling tool used to create the mart. The same problem exists for data extraction and transformation tools as well as business intelligence applications. Metadata from one could be appropriate to another application, but cannot be easily imported or appropriately tracked and updated.
The most interesting approach to addressing the metadata issue has been the emergence of "repositories" that offer the centralized management of metadata. A metadata repository is a database designed to store metadata collected from each tool used in a data warehouse development and application, and make that information available in an appropriate format to the other tools. There are two approaches to creating repositories. In the first, the repository supports the definition of separate models for each participating tool. In the second, each tool’s metadata are translated into a predefined information model, which, in turn, could be shared by the tools.
In the past few years, many vendors and standards organizations have tried to create a standard informational model for metadata, but none has been widely accepted. According to some observers, the most promising is Microsoft’s Repository Open Information Model effort. OIM is an extensible object model and has won the support of many vendors, including Ardent Software. Microsoft reportedly will include a metadata to be included in its SQL Server 7.0 software, which is currently in beta testing.
But Microsoft is far from alone in this area. This summer, Oracle announced Repository 7.0, which will be based on the repository now included in Designer/7000. Repository 7.0 will began beta testing in the first half of 1999. Other vendors will central repositories include Platinum Technology Inc., Computer Associates and One Meaning Inc.
According to several consultants, the stakes for developing effective metadata management systems could not be hired. As one put it recently, metadata is the key to understanding and controlling an enterprise’s business knowledge.