Referential Integrity for the Data Warehouse Environment
Referential integrity is the practice of ensuring that relationships between rows of data exist and are used as they are defined. As a case of referential integrity, one row of data must exist in order for another row of data to exist. As another case of referential integrity, if one row of data has a column whose value equals "A," then another row of data must have a value whose domain is numeric values from 1 to 10, and so forth. The database designer is free to define the relationship between any two or more rows of data in any manner desirable to the database designer. Referential integrity is the component of the architecture that ensures that the data exists in the database as defined.
The notion of referential integrity as a standard part of the database infrastructure for the data stored inside a database has been around for a long time. There is then the idea that relationships between different occurrences of data inside the database need to be implemented and managed with integrity. By creating and enforcing referential integrity, the user of the data can "trust" the data. In addition, the programmer can rely on the predefined relationships to be intact in order to accomplish sophisticated programming tasks.
But with the data warehouse, the implementation of referential integrity requires re-interpretation from the operational environment to data warehouse environment. Referential integrity as originally specified for operational DBMS simply is not applicable to the DSS data warehouse environment.
Referential Integrity - Not New
Why is it that current implementations of referential integrity under most DBMS do not work for data warehouses? The first thing any self respecting data warehouse administrator does upon implementing a data warehouse is to turn off conventional referential integrity features.
Referential integrity as it is implemented under a conventional DBMS does not apply to data warehousing for three primary reasons:
- The volumes of data that are collected inside the warehouse are so great - as measured in hundreds of gigabytes and terabytes - that the current implementation of referential integrity under the DBMS is impractical to implement. The overhead of operating referential integrity on the volumes of data found in the data warehouse preclude the normal use of referential integrity features. Simply stated, overhead x volume = a number too large to be swallowed.
- Updates - the changing of records once stored in the database - are not done in the data warehouse environment. Instead historical snapshots are created every time a change occurs in the data warehouse. For this reason it is said that update is not normally done in the data warehouse environment. Conventional implementations of referential integrity assume that update will be done in the database.
- There is then a very poor fit between the warehouse environment and a facility that is optimized for operational processing, and data warehouse data is time variant. This means that on occasion there will be a relationship between two units of data at one moment in time and no relationship (or a different relationship) at another moment in time. Current implementations of referential integrity do not take into account the time variance of a relationship. Instead, current implementations of referential integrity assume that there is one and only one relationship between any two or more units of data.
There are then three valid reasons why the data warehouse administrator finds the implementation of referential integrity as currently found in popular DBMS not to be applicable to the data warehouse, decision support systems (DSS) environment. But just because referential as currently implemented under DBMS technology is not usable does not mean that referential integrity is not needed in the warehouse environment. There is still a need to identify and manage data relationships in the data warehouse environment, even if the relationships are fundamentally different than the relationships found in the operational environment. What is needed for the world of data warehousing is a different approach to referential integrity altogether.
Bounded Referential Integrity
The approach to referential integrity that applies to the data warehouse environment can be called "bounded" referential integrity. Bounded referential integrity is very similar to classical referential integrity except for two features. Bounded referential integrity: applies to a subset of a table or tables found within the data warehouse - not to an entire table or tables, and applies after the data has been loaded into the warehouse - not as the data is being loaded.
With these two exceptions, bounded referential integrity is the same as classical referential integrity. As an example of a bounded referential integrity relationship, consider two tables A and B. Table A contains part data from 1990 to 1998. Table B contains supplier data from 1991 to 1998. The data warehouse administrator determines that from 1992 to 1994, there was a relationship between part and supplier that mandates that every part must have a one and only one supplier. But in 1995, parts were created whose key began with "JA ..." did not have a supplier. However, starting in 1996, all parts could have from one to "n" suppliers. There are four bounded relationships here:
- From 1992 to 1994, each part must have one and only one supplier.
- From 1995, each part must have one and only one supplier except for parts whose key begins with "JA ..."
- In 1995, parts whose key begins with "JA ..." can have no supplier.
- From 1996 to the present, all parts can have from zero to "n" suppliers.
The first bounded relationship is from 1992 to 1994. The second bounded relationship is for 1995 for all parts those whose key begins with "JA ..." The third bounded relationship is for 1995 for only parts whose key begins with "JA ..." The fourth bounded relationship is for all parts starting at 1996 and going on into the future.
Bounded referential integrity operates on the basis of a foreign key/key relationship between two or more rows. The data warehouse administrator defines the key and its one or more foreign key correspondents. Existence criteria is established. Furthermore, the warehouse administrator determines the moments in time when the relationship is valid. The dates for the validity may be open or closed. If the dates are closed, there is a starting date and a stopping date. If the date range is open there is either a starting date and no stopping date, or there is a stopping date and no start date. In any case the relationship is bounded - either fully or partially - by dates.
But the relationship can be bounded by other parameters as well. For example, the relationship can be bounded by part types - only parts with "R," "W," or "J" as the first digit in the key participate in the relationship, only employees who are management participate in the relationship, only insurance policies where coverage is greater than $1,000,000 participate in the relationship, and so forth.
The net result is that only a small, select number of occurrences of data participate in the bounded referential integrity relationship.
By binding the relationship, the data warehouse administrator saves huge amounts of resources and maps the relationships to the business requirements. In addition, the data warehouse administrator is able to accommodate the differences in relationships that occur over time.
The bounded referential integrity relationship is not normally tested at load time. The testing of the relationship requires too many separate rows of data to be accessed. Consequently, testing of the relationship to see if the relationship conforms to the specifications set up by the data warehouse administrator is done by an independent audit program. The independent audit program operates when there are spare machine cycles (overnight, weekends, etc.). The independent audit verifies that all the foreign key/key relationships are in place.
If the independent audit program discovers an anomaly in the data relationships, then the audit program reports this condition back to the data warehouse administrator. Once reported to the data warehouse administrator, the administrator can decide what remedy to be taken. This rather passive approach to the enforcement of data warehouse referential integrity applies to the warehouse environment because there is no active update that occurs in the environment (snapshots of data are created rather than a record being altered). Should active update occur in a data warehouse, then a much more proactive approach toward insurance of the relationship would be required.
One assumption that the implementation of bounded referential integrity relies upon is that an efficient access can be made to the warehouse data. If data warehouse data monitoring is already being done, it is an easy enough matter to be able to attach the bounded referential integrity processing to the data content tracker. Bounded referential integrity merely takes the form of another type of monitoring - that of monitoring multiple rows of data simultaneously - if data warehouse data monitoring is already being done. But if there is no efficient way to access the data warehouse, then in order to do bounded referential integrity, the data warehouse administrator must find an efficient and non-disruptive way in order to access the data warehouse.
ABOUT THE AUTHOR:Bill Inmon is founder and CTO of Pine Cone Systems (Englewood, Colo.) and is responsible for the high-level design of its products, as well as for the architecture of planned and future products. He is universally recognized as the "father of the data warehouse," has over 26 years of database technology management experience and data warehouse design expertise, and has published 36 books and more than 350 articles in major computer journals.
To Malmborg Sidebar