In-Depth
Five Tips for Managing Database Version Tables
Follow these tips and you'll build the best versioned database tables.
By Tom Johnston, President, Asserted Versioning, LLC
To keep track of changes to objects, enterprises often modify the database tables that represent these objects by adding a timestamp to their primary keys (PKs). With these modified PKs, multiple versions of the same object may exist in the same table at the same time, each version reflecting what the object was like or is like during a specific period of time. Typically, this time period starts when the version is physically inserted, and continues until the next version for the same object is inserted. That next version is created by making a copy of the current version, applying the update to that copy, and inserting the copy as a new version. (Different types of version tables are described and illustrated in Chapter 4 of our book.) Deletes are handled in various ways, but are almost always logical deletes, not physical ones.
Figure 1 shows a versioned customer table. Primary key column headings are underlined. Timestamp values are shortened to dates to simplify the example. The time period of the first version for customer R5-C385 extends from 5/12/10 up to 9/15/10. The second version is the current one, and began on 9/15/10 when the customer’s name changed.
cust-nbr |
ver-ts |
cust-nm |
R5-C385 |
5/12/10 |
Smith |
R5-C385 |
9/15/10 |
Jones |
Figure 1. A version table. |
We can do a lot with this simple version table, but we can also do a lot better. Because DBMS vendors are beginning to support advanced forms of version tables, it is important for data modelers and other developers to understand the sometimes subtle differences that make all the difference between the various kinds of version tables. We present five important differences as five tips for designing and building good version tables.
Tip #1: Use a surrogate identifier for versioned objects.
Logically, the PK of a version table consists of two parts. One is a unique identifier of the object being versioned, which we will call an object identifier. The other is a unique identifier of a time period in the life history of the object.
In the PK of a version table, the object identifier should always be a single surrogate-valued column. This is the foundation of an enterprise approach to version tables, since it means that the same code that manages one such table can manage them all. Otherwise, the PKs of version tables would differ in the number and datatype of their columns, and there would have to be different code to manage these syntactically different PKs – and their foreign key (FK) references as well.
A second reason for using surrogate-valued PKs is that they are stable. Non-surrogate data, by definition, carries business meaning. Values in non-surrogate columns can change -- for example, when a customer moves from one geographical region to another. If the region code is a PK column (or part of one, as it is in these examples), this change affects the PK of that one row as well as all FKs that reference it.
An even more costly change is one in which the datatype of a PK column changes. An example is enlarging a region code PK column from two characters to three to accommodate a finer-grained set of regions. In this case, all the rows in the table are affected. The entire table must be unloaded, and its data transformed and then reloaded. In addition, the same costly process must be carried out for all tables with FKs that reference the altered table. Furthermore, if any of those tables include that FK as part of their own PK, the ripple effect continues outwards to tables that reference those tables. The cost of unstable PKs can be as high as you can imagine.
Of course, when a surrogate is used as a unique identifier, the business key that the surrogate replaces should also be included in the table, but as a non-PK column or columns. Without this business key, match logic will not work unless all update and delete transactions include the surrogate value, and that is often not possible.
In Figure 1, the object identifier is not surrogate-valued. It consists of a region code (R5, for region 5), a hyphen, and a customer-within-region number (for the 385th customer within region 5 to be added to the table). Figure 2 shows how this table is modified in accordance with tip #1.
cust-id |
ver-ts |
cust-nbr |
cust-nm |
2375 |
5/12/10 |
R5-C385 |
Smith |
2375 |
9/15/10 |
R5-C385 |
Jones |
Figure 2. A surrogate-keyed version table. |
The object identifier is now the surrogate value 2375. It is no longer the customer number (cust-nbr), which is now a non-PK business identifier, and which is available for matching transactions to data already in the table.
Tip #2: Use two timestamps to represent a time period.
The second part of the PK of a version table is a unique identifier of the time period which the version represents. Often, developers will use a single timestamp to represent this time period, because business requirements permit them to assume that the end of the time period is either (i) undetermined, if the version is the most current version of that object, and has not been logically deleted, or (ii) ends when the time period of the next later version of the same object begins.
This is a bad idea. A time period cannot be represented in a PK by a single timestamp, and can only be inferred from the timestamp of the next version of the same object on the assumption that each non-initial version begins immediately after its predecessor version ends. Even if that assumption may have been true when requirements for a version table were first stated, it may not always remain true.
In Figure 2, the end of the first version’s time period is inferred from the start of the next version’s time period. In Figure 3, there is no inference required. The first version’s time period starts on the s-ver-ts of 5/12/10, and ends on the last timestamp value before the e-ver-ts of 9/15/10 (this being the standard closed-open convention for representing time periods with pairs of timestamps). The current version’s time period starts on 9/15/10. Because we do not know when (or even if) another version will come along for this customer, the end timestamp of this version’s time period is unknown. Using another standard convention for managing temporal data, we represent an unknown end timestamp for a time period by using the highest timestamp value the DBMS recognizes which, in this case, we assume is 12/31/9999.
cust-id |
s-ver-ts |
e-ver-ts |
cust-nbr |
cust-nm |
2375 |
5/12/10 |
9/15/10 |
R5-C385 |
Smith |
2375 |
9/15/10 |
12/31/9999 |
R5-C385 |
Jones |
Figure 3. A surrogate-keyed version table with a time period. |
Tip #3: Prevent overlaps in time periods.
If the time periods of two versions of the same object overlap, the result is a temporal form of an entity integrity violation. Entity integrity prevents two rows representing the same object from existing in the same table at the same time. The reason for preventing this is clear: if two such rows did exist in the same table at the same time, the data they contain could be different, and thus the two rows could be contradictory descriptions of the same object at the same temporal extent in the life history of that object.
Temporal entity integrity does the same job for version tables that entity integrity does for conventional tables. Without temporal entity integrity, two rows in a version table could both represent the same customer during the same or overlapping time periods. Those two rows, unless they contained column-for-column identical data, would contradict one another.
Figure 4 shows a time period overlap. The second version’s start timestamp is earlier than the end timestamp of the first version. The first version says that this customer’s name was “Smith” in early September 2010; the second version says that this customer’s name was “Jones” during the same period of time. This is a violation of temporal entity integrity. In Figure 4, the second version is struck through to show that it is invalid.
cust-id |
s-ver-ts |
e-ver-ts |
cust-nbr |
cust-nm |
2375 |
5/12/10 |
9/15/10 |
R5-C385 |
Smith |
2375 |
9/01/10 |
12/31/9999 |
R5-C385 |
Jones |
Figure 4. A time period overlap. |
Newer releases of the major DBMSs are beginning to support PERIOD datatypes. Although there is as yet no SQL standard for PERIOD datatypes, one of the things they must do is prevent overlaps among otherwise identical PKs. Lacking a PERIOD datatype, developers must write their own code to enforce temporal entity integrity. The major part of that work is to manage a pair of timestamps as the end points of a continuous period of time.
Tip #4: Don’t use foreign keys.
The referential integrity issue, with temporal tables, is that the parent object must be represented in its table at all times that the child object is represented in its table. However, there is no requirement that the parent object, during this time, must be represented by a single row.
If a version of a child object, for example, exists for the first six months of 2009, and if its parent object is also represented in a version table, then during that time there may have been any number of changes to that parent object. If there were ten changes during that time, the result would be ten contiguous versions of that object in the parent table.
The temporal referential integrity relationship from each child version is not to any specific parent version, and is, consequently, not a child row to parent row relationship. It is a relationship between the representation of a child object during a period of time, and the representation of its parent object during that same period of time.
Figure 5 shows a parent Salesperson Version table and a child Customer Version table. In the child table, the temporal foreign key is italicized. Slsp-id is a temporal foreign key, and its value is the object identifier of the salesperson to whom the customer is assigned. It is not a foreign key, because it does not point to one specific row in the parent table. The minimal row temporal referential integrity parent for customer Smith is the first two versions of the only salesperson shown in the example, because it is only the combination of the time periods of those two versions which temporally include, without any temporal gaps, the time period of the version for customer Smith. For the same reason, the minimal row temporal referential integrity parent for customer Jones is the second two salesperson versions.
These are minimal row parents because each consists of the minimum number of rows that together satisfy the temporal referential integrity constraint for its customer version. All three salesperson versions, however, represent a continuous period of time in the life history of this salesperson and, as such, those three versions constitute an important group of rows. In our book, we call this group of rows an episode, and we use episodes as the parent objects in temporal referential integrity relationships even though episodes are not necessarily minimal row parents.
slsp-id |
s-ver-ts |
e-ver-ts |
slsp-cat |
slsp-nm |
7150 |
6/01/09 |
7/22/10 |
D3 |
Anderson |
7150 |
7/22/10 |
10/03/10 |
D3 |
Fletcher |
7150 |
10/03/10 |
12/31/9999 |
F5 |
Fletcher | |
cust-id |
s-ver-ts |
e-ver-ts |
slsp-id |
cust-nbr |
cust-nm |
2375 |
5/12/10 |
9/15/10 |
7150 |
R5-C385 |
Smith |
2375 |
9/15/10 |
12/31/9999 |
7150 |
R5-C385 |
Jones |
Figure 5. A temporal referential integrity relationship. |
Tip #5: To track corrections, add a second time period.
If we discover a data error in a version table, we may think we can just apply an update to correct it, but that won’t work. First, it leaves the erroneous version uncorrected. Second, it says that the change begins when the new version is created. Both conditions are wrong.
What we need to do, instead, is to update the specific version discovered to be in error. To do this, we cannot create a new row for that version because the new row would have the same PK as the row containing the error and would therefore violate primary key uniqueness and be rejected by the DBMS. Our only other means of making the correction, then, is to overwrite the erroneous data on that version. Doing so, however, creates another problem. By overwriting the original data, we remove the original error from the database, thus losing all information about that error, including the information that it ever occurred.
Bitemporal Data
This illustrates an important but seldom-recognized point: some changes to data reflect changes to the objects which that data represents, and other changes to data are made to correct mistakes. Each kind of change, when applied to temporal data, needs its own time period.
The kind of time period we have been discussing so far is a version time period, one that identifies a period of time in the life history of the object represented by the version. The second kind of time period we need is one which identifies a period of time during which we believed, and may still believe, that the data contained in a version was or is correct. This second time period will have to be added to the PK of the version table so we can retain both the original mistake and the correction to that mistake, a correction applied to exactly the right time period.
Doing so turns the version table into what computer scientists call a bitemporal table. Figure 6 shows a bitemporal table in which an error has been corrected. When the version for salesperson 7150 for 10/03/10 to 12/31/9999 was originally created, that salesperson was assigned a category code of F5 but which should have been F4. The error was corrected on October 16. This correction is not the same thing as a conventional update, because it consists of different data for the same object during the same time period as the original 10/03/10 - 12/31/9999 version. Both the third and fourth rows are the same version for salesperson 7150. The second row of this pair is a later assertion about that version, one that corrects a mistake made in the earlier assertion about that version.
slsp-id |
s-ver-ts |
e-ver-ts |
s-asr-ts |
e-asr-ts |
... |
slsp-cat |
slsp-nm |
7150 |
6/01/09 |
7/22/10 |
6/01/09 |
12/31/9999 |
... |
D3 |
Anderson |
7150 |
7/22/10 |
10/03/10 |
7/22/10 |
12/31/9999 |
... |
D3 |
Fletcher |
7150 |
10/03/10 |
12/31/9999 |
10/03/10 |
10/16/10 |
... |
F5 |
Fletcher |
7150 |
10/03/10 |
12/31/9999 |
10/16/10 |
12/31/9999 |
... |
F4 |
Fletcher |
Figure 6. A bitemporal table with a row-aligned correction. |
We may think of bitemporal data management as a way of managing versioned data that allows us to both correct mistakes in that data and preserve a record of the mistakes that have been corrected. The example just described is simple because the mistake is temporally aligned with a single version. However, suppose the mistake was actually that salesperson 7150 was in category F5 from 8/22/10 to 10/04/10, and after that in category F4 from 10/05/10 until further notice (i.e., until 12/31/9999). How would a bitemporal table both preserve the original data and include the proper adjustments to that data?
Without going into the details of what a correcting transaction would look like in this case, or the details of how the correcting transaction would be translated into a series of SQL inserts and updates, the results of correcting this more complex kind of mistake are shown in Figure 7.
slsp-id |
s-ver-ts |
e-ver-ts |
s-asr-ts |
e-asr-ts |
... |
slsp-cat |
slsp-nm |
7150 |
6/01/09 |
7/22/10 |
6/01/09 |
12/31/9999 |
... |
D3 |
Anderson |
7150 |
7/22/10 |
10/03/10 |
7/22/10 |
10/16/10 |
... |
D3 |
Fletcher |
7150 |
10/03/10 |
12/31/9999 |
10/03/10 |
10/16/10 |
... |
F5 |
Fletcher |
7150 |
7/22/10 |
8/22/10 |
10/16/10 |
12/31/9999 |
... |
D3 |
Fletcher |
7150 |
8/22/10 |
10/05/10 |
10/16/10 |
12/31/9999 |
... |
F5 |
Fletcher |
7150 |
10/05/10 |
12/31/9999 |
10/16/10 |
12/31/9999 |
... |
F4 |
Fletcher |
Figure 7. A bitemporal table with a non-row-aligned correction. |
In this example, the fourth row in Figure 7 is a version which covers the unaffected portion of the original row 2 version. The fifth row is a version that covers the rest of the original row 2’s time period and also an initial part of row 3’s time period. The sixth (and final) row is a version that covers the remainder of row 3’s time period.
Rows 1, 2, and 3 cover the continuous period of time from 6/01/09 to 12/31/9999. Rows 1, 4, 5, and 6 cover exactly the same continuous period of time. The first set of rows contains the data for that salesperson during that period of time which we thought was correct from when it was first entered (on 6/01/09) until we discovered and corrected the mistakes (on 10/16/10).
The second set of rows contains different data for that salesperson during that same period of time. On 10/16/10, we realized that the originally-entered data was incorrect, and so we stopped asserting it by ending its assertion time period on that timestamp. In its place, we added three new rows for the same salesperson, during the same period of valid time. These three rows were asserted to be the truth about that salesperson during that same period of valid time, starting on the same timestamp that the original two rows ceased being asserted, that timestamp being 10/16/10.
Bitemporal Tables and Views
When we move from version tables to bitemporal tables, the structures used to manage temporal tables become significantly more complex. But a view of a bitemporal table can hide the complexities of its two time periods from users who are only interested in the versions.
For example, consider a view over the table shown in Figure 6 which selects all columns except the two assertion timestamps, and which contains the predicate
WHERE s-asr-ts <= CURRENT-DATE AND CURRENT-DATE < e-asr-ts
Run anytime between 6/01/09 and 10/15/10, this view will present the first three rows as rows in a version table. Run on or after 10/16/10, it will present the first, second, and fourth rows.
The first, second, and third rows show the three versions of salesperson 7150 as they were originally entered. This provides an as-was view of the data. With an as-was view, we can rerun a report as of any past point in time, and exactly reproduce the results obtained when the report was originally run at that time.
The first, second, and fourth rows show the same three versions, but the third version shown is the corrected version, not the version as originally entered. This provides an as-is view of the data. With an as-is view, we can run a report about any past or present period of (version) time but with what we currently believe to be the correct data about how things were during those periods of time.
As a view over the table shown in Figure 7, and run anytime between 6/01/09 and 10/15/10, this view will present the first three rows as rows in a version table. Run on or after 10/16/10, it will present the first, fourth, fifth, and sixth rows.
Conclusion
The reader should note that what we call versions, the computer science community and DBMS vendors call valid-time data. What we call assertions, they call transaction-time data. However, the functionality surrounding our Asserted Versioning assertions includes all the functionality of transaction-time data, and important additional functionality as well. This additional functionality is described in our book.
DBMS vendors are finally beginning to support data whose temporal features are expressed at the level of rows in tables. Oracle 11g already provides limited support for both valid-time and transaction-time data. This year, IBM is adding support for valid-time and transaction-time data in DB2 10, and Teradata is doing the same in Teradata 13.10. Asserted Versioning, LLC now supports the initial release of its middleware product, the Asserted Versioning Framework (AVF), and this software supports valid-time and transaction-time data as well as an extension of transaction-time data to include future time periods. This software, including the ways in which it extends the standard computer science and vendor community support for bitemporal data, is described in our book.
Although vendor-supplied bitemporal data management software will eventually render homegrown temporal data solutions obsolete, it is still important for data modelers, DBAs, and programmers who will utilize these bitemporal data management solutions to understand what is going on “under the hood”. With bitemporal tables like the one shown in Figures 6 and 7, it would require a substantial amount of code to manage the temporal aspects of the data contained in those tables. Those willing to write and maintain such code could provide all the bitemporal functionality that vendors will provide. In general, this is not a cost-effective solution. However, understanding what such code would have to do will help the IT professional understand how best to use bitemporal functionality as it becomes commercially available.
- - -
Tom Johnston has a doctorate in Philosophy, with a concentration in logic, semantics, and ontology. He has worked with business IT for over three decades and, in the latter half of his career, as a consultant for over a dozen major corporations. He is the author of nearly 100 articles in IT journals and is the co-author of Managing Time in Relational Databases (Morgan-Kaufmann, 2010). Information on the Asserted Versioning Framework, the bitemporal data management software offered by Tom’s company, is available here. The author will present a session on bitemporal data at the Enterprise Data World Conference to be held in Chicago, April 3-7, 2011.
Tom offers seminars on the management of temporal data at client sites that utilize client data issues to illustrate important temporal concepts. You can contact the author at tjohnston@acm.org.