In-Depth
Integrating Canonical Message Models and Enterprise Data Models (Part 1 of 3)
The enterprise data model (EDM) has failed. We explore a new way of using EDMs -- one in which an EDM can more directly affect the management of data than as merely a paper reference model.
By Dr. Tom Johnston, Chief Scientist, Asserted Versioning, LLC
The term “canonical” originates in the Greek kanon. In Xenophanes and Euripides, it referred to “a rule used by carpenters and masons” (see Note 1). Later uses include “Canon Law” (the legal code of the Catholic Church), music canons (compositions with a particularly regular structure), “canonical form” (used in logic), and the general use of the term “canonical” to mean “standard” or “authoritative.”
The term “canonical message model” (CMM) appears in discussions of service-oriented architecture (SOA), where it refers to the use of a message-specific common data model to mediate data exchanges between one database and another. A message from a source database is translated into the representation defined by this data model. On the receiving end, the message is translated from this representation into the representation used by the target database. Being the standard representation into and out of which the data in messages is translated, the data model common to these messages is the canonical model for the data they exchange, and the message formats it defines are canonical formats.
The term “enterprise data model” (EDM) is widely used by data architects and data modelers to refer to a normalized logical data model of the entire enterprise. Usually the EDM is not fully attributed. Often it is a “key-only” model of the data. In most cases, the EDM will include only those entities whose significance extends outside the confines of any one department or division within the enterprise.
In an article published last year, Malcolm Chisholm said that “the major use case for an enterprise data model today is not for instantiating databases, but to facilitate messaging” (see Note 2). Whether or not it is the major use case, I agree that it is an important one. In this series of articles, I want to discuss this role of EDMs as the canonical data model for CMMs.
I will also explain why using an EDM to facilitate messaging does not provide a different way of achieving the same benefits that would be achieved by using the EDM to instantiate databases. The principal intended benefit of using an EDM to instantiate databases is semantic interoperability among the set of thus-instantiated databases. In this age of the Semantic Web, that interoperability is critical; but the use of an EDM as a canonical message model does not move us much closer to achieving it. (See Note 3)
The Failed Mission of the Enterprise Data Model
For a long time, data architects and data modelers certainly did think that “the major use case” for EDMs is “for instantiating databases.” Each database in the enterprise should be, in this view, a fully consistent extension of the EDM. Exactly what this means is not completely clear, but we can think of it as meaning that if all the databases in the enterprise were logically combined, and all consequent redundancies eliminated, the result would be a fully attributed and internally consistent version of the EDM.
However, in point of fact, there are very few enterprises most of whose databases implement data models that are consistent extensions of their EDMs. Although all enterprises have projects to create and evolve working databases, few if any have ever had a project whose sole or principal purpose was to revise those databases so that their data models would be consistent extensions of their EDMs. Because nearly all the important databases in an enterprise have existed longer than the EDM for that enterprise, the most important databases will not be (and show no inclination to ever become) consistent extensions of those EDMs.
The main reason for this sad state of affairs is cost. Working databases have an extensive codebase to maintain their data and an extensive set of queries to assemble information from that data. Changing the schemas for those databases would have a ripple effect out to that codebase and those queries. It might be possible to protect many of those queries by means of views, but it is unlikely that views can or should protect all the code that maintains the data. If we were to rewrite those databases to be EDM-consistent, we would incur a significant cost, and there would be no user-visible results from the rewrite -- or at least none that business management would accept as justifying that cost.
This is why rewrites of those databases to make them consistent with their EDMs don’t happen. In the real world, an EDM is just a paper reference model that may or may not be consulted by a project’s data modeler during development of a model designed to meet the objectives for that specific project. If IT policy mandates oversight reviews to guarantee consistency with the EDM, that oversight is often resented because it generates project costs and consumes project time to achieve an objective that the project customer doesn’t see and doesn’t care about.
Consequently, across the multiple databases in an enterprise, there are significant differences in how data about the “same things” is structured and represented. As a result, when data is exchanged across these databases, and when queries assemble results from across these databases, those data messages and queries must handle differences in physical representations of the same information, and resolve semantic differences between those representations. I will, for the most part, not continue to distinguish between these two very different tasks, since that distinction is a topic which is beyond the scope of these articles. Instead, I will generally refer to “mapping” between different “formats” for the same information.
The Enterprise Data Model: A New Mission?
As I noted, within the SOA community, a new way of using EDMs has been suggested, one in which an EDM can more directly affect the management of data than as merely a paper reference model. On this approach, the EDM is a data model whose physical implementation lies, not in the schemas for databases, but in the code that does the mapping whenever data is exchanged between them.
Data has always been exchanged between databases and has almost always required mapping. The mapping has usually been embedded in application code and queries, where it is inextricably intertwined with all the other work done by that code and those queries (for example, the work of expressing business logic as data is created and transformed, the work of assembling data into information results, the work of managing the user interface, and so on).
In SOA-governed collections of databases, on the other hand, data exchanges are isolated in their own architectural layer, called the SOA messaging layer. The work done in this layer is not merely data transport. Rather, it includes data mapping which, besides straightforward translations from one format to another, often includes the far more formidible task of resolving semantic differences between source and target. In SOA environments, the code that does this mapping is segregated from the code and the queries that do the other work, the work of acquiring, managing, and presenting information to its business users.
In this way, it is suggested, an EDM can have a real impact on software and database design and development. It can be more than just a paper reference model. It can be a model that defines the standard data formats and the standard semantics for the data being exchanged between systems. It can be a physically implemented model, but one that is implemented in code instead of in database schemas. It can be a canonical model for messages exchanged between a data source and a data target.
The idea here is that since inertia (in the form of cost) prevents the extensive transformation of specific databases into instantiations of an EDM, why not use the EDM, instead, as a virtual standard representation of data? It would be as if all point-to-point data exchanges would be carried out as spoke-to-hub and then hub-to-spoke exchanges, in which the hub is a virtual enterprise database, a virtual instantiation of the EDM. In this way, the EDM could be brought out of the back office of enterprise data management, into the front office of software and database design and development. In this way, the EDM could begin to do real work.
A Look Ahead
In Part 2, I will show how an EDM is used to “facilitate messaging,” and will explain the benefits of using it that way. In Part 3, I will explain why this way of using an EDM does not achieve the same results as using the EDM to “instantiate databases,” and thus that both of what Chisholm calls “use cases” for an EDM are still important. Specifically, I will show that if an EDM is used only to facilitate messaging, then the problem of resolving semantic inconsistencies among databases will remain unsolved, and the objective of semantic interoperability across those databases will remain unachieved.
NOTES:
1. Liddell and Scott, An Intermediate Greek-English Lexicon. It is interesting to note that the Liddell of this lexicon is the father of the Alice immortalized in Lewis Carroll’s Alice in Wonderland and Alice Through the Looking Glass.
2. Malcolm Chisholm, The Canonical Data Model, March 8, 2010.
http://erwin.com/expert_blogs/detail/canonical_data_model/
3. These articles arose out of discussions that took place last June in the CA Erwin Modeling discussion group on LinkedIn. Other participants in those discussions were Adam Anderson, Bob Muma, Ed Johnson, Frances Brickhill, John Bogard, Roger Jackson, Sergey Sviridyuk, Soun-Young Kwon, Todd Owens, Vinay Kumar, and William Moore. I would like to thank those participants for a stimulating and informative exchange of views. However, there are substantial differences between these articles and those discussions, and the reader should not assume that those participants endorse or would even agree with the views expressed herein.
- - -
Tom Johnston has a doctorate in Philosophy, with a concentration in logic, semantics, and ontology. He has worked with business IT for over three decades and, in the latter half of his career, as a consultant for over a dozen major corporations. He is the author of nearly 100 articles in IT journals and is the co-author of
Managing Time in Relational Databases (Morgan-Kaufmann, 2010).
Information on the Asserted Versioning Framework, the bitemporal data management software offered by Tom’s company, is available
here.
Tom offers seminars on the management of temporal data at client sites that utilize client data
issues to illustrate important temporal concepts. You can contact the author at
tjohnston@acm.org.