Old Thinking Does a Disservice to New Data Hubs

Don't underestimate or misunderstand some key characteristics and features of data hubs. We debunk two of the most common misconceptions.

by Larry Dubov

Master data management (MDM) deals with master data, those data that are generally the most highly shared and the most critical to meeting an enterprise’s goals. Master data are essential sets of core data to an enterprise, which means they have to be accurate. If master data are inconsistent, they could potentially expose an enterprise to significant risk.

Over the past decade, data hubs have been a popular and evolving architectural construct for MDM and other enterprise data management solutions. Yet in my travels, I’m amazed that so many IT professionals still aren’t clear in their understanding of data hubs and their capabilities. The term is still used as a replacement in referring to the more traditional operational data store of the 80s and 90s. This is quite maddening because it adversely affects understanding of modern design options of enterprise data management (EDM) and master data management solutions that are enabled by data hubs.

There are some key characteristics and features of data hubs that often are underestimated or misunderstood by enterprise architects and systems integrators. Here are two of the most common misconceptions:

Misconception #1: Data must be cleansed and standardized before it is loaded into the data hub

For many professionals brought up on the concepts of operational data stores, data warehouses, and ETL (extract, transform, and load), such a cleansing requirement is an undisputable truth. Data must be first cleansed before the inbound processes load the data in the hub. With this principle in mind, a data hub is just another data repository or database used for storage of cleansed data content oftentimes used to build data warehousing dimensions.

The reality for data hubs includes a much more active approach to data than just storage of a golden record. The data hub makes the best decisions on entity and relationship resolution by arbitrating the content of data created or modified in the source systems. Expressed differently, a data hub operates as a “service” responsible for creation and maintenance of master entities and relationships.

The data hub as the enterprise master data service (MDS) applies the power of advanced algorithms and human input to resolve entities and relationships in real time. In addition, data governance policies and data stewardship rules and procedures define and mandate the behavior of MDS, including the release of reference codes and code translation semantics for enterprise use.

The data hub as MDS provides an ideal way for managing data within a service-oriented-architecture (SOA) environment. Using a hub-and-spoke model, the MDS serves as the integration method to communicate between all systems that produce or consume master data. The MDS is the hub, and all systems communicate directly with it using SOA principles.

Participating systems are “autonomous” in SOA parlance, meaning that they can stay independent of one another and do not have to know the details of how other systems manage master data. This allows disparate system-specific schemas and internal business rules to be hidden, which greatly reduces tight coupling and the overall brittleness of the ecosystem. It also helps to reduce the overall workload that participating systems must bear to manage master data.

Misconception #2: The golden record must be persisted in the data hub

The notion of a data hub as a data repository presumes that the so-called golden record must be persisted in the data hub. The notion of the data hub as a service does not make this presumption. Indeed, as soon as the master data service can deliver the golden record to the enterprise, the data hub may or may not retain the golden record. The notion of the data hub as a service leaves a decision to persist or not to retain open. A data hub can persist the golden record or assemble it dynamically instead.

One of the arguments for a persistently stored golden record is that performance for golden-record retrieval will suffer if the record is assembled dynamically on request. The reality is that the existing data hub solutions have demonstrated that a dynamic golden record can be assembled with practically no performance impact.

One of the advantages of dynamically assembled records is that the data hub can maintain multiple views of a golden record aligned with line-of-business (LOB) and functional requirements, data visibility requirements, tolerance to false positives and negatives and latency requirements. Mature enterprises increasingly require multiple views for the golden record and the dynamic record assembly works better to support this need.

Another argument often raised in favor of persistently stored golden record comes from the need to support the history of the golden record. Indeed, history support for master data is critical. Two major usage patterns for the history of master data. The first pattern is driven by audit requirements. The enterprise needs to understand who and when a change was made, and possibly why the change was made. These audit needs must be supported by the data hub at the attribute level. MDM solutions that maintain the golden record dynamically address this need by supporting the history of changes in the source systems record content.

The second usage pattern for history support results from the need to support database queries on data referring to a certain point in time or certain time range (e.g., what was the inventory on a certain date or sales over the second quarter). A classic example of this type of history support is the management of slowly changing dimensions in data warehousing. To support this usage pattern, the golden version of the master record must be persisted. It is just a question of where. Many enterprises decide this question in favor of data warehousing dimensions while avoiding the persistently stored golden record in the data hub.

The Final Word

Modern data hubs function as active components of service-oriented architecture and master data services rather than passive repositories of cleansed data. This consideration should help the enterprise architects and systems integrators build sound master data management solutions.

Dr. Larry Dubov is senior director of business management consulting at Initiate Systems, an IBM company (www.initiate.com). He is an expert in master data management and customer data integration and a co-author of Master Data Management and Customer Data Integration for a Global Enterprise (McGraw-Hill, 2007). He can be reached at ldubov@us.ibm.com