In-Depth

Your Organization and Technology Are Ready for Success. What About Your Data?

Data on the Move – in Pursuit of a Competitive Advantage

There’s been an evolving consensus about how to gain a competitive advantage in the 21st century. The first solution was business process re-engineering. This required restructuring hierarchies into flatter organizations, as well as streamlining processes and aligning them with strategic objectives.

More recently, to support the new organization and its processes, companies have begun to re-engineer their computing infrastructure - investing in leading-edge business systems, based on new Internet networking and software technology and standards such as HTML, XML, Java, CORBA and DCOM. These solutions include enterprise resource planning (ERP) systems for integrating data from multiple business functions, as well as enterprise application integration (EAI) initiatives for moving data from one application to another and into business intelligence functions and information portals. The purpose behind these implementations is to integrate and leverage information across the enterprise. And the goal is to improve operations and decision making at the back end and marketing and service at the real-time front end - for increased revenues and a competitive advantage in the new, global, wired economy.

By combining organizational and technical solutions, companies hope to create a streamlined, Web-enabled organization running on integrated information and powered by super information engines along the fast track to 21st century success. But will it work? Consider this: Is the fuel that will drive performance crude or high-octane?

In these various initiatives for re-engineering people, processes and technology, a key element of a total solution – data - may have been overlooked because of its humble status. Integrated, enterprisewide information is based on data – much of it from diverse legacy systems with varied data formats and from real-time applications like e-commerce, where input often can’t be controlled. Even if the data model has been transformed for migration from old into new systems, what is the quality of the data at the character level? And why is this important?

Hidden Roadblocks – in the Data

Remember the Y2K bug? The problem was about two characters – two zeros – in the date field. But these two ambiguous characters could have a huge business and social impact, throwing off calculations and operations for all sorts of companies and even governments. So, problems with data at the character-level –data quality problems – are not trivial.

Y2K is just the tip of the iceberg. In the new millenium, the same set of problems – character-level problems - will continue. The new systems – with their business intelligence functions – may prolong and intensify the effect of data quality problems. Legacy applications contained all sorts of information relevant to decisions represented by characters bucketed and strung together in certain fields. In implementing new systems, companies converted legacy data into relational-and ERP-type structures. But unless they also investigated this data at the character-level and aligned it in separate, accessible fields as a part of the conversion, the data quality problems persist in the new systems – hidden, but not dormant.

For example, consider the situation in which key information in legacy systems may be in the wrong field or in a free-form field - like the comment "deceased" input into a free-form name field. When this data is loaded into a data warehouse or ERP system without investigating and re-engineering the data to put "deceased" into a separate field, these problems persist. When business users in search of trend information drill down to certain fields in the data warehouse with OLAP tools, the analysis results will be skewed if the deceased person is counted. Or the deceased person will live an afterlife in the ERP system, skewing reports and maybe even receiving direct mail, bills, etc. – in other words, costing the company time and money.

There are other potential sources of data character issues, too. E-commerce applications bring a steady stream of data subject to users’ input errors and choices into the enterprise. With enterprise integration, this inaccurate or inconsistent data, whether new or legacy, will ripple through the integrated applications, for a potential negative impact enterprisewide.

As a result, problems will emerge in the new dynamic operational environment, just when companies are expecting accurate, complete business intelligence capable of supporting smoother operations and better decisions. Unsuspecting companies will be exposed to operational roadblocks and decisions based on faulty data – and run the risk of exposing their problem data to suppliers and consumers over the Internet.

Data Quality Problems and Their Impact on Business

What kind of problems are these? Seemingly minute character problems hidden in a wide range of fields that have the potential to impact operations and revenues. Examples include: variants of customer names – resulting in multiple account records for the same customer; multiple numbers for the same part – creating major inventory problems; different provider numbers for the same doctor – preventing a smooth merger of two HMOs; or critical information, such as relationship identifiers, buried in generated keys for customers, dealers, suppliers and so forth.

For example, think about the characters in a name field: first name, middle initial and last name. Then take the name Denis E. Mario. What if, when entering this name, some characters got concatenated, strung across each other without the proper break point, so the middle initial wasn’t separately parsed, but joined to the first name? "Denis E. Mario" would become "Denise Mario."

This character problem, multiplied, can have broad adverse effects – for example, for insurance companies and retailers. Because of the difference in life expectancies between males and females, mistaking Denis to be Denise – the character error of the misplaced "e" - will affect the calculation of the premium. Even more fundamental, the error (counting Denis as a female) can ultimately skew the statistics that go into calculating the actuarial tables that determine the premiums, for an even broader impact on revenue.

The same problem persists for a retailer that wants to do a segmented direct mail piece – sending one catalog to women, and a different one to men. If Denis gets the catalog intended for women, the retailer wastes the cost of mailing and, more important, loses the opportunity of a sale – all because the middle initial character isn’t in its proper place.

The problem escalates beyond the mistake in gender. What if a financial institution has entries for both Denis and Denise in the system – counting one person twice? This particular example is symptomatic of a major data quality problem across large enterprises in all industries: duplicate entries for "entities" such as customers, patients, providers, parts, materials, products and suppliers – because variants of names, plus the lack of another common identifier, obscure their identity.

Duplicates pose a serious problem for business performance. They prevent the integration and consolidation of all information about an entity– simply because all the information about an entity with variant names, or without a single universal identifier, can’t be matched. Mailing solutions for direct marketing aren’t the answer. These have achieved match accuracy rates of 60 to 70 percent, for significant direct mail cost savings. But this level of accuracy isn’t sufficient for enterprise intelligence efforts.

Duplicates compromise the performance of enterprise intelligence tools and the decisions and programs based on them. Take the case of a large manufacturer with many sites trying to improve supplier management and inventory control. This company needs information on its total book of business with each supplier and an accurate count of each part or material in inventory across all sites. If it can’t consolidate all transactions from one supplier, or misses some parts stored in one site because of variant names or numbers, the information it will act on simply isn’t accurate – its accuracy rate is zero. And its efforts at supplier management and inventory control will be compromised.

The detrimental impact of duplicates is especially clear in mergers and acquisitions. Each company in the proposed transaction may have duplicates in its own database, and this problem is compounded because of additional duplicates – customers who conduct business with both companies – in the merged company. Initially, the acquiring company will base its proposed offer for the targeted company on the number of new customers it will gain. If that number is inflated because of duplicates, the acquiring company will overestimate the value of its target – and pay too high a price.

Then, when the merged company makes projections and plans for future business - based on an inaccurate customer count – these may fail. Consider customer relationship management initiatives in the newly merged company. What if the new company perceives Denis and Denise (a character error) to be two customers when, in fact, Denis has two accounts. What if the accounts added together qualify Denis as a premier customer, but the accounts considered individually do not. Denis would be ignored by special programs for premier customers – and might take his business elsewhere. In fact, any company that wants to improve its customer relationships – a utility, telecommunications company, retailer, insurance company, bank, pharmaceutical company or healthcare organization - needs accurate, comprehensive information about each customer or patient.

There’s more. Beyond these internal problems, as companies use extranets or the Internet in conjunction with e-commerce and e-business applications they will make their information visible to, and accessible by, the public. If a system has duplicates, there will be a breakdown in its ability to accurately fulfill queries, delaying or subverting transactions. In addition, by exposing their data contamination to external parties, companies may endanger their reputation and future business relationships. Simply put, the public may readily assume that because the company’s information is unreliable, their operational performance will also be unreliable.

Though the examples here are highly particular – a matter of erroneous or mismatched characters in data fields - their adverse impact is potentially dramatic. And they may do all the more damage for being hidden from corporate executives users. However, these data quality problems lurking in the new infrastructure can be discovered, diagnosed and fixed by current technology, ideally when information is loaded into the new systems. If the systems are already running, the time to act is now, before system performance is compromised and users lose confidence in the accuracy of their information.

Data Re-engineering for a New Level of Data Quality and Integrity

What needs to be addressed – and re-engineered from the ground up -- is data at the character level. This is the building block of the information required to integrate operations and impact the bottom line. To achieve the level of data quality and integrity necessary for smooth operations and sound decisions requires a sophisticated re-engineering effort. This can be achieved in two ways. You can write custom code, which requires many years of effort and still may "miss" some data anomalies, especially in cases where records number in the ten thousands or more. Or, you can implement "best of breed" software with comprehensive, powerful and flexible technology that goes beyond name and address scrubbing and can perform all steps in the re-engineering process outline below.

Either way, re-engineering should start with data investigation; breaking apart the data in fields into its individual characters and analyzing them to see what is present. This process will disclose hidden, often valuable data that is in the wrong field or "extra" (for example, indicators of important business relationships, such as "DBA" or "Trustee of"). Investigation also helps an organization to formulate optimal rules for standardizing the data, based on what users find in the existing data. As a result, it’s an important prerequisite for conditioning data into a standard format across all characters and data sets.

Conditioning, which comes next, manipulates records into a standard content (for example, with consistent abbreviations) and format. The standard format will isolate each data value in a discrete field, making all data accessible to users. For example, if we start with the address "120 Main Road" in a free-form line, conditioning will place each element into a discrete field: "120" in the street number field (which is numeric), "Main" in the street name field and "Road" in the suffix type field. If the business rules specify standard abbreviations for suffix type, the conditioning process will also transform "Road" into the standard abbreviation – for example, "Rd."

As you can clearly see from this example, standardization greatly facilitates and, in fact, is a prerequisite for the next step: matching records. What’s needed at this stage is matching capabilities sensitive to "gray areas," or non-exact matches (to "catch" record variants). The objective here is to be able to identify each distinct entity about which there are records in the systems, removing duplicates and then assigning a unique identifier key to that entity so that all information about it can be integrated and consolidated across systems. This is necessary to reveal all buried relationships – for example, the same product ordered under different numbers at different sites, or two people who constitute a household and information about all their accounts, and ensure data integrity.

There are two ways to automate the decision as to whether two records match. One, the deterministic, decision-table approach, performs a pattern- or rule-based lookup in a table. The other method, probabilistic linkage, uses rigorous mathematical measurement of available data in the record to evaluate the match. Both are satisfactory when data is fairly simple and matching requirements are lax. However, the probabilistic approach is required when there is inconsistent, incomplete, conflicting and missing data, or business requirements are demanding because errors can result in liability or have financial consequences and users need trustworthy data.

In the decision-table approach, each field being compared is evaluated and given a score or letter grade that tells how well it matched. All the grades are lined up to form a pattern, maintaining visibility for each field. The pattern is then matched to a static table that tells the system whether that particular configuration of field scores should or should not be matched.

Probabilistic linking also evaluates each field, but the score numerically represents that field’s information content or amount of information (its emphasis, significance or usefulness in making a matching decision). Then, the individual field scores are summed, to produce a final score precisely measuring the information content of the fields being compared for a match. That final score, or match weight, can be converted into an odds ratio – for an accurate gauge of the probability of a match.

This measurement of "information content" is based on mathematically defined information theory and adjusts value and field scoring based on characteristics of the data. For example, the measurement process will give a higher weight to a match between a pair of Social Security numbers than it will give to a match between gender indicators like M and F. It will also give a higher weight to matching rare values, like the first name Horatio, than it will give to matching common values, like the first name John. In addition, it can give more gradations of partial credit than tables can give to instances of non-exact matching values like Social Security numbers with a few differing digits or name variations (Robert versus Bob). As a result of this ability to measure finer distinctions than a deterministic table approach, probabilistic linkage finds more non-exact matches and makes fewer erroneous matches (it doesn’t overmatch) than tables. It can be audited and validated.

Finding all matching records is important to the final re-engineering step, survivorship. Here, desired and "best of breed" data from all records pertaining to an entity are propagated into the master record for that entity (according the chosen standard format).

What this entire process, including probabilistic linkage, produces is a new level of data quality and integrity – from the character-level up. Then, once this re-engineered, high-quality data is integrated and leveraged throughout the enterprise in the new electronic infrastructure, users in the re-engineered organization can make better operational and strategic decisions – with confidence. In short, data re-engineering is a necessary part of the total solution to create the streamlined corporation using enterprise intelligence to get a competitive advantage. And the good news is that the technology that automates this thorough data re-engineering can do the job faster -- as well as more easily, thoroughly, and cost-effectively -- than an army of analysts and software programmers working through thousands, even millions, of legacy records.

About the Author:

Mark E. Atkins is President and Chief Executive Officer at Vality Technology Incorporated (Boston).

Must Read Articles