Clean-Cut Warehousing

HP Comes Clean On Their Reseller Database

Data warehouses can be like the Trojan Horse. If their enormous and sophisticated datastores are filled with inaccurate or duplicate data, you can be caught unaware withmisleading analysis and inaccurate reporting. Even a $40 billion company like Hewlett-Packard is no exception.

With 100,000 records a day streaming in via electronic transmission from HP's Wide AreaNetwork (WAN) of resellers around the world, cleaning data in HP's Oracle database(running on top of HP-UX 9.x and 10.x) was imperative. These customer records arrive inany number of nonstandard forms: with company names spelled in different ways; addressesoutdated or incorrect; individual names misspelled; titles missing; and run-of-the-milltypographical errors. As a result, the database was polluted with duplicates andinconsistencies and there wasn't any efficient way of ferreting these inaccuracies out.

"The driving force for that initiative," says Lenka Rozporka, the HP ITengineer leading the clean-up job, "was to find a way to consolidate and clean upcustomer data coming from our resellers. We wanted to find out which divisions were buyingwhat and who exactly was buying HP products. We also needed accurate information about acompany's address and contact names." But with the raw data, there was no way toaccomplish this. "Various sources had AT&T entered in about 50 different [ways],meaning they would end up in 50 different brackets for the amount the company spends. Thismade it quite difficult to total the whole amount AT&T spends."

Name Calling

Rozporka reviewed data cleansing software by i.d.Centric data quality tools fromFirstlogic, Inc. (La Crosse,Wis.). "I was familiar with Firstlogic's Postalsoftmailing software because we were already using that for address correction andmerge/purging in other departments." After an initial presentation, proof-of-conceptand verification of results, Rozporka installed i.d.Centric for business name and titlecorrection, standardization, enhancement, matching, and consolidation. The initialimplementation took less than three weeks. Firstlogic also brought in specialists who tookthe process beyond the normal training procedure, working with HP to fine-tune thesolution.

Using i.d.Centric, several tables were implemented to perform data cleansing that wouldcatch all name and address format variations (upper case, lower case, special characters,abbreviations) and data parsing to identify business names related to one company but notall located at the same address. i.d.Centric's matching and consolidation technologyhelped bring together the cleansed business names to eliminate duplicate customer records.After four months, Rozporka is convinced that there's been "an absolute gain fromusing the data cleansing tools from i.d.Centric."

HP's channel marketing organization will now be able to generate accurate reports basedon regrouping of data that allows them to pinpoint where sales are coming from--bycustomer, reseller, region, and other marketing sectors. From these reports, HP'smarketing organization can accomplish long-term strategic planning and make betterbusiness decisions.

Lean, Mean And Clean

The success of this data quality implementation has inspired HP to look at cleaningwarehouse data as they migrate it to newer machines. "When we standardize ourwarehouse data," says Rozporka, "we expect to find at least 10 to 15 versions ofeach business name. We hope to catch all the variations with the data cleansingtools."

Given HP's vast technological resources, did they at any point consider writing theirown data cleansing software program? "First, it's a lengthy process, requiringexperience in writing software. It's difficult to hire people who can write exactly whatyou want, especially since we don't have the experience in how to design the software andhow to program. Second, it's not really cost effective to spend hundreds of thousands ofdollars to write software that you might not be able to market somewhere else."

Because HP's solution did something "most companies don't use i.d.Centric'ssolution for," the biggest hurdle was finding out exactly what the software iscapable of, recalls Rozporka. "We needed to customize the product. There are variousplaces where you can use some selection logic in i.d.Centric's technology. It was crucialfor us to find out the product's limitations and advantages in the beginning, before weactually used it."

According to Rozporka, everyone should be busy cleaning up their data warehouses."Because data warehousing has evolved so quickly and there is need for massiveamounts of information, it seems that everybody can store the information." ButRozporka adds rhetorically "Is it accurate and clean? Are all the totals totaledcorrectly because the data is clean, or are we losing visibility of some data because thebusiness names are not standardized? It has become increasingly valuable to companies toobtain clean, solid customer data," Rozporka concluded. "That's why you seecompanies like Firstlogic bending over backward, retooling their technology to adjust andexceed expectations."

Sarah Juon is a freelance writer based in Rhinelander, Wisconsin.


Data quality is getting more recognition in the market, because many companies realize too late, the dirtiness of their data. Often entire projects will fail because it wasn't incorporated into the solution. With "dirty data," the ability to create a complete customer view is impossible -- which affects other things in the data warehouse like data mining and data modeling. It's important to remember the need for data quality stretches further than just data warehouses-- data quality is essential in any project involving customer-centric information!

1. Parsing Parsing locates, identifies, and isolates individual data elements in unstructured data. Components may include a customer's first name, last name, title, company name, street address, city, state or ZIP Code. Parsing makes it easier to correct, standardize and match data because it allows you to compare individual components, rather than long strings of data.

2. Correction Correction can be defined as "intelligently verifying, scrubbing and appending data based on a set of reliable secondary data sources." Correction is the building block that sets i.d.Centric apart from many other companies in the data quality market space. Many of our competitors can only parse and identify the customer record components, whereas i.d.Centric's tools can verify that data components are correct, such as a street or city spelling or ZIP code.

3. Enhancement Enhancement is the addition of new data and appending of missing information. This information includes data like age, household income, presence of children, marital status and number and types of credit cards owned. Missing information that may be appended includes: apartment numbers, missing address elements, telephone numbers, latitude/longitude information, county codes, MSA and political districts.

4. Standardization Standardization is the fourth building block in the data quality process. This step allows you to convert information into a preferred and consistent format. Standardization allows fixes such as inconsistent abbreviations (IBM, Int'l Bsns Mach, International Business Machines), inconsistent name data (Robert, Bob), unusual titles (Graduate, Realtor Institute) and misspellings and variant spellings (Kwik, Quick, Quik). This building block is important because it helps facilitate better matching.

5. Matching Matching is the most important step in the customer data quality process. Matching allows identification of similar data within and across data sources. This is the heart of data quality--using cleansed information and match standards, eliminating duplicate representations and consolidating all information about each individual customer.

6. Consolidation Once locating the matching records in the data, relationships between customers can be identified and a consolidated view can be built. This is consolidation, the final critical component of successful one-to-one marketing and data quality building blocks. Consolidation allows you to gain a clearer understanding of customers.

It's imperative to realize that these blocks are most successful when used together.