Shining Up Your Data

Data quality is important from the perspective of both transaction processing and decision-support databases. However, many IT shops don't pay serious attention to the issue of data quality until they decide to implement a decision-support database. Traditionally, transaction processing applications are held responsible for their own data quality issues, and it’s up to either the application programmer or the DBA to manage the quality of data that end users enter into the system.

Application programmers usually deal with data quality by implementing edit tests at data entry fields. For example, a series of tests for a name will be first, to find out if the name of the individual or entity in question (customer, partner, supplier, patient, whatever) is known to the system. If it is, or there is a match with a number of potential candidates, the programmer presents the user with a pick list, with perhaps a default choice. A match may be found using a simple table of database lookup, a Soundex search, or some other, more sophisticated, mechanism.

If, on the other hand, the system doesn't find a potential match, the program assumes it is a new entry and performs additional edit checks on the data. For example, it may test a name using a pattern match and reject names that are illogical, such as personal names that begin with a number or that have less than two words. However, if the field is a company name the software will often let the user enter whatever he or she wishes, since the potential universe of combinations is almost infinite.

Similarly, the software can test an address to see if it contains a street number, an alphanumeric street name, an alpha city and a known state, and if a zip code is a 5-digit numeric field within the range 00000-99999. Of course, the edit tests become more complex if the entry is for an entity outside the U.S. Other data fields, such as order date, can be tested for reasonableness. You can order a product now, but you cannot order it in the future, although it will probably be delivered and paid for in the future.

One big issue you need to watch out for is abbreviations. Often a data entry field is fixed length, and users will go to extraordinary lengths to fit as much data as possible into the field. However, the result is that, unless you have some well-understood corporate guidelines for abbreviations that users willingly adhere to, they are going to be very creative in how they define abbreviations.

So, what should you do? First, you should assess where your most severe data quality problems are. Typically, they are in personal and corporate name and address fields. Where is your data quality being compromised? Often it's at the point of entry, but it may also be in data you obtain from an external source. Or, the data may be in a legacy system, and you can't go back and ask the users to re-enter it.

Tackle the problem in two areas. One is at the point of data entry, to ensure that new data being entered into the system is clean. The other exists when you're migrating data between systems, which could be the case if you're implementing an ERP system, or a decision-support database. In this case, you can use a data quality tool to fix the data before loading it into the target system.

Create a set of standardized data validation and verification subroutines that can be called from data entry fields in various applications. Use these subroutines to test for such things as standardized name and address formats, product codes and SKUs, dates, and data reasonableness.

You may need to use verification technology. There are a number of vendors, for instance, who sell tools along with databases of known addresses, zip codes, and other demographic data that can be used to validate an address. It might be worthwhile to investigate them. Some companies in this market niche include Vality Technology Inc. (Boston, www.vality.com), Prism Solutions Inc. (Sunnyvale, Calif., www.prismsolutions.com), Trillium Software (Billerica, Mass., www.trilliumsoft.com) and Firstlogic Inc. (La Crosse, Wis., www.idcentric.com ). --Robert Craig is director, Data Warehousing and Business Intelligence Division, at Hurwitz Group Inc. (Framingham, Mass.). Contact him at rcraig@hurwitz.com or via the Web at www.hurwitz.com.