<B>Special Feature:</B> Reincarnating Data
Cats may have nine lives, but today’s data can live forever. Thanks to data cleansing, propagation and replication tools, data residing on an NT system is no longer a static element entered into the system once and available forever.
Savvy IT managers are increasing the value of their data by reincarnating it to support sophisticated analysis and decision support operations. But these applications require complex processes -- propagation, replication and cleansing -- before data can be stored in secondary sites, such as data warehouses.
What happens during these processes? What must you know to ensure that they run optimally? Some leading-edge NT sites are finding that understanding the use of data propagation, replication and cleansing tools is enhancing the value of their data and the competitiveness of their organizations.
At McLeod USA Publishing Co., a subsidiary of McLeod USA (Cedar Rapids, Iowa), data replication between NT and AS/400 systems enhanced the productivity of the sales force, increased the number of sales closes per day, and provided in-depth, detailed reporting facilities to the sales department and executive management. McLeod, a provider of integrated telecommunications services to business and residential customers in Iowa, Minnesota, Wisconsin, North Dakota and South Dakota, implemented InfoPump from Platinum Technology (Oakbrook Terrace, Ill., www.platinum.com) to replicate data in near real time and incorporate it into sales reports for 125 end users.
Because the AS/400 data was distributed in printed reports, the data was often outdated before sales representatives received it, says Jeff Gleason, McLeod unit manager for Oracle systems. The sales department was more comfortable using the NT system, so the company decided to replicate the data and make it available online. InfoPump facilitates data changes and movement between the two platforms, extracting data from the AS/400, converting it into a SQL statement, and storing the SQL statement in the Oracle database.
How does a site like McLeod know what to replicate and how to begin? The process is complicated, but not impossible.
What You Have
Before you can begin reincarnating your data, you must first understand what you already have. Any data that is destined for analytical use must be assessed to determine the completeness of its information, the way it is structured and the consistency of formatting. Because many NT sites use data that is both internally generated and externally purchased, it is common to have data in various formats, states of completeness and degrees of detail.
The first task, then, is to review the data. Many IT managers embark on data warehousing or data mining projects convinced that their data is clean and complete. But the opportunities for problems are vast. In an address field, is the state shown as a two-character abbreviation or as a numeric code between 1 and 50? Do all the date fields have valid entries, or are some empty? Do fields contain out-of-range values, typographical errors or incorrect data types?
Abbreviations can cause confusion: Is the entry "N" for north, number or no? Do abbreviations include periods? Do customer names use middle initials? What is the placement and usage of suffixes such as Jr., Sr., or numeric representations such as III?
What You Need
Once you understand what you have, you must consider what you need. If you want to be able to use the data mart to analyze distributor patterns by region of the country, you need data that clearly and completely includes state information. If you plan to analyze product consumption by buyer income, complete and correct state affiliations may not be a worthwhile value.
Once you have identified what you need to support the analysis operations, you can begin pulling it together through replication and propagation. This can include only a few passes over the data, or multiple passes that handle conversions, translations and aggregations.
For example, if your source data is from a mainframe system or an AS/400, you may require one pass to convert EBCDIC data to ASCII. Another pass of the data may convert numeric state codes into alphabetic representations and place incomplete state fields in an exception file. Point-of-sale transactions that have been stored as discrete records may be aggregated into transactions summarized by time, product or other categories, depending on the ultimate data warehouse application.
Much of this effort can be handled by sorting tools from companies such as SyncSort Inc. (Woodcliff Lake, N.J., www.syncsort.com). These tools can process the tasks in bulk before piping cleansed data into the new data warehouse. By pre-sorting the data, processing time can be reduced 60 percent to 80 percent. The piping feature, a well-known Unix capability, takes the output of the sort and inputs it directly to the data warehouse.
There are multiple NT propagation tools, but most require an understanding of replication technologies and the intricacies of the systems involved. For example, McLeod uses triggers in each AS/400 file to be replicated that activate on predetermined events. The triggers then cause data to be placed into staging files, which are ordered chronologically by transaction ID. When the triggers send the records into the staging files, the records carry a transaction-type value that indicates whether an insert, update or delete operation was performed against the primary file on the AS/400. InfoPump recognizes the insert/update/delete value and creates the appropriate statement. For example, if the original operation on the file was an insert, InfoPump creates an insert statement, converting the columns from the file into the value that was stored. InfoPump then stores that record in another file on the Oracle database. Then a parameterized procedure executes each statement. "We got incredible performance improvements from doing that," Gleason says.
EDQM
Vendors say that many IT managers attempt to skip the assessment and cleansing processes. While NT sites with shorter life histories may have somewhat cleaner data than ancient legacy systems, nearly every organization comes face to face with the issue of data quality.
The industry has even coined a new acronym for this issue: Enterprise Data Quality Management (EDQM). Comparable to Total Quality Management in the manufacturing industry, EDQM "ensures the accuracy, timeliness, relevance and consistency of data throughout an organization…to ensure that decisions are made on consistent and accurate information," states a white paper published by Trillium Software (Billerica, Mass., www.trilliumsoft.com).
The primary tenet of EDQM is that sites can prevent data quality problems better than they can cleanse data after the fact. EDQM, therefore, encourages designing data quality processes in small, definable projects with an eye toward incorporating the philosophy of enterprisewide quality.
EDQM brings data cleansing to the system level, as a utility available enterprisewide. This can be significant for organizations that use corporate data throughout multiple data marts. If each department typically cleanses the data for its own needs, efforts are duplicated and expenses are increased. EDQM initiatives cleanse data when it is entered into the system, making clean data available to all data marts without additional effort.
New tools, such as Trillium’s Software System, provide context-sensitive cleansing of the data at the point of entry or in batch mode for legacy data. Compared with previous tools that placed code between the data extraction and data loading phases of the warehouse implementation, EDQM tools act as an enterprisewide system utility, using conditional logic, spell-checking and corporate business rules that reject inappropriate entries to validate data upon entry.
Microsoft Cleans House
For complex data warehouse applications, such functionally rich products are an advantage. The sales organization at Microsoft Corp. uses the Trillium product to support an enterprisewide revenue tracking application.
The database -- which tracks Microsoft sales data from billing systems, key distributors and resellers -- provides executives with a revenue picture down to the customer level, according to Jon Elliott, senior sales manager at Microsoft. From this database, the company rebates its trading partners, compensates its sales force, and forecasts inventory builds.
Residing on a SQL 7.0/NT architecture, the database is growing ferociously: Each month the company adds 4 million transactions and 100,000 new customers. Microsoft needed a unified data cleansing strategy to reduce the number of data errors and match customer information so the data warehouse contained the most useful, valuable information.
Until recently, data cleansing activities were performed by 20 employees who were responsible for the monotonous work of matching customers, "de-duplicating" entries and isolating unmatched customer data. In August, the company implemented the Trillium product in an effort to automate the process.
The product standardizes entries and parses the data to produce a suggested list of matches with new customer data. "What used to take one day to match now takes 25 minutes," Elliott says. Matches of new data to existing customers have increased 147 percent from 10,000 per month to 25,000, with a near-term goal of 30,000.
Elliott was able to reassign eight of the original 20 cleansing employees to higher-level (and more satisfying) analysis tasks only one month after installing the Trillium product. As the product tuning continues, he expects to trim another four heads from the cleansing staff. "Obviously the impact is significant," Elliott says.
Although vendors say that the return on investment for data propagation, replication and cleansing technologies is difficult to predict, it is clear that many organizations have benefited from more timely and accurate data. The credit card division of a Canadian bank cleaned 9 million records within four weeks, saving $40,000 in a single implementation.
Yet the process is not one that can be executed once and then forgotten: Getting the most from your data is an ongoing effort. The bank credit card division buys new data almost weekly, then cleans and matches it against its own marketing data. But when production data contributes to your bottom line through multiple lives, the effort – and the expense – may prove to be well worth it.
-- Cheryl D. Krivda is a technical journalist specializing in IT topics. She can be reached at [email protected].