Special Report: Data Movement for Data Warehousing
The process of data movement happens when data from various sources within the organization is moved from its origination point -- a separate server or workstation -- to the data warehouse target. One of the most important considerations in the data movement scheme is the problem of maintaining accurate data. Older data movement schemes were fraught with errors, omissions, and failures, all of which are unacceptable for the data to be deemed reliable. Reliability is critical since erratic data can lead to poor or incorrect decisions. Thus, the several vendors who supply data movement and replication software have designed their current systems to maintain data accuracy while facilitating the data’s movement.
Moving data from one system to another is not a recent phenomenon. Large organizations have long moved data for storage and access. Traditionally, IS personnel have developed in-house code to move data from one legacy server to another. But inconsistencies arise when database changes, maintenance, or increased data levels cause IS personnel to ignore mission critical tasks.
The relatively recent shift to heterogeneous computing environments has meant that IS personnel must spend a great deal of time maintaining or rewriting code to move the data. Rather than commit their own limited IS resources to this job, companies have turned to data movement vendors such as DataMirror, LakeView Technologies, Vision Solutions, IBM, and Carleton to supply the software to reliably move, transform, and standardize data for use in data warehouses. These vendors provide ready-made solutions complete with upgrades to match improvements in database software.
These vendors have developed their software to do more than simply copy and transfer, but to also standardize, cleanse, and validate the data. Historically, file transfer between AS/400’s has been done manually. To do this, the AS/400 systems required the same database software, and manual transfer was time consuming. To ease the time constraints of manual file transfer, many IS departments wrote scripts to automatically move the entire database. However, these scripts were fairly simple, and mostly only moved files. Further, maintenance quickly became a drain on IS resources; the process of copying the entire database, even when minor changes have occurred, consumes large amounts of processing time. More complex programs were needed to do more complex operations, such as move only portions of files or prevent sensitive data, such as personnel information, from being moved to a data warehouse accessible to everyone in the organization.
An alternative to file transfer is replication or copying the database to another machine, most commonly used for back up. The replication process leaves the original database at its source and updates just the target platform. Database replication historically required compatible hardware and database software.
File transfer or replication that requires the entire organization to use the same equipment is unreasonable in a time where mergers, acquisitions, and growth result in the use of several different platforms, operating systems, and application software within an organization. The need for moving data across heterogeneous platforms, operating systems, and database applications has resulted in the development of transformation software. Now it is possible to gather data on an NT server, a UNIX box, an AS/400, and even a mainframe and then consolidate the data into a single data warehouse on an AS/400.
AS/400 centered products like DataMirror’s High Availability Suite 3.0, Vision Solutions’ Symbiator, LakeView Technology’s (Praxis) Omni-Enterprise, and IBM’s Propagator all provide transformation solutions to move data, standardize it, and then make the data accessible to all of the source locations.
Data sharing across heterogeneous platforms and databases is accomplished using bi-directional replication. The most common use for bi-directional replication is in data synchronization applications. Centrally administered inventory systems, for example, continually reflect product level changes in real-time from all branch sites.
Cleansing and Validation
Data must be cleansed and validated to remove erroneous and duplicated files. For example, an accounting database could show a customer as William J. Smith, while the sales database has him listed as Bill Smith, and marketing has William Smith all for the same person at the same address. Without cleansing and validation software, this one customer would show up as three customers in the data warehouse. Cleansing and validation can take place on the AS/400 or, as with Carleton Corporation’s PASSPORT, on an NT server and then transferred to the AS/400.
But cleansing and validation is not even enough to make the data usable. Gartner Group analyst, Janelle Hill, says, "It is not just the replication of the data; it also has to be in a format that makes sense to the user." To put it in a more usable format, the database must be standardized; for example, it must change a name listed in a single field [Smith, John A.] to multiple fields [Smith] [John] [A]. Also, the data movement software must merge the information from varied source formats, such as SQL Server, Oracle, Sybase, Access, and Progress.
As businesses change data storage to a more customer-centric approach (vs. product-centric), data movement software is experiencing an expansion across all industries. Retail, insurance, manufacturing, financial, and telecom industries are the most significant adopters. Factors pushing for greater data movement are Business Intelligence (for decision support), Enterprise Level Applications (ERP), and the move away from legacy systems as a result of mergers, acquisitions, and growth. Another factor is the need for expansion--while databases continue to grow, sometimes exponentially, organizations must maintain data continuity or struggle to decipher data from one part of the organization to another.
Another feature contributing to the growth of data movement software is the current capability to incrementally move data from the database without the need to reload the entire database. According to Randy Corke, LakeView Technologies’ Director of Marketing, up to 75 percent of the reload time can be recovered, freeing up valuable processing time. For example, if the time required to update the database to the data warehouse by reloading the database took eight hours, the same reload, by updating just the changes in the database, could take as little as two hours; a six hour savings in processing time.
The AS/400 market is just beginning to use data movement software for data warehousing. According to Randy Corke, there is currently less than 10 percent penetration in the market. However, newly available as well as future technologies will contribute to its growth.
The next step in data warehousing is to expand it to become part of an enterprise-wide knowledge management system. By having a standardized data warehouse available to all segments of the organization, you remove the need for independent ad hoc data warehouses for each segment. This will provide for consistency throughout the company while giving each segment access to data previously unavailable.
Van Symons, IBM Business Intelligence Executive, sees the future of data movement technology and knowledge management tied to three areas. First, integration of the data warehouse with business intelligence. Second, advancements in web technology and database software. Finally, the ability to access non traditional data sources, such as spreadsheets and documents.
Data movement software consisting simply of file transfer and replication is insufficient for today’s business needs. Transformation software still requires cleansing, validation and standardization to make the data useful to all parts of the organization. Whether the data warehouse is used for consolidating heterogeneous data or for high availability, the data must be accurate and presented in a useful format using understandable field names.
Data movement software makes data more usable and understandable. Whether updating the data warehouse daily, weekly, or monthly; data movement software will enhance the information stored, reduce storage time, and make the data readily available to the entire organization. Better decisions are made because decision makers have consistent, accurate information. Data movement software is automated, it frees up IS resources, standardizes data, and consolidates data. The result is "a single version of the truth" that facilitates cohesive, and therefore better decisions.
There are different ways to replicate data--continuous mirroring, periodic net-change, or full copy refresh. Continuous mirroring, also known as real-time replication, enables organizations to update databases as changes (additions, updates, deletions) occur and reflect those changes to target systems. Periodic net change involves storing captured database changes until a predetermined replication time (nightly or hourly); and then only those records that have changed since the last update are sent to target databases. Sometimes, a full copy refresh is done when it is necessary to replicate an entire database ; this is generally done after a power outage or upon initial synchronization.
Good Data, Good decisions
Data consolidated in a data warehouse facilitates decision making. People can make better decisions when they have reliable information derived from computer-based data. However, if the data varies from department to department, decision making is, at best, difficult. Byron Goodes, a senior consultant with Aurora, Ontario-based Astech Solutions says, "You need to have one version of the truth--one single database."