Focus Topic - Getting from Here to There is the Easy Part

Business managers often get absolutely giddy when they describe the possibilities of the organization's upcoming data warehouse. Wildly pointing their arms here, there, and everywhere, they explain how data from the order entry system here and the customer service system there and a bunch of other applications someplace else will all be consolidated in the data warehouse. Then, with a click of a mouse, they will have fast access to all the latest company information. Nobody wants to squelch this enthusiasm, but getting the information from here, there, and everywhere is just the beginning.

Populating the data warehouse involves a number of different processes: data extraction, bulk data transfer, data transformation and data cleansing. The actual transfer is the easiest part. Identifying the right information, agreeing on how it should be normalized for presentation, transforming it to fit a common data model, cleaning it to ensure the consistency and integrity of the results of queries require much more work.

The work doesn't stop once the warehouse is fully loaded -- the data must be regularly updated. Often, the data warehouse feeds smaller, more specialized data stores, called data marts, which must also be kept fully stocked with the latest data. Very quickly, IS finds itself managing the movements and transformation of large amounts of data all around the enterprise, between production systems and the data warehouse and between the data warehouse and various data marts.

IS old-timers, those who remember the days when moving data in bulk between systems simply required writing a few lines of file transfer code, will assume the job can be handled as a quick batch file transfer operation. Although the AS/400 still supports the FILE COPY command in RPG, "things have become much more complicated than a simple file transfer," advises John Otterstatter, vice president/development, Showcase Corp., a data transfer tools provider based in Rochester, MN.

For AS/400 shops, the issue today goes far beyond just getting a large chunk of data from one AS/400 to another. "There are issues with cleansing the data, bandwidth, snapshots versus deltas, and transformation," Otterstatter continues. The information, or at least some of it, may not be coming from an AS/400; or it may be coming from an AS/400 but ending up on something else.

Choosing the Best Tool

There are many reasons why AS/400 shops may have to move data around in bulk and many tools are available to help do the job. The tool choice depends on why the organization needs to transfer the data, the data source and target platforms involved, and the requirements of the information itself.

Organizations for example, may need to move data for the purposes of high availability and disaster recovery. In these cases, the data is moving in bulk, as complete files, between like machines, for example, one AS/400 and another. In such cases, FILE COPY may be sufficient, although if there is need to continuously mirror changes to an AS/400 production system or ensure that information isn't lost in the event of a sudden failure, additional functionality is required.

When the file transfer is for high availability, simple FILE COPY may not be sufficient. "The real issue is performance. You want to put a minimal load on the production CPU. You want a system that works in background and is transparent," says Therese Sullivan, director/marketing communications for Lakeview Technology Inc. (Oakbrook, Ill.), developer of MIMIX. A high availability system also deals with objects, so a complete system is replicated not just the data. On the other hand, high availability replication systems don't bother with data cleansing or transformation or cross-platform capabilities.

The organization may need to move data from one machine to another simply for efficient network access. A global organization, for example, may want to keep a copy of central sales or inventory data on a European server to avoid international telecommunications charges (although the Internet may soon make this need obsolete). While FILE COPY, again, might be sufficient, the organization may want to be more selective in the records it sends to the European server or it may want to send only the data that has changed since the last replication, which requires a more sophisticated type of transfer.

Chemtronics, Kenesaw, Ga., a manufacturer of compressed air and other products, replicates data from its production systems running on an AS/400 to a Windows NT-based SQL Server database for basic reporting purposes. The company uses the RapidDATA to replicate half a million sales transactions.

The reason for this approach, notes Chemtronics MIS manager Thomas Larson, is to free the company's aging AS/400-model 9406 from having to handle anything but its core production applications. Any cleaning, transformation and aggregation takes place primarily on the SQL Server system. "We're mainly bringing the fields across as they are," Larson explains, although the company does use RapidDATA to perform some small data manipulation in the process.

While Chemtronics uses data transfer mainly for reporting, the biggest driver of demand for file transfer tools increasingly is full blown data warehousing. Data warehousing, however, puts a host of new demands on the file transfer process. "Data warehousing consists of pools of historic data that have been summarized and normalized for the purposes of analysis. You don't just keep a copy of your production data," Otterstatter notes.

For data warehousing, bulk file transfer is not likely to be the best choice. Instead, the organization will want to do more selective extraction and replication combined with some amount of transformation of the raw data into something that can be used for analysis.

OshKosh B'Gosh Inc., (OshKosh, Wis.), a clothing manufacturer, uses the AS/400 as both the source and target databases for its series of data marts. To handle the initial loading and ongoing updates of the data marts, OshKosh turned to DataMirror Corp. (Markham, Ontario). "We update every night, and we do a lot of transformation and cleansing with DataMirror in the process," explains Jon Dell'Antonia, VP of MIS at OshKosh.

The kind of transformation OshKosh performs mainly revolves around converting production codes into something that users can understand or creating new data that results from calculations performed on other data. To cleanse the data, the company fills in data in optional fields that weren't required by the production system. "We're really just straightening up the fields that our users will need," he explains.

"There are a lot of ways to move data with an AS/400, from file copy to quarter-million-dollar products," says Van Symons, business intelligence executive at IBM's AS/400 server group in Dallas. But before organizations pick a tool, they must resolve some infrastructure questions, Symons believes.

Storage Centers

The infrastructure issue revolves around the relationship between the data warehouse, data marts, and the production systems. For example, organizations often start building specialized data marts because, due to their more limited scope, they are quicker and easier to build than a full blown data warehouse. However, by the time they get to the third or fourth data mart, they realize how helpful a data warehouse is, notes Symons.

With individual data marts, each must be loaded and the data refreshed. Three, four, or more data marts going into the same production systems for updates on a regular basis can become quite a burden. If, however, all the data was updated once to a data warehouse, which then updated individual data marts with subsets of the data, the process would be less demanding on the production AS/400.

The data warehouse can also be used as the backup for detailed data. In this case, the data marts contain only summarized data. Should users want to drill down for more detail, they are steered into the data warehouse instead of into the production database. Again, detailed data only needs to be replicated and updated in one place, and the production systems needn't support ad hoc drill-down queries.

To begin sorting out the choices, Symons looks at the specific processes and tasks involved. In the data warehousing environment, where most interest is focused today, organizations have two main data transfer tasks: physically extracting and moving the data and cleansing, transforming and aggregating the data into useful information. "This may be one process, but it consists of two separate things to do."

Some tools simply move data. Most tools, such as IBM's Data Propagator and Data Joiner or DataMirror, do both, but they do the latter, cleansing and transformation, to different degrees. Some tools will perform basic cleansing, such as reconciling different names and abbreviations for the same thing. Others will handle complex summaries and aggregations across different platforms or automatically create and fill new fields with calculated results.

FILE COPY, Symons notes, is the most common tool and it comes for free as part of the AS/400 operating system. However, data replication tools are much more precise. They will extract and transfer only specific fields or individual records or only fields that have changed. Replication tools often provide recovery and restart capabilities should the transfer be interrupted, Symons explains.

Other key differentiators are the speed of the transfer and platform support. IBM's Data Propagator, for example, supports all DB2 platforms while Information Builders' transfer tools support a wide range of non-IBM platforms as well.

The latest feature is bi-directional replication. In this case, the data transfer tool can reverse direction, moving data from the previous target database to the previous source database. "When you have a data warehouse, you often create new data of value in the production environment and you want to move it back to other systems," notes Otterstatter.

Industry observers, however, are divided over the value of bi-directional replication in the typical data warehouse setting. The biggest potential for bi-directional replication in data warehousing is between multiple data marts, where each data mart is both a target and source database for the other data marts or an enterprise data warehouse.

In the end, the hard part isn't moving data but in determining what data to transfer and how to transform it. This requires close collaboration with the business users of the data and the data warehouse builders. Observes Dell'Antonia, "You'll run into lots of opinions on what to do. This part can be very divisive if you're not careful, and you just have to work through it." Once this is done, many tools are available to move the data where it needs to go.

Must Read Articles