Requirements for ETML Tools

As I’ve said in previous columns, one of the most difficult, time-consuming and expensive tasks associated with the data warehouse life cycle is the extract, transform, move, load (ETML) process. This process is critical to the success of any decision support software (DSS) database, whether it’s an enterprisewide data warehouse, a departmental data mart or a multidimensional OLAP cube.

In this column I’ll discuss some issues specific to DSS databases that you’ll need to consider when selecting an ETML tool. Next time, I’ll probe a little deeper into the various tools that are on the market.

Extract -- The first thing that needs to be done is to extract the data from the source systems. I use the plural here because it’s relatively rare for any DSS database of consequence to be populated from a single source. The extraction product needs to be able to retrieve data from hardware environments, operating systems, file systems and databases. Many of the legacy databases aren’t relational, so the software must be able to understand COBOL files, IMS or VSAM files, and Pick, Adabas, Mumps and other less common database formats.

Some tools depend on the system administrator to run a batch job that leaves data in a flat file for the tool to pick up. Obviously, it’s preferable for the tool itself to extract the data extraction in a single pass. Another important feature is change data capture. As your database changes, you want to extract only the data that changed since the last refresh.

Transform -- The transformation process prepares the data for the target database. This can be as simple as changing a gender field from "1" to "M," or as complex as writing a complete program. In some instances a multistep transformation process is required, with the output of one stage becoming the input for the next stage.

The transformation product should have a rich set of built-in transformation functions, especially time-related functions. It should provide you with a programmable scripting environment, preferably using a standard, nonproprietary language. Another, related feature is the ability to call previously written code modules. You might also consider the need for specialized products for address verification or detailed cleansing. Just think about the many ways your own name might appear in a database record. The tool you select should be able either to perform these functions or to interoperate with one that does.

An important architectural consideration is the transformation processing location. Some tools apply the transformation logic on the host platform, while others move it to an intermediate staging area for further processing. The important thing to consider is how to maximize the overall throughput of the process.

The transformation logic should be readily available to end users who need to know where the data came from, and what was done to it to prepare it for the warehouse. This implies either a metadata navigation tool or a metadata repository with public APIs that are supported by the desktop tools you’ll be deploying.

Move -- Data must be moved, preferably as fast as possible and with as little disruption as possible, to the overall computing environment. Bulk data movement is preferable to dribbling data across the wire, unless you’re updating the DSS database with real-time data feeds. It’s crucial to consider encryption if you’re moving data across an insecure or hostile network. For example, if the vendor supports only FTP, your corporate trade secrets may be transmitted in the clear. Also, if the data resides on a mainframe, look for support for high-speed, mainframe-class connectivity.

Load-- Finally, you’ve got the data onto the target machine and you’re ready to load it. Again, for high performance, look for bulk loader support. The faster you can stuff the data into the data warehouse, the better. You should factor in time for sorting, which you may want to do prior to data loading, along with index building and data aggregation. Look for a platform that won’t require you to completely rebuild your indexes every time you update the database. It would also be good to precompute aggregates before they are loaded into the database.

This is all complicated stuff, and there are some issues I haven’t covered, but these are the ones I believe are the most important. Next time, we'll talk more about specific products.

Robert Craig is director, Data Warehousing and Business Intelligence Division, at Hurwitz Group Inc. (Framingham, Mass.). Contact him at rcraig@hurwitz.com or via the Web at www.hurwitz.com.