In-Depth

AnalytiX Pre ETL Mapping

A decade ago, using spreadsheets to manage ETL source mappings might've made sense. Today, an upstart BI vendor claims, it isn't just inelegant -- it's anachronistic.

Upstart data integration (DI) vendor AnalytiX Data Services Inc. aims to shine a light on a not-so-well-known aspect of the ETL process: pre-ETL source mapping.

It isn't quite a dirty little secret, says AnalytiX CTO Michael Boggs. "It's one of the integration problems that most of the key vendors have pretty much left untouched," he says. "When you build a data warehouse, you have to create these complex pre-ETL source mappings before your ETL developers start work. The way [most organizations] do this is with spreadsheets. Every organization has an exorbitant amount of spreadsheets that they use to document this stuff. Once they've handed [them] off to the ETL developers, they're never maintained."

There are several problems with this approach, says Boggs. First, it's fundamentally ungovernable. This is particularly true in large DW projects, where spreadsheets tend to proliferate, such that a spreadmart hell of competing .XLS source mappings isn't unusual.

Second, using spreadsheets is inherently -- and definitively -- non-authoritative: source mappings change as you design and test your ETL jobs. A spreadsheet that once functioned as the single or authoritative catalog of all source mappings might not get updated -- or (just as likely) might get updated with incorrect or incomplete information -- as the ETL design process evolves.

Third, Boggs continues, there's little documentation. This poses problems in both the near term -- e.g., during quality assurance (QA) testing -- and in the longer term, too: it's that much harder to maintain and evolve a data warehouse if you haven't adequately documented your pre-ETL design and planning, he claims.

"Your QA teams always want to go back to the original target specifications [during final testing], but then you find that even if you have the original target specification, it's changed so much [as a result of design], that it doesn't bear any resemblance to what you have in production," Boggs maintains.

Boggs and AnalytiX say that they aim to complement the big ETL tools, which haven't kept pace with technological change. A decade ago, using spreadsheets to keep track of one's ETL source mappings might've made sense, Boggs concedes; nowadays, it isn't just inelegant, it's anachronistic.

It's an anachronism to which IBM Corp., Informatica Corp., and several other players only recently started to pay attention, says Boggs, although he concedes that IBM introduced a source mapping tool in the InfoSphere DataStage version 8.0 timeframe.

"Some of the ETL vendors are doing some level of pre-ETL mapping. We were the first to actually do something about it, and we do it much better. They just started entering into that space. Most of them don't have a version 1.0 [general availability] release; most of them are in beta releases. We have about 200-plus customers, a lot coming from DataStage or PowerCenter."

IBM InfoSphere DataStage and Informatica PowerCenter, of course, are probably the two best-known ETL tools on the market. AnalytiX's flagship product, Mapping Manager, now in version 3.0, can auto-generate ETL source mappings for both platforms, along with SQL Server Integration Services (SSIS) from Microsoft Corp. It can also "reverse-engineer" -- using a combination of Excel spreadsheets and the (typically) XML files generated by the tools themselves -- mappings for existing projects, Boggs indicates.

"The tool is self-generating and self-documenting. We have quick, efficient metadata scanners. We automate the mapping, so instead of having analysts out there manually typing column names and business rules, we offer a drag-and-drop, Web-based environment where they can pick and choose [from among] sources and columns."

AnalytiX doesn't necessarily have an adversarial relationship with the ETL heavies, either: 15 months ago, for example, it took home a Grand Prize at the Informatica World 2010 conference, where it was selected as the best complimentary application out of 140 competing entries. It has partnerships with several computing giants, including Hewlett-Packard Co. (HP), which resells Mapping Manager.

"We're agnostic. We focus on solving the data mapping problem. A lot of our customers ... [are] using [both] PowerCenter and DataStage," Boggs continues. "It doesn't matter if your teams are using Informatica, SSIS, or DATAStage: we can automate those jobs [and] export them out automatically."

Although pre-ETL source mapping is a problem to which ETL vendors are starting to pay attention, it's likewise a problem that several DI players claim -- at least in large part -- to have licked.

DW specialist Kalido, for example, delivers a "Unified Load Controller" that can run queries directly against database sources and then "land" the data into its Kalido Information Engine, which automatically generates the appropriate mappings. Its message: ETL can, to a degree, be "avoided."

At the recent TDWI World Conference in Las Vegas, in fact, Kalido's booth featured a poster with the universal "No" symbol superimposed over the letters "ETL."

Kalido attacks the problem by trying to eliminate -- or at least minimize -- the role of ETL, but WhereScape Inc. markets a planning and design tool that deals explicitly with the pre-ETL design process. WhereScape Data Driven Design (or "3D") can automate source and target mapping, along with other pre-ETL activities, such as scoping, sizing and appraisal. Like Analytix Mapping Manager, it also generates documentation.

Must Read Articles