In-Depth

SyncSort Doubles-Down on High-Performance ETL

Performance is a major component of SyncSort's ETL 2.0 push.

A few months ago, data integration (DI) stalwart SyncSort Inc. published the results of a survey which found that a surprising number of shops are still getting by with hand-coded DI tools. In spite of the fact that free DI offerings abound.

It's also in spite of the fact that hand-coded DI is far from ideal, especially when it comes to governance, compliance, and (of course) manageability.

The staying power of hand-coded DI shouldn't surprise anyone, says Jorge Lopez, senior manager of data integration with SyncSort. After all, Lopez argues, DI vendors haven't yet given users a compelling reason to ditch their existing hand-coded assets.

This is the raison d'être behind SyncSort's "ETL 2.0" vision, Lopez says.

"We know that data integration tools as we know them today are failing, but why are they failing? We believe it's because they tend to focus on the wrong things. The most pressing need that organizations have today, at least based on our experience, is to deal with increasing data volumes, with the [increasing] variety of the sources, with the growing complexity of the data, and with shrinking batch windows -- all of those things done in a cost-effective fashion."

As far as Lopez and SyncSort are concerned, the failure of DI as we know it has a lot to do with performance. DI vendors like to claim that they've licked ETL performance issues, he claims, when they've looked beyond ETL performance and focused instead on fleshing out their portfolio of complementary -- and lucrative -- DI services. At the same time, Lopez maintains, many DI offerings are more properly ELT- than ETL-based tools, regardless of what they're called. In other words, these tools -- even tools that have a branded ETL component or engine -- increasingly fob the work of transformation off on the database itself.

By contrast, SyncSort uses its DMExpress ETL engine to perform the requisite transformations, says Lopez. In fact, he explains, doing transformations inside the ETL engine is one of the core tenets of ETL 2.0.

It's putting the "T" – or the "T" after the "E" and before the "L" – back in ETL.

"A lot of organizations are realizing the insanity of ELT because it's costing millions of dollars a year in database capacity and also IT staff responsibility. DI tools were originally designed to facilitate the work, and what's happening is that they just end up becoming very expensive schedulers that push all of the transformations down to the database itself," he argues. "So one of the key tenets [of ETL 2.0] was to bring the 'T' back to 'ETL' and do the transformations on your ETL engine. This is not [a] trivial [proposition]: you need a very fast ETL engine, an engine that's efficient, and [which] can also deal with Big Data."

If SyncSort's take on ETL 2.0 is at least somewhat self-serving, its take on the ELT versus ETL debate is at least somewhat tendentious.

Competitors concede that there's some truth to SyncSort's messaging, depending on how one frames the issue. For one thing, notes Itamar Ankorion, vice president of business development and corporate strategy with change data capture (CDC) specialist Attunity Inc., to the extent that the ELT scenario described by Lopez rings true, it's also true that organizations are increasingly tapping specialty analytic databases -- typically, massively parallel processing (MPP) data stores -- to handle all of these fobbed-off transformations.

"There's going to be always an argument between the ETL vendors and the data warehouse appliance vendors about this. There are arguments to be made here that basically [involve] economies of scale. If you look at the size of -- at the power of -- these appliances, handling transformations is nothing for them, and using the resources to do the [transformative] work does not impact the availability of these systems," he argues. "Think about it this way: if you have a 96-core [system hosting your] data warehouse, you might use four cores for transformations. From this perspective, it's not a big issue -- but if you need to buy another big server for ETL, that's a bigger issue."

Another core tenet of ETL 2.0 is improved collaboration and communication between business and IT. The idea, Lopez explains, is to bring business and IT together early in the design process -- and to keep them connected throughout the development process. To achieve this, SyncSort plans to deliver collaborative enhancements in forthcoming revisions of DM Express, starting next year, says Lopez. "Data integration still remains isolated from the business users. The communication between [business and IT] is minimal, so what ends up happening is you have the ETL processes and then you have a wall which is the data warehouse and then you have another wall which is the BI tool," he explains. "If you look at the way [IT and the business] collaborate, it's 'collaboration' in name only. For example, many times they'll define the mappings through e-mail, so everything gets lost. The end result is you end up with frustration on the part of the business analyst and the business user."

Unlike the ETL-versus-ELT debate, this is a much less tendentious notion in part because so many of SyncSort's competitors (or nominal competitors) are talking about the same things.

It's a trend that isn't confined to IBM Corp., Informatica Corp., Oracle Corp., SAS Institute Inc. subsidiary DataFlux, or other DI Powers That Be, either.

Consider DI upstart WhereScape Inc., which recently announced a new data warehouse scoping tool -- WhereScape 3D -- that also promises to boost collaboration between business stakeholders and IT.

Remember those e-mail conversations (along with other not-so-structured exchanges between the line-of-business and IT) of which Lopez spoke? WhereScape 3D gives customers the ability to capture them (e.g., as notes or annotations, or simply as attached collateral) and present them in context.

WhereScape positions 3D as a planning or scoping tool, but CEO Michael Whitehead also enthused about its collaborative potential during a meeting at TDWI's Summer World Conference in San Diego. "It's meant to get you to that shared understanding of what's possible sooner," said Whitehead.

Lopez, for his part, says that DI collaboration is "just a tough nut to crack. In the product today, for example, we don't yet have that collaborative piece. We are looking at what are the key things that we need to include, and this is something that data integration [as a whole] just hasn't been good about addressing."

Collaboration aside, SyncSort doesn't aim to displace DI platforms from Informatica, IBM, or other competitors, Lopez says. It does, however, expect to complement them.

Last year, it began positioning DMExpress as a "seamless accelerator" for third-party DI tools. "Many times, [customers] start by using their ETL tool to push the transformations down to the database, but after a while they realize that [this makes the ETL tool little] more than a scheduler, so some of them even ... hand-code SQL or PL SQL for performance. This only gets them so far," he points out.

"A lot of these organizations have already spent millions of dollars procuring, developing, and implementing their existing DI tools, so when they face performance bottlenecks, it's almost a showstopper when you have to tell them that they have to rip and replace everything they have," he continues. "Data integration acceleration is a concept that allows you to readmit the data, say, from Informatica or DataStage. We import the data -- we are able to import and export the metadata, too -- we do the transformations in our engine, and ... we hand everything back to the tool or load it into your database."

Must Read Articles