In-Depth
Talend Enhances Data Integration with MapReduce Support
MapReduce is usually thought of as an enabling technology for Big Data analytics. Talend says it can help supercharge data integration, too.
Open source data integration (DI) specialist Talend recently announced support for Apache Hadoop, the increasingly ubiquitous open source implementation of Google Inc.'s MapReduce algorithm.
Talend isn't the first business intelligence (BI) vendor to embrace MapReduce. Its peers in the analytic database arena -- namely, Aster Data Systems Inc. and the former Greenplum Software Inc. (which was acquired by EMC Corp.) -- first announced MapReduce support nearly two years ago. (Greenplum also supported Hadoop before it announced its own -- i.e., native -- implementation of the MapReduce algorithm.)
More recently, other BI vendors (including Teradata Corp.) have followed suit.
Nor is Talend the first data integration vendor to announce support for Hadoop. In June, IBM Corp. announced its own distribution of Apache Hadoop (available via its AlphaWorks developer portal) and likewise unveiled a new Hadoop-powered BigData portfolio, dubbed InfoSphere BigInsights.
At the same time, counters Talend vice president of marketing Yves de Montcheuil, support for Hadoop in Talend's Integration Suite was available as of early July; at this point, IBM is offering a "Technology Preview" of BigSheets, a software layer that enables business users (not just programmers) to interact with Hadoop, as well as a portfolio of Hadoop-related BigInsights services -- in addition to its AlphaWorks implementation of Apache Hadoop.
De Moncheuil says Talend has worked with Cloudera Inc. to develop and refine its support for Hadoop. Cloudera markets its own software distribution of Hadoop (Cloudera Distribution for Hadoop, or CDH) and likewise offers a cloud-based implementation of CDH. "We are the only ones today who have immediately available support for Hadoop for data integration," he points out. "We've registered very strong interest from our community for Hadoop, we've worked very closely with Cloudera … on the technology and the integration to make sure that we get our story right. Being early in the market is going to pay off for us."
Analytic database players such as Aster Data like to position MapReduce as a silver bullet for Big Data analytics. One very compelling MapReduce use case involves data integration, however: Teradata senior marketing manager Dan Graham has described DI-powered MapReduce as "ETL on steroids."
De Montcheuil says that "One very simple way of doing data integration for Hadoop is to see Hadoop … as a place where you store your source or your target data."
Simple connectivity into Hadoop isn't anything to brag about. "Some people just have connectors so that they can get their data into Hadoop or out of Hadoop, but we can go further. Just connecting to Hadoop to put data inside it is useful, of course, but if that's all you're doing, you aren't really taking advantage of the strength [of the MapReduce algorithm]," he notes. "Where we are going … with this Hadoop support is to actually leverage the Hadoop architecture to perform the data integration processes for the transformation of data."
Talend's approach in this regard is more the stuff of ELT -- i.e., extract, load, and transform -- than ETL, de Montcheuil explains.
"What we're doing is we're generating native Hadoop code … and we'll be doing the processing of the data inside the hive of the [Hadoop] database," he says. "All of the processing [and] mappings, preparing the data for reprocessing or analysis -- that's driven by Talend. The actual processing of the data? Hadoop does that for us," de Montcheuil continues. "Hadoop takes care of distributing the processing across all of the various nodes. We don't have to deal with the breakdown of the jobs; Hadoop actually does that for us, too."
Talend's support for Hadoop is consistent with its DI model: the Talend Open Studio generates Java code that facilitates connectivity, performs any needed transformations, and completes the loading of the data.
"We generate … native Hadoop Java code with all of the Hadoop APIs. Outside of Hadoop, it's just native Java code with a fair amount of SQL inside it, especially if you want to run the ELT approach," de Montcheuil explains.
Like other open source offerings, Talend's DI software is available in both subscription (Talend Integration Suite) or free (Talend Open Studio) editions.
Only Talend Integration Suite can generate native Hadoop code, however.