In-Depth

Clearing Up Big Data/Hadoop Confusion

Data management vendors are anxious to put their own spins on big data.

For many, "Hadoop" has become synonymous with the term "big data" -- just as MapReduce and Hadoop likewise tend to be used interchangeably.

Data management (DM) vendors are anxious to correct these impressions (or misperceptions, to hear some describe it. Call it the big data spin zone.

For example, Jim Hare, program director for big data product marketing with IBM, politely (but pointedly) protests that Hadoop and big data are not the same thing.

"A lot of people associate big data with Hadoop. We're really trying to correct that impression. [Hadoop is] one element and one capability that's required to address one part of the big data problem," says Hare, who argues that "you really need multiple capabilities."

Not surprisingly, IBM -- with its catalog of DM, data integration (DI), enterprise application integration (EAI), and its own Hadoop distribution -- claims to supply a good many, if not all, of these capabilities. Hare, for example, points to Big Blue's DI and governance tools, which he says can complement and augment the raw input data produced by Hadoop and MapReduce.

"It's great to be able to pull all of this data in, but can you trust it?" he says, invoking the so-called "3Vs" of big data: volume, variety, and velocity. IBM, Hare says, focuses on a "neglected" V: veracity.

"Using our governance products, we provide the ability to make sure you can trust the information you're getting" from Hadoop or other big data sources. Using IBM's advanced analytic technologies -- because IBM, with its portfolio of Cognos- and SPSS-based assets, does analytics and data mining, too -- enterprises can bring their analytics to big data.

"Our strategy with big data is trying to move the analytical process closer to the data," he explains. "Sitting on top of our [data warehousing] platform is sort of where our traditional analytics sit."

Hadoop Isn't MapReduce

Dave Inbar, senior director for big data products at Pervasive, seems most bothered by the conflation of Hadoop and MapReduce. It isn't simply that Hadoop isn't MapReduce, laments Inbar; it's that Hadoop is much bigger -- and, on Inbar's terms, much more elegant -- than MapReduce.

Inbar, in fact, describes Hadoop as "a beautiful platform for all kinds of computation," not least of which because it addresses three problem areas (namely, data distribution, compute distribution, and coarse-grained parallelism) that have long bedeviled information science. He's likewise scathing in his treatment of MapReduce, which he describes as "a chain and shackle ... because it forces you to define your compute solutions in particular ways, including shuffling a lot of data around in intermediate systems."

Pervasive, like IBM, has a dog in this fight. It markets a high-octane DI engine -- Pervasive DataRush -- which, by virtue both of its claimed SMP scale-up-ability and its (claimed) ability to scale out across Hadoop, it positions as a platform par excellence for big data DI. In this regard, says Inbar, DataRush can be deployed either outside Hadoop (e.g., pulling information into DataRush from HDFS) or across Hadoop (as a MapReduce replacement).

Thanks to the Apache YARN project -- YARN (geek-speak for Yet Another Resource Negotiator) -- the next major refresh of Hadoop will be able to treat DataRush, or similar implementations, as first among equals, so to speak, alongside (or in place of) MapReduce.

"DataRush just becomes another compute-paradigm engine that's managed and visible to everything else in Hadoop, but it gives you [DataRush's] pipelining and parallelism benefits inside the Hadoop infrastructure," he said. "The guys working on the next version of Apache Hadoop, will release many improvements ... [one result of which is to] decouple the MapReduce coding paradigm from the Hadoop data management and distributed compute infrastructure."

Yves de Montcheuil, vice president of marketing with open source software (OSS) DI specialist Talend Inc., takes the opposite view. Talend, de Montcheuil told BI This Week, is "all-in" as far as Hadoop and MapReduce are concerned.

"[W]e will run all of the integration, all of the cleansing jobs inside Hadoop by generating native code. This can be [a] MapReduce job, or it can be Pig-script, HQL, or HBase SQL." Talend partners with Hadoop specialists Hortonworks and Cloudera Inc., and -- in its Talend Open Studio for Big Data -- taps MapReduce as the processing engine for most ETL workloads. According to de Montcheuil, Hadoop and MapReduce jibe neatly with Talend's OSS commitment.

Jim Walker, director of product marketing with Hadoop specialist Hortonworks -- and himself a veteran of Talend, where de Montcheuil was his boss -- says he doesn't mind the conflation of Hadoop and big data. For the most part -- as it's used by most adopters today -- Hadoop is big data, Walker argues. If you're using Hadoop, he says, chances are you're using it to support "big data" use cases.

As for MapReduce, Walker continues, certainly it has its problems -- but, for a wide variety of users and use cases, it's good enough. "Is MapReduce good enough? Absolutely! Is it a mess? I wouldn't say it's a mess by any means," Walker maintains. "When you have Yahoo with [MapReduce running across] 50,000 nodes, it works." Walker and Hortonworks, too, have a stake in the game. Hortonworks, for example, likes to contrast the self-styled "purity" of its Hadoop distribution with those of its more established competitors, such as Cloudera and MapR Technologies Inc.

"We contribute everything back [to the OSS community]," says Walker, who concedes that Cloudera claims to hew to a similar commitment. "A lot of the Web properties [i.e., companies] started with MapReduce and modified it and never came back to the open source tree, so we can do high-availability [in Hadoop] on version 1. We fixed it, we did that in our release. We delivered full-stack high-availability, and then we gave that back [to the community]."

Must Read Articles