EMC Makes Big Data Move

EMC positions the Greenplum HD Data Computing Appliance as a key component in its new Hadoop-centric analytics push.

It was inevitable. With its purchase of the former Greenplum Software Inc. last June, EMC Corp. became steward of a seminal analytic database player that was almost first out of the gate with an in-database MapReduce capability.

Nearly a year after it announced its intent to acquire Greenplum, EMC recently announced its most ambitious Big Data product to date -- the Greenplum HD Data Computing Appliance. EMC positioned the new offering as a key component in its new Hadoop-centric analytics push.

The storage giant is serious about Big Data: it showcased its new Hadoop analytics announcement on the first day of its EMCWorld 2011 user meeting in Las Vegas, where the Greenplum HD Data Computing Apppliance shared the spotlight with bread-and-butter technologies such as EMC's Atmos cloud storage platform and the company's newest Isilon NAS product.

Have Hadoop, Will Scale

EMC describes the new Greenplum HD Data Computing Appliance as a mashup of the vanilla Greenplum analytic database with the open source Hadoop framework. If that's the case, EMC likely didn't break much of a sweat on the Greenplum side of the equation, at least with respect to supporting Hadoop.

After all, long before Greenplum co-announced its own in-database MapReduce capability -- on the same day, and at virtually the same time as rival Aster Data Systems Inc. -- it supported Hadoop as an external provider.

"What we are is a massively parallel engine that can do all of this interesting stuff for SQL data, but we also provide a mechanism that people can use to get at [our] parallel engine in different languages -- and one of these [external providers] is Hadoop. We've begun supporting MapReduce written directly against data inside our engine or even data outside our engine," said CTO Luke Lonergan, in a March, 2008 interview.

Hadoop and MapReduce aren't quite the same thing, however.

Hadoop comprises a full-fledged framework for supporting MapReduce jobs in a highly distributed environment, so it includes several optional or required features in addition to the bread-and-butter Hadoop MapReduce implementation. These include HDFS, or the native Hadoop File System; libraries that support non-HDFS file systems, such as Amazon's S3 or remote directories on FTP servers; and a scheduling facility. From the perspective of Greenplum-the-high-end-data-warehousing-vendor, then, several key Hadoop components (e.g., HDFS) were likely seen as superfluous.

That's the way some industry watchers saw it.

"I think there are two separate elements to why [Greenplum] did their own [MapReduce] implementation," said industry veteran Mark Madsen, a principal with consultancy Third Nature Inc., at the time of the announcement.

"One is simply the integration with the database. They didn't need the things in [Hadoop] like file system access since the data is in the database, there are features that are not required, etc. The other is that the existing implementation isn't as fast as it could be. Greenplum has engineers ... [who] know how to do parallelism well -- probably better than most of the Hadoop contributors. By improving the code and some of the internal algorithms, they can get better performance."

Into the Enterprise

Although the full-fledged Hadoop stack might not have made much sense to the Greenplum of old, from the perspective of Greenplum's current steward EMC, that's another story.

After all, Big Data was seen as a Big Reason for EMC's acquisition of Greenplum. That was in June of 2010. Over the last 12 months, a bevy of vendors -- including IBM Corp., Informatica Corp., Talend, Teradata Inc., and others -- have embraced Hadoop, which has emerged as a de facto open source standard for accessing, managing, and querying Big Data.

The challenge, from a vendor such as EMC's perspective, is to make Hadoop safe for the enterprise. As industry veteran Madsen has suggested, Hadoop's fit-and-polish shortcomings probably convinced Greenplum to develop its own in-database MapReduce engine. As IBM, Teradata, and other vendors have embraced Hadoop, they've also talked up the need to optimize or extend it for enterprise or specialty data warehousing use-cases.

That's precisely what EMC says it's going to do, too.

"What we're doing with Hadoop is we're taking the core of what Apache has with the Hadoop distribution and we're going to invest a lot of effort to make Hadoop enterprise-grade, enterprise-capable, [to] some of solve some of the core issues that are associated with making it reliable [and] scalable," said Lonergan, in a promotional Webcast on EMC's site. The crux of the issue, Lonergan argued, is to make Hadoop "a more stable, reliable platform for businesses to adopt."