Up Close: Greenplum's Smart Moves
Greenplum cozied up to Hadoop earlier this year -- but the MapReduce capability it announced last week is its own. What happened?
Greenplum Inc. isn't exactly a newcomer to MapReduce.
Earlier this year, it started working with Apache Hadoop, an open source version of MapReduce that's sponsored by Yahoo!, among other contributors.
In a March interview, CTO and co-founder Luke Lonergan talked up Greenplum's work with Hadoop, which at the time consisted of API-level query support for the external Hadoop framework. At the time, Lonergan offered a succinct recap of the genesis of the Hadoop project itself. "[B]asically, Google keeps its algorithms close. They don't disclose them, but they do write about them. They've been writing about MapReduce for a long time, in fact. So eventually what happened is that one of their competitors cloned the idea into this project called Hadoop, which is an [open source] implementation of the MapReduce idea," he explained.
Five months ago, Greenplum wasn't talking up in-database MapReduce. Instead, Lonergan positioned Greenplum as a consumer of external MapReduce queries, which -- in the scheme that his company's engineers were then working with -- would be fed to the database engine via Hadoop.
"What we are is a massively parallel engine that can do all of this interesting stuff for SQL data, but we also provide a mechanism that people can use to get at [our] parallel engine in different languages -- and one of these [external providers] is Hadoop. We've begun supporting MapReduce written directly against data inside our engine or even data outside our engine."
Lonergan did hint at what was to come, however -- for both MapReduce and (potentially) other programming models: "We're looking at kind of creating our own APIs that would allow developers and programmers to get at the parallel engine that we built to write custom applications or programs."
Fast forward five months. Surprisingly enough, Greenplum's in-database MapReduce implementation isn't based on Hadoop. Instead, officials say, Greenplum went ahead and built its own implementation of MapReduce. There's a reason for that, says CEO and co-founder Scott Yara, who explains that the full-blown Hadoop package -- which includes a resilient file system, among other amenities -- duplicates a lot of the functionality that already exists inside of Greenplum.
"Hadoop actually is a full stack, so in addition to the MapReduce function, it has … an underlying file system. We already had a lot of that in the Greenplum database, so to support MapReduce [with our own implementation] was a relatively straightforward thing," Yara comments.
Industry watchers say this was probably a wise move.
"The problem with Hadoop is that it's an early project and the implementation isn't that efficient," comments Mark Madsen, a veteran data warehouse architect and a member of TDWI's extended research collaborative.
Sources close to the issue say that Greenplum encountered Hadoop-related bottlenecks and coding inefficiencies -- both of which reduced performance -- and that working with Hadoop also required that Greenplum make unspecified "modifications" to its data warehouse stack.
Madsen, for his part, seems persuaded by Greenplum's official explanation.
"I think there are two separate elements to why they did their own implementation. One is simply the integration with the database. They didn't need the things in there like file system access since the data is in the database, [so] there are features [in Hadoop] that are not required," he comments.
"The other is that the existing implementation isn't as fast as it could be. Greenplum has engineers who know how to do parallelism well, probably better than most of the Hadoop contributors. By improving the code and some of the internal algorithms they can get better performance."