The Mainstreaming of MapReduce

MapReduce ready for the mainstream, or is it best used in niche environments?

Last September, a pair of analytic database specialists -- Aster Data Systems Inc. and Greenplum Software Inc. -- trumpeted the availability native MapReduce implementations on their DBMS platforms that, they promised, would significantly accelerate performance for customers in several vertical markets. Recently, several other analytic database firms -- including Netezza Inc. and Teradata Corp. -- have followed suit, promising to introduce native MapReduce implementations of their own.

These are heady times for MapReduce advocates.

MapReduce promises to (drastically) simplify parallelizing certain kinds of queries or problems -- particularly those that involve extremely large datasets -- across a cluster -- even those that are petabytes in size. MapReduce advocates say it suggests the scale of the problems adopters plan to tackle.

The beauty of MapReduce -- in addition to its parallel processing -- is that it permits programmers to run queries against a database using any of several popular languages, including C++, C#, Java, and Python.

In this respect, MapReduce is similar to a facility like Microsoft Corp.'s Common Language Runtime, or CLR.

One obvious difference is that MapReduce is both highly parallelizable and intended for use primarily with very large datasets. CLR and similar facilities (Oracle, for example, supports CLR via its Database Extensions for .NET) are chiefly intended to let developers program against a DBMS in the language with which they're most comfortable. Its scope, then, is considerably more mundane.

MapReduce was popularized by Google Inc., which uses it to power its search technology. Not surprisingly, when Aster Data and Greenplum announced support for MapReduce (on the same day, no less), both companies sought to invite comparison with Google's MapReduce-powered search expertise.

The $64,000 question about MapReduce concerns the kinds of very large dataset problems for which it's best suited.

Skeptics -- principally, analytic database competitors that don't currently offer MapReduce implementations of their own -- like to question its applicability in the enterprise, at least for general-purpose data warehousing (DW) tasks.

"MapReduce is on the road map in a distant future for us. We still to date see almost no demand in the marketplace for it. It's become a marketing thing more than a desired customer feature thing," says David Ehrlich, CEO of analytic database specialist ParAccel Inc. At a time when competitors such as Netezza and Teradata have announced plans to support MapReduce, Ehrlich says he doesn't see the urgency.

"We are working with one customer right now where they were looking at a MapReduce approach to an implementation, and when the relational guys and the MapReduce guys finally got into the details of what they were trying to achieve, [they determined that] MapReduce was going to slow the performance down significantly."

MapReduce does have its uses, Ehrlich concedes -- and ParAccel probably will accommodate it at some point -- but he demurs as to when.

"MapReduce is a great approach to distributed computing for a lot of things, but if you have a workload that is especially friendly or honed or appropriate for a [typical] relational database environment, our belief is that you let the relational database environment do it. We haven't seen a lot of demand and we've seen very few environments around MapReduce where we think it would be a good answer for the customer."

There's still a clear sense in which the MapReduce use cases touted by proponents tend to skew toward specific conditions or environments.

Take, for example, Aster Data, which has introduced a software development kit (SDK) to support its MapReduce implementation.

Aster officials are predictably enthusiastic when it comes to MapReduce and its potential applicability, but the application examples they tout involve typical Big Data implementations -- e.g., Aster's MapReduce SDK introduces canned support for sequential path analysis and provides sample data sets for retail customers. The latter vertical is one of the textbook Big Data uses cases for which MapReduce is frequently touted.

"[W]hat I've found is that there are so many different types of applications that you can actually leverage [MapReduce] for. It's not only pieces where you see engineers or super-techie people latching on to it, but also business people," said Shawn Kung, director of product management for Aster Data, in an interview earlier this year.

"The thing that's surprising and a testament to our field sales is that we've been in many engagements where we brought forward the SQL MapReduce and … the techies, they get it, but maybe the most surprising is the fact that once our field teams have actually translated that into business value, the business sponsors … when they see that they can get a 10,000 to 30,000 [times] improvement in the way that they do certain kinds of analytics, and that's going to reduce cycle time dramatically, they become champions."

Kung was more vague, however, when asked to describe the kinds of enterprise applications for which business sponsors, in particular, have latched on to MapReduce. He instead positioned the technology as a proposition that programmers and data management (DM) pros are still getting their heads around.

"As we develop a community of SQL MapReduce users, there's going to be more knowledge-sharing. Think of in-database MapReduce as sort of [like] the early days of Java," he explained. "You didn't see people suddenly knowing everything about Java. It took time, and now there's a rich ecosystem around that. In many ways, I see [MapReduce] as sort of [like Java in its] early days, but in the coming months and years I see in-database MapReduce really proliferating, sort of the way Java did with the Internet."

Kung's prognostication might sound too optimistic, but at least one industry watcher thinks there's something to it.

"All of these years we've had parallel data handling. Now we need parallel processing. The parallel data side has been mature and strong. It's like a bride waiting for her groom. Some day, this MapReduce will grow up to be an incredibly powerful processing system. That day, Teradata will be challenged in scalability, and we'll just love it. It'll be wonderful to have a workload that's equal to us," says Dan Graham, a senior marketing manager with Teradata.

Graham's company recently announced its own MapReduce strategy via the open source Hadoop project, so Graham and Teradata aren't unbiased. At the same time, Graham notes the emergence of a future crop of MapReduce-based applications that -- eventually -- will transform data warehousing.

"MapReduce as it sits today is embryonic. It's clumsy. It doesn't have tools. It has a lot of excitement and momentum. [It has] a lot of installs, [but] no two installs are the same," he says, predicting that "in the next five to seven years, these things will grow up." When that happens, Graham says, DW practitioners and programmers will have to reach a separate peace.

"The most important thing right now is the perception that [MapReduce] is a replacement or displacement of a data warehouse. It's what a lot of these fellows chant on the Web. Teradata is preaching a coexistence strategy, simply because we're not going to fight the Java aficionado for his pride and his work," he continues. "At some point, data warehouse pros and their programmer counterparts have to realize that each tool has its place. If you couple them, you can get tremendous competitive advantage. There's a lot of workloads that can be done in MapReduce today that would be better [done] in a data warehouse."

Consequently, Graham concedes, there are workloads -- such as ETL processing on extremely large data sets -- for which MapReduce seems tailor-made.

"The most common early use of it will be as an ETL system on steroids. If you think about [having] a parallel system out there gathering the data, transforming the data, and handing it [off] to Teradata to be loaded, this is great! We're finally finding our equal who can feed us data. We have mainframes that can't keep up with us," he explains.

The drawback, he points out, is that MapReduce-based ETL means a return to hand coding. "If you're a dot.com with 1,000 servers with Hadoop Web data on it, you don't have a choice. Go to your ELT vendors and ask them how to do that and they will probably step up."

Graham anticipates that ETL-powered MapReduce could account for up to 80 percent of Teradata's "snuggle-don't-struggle" coexistence strategy.