Hadoop: Java-Based Framework Can Run Data-Intensive Apps on Clusters

First Apache Hadoop summit is standing-room-only event

Yahoo hosted the first-ever Apache Hadoop Summit this week in Santa Clara, Calif. The day-long event presented a program of speakers from the Hadoop developer and user communities, including representatives from Yahoo, IBM, Microsoft, Facebook, Google and the University of California at Berkeley, among others.

The event drew around 500 attendees, but event organizers were unsure of the exact number. They were, in fact, caught off-guard by the turnout and had to change venues to accommodate a standing-room-only crowd.

"We organized the summit because we've been investing a lot in Hadoop ourselves, and we knew there was a large community of Hadoop users out there that mostly haven't met each other," said Yahoo Technical Evangelist Jeremy Zawodny. "I guess it was larger than we thought."

The Hadoop Framework is an open source, Java-based distributed computing platform designed to allow implementations of MapReduce to run on large clusters of commodity hardware. Google's MapReduce is a programming model for processing and generating large data sets. It supports parallel computations over large data sets on unreliable computer clusters.

Yahoo hired Hadoop's creator, Doug Cutting, early last year to work full-time on the framework. Cutting created the Lucene open source information retrieval library with Mike Cafarella, and the Nutch open source search engine based on it. Both projects are now managed through the Apache Software Foundation.

"The momentum around Hadoop is growing every day," Cutting said. "It's really exciting to watch."

Cutting called Yahoo's resource commitment to the Hadoop framework "considerable," but offered no details. Yahoo has made a very public commitment to Hadoop. In February, it launched what company representatives claimed to be the world's largest Hadoop production application. Called the Yahoo Webmap, the application runs a 10,000-plus-core Linux cluster and produces data used in every Yahoo Web search query, according to company literature.

The initial intended use of Hadoop within Yahoo was to support Web search, Cutting said, by building the Web search index and maintaining that massive collection of data. But although it is making the Yahoo search engine more easily scalable and reliable, he said, the majority of in-house users are actually employing Hadoop for data exploration.

"It turns out that there are all these other people within the company who want to be able to access and analyze these massive data sets -- access logs, event logs, Web and geographic data -- and use them to improve the Web search software itself," Cutting said. "So they're using Hadoop for analysis to improve the software, as opposed to actually implementing the Web search. That's where we're seeing the big payoff."

That's where he expects other companies to jump on the Hadoop bandwagon.

"The data exploration is more generalizable to lots of businesses, and that's why we're seeing all this interest," he added. "Companies are amassing more and more data, and they need to explore it. The tools that are out there for doing ad hoc exploration and analysis of new data sets aren't as convenient."

Along with several Yahoo representatives, the roster of summit presenters included the IBM Almaden Research Center's Kevin Beyer, who described how to use JAQL, a query language for JSON (JavaScript Object Notation) data, in Hadoop apps.

Microsoft's Michael Isard was also on hand to talk about DryadLINQ, which combines Microsoft's Dryad distributed execution engine and the .NET Language Integrated Query (LINQ). DryadLINQ is similar to JAQL and Yahoo's open source Pig, which is an infrastructure designed to support ad hoc analysis of very large data sets. But DryadLINQ doesn't actually run on Hadoop.

"Microsoft is doing a very similar set of technologies," Cutting explained, "but all within Microsoft. They're not using an open source model, and it's not even a commercial product at this point. I think they're here because they want to talk with people on a technical level, and because this is important technology, but not in terms of actually cooperating with people by sharing code and building on one another's efforts."

Cutting named the framework "Hadoop" after his son's yellow stuffed elephant. The yellow pachyderm is the official mascot/logo of the project.

About the Author

John K. Waters is the editor in chief of a number of sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at

Must Read Articles