Q&A: Harnessing Computing Power for Data Analysis with MapReduce
MapReduce can help spread the computational chores between many computers. But what if the technology could be unleashed for all data, not just file-based data processing?
Data analysis runs faster the more computing power you have, and one way to get more power is to share computing chores among multiple systems. That’s what MapReduce can help IT accomplish quickly and easily. Traditional use focuses on file-system-based data processing of unstructured files. What if MapReduce could be used not just with files but with all data?
To learn more, we spoke with Shawn Kung, senior director of product management at Aster Data Systems.
Enterprise Strategies: What do MapReduce and Hadoop actually do? What benefits do they promise?
Shawn Kung: MapReduce is a framework for using a large number of computers (nodes) to solve a problem; these nodes make up a cluster. Processing can occur on data stored either in a file system (unstructured) or within a database (structured). The Hadoop project is a free, open source Java MapReduce implementation that operates on unstructured data but doesn’t work on structured database data. Amazon Elastic MapReduce is a hosted version of Hadoop made available on EC2.
Why is MapReduce significant?
MapReduce allows ordinary developers to create a large variety of parallel programs without having to worry about programming for intra-cluster communication, task monitoring, or task-failure handling. Before MapReduce, creating programs that handle these issues could consume literally months of programming time. With MapReduce, developers can simply write application logic, not debug cluster communication code. MapReduce programs have been created for everything from text tokenization, indexing, and search to data mining and machine learning algorithms.
How do they work?
Regardless of which "version" of the MapReduce implementation we're talking about, the premise is the same. The functional workings of MapReduce are as follows:
- In the "Map" step, the master node takes the input, chops it up into smaller sub-problems, and distributes those pieces to worker nodes, which process their small problem and provide the answer to the master.
- In the "Reduce" step, the master node takes the answers to the sub-problems and combines them to get the output -- the answer it was originally trying to find.
MapReduce allows for distributed processing of the map and reduction operations, letting processing scale out on multiple machines for faster parallel processing.
What are the differences between these technologies and versions?
There are basically two camps for MapReduce: the traditional camp (Google, Hadoop, Amazon's implementation of Hadoop on EC2) focus on file-system-based data processing of unstructured files.
The other camp (including Aster, the company I work for) believes MapReduce should not be limited to files but rather all data. This "in-database" MapReduce group operates on both unstructured data (files) as well as structured data (database schemas). Figuring out how to execute powerful MapReduce jobs on structured and unstructured data (and everything in between) is really hard technology that this second camp is trying to solve.
How are big players such as Google and Yahoo using them?
When completed, Google used MapReduce to completely regenerate Google's index of the World Wide Web and replaced the old ad hoc programs that updated the index and ran the various analyses. Today, more than 3,000 computing jobs per day run through MapReduce at Google.
The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux clusters and produces data that is now used in every Yahoo! Web search query.
Google's MapReduce implementation is completely proprietary and not for sale. It has been highly customized and tuned for use only within Google and is a key reason for Google's market dominance.
What place do they have within smaller enterprises?
Even smaller enterprises often have large data volumes, particularly as the Web (and its data) is integrated into customer-facing processes such as marketing/advertising and e-commerce, for example. Even in the case of small data volumes, heavy analytic processing is often required for rich analytic functions such as data mining, behavioral pattern analysis, and forecasting. The trick for smaller enterprises is to figure out how to harness and apply the power of MapReduce to these vast volumes of data without having to hire a team of Google PhD's to implement and leverage it.
Why aren't more IT shops using the technology?
MapReduce is a new concept that only the companies "rich" in developer resources could afford -- armies of engineers such as Google and Yahoo. Most implementations of MapReduce lack native integration with industry-standard tools for reporting, analysis, and data integration. This has left traditional companies without the resources to cobble together open-source projects out in the cold.
IT organizations rely on standards-based technology for process-oriented corporate governance. That's why an "in-database" Mapreduce is so powerful. It enables seamlessly integration of standards (such as ANSI-standard SQL) and integration with hundreds of third-party ecosystem software for data modeling, SQL development, database administration consoles, system administration consoles, system monitoring, ETL, analytics, and BI.
In addition, IT organizations need technology that people already know. There are millions of DBAs and SQL developers because SQL and databases have been around since the 1970s. As a result, there's an army of IT professionals trained on SQL and databases.
In stark contrast, a much smaller number of people know traditional MapReduce (since it requires deep understanding of new-age technologies such as distributed computing). An "in-database" approach enables seamless integration of esoteric MapReduce into the comfortable confines on ANSI-standard SQL. As a result, IT organizations don't need to re-train or re-hire -- they can keep their existing DBA and SQL developer staff but still get all the performance benefits and analytical insights of MapReduce.
What are the benefits to integrating SQL and MapReduce? How can IT “sell” this technology to upper management?
Providing SQL and MapReduce tightly integrated on a single DBMS platform provides several benefits:
- Data is more accessible: It makes both BI analysts and data miners/statisticians happy. They can do both reporting, analysis, as well rich analytics in one place. They can drill down seemlessly to detailed data in one place.
- Data doesn't have to be "off-loaded" from the data warehouse to a separate data mart for analytics/BI. You can store all the data you need for analysis closer to the source, in large volumes, and provide a rich set of functions (via MapReduce) and standard SQL for fast analysis without waiting for data to be migrated to a 3rd application tier for analysis. This saves tremendous time and costs of maintaining multiple tiers and platforms for analysis.
- Data is more manageable: you don't need the complexity of two systems (e.g., you eliminate the need for more hardware, getting more people trained, defining access control for two systems, fail over, back-up, etc.).
- Data analysis is faster. The characteristics of a file system are different from a RDBMS. A file system like Hadoop has characteristics of a batch system -- unsuitable for looking up small pieces of data; RDBMS can return data in less than a second while Hadoop may have response times of several minutes.
- Data is more structured: RDBMS provides a schema, features for data consistency, constraints, user access (security). Relational databases let you easily define relationships across data and make it faster to access.
What best practices can you recommend for using this technology?
Make sure the MapReduce technology selected is not focused only on files (like Hadoop). This is great for small departmental projects or research/advanced development but doesn't make sense in IT production environments for Fortune 1000 companies.
Enterprise IT organizations should only look at MapReduce technologies that work on both files and databases. Here is a checklist of key criteria for Fortune 1000 IT executives to consider when tendering an RFP/RFI for MapReduce technology:
- Seamlessly integrated MapReduce (MR) and ANSI SQL
- Maximum fault isolation across MR and SQL queries
- Predictable service levels across MR and SQL queries
- Easily run ad-hoc and canned MR queries
- BI tools support MR queries for business users
- Ecosystem tools integration (ODBC/JDBC)
- ACID transactional integrity (avoid corruption)
- Minimized data shuffling (network efficient schemas)
- Security and data protection (backup and disaster recovery)
- Simplicity -- appliance-like maintenance and administration
What is the purpose of Aster nCluster and how does it relate to MapReduce?
Aster nCluster is a high-performance analytic database designed from the ground up to support large-scale data warehousing utilizing both standard SQL and MapReduce. This brings the power and expressive flexibility of MapReduce to the traditional world of SQL databases. This enables companies to do more analysis faster on larger data sets than ever before.
What makes Aster nCluster unique?
Aster provides the first-ever In-Database MapReduce, which brings the functional programming paradigm popularized with Google MapReduce to the world of relational databases and structured data.
Traditionally, massively parallel databases were well able to parallelize ordinary SQL but had limitations when parallelizing more general programs whether written as user-defined functions or a database programming language such as PL/SQL. In many cases, these capabilities simply ran on a single node of an MPP database. Now, analysts and developers can take advantage of the power of MapReduce from within ordinary SQL, by creating SQL/MR functions in Java, Python, R, and more. Perform analyses never before possible with ordinary SQL, and get more value from their data.