A Big Year Ahead for Big Data and Hadoop

During 2011, Hadoop became more reliable with better performance in new commercial distributions. In 2012, Hadoop will become easier to use and manage.

By Jack Norris, Vice President of Marketing, MapR Technologies

Big Data has become a Big Deal for a growing number of organizations, and the industry experienced three major trends last year.

2011 Trend #1: Hadoop became available in more commercial distributions

Hadoop traces its roots to an extraordinarily difficult Big Data challenge: cataloging the content on the World Wide Web and making it searchable. If you’re unfamiliar with Hadoop, it is a cost-effective and scalable system for storing and analyzing large and/or disparate datasets on clusters of commodity servers. The original open source distribution, still available from Apache, was joined in 2011 by commercial versions (also sometimes free) from EMC, MapR Technologies, and Oracle.

Although the original Apache distribution continues to be used for a variety of specific use cases and test environments, market research firm ESG recently found that nearly half of new users plan to deploy commercial versions of Hadoop owing to their enhanced feature sets. The industry has seen this pattern before as technologies mature and their use expands. Technologies that fail to get more capable and dependable never really gain traction in mainstream markets beyond their original small community of users.

2011 Trend #2: Hadoop became more reliable with new high-availability features

The Apache Hadoop architecture has single points of failure in two critical functions: NameNode and JobTracker. Hadoop was designed with a centralized NameNode architecture where all Hadoop Distributed File System metadata (e.g., namespace and block locations) are stored in memory on a single node. This architecture undermines reliability and limits scalability to the amount of memory in the Primary NameNode.

A far better solution is Distributed NameNode, which, as the name implies, distributes the file metadata on ordinary DataNodes throughout the cluster, thereby dramatically improving both reliability and scalability. The metadata automatically persists to disk (as with the node’s data) and is also replicated to two other nodes or more to provide tolerance to multiple simultaneous failures.

Similar resiliency is also now available for the JobTracker, the second critical function in Hadoop environments. In the event of a failure, a secondary JobTracker automatically picks up and continues the tasks with no interruption or data loss.

2011 Trend #3: Elimination of the network as a performance bottleneck

The IDC Digital Universe study in 2011 confirmed that data is growing faster than Moore’s law predicts. Organizations are, therefore, increasingly seeing a major bottleneck emerge in the transfer of data over the network to the compute platform. There are point solutions that help, such as faster drives and using DRAM to process data, but with Big Data, the best approach is to employ a different compute paradigm such as Hadoop’s MapReduce. MapReduce was purpose-built for processing enormous amounts of data, and several commercial enhancements have dramatically improved its performance.

With commercial versions now affording greater reliability, scalability, and other improvements, the range of Hadoop use cases is poised to expand. Here are the three major trends worth tracking in 2012.

2012 Prediction #1: Hadoop will get easier to manage at scale

The requisite (and sometimes intricate) care and feeding of a Hadoop environment often presents challenges for the typical organization that lacks experienced technical talent to surround Hadoop and keep it operational. These challenges are now being addressed by enhancements in some commercial distributions.

One is the advent of Distributed NameNode in 2011, which will now make it easier to manage the entire Hadoop cluster. By distributing the NameNode function in a resilient fashion, there is no longer a need to create a Checkpoint Node (previously called the Secondary NameNode) and/or a Backup Node. Managing these additional nodes is complicated, and neither provides for automatic hitless failover of the primary.

Another enhancement that simplifies management is support for data Volumes. Volumes make cluster data easier to both access and manage by grouping related files and directories into a single tree structure that can be more readily organized, administered, and secured. Volumes also make it easier to tailor advanced data protection features such as Snapshots and Mirroring. Snapshots can be taken periodically to create drag-and-drop recovery points. Mirroring extends data protection to satisfy recovery time objectives. Local mirroring provides high performance for highly-accessed data, while remote mirroring provides business continuity across data centers as well as integration between on-premise and private clouds.

2012 Prediction #2: Hadoop will get easier to use

One of the most significant ease-of-use enhancements involves supplementing the Hadoop Distributed File System (HDFS) with direct access to the ubiquitous Network File System (NFS). HDFS is a write-once file system with a number of limitations, including batch-oriented data management and movement, the requirement to close files before new updates can be read, and a lack of random read/write file access by multiple users. All of these limitations create a serious impediment to Hadoop adoption.

The next generation of storage services for Hadoop overcomes these limitations and affords some additional benefits. Lockless storage with random read/write access enables simultaneous access to data in real-time. Existing applications and workflows can use standard NFS to access the Hadoop cluster to manipulate data, and optionally take advantage of the MapReduce framework for parallel processing. Files in the cluster can be modified directly using ordinary text editors, command-line tools, and UNIX applications and utilities (such as rsync, grep, sed, tar, sort and tail), or other development environments. And transparent file compression helps keep storage requirements at a minimum -- a big plus with Big Data.

Making Hadoop compatible with the popular and familiar Network File System makes Hadoop suitable for almost any organization. This change is not unlike what occurred with the Web browser in the early 1990s. Previously, using the Internet required a working knowledge of UNIX commands. With its intuitive, point-and-click ease, the browser opened the Net to the masses. NFS support now places Hadoop easily within reach of the skill set of even non-programmers.

2012 Prediction #3: Hadoop will be used for more applications

Making Hadoop easier to use and manage will clear the way for its use in many additional applications. There are four reasons in particular that explain why organizations will increasingly rely on Hadoop to unlock the value of their rapidly-expanding data.

First, the sheer size of today’s datasets is driving companies to find a more effective and economical way to analyze data, and Hadoop was purpose-built to handle massive amounts of data using commodity hardware.

Second is the growing diversity of data. Hadoop was designed to handle a wide variety of data, particularly unstructured data, thus eliminating the need to engage in complex data transformation exercises or to otherwise pre-process any data sources.

Third, organizations are often not quite sure what the data will reveal! Hadoop does not require schemas to be defined or data to be aggregated in advance, both of which risk losing important detail.

Finally is the need for accurate and meaningful analysis, especially with disparate data. Here, too, Hadoop’s ability to run simple algorithms on large datasets beats running complex models on small datasets every time.

Jack Norris is the vice president of marketing for MapR Technologies, an advanced distribution for Apache Hadoop. Jack’s experience includes defining new markets for small companies, leading marketing and business development for an early-stage cloud storage software provider, and increasing sales of new products for large public companies. Jack has also held senior executive roles with Brio Technology, EMC, and Rainfinity. You can contact the author at jnorris@maptech.com.