Hadoop, The Cloud, and Big Data
Hadoop is the big buzz word this year, but what does the technology enable and what are its benefits? What mistakes does IT make beginning a Hadoop project, and what best practices can you follow to avoid those problems and be successful? For advice, we asked Jack Norris at MapR Technologies; the company produces its namesake advanced distribution for Apache Hadoop, MapR.
Enterprise Strategies: Tell us about Hadoop and why is it important today.
Hadoop is one of the most important enterprise software technologies in decades. It is a platform that allows enterprises to store and analyze large and growing unstructured data more effectively and economically than ever before. With Hadoop, organizations can process and analyze a diverse set of unstructured data including clickstreams, log files, sensor data, genomic information, and images. Hadoop represents a paradigm shift. Instead of moving the data across the network, it’s much more effective to combine compute on the data and send the results over the network.
What’s the top misconception IT has about Hadoop?
The biggest misconception is to look at Hadoop as a closed ecosystem. As with any other enterprise platform, Hadoop needs to integrate with existing information sources and systems. Hadoop also needs to meet an organization’s service-level agreements (SLAs), availability, and data protection standards. If an organization fails to select a Hadoop distribution with these integration and protection features, they are limited to a narrow set of use cases that can be supported.
How does this misconception put IT or an entire enterprise at risk?
A failure to understand the differences in Hadoop distributions can result in lost data, downtime, and lost productivity.
IT often struggles to manage and process ever-growing amounts of data. How can Hadoop help?
The issue is that these data sources are typically unstructured like social media or sensor data and are growing in volumes that outstrip the ability to process them with the existing tools and processes.
Hadoop removes all these obstacles by providing a radically different framework that allows for easy scale-out of systems and for processing power to be distributed. Data from a wide variety of sources can be easily loaded and analyzed with Hadoop. There’s no need to go through a lengthy process to transform data and a broad set of analytic techniques can be used.
Why do you believe that Hadoop in the cloud should be part of any big data strategy?
At the current rate of data growth and the ability to extract significant value from it, organizations cannot continue to expand their data centers to keep pace. A cloud option provides the flexibility and scalability that organizations require -- or will soon require.
This is not to say that everything will move to the cloud, but a cloud option -- either for additional processing or as a disaster recovery component -- will soon be the reality. Next generation, enterprise-grade Hadoop distributions have recognized the need for a cloud component and include features such as mirroring across the WAN. Amazon and Google, for example, are now offering Hadoop in the cloud as service options.
We know that scalability is vital in big data environments. How can the cloud help companies of all sizes scale? Are there particular enterprise characteristics where Hadoop will be the most successful when it comes to cloud scalability?
It’s not just that data is growing quickly. Companies that have deployed Hadoop can now appreciate the value of this analysis and have expanded the data to be analyzed. Instead of extracting a portion and discarding data, organizations are retaining more data. A vicious circle gets established that results in more data being stored and analyzed in Hadoop.
Google published a paper that revealed that simple algorithms on big data produce much better results than complex models with smaller data sets. With Hadoop, scalability is linear. If you want to analyze more data, simply add more nodes. Some distributions now address one of the major limitations to scalability in Hadoop, which is a limit on the number of files that can be managed within a cluster. Previously, the largest Hadoop clusters had less than 200 million files and great care had to be taken to limit file growth and to compact files. With next-generation Hadoop distributions, customers can have billions of files in their clusters.
What are the biggest mistakes IT makes in implementing Hadoop?
The good news is that Hadoop is fairly straightforward to deploy and provides a welcome change for IT personnel that are accustomed to data warehouse deployments. With traditional analytic platforms, the success of the project rests on the ability to define the proper data structures up front. With Hadoop, there is no need to define these up front. Hadoop is extremely flexible and allows users to change the type of analytics and granularity of the analysis at any point.
What best practices can you recommend to avoid these problems?
The flexibility of Hadoop helps IT avoid a lot of problems. In general, the best practice is to start a deployment around a particular use case. Get experience and success with that and then expand from there.
What makes MapR unique from other Hadoop distributions?
MapR applied engineering innovation throughout the Hadoop stack so that more businesses could use the power of big data analytics. MapR allows users to use NFS to get data in and out of the cluster and enable applications to access directly. MapR provides a full dynamic read/write storage infrastructure to make it easier to get data into a cluster including support for real-time streaming.
MapR makes Hadoop dependable for mission-critical use, making possible broader business use of big data analytics for competitive advantage. MapR also brings such features as Snapshots and Mirroring to Hadoop. There are many more unique features to help organizations address big data challenges and make analysis with Hadoop easy, dependable, and fast.