In-Depth
Busting 10 Myths about Hadoop
Hadoop is still misunderstood by many BI professionals.
- By Philip Russom, Ph.D.
- 03/20/2012
Although Hadoop and related technologies have been with us for over five years now, most BI professionals and their business counterparts still harbor a few misconceptions that need to be corrected about Hadoop and related technologies such as MapReduce. I hope that the following list of 10 facts will clarify what Hadoop is and does relative to BI, as well as in which business and technology situations Hadoop-based BI, data warehousing, and analytics can be useful.
Fact #1. Hadoop consists of multiple products.
We talk about Hadoop as if it’s one monolithic thing, whereas it’s actually a family of open-source products and technologies overseen by the Apache Software Foundation (ASF). (Some Hadoop products are also available via vendor distributions; more on that later.)
The Apache Hadoop library includes (in BI priority order): the Hadoop Distributed File System (HDFS), MapReduce, Hive, Hbase, Pig, Zookeeper, Flume, Sqoop, Oozie, Hue, and so on. You can combine these in various ways, but HDFS and MapReduce (perhaps with Hbase and Hive) constitute a useful technology stack for applications in BI, DW, and analytics.
Fact #2. Hadoop is open source but available from vendors, too.
Apache Hadoop’s open-source software library is available from ASF at http://www.apache.org. For users desiring a more enterprise-ready package, a few vendors now offer Hadoop distributions that include additional administrative tools and technical support.
Fact #3. Hadoop is an ecosystem, not a single product.
In addition to products from Apache, the extended Hadoop ecosystem includes a growing list of vendor products that integrate with or expand Hadoop technologies. One minute on your favorite search engine will reveal these.
Fact #4. HDFS is a file system, not a database management system (DBMS).
Hadoop is primarily a distributed file system and lacks capabilities we’d associate with a DBMS, such as indexing, random access to data, and support for SQL. That’s okay, because HDFS does things DBMSs cannot do.
Fact #5. Hive resembles SQL but is not standard SQL.
Many of us are handcuffed to SQL because we know it well and our tools demand it. People who know SQL can quickly learn to hand-code Hive, but that doesn’t solve compatibility issues with SQL-based tools. TDWI feels that over time, Hadoop products will support standard SQL, so this issue will soon be moot.
Fact #6. Hadoop and MapReduce are related but don’t require each other.
Developers at Google developed MapReduce before HDFS existed, and some variations of MapReduce work with a variety of storage technologies, including HDFS, other file systems, and some DBMSs.
Fact #7. MapReduce provides control for analytics, not analytics per se.
MapReduce is a general-purpose execution engine that handles the complexities of network communication, parallel programming, and fault-tolerance for any kind of application that you can hand-code – not just analytics.
Fact #8. Hadoop is about data diversity, not just data volume.
Theoretically, HDFS can manage the storage and access of any data type as long as you can put the data in a file and copy that file into HDFS. As outrageously simplistic as that sounds, it’s largely true, and it’s exactly what brings many users to Apache HDFS.
Fact #9. Hadoop complements a DW; it’s rarely a replacement.
Most organizations have designed their DW for structured, relational data, which makes it difficult to wring BI value from unstructured and semistructured data. Hadoop promises to complement DWs by handling the multi-structured data types most DWs can’t.
Fact #10. Hadoop enables many types of analytics, not just Web analytics.
Hadoop gets a lot of press about how Internet companies use it for analyzing Web logs and other Web data. But other use cases exist. For example, consider the big data coming from sensory devices, such as robotics in manufacturing, RFID in retail, or grid monitoring in utilities. Older analytic applications that need large data samples -- such as customer base segmentation, fraud detection, and risk analysis -- can benefit from the additional big data managed by Hadoop. Likewise, Hadoop’s additional data can expand 360-degree views to create a more complete and granular view.
For more information, read the TDWI Checklist Report on Hadoop at http://bit.ly/Hadoop4BI.