In-Depth

How Cloudera Became a Leader in BI/Hadoop

With a flurry of recent BI-oriented partnerships, it's no surprise Cloudera is attracting so much interest.

Although Cloudera has never explicitly marketed itself as a BI-centric product, the company's Hadoop-based business model has effectively brought BI to it. The partnership Cloudera signed late last year with Informatica Corp. was just the latest example.

The Informatica accord is Cloudera's second partnership this year with a leading DI player. Back in August, Cloudera cemented a deal with open source software (OSS) data integration (DI) specialist Talend. It also has partnerships with Teradata Corp., the former Netezza Inc., the former Greenplum Software Corp., Aster Data Systems Inc., Vertica Inc., and Pentaho.

One thing's for sure: Cloudera is certainly attracting attention.

On the DI front, both Talend and Informatica are tapping Cloudera Enterprise, which includes both Cloudera's Hadoop-based services framework and a dollop of proprietary "special sauce." By itself, Cloudera Distribution for Hadoop is an OSS offering: it's a bundle of 11 distinct open source products.

Before it introduced Cloudera Enterprise, the company chiefly focused on selling support for Hadoop -- bundling all of the Hadoop packages together, offering Cloudera Distribution for Hadoop as a single package, and selling maintenance and support (offering bug and security fixes, releasing update patches, and helping enterprises troubleshoot Hadoop performance or implementation issues.)

Cloudera positions Enterprise as a kind of on-ramp into the world of Hadoop. It consists of three components, two of which are homegrown: an authorization and authentication facility, a resource management tool, and integration and configuration monitoring.

"What will happen is that an organization will download Hadoop [and] they'll start to use it internally -- maybe it's a development group or an architectural group -- and they'll start to play with it and then the [IT] group will be asked to solve a particular technology problem that they have," explains John Kreisa, Cloudera's vice president of marketing.

"They'll connect that to a business user [who will] start to use the software in an increasingly sophisticated way. [IT will find that users are] loading more data into [Hadoop], and [as they load more data] they'll start to find more business use-cases. At some point, the cluster will grow, they'll start to add even more components, and the managing of all of that will become more complex. Cloudera Enterprise is designed to meet those needs."

Cloudera is also collaborating to develop high-speed, bi-directional connectivity between its own Hadoop-based platform and the offerings of Informatica, Talend, Teradata, and other partners. The high-speed connectivity component is another way the company differentiates its Cloudera Distribution for Hadoop from those offered by Yahoo and other sources, Kreisa maintains.

That being said, he concedes, there's at least one OSS offering -- Sqoop, or SQL-to-Hadoop -- that shops can tap to shuttle data between Hadoop and relational data sources. (Sqoop is part of Cloudera's Distribution for Hadoop.)

"We use our expertise in Hadoop combined with the expertise of the vendor in their own technology to build this connectivity into their architecture," Kreisa explains. "Each of the partnerships that we signed includes the building of a connector," he continues, acknowledging that many connectivity offerings are still works in progress.

"Some are already in beta, some are out there today, and some are just being developed. The connectivity between Oracle and Hadoop [which Cloudera developed in tandem with partner Quest Software] is available today. Others -- such as the partnership with Teradata -- are in development; we're testing at key customer sites. There's a timeline for rolling these things out, but we're working [with our partners] to make it happen."

Cloudera likes to bill itself as the most prominent name in Apache Hadoop. It could plausibly do so up until early last year, when IBM Corp. announced its own Hadoop-oriented push. Like Cloudera, IBM proposed to deliver its own managed, value-added, and (just as important) vendor-supported version of Hadoop via its InfoSphere BigInsights platform.

IBM is not currently a Cloudera partner. It explicitly competes against Cloudera in the Hadoop arena. Kreisa doesn't see BigInsights as direct or symmetrical competition, and he doesn't dismiss Big Blue. "We don't right now partner with IBM. They've been pushing BigSheets and BigInsights, and they have a very serious strategy around Hadoop," he says. "For them, it's around selling some of the analytics on top [of Hadoop] along with the related proprietary tools that they have. They have an IBM distribution [of Hadoop], but really that is just the core Apache Hadoop offering pushed out with their tools running on it," he notes.

Industry veteran Merv Adrian saw plenty of upside in Cloudera's partnership with Informatica. To Adrian, the Informatica deal -- which was completed in November of 2010 -- capped a terrific year for Cloudera. More than anything else, however, Adrian sees the deal as a feather -- as perhaps the feather -- in Cloudera's cap.

"Connecting to the dominant player in data integration and data quality expands the opportunity for Cloudera dramatically; it enables the de facto commercial Hadoop leader to find new ways to empower the 'silent majority' of data," wrote Adrian on his IT Market Strategy blog.

The dirty big secret of information management concerns the dispensation of this "silent majority" of data. For a long time, Adrian contends, this information has been neglected -- chiefly because it's outside the RDBMS.

"The majority of data is outside; not just outside enterprise data warehouses, but outside RDBMS instances entirely," he wrote "Why? Because it doesn't need all the management features database management software provides -- it doesn't get updated regularly, for example. In fact, it may not be used very often at all, though it does need to be persisted for a variety of reasons."

It's important for Cloudera to sustain and grow its partnerships if it's to retain top billing in the Hadoop market, Adrian maintained.

"[Cloudera is] going to be challenged by some big players in 2011, notably IBM, whose recent focus on Hadoop has been remarkably nimble. So these deals matter. A lot. The Data Management function is being refactored before our eyes; both these vendors will play in its future."

Must Read Articles