In-Depth

Cirro Touts Big Data-Ready Analytic Platform

Don't know how to get started with your big data project? BI start-up Cirro claims to have just what you're looking for.

Business intelligence (BI) start-up Cirro bills itself as big data-ready. What does that take? Cirro's prescription mixes data federation -- delivered via its Cirro Data Hub -- with an end-user analysis plug-in for Microsoft Excel, called Cirro Analyst.

If you're like the overwhelming majority of shops, you're just starting -- or don't know where to start -- with big data. Cirro claims to have just what you're looking for.

CEO Mark Theissen, a veteran of the former Brio Software Inc. and the former DATAllegro Corp. -- among others -- describes the combination of Cirro Analyst and Cirro Data Hub as a great solution for big data analytics. In this case, data exploration -- or the simple act of equipping analysts to interact with, explore, and mash-up big data data sets -- is as good a starting place as any, Theissen maintains. "Our value proposition is around the idea of accessing any data, on any platform, at any time. We want to enable self-driven exploration," he explains.

Cirro launches this month. It's currently in beta with several customers, including a prominent entertainment giant. Its pitch is that of a BI analytic platform that requires Hadoop. In this regard, Theissen contrasts Cirro's architecture with those of its more established competitors, most of which, he claims, are trying to reconcile -- that is, embrace and extend -- existing platforms with Hadoop and big data.

If Cirro differs architecturally from competitive offerings, it's nonetheless similar in one important respect: its means of access to big data. Cirro, like a growing number of BI analytic technologies, leverages the open source Hadoop framework. Theissen, for example, lists the bread-and-butter Hadoop Distributed File System (HDFS), the Hadoop implementation of MapReduce, and -- in most cases -- Apache Hive as required components. Cirro accesses HDFS by means of either the Hive Query Language (HQL) or via direct MapReduce calls, which Theissen says are registered in its Cirro Function Library.

Cirro the platform sits atop the Hadoop framework. Its Cirro Data Hub comprises a de facto data virtualization layer that "connects" the traditional (relational), quasi-traditional (semi-structured), and Hadoop worlds (anything goes, structure-wise) in virtual space.

Exploring Big Data Analytics

Theissen sees data exploration as the most immediate application for big data analytics.

"Among the reasons people struggle with big data use cases is that they don't have a great way to explore their data," he argues. He suggests that Cirro's platform gives shops a chance to dip their toes into big data without getting bogged down in its still-coalescing details.

One wrinkle, of course, is that the problem Theissen describes -- enabling power users or analysts to interact with and explore data -- has always been a problem. Users, after all, have long been dissatisfied with the "average, everyday data" status quo.

True that, Theissen concedes. In the big data-scape, he argues, data exploration is an even bigger problem, chiefly because -- in the absence of packaged or codified offerings -- it's the de facto starting point for most big data efforts. Think of Cirro's model as ad hoc to the nth degree: give power users a means to mash up all-of-that-data, step back, and See What Happens.

"It's been very problematic for users to get their hands on the data without having to get a lot of help from IT," he says. "You'll have a situation where [IT will] say, 'We don't want a copy of our customer data sitting over where our unstructured data is,' or 'We want to be able to do joins, but we don't want to move our unstructured data where the customer data is, either.'"

"Federation" is a dirty word in some quarters, but Theissen says Cirro's federated architecture is the ideal topology for big data. "You might use Teradata for an enterprise data warehouse, or you might be using SQL Server for data marts. Maybe you have machine data in Splunk or social data in Hadoop," he comments. "It's going to be a distributed world, and these are all analytic data sources. The idea of putting everything in one place just isn't holding water, [and] people are going to need to be able to access all of these different analytic repositories."

Theissen says the big data mash-up is one of Cirro's specialties. He describes a scenario in which data from a live Twitter feed (consisting of JSON messages) is mashed with customer data. The Cirro Data Hub effectively orchestrates the mashup, pulling data from the Hadoop Distributed File System and from a relational data store, and orchestrating the join on another platform. "We also enable big data joins, so if you want to join data that's in HDFS with customer data that's in Oracle, you can do that with a single SQL statement, and under the hood, Cirro will take care of that," he says, explaining that Cirro entrusts the Oracle -- or Teradata, or SQL Server, or DB2 -- query optimizer to tweak SQL queries.

Mark Madsen, a veteran data warehouse architect and a principal with information management consultancy Third Nature Inc., says the scenario Theissen describes isn't at all far-fetched. On the other hand, says Madsen, it doesn't have to be especially complicated, either.

"It's not hard to join SQL and MapReduce job queries. You simply split, send, retrieve all the data, and join at the server," he points out. Things get more complex as data volumes get bigger, Madsen concedes. In that case, he says, a technology solution could make a difference, particularly if it addresses the data movement -- or workload orchestration -- issues that are the inevitable byproducts of highly distributed processing.

"Then you have to look at predicate ordering, work out which place to go first so you only move the minimum amount of data to the server, or you get even more clever and do the most restrictive predicate first, then send the results to the other server as part of that query, and get the (hopefully) much smaller set of results back." This is the kind of thing that makes data federation -- to say nothing of data virtualization -- such a tricky proposition. It's likewise an area in which federation pioneer Composite Software Inc. has "has focused a lot of effort on for years," Madsen points out.

Like some of its analytic rivals -- QlikView from QlikTech Inc. comes most immediately to mind -- Cirro doesn't offer much in the way of a metadata management facility. "[Our] product strategy is not to be a metadata management solution for a data ecosystem," Theissen says. "Cirro does maintain its own metadata for all unstructured data sources, query performance, and published views."

On the other hand, he continues, "Cirro's metadata will integrate with other metadata management tools or BI/analytic tools." Its metadata conforms to what Theissen describes as "standard metadata for databases," with categories such as SYSCOLUMNS and SYSTABLES.

Must Read Articles