In-Depth
Next-Generation Analytics Solution: Real-Time Data Analysis
How on-the-fly data analysis can fundamentally change the basic cost structure of analyzing data
- By Linda Briggs
- 03/04/2009
Traditional analytics solutions are under pressure as data volumes mushroom faster than hardware can keep up. Solutions include scaling out by adding servers and database engines, employing custom hardware, or using column-oriented databases.
A 2006 startup, Truviso, is taking a different approach, offering what it prefers to call continuous (rather than real-time) analysis of incoming data, with zero latency even for massive volumes of data. In this interview, Truviso co-founder and CTO Mike Franklin discusses challenge facing next-generation analytics.
“With data growth rates significantly higher than the rate at which hardware is getting faster, Moore’s law cannot keep up with the flood of data,” Franklin says. In our conversation he explains how on-the-fly data analysis can fundamentally change the basic cost structure of analyzing data.
Franklin has a doctorate in computer science and is a professor of computer science at the University of California, Berkeley, where his research focuses on the architecture and performance of distributed data management and information systems.
TDWI: What are the challenges with next-generation analytics? What has happened in the last few years that changes the picture for data analytics?
Mike Franklin: There are three forces that are combining to put pressure on traditional analytics solutions. First, and most important, is the unprecedented increase in the amount of data that organizations need to make sense of. In a recent article, Richard Winter, a specialist in large database technology, concluded that virtually every organization is facing rapid data volume growth, with typical data volumes increasing at a rate of one-and-a-half to two-and-a-half times a year. In network-centric areas such as social networks, content delivery, security, and others, data can be increasing at a rate as high as ten times a year.
With data growth rates significantly higher than the rate at which hardware is getting faster; Moore’s law cannot keep up with the flood of data. As a result, trying to address the data analytics scalability problem by throwing hardware at it becomes a losing proposition due to spiraling costs in terms of servers, people, power, cooling, and space.
In conjunction with data volume growth, organizations are under increasing pressure to make decisions more quickly. To make matters even more challenging, increasingly sophisticated analyses must be applied over more detailed and complex information. The combination of these three forces has created a “perfect storm” in which the traditional approaches to data analytics’ scalability and performance have become increasingly untenable.
Given these challenges, what solutions have companies developed?
Most database and data warehouse vendors are addressing these issues by scaling-out through parallelism. The idea is to try to keep up with growing data volumes by adding more servers running traditional database engines. The roots of this approach go back to research on “shared-nothing” architectures dating from the 1980s. Recently, there has been interest in using more general-purpose parallel programming systems, largely from the open source community, as a platform for scaling-out analytics. Examples of such systems are Apache Hadoop, a Java software framework, and MapReduce, a software framework introduced by Google.
In a different approach, a number of newer companies are re-thinking some of the fundamental assumptions of how analytics engines work. Column-oriented databases is an approach that has garnered significant interest lately. Other companies are applying custom hardware to key parts of the query processing pipeline to make things go faster.
Truviso has a unique approach to the problem, called Continuous Analytics. We apply continuous, stream-oriented query processing to information as it flows into the system in order to fundamentally change the basic cost structure of analyzing data.
What are some of the pros and cons of the various solutions you’ve mentioned?
The approaches based on parallelism have the advantage that the techniques for parallelizing database systems are fairly well understood. However, they suffer, as I noted earlier, from the inability of Moore’s law to keep up with the demands being placed on modern analytics systems.
The Hadoop-like approaches also have compatibility and ease-of-use issues associated with supporting general programming and scripting languages rather than SQL. Column-stores can reduce the disk I/O required for certain types of queries, but they are still fundamentally limited by the disk-centric, batch-oriented, store-first/query-later approach of traditional database technology. Although specialized hardware can speed up certain query processing operations, we learned in the early days of custom “database machines” that such solutions have a cost and evolution disadvantage relative to conventional, off-the-shelf hardware.
Continuous analytics has a number of advantages. Queries can be processed without first having to write data to disk and then reading that data later, redundant work can be squeezed out from a large set of analytics queries, and analytics can be done throughout the entire day, rather than during a restricted batch window. As a result, analytics can be performed with dramatically lower hardware requirements.
Are there disadvantages to continuous analytics?
The main disadvantage of stream-based continuous analytics is the incorrect perception that stream processing is solely for real-time applications, rather than the more general --- and much larger -- problems facing data analytics today.
What are the particular challenges of data streaming?
If you think about it, all data, be it clicks, ad impressions, transactions, or whatever, begins as streaming data. Data only stops streaming when you store it somewhere. A key realization in the development of the continuous analytics system is that it is crucial to separate the processing of data from the use of the results of that processing.
At Truviso, we process the data in real time, since that is by far the most efficient way to do so. We don’t, however, require users to consume the analytics in real time. Rather, query results are continuously loaded into standard database tables so that users can obtain analytics when and how the business processes require them, whether on-demand, through scheduled reports, or even through instantaneous alerts and notifications.
From a technical point of view, the main challenge in building a continuous analytics solution is to do so in a way that doesn’t break the database and data analytics programming model that companies have invested in. This means supporting the full SQL language rather than a subset or a SQL-like language, supporting the standard interfaces (such as JDBC and ODBC), as well as seamless integration with persistent storage.
How does Truviso address the issues we’ve discussed?
In a nutshell, Truviso has figured out how to seamlessly integrate a high-performance stream processing engine inside of (not on top of or next to) a full-function SQL relational database system. In Truviso, streams are simply tables that arrive on a continuous basis. Therefore, queries can be written that execute over streams, over tables, or over any combination of the two. In this way, Truviso can be slotted into existing data analytics architectures without requiring a “rip and replace,” while providing orders of magnitude improvements in both scalability and latency. It thereby enables the increasingly demanding analytics workloads of today’s data-intensive businesses.
As I said earlier, a main challenge in building a continuous analytics solution is to support the database and data analytics programming model that companies already have by supporting the full SQL language, the standard interfaces, and seamless integration with persistent storage. Doing all this while still providing the fundamental benefits of continuous processing is at the heart of the innovations that Truviso offers.