Q&A: IT's Evolving Big Data Landscape
As the buzz over big data grows, what's changed in the last year? How has machine-generated data changed the way IT tackles big data?
Last year around this time, we spoke with Don DeLoach, CEO of Infobright, about his thoughts on the many changes happening in the information management landscape, particularly those related to the rise of "big data."
The big data buzz grew even bigger this year, and Don's company was prepared. His company develops an analytic database that's purpose-built for crunching the ever-growing volumes of machine-generated data streaming in from the Web, call-detail records, sensor output, and logs of all kinds. Machine-generated data is the fastest growing category of big data, and Don knows first-hand how businesses are still struggling to find ways to manage and exploit all this digital information.
We sat down with Don to talk about how things have evolved in the last 12 months and get his thoughts about the many diverse approaches that companies are taking to tackle big data.
Enterprise Strategies: We talked extensively about the coming big data storm just about one year ago and there's certainly been a lot of news about this in the IT world since then. Which of your predictions have played out as you expected and what's surprised you the most?
Don DeLoach: Well, I think it's no surprise that the volumes of data that companies are dealing with are continuing to just explode. There's really no end in sight. Just a few months after we spoke last year, IDC came out with it's annual report on the digital universe and included a number of eye-popping statistics and predictions, including one that estimates that the number of servers will grow by a factor of 10 over the next decade, the amount of information managed by data centers will grow by a factor of 50, and the number of files the data center will have to deal with will grow by a factor of at least 75.
IBM claims that 90 percent of the data in the world today has been created in just the last two years. Cisco's recently announced forecast for global mobile data growth predicts an 18-fold increase in traffic over the next five years, reaching 10.8 exabytes per month by 2016. This kind of growth is impacting all types of businesses and industries, from financial services and telecommunications to retail, utilities, oil and gas, and more.
That said, the huge amount of attention that the challenges and solutions surrounding big data are now receiving -- especially even beyond the technology press -- has been a somewhat pleasant surprise. The Wall Street Journal, The New York Times, Forbes and Bloomberg Businessweek have all reported on big data in the last year. National Public Radio recently devoted a series of stories to the issue, and of course, the topic has just exploded in the technology trade magazines and blogs.
Topics such as data centers, servers, and storage capacity have not traditionally sounded all that scintillating, but when you start talking about some of the things businesses can do with all this information -- well, let's just say that people's ears have started to perk up. For those of us who have been working hard in the background for a number of years now, trying to figure out ways for businesses to capture, analyze, and actually capitalize on these vast reams of data without getting overwhelmed, it's rewarding to see that we've struck a nerve.
What is also surprising is the extent to which traditional solution providers are adapting their offerings to incorporate machine-generated data. For example, domain-specific solution providers ranging from mobile convergence and network optimization providers such as Mavenir and JDSU, respectively, to security management companies such as SonicWALL (now Dell) are increasingly upgrading their products and services to allow for much greater exploitation of this type of data. When that happens, everyone wins.
Earlier this year, Splunk, a firm specializing in log data analysis, held an initial public offering, issuing shares at $17 each; at press time they were above $31 a share. What does the Splunk IPO mean to other IT vendors specializing in solutions aimed at big data? Wall Street seems to have an appetite for tech stocks lately, but couldn't there be a big data "bubble"?
I think Splunk's very impressive IPO speaks well for the data management sector in general -- a rising tide lifts all boats, so to speak. I'd also say that it's premature to talk about a bubble, especially as the world of big data is still evolving. Right now, both business and IT stakeholders are hungry for innovation in this area and, as we've discussed, there's just so much growth in the diversity and volume of data. There's room for a lot of players.
Splunk is a great example of this, in fact. They've developed an excellent, vertically integrated, turnkey solution aimed at a very specific use case: IT log management. However, when it comes to big data, the challenges are as varied as the solutions cropping up to address them, and for many businesses, there's not going to be a one-size-fits all answer. Just as many hardware systems consist of specialized components working together synergistically, certain software solutions might be deployed in tandem, depending on the need.
Will there eventually be some consolidation among vendors with various data management, ETL, storage, and analysis solutions? Sure. That's natural as a market matures. At this point, however, vendors are just putting their heads down and working hard on new ways to solve some of the big data challenges that businesses are facing today and will be facing tomorrow. It's an exciting time.
Both your company and Splunk are particularly focused on data that's machine-generated. What unique challenges are posed by machine data, and why should businesses be concerned about mining it? What gems does it hold?
Machine-generated data is one of the fastest-growing categories of big data, encompassing everything from Web, network, and security logs to call detail records, smart meter feeds from utility infrastructure, sensor data, gaming data, and data generated by social networking, to name just a few. In fact, the amount of data generated by machines is far greater than the amount of data created by individuals. Increasingly, businesses are just getting buried by it.
They can't just throw up their hands because if it's managed and mined properly, machine-generated data is also hugely valuable. For example, log data can help retailers and financial service companies spot fraud and security breaches. Sensor data can help utility providers pinpoint problems on the grid and take action to avoid blackouts and other potential crises. Telecommunications companies need this data to stay on top of service issues and track usage patterns. The list goes on.
The challenge with machine-generated data is that it's relentless. New records are added quickly and the volume is enormous, so finding those important nuggets of information demands an approach that's able to cut through all the "clutter" quickly and with precision. This is where traditional data solutions such as standard, row-based relational databases run into trouble. Designed to handle single-record, structured data, they just weren't built to do the kind of dynamic, ad hoc analysis that is so important in extremely fast-paced business and operational environments.
As data size and diversity increases, considerable manual configuration (i.e., DBA overhead) is required to get reasonable query performance. Then there's the hardware expense involved in adding more servers and storage capacity. This is why machine-generated data is so tricky -- companies that want to exploit it need to think outside of the box a bit.
Are there specific industries where machine-generated data is an especially important concern?
Definitely. Telecommunications vendors and companies operating in the online arena (such as ad tech and analytics providers) have access to unprecedented amounts of log data, session data, clickstream data, and call-detail records that contain information about device usage, service quality, online behavior, and so on. This data can be used to more effectively target advertising (even based on very granular details such as location and social networking behavior), reduce customer churn, and improve service quality.
Industries with very complex operational environments are another good example -- such as oil and gas or utilities -- as these enterprises need to keep track of tons of sensor input and equipment feeds to operate efficiently, maximize asset performance, and spot issues that could potentially turn into a crisis. The financial services industry is another example, because there are such big volumes and tight time constraints involved. Basically, any type of company that needs to capture multiple data feeds, from diverse sources, and run investigative analytics on that data very quickly should be thinking about how best to achieve this objective.
Let's talk analytics. What do you think are the most important considerations for companies when it comes to extracting intelligence from their data, and considering how tight IT budgets are these days, does it even make sense financially to invest in solutions that are still being proven? Can you give some examples of how companies are either misfiring or "getting it right" with big data?
Obviously, scalability is a top consideration. As data volumes continue to grow, enterprises need data solutions that can help them handle both their current and future analytic requirements. At some point, traditional, hardware-based infrastructure is just going to run out of headroom in terms of storage and processing capabilities. Flexibility and speed are also really important, as we've already discussed. Much of the machine-generated data being captured has a fairly short expiration date. For example, a mobile carrier may want to optimize location-based offers based on incoming GPS data or a utility company must analyze smart meter feeds in real time to track usage behavior and stay ahead of power demand.
If it takes too long to analyze this type of data, or if users have to work within the confines of pre-defined queries and canned reports, it's simply not going to be very useful. Enterprises need to be able to quickly and easily load, dynamically query, analyze and report on machine-generated information without wasting time and resources indexing or partitioning data or doing other sorts of manual configuration to extract the needed intelligence.
This brings us to the last critical consideration: cost. In this time of still-constrained budgets, big data analysis needs to be affordable, as well as easy-to-use and implement, to justify the investment. This demands solutions that are optimized to deliver quick analysis of large volumes of data, with minimal hardware, administrative effort, or customization needed to set-up or change query and reporting parameters.
Companies that keep doing things the way they've always been done will find themselves spending more on servers, storage, and DBAs to keep up, which is not going be sustainable. You'll see enterprises successfully benefitting from alternative approaches -- whether columnar databases specifically designed for high-performance analytics, distributed processing frameworks such as Hadoop, cloud solutions, or a combination of techniques. And the good news is that many of these options are open source or available at a fraction of the cost of traditional tools, so companies can test drive new solutions without risk, get projects up and running quickly, and eliminate the need for custom configuration, expensive licensing agreements, and equipment. "Getting it right" is a lot easier now that there are more choices.
How does Infobright's solution fit into the big data landscape?
Infobright's analytic database is specifically designed for users that need to capture and mine massive volumes machine-generated data. (As we've already discussed, this is the fastest growing category of big data.) The key thing we are after is to empower companies to do this without having to spend a ton of money or hire an army of database administrators, because at the end of the day, the ends aren't going to justify the means if you eat up all your time and budget to get there.
Our technology (which is available in a free-to-download open source community edition or as an enterprise edition) combines a columnar database with a unique architecture that eliminates the complexity inherent in running investigative analytics using a traditional relational database.
There are many advantages to column orientation, including the ability to do more efficient data analysis because each column stores a single data type (as opposed to rows that typically contain several data types) and allowing compression to be optimized for each particular data type. This eliminates the costly hardware infrastructure and time-consuming manual configuration (such as indexing and building partitions) required to create, tune, and maintain an analytic platform. Infobright also provides strong data compression (from 10:1 to over 40:1), which drastically reduces I/O (improving analytic query performance) and results in significantly less storage than traditional solutions. All these capabilities ultimately reduce administrative effort by about 90 percent, offering a fast, simple and affordable path to high-volume, high-performance analytics on machine-generated data.
There's lot of buzz and around Hadoop and NoSQL. Do you see those technologies as competition or complementary offerings to what Infobright is doing with machine-generated data?
I think they're complementary. We see Infobright working alongside Hadoop, MongoDB, Citrusleaf, and certainly SQL Server, Sybase, Oracle, and MySQL and others. Our aim is to focus on what we do best -- storing and analyzing machine-generated data -- and to easily and effectively co-exist with other technologies. So whether an end-user is deploying us to solve an analytics issue they were struggling to handle with their existing infrastructure or a solution vendor who wants to embed our database within their own offering to enhance their analytic capability, Infobright is designed to work within different environments and use cases.