In-Depth
Big Data Meets Virtualization
Big Data has surpassed cloud computing as the Next Big Thing in IT. We explore how big data is connected to virtualization and private clouds.
By Mike Wronski, VP of Product Management, Reflex Systems
Cloud is no longer the biggest buzz around the IT water cooler, at least not directly. The tech term that is all the rage now is Big Data. What is Big Data and what does that have to do with virtualization and private clouds? The best place to start is with some definitions.
Big Data refers to enormous quantities of unstructured or loosely structured information. Essentially it all revolves around the benefits of performing analysis on these large data sets and what technologies are needed for the task. In the social media world, for example, many marketing companies are looking at the over 200 million messages generated by Twitter daily to glean information about consumer behaviors.
Cloud computing has varied definitions, but usually it boils down to a focus on the service delivery model. The “cloud model” is all about leveraging an economy of scale in IT infrastructure, typically leveraging virtualization, via a multi-tenant, self-service access interface for the consumer. Cloud’s predecessors are utility and grid computing which shared common goals.
How does Big Data relate to virtualization or private clouds? IT has always had the lofty goal of the lights off, fully automated data center. Surprisingly enough, the cloud model fits this goal quite well. As the industry uses virtualization technologies to get closer to achieving “cloud,” we gain access to data sets that were difficult or impossible to reliably gather in the past. These technologies all share a commonality, in that they provide well-defined, common application programming interfaces (API) access to their functions, and to lots of performance and operational data. Previously, access to this data was far more varied in quantity and quality by vendor, making for complex integration and data harvesting.
Ease of access to data, when combined with the ease of provisioning and reconfiguration all via software, presents a massive collection of disassociated data that ranges from application, hypervisor and storage performance metrics, guest operating system configuration, network flows, security events to configuration changes. That definitely sounds like the definition of Big Data that we started with.
The next part of the equation is storage and analysis. The current trend in thinking is that existing storage methods (e.g., RRDtool for time-series data) or traditional relational databases (e.g., MySQL, Oracle) are insufficient for the job. We are seeing the rise of special-purpose tools focused on Big Data. Because the data is unstructured and varied in format, the option to store that data normalized isn’t practical.
On the database side, this class of tools falls under the NoSQL moniker. These NoSQL systems rely on a key-value mechanism so they have no dependency on a fixed schema or data-model, addressing the unstructured aspect of the data. NoSQL typically allows for replication and distribution of the data as well, providing both high availability and the option for distributed location and query. Examples of NoSQL technologies are Cassandra, CouchDB, and MongoDB.
On the analysis side, data is typically queried and then passed through a set of computational or event-correlation algorithms to produce a result in the form of trending, sizing recommendations, capacity exhaustion alerts, or similar information. The problem with performing this kind of analysis with large data sets is that they are very resource (CPU, memory) intensive and time consuming.
One approach to reduce the cost of calculation is to reduce the data by rolling it up into averages or pre-computing some portions of the data. This produces quicker results, but the act of rolling up data into averages can compromise data fidelity. For example, if performance spikes occur for only a small number of hours in a day and the day performance data is rolled up into daily averages, the evidence of those spikes will be lost.
Another problem in dealing with stored historical data is that its processing is typically done in batches. Batch processing is great for basic reporting, but for many types of analysis, periodic results are unacceptable. The ability to start calculations with historical data and then take streaming inputs and provide real-time outputs is needed for business-critical applications. For this we look to a technology that was spawned out of automated trading in the financial industry. In that use case, there are many real-time feeds of information that include stock trades, company financial reporting, and weather information (aka Big Financial Data) that traders create complex formulas for buy/sell decision-making.
The generic name for these systems that take multiple real time data feeds and can perform detailed correlation and analysis is Complex Event Processing or CEP. CEP implementations typically provide two types of processing: computational and detection. Computational focuses on sliding windows or window-based calculations based on inbound data. Detection focuses on detecting combinations of events or pattern detection in an input stream.
If we take these Big Data concepts and create a specific set of tools targeted at virtualization, the tasks should become even easier to handle. We are going to start to see the creation of technology building blocks targeted at the management of large virtualization environments. A starting point for integrating the NoSQL and CEP technologies would be a domain-specific language (DSL) for virtualization. Domain-specific languages are dedicated to a specific problem and thus allow the solution to be expressed more clearly -- in this case, the interaction with and analysis of Big Data generated by the next generation of virtualized data centers. Examples of other DSLs are SQL for interaction with relational databases or HTML for the creation of Web content.
By being domain-specific, this “virtual query language” comprehends the relationships between the tiers of virtualization: hypervisor, storage, networking, and applications and the data generated by those tiers. It will also act as the abstraction layer for interacting with the various and distributed data sources via virtualization specific expression and taxonomy.
A DSL for virtualization should:
- Provide easy query and search of the virtual environment that provides results in context of virtual components
- Act as the declarative language of policy and “eventing” for feeding a rules engine
- Be the language used to define the scope and parameters for real-time processing done by CEP
- Allow for programmatic interaction for easy integration into other business intelligence tools
Virtualization creates a Big Data situation, and in 2012 we will start to see new technologies focused at providing a solution to this rapidly growing challenge. Existing data center management software solutions were not designed to manage the data volume or rates of change that we are seeing with server virtualization. Because of that, vendors that attempt to bolt on virtualization support will fail in the areas of scalability and application performance.
Solutions that are being built with the Big Data knowledge in mind will have the advantage, and be capable of providing capacity planning, resource trend analysis, or more complex correlations between configuration changes and different performance metrics as the shift to large scale virtualization continues.
Mike Wronski brings more than 15 years of industry experience to his role as VP of product management for Reflex Systems. Prior to joining Reflex, Mike was a senior data center architect at GE Healthcare, where he designed IT security into medical devices and data centers hosting medical records and images. Mike's broad IT experience ranges from large carrier data networking to virtualization. He holds a CISSP and Certified Ethical Hacker certifications, an MBA, and a Bachelor of Science degree in Computer Engineering from Florida International University. You can contact the author at mwronski@reflexsystems.com