Q&A: VM Evolution and Systems Management

System complexity is burdening systems managers. We look at the landscape.

As IT environments become more complex, systems management is getting more resource-intensive. What does IT need, what do they have, how do new technologies such as virtualization introduce even more complexity, and what should you be measuring in your shop?

For insight and answers, we turned to Daniel Heimlich, vice president of Netuitive.

Enterprise Strategies: Why is IT so dissatisfied with their current systems management tools?

Daniel Heimlich: The current tools have simply been outpaced in sophistication by the complexity of the environments they are meant to monitor.

Current systems management tools are based on the same technology from 10-15 years ago. They do a good job of availability monitoring -- that is, the can tell if a system is "up" or "down," but it is much more difficult for these systems to detect "slow" or degraded performance. That's because these tools rely on rules, scripts, and thresholds determined by administrators that send alarms when metrics breach a pre-determined policy. The problem is that the dynamics of today's data centers has rendered this manual approach obsolete.

In today's real-time, always-on application environments, organizations run a dizzying variety of applications -- each with different behaviors and workloads by time of day, day of the week, and seasonality. If you throw in the added dynamics of virtual IT environments, it is easy to see why the job of performance monitoring has grown so complex that manual approach can no longer do the job. In fact, industry surveys show that 70 percent of IT managers have a little or no confidence in their existing systems management tools.

How does virtualization change systems management? What new challenges does it introduce to the environment?

Virtualization adds exponential complexity to systems management. These environments are highly interdependent. As a result, the performance of each virtual machine depends on that of other virtual machines, its host server, and the shared resource pool. In addition, these environments are fluid and can change dynamically. Virtual machines can be instantly migrated from one host to another based on available resources.

Take a look at how dynamics add complexity. Reallocation of computing resources is often touted as a key benefit of virtualization. For example, suppose you have four virtual servers running that you decide to reduce to three. Unless manual changes are made in all the monitoring tools, the higher workloads in the remaining three servers will generate continuous performance alerts. This is problematic if you plan to shift workloads around often to maximize server utilization.

Now consider an example of complex relationships between physical and virtual resources. In larger deployments of virtualization technology, clusters of multi-process servers act as pure computing power (CPU and memory) for virtual servers. The virtual machines themselves may reside on a storage area network (SAN). You could run into a situation where all the OS metrics for a VM appear "normal" but the application on the VM is still sluggish. The problem may not be in the VM at all, but in the network I/O between the SAN, or with the SAN itself. Without more sophisticated monitoring tools that analyze and correlate the performance of virtual and physical infrastructure, this type of problem is extremely hard to detect and diagnose.

What should be measured in a non-virtualized environment and how is that different from what should be measured in a virtualized environment (and what should be measured -- hosts, virtual servers, or both)?

There are dozens of useful metrics in non-virtualized environments, but the most important performance metrics are generally server memory, CPU, and disk utilization. Data packet traffic (I/O) through the network is also useful.

In virtual environments that are relatively static -- meaning you simply create a fixed number of VMs per physical server and leave things alone -- monitoring a virtual server is exactly the same as monitoring a physical one. The added difference is that you want to track and correlate the metrics for both the VMs and the hosts to help you isolate root cause of problems -- is it the VM or the host?

In more dynamic virtual environments -- where you may be moving VMs and reallocating workloads and resources such as memory and CPU cycles, the key relevant metrics are also the same, but performance modeling and analysis changes dramatically. Instead of dedicated resources for each server, the VMs usually share a "resource pool" of physical resources. This allows a VM to get more resources if necessary -- so alerting based on fixed limits does not work in this environment. Only when a VM is using more than its "fair share" of resources would this become a problem. This would be CPU, memory, and disk utilization as a percentage of the total available.

Think of it like a mobile phone "family talk plan." You may not care if, from month to month, one of your kids uses 150 minutes out of a 500-minute plan, but if two of your kids end up using 300 minutes in the same month, you have a serious problem.

Two other sets of metrics become more important for large-scale virtualization deployments. These are network I/O metric and metrics related to enterprise storage (SANs). Because the hard drive in the physical server is replaced by a virtual drive on the SAN, and because the network takes the place of the local I/O bus, the health of this infrastructure is crucial to the health of the overall virtual environment.

If you have a mixed environment, how to you combine the measurements to know how well IT is doing overall?

Where possible, you have to "normalize" performance metrics into some indication of system "health" to give users a view into how well their IT infrastructure is performing. Even when performance metrics can be rolled up to a single "health" score, it is important to understand how the metrics are correlated to one another and how each metric contributes to the "health" of the service in question. This helps tremendously with root cause analysis when problems occur.

One foundation of ITIL is that IT should move from being reactive to proactive. Some IT shops have changed their thinking into truly being a service provider to their own organization. How does performance management help in that transition?

For a long time, IT organizations have focused on providing the best service by trying to resolve problems as quickly as possible after they are discovered, but by then, end users have already been impacted. That's considered being reactive.

New technologies such as Netuitive allow organizations to do true performance management for the first time -- no just "up / down" monitoring, but proactive monitoring for signs of degradations in service, often long before users are impacted. This changes the mindset from "how fast can I resolve the trouble ticket" to "how can I eliminate the cause of the trouble ticket." This combination of technology and mindset is the combination needed for IT organizations to finally deliver proactive service to their business organizations.

What does a third-party systems management tool do that built-in operating system tools don't?

Essentially it's the difference between presenting data and leveraging information or intelligence. Built-in operating systems tools simply collect and display metric data -- sometimes with plotlines, sometimes just in log files. These tools generally don't interpret the data to tell you when there may be a problem. Within built-in tools, fault detection and isolation have traditionally involved manually intensive rules, scripts, and thresholds -- which work for environments that don't change much.

Third-party systems management tools leverage this collected data and have unique mechanisms for identifying system faults and breakdowns in performance. These third-party tools are more flexible, allowing them to leverage more sophisticated new technologies and powerful algorithms to automate fault detection and isolation, and even forecast future potential problems. This automation frees system administrators to be more proactive in managing the performance of critical business systems.

What is behavior learning technology, and what other types of improvements should IT watch for over the next five years that will impact systems and performance management? How will these bridge the virtual-physical gap in managing mixed environments?

Behavior learning technology (or self-learning technology) is a category of technology that involves automated "learning" of normal system behavior. This can lead to more accurate alerting, problem forecasting, and automated fault isolation. This replaces the current labor-intensive manual guesswork and fixed models used to detect problematic behavior.

This new technology is gaining more recognition from leading analyst firms like Gartner and Forrester – especially now that enterprises are deploying it in production. The analysts also see it as a "must have" to achieve the current vision of resource-efficient, business-responsive IT organizations.

The five-year vision for many top enterprises is an infrastructure that maximizes IT resource utilization through a flexible "utility" or "cloud" computing model. At the same time, organizations are looking to be more responsive to the lines-of-business in the enterprise.

Virtualization technology enables computing flexibility – but depends on a new generation of automated management tools to enable IT operations staff to deliver on promised performance of this new infrastructure.