Q&A: Dynamic Capacity Planning in a Virtual World

We explore the challenges of capacity planning and how IT can best balance performance and risk in virtualized environments.

Years ago, predicting workloads was simple. As Andrew Hillier, chief technology officer and founder of CiRBA, explains it, applications and hardware grew old together. You simply needed to plot the trends to predict when you’d need to add more capacity.

Times have changed. In this conversation with Andrew, we examine the challenges of capacity planning and how IT can best balance performance and risk in virtualized environments.

Enterprise Strategies: What are the problems that make predicting workloads in virtualized environments difficult?

Andrew Hillier: Workloads are potentially much more mobile in virtual environments, and while this is the source of many benefits, it also makes the management of capacity more complex.

The first issue most organizations hit is the fact that you cannot track certain metrics as percentages because you no longer know what they are a percentage of. As VMs move between physical machines this traditional measure must be thrown out the window, potentially invalidating many management tools and processes.

This is just the start. In virtual environments the entire capacity management paradigm shifts toward something that more closely resembles a rapid-fire dating service, where supply and demand can be matched dynamically to provide high efficiency. This is powerful when used properly, but can introduce risk if it is used to create a purely reactionary management model, where workloads shift based on recent utilization levels. This can begin to resemble a load-balancing pool, and although it allows the management of resources in aggregate, it should not be confused with capacity planning, which is a more forward-looking discipline. In practice, both must be used to properly manage virtual environments.

What do companies do today to manage and plan for greater workloads, and how successful are they?

Organizations typically use a combination of traditional utilization tracking tools and high-level, business-oriented estimates of expected activity levels. The estimates are typically created on a quarterly basis, and although useful for trending and planning upgrades and refreshes in traditional environments, it leaves a significant gap when managing more dynamic environments, where everything happens much, much faster.

Also, technical and business-level considerations and constraints rarely factor into this process in physical environments, as the lack of mobility makes it very difficult to go "offside" as a result of routine capacity management. The same is not true of virtual environments, where workload mobility can result in unsupported configurations, undesirable combinations of workloads, and even situations in which regulatory compliance is jeopardized.

What do companies do today to manage and plan for greater workloads, and how successful are they?

We use the term workload personalities to describe the general nature of a workload. By observing the patterns of CPU utilization, disk and network I/O, memory utilization, and other key metrics, it starts to paint a picture of the overall characteristics of the activity, which generally fall into one of several distinct personality archetypes. For example, a database performing OLTP-style transactions will look quite different from one acting as a data warehouse, with one having more bi-directional I/O activity than the other. Similarly, app servers, queue managers, and servers performing raw number crunching will all make use of resources in different ways.

This information is useful for two reasons: when managing capacity and optimizing performance in virtual environments, you must combine workloads in such a way that resource utilization dovetails, thus making the best use of all available resources. For example, placing several number-crunching applications on the same physical system may exhaust CPU resources without putting a dent in the I/O capacity. By combining workloads in a way that allows complimentary personalities to share resources it is possible to get higher efficiency.

The other use of this information is for normalization of workloads between platforms. Because different servers have different levels of performance, a workload running at a certain level of utilization will not necessarily look the same if moved elsewhere, and it is necessary to use benchmarks to normalize the activity between platforms to get an accurate answer. Also, certain platforms also favor certain types of personalities, making the benchmarking strategy dependent on the personality of the workload. For example, a number-crunching application may run very well on an x86-based blade server, whereas an OLTP workload may favor a mainframe system.

Can any of the techniques from planning and managing non-virtualized environments be used in a virtualized environment, or does IT need to adopt a completely new way of thinking?

The challenge is keep the old ways of thinking where they still make sense and combine them with new approaches that are made possible (and sometimes necessary) by virtualization. For example, maintaining a forward-looking view of capacity management is important, but tempering this with the reality of mobility is the only way to make it work. Managing capacity in aggregate is also an important new concept, as it allows the "whitespace" of a pool of resources to be managed, thus providing economies of scale across multiple resources. Again, this must be tempered with some traditional thinking, as these "pools" are actually made up of individual servers, which must factor into the equation as well.

A large server is like a swimming pool, whereas a virtual cluster made of smaller servers is more like a pool made out of buckets. This fragmentation of resources matters when determining the optimal workload placements, and such environments can’t simply be managed like they are a large server.

How does IT know when it’s struck the right balance with its proactive management?

Being proactive is all about anticipating demands before they occur and being ready for them ahead of time. In reactive models, you must shift workloads around to react to application demands, and it may take some time before an environment realizes that action is necessary. This may cause performance issues, and also complicates service delivery and management processes, as it is difficult to determine whether things are moving because of normal activity or because of unusual activity. In ITIL terms, this means that it is difficult to unravel incident management, change management and capacity management and can potentially undo years of progress in that area.

In proactive management, on the other hand, there is advanced knowledge of what is expected to happen based on historical patterns, and virtual environments can be configured ahead of time in anticipation of that demand. In such cases, intraday motioning will only be required if unusual or unplanned activity occurs. This helps identify unusual activity that might otherwise go undetected and reduces operational volatility and increases control.

Ultimately, the right balance between reactive and proactive is struck when workloads are serviced at the proper level of performance without the need for workload motioning/migration in the middle of an active business cycle, and if motioning does occur, it is in response to unusual demands that could not have been foreseen in the historical data.

What are the biggest problems IT makes when trying to balance workloads?

One of the biggest problems when balancing workloads is "misprovisioning," either through overprovisioning or underprovisioning. Overly conservative management practices produce low operational risk but are not as efficient as they can be and tend to cause overprovisioning. Aggressive VM-to-host ratios can be efficient from a resource perspective but tend to carry more risk, and if this risk is not appropriate for the environment, then underprovisioning results.

This risk tolerance is one of the key factors in avoiding problems when matching supply and demand in virtual environments, and properly understanding how much risk can be assumed in the servicing of application workloads largely determines how tightly they can be squeezed together on a host system. For example, if an application is not business-critical and it is acceptable for there to be a slight risk of workloads contending for resources, then VM-to-host ratios can be much higher than if there is no risk tolerance. The right balance is thus struck when workloads are placed on servers in a manner consistent with the acceptable risk for that environment.

What best practices can you suggest to help IT avoid the "gotchas" or traps of planning and managing workloads in a virtualized data center?

Again, do not confuse load balance with capacity planning. Ensure that forward-looking capacity planning capabilities are employed in virtual environments. View virtualization as an exercise in risk management and not just a simple sizing exercise. This means understanding the nature of the workloads and the tolerance for risk for each application and business service.

Finally, be sure to fully understand and model the technical and business constraints that govern the operation of IT environments. Workload mobility is a powerful thing, but the ability to reorganize IT environments on the fly can also lead to trouble if not planned and governed properly.

What products or services does CiRBA offer to address these management challenges?

CiRBA provides Placement Intelligence software that analyzes highly detailed technical, business and workload constraints to determine the optimal workload placements for IT environments, whether physical or virtual. This is used both to plan the move to virtual environments as well as to provide dynamic capacity management capabilities for virtual infrastructure. By incorporating risk models, advanced benchmarking, and powerful what-if capabilities, CiRBA helps establish and maintain the most cost-effective data center possible.