In-Depth
The Path to Large-Scale Virtualization Success
A closer look at the challenges of virtualization technologies in a large-scale initiative.
by Andrew Hillier
A growing number of organizations have embraced virtualization technologies and are actively moving toward large-scale implementations. Small-scale virtualization projects are almost always met with success, giving teams confidence in taking small deployments or even tests in non-production environments to enterprise levels. However, these two environments differ significantly, and organizations need to be aware of the challenges that come with large-scale initiatives.
Virtualizing a small set of servers is a relatively informal process. Individuals implementing the solution are usually familiar with many (if not all) of the areas of the servers in question. When extending this to the enterprise level, the sprawling nature of most data centers and lack of data about target environments requires a more methodical approach. Large-scale changes to IT environments necessitate careful evaluation of business interests, geographic considerations, availability requirements, and other key factors to mitigate risk and ensure optimum reliability of affected business services. Organizations must consider numerous factors, including:
- Asset identification: Identifying the location, functional roles, and applications residing on effected servers
- Technical constraints: Evaluating requirements for network connectivity, storage, controllers, peripherals, power, non-standard hardware configuration details, etc.
- Business constraints: Considering availability targets, maintenance windows, application owners, compliance restrictions, disaster recovery relationships, and more
- Workload patterns: Understanding application loads, resource distributions, overall utilization versus capacity, metrics for chargeback, etc.
Moreover, large-scale initiatives require a methodical way to identify and track virtualization candidates and performance impacts of their implementation. Having this knowledge empowers organizations set forth the best possible virtualization roadmap.
Getting Started
The first step is to identify the existence and composition of systems within the scope of the initiative. This includes sophisticated operations, such as “Level 2” ARP Cache traversal to locate all systems on the network, and router-to-server MAC address reconciliation to avoid redundancy caused by teaming interfaces. At this stage, the organization should also track parent/child relationships to eliminate redundancies caused by virtualized servers already deployed, which are often prone to double counting when the physical components and VMs are tracked separately.
The next step is to analyze technical constraints. The type of virtualization solution being used will dramatically affect technical limitations on the virtualization initiative. For example, when analyzing for VMWare ESX, there are relatively few constraints put upon the operating system configurations since each VM will have its own operating system image in the final configuration. Alternatively, applications being placed into Solaris 10 Zones will “punch through” to see a common operating system image. Thus, analysis in this situation should factor in operating system compatibility.
Early in the analysis process, organizations must perform a variance analysis on the source hardware. This helps qualify an environment by uncovering any hardware or configurations of interest, such as SAN controllers, token ring cards, IVRs, proprietary boards, direct-connect printers, or other items that are not part of the standard build. Not taking these into account could impact the initiative. This process, called variance analysis, reveals hardware configuration “outliers”. For example, failure to account for fax boards that are installed could case the interruption of critical business services.
The next stage is conducting rules-based configuration analysis whereby “regions of compatibility” are mapped out across the entire IT environment. These regions represent areas of commonality that are strong candidates for VM pools and clusters.
One of the most important—and most often overlooked—areas of virtualization analysis involves analyzing business constraints. Most virtualization planning tools provided by VM vendors do not go beyond high-level configuration and workload analysis, yet businesses cannot afford to stop there. Failure to analyze these issues can lead to significant operational issues in virtualized architectures. Unlike technical issues, which can usually be resolved by financial means, many business constraints cannot be easily resolved. For example, if the organization cannot move an application between locations or combine certain applications on the same networks, no amount of money or effort can overcome this issue.
To that end, organizations should analyze constraints related to ownership, location, maintenance windows, and availability targets to reveal which sets of systems are suitable to be combined in a single virtualized system or cluster.
Workload Scrutiny
Workload patterns are probably the most obvious aspect of operation that must be scrutinized when virtualizing systems. Some of the most important aspects of workload analysis, such as complementary pattern detection and time-shift what-if analysis, are often overlooked when determining if workloads can be combined. This can lead to problems such as unnecessarily limiting the possible gain in efficiency, or failing to leave enough headroom to cushion peak demands on the infrastructure. It is also important to measure aggregate utilization and detailed system-level statistics from a historical perspective, as failing to do so makes it difficult to determine what reduction targets are realistic.
Measuring aggregate utilization versus capacity provides insight into pre-virtualization utilization levels and patterns, and dictates both the maximum capacity reduction that is possible as well as the estimated utilization target for the virtualization initiative.
Aggregate utilization is measured by normalizing the workload curves of all physical servers against their overall “power” (typically using benchmarks) and summing them to obtain a weighted average. The per-hour time-of-day curves are also normalized and summed to give a view of the aggregate workload pattern over time, which shows the distribution of resource demand in the target environment.
The core element of detailed workload scrutiny is a “what-if” analysis. Organizations should assess the various combinations of workload patterns to determine the optimal stacking function. This analysis involves the normalization of workloads against the relative powers of the source and target servers, and the stacking of specific sets of workloads onto target systems (either new or existing) to determine which combinations will fit. To get the most accurate assessment, combinations must be scored based on both peak and sustained activity, and patterns must be analyzed with and without time-shifting. This provides a comprehensive view of the best- and worst-case outcomes and provides an automated means of determining the optimal virtualization transfer function.
The ability to model overhead is also intrinsic to this process. Many VM technologies introduce overhead related to I/O and other operational issues. Compensating for this overhead is important to ensure sufficient capacity is available in reserve to sustain target service levels. Similarly, any efficiencies gained in the process (such as elimination of multiple backup devices) should be accounted for to fully optimize the resulting environment. This analysis should produce an overall virtualization scorecard that identifies the combinations of systems that produce the highest reduction in server count while maintaining optimum performance and complying with critical business constraints.
Ultimately, enterprise-wide virtualization requires a systematic approach to gathering intelligence, identifying goals, and the diligently analyzing constraints (many of which are business related). This rigorous approach will enable organizations to maximize the reliability of affected business services and gain a clear path to the lowest possible total cost of ownership.