Avoiding Downtime in Virtualized Environments
Virtualization can create single points of failure. We show you how to protect yourself.
By Bob Williamson
Virtualization technology is gaining significant traction in numerous market segments. According to IDC’s Worldwide Quarterly Server Virtualization Tracker, worldwide virtualization license shipments in the second quarter of 2008 reflected a 53 percent year-over-year increase. The many benefits of virtualization -- including reductions in hardware and software costs, improved disaster recovery functionality, and lower energy usage -- bring with them significant new business continuity issues that must be considered.
When companies replace multiple physical servers with virtual machines (VMs) that run on a single physical server, the hypervisor and the physical server on which it runs become a single point of failure (SPOF). Worse yet, the SPOF is not limited to only one server; it extends to all of the VMs hosted on that server. This classic “all your eggs in one basket” problem can result in a catastrophic failure should the physical server fail.
Natalie Lambert, a Forrester Research security analyst, agrees that the hypervisor is the primary vulnerability in virtualized environments. She points out that attackers could access thousands of desktops from a single compromised hypervisor.
One of the most compelling reasons to incorporate virtualization into a data center is server portability. A VM can be moved from one server to another simply by restoring a VM file to an alternate server, typically with no consideration required to the make and model of the clustered servers. Such portability is extremely attractive when it comes to disaster recovery planning. VM restoration is much less complicated than traditional bare-metal recovery from tape backups and can be considerably less expensive given the flexibility of the servers deployed.
With traditional backup and restore technology, full backups are taken on a scheduled basis (for example, weekly, daily, or hourly), and then incremental or differential backups are taken periodically (between the regularly scheduled full backups). In the event of a disaster, recovery involves restoring the most recent full backup and any incremental or differential backups available (of course, this assumes the backups are available in the disaster recovery location, which is not always the case).
Relying on this process, recovery time can take hours and the recovery point is only as good as the last available backup. In the event of a disaster, even with the advantages gained by portable VMs, users are likely to lose a significant amount of data and may need to invest a significant amount of time in the disaster recovery process. These results are unlikely to meet the company’s recovery time objectives (RTO) and recovery point objectives (RPO).
IDC estimates that in 2007 server downtime cost organizations approximately $140 billion in lost revenue and reduced worker productivity. The main key to avoiding this mess is incorporating continuous data replication technology.
There are many providers of this technology; all enable ongoing reproduction of live VM data from a primary server to an alternate server located in the same data center, at an alternate location, or in both places. Preferably, an alternate location is used because unplanned downtime can be caused by incidents such as power outages and weather emergencies that would likely affect both the primary and the secondary servers if they were located within the same facility. The replicated, or secondary, VM can be brought into service with minimal or no data loss. It can also be brought online, thus eliminating the need to invest resources to restore data from backup media.
For organizations that wish to implement fully automated disaster recovery, high-availability clustering should also be deployed. This technology ensures that if the host server fails, the VMs hosted on that server are restarted on an alternate server. This requires that all of the VMs reside on a shared fiber channel or iSCSI array.
What happens if the shared storage array fails? What if there is a regional power outage? Shared storage devices should be eliminated because of the security risks they entail. Geographically dispersed clusters utilizing real-time data replication are better options that shield organizations from regional power outages, hurricanes and other phenomena. Specifically, putting at least 500 miles between a primary and backup server helps minimize risk.
While unplanned downtime gets more attention than planned downtime, the latter is still a significant concern for many organizations, especially those that specialize in security, financial services, online shopping, and other uptime-critical fields. To attain the famed “five nines” -- 99.999 percent, or all but five minutes and 15 seconds per year -- of uptime, companies need to effectively manage planned downtime.
Scheduled downtime accounts for the majority of data center outages, so effective management is critical. Planned downtime of a host server affects all of the workloads that are virtualized on that system. In some circumstances, this means that dozens of virtual machines must be brought offline while maintenance (OS patch downloads, hardware replacements, and so on) is performed on the host server. Generally, this amount of downtime is unacceptable, and virtualization leads to an exponential increase in the number of affected workloads during planned maintenance.
In a planned downtime situation, an administrator does not necessarily need to restart a VM on a secondary server as in an unplanned disaster. Rather, in most cases, the administrator can move the VM to a standby node, resulting in a brief outage while the running VM is moved to the secondary server. In some cases, the VM should be moved to the secondary server. For example, when the hardware needing service is the SAN that holds the organization’s VM files, the live VM should be moved.
It’s bad enough to lose several hours of productivity from a traditional server crash. Multiply that disaster by a factor of two, three or more, depending on how many VMs are stored on a host server, and you see the huge business impact. Virtualization can provide companies with huge benefits, but only if done right.
Because of the increased vulnerabilities inherent to virtual servers, organizations must enact a clear disaster-recovery and high-availability plan for their virtualized environment. They need to make sure that they are not putting all of their proverbial eggs in one basket by exposing a SPOF. They must also strongly consider factors such as geography, the likelihood of weather-related downtime, and whether to enable fully automated disaster recovery. Many organizations fail to consider that human resources can be scarce in a disaster, so it is usually best to automate as much of the process -- as far away from the primary server location -- as possible.
The best time to plan for a disaster is long before it actually happens. As companies continue to dive deeper into virtualization, they must simultaneously enact sound business continuity practices or they will be at serious risk for significant downtime and data loss.
Bob Williamson has over 10 years of experience delivering application and data protection solutions. He is currently the senior vice president of product management at SteelEye Technology. You can reach the author at firstname.lastname@example.org.