In-Depth
Three Top Tips for Successful Business Continuity Planning
Building a business continuity plan is an ongoing job, but mature technologies exist that meet the range of key requirements.
by John Ferraro
Business continuity planning is a critical task for enterprises of all sizes. Statistics collected over the past 20 years consistently show an extremely high percentage of companies go out of business within one year of sustaining a multi-day outage. Business continuity planning is a complex process in today’s rapidly changing IT environments, presenting formidable challenges for IT managers. In this article, we’ll focus on the top three tips you need to fast track your way to a successful business continuity plan.
Tip #1: Understand Your RPO/RTO Requirements
A recovery point objective (RPO) is defined as the amount of data loss an organization can sustain in the event of a problem (outage, virus, natural or manmade disaster, etc.). A recovery time objective (RTO) is the time it takes a business to come back online and function normally when recovering from a problem. Together, these two metrics define the service-level agreements (SLAs) companies must maintain for any given application environment, based on end-user requirements for local recovery.
As RPO/RTO requirements become more stringent, the cost of the data protection infrastructure required to meet them increases. To accurately determine business continuity requirements and ensure you’re spending in the right areas, you must have a thorough knowledge of the SLAs for each of your application environments.
Applications can be divided into different tiers depending on their RPO/RTO requirements. A triple-tiered model is a common approach. Tier 1 comprises mission-critical applications requiring RTOs of less than one hour; Tier 2 includes applications with RTOs of less than four hours; and Tier 3 applications can sustain RPOs of under 24 hours. Work with department heads to determine which tier is most appropriate for each application. A chargeback system can be helpful in encouraging applications to be tiered appropriately by providing each department with the exact costs of SLA non-compliance. Depending on the type of customers you have, fees or penalties may accrue when contractually defined SLAs are not met.
When determining SLAs, know how your business is impacted when data becomes unavailable. Online applications that drive revenue in real-time, or applications that impact critical services (such as health care and emergency response) should be classified in the Tier 1 category. E-mail systems may fall into Tier 1 or Tier 2, depending upon the criticality of e-mail communications to the daily flow of business. Home directories may be Tier 3 applications depending on the nature of your business.
Generally, very few applications are defined as Tier 1 or Tier 2, and it’s not uncommon for 80 percent or more of a company’s application environments to fall into the Tier 3 category.
Tip #2: Implement the Appropriate Data Protection Infrastructure
Once you understand where your applications fall in terms of this “recovery tiering,” you can implement the infrastructure necessary to meet your recovery requirements. First, it’s likely that you have a heterogeneous environment with different applications, operating systems, and servers and storage from different vendors. You’ll want to ensure your data protection solution (i.e., your backup software plus the supporting hardware infrastructure) supports heterogeneity to minimize complexity and have maximum flexibility when purchasing new hardware or re-purposing existing hardware.
Server virtualization technology is implemented in over 80 percent of enterprises of all sizes in at least some capacity, and will become more common in production environments. Be sure your data protection infrastructure can accommodate both physical and virtual server environments and take advantage of optimizations available only with server virtualization platforms.
Second, you’ll need to identify the scale of storage your business needs to support over time, and ensure your infrastructure can be reliably built out to accommodate it. With data growing at 50 percent to 60 percent or more per year across the board, even small enterprises will manage 100TB or more over the next several years. Storage area networks (SANs) provide more flexibility than direct attached storage, are particularly interesting for virtual server environments, and can offer ease of use and cost advantages through centralized management.
In addition, SANs facilitate the deployment of storage management technologies critical to addressing data protection problems in the SAN fabric, providing a cost-effective way to leverage sophisticated storage functionality (thin provisioning from a centralized storage pool, off-host backups through proxy servers, WAN optimization technologies like compression, security capabilities such as encryption, etc.). Additionally, SANs minimize the overhead imposed on production servers. The storage consolidation obtained with a SAN also allows efficient use of replication in creating and maintaining copies of critical data sets at remote sites for DR purposes.
Third, evaluate the technologies available to meet RPO and RTO requirements across your defined tiers. Disk-based backup offers many performance and reliability advantages over tape, and gives you access to other technologies that are important in meeting stringent recovery requirements such as snapshot backups, continuous data protection (CDP), and replication. Snapshot backups can be used to minimize the production impact of backups and increase the frequency of backups relative to the conventional, tape-based “once a day” backups, offering multiple recovery points per day.
CDP can transparently maintain up-to-date copies of data in real time, almost instantly recovering data with no data loss. As such, CDP is the data-protection technology commonly recommended for stringent RPO/RTO requirements and Tier 1 applications.
Serial ATA (SATA) disk technologies offer the performance and reliability required for secondary storage applications such as backup and DR at aggressive price points, winnowing the cost difference between disk- and tape-based data protection infrastructures.
Tip #3: Document and Test Your Recovery Plan
After you’ve invested time in designing a good recovery process, document it in writing. If there is only one administrator at your company who knows the recovery process in its entirety, and that process is not documented, you run major risks. What if that person leaves the company or is out on a day when a recovery is required? If you’ve created a business continuity plan, make sure you will be able to reliably take advantage of that plan to get the expected recovery results. Document your recovery processes in a “run book,” and make sure you keep copies of it in two different places.
Let’s distinguish between local and remote disaster recovery. Most enterprise backup software products do a good job of tracking local data protection operations (e.g., backup jobs, file-level recoveries) and notifying you of associated problems. Most enterprises are recovering at least some data locally almost every day in response to user requests for deleted or corrupt files, effectively testing those processes on a regular basis. If problems arise in your local recovery processes, they are generally discovered and resolved quickly.
When you’re dealing with multi-site DR solutions, however, it’s a different story. Compared to recovering data locally, remote DR is a more complicated process and therefore riskier. In the natural course of operation, DR configurations tend to devolve -- a process called “configuration drift.” Patches are applied on systems, new data volumes are added, and configuration parameters are changed. To ensure that DR systems recover as expected these changes all have to be made at both the replication source and target locations or your recovery process may not perform as you expect.
Testing is the ideal way to manage DR solutions to perform as expected, but historically DR testing has been disruptive to production operations and therefore is not done very often. Most enterprises with a DR plan rarely test it, and many enterprises never test it after initial deployment. This is a disaster waiting to happen.
Testing your DR plan may be the difference between experiencing a disaster recovery (where your recovery processes perform as expected) and a “disaster” recovery (where you run into unexpected problems that preclude your ability to meet your RPO and RTO requirements). A comprehensive discussion of this issue is beyond the scope of this article, but here are two key points to remember.
- Leveraging automation is a smart way to reduce the risk associated with DR scenarios, make DR testing faster and easier, and provide the framework for improving your DR processes over time. Many of the recovery steps identified in your run book can be automated through scripting or other software tools. If it is at all possible, do it.
- If you’re using server virtualization technology, you may be able to leverage the snapshot and replication functionality within your data protection solution to cost-effectively and non-disruptively perform DR testing. Evaluate the tools your server virtualization vendor offers to support this.
If testing is automated and can be done non-disruptively, you’re likely to do it more often, and you will enjoy more reliable recoveries and fewer surprises. How often is often enough? We recommend DR testing at least every six months.
Summary
Crafting and maintaining a business continuity plan is an ongoing job, but mature technologies exist that meet the range of RPO and RTO requirements. The three tips discussed in this article will fast track your way to ensuring successful business continuity for your environment.
John Ferraro is the president and CEO of InMage Systems. You can contact the author at jferraro@inmagecom