Q&A: Business Continuity and Disaster Recovery
We examine the connection between business continuity and disaster recovery, explore what business impact analysis is and who's involved, and the tools available for this analysis.
Disaster recovery and business continuity are terms that are often used interchangeably. Our guest, Rich Schiesser, explains the difference, explores what business impact analysis is as well as who is involved and what major steps are part of the analysis. He helps clear the air about the tools available for performing this analysis, and the part risk management plays in business continuity. Rich is the author IT Systems Management, now in its second edition (2010: Prentice Hall).
Enterprise Systems: What is business continuity?
Rich Schiesser: Business continuity is a program of plans and activities that ensures critical business processes can be resumed within agreed-upon time frames if a sustained outage occurs. The agreed-upon time frames are referred to as recovery time objectives (RTOs), and the agreed amounts of associated data to be restored are called the recovery point objectives (RPOs). The RTOs and the RPOs are determined from a business impact analysis.
How does business continuity differ from disaster recovery?
Disaster recovery (DR) had its origins in the 1970s and referred to the recovery of a company’s IT infrastructure in general and its IT data center in particular. Business continuity (BC) had its origins in the 1990s and emphasized the continuity of all critical business operations across the entire enterprise, not just IT. Where IT tends to be reactive, BC is more proactive. DR focuses on technical recovery, whereas BC focuses on business recovery.
DR involves mostly technicians, whereas BC involves mostly business users. Finally, DR is usually part of IT with no specific career path or certifications as part of it. BC can be a part of risk management or an entity on its own and has widely accepted career paths and certifications.
What is a business impact analysis (BIA)? Who prepares it and what does it cover?
A business impact analysis is an enterprise-wide activity in which the effect of prolonged outages to business processes is determined. The purpose of a BIA is to identify and prioritize the most critical business processes in terms of the amount of time a process can be idled before significant business impact is felt. For some processes the amount of allowable time down might be only minutes or a few hours, whereas for others it might be days.
These estimated times are called recovery time objectives and are closely related to the point at which data must be restored (recovery point objectives) to support the recovered business process. The results of the BIA, RTOs, and RPOs are combined with risk management to determine appropriate recovery strategies.
A BIA is usually prepared by a group of business continuity planners from within an organization or by outside consultants. A full BIA covers all major departments of an enterprise, including core competencies, finance, administration, and IT.
What role does IT play in preparing the BIA? What is the business users' role?
IT plays a major role in preparing a BIA. Most business processes today depend on various IT services to operate. This means that if a major disaster disrupts a business process, the IT services that support the business process must first be restored before the business process can be recovered. If a process needs to be recovered in 4 hours, the IT services supporting it, and the associated data, might need to be recovered in 3 hours.
IT’s main role in a BIA is to determine the feasibility and costs of recovering IT services in time to meet the RTOs and RPOs of the business processes. Sometimes the RTOs need to be extended because the cause of the IT recovery can be prohibitive. Another role of IT in a BIA is to identify the IT dependencies that a particular IT service might have. These dependencies could influence the feasibility and cost of recovery.
The main role of the business users in a BIA is to identify their critical business processes and dependencies and to estimate how long a process can be down before significant impact occurs. Impacts can be financial or legal (among other categories) and need to be quantified by the users.
What tools are available to help an enterprise create the BIA? If created in-house, what are the steps in preparing it? Who's on the team? What expertise is needed?
Three of the major disaster recovery service providers are IBM, HP, and SunGard. Each of these provides software tools that help an enterprise create a BIA. Last year, SunGard acquired Strohl Software Systems, which had developed one of the premiere tools of this type called BIA Pro that, among other features, has Web interfaces. A few other vendors also supply BIA tools that give users a variety of alternatives based on function, cost, and ease of use.
An in-house created BIA consists of five major steps:
Step 1: Acquire executive support to ensure appropriate priority and resources are dedicated to the effort. Included in this step is a clear agreement as to the objectives and scope of the effort. This step needs to occur regardless of the BIA created in-house or by outside consultants.
Step 2: Develop a questionnaire and interview form for planners to use in gathering data about processes from users.
Step 3: Schedule and conduct the interviews with users to determine RTOs, RPOs, and dependencies.
Step 4: Analyze the results and prioritize all processes across the enterprise.
Step 5: Compile the final report and present recommendations and costs of recovery strategies.
Business continuity planners, business user sponsors, and IT recovery specialists are usually on the BIA team. Excellent analytical and communication skills and knowledge of business and technical recoveries are the types of expertise needed for this effort.
What is the impact of the BIA on business continuity?
The BIA has significant impact on business continuity. A properly conducted BIA determines the viability and costs of recovering within reasonable time frames for most types of calamities. The BIA helps prioritize business processes for recovery and identify the dependent processes and IT services needed for restoration.
What is meant by risk management?
Risk management involves three major steps: identification, analysis, and recommendation:
Step 1: Identify the threats (causes of major outages) and vulnerabilities (probabilities of the causes occurring) an organization has to the stability of its operations. This is sometimes called a risk assessment.
Step 2: Analyze the levels of threats and vulnerabilities, and propose countermeasures (and their costs) to these exposures. This is often referred to as risk analysis.
Step 3: Weigh the costs and benefits of implementing these countermeasures and recommend and implement appropriate responses.
For each risk, one of three actions is typically taken: the risk is either eliminated, ignored, or mitigated.
The combination of risk assessment, risk analysis, and proposing and implementing recommendations is collectively referred to as risk management.
What role does risk management play in business continuity?
Risk management, in collaboration with the BIA, helps to determine appropriate recovery strategies for business continuity. Understanding the threats and vulnerabilities an organization has for normal business operations can help minimize these exposures by implementing cost-effective countermeasures.
How are recovery strategies generated?
Recovery strategies are generated by compiling the results of the BIA and the risk assessments and risk analysis. This compilation should identify the appropriate recovery strategies needed to meet the agreed-upon RTOs and RPOs. For example, if a business processes has an agreed to RTO of four hours, the recovery strategy must be such that all dependent processes and IT services are recovered in less than four hours to ensure the primary business process is operational within the four-hour RTO.
What types of testing are performed, and how often should they be done?
There are three types of testing performed in support of business continuity: verification, simulation, and operational.
A verification test updates the factual contents of a business continuity plan. These contents include current participants, their contact information, call trees, hardware model numbers, software versions and releases, and other types of data that is likely to change over relatively short periods of time. A verification test should be done once every three to six months depending on the dynamics of the environment.
A simulation test, sometimes called a table-top exercise, consists of assembling the business continuity planners, recovery team members, appropriate business users, and other participants in a single room to act out the response to a simulated disaster. The purpose is to validate the accuracy, sequence, and dependencies of the recovery steps. Simulation tests should be performed once every 6 to 12 months.
In an operational test, critical business processes and the IT services that support them are stopped as if a major calamity had rendered them inoperable. IT services and business processes are restored at a designated recovery site. The purpose is to confirm the viability of restoring all critical processes, and to compare the actual recovery times and recovery points to the RTOs and RPOs.
What are the biggest mistakes enterprises make in their business continuity plans?
The three biggest mistakes involve participants, dependencies, and testing. Organizations sometimes involve only technical participants in developing technical recovery plans instead of including business users to address the recovery of business processes. Both groups need to participate collaboratively as a team to ensure the business continuity plan covers both the business and technical aspects of recovery.
Another frequent mistake companies make is to omit the dependencies that many business processes and IT services require to make them operational. If a particular IT service needs to be recovered within four hours and it depends on two other services to function, then the two other dependent services need to be recovered at the same time.
The last mistake is failing to test the plans. So much effort is often spent on developing the plans that there is little time or few resources left over to actually plan and conduct testing. Validation, simulation, and operational testing should be conducted approximately every three, six, and 12 months, respectively. Seldom are these done.
What best practices can you recommend to avoid these mistakes?
The best practice to avoid the mistake of improper participation is to ensure the effort to develop business continuity plans has the executive support from both the business community and IT. This support is critical to ensuring both groups collaborate as a team to develop the most comprehensive recovery plan possible.
The best practice for identifying dependencies is to thoroughly review every recovery step with several pairs of eyes to ensure all input and output dependencies are identified. The best practice for testing is to establish a schedule by which validation, simulation, and operational testing is conducted approximately every 3, 6, and 12 months, respectively.