Imagining Disaster Recovery

A little imagination can be a great thing when it comes to planning for a disaster. The disasters we might encounter as IT managers are no different.

by Kelly J. Lipp

We’re all aware of large scale disasters: earthquakes in Southern California, a fire in a tunnel in New York City, and hurricanes in Florida and Louisiana. These catastrophic incidents taxed many corporate disaster recovery plans and in many cases resulted in the loss of a business. Although these events are hugely destructive, they do not occur often. Let’s call these Big Disasters.

It’s the Little Disasters that disrupt our business. The e-mail server’s database becomes corrupted. The order processing application fails. Network access to our building fails. A hacker inserts a bug into our infrastructure taking out our critical servers. A disgruntled employee tampers with data. Issues such as these are more likely to occur than a Big Disaster. How many of us have imagined what we would do if one of these failures happened in our data center?

All of us have given some thought to the matter, but our thoughts were not sufficiently fleshed out. We’ve given a great deal of thought to how we might recover, but we haven’t given nearly enough thought to how we will exist while we’re recovering. It is in these thoughts where our imaginations can have a tremendous impact.

The fact that e-mail is down is not the cause of business failure. It’s the lack of good communication between our customers and employees that causes the failure. E-mail is our key tool. Without it we are clearly at a disadvantage. Though we can’t imagine life without e-mail, if we did, we might have minimized the disruption when it was unavailable!

Each disaster results in two streams of activity:

  1. The Recovery Process
  2. The “Exist Without” Process

The recovery process is usually straightforward for IT professionals: restore from a snapshot, restore from a backup, call the vendor, etc., depending on the type of outage. Even if we’ve never executed a recovery like this before, the recipe is fairly standard. We can even invent it on the fly in many cases as we have enough experience to imagine all sorts of recovery options.

The “Exist Without” process is much more complicated. We often don’t even start this process because we think the recovery process is going to happen fast, so why bother? That’s called hubris and it’s the beginning of our problems. If we had a good plan for “existing without,” we could probably weather any storm. Clearly it’s here where we need to spend our time.

Applying Imagination to Disaster Recovery

To adequately prepare for disasters, we must consider the types of disasters we are likely to experience. These steps start us on the path:

  1. Imagine the most likely events that will cause disruption within your data center
  2. Determine the business impact of these events
  3. Rate the business impact from high to low
  4. Develop a comprehensive plan to recover from each event, starting with the high impact events
  5. Develop the “Exist Without” process

Step 1: Imagine Likely Disaster Scenarios

These vary widely from business to business but there are many similarities. Generally, losing access to some critical application is the most disruptive failure we can imagine. During research for this paper, I conferred with colleagues about likely scenarios. We developed five rather quickly. Four of these we had firsthand experience with. Perhaps as you are working through this step your best source of scenarios and events will be past experience. Your or your staff’s experiences will be a valuable source of imagination.

Think big, think small, but think thoroughly. The more scenarios you predict and imagine, the more likely you are to have resources at hand if and when they happen.

Step 2: Determine the Business Impact

The impact on the business is mostly a function of time. If the e-mail server is only down for five minutes, the impact is perhaps inconsequential. However, if that outage persists for hours or even days, the impact is much greater. Hope for the best and plan for the worst is your key adage here.

A comprehensive disaster recovery plan for an e-mail outage takes into consideration all the things that can happen during the recovery while having adequate contingency plans in place when those recovery plans go awry.

Asses the business impact of each scenario, factoring in time as necessary. Focus on high-impact events.

Step 3: Rate the Business Impact

Some of the disaster scenarios you can imagine won’t have much impact on your business or they don’t happen often enough to worry about. Others, however, will have serious consequences. These are the low-hanging fruit for you. Get the most likely or the most critical scenarios identified and you’ve made it a long way. Imagination has already helped you. In many cases simply imagining what might happen during an event will help you out in the long run. If you’ve thought of something once you’re more likely to think of it again.

Step 4: Develop a Recovery Plan

For each scenario, develop a comprehensive recovery plan that details who, what, when, where, and how the event will be resolved. For traditional data center problems such as the loss of an application or the corruption of data, this will involve some restoration scheme that is reasonably well known to most of your staff.

For disasters that affect your infrastructure, it may be necessary to recover the application at another site. Do you have an alternate site? Plan for contingencies: what do we do if Plan A doesn’t work? Do we have an alternative? Should we be working multiple recovery options in parallel knowing one will complete quicker than another? Imagine a number of recovery scenarios. One will be best, but having alternatives may pay dividends.

In general, our plan answers the following:

  • Who: Who are the key players necessary to recover? What if they aren’t available? How do we contact them? What are the alternatives?
  • What: What are we recovering? What are we going to do if our first recovery option fails? What resources are we going to need to implement the recovery? Do we have an alternate site, alternative hardware, or alternative communications?
  • When: When can we realistically expect the recovery to be complete? This is crucial for the “Exist Without” process. Having a good estimate of how long the event will last is the key to a good “exist without” plan.
  • Where: Where is the recovery going to happen? In the event of an application failure, the where is probably in our data center. Unless there is a communications outage, there might be an alternate site.
  • How: How many resources do we need? What other dependencies are there?

Developing a comprehensive recovery plan is tricky business. The more thorough your plan, the greater your likelihood of a successful outcome. The good news is this is a standard data center procedure. We should be good at this part of the problem. If only we imagine all the things that can happen, especially issues that force us to fall into alternative recovery options. Having more than one recovery option is crucial. It isn’t the problems we know that will cause our failure, it’s the ones we don’t know. The more problems we imagine and address, the less likely we are to fall prey to them.

Step 5: Develop the “Exist Without” Plan

Here’s where you earn your money. Plan this step well and you can minimize or perhaps eliminate the impact of an outage of almost any sort! Let’s use the loss of e-mail as our scenario.

If e-mail is the key communications method for your organization, you can expect big problems fairly soon. What are you going to do? Do you have an alternative communications plan? Do you have an alternative e-mail server? Do you have a way to communicate with your customers to let them know of the alternative communications method? What’s Plan B? (What do you mean what’s Plan B? We didn’t even have a Plan A!)

Ideally, you’ve thought through how you will communicate if e-mail is unavailable. Your plan might have time triggers to help you determine which alternative you will employ. Based on information from the recovery team, you can begin to implement your alternatives. For instance, if the recovery will take eight hours to restore e-mail service, you might decide to communicate via telephone and a Web site announcement to your customers explaining the problem and suggesting alternative ways to contact you. You might begin using an alternative mail server, either something you have within your data center or perhaps falling back to a third-party e-mail provider. Perhaps there is another division within your organization that can provide resources.

It is another who, what, when, where, and how problem.

  • Who: Who needs to be informed of the outage? Who informs them? Who acts as the information clearing house during the outage?
  • What: What are the alternatives? How are we going to do business without this application, communication method, computer room, or building?
  • When: When do we act? When do we switch to an alternative so we can “Exist Without”?
  • Where: Where can our customers continue to do business with us? Where is our business going to be located? Where is our staff?
  • How: How does lour business look during the event?

Time is the key. If you imagine life without e-mail for various periods of time and develop contingencies for each of these periods, you will know what to do when you’re faced with an outage. If your recovery plan is flawless, you might not need to do anything. However, it’s when something goes wrong that having imagined a number of alternatives will prove invaluable.

Summary

On January 27, 1967, a tragic fire swept through the Apollo 1 command module at Cape Kennedy in Florida. All three astronauts on board perished. Recriminations echoed through the halls of government. In testimony to the U.S. Senate, astronaut Frank Borman noted, “It was a failure of imagination. We are limited. We can't foresee ll contingencies. We don't know what is around the bend. No one thought there would be a fatal fire on the ground, or that a non-explosive hatch might prevent the astronauts from escaping such a disaster. No one imagined it could happen. We are all to blame for failing to imagine what could happen.”

After the Apollo 1 investigation was complete, NASA returned to the business of getting to the moon. They had weathered the largest disaster and had come through the experience stronger. They were imagining more and thinking about solutions to problems before they happened. Apollo 13 gave them another chance, and their imagination paid off. Instead of losing three more astronauts, they rallied and implemented imagined solutions to a serious problem. You might even say it set the tone for the future of space travel.

We can learn a lot from NASA’s lesson. Imagine the disasters that will affect us. Imagine all the alternatives we have for solving the problem and existing while we’re solving it. Develop good plans based on our imagination and know we will navigate our way through the problem.

We can’t imagine life without our technology. We must try to imagine it if we have any chance of developing a realistic disaster recovery plan.


Kelly J. Lipp is the CTO of STORServer, Inc. You can contact the author at kelly.lipp@storserver.com