7 Steps for Application Recovery

Seven actions every CIO should take in the event of a major service outage.

By Sherman Wood

The world's smartest engineers work hard every day to ensure that the cloud services we all use don't fail. Corporate IT departments have service management groups set up specifically to keep networks and applications humming. Still, even the most durable applications and websites fall down sometimes. End users have scant tolerance for application failure, particularly if it is a frequently used high-profile service.

Do you have a disaster recovery plan for your critical applications? Without one, a company risks revenue loss due to business disruption, customer defections, and damaged reputation. Based on our company's experience working with Fortune 1000 companies to improve application performance, we've outlined the first-response actions every CIO should take in the event of a major service outage.

Step 1: Assess the impact on customers and operations

Because the business-criticality of an application dictates the level of response, determining which transactions are under assault is your first task. If the problem lies within a customer-facing application or transactions that generate or directly impact revenues, naturally it's going to need top-level attention. IT must know if there is a direct cost to the business for every second of downtime before creating a response strategy.

Determining how many and which customers are affected (be they internal or external) is not always easy. Packaged applications such as SAP provide dashboards to show the priority of system events. Hopefully, a company has monitoring systems that can also give a close estimate of the number of affected users when a particular segment of the network or site goes down, but often manual effort is needed. An IT employee may need to call the help desk to learn how many support calls came in related to the issue. Problems can take hours and even days to unravel. Above all, IT should have processes and/or tools that show the business relevance of a problem quickly so that it can respond appropriately. An e-mail server slowdown should trigger a process different from actions to restore your website’s customer sign-on page.

Step 2: Notify customers and other affected users immediately

There's nothing like that uncomfortable void when your power goes out and the utility company’s phone lines are busy. Silence is deadly to business when things go awry. Even if you have to write the e-mails yourself, send them out as soon as you know the nature of the problem and when it might be fixed. Ideally, your event management or monitoring system will do this automatically, but a live person needs to ensure that the information customers and users receive is spot on. Crying wolf, with incorrect information, is as bad as waiting too long to update users. In the first e-mail, provide your best guess of how long it will take to fix the problem. Be conservative with your time frames.

Be clear about when the next communication will go out-- and remember, sooner is better than later. To protect the reputation of a company’s well-known brand, you'll need to craft a PR response, too. There are plenty of sites that badmouth companies when they err, so avoid at all costs being mentioned for ignoring the issue!

Step 3: Put together a Tiger team

In the initial stages of your investigation, you may not know the exact source of the problem but monitoring systems should at least indicate the general area, such as the CRM system. At that point, you'll need to bring together the server, database, storage, and application people that support the CRM. Next, your "Tiger team" can begin to process failure and identify precisely where the breakdown occurred.

Step 4: Avoid finger-pointing

When any major system fails, there's always going to be "blame-storming." The people managing each component of the environment will survey their monitoring data and say, “No, it's not us.” This is where a holistic application performance management system can help by showing system-wide monitoring data in one screen, with cross-application and cross-transaction views across the entire environment. A dashboard like this should leave no question as to the source of the problem. The Tiger team needs to stay objective and conduct a step-by-step elimination of potential causes until the true culprit comes to light. The more a company can deploy automation for this process, the better for everyone -- and the faster the resolution.

Step 5: Align IT with the business on a path to recovery

Whether to solve a problem as fast as possible (to mollify a department head or premium customer or to solve it so there’s minimal damage) is a tricky decision. If the data is sensitive and critical to product development, it might be best to spend more time recovering the full data set. These decisions will need senior-level guidance from the CIO, CTO, or even the CEO and backing from the application’s business owner.

Step 6: Put your plan into action

For common fixes, such as rebuilding a server or recovering a database to a certain point in time, IT employees can practice the task so they’re prepared to act quickly and confidently when disaster strikes. Unfortunately, many major system repairs are ad hoc, and the IT department will have no prior experience performing the job. Large companies usually try to staff their IT departments with top application experts who are prepared to handle most failures, yet there may be cases when an outside consultant’s help is required, regardless of the cost.

Step 7: Perform a post-mortem analysis

Hindsight is always 20/20. Take the time to analyze what happened when high-impact events occur. First, you want to prevent the event from happening again; granular insight into the root cause gives IT managers actionable guidance. Invariably, someone will need to tune or reconfigure systems or even add new technologies, such as stronger security tools. It’s also imperative to analyze response times. Did IT resolve the issue fast enough, were they efficient, and were the communications good enough to alleviate user concerns? Major outages are painful, but they provide valuable lessons; the post-mortem is the crucial last step of the application disaster recovery plan.

Understanding how to respond to a disaster efficiently, especially when critical processes and applications are at stake, should be a top priority for CIOs today. Preventing problems from happening in the first place is harder but that’s also the end goal. It takes a combination of planning with business prioritization, the right team of system experts and problem solvers, and comprehensive monitoring technologies to make application management a whole lot simpler.

Sherman Wood is vice president of products at Precise) in Redwood Shores, California. You can contact the author at sherman.wood@precise.com.