Another Mean Season Underway

Testing is where the rubber meets the road….and fewer than 50 percent of companies ever do it

Last week, I was in Chicago speaking at a Storage Decisions conference that I hadn't attended in about five years. It was like a reunion, with many of the same faces that I remembered from long ago, and conversations that seemed to pick up just where we left them half a decade ago.

One topic I was asked to cover was disaster recovery plan testing -- in particular, the question of whether traditional testing techniques were still up to the task of validating strategies and coping with the fast pace of business and technology change. In every survey today, there is increased attention paid to business continuity. Awareness of the need for disaster recovery/business continuity planning has never been higher.

Of course, testing is where the rubber meets the road. Testing is gap analysis in action, helping to spot inadequacies or weak points in strategies that usually present themselves as a consequence of business process or infrastructure change. Testing is also a rehearsal for logistical activities and for personnel who will be called upon to perform recovery tasks and to keep their heads while all about them are losing theirs.

I have noted that testing isn't being done. In addition to the old 50/50 rule of thumb (50 percent of companies have plans, and, of those, fewer than 50 percent ever test them), I found some other stats that were even more compelling. Last October, Forrester Research and Disaster Recovery Journal surveyed 250 of what I assume to be well-versed DR planners to find that 68 percent tested their plans only once per year, 18 percent less frequently. In the same research, 43 percent admitted to updating their plans once per year or less often. The most startling statistic came from a May 2008 survey of 100 Chicagoland companies with revenues of $10M or more by AT&T: awareness of the need for planning was up 73 percent from the previous year's survey data, but only 43 percent of those surveyed had fully tested their plans within the previous 12 months (an improvement over the 37 percent that did so in 2007); and almost one-fifth admitted they have never tested their business-continuity plans, up slightly from 10 percent in 2007.

Over and over, I hear that a combination of inclement weather patterns, economic uncertainty, and political flux, not to mention regulatory compliance, is driving interest in DR, but that does not translate directly into comprehensive planning action, which includes testing. I was curious to see if the complexity and cost of testing might not be part of the problem.

Based on non-scientific polling (e.g., conversations over lunch at the conference, chats with folks in airports and on planes, etc.), here are a few of my findings.

First, testing is widely perceived as an additional expense that was not covered in the definition of the disaster recovery planning project. It was sufficiently difficult to convince management to budget the cost of building a plan, but it is much harder to get them to underwrite the costs for ongoing testing and maintenance of the DR capability. Management buy-in was often intended to mollify auditors and to get the checkmark for building the plan, which is viewed as just so much more insurance. In such instances, it is not a deficit of testing methodology that is to be blamed for a failure to test, it is a lack of selling skills on the part of the planner to surmount management complacency.

Testing is an expense, and in the current economic climate, expenses are constantly being re-assessed. Without a solid business value case, reiterated over and over, chances are that management's willingness to fund more DR work will wane over time. To make a more compelling basis for funding, planners need to re-argue the case for ongoing plan testing from the standpoint of risk reduction,process improvement, and cost-savings.

I am also told that technology is making testing more difficult. Every vendor, from the application software and operating system vendors to the hardware folks, wants in on the game of data replication. A week or so ago, VMware released a continuity offering designed to aid failover of virtual server infrastructure. Oracle and Microsoft now sport their own backup processes in their databases and e-mail. EMC has released new versions of its data replication software, which now integrates Avamar de-duplication technology. Everyone who's anyone has heard of backup software and proprietary intra-array mirroring. Bottom line: the number and complexity of data protection processes in play has never been greater and keeping tabs on what is being replicated where -- and figuring out ways to test it -- has become a hassle unto itself.

Aggregators (such as Continuity Software) and wrappers (such as CA XOsoft, Neverfail, and DoubleTake, among others), previously covered in this column, may provide partial solutions. However, the products that enable the monitoring of replication processes (aggregators) or the geo-clustering of entire hosting and storage platforms for scenario-based replication and failover (wrappers) are still works in progress. They vary widely based on their support for external hardware and software processes. My hope, and those of many with whom I have talked, is that these products will enable testing and change management to become much simpler by providing the means to continuously validate infrastructure recovery so test time can focus on other logistical requirements such as user recovery and incident command and control.

This leads to the third testing impediment: testing methodology itself. Testing, for anyone who has ever done it, is a multi-headed beast. There are many ways to test, but few that are considered as effective as actually re-hosting systems at the shadow site and bringing them up to an operational state parallel with the primary facility. For those who test their plans fully every year, such a testing approach is commonly applied to all mission-critical business processes, with the recovery of each critical process being tested from end to end over the course of several test events over 12 months.

Design is everything when mapping out the annual test regime. You need to account for all of the resources required for testing, attend to schedules, and budget peoples' time. You also need to structure tests so they are non-linear. Non-linearity means not including in a particular test event several test items that depend on each other's outcomes in order to be completed. Including a set of linear and interdependent tasks in a test can cause the entire test event to fail because a certain test task could not be accomplished -- it wastes precious testing time.

A valid question to ask is whether non-linear testing really communicates the order and interdependency of recovery processes to participants. If one of the key values of testing is rehearsal, does non-linearity obviate this value?

Again, using burgeoning aggregator and wrapper technology to augment testing may help to make this concern a moot point. If we are constantly reassured that infrastructure and applications will failover successfully, and that data is being replicated correctly, we might be able to focus on non-IT elements (business process and people logistics) with our "live" tests, enabling them to be more linear in nature.

A corollary concern to the above is something I call the Heisenberg Factor. It is the concern that scrupulous pre-planning for tests may actually interfere with outcomes and make us miss subtle variables that could impact real-world recovery. Traditional testing must ensure that all assets required for the test are in place and ready for use on test day to prevent a test failure that wastes everyone's time.

Let me be clear about this: There is no such thing as a failed test, except for one that is improperly planned. If a process or procedure that is being tested fails, we learn something valuable. However, if the test item cannot be tested because someone forgot to bring the backup tapes, then it is an administrative failure.

The Heisenberg Factor may be a valid concern in assessing testing efficacy, but I find the discussion somewhat specious. Continuity plans are never scripts for recovery. They are at best a guide that will need to flex to accommodate whatever challenges a crisis event throws your way. Clearly, having good data replication monitoring and management, as well as a solid and continuously monitored strategy for infrastructure failover -- which can be accomplished today at minimal cost using an IP network and some wrapper software -- will mitigate most of the concerns about over-planning for tests.

As this column goes to press, we are about a week away from the official start of hurricane season in Florida. We have been blessed by several years of reasonably calm weather since the cataclysms of Katrina and Rita, but this should not lull anyone into complacency. Last month's wildfires, tornadoes, earthquakes, and cyclones in the U.S. and abroad, as well as the litany of non-weather related events that happen every day, should continue to underscore the importance of preparedness.

So, get prepared. Test your plans, and consider some software to help monitor and manage change. Your comments are welcome: