Disaster Recovery: Best Practices
Last December, the Board of Governors of the Federal Reserve System, in conjunction with the Securities and Exchange Commission (SEC) and the Office of the Comptroller of the Currency, released a white paper that outlined several “sound practices” designed to strengthen the resiliency of U.S. financial markets in the event of a disaster.
Among other recommendations, the inter-agency white paper suggested that large financial institutions adopt business continuity and disaster recovery plans that support intra-day recovery—with zero data loss, to boot—and which separate primary and secondary recovery sites by at least 200 miles.
The reaction from purveyors of disaster recovery services was one of mixed bemusement and frustration. “The recommendations they made—it’s just impossible,” says John Sensenich, director of product management with BCP and disaster recovery services vendor Sungard. “Technically what they were suggesting was not possible. The network implications alone—you can’t put together a reasonable high-bandwidth capability when you start to stretch it out over hundreds and hundreds of miles. It’s impossible.”
Sensenich and other industry watchers acknowledge, however, that the inter-agency recommendations were proffered as part of an evolving dialog between government agencies, key players in the financial services industry, and hardware, software, and services vendors. “The 200 or 300 mile[s of separation between primary and secondary sites] comment was really just that—a request for feedback. It was designed for financial services firms to weigh in on that 200 or 300 mile suggestion,” says Steve Higgins, director of business continuity marketing for EMC Corp.
If it was feedback they were seeking, the three federal agencies got their wish. Sungard, EMC and other large purveyors of disaster recovery or business continuity planning (BCP) services cried foul, pointing out—a la Sungard’s Sensenich—that many of the white paper’s recommendations were impossible to implement using current technology. “When you read through the comments [from vendors], most of them say, ‘Look guys, you are talking about something that the laws of physics just won’t allow us to do. We need to be more practical and realistic about this,’” comments Higgins.
Adds Thom Carroll, global director of business continuity with IT outsourcing and services provider Computer Sciences Corp. (CSC): “The long and short of it is that the financial services people pushed back dramatically on a lot of this stuff, saying, 'Yeah, it’s a nice idea to have 200 and 300 miles between sites, but it’s not possible.' So what you’re seeing is everybody narrowing this down to some [recommendations] that they can agree on.”
Even as these players clash over just what constitutes disaster preparedness in the post-September 11 world, industry watchers say that most of the same best practices will carry over. “Honestly, the best practices that have been developed over the years continue to hold true,” confirms John Sensenich, director of product management with BCP and disaster recovery services vendor Sungard. “[There are] applications that are used which update specific files when they’re modified on a disk, there’s asynchronous updates, or trickling information back and forth to the recovery disk from the production environment. What that means is that when specific modifications are made to a specific database, they can be changed immediately at the database at the backup site.”
CSC’s Carroll agrees. “The individual platform recovery strategies haven’t changed from pre-9/11. What has changed is the attitude around business continuity planning, the need to protect the enterprise,” he says. “So you’ve got to take the time and the energy to understand your business flow, to understand just where your critical processes are, and that’s more focused on the business continuity side.”
One suggestion, says Sungard’s Sensenich, is to consider moving your most critical applications to a managed hosting provider that offers disaster recovery services. “Not all apps are equally important, so you would identify those that are most important, most critical to your business, and look at a managed hosting solution,” he suggests. “For the others, those applications that you can manage your business without, you can host them internally.”
Still another popular practice, says Carroll, is consolidation. The reality is that most organizations have too many different systems with too many different operating systems and a surfeit of applications. Cutting down the number of individual systems—and standardizing on as few operating environments as possible—is a big help when preparing and actually practicing a disaster recovery plan. “We try to encourage our customers to consolidate to a standard suite of machines and products, standard operating environment, [and] very specific set of products and services that allow for mirroring and monitoring, and that improves the quality of the product,” he explains.
Another lesson that many companies have learned from September 11 is not to put all of their eggs in one basket, so to speak. As a result, many organizations are attempting to geographically distribute human resources, information systems, and business processes to minimize the potential for catastrophic loss. “They’re doing topology risk analysis, critically looking at their mission-critical processes and making sure that they don’t have too many running at one location,” says Kevin Coyne, director of business operations with Sun Microsystems Inc.’s services unit.
For many applications and technologies, organizations are constrained in their ability to distribute information systems by the physical limits described above. Parallel Sysplex mainframe systems, for example, can’t be clustered beyond a certain point without sacrificing performance and—potentially—reliability and availability. “If you’ve got a parallel computing architecture, you can’t have a node of that too far away because the propagation rate of bits will put that node too far away to make the computer work fast enough to do a parallel processing job,” observes Tony Adams, a principal support analyst with research firm Gartner Inc.’s worldwide infrastructure support service.
While it’s possible to stretch these distances out using asynchronous and other replication schemes, industry watchers point out that such approaches are not appropriate for all industries or applications. “The nature of asynchronous replication is that you’re going to lose some information, and in the financial services community, the idea of losing information can have a real impact on your condition to report back to the [Federal Reserve], or close out a trade,” says EMC’s Higgins. Synchronous ensures that no data is lost, and that’s really critical in this community … They know what trades have happened … It’s very important in the financial services community that they don’t lose information."
After September 11, business continuity and disaster recovery planning vendors are anxious to tout products and services that allow organizations to geographically disperse resources without losing data, sacrificing performance, or jeopardizing availability. Sun, for example, has developed a solution that lets customers separate systems by up to 200 kilometers, although Coyne acknowledges that at the extreme end of that limit “you still have challenges with latency.” EMC, for its part, touts a synchronous multi-hop topology—in which data is replicated among different systems—across distances of up to 30 miles.
Even the venerable mainframe—the platform that gave rise to the concepts of business continuity planning and disaster recovery planning in the first place—has been improved to suit the post-9/11 emphasis on geographical distribution. Last month, IBM Corp. announced a revamped Parallel Sysplex implementation that supports larger (up to 100 km) clusters of zSeries mainframe systems. Big Blue had previously supported a maximum distance of 40 km for Parallel Sysplex clusters. “That’s appropriate for sites that are pursuing business continuance strategies, and that will allow for continuous availability in the event of a disaster,” says Pete McCaffrey, program director of zSeries marketing with IBM. “There’s a real need for that, for more distance between [primary and secondary] sites, with some of our customers, and this raises the bar.”