In-Depth
Flirting with Disaster: How to Ensure Business Continuity
With the growth of distributed systems, companies aren't adequately planning for business continuity or disaster recovery across the enterprise. Is your firm guilty?
Picture this: One of your company’s major facilities is flooded, along with dozens of servers, routers and network switches. Hundreds of users, offline for several days, are screaming to get applications and e-mail back up and running. Parts shipments from major suppliers are frozen in the pipeline, and customer orders are backing up.
Although you initially predicted the main servers at that facility, all running Windows NT or 2000, would be up and running in a matter of hours, everything is going wrong. The replacement equipment takes two days to arrive and turns out to be a different brand of server hardware. Some data backups are unrecoverable, at least for the near term, and the DLT tape that half of your company stores its data on is incompatible with your 8mm drives. Some departments have earlier versions of some applications and later versions of others. And if that isn’t enough, a day’s worth of transactions are lost somewhere in the pipelineuncaptured on the backup tapes you were able to recover.
If your mainframe went down in this scenario, you might well have things back up and running at a "hot site" in 24 hours or less. In simpler times, large mainframe sites maintained a one-for-one replacement strategy for recovering systems rapidly in the event of a disruption or disaster. Hardware, software and network connections could be replicated byte-by-byte at a hot site maintained by a disaster recovery service.
Ah, how the structure of IT has changed. Although the mainframe part of the organization can be replicated from a hot site, distributed and e-business platforms throughout the organization may be far more difficult to recover systematically. This fact is driving a greater emphasis on business impact analysis and planning, through which users and IT managers prioritize the specific applications that need to be identified and recovered in the event of a system failure.
In today’s enterprise, these applications often reside out on the networkon Unix, Windows NT/2000 and Linux systems.
Convergence Happens
Because of this shift, the disciplines of disaster recovery, business continuity, high availability and site hosting are converging, according to Todd Gordon, senior vice president and general manager for IBM Business Continuity Services. Because of the proliferation of e-business and distributed systems, the recovery window has grown shorter and shorter. "The cost of storage is coming down, storage area networks are making storage more readily available and companies are paying more attention to the total continuity picture," says Gordon. He also notes that the accepted recovery time for restoring a system a few years ago was 36 to 48 hours. Now the average is 12 hours or less. And with the rise of e-business, the recovery window may narrow to minutes.
The disaster recovery industry has shifted its focus accordingly, with vendors rebranding themselves as business continuity services in order to reach companies that run distributed, multi-platform networks. Vendors like IBM Business Continuity Services, which formerly specialized in mainframe recovery hot sites, now offer a range of options and services. Such services include mobile systems that can be trucked onsite, "workspace" centers that are wired for LAN/WAN access to PC and Unix servers, and specific application expertise such as SAP. Other leading vendors have also changed their service offerings along these lines, including Comdisco Inc., SunGard Recovery Services Inc. and HP Business Recovery Services from Hewlett-Packard Co.
The High Cost of Failure Even an hour of downtime can be extremely costly, depending on the application. Here are average hourly costs of a networked system failure for various application types. Brokerage operations: $6,500,000 Credit card/sales authorization: $2,600,000 Pay-per-view: $150,000 TV home shopping: $113,000 Catalog sales: $90,000 Airline reservations: $89,500 Telephone ticket sales: $69,000 Package shipping: $28,000 ATM fees: $14,500 Source: Strategic Research Corp., 1999 | |
Distributed Challenges
Distributed networks of systems––particularly Windows NT/2000 servers, and increasingly, Intel-based Linux serverspose a vexing challenge to IT managers. A 1999 survey by the GartnerGroup finds that while 85 percent of the largest companies have effective enterprisewide disaster recovery programs in place, fewer than half of these plans include networks and only a quarter cover PC LANs. Industry experts agree that companies aren’t adequately planning for business continuity or disaster with their distributed and e-business networks. Along with fire, flood, earthquake and sabotage, these systems are particularly vulnerable to corrupted registry files, virus and hacker attacks, human error, failed disks, demand spikes and power outages.
The cost of such downtime is high for many industries. For example, a disabled airline reservation system or retail catalog sales operation stands to lose up to $90,000 an hour, according to calculations by Strategic Research Corp., shown in Table 1. Typical distributed network sites have a downtime cost of between $20,000 and $80,000 per hour.
Aside from technical and natural maladies, perhaps the greatest threat to business continuity comes from "lack of awareness in the executive suite, or even false optimism about the business’s state of availability," according to IBM’s Gordon. "As technologies have moved faster and faster, their state of availability has been reduced. There are many more connections and interactions that a disaster recovery plan hasn’t been designed to keep up [with]."
The two greatest areas of vulnerability are distributed and e-business networks, industry experts agree. Until recently PC servers weren’t seen as essential mission-critical platforms. Smaller companies that tend to rely more on PC server-based networks often aren’t focused on business continuity. Larger companies with comprehensive mainframe recovery programs are just beginning to get their arms around the distributed challenge. But more and more mission-critical applications––from ERP systems to e-mail servers––now run on Windows NT/2000 systems. Thus the cost of overlooking recovery for these systems is growing, as Table 2 shows.
At Hoffman-LaRoche, a major pharmaceuticals manufacturer, ERP functions are run on Unix servers, while human resources and payroll are spread across both Unix and Windows NT servers. "We can’t be down more than a day," says Bill Harloff, director of technical services and operations at Hoffman-LaRoche.
Many corporate applications, however, don’t run on a single platform like Windows. Rather they extend across mainframe, Unix and Windows systems. "It’s much easier if you have a single platform," says Donna Scott, vice president of software infrastructure for the GartnerGroup. "Let’s say you run everything on Windows, even though it’s distributed. Running SAP on six Windows boxes is complex, but not as complex [as it could be] since your database is all on one system."
Putting a Plan in Place As more and more mission-critical applications run on Windows NT/2000, it becomes critical to consider system recovery times on Windows. This table shows an "optimistic" Windows NT/2000 server recovery timeframe when a continuity plan is in place. Problem recognition/decision: 2-4 hrs. Hardware acquisition: 2-4 hrs. OS/NOS installation: 4 hrs. Application installation: 4 hrs. Data restoration (if readily accessible): 2-6 hrs. Troubleshoot/fix: 4-6 hrs. Total Restoration Time: 18-28 hrs. Source: IBM Business Continuity Services | |
Document and Plan
Recovering distributed server applications requires advanced preparation––data can no longer be informally backed up on disks by technicians. Along with data, most networked systems require specific registries, IP addresses and application versions. That’s why extensive documentation of all system changes is crucial. Documentation includes configuration files and change control logs to track all applications, upgrades, patches and changes in network cards. Without documentation, it could take days, if not weeks, to get a server back up and running, according to Randy Middlebrooks, LAN recovery engineer for IBM’s Business Continuity Services.
Industry experts also advise that recovery servers come from the same hardware vendor as primary servers. Without the same hardware, moving a system to another box requires reinstallation or restoration of identical versions of operating systems, applications, registries and data. "It’s a nightmare recovering servers on dissimilar hardware," says John Kisslinger, director of disaster recovery for AXA Financial Corp. "Every configuration is different."
Kisslinger notes that keeping up with changes to Windows NT configurations is far more difficult than upkeep on the mainframe. "We’re constantly changing the hot site," he says. "For the mainframe, we review our DASD and MIPS requirements once a year. But in the NT and server world, things change all the time." AXA addresses these changing requirements through rigorous and ongoing testing of the latest configurations at the company’s hot sites.
At Hoffman-LaRoche, IT administrators are working around this challenge by "tying the disaster recovery group closely to change management," according to Harloff. "When configuration changes are made to hardware and software, that changes the configuration of the recovery center. This is extremely important, especially since something is always being changed on some 20 different servers."
Tools and Techniques Industry experts outline the most feasible approach to disaster recovery and business continuity as a combination of technologies: Data is backed up on tape for regular delivery off-site, while live data is replicated to electronic vaults. Tape back-up technologies provide general protection and archival storage of information for recent losses of data. Replicators or mirroring allow you to have the information back immediately. Costs are a key consideration and rise exponentially with the speed with which you need to recover your systems. The faster and broader the recovery, the costlier it will be. Because contracting for a hot site is expensive, it's only suitable for the most mission-critical, high-end applications, says Donna Scott, vice president of software infrastructure for the GartnerGroup. Likewise, clustering and mirroring two or more servers connected by a fiberoptic line ensures a hot-standby site that will be up and running in minutes, but which also should be reserved for the most critical applications in the business. A more cost-effective option is to establish data replication to similar systems in other parts of the organization. For example, rather than contract for or maintain an entirely separate site for your IBM iSeries-based Domino Server, your Chicago offices may already have a similar configuration for their day-to-day operations, which could also be pressed into service as a back-up system. The more traditional method of disaster preparation consists of regular, physical removal of backed-up storage tapes to alternate facilities. However for e-business operations, even day-old data is too old, warns Todd Gordon, senior vice president and general manager for IBM Business Continuity Services. The best approach for most large companies is a combination of techniques, including electronic vaulting on a nightly basis, with remote journaling and replication during prime business hours. Disaster recovery vendors also provide mobile solutions that are better suited to business unit-level PC and Unix environments, as well as workspace centers with workstations and telephone lines. At AXA Financial, most data is delivered to backup sites physically on tape. However, the company's Alliance Capital subsidiary requires a two-hour recovery window, and a mirroring solution is employed to replicate data to a third-party hot site. "If we had a disaster, we would just break the connection, put the connection in the hot site and access the data that way," says John Kisslinger, Axa's director of disaster recovery. Staples Inc. has adopted this type of integrated approach as well. Backup tapes are physically delivered to an off-site storage facility. Staples has also implemented remote mirrored backup of data between its various systems, which support warehouse management, inventory replenishment, data marts and e-mail. J.M. | |
Business Impact Analysis
The key part of the business continuity and disaster recovery planning process is communicating with user groups throughout the company to develop a business impact analysis. Currently users and IT departments at Hoffman-LaRoche have identified 10 mission-critical applications running on 20 or so servers that must be kept up and restored immediately. Mainframe continuity is no longer a direct concern since the company began outsourcing its mainframe operations in July 2000. In addition, procedures are in place to bring up the company’s diverse network of 120 HP-UX servers, 500 Windows NT servers and a number of Sun Solaris servers. Mission-critical applications running on HP-UX and Windows NT servers include SAP and PeopleSoft, and FDA and DEA reporting interfaces, and are replicated at a hot site maintained by IBM Business Continuity Services. In total, according to Harloff, these systems support nearly 20,000 users across the continent. "In tests, we’ve been able to restore SAP within four to five hours," notes Harloff.
Cross-platform continuity and recovery is, however, proving to be a vexing issue for many companies. Many available tools don’t handle cross-platform backup and recovery, according to Gartner’s Scott. "Data replication solutions are specific to a database, file system, operating system or disk subsystem," she says, which requires multiple solutions to protect critical data. Recovering multiple Windows NT servers may involve numerous sites with their own flavors of backup, including differing tape media.
AXA Financial has been addressing business continuity issues through a series of regular meetings with all of its business units. To determine which applications need to be recovered fastest at AXA Financial, Kisslinger’s department conducted a business impact analysis across the entire organization. "We interviewed all of the business units to determine how long they could afford to be down without serious impacts to the business," he says. The company maintains a mainframe-based data center in New Jersey, but also supports Windows NT SNA Gateway servers and a Sun E10000 server running Solaris.
However, recovery timeframes vary on an application-by-application basis, rather than a platform basis, Kisslinger continues. "Currently, our mainframe recovery time objective is 48 hours. We also have a two-hour recovery timeframe for applications at our Manhattan headquarters. There’s a backup PC just waiting to be used." At the company’s Charlotte, N.C. offices, the company contracts for a hot site for its imaging system, which runs on HP-UX-based servers.
At Staples Inc., the office superstore’s IT staff met with business units to develop business impact analyses around a number of its platforms. The Staples enterprise includes NT servers running on HP, IBM, AS/400 and Unix servers. "The biggest challenge is identifying what’s really critical and what’s not critical," says Cliff Leavitt, IS administrator. "Everybody wants to think that his or her application is critical." Leavitt says users get the final say. "This is not an IS-driven project," he says. "It’s driven from the outside in, and customers tell us what they want." Through these meetings with users, Leavitt says Staples identified high-availability applications versus those that had a 48-hour window, and others that could wait until after the disaster.
Timing is Everything
As more companies rely on e-business functions across all platforms, load balancing and continuous availability emerge as key business recovery strategies as well. A simple glitch in a Web server’s performance could quickly disable the online portion of business. "The era of experimentation on the Web is over," says Gordon. "Now most companies have a revenue or cost-saving objective to their e-business sites."
E-business sites need to be developed from the ground up with quick recovery in mind, Gartner’s Scott says. This is accomplished through deployment of ongoing, continuous replication over two or more physical sites, which avoids a single point of failure, she explains. Some companies "operate on both sites actively, meaning that client traffic is routedor load balancedto any of the physical sites for complete processing." However, she notes, such an arrangement requires comprehensive, up-front planning of data replication and synchronization. Plus in e-business environments, transactions that were in the pipeline at that moment could get lost if the system went down. Managers need to plan for methods to capture transaction data that may have not been saved on a previous backup. For example, a customer may have placed an order at 10:00 a.m., which was lost when your system was disabled at 10:10 a.m. Your back-up tapes, however, may only have been updated as recently as last night. A mirrored site for transaction data would provide more up-to-date transaction data.
The complexity of quickly restoring an e-business operation depends on whether the site is internal or faces the public, continuity experts concur. Networks accessed by internal employees or partners, such as intranets, extranets or VPNs, pose less of a challenge since users can simply be directed to a secondary URL. VPN or intranet arrangements can also be brought up more seamlessly because branch offices typically have separate Internet access. For a business relying on public e-commerce, the best approach is a multi-homed site, maintained through two ISPs, or for sites using a single ISP, an IP address change to a secondary facility.
In addition, the company needs to look at the business continuity plans of any supply chain or other online partners, says Gartner’s Scott. "Developers need to determine what steps to take if a partner site is down, and how it would disable the site." Just as important, Scott adds, is the appearance of availability. "If you buy something online, does it matter if something broke behind the scenes and they were fixing it, as long as you could complete the order?" Front-end, public-facing systems can be kept functioning at minimal cost while a back-end system is downoften a far cheaper alternative than replicating the entire system.