A Faster Disaster: Recovering in Internet Time
Your mainframe runs fine, but your intranet goes down, and your customer service department can’t get to account information on the mainframe for the next day and a half. Information is taken offline, but critical orders are delayed, and inquiries back up. Customers are getting frustrated, and your company’s call center manager and half her staff are ready to walk out.
Disasters just aren’t what they used to be. I remember my first tour of a hardened IT disaster recovery facility. It was the ultimate big-iron house – a converted World War II-era tank factory built to withstand a direct hit from enemy bombers. Secure rooms were humming with S/390 and Hitachi mainframes. Right away, you felt safe from the varied calamities that could literally flatten an ordinary IT operation. For the site’s customers, the drill, which was simulated once or twice a year, was pretty straightforward – 1) hurricane hits; 2) backup tapes come in; 3) systems are brought back up and running within two to three days.
What used to be called disaster recovery gets branded with a more upbeat and politically correct moniker – "business continuity." With that comes an implied mandate to not only bring the mainframe back up, but everything else – distributed systems, Web and e-commerce servers, networks and call centers.
As you extend applications out to the Web, your list of potential disasters grows – system overloads, outages and denial of service attacks, which can be just as costly as fires or floods. If you’re in 24x7 mode like many other shops, your recovery window may be reduced to hours, not days. We now have more to fear from virtual disasters than physical disasters.
Industry experts warn that there is a disturbing lack of awareness about business continuity solutions in this e-business era. Even among the Fortune 1000, less than half of business continuity plans include networks, and only a quarter cover PC LANs, according to GartnerGroup.
The status of Web-to-host deployments in business continuity plans is murky, at best. While the back end is covered by traditional mainframe recovery methods, this may not be enough for the front end of the operation. Delays in bringing a system back online may be costly. Typical distributed network sites have a downtime cost of between $20,000 and $80,000 per hour, according to Strategic Research. The cost for sites supporting high-volume transactions could run into the hundreds of thousands, or even millions, of dollars.
The Price of Downtime
There is a threshold for determining the point at which a business starts hurting as a result of a system going down. Mike Errity, Segment Executive for IBM Business Continuity and Recovery Services, observes that a site providing pure information to end users is "more akin to a legacy environment, allowing for 48 to 72 hours for recovery." However, he continues, "the moment we see customers doing transactions, either executing or booking orders through a supply chain system, that’s when we start working to develop less than a 24-hour recovery time objective."
Different types of systems have different priorities. While ERP has evolved into a critical e-business function, most ERP managers are still comfortable with a 24-hour window recovery time. However, for straight-through e-commerce, many companies require "near-instantaneous failover," says Errity. Of course, the shorter the recovery window, the more the cost of services and equipment goes up. Such solutions require a greater degree of dedicated equipment and data mirroring, he says.
For those sites that have a base of internal users, a popular approach is to mirror data and applications to another site, and have a backup URL that users can transfer to in the event of an outage. End users need to be aware of the alternate URL ahead of time, Errity points out. "Re-advertising a new URL address would be the most intrusive way to accomplish a recovery exercise," he says. Of course, it would be impossible to provide an alternate URL for more public, customer-facing Web interfaces. Errity recommends maintaining two ISPs, where one could almost seamlessly pick up the demand if the other becomes unavailable.
Adding Web servers, Web-to-host middleware and networks to your business continuity planning calls for a new set of skills and approaches. You may be relying heavily on a LAN or WAN running on Windows NT servers. It may be difficult to secure a recovery system from the same vendor as your primary production system. Synchronizing and prioritizing recovery, also adds complexity, adds Dana Scott, Research Director at GartnerGroup. "In an organization that’s got hundreds of UNIX and NT servers, prioritizing the recovery is a huge effort. … You need to set up a priority scheme to recover those in the sequence they need to be recovered."
About the Author: Joseph McKendrick is an independent consultant and author, specializing in surveys, technology research and white papers. He can be reached at joemck@aol.com.