Focus On: Planning for the Worst

Mention disaster recovery or contingency planning, and most IS managers envision traditional smoke-and-rubble events such as fire, floods and hurricanes. The long-standing solution? Clean up the mess and bring on the backup tapes.

But the increasing criticality of AS/400 applications and data to business operations is changing the face of disaster recovery as we push into the next millennium. New technical issues, increased business expectations, and expanded service offerings are all forcing AS/400 managers to revisit their existing contingency plans. For midrange sites that never crafted a disaster recovery strategy, the pressure to get the job done has never been greater.

"The need for continuity and the need for continuous operations have increased because companies are more dependent on information systems than ever before," says Tony Martinez, general manager for IBM Global Business Recovery Services (Sterling Forest, NY).

Added to these pressures is the expanding spectrum of disasters. A hacker that enters a system without authorization or a virus that disables a critical application creates a crisis every bit as disruptive as a flood or earthquake.

Martinez points out that the source of the problem matters little during the disaster, only the solution. "It could be a tongue-studded teenager putting out a macro virus that - as with one of our customers - took down 11,000 devices across at six locations in four states in three hours. The CIO that I saw at two in the morning didn't say 'help me get rid of this computer virus.’ He said 'help me out of this disaster,'" he explains.


Part of the responsibility for these changes comes from the AS/400 itself. The growth and maturity of the AS/400 product line means that the platform can scale to create the largest of networks or serve as a true enterprise server. The power of these consolidated machines means that organizations are using them to run mission-critical, enterprise-wide applications.

Technical sophistication can also increase the system vulnerability. The AS/400's communications capabilities have allowed many sites to link multiple machines into distributed networks. Geographically distributed systems can simplify an organization's disaster recovery strategy by providing a backup system to a downed site. But logically distributed systems in the same geographic area dramatically increase the vulnerability of the AS/400, because all systems are subject to the same disaster, says Tom Huntington, VP of technical services for Help/Systems (Minnetonka, MN).

Even state-of-the-art applications such as electronic business demand extensive contingency planning. Companies who make ordering or other operations available to partners and customers online cannot afford to have a critical system go down, because system downtime means lost sales and lost business opportunities. A well-designed contingency plan can keep a minor disruption from becoming a major catastrophe.

Allison Manufacturing (Albemarle, N.C.) had prepared for the worst. When a flash flood last summer submerged it headquarters in nine feet of water, washing out its screen printing operation and its Model 620 AS/400, the company activated its disaster recovery plan.

An alternate IS site was established in a nearby building, and IBM provided a new, configured AS/400 by the following afternoon. It took another day to address TCP/IP communications issues and the unceasing requests of insurance adjusters and salvage brokers, but Allison Manufacturing was up and running again after the weekend, losing only three business days.

The AS/400's technical sophistication makes it an indispensable business resource, explains Dan Strout, director of distributed systems for Comdisco (Rosemont, Ill.). "The impact of an outage is much more widespread, and you have to protect against that," he says.

Great Expectations

Technical changes represent only a piece of the changing disaster recovery market, however. IS managers and business executives are driving the need for contingency planning with their expectations for continuous system availability.

In the AS/400 environment, downtime has become less acceptable, even for backup operations. Further, more managers expect a recovery to take hours, rather than the previously acceptable 48 to 72 hours, according to Larry Henderson, senior VP of operations for SunGard Recovery Services (Philadelphia, PA).

The bad news in all this is that a shorter recovery window increases the cost of a contingency planning strategy. The good news is that AS/400 managers are more effectively explaining the balance between cost and service level to senior management.

"IS managers are doing a better job now communicating to their senior management that technology provides a solution to meet any service level requirement that the business has, but that there's a price tag attached," says Kathleen A. Kobal, contingency planning specialist for Amelia Systems, Inc. (DePere, Wis.).

"Businesses committed to due diligence can conduct an impact assessment to identify the maximum time a process can be disabled," she says. "They benefit by moving their pre-disaster expense from hot standby or hot site contracts into a post-disaster expense, which is usually an insured loss."

The final decision on what type of service to contract rests with senior managers, not the IS staff. "The only way an IS manager can reduce those requirements is if the business reduces the service level requirements. That message is finally hitting home. That's a major change from a business perspective," Kobal says.

At Kraft Foods (Rye Brook, NY), a network of approximately 60 AS/400s is protected according to their criticality, according to Chuck Roberts, manager of information technology. Some 45 of the machines that support the manufacturing process are under hot site contract, because the manufacturing recovery window is 24 hours. Twelve other machines are covered by contracts that promise replacement machines within seven days because their outage can be tolerated for longer periods, he says.

IS and senior managers recently decided to update the corporate disaster recovery plan. Kraft hired the professional services arm of SunGard to handle routine tasks such as restoring the operating system. In the event of a disaster, internal IS members will concentrate on the more sensitive areas of applications and the end-user interface

A Menu of Services

Disaster recovery and contingency planning vendors are responding to the technical and business changes in the AS/400 market by offering an expanded range of service and product offerings.

High availability solutions -- marketed by software vendors such as Lakeview and Vision Solutions (see sidebar) as well as recovery vendors such as SunGard, Comdisco and IBM - are increasingly popular. These can include RAID disks, mirrored systems, and vaulting, a service that instantly and transparently switches over critical applications running on a company's servers to servers at the vendor site.

SunGard offers a technology called "hot and spinning DASD," which keeps the OS/400 operating system up and running at their hot site in case a company's system goes down. Henderson estimates that using the service can save a company 4 to 24 hours in the recovery process.

For companies engaged in e-business, IBM offers a logical networking disaster recovery service that automatically reroutes traffic over the Internet to a backup system, either at a second customer location or an IBM site. The disruption is minimal and the transfer is transparent to users, Martinez says.

Even Year 2000 issues, which can hardly be considered an unplanned outage, are included in some vendors' service offerings. Although an outage caused by a Y2K problem is not one that could be resolved at a hot site ("If it doesn't work at their site, it won't work at ours," says Martinez), vendors with a core competency in testing are allowing customers to test Y2K applications in their labs.

Getting Started

With so many offerings, how should an IS manager revise or create an effective contingency plan? First, experts say, determine your needs. Which applications, data and systems will be most critical in case of an outage? "There is only one thing you can't buy from anyone else and that's your business data," says Huntington of Help/Systems.

After identifying the key resources, you must consider how long you can afford to be without them. Can you afford to lose transactions during an outage? Do you have a manual system that could operate for a short time in case of disaster?

Second, assess the cost of the protection you'd like to have. How much will senior management pay to protect these resources? How long can you wait to get back up and running? "The business needs to understand that their disaster recovery service level drives the technology solution," says Kobal of Amelia Systems.

Third, talk to vendors. Most vendors offer Web sites, toll-free 800 numbers and partners as a starting point.

At Allison Manufacturing, where flood waters washed away an AS/400, the extensive contingency planning paid off. "You learn an awful lot by having to go through something like this," admits Glenn Wood, VP of MIS. Challenges from matching two versions of TCP/IP convinced the staff to stay even more current on communications issues. The company also now has a plan for an alternate site in nearby Charlotte should the next disaster be more geographically widespread.

But by having a plan in place and executing it with patience and confidence, the team succeeded in recovering quickly. "You need everything in your hard copy plan," Wood says. "There's no time for planning once the disaster hits."