New Challenges for Disaster Recovery

When the 1994 Northridge earthquake shook the buildings at California State University, the campus's IBM mainframe and Digital VAX system went down. The VAX was never revived, and the campus IT department eventually had to jury-rig a remote admissions and student records system that ran on a mainframe at a Cal State campus in Fresno.

The university has not forgotten the lessons learned from the quake. The IT department lives under the motto "never again," says Marc Montemorra, senior networking and systems analyst at Cal State’s Northridge campus. "When the earthquake hit, it became apparent that the California State University system needed more disaster recovery preparedness."

Most of the campus's data had been centralized on the mainframe and midrange systems. Now, campus applications are distributed between the mainframe and a burgeoning array of Windows NT servers. As part of its disaster management effort, the IT center recently acquired a 2 terabyte tape library from Tandberg Data (www.tandberg.com), supported by ARCserve from Computer Associates Int’l Inc. The configuration is expected to capture and consolidate data on a campuswide basis to better prepare the university for disaster-recovery, Montemorra explains. Also, a patchwork of drives from Quantum Corp. (www.quantum.com) and other vendors are deployed to support the backup of campus data.

The university is also planning to support data mirroring between campus sites over a wide-area synchronous optical network. The network is part of a vBNS (Very High Performance Backbone Network Service) connecting California universities. "That pipe can be turned on with a moment's notice," Montemorra says.

Cal State learned a difficult lesson that much of the IT world has yet to learn: Most organizations are ill-prepared for disasters. More companies are staking their futures on technology, with initiatives such as e-commerce, ERP, and messaging, but most of them don't have a viable plan to bring their systems back up in the event of disaster.

Although disaster generally invokes a dramatic image of fire, flood, earthquake or sabotage, it can occur simply as corrupted registry files, failed disks, or an extended power outage. A survey by GartnerGroup (www.gartner.com) found that 15 percent of the largest companies don’t have an effective enterprisewide disaster recovery program in place.

It gets worse. Of the 85 percent of large companies with an enterprisewide disaster recovery plan in place, less than half of these plans include networks, and only one-fourth cover PC LANs. The percentages may grow as reliance on e-commerce and ERP grows, but industry experts agree that, as a whole, that there is a disturbing lack of awareness about disaster recovery solutions.

Quantum Corp. (www.quantum.com) is spearheading a $4 million advertising campaign to drum up awareness about disaster recovery. Twenty other companies in the storage and recovery business have joined Quantum's "Prove It" initiative (www.DLTtape.com/ProveIt), which aims at providing information and support on disaster recovery issues to midsized businesses.

"There was a time when disaster recovery was relegated to one corner of the computer room," says Donna Scott, vice president of software infrastructure at GartnerGroup. Gone are the days of traditional disaster recovery, which was designed to get a central-site host back up and running, she says. Today's business resumption plans require end-to-end recovery strategies that include distributed systems and data. In addition, many IT departments are under pressure to maintain 24x7 availability. In Windows NT environments, for example, applications such as Microsoft Exchange Server and IIS are considered mission-critical, and loss of e-mail and messaging services -- even for a few hours -- could be costly to a company's operations.

In the financial and retail worlds, the cost of downtime can be high. For example, a system failure at a brokerage operation could cost more than $6 million per hour, according calculations by Strategic Research Corp. (www.sresearch.com). An airline reservation system or a retail catalog sales operation could lose up to $90,000 per hour of downtime. Many of these types of businesses use mainframes or Unix systems, but Strategic Research notes that Windows NT is coming on strong as a mission-critical application platform. Typical distributed network sites have a downtime cost of about $20,000 to $80,000 per hour, according to Strategic Research.

Ten years ago, the accepted recovery time for restoring a system was 24 to 36 hours, says Gregg Dockery, PC LAN specialist at SunGard Recovery Services Inc. (www.recovery.sungard.com). "Now, it's two or three hours," he says.

Mirror, Mirror, on the WAN

In a world growing accustomed to conducting near-instant transactions, data mirroring and shadowing techniques are becoming a necessity. GartnerGroup estimates that the use of data replication will grow by more than 50 percent annually over the next three years. Within large organizations, use of these techniques will grow from less than 5 percent today to 25 percent by 2001, and 60 percent by 2004. The growth is in part fueled by the increased availability of fiber optic networks in major metropolitan areas.

The traditional method of disaster preparation, employed by IT managers for decades, consists of a nightly, physical removal of storage tapes to alternate facilities. But in this era of instant fulfillment, the faster approach is to electronically replicate and send information over a wide-area network to another location via data mirroring and shadowing. Since this can be accomplished on a close to real-time basis, data can be back online far sooner than with traditional disaster recovery programs.

For the best results, data should be replicated to a dedicated off-site facility or to a geographically separated office with a similar systems configuration. "One of the more cost-effective ways of backing up is to find a reciprocating type agreement with another similar configuration within your organization," suggests Randy Settergren, manager for product marketing at Storage Technology Corp. (www.storagetek.com).

Such a strategy is being used by the William Morris Agency, a Los Angles-based talent agency. William Morris recently deployed a network that backs up three sites running Windows NT applications and data. Each site maintains copies of all files and data from the other two, says Michael Naud, network administrator at William Morris. "Those different directories are replicated across the frame-relay network WAN to all our different sites," Naud says. The company is employing SureSync from Software Pursuits (www.softwarepursuits.com) to replicate data at the three sites.

Mirrored data is stored on drives attached to the firm's Compaq ProLiant file servers’ built in RAID controllers, Naud says. "In the past, all of our Unix servers were centralized in the data center," Naud explains. "When we implemented NT, we distributed our file servers to different sites. If anything happened, chances are greater that one of them might still be usable during the day, or in a pinch, everybody could be moved to another file server."

Some larger companies are contracting with a hotsite provider that guides ongoing testing and furnishes the necessary equipment in the event of an emergency. "While it's cheaper to do a reciprocating backup, there's the issue of locking out resources," says Settergren. "The other site may not be able to spare its equipment. With a hotsite, equipment is more available, and hence is going to cost more."

Staples Inc., the Westborough, Mass.-based office supplies chain, has been working with IBM Business Recovery Services (www.brs.ibm.com) to develop a disaster recovery plan that covers a variety of systems. The mix includes Windows NT servers running on Hewlett-Packard and IBM Netfinity servers, AS/400s, and Unix servers. "We've identified all of those different platforms as critical and as part of our business continuity plans," says Cliff Leavitt, IS administrator at Staples.

Backup tapes are physically delivered to an off-site storage facility, says Leavitt. Staples has also implemented EMC Enterprise Storage from EMC Corp. (www.emc.com) to provide remotely mirrored backup copies of data between various systems, which support warehouse management, inventory replenishment, data marts and e-mail.

Industry experts say the most feasible approach to disaster recovery and business continuity is a combination of technologies: Data should be backed up on tape for regular delivery off-site, and live data should be replicated to electronic vaults. "Tape back-up technologies provide general protection and archival storage of information for recent losses of data," says Chris Midgley, chief technology officer at Network Integrity (www.netint.com). "Replicators or mirroring allow you to have the information immediately."

The most robust, and expensive, approach is a cluster of two or more servers connected by a fiber optic line. Such a configuration enables companies to maintain a standby site that duplicates the applications and data from the primary site.

NT Challenges

Recovering Windows NT server applications -- still an afterthought in many IT shops -- is a complicated process that requires advanced preparation. "In the old days, you could back up something to a disk, and hide it in your desk drawer," says SunGard's Dockery. "These days, saving data isn't enough. If you come into a recovery situation, you're going to be wondering about your domain name server, IP address scheme, and things of that nature."

Extensive documentation of all system changes is crucial for eventual restoration, explains Randy Middlebrooks, LAN recovery engineer at IBM's Business Recovery Services. "If you do not do a thorough job in documenting the configuration file, or the definitions of those applications, you could be looking at days, if not weeks, to get a server back up and running."

Dockery advises meticulous change-control logs to track all applications, upgrades, patches and changes in network cards. Other experts advise that recovery servers come from the same hardware vendor as the primary server. "That's a challenge if your disaster recovery service has HP servers and you have Compaq servers," says GartnerGroup's Scott. "You need to have very similar configurations, including that same brand of hardware."

Without a hardware match, moving a system to another box can require reinstallation of the operating system, reinstallation all applications that edit the registry, restoration of the subset of the registry that contains user account information, and then restoration of data, Middlebrooks says.

End-users report frustration with the complexity of NT restorations. "With Windows NT, you have to make sure all of your programs and the registry are backed up," says Naud of the William Morris Agency. "It's just a much more complicated beast on so many different levels than Unix. And there aren't a whole lot of tools."

Many data replication solutions are specific to a database, file system, operating system or disk subsystem, and require multiple solutions to protect critical data. Recovering multiple Windows NT servers may involve numerous sites with their own flavors of backup. "Fifteen NT servers might have 15 different ways and media types and mechanisms to back them up," says Settergren. "Imagine the problems of going to a hotsite to restore a mishmash of floppy drives, 4 mm tapes, 8 mm tapes and so forth. It would be impossible to recover at the enterprise level."

New Technologies

New storage and network technologies -- including Fibre Channel and Storage Area Networks -- are creating more seamless remote vaulting capabilities for backup and restore operations across all platforms.

"For the first time, you can locate your storage devices at remote backup sites, connected by high-speed lines," Settergren says. "A SAN will allow a more common point of backup and a common interface on how you manage those backups." A SAN configuration automates the backup process to a greater degree, as well.

Fibre Channel technology helps remove distance as an issue in remote backup and restore configurations. Fibre Channel will support long-distance data transfers up to six miles, while SCSI-attached devices are limited to 25 meters. The increased distance can help alleviate floor space problems for larger backup systems.

But many users are unsure of when to adopt Fibre Channel or SAN technology for backup and recovery functions. Naud is holding off on Fibre Channel since it "requires a hub, and there are no direct network-attached devices yet." Presently, the William Morris Agency has "a huge investment in our RAID storage," he adds. "It doesn't make any sense to pull out everything and start from scratch."

Cal State's Montemorra has been looking into SAN and Fibre Channel, but finds their costs to be prohibitive for a public-sector university budget. "We'd like to use those technologies for on-campus arrangements, where we could put disk farms in various buildings," he notes. "That way, we won't have everything in one basket."

At Staples, these technologies are being considered as a way to manage high availability and vaulting, Leavitt states. But it will be some time before the company decides on implementation. "If it's any value to us, we'll grab it. But not because it's the flavor of the month."