Masters Of Disaster

When The Computing Lights Go Out They Make Sure Your Business Goes On Shining

During a time in computing history when IT is the engine of change and the competitivedriver for global businesses, it's difficult to think of a time when system disruptions,outages, or even natural disasters weren't really that big a deal. Only naive userspanicked when the system crashed. Back room downtime was a dirty little secret -- askeleton in the glass house closet. For those who remember when downtime was planned, thecasual motto seemed to be: "What? Me Worry?"

A month ago Hurricane Floyd ripped up parts of the East Coast. And, those who takenotes, recall the 5.0 earthquake that shook San Francisco in August during HP World. But Idon't want to get too melodramatic. Still, the days of pulling backups out of your drawerto recover data are over. Without effective disaster recovery plans, companies of allsizes are at risk of losing not only data, but also sales and customers. If your companyis a public one, especially of the e-commerce kind, then investor confidence and stockvalue are also at stake.

If you need more evidence, check out the DLTtape ProveIt Index for Disaster Readiness ( /proveit) which found that:

  • 72 percent of companies are in the danger zone or inadequately prepared for disaster recovery.
  • 82 percent of companies do not have a tested disaster recovery plan.
  • 75 percent of companies do not understand the financial impact of a computer outage.

Those in the 24-hour world of financials, manufacturing and health industries havealways had the need to recover very quickly from outages. Now, Web-based e-business of allkinds find themselves exposed to the E-Bay syndrome. "The systems and infrastructuresrequired are markedly different than what was being put in place three or four yearsago," according to Les Wilson, HP E-Services Systems and Solutions Manager forBusiness Critical Computing. "And customer and consumer expectations have evolved toa point now where they will not tolerate downtime, especially from an Internet site or adot-com company."

Additional studies conducted by the Contingency Planning Association Research andStrategic Research Corp. estimate that unprepared companies are exposed to losing from$18,000 to $6.4 million in the event of an IT-disabling disaster strike. In other words,improving High-Availability (HA) has become a necessity.

Take note of the average cost per hour of downtime in the following industries:

  • $6.5 million brokerage operation
  • $2.6 million credit card processing
  • $150,000 pay per view
  • $93,000 TV home shopping
  • $28,000 package shipping

With stakes that high, if you're down, you're dead. Failure is not an option. But thesavviest of IT managers don't stop at disaster recovery. They don't think ofhigh-availability as a luxury. They talk about business continuity. They know thatdowntime is not only bad news, it makes news. These masters of disaster -- if you will --know that just about any IT disruption can hurt their users, their customers and theirsuppliers.

That's why this month, the editorial staff at HP Professional interviewed three"masters of disaster" who discuss their business continuity plans andphilosophies.

What does it take to be a master of disaster? Smart instincts, an emphasis on fullbackups, a penchant for redundancy and just plain common sense were some of the commonthemes among our trio: Rory Hammond, Manger of Information Systems at Menlo Logistics,Jose Rivera, Technical Services Operations Manager of Aearo Company, and Scott Womer,Systems Engineer of Atmos Energy. They have dealt with their share of power outages,construction debris and just plain human error (which accounts for about 80% ofdisruptions).

They have also been fortunate. None in our group have experienced a true naturaldisaster (defined as an act of God). But that's why they are the masters. They aren'twaiting for the next flood, bombing or earthquake. Like minutemen they have their systemsand configurations available to continue -- in a minute. When the chips are down, you calltheir name.

Editor's Note: HP Professional thanks Gayle Mestel of CCS Public Relations(Carlsbad, Calif.) for her assistance. We also thank Rory Hammond, Jose Rivera, and ScottWomer for their willingness to speak with us.

Name: Jose Rivera, Technical Services, Operations Manager
Company: Aearo Company (Indianapolis, Ind.; www.
Business: Manufacturer of safety equipment for industrial workers, health care professionals and consumers
IT Assets: SAP Enterprise Resource Planning System with an Informix 7.2 RDBMS; 4 HP 9000 G70 and T500 servers running HP-UX 10.2
Recovery Window: 36 hours
Recovery Drills: Yes. 2x/year
BC Philosophy: "Disaster Recovery is a requirement."

Aearo Company (Indianapolis, Ind.; safety equipment for industrial workers and healthcare professionals."We make specialized foam pads used for energy absorbing -- that is, for dampeningnoises. We also make safety equipment like safety glasses, helmets, goggles,respirators," says Jose Rivera, Aearo's Technical Services, Operations Manager. Likemost companies, Aearo has multiple sites. In addition to the Southbridge, Mass.headquarters, there are sites in Delaware, Indiana, Massachusetts, Oklahoma and the UnitedKingdom.

"Our entire corporation is run on HP 9000 systems and an HP 3000 [which is not onactive support]. They cover every aspect of running our company. We have a system thatlets us see how the company is doing as a whole. But the drawback of a centrally locatedsystem is that it requires some sort of disaster recovery or business continuityplan."

Rivera's goal is to have Aearo's systems back online and accessible to users within 36hours. Aearo's nine IT sites are linked together by a SAP Enterprise Resource Planningsystem over a frame relay WAN that "we strive to keep online 24 hours a day, sevendays a week," he explains. And that's backed by an Informix RDMBS [version 7.24].That central system runs on a HP 9000 T500 connected to an EMC 3700 storage system withfour HP 9000 Model G70s doing duty as application servers.

Until recently, Rivera wasn't particularly happy with SunGuard (Philadelphia, Pa.) whenit came to some of his disaster recovery scenarios. "We didn't have the appropriatehardware configuration, the software patch levels for the OS were not where we would haveliked them to be. That hindered us from hitting our 36-hour goal."

Rivera had attempted to recover his systems three times, but was never successful.Recalling a power outage from two years ago, he describes a mishap that blew out atransformer. "The facility had its own power plant, but a couple of lines had beenshorted by some debris. We were dead in the water. And we would have struggled if we wentto our previous provider." The system came back online, but four hours were lost. So,he looked for another vendor.

One of those vendors was HP. "We went to their facility [in Valley Forge, Pa.] andtested. And the first time, we were able to bring our systems back within our 36-hourdeadline." That's less stress for the 10 individuals on Rivera's team which islargely responsible for Aearos' bread-and-butter SAP systems, PBX's and help desk calls.About six others are responsible for the overall frame relay WAN.

Rivera stresses that the HP facilities had everything that was needed. And more. Theyhad a T500 and a couple of I-class systems, which were actually better than the ones wehad at home [the G70s]." He also notes that "we have the High AvailabilityObserver workstation connected through our Local Area Network providing 24-hour monitoringto HP so they access to our machine at anytime, it polls our machine and checks its healthand they then notify us of any issues."

Rivera provides a case in point: "We had a processor failure and we were able toget the system repaired and back online in under four hours." However, he points outthat "it's not the same level of support on our application servers because if one ofthe app servers fail, we have four of those. So we would be able to reroute all of usersto the remaining three." There are few Compaq ProLiant servers running Windows NT;but no critical applications, according to Rivera. Applications like e-mail are not undersupport contract. "We just handle it in house."

Five years ago, Aearo like most other companies, had different systems running atdifferent locations. "An IBM in Europe. An HP 3000 in Southbridge. An AS/400 inIndianapolis. They didn't talk to each other." But since moving to the central SAPsystem, disaster recovery has been a higher priority. Now with the Valley Forge hot site,Rivera is comfortable that "we can continue running our day-to-day operations in casea facility is taken off line."

Still recalling earlier days, Rivera says, "At the time, the reporting of all thatinformation was rather difficult. Today we don't have that problem. We have a true datacenter with a power generator and UPS system that we didn't have at that time."

With fully redundant systems, internal backup and recovery procedures and their owngenerator in Southbrdige, Aearo's systems stay online indefinitely, "as long as wecan keep fuel in the [in-house] generator." Rivera explains that something that wouldprompt them to go [to Valley Forge, Pa. site] would be a telecommunication outage.

Long term goals include establishing a remote connection to the Valley Forge hot site.For now, Rivera is prepared to send out the tapes, so they could get the system online.

Recovery practices are scheduled every six months. However, he notes that HP providesthe hardware. "We restore our systems on to it. There's not a hot system at thesite."

Name: Rory Hammond, Manger of Information Systems
Company: Menlo Logistics (Redwood City, Calif.;
Business: Provides global logistical operation services, including order fulfillment, transportation, storage and distribution, from raw material to consumer; provides logistics control for companies with warehouses
IT Assets: Informix-based warehouse management system; mixture of HP 9000 D- and K-class systems running HP-UX 10.2; moving to 11.0 in the future
Recovery Window: None
Recovery Drills: Yes. 2x/year
BC Philosophy: "Have a failover or backup plan, or you'll be out of business."

Atmos Energy Corp. (Dal-las, Texas ; natural gas and propane to more than one million customers in 13 southwesternstates. Scott Womer, Systems Engineer, runs disaster recovery tests twice a year."For us it's very mandatory. It's more than just IT because we are a natural gascompany. We have all those pipes out in the ground. We have to track leaks. We're liablefor any kind of explosion monitors at every junction; sniffers along the pipe. BusinessContinuity Planning or BCP as we call it is very critical. Everything must be recovered inthree days."

At Atmos, the call center is front and center. A centralized Customer InformationSystem (CIS) handles all the calls for service orders, billing and meter reading. The CISis a semi-off-the-shelf application backed by an Oracle RDBMS. Developed in part with SCT(Malvern, Pa.) using SCT's Banner CIS software. There are 50 to 150 SCT developers onstaff to help customize the app. According to Womer, "We have a huge source coderepository and version control."

On the hardware side, Womer says, "everything is ServiceGuard," referring toHP's MC/ServiceGuard product. MC/ServiceGuard allows you to organize your applicationsinto packages and designate the control of specific packages to be transferred to anothersystem or communications transferred to the idle LAN in the event of a hardware failure onthe packages original system or network. Atmos' systems consist of a mix of HP 9000 K- andD-class servers.

The Oracle servers run on a pair of K580s. The Oracle Web application servers arerunning on a pair of K460s. There are two ancillary systems for drivers out in the field."They all dial-up via cellular to get their service orders and emergency servicerequests, that runs on a D330 while our Internet firewall runs on a D370 and our OpenViewserver runs on a D370," describes Womer. With the exception of the Oracle Financials[a pair of K 580s and pair of K 460s running HP-UX 11.0] everything is running HP-UX10.20. Every machine has a partner in a two node cluster.

The above configuration, located in Dallas, contains the crown jewels of Atmos' centralsystem. "The central site is the only one that has the ServiceGuard," saysWomer. A frame relay WAN establishes a connection to all of Atmos' 86 sites. For smallsites with less than five users, Womer uses a VPN solution coming in over the firewall.Womer also notes that his backup and recovery software is OmniBack. "But no MUM, justa central cell. We do that for all the HP boxes." Womer's OmniBack backup andrecovery procedures also take care of applications running on Compaq ProLiant Windows NTservers, a single IBM RS/6000 and a "couple of Sun boxes and a dozen Linuxservers."

For both the Call Center and the corporate site where all the data resides, disasterrecovery services are contracted through Comdisco. "Every critical system we have tohave back in 30 days is under contract for recovery. That includes actual offsite officesand space for 200 employees and workstations." Preparation includes a semi-annualpractice. "We just scheduled the next one for December," says Womer. In case ofa natural disaster, he's ready to send users to Grand Prarie, Texas. (30 minutes away) andsystem admins with backup tapes to New Jersey "where the HP systems live."

Thankfully, natural disasters have not proved to be a problem for Womer. But "fatfinger" instances that blow away the database on occasion are another matter."Whoops, it's gone," he says chuckling. "We just do a restore." Thecavalier attitude doesn't come without justification: "We do full backupseveryday." To some IT managers, that might seem like overkill. But not for Womer:"I don't believe in incremental backups unless you have so much data that yourhardware can't take it."

However, incremental backups are not entirely out of the question: "I have twoboxes [a pair of Compaq ProLiants running Windows NT] that I do incrementals on if I canget a backup completed within our backup window, which is four hours. With 40G to 50Gworth of user data on them, it takes 24 to 48 hours for those guys to back up. We doincrementals throughout the week, with a full [backup] on Saturday." Six single DLTtape drives plus two full-blown robotic libraries are located in Dallas. "We rotatetapes out off-site everyday with four-week, eight-week and 52 week rotations for daily,weekly, monthly tapes." The remote sites each have one DLT tape drive.

The data center is on a Liebert UPS -- good enough for twenty minutes says Womer --"which is enough time just to run in there and start shutting things down."

So far, the most unusual circumstance for Womer has been toner dust from largehigh-speed line printers."The accumulation of toner dust shorted out some memorychips. Since then, we added extra baffles and mufflers and put a partition between theprinters and the servers."

Menlo Logistics (Redwood City, Calif.; is a contract logisticscompany where Rory Hammond, Manager of Information Systems tries to design IT systems thatare as close to total redundancy as possible. "We have these warehouses with hugeamounts of inventory that cost millions of dollars for our customers. We have an HPwarehouse. We have an IBM warehouse. And we have a [warehouse for a] very large chipmaker." Other clients include AT&T, Dow Chemical, NCR, Nike and Sears.

According to Hammond, "Our customers will ship goods and orders to us, and wedecide how to best get those goods out of the warehouse and to their destination. We alsoprovide shipping and destination services. So, all the warehouses need to be 24x7."

With 16 warehouses throughout the world and three to four warehouses being added eachyear, Hammond's work is cut out for him. From a business continuity standpoint, Hammondfavors HP-UX mirroring. "We can do hardware mirroring, but we normally use diskmirroring because I can get more bang for the buck. So if we lose a disk, we can fail overto disk. We also have dual paths or dual SCSI cards to the devices. If the CPUs are onseparate cards, you can actually lose a CPU, which we have on one occasion, and the boxwon't die. It will limp along on the other CPU. HP does that well in the K-[class] boxes.That's good because you can schedule an outage."

Menlo's business success depends on getting the client's inventory where it belongs inthe fastest amount of time. So, Menlo depends on IT in the form of its Real-Time WarehouseManagement System (RWMS). Customized for each warehouse, the RWMS is based on an Informix7.2 RDBMS integrated with radio frequency (RF) terminals, bar code and label printers andEDI. "We strive for 100 percent accuracy with people out in the terminals tagging andverifying."

Warehouses, full of inventory, however, are also full of just about everything else --including dust, dirt, oil, smoke, grease and other sundry mixtures not conducive to"sterile" glass house computing. Hammond explains: "I had a server in thisroom during the construction of a warehouse that didn't have a roof on it. Theconstruction workers had taken out the metal ducts and there was something like four footopenings in the wall with the wind blowing through. The back door was boarded and therewere sandbags along the bottom to keep water out.

"I was installing the Informix application software. And I was having trouble withthe tape drive getting dirty -- having to run the tape cleaner more than what I normallywould expect. When I opened the door and looked at the far wall -- which was about 300feet away -- I could not see the wall because of the cement dust. They were cutting cementfor the construction of this warehouse. They were using jackhammers to finish theremodeling and I could see the server bouncing from the vibrations.

"After I left, there was a torrential downpour. Water came in through the backdoor. The workers moved everything off the floor except for the server. But that [K-class]box ran for a year and half. Later I had a thermal check, the CE came in and discoveredthe fan had burned out. He found all kinds of cement dust in there. He replaced the fanand that box is still working after three years."

Then there are the power outages. "If there are no lights, the guys just can'twork," states Hammond. "But the systems are on UPS and they usually stayup." Hammond describes a one-hour outage in Richmond, Va. where there were no ordersbeing transferred, "but the data was safe and box was up."

The example illustrates Ham-mond's belief that cold components are problems waiting tohappen. "When components go down and come back up, you're going to have problems withthe life span of the components. I keep them up hot and running." The UPS are forthat -- they're designed to run up to 15 or 20 minutes." Hammond has a few AmericanPower Company UPSs, but the majority are integrated in the D-and K boxes. "We buythem and put them in the [server] rack."

Because of the volumes of the transactions and the number of people getting in, Hammondavoids running the WMS centrally over a network. "They would probably not get asatisfactory response," he says. "Because they are inexpensive, we have slowframe connections which are 56K and 128K lines -- about $600 to $1,200 a month -- whereasas a T1 line from Portland to Virginia [the cost] is huge. So, we don't generally hook upT1s or use large WANs to run warehouses over the network." So for Hammond, it'sbetter to replace the hardware on site.

For Hammond, disaster recovery services are like insurance policies. In the event of atotal failure, he is faced with recovering the warehouse and the associated systems."If we lose the whole warehouse in an earthquake or something, it just falls in anddies. The strategy is that we have off-site tape backups. What we're really doing isrecovering the data to see what we lost. While they are rebuilding the warehouse andrestocking it, we have plenty of time to rebuild and recreate the system."

All of the warehouses have LAN connections, according to Hammond, that are used forsupport and for employees to send and receive e-mail. However, he stresses that "Ourbackup is not a backup LAN, but actually dials-in direct to the server. We have LANredundancy. Some have ISDN backup. Some don't. We can be down for a day, and probablystill get product out because the servers are centralized." However, he notes that"we ran warehouses for about a year on telephone lines where we dial-upped anddropped the data. One of my assumptions," states Hammond, "was that the serverwas going to be onsite and the network could be down and we could run standalone for a fewdays."

If the data center is lost, Menlo's disaster recovery services are contracted throughComdisco. "If you lose something, they tell you where the data center is availableand where to move your stuff. We take our data and restore it remotely." The lastrecovery practice was done in April. The most recent this past August. Hammond notes thatthe HP 9000 production boxes for Menlo are relatively new, so "we haven't practicedwith those as of yet. We also have an AS/400 running another application that has failover capabilities."

Hammond takes a slightly different approach for Menlo's glass house [located inPortland, Ore.] where preventing unscheduled outages of any kind is absolutely essential."These are shared systems that are critical among many customers. So, we haveredundancy and failover as the first line of defense. We have a production box and a failover box. If we lose the backbone, we would reboot off of our backup box."

Hammond so far, however, has avoided the HP MC/ServiceGuard and switchover kinds ofsolutions. "Sometimes it's easier to fix the problem than fail over. You still haveto take the system down and fail back over when you get the problem fixed. Those arejudgement calls." However, as Menlo reaches what Hammond refers to as critical mass,"we might be going to ServiceGuard or load balancing -- if one box goes down they'lltransfer the workload."

Personally, he would "like to see the production box in one geographical area andour development box in another." For now his goal is much more prosaic: "My goalis not to have any unscheduled outages. On the hardware side, we've been very successfulat that. We have boxes that have been running 465 days since the last reboot. That's agood record." The only reason it went down was to do the Y2K upgrade. "We try toavoid unplanned outages." Indeed.

Must Read Articles