Wild West Tales of DR: Resiliency, and a Bit o’ Luck Help Avert Disasters
The "Wild West" days of disaster recovery planning were a simpler time for networks. The hardware was tough, businesses had greater tolerances to outages and IT managers rode the range confident the homestead was safe. Today, businesses rely on the uninterrupted performance of its IT structure. More than trust in the resilience of hardware is required to safeguard an organization's information assets.
Fifteen years ago, it was not unusual for an IBM customer engineer (CE) to reassure the corporate data processing manager that disaster recovery planning requirements would be completely handled by an up-to-date maintenance agreement. If a mainframe or minicomputer was powered down in advance of a fire, flood or hurricane, the CE would say that chances were very good that it could be powered up again once the disaster subsided.
The claim was usually substantiated by tales of disaster in which the Big Iron on the raised floor became wet as the result of fire hoses, sprinkler systems or water intruding from other sources. The data center manager simply disassembled the gear, removed circuit boards to the parking lot, and dried them with a hair drier or fan. Once dry, the components were reinstalled, the system restarted and data processing continued as usual.
Many fondly recall the "Wild West" days of disaster recovery planning. Hardware was more robust. Networks were simpler. And the business’ tolerance to outages was greater. Every IT manager was a Marlboro Man – a rugged individualist, content with a solid horse and a stretch of open plain.
Todd Gordon, General Manager of IBM Business Continuity and Recovery Services, smiles at the nostalgia and its many ironies. "If only it were that simple," he muses, considering the many intricacies involved in business continuity planning.
Gordon, who heads IBM’s business continuity consulting and disaster recovery services organization, observes that the business’ dependency on the uninterrupted performance of its IT infrastructure has grown substantially. "It has come down to a question of how you will run the business while the system is down. How long can you sustain an outage?" More than trust in the resilience of hardware is required to safeguard an organization’s information assets.
He notes that an increasing number of organizations entrust mission-critical IT processes to PC-based servers. "These are engineered for a two-and-a-half year life cycle. They don’t have the resiliency engineered into them that you find in mainframes or minis. It wouldn’t be cost effective."
Jerry Revell would agree with him.
A Flood at Franklin Equipment Company
Revell, Director of Information Systems for tree farming equipment manufacturer, Franklin Equipment Company recalls his encounter with disaster last September. As a by-product of Hurricane Floyd, the nearby Blackwater River overflowed its banks, washing out roads, engulfing the town of Franklin and dumping nearly five inches of water and sewage into Revell’s data center.
Fortunately, the two AS/400s and three IBM Netfinity servers were powered down before the deluge occurred, Revell recalls – not as part of a disaster avoidance procedure – but because of the power outage that impacted the area from Wednesday through Sunday evening.
"We prepared on Wednesday for Floyd to go through our area. We took last minute backups. Then, we lost power Wednesday night, and were closed down on Thursday, when the flood occurred."
Franklin Equipment was "fortunate" to be spared the devastation that wiped out 185 businesses in the City of Franklin, according to Revell, who attributes a large part of his successful recovery to the resiliency of the AS/400 hardware installed in his data center.
"Like most midrange shops, we didn’t have a disaster recovery plan. We take backups and store them offsite and we have shared recovery arrangements with other AS/400 shops in the area. But, we didn’t have a formal disaster recovery plan then, and we don’t now. We did everything wrong that you could possibly do. We didn’t power down the equipment after taking backups Wednesday night. And we used unstable generator power for our AS/400s following the cleanup. But, we were fortunate. We were spared some of the things that could have happened."
If it hadn’t been for the ruggedness of the AS/400, Revell notes, the outcome may have been very different. On Thursday, Revell traveled back to the site to inspect damage from the flood. His 25-minute drive to work required two-and-a-half hours due to washed out roads. Upon arrival, he discovered the flood in his data center.
"Our AS/400s, which have a clearance of about [between] a quarter inch to about two inches, were at least two to three inches submerged" in a sludge of water and sewage that saturated his shop for two days, Revell recalls. The flood receded to about one-and-a-half inches by Friday, which Revell says was largely spent "recovering from shock." On Saturday, his staff worked with brooms, mops and portable lights to scrub the facility, so that Federal Emergency Management Agency approvals could be obtained for employees to return to work.
While the facility was being cleaned, the AS/400s had the opportunity to dry out. On Sunday, Revell plugged them into a generator and flipped the switch. "They came up and stayed up, and we ran payroll on Monday."
Not so fortunate, he notes, were the rack-mounted NetFinity servers. While they had not been directly contacted by the floodwater, ambient moisture caused one of the PC-based servers to fail. "We brought in IBM to help with the recovery of the server. It took about a week to bring it back online," Revell recalls.
Mosaix Versus the Tornado
David Schlabs, President and CEO of Mosaix Inc., can sympathize with the experience of Revell and Franklin Equipment on several levels. Like Revell, his firm was impacted by both the direct effect, and the regional impact, of a natural disaster. In Schlabs’ case, however, the antagonist was an F2 tornado (113-157 m.p.h. winds) that ripped through the Fort Worth, Tx. area on March 28, 2000.
Schlabs recalls the event with singular clarity, though he was on business in Oklahoma at the time. Unlike hurricanes, which are generally accompanied by a lengthy advanced warning, tornados are often sudden events, Schlabs observes. Fortunately, prior to the tornado strike, "there were warnings of hail in the area, so everyone went home earlier in the day."
At 6:03 p.m., the twister crossed the Trinity River and weaved a destructive path through a Montgomery Ward facility, the Cash America Building, the Bank One Tower and the central downtown commercial district, where Mosaix’ office was located. At that point, it was too late to develop the formal disaster recovery plan that Schlabs says he had always planned to develop.
"Formal disaster recovery planning was always something we planned on doing, but it took a backseat to our other work. We knew we could depend on a quick replacement box from IBM. We take nightly backups and store them off-site, so, in a worst-case scenario, we are only a day or so out of synch. And we have an insurance policy for cleanup. We hoped that would be sufficient."
It took until 9:30 that evening for Schlabs’ marketing director to reach the office and report his findings to Schlabs, "The electricity was off, there was water everywhere, it was a real mess. Our Vice President of Development saw that the AS/400 used for software development was [in contact with the pooled water.] We hadn’t turned off the power after taking backups earlier in the day. Fortunately, the tornado created a power outage that shut the equipment off for us. Our VP took some IBM manuals and propped the AS/400 up on them and out of the water. I always wondered what all of those manuals were good for."
It required an additional day to get back into the facility, Schlabs says. "The emergency managers were reluctant to let people back into the downtown area. There was a lot of damage to skyscrapers and a risk of falling glass." When Mosaix’ staff did return, they set about cleaning the facility and salvaging what they could.
Schlabs says that his insurance company brought in an internationally known salvage company to clean modems, PCs, printers and other devices. "But we didn’t send them the AS/400. Once it was dry, we started it up, mounted a new RAID 5 storage subsystem we were preparing to install before the tornado hit, and everything came up okay. IBM sent out a customer engineer. He performed some diagnostics, and everything checked out."
"We were operational in a couple of days in a new office. It took another week for the phone company to get our phone system up and connected," Schlabs says.
All in all, he is philosophical about the event, "You can never have enough safeguards in place, but I don’t think it would have changed the outcome. We were fortunate that, with all the stuff blowing around, it didn’t hit anything important. I had to throw a lot away. There was a pile of rubble in my office four feet high. Important documents, customer support requests, and so forth, had to be thrown away. That is not the kind of damage that is cleared up in a week, regardless of what kind of plan you have."
He also questions the efficacy of salvaging PCs and their peripherals, "Was the cleaning process beneficial? Ultimately, that’s a call the insurance company must make. They spent something like $1,000 to salvage a PC worth about $300. We seem to have things going wrong with the salvaged equipment almost every day. But, not with our AS/400."
Schlabs notes that this was his second encounter with a twister, the first having hit his home in 1979 – "also on a Tuesday, also at 6:00 p.m." He muses that, if the interval between encounters holds at 20 years, "I will be expired before the next one." With that comforting rationalization in mind, he is planning to relocate back to his original offices in September.
Blumenthal Leverages Mirroring
Ed Griffin, Information Technology Manager for New Orleans-based textile manufacturer, Blumenthal Mills Inc., can sympathize with Schlabs’ views of formal contingency planning, but insists that his company has a definite requirement to try to mitigate disaster. At the same time, he acknowledges that no plan can safeguard a company from all risks.
Case in point for Griffin happened at summer’s end in 1998. From August through September, he was forced to contend with the impact of, not one, but two hurricanes – a test that challenged his recovery strategy nearly to the breaking point. In the end, it was advanced planning, and some luck, that saved the day.
Griffin describes his basic disaster recovery strategy as one of "role reversal." Blumenthal’s two facilities – a manufacturing plant in Marion, S.C. and corporate headquarters in New Orleans – are both AS/400 shops, interconnected via a Frame Relay network. Griffin says that this infrastructure enables him to set up a "realtime mirror" between the two sites. Data is replicated at each location, so each can serve as a backup to the other. He regards the strategy as key to supporting his company’s "emphasis on customer service and just-in-time manufacturing and delivery."
So, on August 24, 1998, as Hurricane Bonnie – a Class 3 hurricane packing 120-plus mile per hour winds – approached South Carolina, Griffin’s confidence was high that his firm could take whatever Bonnie doled out. He confirmed the operation of his mirror to New Orleans and reviewed his disaster recovery plan. On August 26, his staff switched systems over to New Orleans, powered down the AS/400 in Marion and went home to ride out the storm.
When the threat had subsided, Griffin’s staff returned to the data center and restarted the AS/400. "The problem was that our uninterruptible power supply, which had been powered down, was recharging its batteries. So, the AS/400 was taking power directly from the line. There were a series of power surges and drops [related to the impact of the storm on local utilities]. They killed a fan on the UPS and stopped power to the AS/400 in mid-IPL [initial program load]."
The result was a system crash that cost Blumenthal three disks and one parity stack in its RAID 5 Serial Storage Architecture (SSA) storage subsystem: totaling approximately 194 GB of information. "IBM customer engineers from Florence, S.C. showed up that night with replacement drives. It’s a blur now, but the replacements arrived before we could do anything with them," Griffin recalls.
Unfortunately, according to Griffin, the local data was permanently lost, but New Orleans continued to host the Marion plant’s processing using mirrored datasets. The problem came down to how best to re-establish the mirror and to get the Marion systems back online.
"Given the amount of data to restore, we couldn’t use standard mirroring techniques across the Frame Relay network. Data had to be copied to tape in New Orleans and shipped to Marion, where we would use a local Magstar tape drive to reload our systems. It took until September 19 to complete the restoral and to get the mirror back in place."
Some of the delay was the result of a second hurricane, Hurricane Earl, which moved up the Gulf of Mexico to threaten New Orleans on September 13. "With all of our processing now being handled at our New Orleans headquarters, we kept praying that the pumps would keep up with the storm surge from Earl. The water came up to within one-and-a-half feet of the door of our New Orleans data center. Fortunately, it didn’t flood us there," says Griffin.
Griffin attributes his success in the face of the challenges posed by the disaster events to advanced planning and testing – combined with a ration of good luck. "We experienced a total of a couple of hours of downtime – maybe. All shipments were made on time and no business was lost. Advanced planning and mirroring accounted for the bulk of our results."
Says Griffin, "A preparedness effort is part of our emphasis on just-in-time delivery and customer service. A [continuity] plan was underway when I joined the company in 1992. We have tuned it up considerably since then. We test generally twice a year now – during Christmas and the July 4th holidays, when most of the employees are vacationing."
For Blumenthal Mills, he adds, a reputation for hardware resiliency and a maintenance agreement are not enough for preparedness. He says, "You can’t have blind faith in anything."
Services Refined
IBM’s Gordon believes that the growing attention paid to advanced planning for business continuity in many organizations has little to do with their faith in vendors or hardware. It has everything to do with how business itself has changed.
"With the advent of e-business," Gordon says, "a plethora of new technologies are being introduced with new design points and new recovery requirements. Adopting e-business processes moves a company out of the traditional application processing model and into a supply chain model that is far more vulnerable to network reliability, security and other factors. Providing high availability hardware configurations and component redundancy is only a part of business continuity planning."
The Wild West days of disaster recovery are an anachronism in the 21st Century. And the Marlboro Man may be in an oxygen tent; but the sun never sets in the 24x7x365 universe of the Internet.
About the Author: Jon William Toigo is an independent consultant and author with nearly 20 years of practical IT experience and author of Disaster Recovery Planning,from Prentice Hall PTR. He maintains a Web site dedicated to disaster recovery planning at www.drplanning.org and can be reached via e-mail at jtoigo@intnet.net.