Riding Shotgun with the Masters of Disaster
Many companies have not taken the steps required to ensure that their IT capabilities are available when they really need them. In many cases, companies rely solely on tape backup of mission-critical systems, with the knowledge that the information is at least somewhere, if needed. Often the data recovery process is not tested, or tested only partially. In fact, a recent survey found that 60 percent of companies with data recovery plans have never tested them and aren’t sure they would work.
Several years ago, Verizon Data Services (formerly GTE Data Services) reacted to these trends by rethinking its availability strategy. While traditional recovery processes focus almost exclusively on systems and platforms, Verizon has developed a strategy that involves a combination of a high-performance infrastructure, continuous data accessibility and protection from data loss. This strategy goes beyond the traditional approach to view disaster recovery as a refinement of normal operating procedures, an element of standard performance objectives, and a fundamental part of the company’s culture.
At Verizon, disaster recovery planning is designed to avoid customer losses by maintaining the required levels of data processing support, regardless of unanticipated events. Because disaster recovery planning is finely integrated into the normal operating culture of the organization, it helps promote improved performance, enhances employee skills and provides more reliable methods and operating procedures.
More than 4,000 Mainframe MIPS
Verizon maintains data centers in Tampa, Fla., Fort Wayne, Ind., and Sacramento, Calif. Together they work as a virtual data center, providing a fully integrated processing environment. For example, the Tampa data center has more than 4,000 MIPS of mainframe computing capacity, 700 T1 and T3 circuits linking dozens of sites, more than 120 midrange computers and 28 terabytes of data stored on 706 DASD 57 StorageTek tape silos.
There is a full-time disaster recovery staff at each of the data centers. Remote site personnel provide additional support in the event of a disaster at one of the data centers. The recovery teams include management response teams, such as a security team, an administrative support team, a disaster recovery coordinator and an emergency management team. The hot site restoration team includes a user liaison team, AS/400 support team, software recovery team, off-site storage team, computer operations team, network telecommunications team and DASD resource management team. The applications recovery team includes personnel from both the customer and Verizon, and the site restoration team includes a damage assessment unit.
Tampa Data Center Is Crucial
Disaster recovery plans at the Tampa site are especially important. The Tampa data center is Verizon’s largest, and currently runs 170 mainframe online applications. This data center also houses 316 minicomputers, midrange systems and client/server platforms with 1,645 MIPS, including Tandem/Compaq, HP, Digital VAX, IBM, NCR, Prime, Sequent and Sperry. Sun SPARC workstations are used to monitor all software, hardware and network conditions. These workstations provide systems personnel with the ability to establish up to 128 concurrent sessions. Tape operations include 604,037 volumes on-site, 60,424 volumes off-site, 1,179 manual tape mounts per day and 32,285 robotic tape mounts per day. Input/output includes three IBM 3900 laser printers and four high-speed impact printers. Telecommunications equipment includes 24 front-end processors and 75,200 directly connected network devices.
Built to Beat Weather
Verizon’s Tampa data center has been constructed to withstand virtually any unexpected event. The building itself can withstand Category 3 hurricanes (on the five category Saffir-Simpson scale) with winds of more than 130 miles per hour, and provides 100,000 square feet of raised floor processing, including an 8,000 square foot minicomputer center.
Two separate power and data feeds prevent interruptions in power. The data center also contains a water pump capacity of 18,000 gallons per minute and 2,200 gallons of CPU water chilling capacity. The building automation system consists of more than 1,000 points of analog and digital I/O, and includes alarm monitoring with dual-out annunciation for chiller plant alarms, UPS alarms, raised floor water detection, computer room temperature and humidity levels, generator operational alarms, power distribution equipment alarms and advanced smoke detection system alarms.
The Tampa data center disaster recovery plan begins with disaster prevention. A satellite monitoring system warns of hazardous weather conditions. Multiple banks of uninterruptible power system backup units can maintain the production environment for 30 minutes if power is cut off.
The company has two isolated redundant UPS systems with four primary modules and one maintenance module each, and 3,840 kW total capacity. Each UPS module has 200 battery cells operating at 450 VDC and 533 kW, whose power quality is monitored by Square D circuit monitors. If a primary module fails, critical load transfers to the maintenance module, generators automatically start in parallel to the utility and the monitoring system alerts facilities staff.
Verizon also maintains four 1,100 kW and two 1,250 kW Cummins generators with 6,900 kW of total capacity. Two underground 20,000-gallon fuel tanks make it possible for the generators to run nonstop for approximately seven days.
The generators are online and synchronized in 35 seconds or less after a power failure is detected. Beyond that point, the company is located only 20 miles from the nearest port facility and has arranged for priority fuel deliveries. In a situation where storms are moving in, the company moves onto diesel power to avoid interruptions from power outages or surges.
Backup, of course, is a critical part of ensuring high availability. Verizon does systems backups every weekend and incremental tape backups of data every hour. High-volume databases are backed up more often, sometimes minute by minute for the most heavily used systems. If the Federal Aviation Administration closes the airport because of an approaching storm, Verizon stays online until the last minute to synch up all systems. Meanwhile, processing is gradually transferred to other data centers to reduce transaction volume. The ultimate plan, which has never had to be implemented, involves transporting the tapes to the hot site in Chicago.
Five-Step Recovery Plan
Verizon has a five-step disaster recovery methodology:
Step 1: In an emergency, the disaster recovery coordinator and disaster recovery manager at the data center assess the damages and determine the ability of the data center to continue processing. If the data center’s physical facilities are inaccessible, a temporary command center is established.
Verizon’s hot site is a fully equipped facility containing all the computer equipment, network equipment and connectivity needed to support rapid recovery of any of Verizon’s data centers. The hot site can be used for up to eight weeks in disaster mode. The hot site’s communications network is maintained as an inactive node on Verizon’s commercial network and can be activated in minutes to support hot site operations on a national level.
Step 2: After a disaster occurs, the operating environment is restored at the hot site, which provides the framework to restore critical business operations.
Step 3: Mission-critical business applications are restored.
Step 4: Normal business functions are restored. Five days after having moved production processing to the hot site, the disaster recovery manager and site restoration manager estimate when production processing can be restored at the home site.
Step 5: The activities required to move from the hot site back to the home site are performed.
If the company’s equipment is damaged or destroyed in a disaster, the damage assessment team evaluates the extent of the damaged equipment and makes appropriate recommendations. The damage assessment team includes Verizon personnel and on-site equipment vendors.
For equipment replacement or repair procedures, the damage assessment team contacts the appropriate vendors for repair and the Verizon repair group for assistance in purchasing new equipment. Regular testing of the disaster recovery plan is an important part of the program. Verizon regularly tests the viability of its backup process by performing a complete system restore at a remote location owned by its disaster recovery partner.
Documenting the Plan
The disaster recovery coordinator ensures that the procedures to be performed during each stage of recovery are fully documented. Disaster recovery documentation includes a list of personnel responsible for the applicable recovery effort, a list of high-level action items in chronological order and a set of detailed, step-by-step instructions for each task.
The disaster recovery coordinator ensures that documentation outlining the alternate methods and procedures to be followed by the user community in a disaster situation are developed and included in the disaster recovery plan. This documentation includes a prioritized list of critical production applications. A customized disaster recovery plan specific to each customer’s needs is prepared and updated throughout the life cycle of the customer’s contract. Each contract becomes a part of Verizon’s data center disaster recovery plan.
Keeping the Call Center Up
Verizon’s efforts to increase the availability of its customer service representatives also improved its data availability record. The first step was to identify all of the systems supporting the representatives, which were also potential contributors to downtime.
An important action item was providing redundancy for every critical aspect of the call center’s operations. This included the workstations, network connectivity, telecommunications lines, etc. The call center locations are spread around the country, so one of the first steps was to strategically position spare parts at each of these centers. A subject matter expert was stationed at each call center to serve as a first line of support. Critical systems were identified and plans were made to immediately switch to a different process in case of an outage. The result was not only a reduction in the number of outages but also a reduction in the length of outages.
The program accomplished its ambitious goal of providing a tenfold improvement in the availability of service representatives through technology and improving operating methods. It reduced online systems downtime by 40,000 hours, resulting in savings of between $500 million and $1 billion, and reduced outage hours per customer service representative to below the two-hour mark.
About the Author: Jerry Fireman is President and founder of Structured Information (Birmingham, Mich.), and has written over 6,000 articles for over 1,500 trade journals in 24 countries.