Round 'em Up: Geographically Dispersed Parallel Sysplex

How would a shutdown of your S/390 system affect your business? Do you put off system maintenance and upgrades to avoid system downtime? What about a site disaster? Are your business-critical applications and data protected from a site disaster? Many companies have inadequate business continuance plans developed on the premise that back office and manual processes will keep the business running until computer systems are available. Characteristics of these recovery models allow critical applications to recover within 24 to 48 hours, with data loss potentially exceeding 24 hours, and full business recovery taking days or weeks. As companies transform their business to compete in the e-marketplace, business continuity strategies and availability requirements should be re-evaluated to ensure that they are based on today’s business requirements.

With IBM’s Parallel Sysplex clustering technology enabling resource sharing and dynamic workload balancing, enterprises are now able to dynamically manage workloads across multiple sites to achieve high levels of availability. The Geographically Dispersed Parallel Sysplex (GDPS) complements a multisite Parallel Sysplex by providing a single, automated solution to dynamically manage storage subsystem mirroring, processors and network resources to allow a business to attain "near continuous availability" without data loss. GDPS is designed to minimize and potentially eliminate the impact of any failure including disasters, or a planned site outage. It provides the ability to perform a controlled site switch for both planned and unplanned site outages, with no data loss, maintaining full data integrity across multiple volumes and storage subsystems and the ability to perform a normal Database Management System (DBMS) restart -- not DBMS recovery -- at the opposite site. GDPS is application independent and, therefore, covers the customer’s complete application environment.

GDPS has been successfully installed at several customer locations. They have experienced significant reductions to the recovery time window with no data loss when running Disaster Recovery (DR) drills. For example, with GDPS, a simulated disaster at a customer site caused no data loss and the recovery window was reduced from 12 hours to 22 minutes. Additionally, a user-defined planned site switch from one of the sites to the second site took 42 minutes.

GDPS In Action

GDPS consists of a base or Parallel Sysplex cluster spread across two sites (site 1 and site 2) separated by up to 40 kilometers (km) -- approximately 25 miles -- with one or more OS/390 systems at each site. The multisite Parallel Sysplex cluster must be configured with redundant hardware (for example, a Coupling Facility (CF) and a Sysplex Timer in each site) and the cross-site connections should be redundant. All critical data resident on disk storage subsystem(s) in site 1 (the primary copy of data) is mirrored to site 2 (the secondary copy of data) using the open, synchronous Peer to Peer Remote Copy (PPRC).

GDPS consists of production systems, standby systems and controlling systems. The production systems execute the mission-critical workload. The standby systems normally run expendable work which will be displaced to provide processing resources when a production system or a site is unavailable. There must be sufficient processing resource capacity, such as processor capacity, main and expanded storage and channel paths, available that can quickly be brought online to restart a system’s or site’s critical workload (typically by terminating one or more systems executing expendable (non-critical) work and acquiring its processing resource). A significant cost savings is provided by the S/390 9672 Capacity BackUp (CBU) feature, which provides the ability to increment capacity temporarily, when capacity is lost elsewhere in the enterprise. The CBU function adds Central Processors (CPs) to a shared pool of processors and is activated only in an emergency. GDPS-CBU management automates the process of dynamically adding reserved Central Processors, thereby minimizing manual customer intervention and the potential for errors. The outage time for critical workloads can be reduced from hours to minutes. The controlling system coordinates GDPS processing. By convention all GDPS functions are initiated and coordinated by one controlling system.

All GDPS systems are running GDPS automation based upon Tivoli NetView for OS/390 and System Automation for OS/390. Each system will monitor the Parallel Sysplex cluster, coupling facilities and disk storage subsystems and maintain GDPS status. GDPS automation can coexist with an enterprise’s current automation product.

The Freeze Function

Data consistency across all primary and secondary disk volumes spread across any number of storage subsystems is essential in maintaining data integrity and the ability to do a normal database restart in the event of a disaster. The main focus of GDPS automation is to make sure that, whatever happens in site 1, the secondary copy of the data in site 2 is data consistent (the primary copy of data in site 1 will be data consistent for any site 2 failure). Data consistent means that, from an application’s perspective, the secondary disks contain all updates until a specific point in time, without anything missing and no updates beyond that specific point in time.

The fact that the secondary data image is data consistent means that applications can be restarted in the secondary location without having to go through a lengthy and time-consuming data recovery process. Since applications only need to be restarted, an installation can be up and running in less than an hour, even when the primary site has been rendered totally unusable. In contrast, data recovery involves restoring image copies and logs to disk and executing forward recovery utilities to apply updates to the image copies -- a process measured in hours or days.

GDPS uses a combination of storage subsystem, Parallel Sysplex technology and environmental triggers to capture, at the first indication of a potential disaster, a data consistent secondary site copy of the data, using the new, recently patented PPRC freeze function. This function will freeze the image of the secondary data at the very first sign of a disaster, even before any database managers will be aware of I/O errors. This prevents the logical contamination of the secondary copy of data that would occur if any storage subsystem mirroring were to continue after a failure that prevents some but not all secondary volumes from being updated. This optimizes the secondary copy of data to perform normal restarts (instead of performing database manager recovery actions). This is the essential design element of GDPS in minimizing the time to recover the critical workload in the event of a disaster at the primary site.

GDPS Functions

GDPS provides the following functions: PPRC configuration management, planned reconfiguration support and unplanned reconfiguration support.

PPRC Configuration Management. PPRC configuration management simplifies the storage administrator’s remote copy management functions by managing the remote copy configuration, rather than individual remote copy pairs. This includes the initialization and monitoring of the PPRC volume pairs based upon policy and performing routine operations on installed storage subsystems.

Planned Reconfigurations. GDPS planned reconfiguration support automates procedures performed by an operations center to simplify operations. These include standard actions to: (a) quiesce a system’s workload and remove the system from the Parallel Sysplex cluster (e.g., stop the system prior to a change window); (b) IPL a system (e.g., start the system after a change window); and (c) quiesce a system’s workload, remove the system from the Parallel Sysplex cluster, and re-IPL the system (e.g., recycle a system to pick up SW maintenance). The standard actions can be initiated against a single system or group of systems. Additionally, user-defined actions are supported (e.g., planned site switch in which the workload is switched from processors in site 1 to processors in site 2).

Unplanned Reconfigurations. GDPS unplanned reconfiguration automates procedures to handle site failures, and will also minimize the impact and potentially mask an OS/390, software subsystem, processor, coupling facility or storage subsystem failure. Parallel Sysplex cluster functions along with automation are used to detect OS/390 system, processor or site failures and to initiate recovery processing to help minimize the duration of the recovery window. If an OS/390 system fails, the failed system will automatically be removed from the Parallel Sysplex cluster, re-IPLed in place if possible, and the workload restarted. If a processor fails, the failed system(s) will be removed from the Parallel Sysplex cluster, re-IPLed on another processor and the workload restarted.

If there is a site failure, GDPS provides the ability to perform a controlled site switch with no data loss, maintaining full data integrity across multiple volumes and storage subsystems.

Unplanned Site Reconfiguration for a Multiple Site Workload

GDPS supports two configuration options:

Single Site Workload -- the single site workload configuration is intended for those enterprises that have production in site 1 and expendable work (e.g., system test platform, application development, etc.) in site 2.

Multiple Site Workload -- the multiple site workload configuration is intended for those enterprises that have both production and expendable work in site 1 and site 2. This configuration has the advantage of utilizing the resources available at the second site to provide workload balancing across sites for production work. GDPS provides the operational simplification to manage resources in a multiple site environment.

The production systems, SYSA, SYSB, SYSC in site 1, and SYSD, SYSE, SYSF in site 2 execute the mission-critical workload. SYST, SYSU and SYSV are standby systems that provide processing resources when a production system is unavailable or a site is unavailable. The 9672-X37 in site 2 has the CBU feature and can be expanded to a 9672-XZ7 during an emergency. The controlling system, 1K, coordinates GDPS processing. The primary copy of data (P) in site 1 is mirrored to the secondary copy of data (S) in site 2. The online transaction processing and data sharing-related CF structures reside in CF1 in site 1 and other structures reside in the ICF in site 2.

When site 1 experiences a failure or disaster, GDPS will freeze the secondary copy of the data at the first indication of a problem to maintain data consistency. When GDPS detects the last system in site 1 is no longer functional, it will initiate a site takeover. The secondary storage control units will be reconfigured to simplex mode; the CF and the Couple Data sets will be reconfigured; automatic CBU activation will expand the X37 to a XZ7, thus providing processing resources needed to execute the mission-critical workload. SYSA, SYSB, SYSC, SYSD, SYSE and SYSF will be re-IPLed, the CF structures will be rebuilt in the ICF, and finally, the mission-critical workload will be restarted.

When site 2 experiences a failure, remote copy processing suspends and one of the systems in site 1 assumes the controlling system role, while site 1 continues to execute the mission-critical workload. If cloned, data sharing applications are active across all production systems, site 1 users will continue to execute with no impact and former site 2 users can relogon onto site 1. Automatic CBU activation will expand the R46 to a RX6 thus providing processing resources needed to restart any mission-critical site 2 workload in site 1. SYSD, SYSE and SYSF will be re-IPLed in site 1, the CF structures will be rebuilt in CF1, and finally, the mission-critical workload running on these systems will be restarted.

Summary

GDPS provides all the resource sharing, workload balancing and continuous availability benefits of a Parallel Sysplex. It significantly enhances the capability of an enterprise to recover from disasters and other failures, as well as manages planned exception conditions. GDPS helps a business to achieve its own continuous availability and disaster recovery goals.

About the Author: Noshir Dhondy is an Advisory Engineer in the S/390 Parallel Sysplex Product Development organization in Poughkeepsie, N.Y. He can be reached via e-mail at dhondy@us.ibm.com.