Tips and Techniques for Running a Successful OS/390 Disaster Recovery Test

Disaster Recovery (DR) testing is costly, both in out-of-pocket expenses and in strain on the system's staff. Companies who outsource to a third-party DR site will spend thousands of dollars per month to archive the data with additional per hour charges if they run over their limited testing window. The systems staff work overtime to plan the DR procedures and neglect other projects. These tests need to run successfully -- on time and on budget.

The first step of any DR is rebuilding the basic OS/390 environment. The Integrated Catalog Facility (ICF) catalogs are essential assets for the system and applications. Recovery of ICF catalogs and synchronization with subsystem and application data are critical to the overall success of DR. This task becomes more complex as businesses strive to continually reduce data loss and outages. Proper planning, especially for ICF catalog recovery, will keep the OS/390 recovery from becoming a bottleneck that prevents recovery of all the applications that run on the S/390 system.

DR Planning: Decisions about Data

Why should there be a DR plan for business data? Why should there be a specific concern for the ICF catalogs? What impact would be felt if business critical data was lost? If a disaster occurred, how much time and money would it cost to replace this data? The data could be anything from customer information to accounts payable and even more importantly accounts receivable. What is the effect on businesses if data is unavailable? A disaster might be an earthquake, a fire or a program that corrupted the data. To protect against data loss, steps must be taken to allow the reconstruction of the data.

First, your business must determine what data is critical for business continuance. System data sets, production databases and applications data sets may be deemed critical while test system data sets may not.

After identifying the critical data, determine how much data loss you can tolerate. This, in large part, will determine the method and frequency of backups and data preservation. If the data is not volatile, weekly backups may be sufficient. Daily backups or multiple backups per day may be required.

Methods of data restoration include manual reentry, copy data from backups and reapplication of transactions. Data backups may exist on removable media (tape), or have been electronically transmitted to another site.

If backups of the data are created onsite, a question arises as to where to keep the backups. Should they be kept onsite? If so they may be kept in the tape library or a safe. Should they be sent offsite? A local or remote storage site may be available. Should multiple copies be created?

Now that this critical data is backed up and stored safely away can it be retrieved, restored, and used? What if the data that is lost is an ICF catalog? If a backup of the catalog was taken, how old is the backup, and how out of synch with the rest of the data will it be when restored?

Role of the ICF Catalog

An ICF catalog is a Yellow Pages for mainframe application files; it lists where data resides. You use the Yellow Pages to find a business and its location. If the Yellow Pages were the only directory and it was lost, there would be no way to know the location of a business or if it even existed without searching all the city streets. If an outdated Yellow Pages were used, a business may have moved or may not exist anymore. The information in the old Yellow Pages would not contain information about businesses that came into existence after it was published.

When an ICF catalog is restored, it reflects the condition of the data at the time of the backup. The data that currently exists may or may not reside in the same place. Some of the data pointed to by the catalog may not exist. Data created after the backup will not exist in the catalog. Therefore, the catalog needs to be restored to the same point in time as the application data.

Options for Restoring Catalogs

Several utilities can be used to backup and restore a catalog. Most of these utilities restore the catalog with the same attributes it had when backed up. It is important to define aliases from the master catalog to the newly restored user catalog. Some catalog backup and restore utilities define these aliases automatically. Access to the catalog should not be allowed while being restored.

After the catalog has been restored, how can it be brought forward to the proper point in time? Again, several methods and products can accomplish this. Look for a method that identifies the proper catalog backup (the closest prior to the desired point in time), reports all the required SMF backup files, and constructs the appropriate JCL.

DR Process Pitfalls

To restore an OS/390 system, several groups must work together:

  • The systems programmer staff is charged with backing up the ICF catalogs, the tape catalog, the system data sets, and other selected data sets.
  • The DB2 database administrator is charged with backing up the DB2 databases.
  • The IMS database administrator is charged with backing up the IMS databases.
  • An application administrator is charged with backing up VSAM data sets accessed by CICS for a business critical application.

These groups should work together, but in the real world people often do not coordinate their efforts.

First, the systems programmers use full volume backup to backup the necessary system data sets to install a system for Initial Program Load (IPL). They backup Catalog 1, Catalog 2, Catalog 3 and the tape catalog. The systems programmers also capture and save the SMF data.

Next the DB2 administrator backs up the DB2 databases and their appropriate log files and the IMS administrator backs up the IMS databases and their appropriate log files. Then the applications administrator backs up the CICS VSAM files.

The following diagram shows the order of events when the backups occur.

  • C1 is the catalog containing the data sets for which the systems programmers are responsible.
  • C2 is the catalog containing the DB2 data sets.
  • C3 is the catalog containing the IMS and CICS VSAM data sets.
  • FV1-FVn are full volume backups

The systems programmers install and IPL the system, restore the volumes that have full volume backups, and then import catalogs C1, C2 and C3. Next they restore the tape catalog. Tape activity performed since the backup is not reflected in the tape catalog.

At this stage of the recovery, the ICF catalog is not restored to the same point in time as the application data, creating a problem. The DB2 DBA would try to restore the DB2 database, but catalog C2 backup and the tape backup were taken prior to the DB2 administrator performing the DB2 backups. Therefore, neither catalog will contain the DB2 backup information. The same is true for restoring the IMS database and CICS VSAM files. Because the catalogs are not in synch, the DB2 administrator, the IMS administrator and the application administrator have to produce special JCL to restore their files. To add to the confusion, catalog C1 is not in synch with the system data sets restored by the systems programmer using full volume restore.

Tips and Techniques

To resolve the catalog synchronization problem, bring the ICF catalog forward using SMF data. The systems programmers then manually create the JCL to bring the ICF catalogs forward using SMF data. Some of the challenges using a manual method are creating the JCL, keeping track of the SMF files, and keeping track of the ICF catalog backups. These resources constitute the ICF catalog forward recovery and must be sent offsite as a set for disaster recovery. If ICFRU or an automated method was used, the ICF catalogs could be brought forward to reflect a particular point in time, depending on the SMF data. The problem with the tape catalog being out of synch would persist.

At the disaster recovery site, the systems programmers run the full volume restores. Next they use the pre-produced JCL to bring forward catalogs C1, C2 and C3 to the designated point in time. Then they import the updated exported copy. The tape catalog should be restored and brought forward to match catalogs C1, C2 and C3. Now the ICF catalogs and the tape catalog are current to the particular point in time desired.

The DB2 administrator can now restore the DB2 database using current catalogs. The IMS administrator can now restore the IMS database using current catalogs. The CICS VSAM administrator can now restore his VSAM data sets using current catalogs.

A DR procedure must recover ICF catalogs and synchronize them with subsystem and application data for successful restoration of the OS/390 system and its applications. Careful planning preparation is a must for a successful DR test. Automate these steps where possible. Relying on manual processes to recovery all necessary files can lead to incomplete file structures and unsuccessful recoveries. Automation tools that help in planning, recovery coordination, JCL generation and analysis help reduce the think time needed to react and reduce the total time needed to recover.

Mike Koteras and Jim Whisenant are Software Developers for BMC Software. They can be reached via e-mail at mike_koteras@bmc.com, and james_whisenant@bmc.com.

Must Read Articles