High Noon -- Backup and Recovery: What Works, What Doesn't and Why

What does and doesn't work in today's business environment? Take a look at earlier methods of data backup and recovery, and see how far it has come, and what is still effective. Those who find themselves somewhere between today's ideal environment and yesterday's outmoded practices will learn why they need to bring data backup and recovery methods more up to date.

Data backup and recovery methods have come a long way over the last 20 years. The good news is that many of the limitations of earlier methods have been overcome. We are many steps closer to understanding the best approaches to take to assure effective data backup and recovery. The bad news is that many organizations are working in settings that don’t embrace disaster recovery to an ideal extent. For many organizations, outmoded methods are still in use.

Simplicity is best. However, in today’s complicated technology world, it’s often difficult to find a simple way of doing anything. When we talk about data backup and recovery, simplicity is a must. Recovery plans must include simple to follow procedures that can be executed by non-technical staff in the event that technical staff is not available. Knowing this requires that the tool set of today’s IT recovery plan be all-inclusive; this means that data must be grouped together and that data units must be easily transported, stored, recovered and ready for processing at an alternate site.

Application-Centric Backup & Recovery

Until Aggregate Backup and Recovery Support (ABARS) was introduced as a no-cost feature in DFHSM 2.4, backup and recovery was volume-centric. ABARS has many features that eliminate the problematic recovery issues of volume-centric backup, such as support for tape data and migrated data and the resolution of data synchronization issues caused by multi-volume data sets. ABARS also supports recovery to empty catalogs. User catalog allocation information is copied to an ABARS backup then is recovered at the alternate site. The recovered catalog is empty; it simply contains the alias pointers and is fully connected to the master catalog. As data is recovered, the catalog is repopulated. The catalog support ABARS provides eliminates the issues caused by volume-centric backup methods such as: dangling catalog pointers, failed catalog recovery or missing catalog records. ABARS is the only backup and recovery tool that supports all data, regardless of where it resides, DASD, tape or in DFSMShsm migration. It’s also the only tool that provides support for the environment’s user catalogs and DFSMShsm’s MCDS and OCDS.

Data in migration presents challenging issues for backup and recovery tools other than ABARS. Because of varying migration practices, it is often necessary to include migrated data sets in ABARS backups. These files are required in order to continue processing at an alternate site. As an example: If a weekly production batch cycle produces an output generation data set that is input to a monthly cycle, the backup of the application must include the output files created each week from month to month. In a scenario in which you are currently processing the third week of the month, three output files have been created. If you have a disaster, the backup recovered includes all three generation data sets. Processing moves forward into the next weekly cycle and subsequently the monthly cycle, which requires all four generation data sets. The previous three generation data sets have not been accessed since they were created and were most likely in migration when you backed them up. This is the reason why migrated data is required for application recovery.

Another issue with migrated data is that all data is clustered together on ML2 tape. There is no separation of test, from production, from TSO data. Without ABARS to back up the migrated data as part of an application, the ML2 tapes must be duplicated and the alternate set sent to the vault. This process presents two issues: One is that you must bring all data in migration to the recovery site, regardless of its criticality. Secondly, you must back up DFSMShsm’s MCDS and OCDS in synchronization with the data. The latter issue is not a simple as it may sound, since data is constantly moving in and out of migration.

ABARS solves both of these issues. Since ABARS can backup data in migration, data can be separated into different applications: production batch applications, third-party software, TSO data, test data, etc. These separate applications can then be backed up using ABARS. Once at the recovery site, data can be recovered according to the Business Impact Analysis (BIA) requirements. When data is backed up using ABARS, the catalog record and migration control record is included in the backup. Therefore, the metadata environment -- DFSMShsm’s MCDS and OCDS -- can be created and initialized empty at the recovery site and user catalogs are allocated empty and fully connected. When the data is recovered, data in migration returns to migration. The catalog record and the migration control record is placed in the user catalog and MCDS respectively. The newly created ML2 tape VOLSER is recorded in the OCDS.

A Brief History of ABARS

Volume-centric disaster recovery is generally the responsibility of the data center’s Technical Support/Storage Administration staff. It is only natural that the Storage Administrator, whose responsibility it is to manage the storage environment, be tasked with maintaining dump jobs for disaster recovery. Application’s involvement included backing up tape data sets for off-site vaulting, and maintenance of the recovery script. The majority of the data recovery effort and responsibility sat with the Storage Administration and Technical Support team. ABARS continued to support this paradigm by requiring that the aggregate definition be added to the DFSMS subsystem’s source control data set and then be activated in the DFSMS subsystem.

Storage administrators found them-selves stuck between a rock and a hard place. They recognized the need to implement ABARS but did not have the management support, training and resources to redesign their recovery plans to include using it. Furthermore, they faced an even bigger challenge, the task of identifying the critical data to be backed.

The Benefits of Automated Software Tools

With much determination, many storage administrators did get the support and application resources needed to manually create lists of critical data sets and implement ABARS. Applications shared the responsibility for business resumption by creating and maintaining the lists. However, disaster recovery testing continued to be plagued by recovery issues. This time the big-ticket item was missing data sets. It was often not understood that every data set referenced in the application’s production JCL as OLD, SHR or MOD is a critical data set and must be included in the ABARS aggregate to avoid a "JCL ERROR -- Data Set Not Found" error condition. Attempting to manually identify all critical data for application backup and recovery is an enormous and nearly impossible task for most IT organizations, even in the best of circumstances. Therefore, using ABARS without automation tools to identify critical data sets can be considered an incomplete and often inaccurate implementation.

Automation tools permit a more efficient utilization of resources and increase the accuracy of critical data set selection. They eliminate the need for repeated human intervention in the identification process, reducing the potential for error.

Automated tools also guarantee that changes to applications are automatically picked up, each and every day. For example, if an application programmer has added a few more steps, changed an input data set name, or removed some obsolete programs.

DASD Remote Mirroring

Relatively new in the data recovery arena is DASD Remote Mirroring. These technologies provide the ability to interconnect physically separated storage subsystems with a communications link. I/O processes are duplicated at the remote storage devices either by storage subsystem controller microcode or by processor-based services, depending on the specific implementation and vendor used. By precisely duplicating every I/O and every byte that was transferred to the local storage subsystem, DASD Remote Mirroring offers the promise of simpler and more effective recovery from a variety of disasters.

With any new solution, however, the cost dictates whether or not new technologies will revolutionize the data backup and recovery world. These costs include more than just redundant DASD capacity costs. Tape data and bandwidth cost must also be considered.

Another disappointing problem with DASD Remote Mirroring is that it cannot fully replace traditional data backups. With DASD Remote Mirroring an exact duplicate of data is created on the redundant DASD device. There is no "before" and "after" copy. Applications must continue to create backup copies of critical files in order to "back out" and "rerun" should the data be corrupted or due to an incorrect program change. Online database backups are also needed for "on-site" data recovery.

Combining ABARS and DASD Remote Mirroring

A combination of ABARS and DASD Remote Mirroring can provide a cost-effective solution for business continuity. ABARS can continue to provide the "before" backups required for on-site recovery and can effectively provide data recoverability for less critical applications, test data and other data identified by an organization as non-critical. ASAP can be used to identify exactly which "highly" critical data should be placed on DASD Remote Mirrored devices. The resulting hybrid implementation effectively uses the new technologies promising high availability for the most critical data, such as online databases and critical application data, while continuing to use traditional technologies, such as ABARS, for all other data, including tape data and data in migration.

Other Methods

For decades, full volume dump and restore was the primary tool used for disaster recovery. Gone are the days when, every weekend, the entire data center is quiesced in order to backup every DASD volume in the storage environment. To support catalog synchronization, volumes with catalogs were backed up last to ensure that all catalog records were captured. Since full volume dump doesn’t support tape resident data, tape data sets were copied to alternate tapes (within each application’s batch cycle) and sent to the vault daily. This backup method, although mostly manual, was simple and synchronization was rarely an issue.

By the late 1980s, DASD storage requirements began to increase at unprecedented rates. It became impossible to complete full volume dumps within the time allowed. Applications added to the strain by requiring longer online availability and additional batch processing. Incremental backup technology using DFSMSdss, FDR and other third party products became the tool of choice to decrease backup time by only backing up the data sets that were new or changed. However, with incremental backup it is still necessary to take a full volume dump at some point.

Incremental backup is a volume-centric technology. It is implemented to back up all new or changed data on a particular volume. No tape or migrated data sets are supported and must be duplicated using other utilities and sent offsite. DFSMSdss supports the selection of specific data sets or data set name masks to backup across a group of volumes. This support allowed for a pseudo application-centric implementation in that all data sets belonging to a specific application could be included in the backup as long as they reside on DASD. DFSMSdss requires applications to identify a list of critical data sets for each application, like in an ABARS implementation. Masked data set names were commonly used to include many data sets matching a partial name, PAY.** for example. Data set name masking results in backing up more data then required for recovery. Other issues plaguing DFSMSdss application-centric backup are that the command syntax is difficult to code, special keywords are required for some types of data and of course no support for tape resident data or data in migration.

About the Author: Colleen Gordon is the Technical Account Manager for Mainstar Software. She co-wrote the IBM Redbook DFSHShsm ABARS and Mainstar Solutions. She can be reached at colleen@mainstar.com or visit Mainstar Software at www.mainstar.com.

Partner Experience with the HP e3000

Over the past few years, vendors of disk systems have been touting various forms of Remote Copy or Remote Mirroring as a Disaster Recovery (DR) solution. Remote Copying or Mirroring could be part of a DR solution; but left on its own, could make the recovery process ineffective. Just copying data can result in copying the disaster, which means as the data gets corrupted at the primary site, it also gets corrupted at the secondary site. Although mainframe data centers have been the primary focus for DR solutions, the need is apparent within the NT, open environments and other operating systems.

In a DR situation, you need to restart critical applications at a remote site with full data integrity, without relying on primary site personnel. The restart must come from a known point in time where data has guaranteed integrity. Most business applications can tolerate a loss of some transactions, but not corrupted data. Any effective DR solution has to be cognizant of this fact. But, to understand how to prepare for a disaster, you have to understand the nature of a probable disaster.

Vaporization or Meltdown?

It is highly unlikely that a disaster would vaporize a data center in a nanosecond. A more likely scenario would be a flood, earthquake or hurricane. In these cases, a disaster occurs over a period of several minutes. Disk drives, tape drives, servers and network controllers will fail randomly. Logs and their database are usually on different disk subsystems, so they will not fail in synch. Databases typically have dependent writes to ensure they are written in correct sequence and databases with deferred writes may need to write across multiple disk systems. Some of these deferred writes may be in server memory.

Error recovery procedures within the operating system, the application and various hardware units may be invoked automatically. While some units will fail and some may come back, others may not. Multiple disk drive failures may occur. RAID disk products are designed to mask a failure of one disk drive within a domain, but not multiple drive failures.

It is during the several minutes of the disaster happening, called a "rolling disaster," that data gets corrupted. Unless you can freeze the copying of data immediately before the disaster begins, a Remote Copy function may well replicate the disaster. If the data is corrupted at the remote site, how do you begin to restart the critical applications?

At several seminars held by the Evaluator Group, we have painted the following picture to mainframe data center technicians. We asked them to imagine that a person walks into their data center, and over several minutes randomly turns off every machine. We then ask them how long to restart their critical applications, and what do they do?

Universally, the first thing they want to do is get a copy of the tape with the last mirror image of their databases. Recovery time is usually in days. When they have completed their recovery picture, we make this observation: None of them had confidence in their data as a result of this "rolling disaster." If they did not have confidence in this data, what confidence would they have in a remote copy of this data? And if they had no confidence in this remote data and would not use it in an emergency, why do they have a remote site and why implement remote mirroring?

One user reported that a customer engineer had inadvertently switched off one physical volume -- not usually a problem. But, in these days of mapping volumes or virtual volumes across multiple physical drives, error recovery can be tricky. In this instance, a simple unexpected powering off of one physical volume containing part of an IMS database resulted in corruption of the whole database.

The majority of data centers still rely on copying data to tape and trucking the tapes to a remote site. This is not without its own problems, and becomes a real challenge when sites have multi-terabytes to restore. For a true DR involving a mirroring solution, you need to ensure that all writes are time-stamped to guarantee data is written in the correct sequence. Disk subsystems do not share common clocks so the server must provide this feature. Rather than just copying data from one storage system to another, one needs to be able to transfer the application with a rapid restart capability. Vendors need to provide a recovery mechanism that ensures quick application restart at a remote site prior to a disaster corrupting data.

Geographically Dispersed Parallel Sysplex

Probably the best solution today is IBM’s Geographically Dispersed Parallel Sysplex (GDPS). GDPS is architected to transfer the application and the transactions to the remote site, rather than just the data. Considering there are thousand of reads and writes for every transaction, this seems a wise choice.

GPDS uses peer-to-peer remote copy (PPRC), which is a synchronous copy solution that is also offered by Amdahl, Hitachi Data Systems and StorageTek. EMC offers a version of their Symmetrix Remote Data Facility that operates with GDPS. An EMC supplied STARTIO exit routine traps PPRC commands and issues the appropriate SRDF equivalent. With SRDF, GDPS customers can also use the EMC TimeFinder point-in-time copy feature.

The disadvantage to GDPS is the limited distance, the cost and complexity of implementing the product.

One user we spoke to has recognized the exposure with remote copy products. He told his management that the proposed remote copy solution would not work in a real world disaster. Therefore, rather than spending the money on a remote site and remote copy software, they should pretend they had a remote site. In the event of a disaster, they would be in the same situation as if they had a remote site, but without the associated cost. Cynical as that user may be, he was not confident that the vendor could deliver a DR solution that would actually work. Given the cost and the criticality of DR, you should apply due diligence before deciding which offering you want to bet your company on. Unfortunately, that involves more than getting a few vendors in and listening to their promises.

You need to get the vendors to supply the equipment necessary, so you can simulate your own disaster. For a remote mirroring solution the vendors need to supply a primary and secondary storage system. You should create a script to run against a copy of your production database. Then, when the test suite is running, simulate a rolling disaster by randomly switching disk volumes off and sometimes on again over a few minutes. Then, test the integrity of the data on the secondary system. You may also become a cynic!

Dick Bannister is a Senior Partner at the Evaluator Group Inc., an industry analyst organization focused upon storage products. He can be reached via e-mail at dick@evaluatorgroup.com

Must Read Articles