Subsystem Recovery for ERP Systems Made Simple

ERP systems are typically business-critical applications that must remain online. System and database administrators must juggle the pressing requirements for fail-safe and responsive recoveries with non-stop database availability. Additionally, finding and keeping well-trained ERP systems staff is becoming increasingly difficult. Although subsystem recoveries for ERP systems are difficult, there are several strategies and tips to design a fast and effective ERP recovery strategy.

Corporations today are facing two challenges – demand for 24x7 availability and avoiding downtime. The rate and risk of these challenges is greatest for those with Enterprise Resource Planning (ERP) systems on the mainframe. These companies select the highly available, scalable and reliable S/390 platform for ERP databases because their systems house enormous amounts of data. Because ERP applications such as SAP R/3 and PeopleSoft integrate functions across an organization, ERP outages have a broader impact than interruptions in less sophisticated applications. Recovery is both more important and more difficult for ERP systems on S/390.

ERP systems are typically business-critical applications that must remain online. System and database administrators must juggle the pressing requirements for fail-safe and responsive recoveries with non-stop database availability. Additionally, finding and keeping well-trained ERP systems staff is becoming increasingly difficult. Although subsystem recoveries for ERP systems are difficult, there are several strategies and tips to design a fast and effective ERP recovery strategy.

The Problems

Technical Challenges of Point-in-Time Recovery. Large ERP systems can easily exceed 7,000 tablespaces and 19,000 indexes resulting in extremely difficult problems for backup and recovery processing. When considering the vast amounts of stored data and the corporate investment made to acquire and manage the data, the importance of being able to effectively recover the data becomes immediately apparent.

The Problems with Recovery. Recovery takes time. In reviewing the complexity of recovery, it must be remembered that applications consist of large integrated databases. Application modules or objects and associated data are both physically and logically connected and need to be recovered at the same time. In SAP and Peoplesoft, DB2 referential integrity is not utilized. This makes it hard, and sometimes impossible, to detect because the recovery of object "A" also will necessitate the recovery of object "B". Thus, to maintain data integrity, a recovery to a point-in-time must include the entire subsystem. Such a recovery is very difficult when a recovery starting point (quiesce point) cannot be identified and when the window in which batch processing can take place is very small.

Traditional point-in-time recovery requires a point of consistency. In order to obtain a point of data consistency in an ERP database, a QUIESCE command can be used. However, this would require an outage of the database(s), which is undesirable. Also, many point-in-time recoveries must be to points in time other than quiesce points. Therefore, employing a DB2 conditional restart is the better method for obtaining a point of recovery. It requires the selection of a log point to be used in the conditional restart control record and submitting the appropriate JCL for the recovery of the DB2 subsystem and all objects.

The Problem with Backup. Because point-in-time recovery of ERP systems affects every object in the system, every object requires a backup. Indexes can be rebuilt from table space data, but even limiting backups to only the table spaces means that more than 5,000 to 7,000 objects must be backed-up. Hours of processing can be involved to obtain a single backup of the entire system, and due to the dynamic nature of the system, knowing precisely what to backup is difficult. In addition, splitting the backup process into multiple concurrent tasks is cumbersome and time-consuming.

The Problem with Recovery. Recovery of ERP systems presents the same problems as backup. Table space recovery requires a recover to copy using the most recent image copy with log applied to the desired point in time. Options available for index recoveries include rebuild from table space data or recovery from image copies. Extremely high recovery times may result in rebuilding indexes due to the sort process and the multiple passes of data. In addition, recovering from image copies results in high recovery times due to overhead of data set allocation and open.

The Problem with Think Time. Research by a major analyst group indicates that most enterprises using SAP, for example, have target recoveries of less than 24 hours. Up to one third of that recovery window (8 hours) may be the decision process that involves identifying the scope of recovery, the recovery resources (image copies, incremental image copies, log records) needed for recovery, and the batch processing that must be performed to restore the data. The recovery ‘think time’ can easily shorten the actual window in which the technical recovery takes place.

Hardware ASlternatives. Recently, IBM announced a facility to enable hardware copies to be taken for DB2 subsystem restart-based recovery. It provides a restart-based recovery to a particular point-in-time without requiring the update outage of a QUIESCE. However, these types of backups cannot be used by the RECOVER utility for other types of recovery, such as recovery to current for media failure or recovery to other points in time.

Fast recoveries using standard backup and recovery techniques should use the latest tape technology, maximizing parallelism by using multiple tape drives and channels. Another option to decrease recovery time is to increase the frequency of copies, which will reduce the volume of transaction logs that need to be applied. Shadowing and mirroring techniques across multiple geographic locations can significantly reduce backup time by 30 minutes to several hours. However, these can be very costly.

The Solutions: Five Key Steps

ERP system administrators can employ new rapid recovery techniques to significantly reduce recovery-related downtime.

Begin with Automation. Because backing up and recovering an application that has more than a few hundred objects can be difficult and time consuming, a solution that fully automates the backup and recovery process for ERP systems is recommended.

A technique to take all DB2 application data in the ERP system and split it quickly and automatically into groups based on size can be utilized. The groups are then used to generate backup or recovery JCL for the entire system. Revalidation can be performed on a regular basis to ensure that any objects dropped or created since the groups were established are detected and handled appropriately.

Additionally, options that allow for backing up only the tablespaces that have changed since the last backup are critical. Depending upon the volatility of the system, this can drastically reduce the need for backups on some tablespaces, and subsequently reduce the time required for backup. Groups of objects can be copied and recovered together.

For example, assume an SAP environment has 27,000 objects. Compare that to the 28,800 seconds in an eight-hour workday. A database administrator considering each object in the system for one second would need to devote an entire workday to reviewing the data. Realistically, a DBA might need 30 seconds to assign that object to a group. That would equate to 28 workdays. On the other hand, automated recovery processes can divide the subsystem into balanced groups in just minutes. JCL can be generated to provide a full backup of the entire subsystem that could be run at regular intervals, for instance once per week. The ability to identify table spaces that have not been changed can be utilized to generate JCL to run copies on only the changed spaces on a more frequent basis. Again, unchanged objects would not need to be copied. As this example demonstrates, automation can result in astonishing time savings.

Examine Recovery Avoidance. By performing analysis on the log to identify objects that have not changed between the current time and the recovery point, unnecessary recovery of unchanged objects can be avoided. When recovery JCL is generated only for the objects that have been updated between the selected point-in-time and the current time for local point-in-time recovery, recovery time can be reduced by 80 to 90 percent.

Identify Alternatives to Conditional Restart. Software also can be used to locate additional recovery points with log analysis to provide an alternative to the conditional restart recovery. For example, filtering can be used to locate quiet points and generate UNDO DDL for the catalog and directory. The UNDO DDL would be executed to roll back the catalog and directory to the desired point. If a recovery point is located, recovery JCL could be generated that would exclude unchanged spaces. Forward recoveries for any problem spaces could be built. This technique reduces the scope of the recovery and thereby diminishes the time needed to complete the recovery process.

Employ New Backout Features. The DB2 Recover utility always processes log forward when performing a point-in-time recovery. This means that, in almost every case, either image copies or a pack backup must provide the basis for the log processing. Today, indexes are generally rebuilt from the tablespace data. With this technique, index rebuild may be avoided, but indexes also will require copies. The point-in-time recovery affects every page of every space involved in some way.

To avoid this, use software with backout capability. Backout features provide the ability to do a point-in-time recovery using undamaged spaces and backing out the changes from the log. This technique accomplishes point-in-time recoveries without using image copies or pack restores. No log is required prior to the point-in-time of the recovery. The tablespaces are merged with the sorted log records to avoid reading or writing pages unnecessary to the process. By multitasking the log read, recovering the table space data, and capturing the key data for index rebuild in a single pass, there can be a 50 percent to 70 percent reduction in time for ERP Systems recovery.

Use Copy Alternatives. Three different copy alternatives are available -- software cache copies, hardware snapshot cache copies and software snapshot. Software snapshot copies require intelligent storage devices and the appropriate software to utilize those devices. Copies utilizing the intelligent storage devices can result in timesaving of up to 90 percent.

Robin Starnes is a Product Line Manager for BMC Software in the DB2 Backup and Recovery team. She has experience in a variety of system environments including OS/390, UNIX, and NT with IMS, IDMS, DB2, Oracle, and SQL Server database management systems.