In-Depth
A Business Perspective: Continuous Availability of VSAM Application Data
Due to improved price/performance of mainframe hardware and the impracticality of moving huge legacy data files to smaller platforms, mainframes are now a popular choice for client/server applications. Most mainframe data is still stored in non-relational file structures, such as VSAM. In fact, it has been estimated that 70 to 80 percent of the data stored on mainframe systems still resides in legacy VSAM files.
Today’s computer users have little or no understanding of the traditional requirements of data processing services, and little patience if such requirements result in the unavailability of the services they need at the time they need them. Access to mainframe systems/servers are required (or demanded) around the clock.
Availability must be measured from a user’s point of view. The OS/390 platform is ranked at 99.995 percent availability in a parallel sysplex environment. At approximately 99.995 percent availability, a Xephon/Horison report predicts that the parallel sysplex will lose about 50 minutes of processor time per year.
The average hourly costs of a system failure and its resulting outage range from an estimated (1998) value of $6.45 million in the brokerage industry, to $2.6 million in credit card/sales, to $89.5 thousand per hour in the airline reservation industry.
These costs explain why some industries will pay vast amounts of money to add an extra "9" to the availability level of their systems.
Why Continuous Availability?
Some examples of systems demanding very high levels of availability and continuous operation are:
• Medical services systems assisting in patient monitoring and diagnosis
• Online financial services networks
• Continuous-operation manufacturing installations
• Information services
• International business services spanning multiple time zones
• Internet applications
In practice, it is extremely difficult for most installations to achieve continuous availability, since planned unavailability is scheduled in nearly all installations and unscheduled unavailability is, by its nature, often unexpected and difficult to eliminate. IS departments strive to minimize the frequency and length of both planned and unplanned unavailability.
The frequency of scheduled unavailability varies between once a day, to once every few weeks. The length of the scheduled unavailability is seldom more than two hours.
In all cases, availability is determined by the end user of the service. This means that end users will perceive that their system is unavailable if:
• The response time exceeds the maximum normal response time they are accustomed to and interrupts their workflow.
• The response from the system is a request for an activity outside the normal work activities, such as a request to perform a sign-on operation again.
Unavailability, therefore, is defined as any interruption that presents end users with an interruption to their normal pattern of workflow.
IS departments should consider several factors when creating a strategy for avoiding single points of unavailability. It is imperative to consider data sharing (to prevent a single application or subsystem from being a single point of unavailability), using backup while open to make copies of your data sets, and avoiding the need to reorganize your data sets.
The Business Perspective
In order to have a continuously available system, an IS manager must understand a few obstacles that may derail availability objectives. Issues, such as:
• Making your VSAM data available between the online system(s) and batch jobs (a.k.a. data sharing).
• Taking backups without impacting either the online applications or batch jobs.
The sharing of data between online and batch isn’t a luxury, it’s a business requirement.
Data Backups and Data Sharing
Some challenges for data sharing and data backups are:
• Not stopping the online system, such as CICS, while the batch jobs or the backups run.
• Not affecting the performance of the online system, while you take backups or allow batch tasks to update the online files.
There are solutions to overcome the above challenges, but at what cost?
The cost can be, in time, to implement and test new procedures, or to review new application development standards and test them. The cost can be more visible, for example, in new releases of software, new hardware, or in duplication of existing system components, even duplication of the entire system. Table 2 shows some typical costs that might be included.
Data recovery usually consists of restoring a copy of the data and applying any necessary updates. Almost all installations perform this type of backup/recovery today for their prime site.
When the data is to be restored at a secondary data center, extra considerations are involved due to variations in the installed hardware and the possibility that related data may not be backed up and taken offsite at the same time. For disaster recovery, the speed and flexibility of recovery are the key elements. According to the Disaster Recovery Library, these must be offset against the impact on normal operations to back up the data.
Device-independent backups are copies of data sets that can be restored to a different device type. This facilitates effective disaster recovery when the DASD in the two sites is different.
Transaction consistency is important for database managers. A database consists of several data sets that are updated by a single transaction. These data sets must be backed up as a set, along with the control data that describes the logical relationship between these data sets. For disaster recovery, the absence of a backup in the secondary data center of only one file will make the whole database unusable.
Point-in-time copies back up the data as it exists at a specific instant. Traditionally, this has meant stopping CICS, so that no updates can be performed, copying the data and then restarting CICS. Two basic types of point-in-time copies are available.
Logical copies are those that back up the content of the data, but not necessarily the physical format. These provide the most flexibility during recovery, as the data can be recovered to a different device type. On the other hand, a logical copy usually takes longer to create.
Physical copies have historically been the simplest and fastest. With physical copies, data is copied in a device-dependent manner and usually at the volume level.
This may be adequate for backup within one site, but in a second site with unlike devices, it will not be suitable. Additionally, it does not guarantee consistency within a data set that spans several volumes.
Online Copies
One objective when performing data backup is to do so with minimal or no outage at the prime site. To achieve this, there are solutions and services that back up data while the databases are in use. This could be either a software or hardware technique to ensure that the copy is logically consistent or device independent.
The need for data recovery is the same, no matter which kind of computing environment you are in. For users of systems such as CICS, VSAM data loss probably means a major alert to recover large amounts of mission-critical data. Due to the complexity of recovering CICS VSAM data, few qualified products exist to accomplish this important task. In order to be successful, users need a single, superior product that provides quick and easy VSAM data recovery. Users need ease of use that allows them to focus on the problem of recovering data and not on the program that recovers the data.
VSAM Sharing
VSAM data sets often have to be shared among several different applications in an MVS system image or among applications on several different MVS system images. For example, transactions running in different CICS regions may have to access the same VSAM data set at the same time, or a CICS transaction may have to access a VSAM data set at the same time that a batch job is using the data set. The requirements for sharing can vary. Sometimes applications only have to read the data set. Sometimes an application has to update the data set while other applications are reading it. The most complex case is when all applications have to update the data set, and all require complete data integrity.
VSAM in a non-RLS environment provides only limited support for the sharing of data sets. It does not provide the functions that are required to enable multiple users to update a shared data set with complete integrity.
CICS users have been able to share VSAM data sets with integrity by using function shipping. With function shipping, one CICS region accesses the VSAM data set on behalf of other CICS regions. Requests to access the data set are shipped from the region where the transaction is running to the region that has access to the file. Function shipping provides a solution for the CICS user, but it does have limitations: The processor cost of implementing function shipping can be high, and function shipping does not address the problems of sharing data sets between CICS regions and batch jobs. Also, the file owning region is a single point of failure.
Database management systems such as IBM DATABASE2 (DB2) and IBM Information Management System (IMS) Database Manager resolve the problem of sharing data with integrity among multiple users. Now, with DFSMS/MVS 1.3 and CICS TS, VSAM RLS provides many of the functions that database management systems provide, including support for the parallel sysplex environment.
RLS is a new VSAM function provided by DFSMS/MVS 1.3 and exploited by CICS TS. VSAM data sets are opened in RLS mode, which allows them to be shared, with full update capability, among many applications running in many CICS regions. As part of VSAM RLS, DFSMS/MVS 1.3 supports a new data sharing subsystem, SMSVSAM, which runs in its own address space. SMSVSAM provides the VSAM RLS support required by CICS application-owning regions (AORs) and batch jobs within each MVS system image in a parallel sysplex environment. The SMSVSAM subsystem, which is initialized automatically during an MVS initial program load, uses the coupling facility for its cache structures and lock structures. It also supports a common buffer pool for each MVS system image.
VSAM RLS exploits the parallel sysplex environment. RLS can be used on a single MVS system image or across many MVS system images. CICS AORs and batch jobs located on different MVS system images can share access to VSAM data sets. Access is through the SMSVSAM subsystem.
Applications running in CICS AORs can share access to VSAM data sets opened in RLS mode with full update capability. Batch jobs have only read access to recoverable VSAM data sets opened in RLS mode. Batch jobs can update nonrecoverable VSAM data sets opened in RLS mode.
VSAM Record-Level Sharing
VSAM RLS improves price performance, improves the availability of systems using VSAM data sets, enhances the integrity of systems using VSAM data sets, extends the processing capacity of your system, and provides flexible ways to configure your environment and balance workloads, while requiring minimal changes to your existing applications.
Users increasingly require access to data held in VSAM data sets every hour of the day, every day of the year. In a non-RLS system, all CICS application access to a specific VSAM data set is typically through a single file-owning region (FOR). The FOR thus becomes a single point of failure. RLS enables sharing of VSAM data among many applications running in many CICS regions, possibly on several MVS system images, all within a single sysplex. RLS improves the availability of VSAM data sets during both planned and unplanned outages. In a VSAM RLS system, CICS application access to VSAM is through the SMSVSAM address space, and there is one SMSVSAM address space in each MVS system image. Thus, if one of the MVS systems or subsystems is not available, applications can access their data from another MVS system or CICS region.
VSAM RLS also improves data integrity and availability through the use of common locking and buffering.
You can improve availability substantially by careful planning and effective system management. Together these can take you to and beyond the 99 percent level (still one and a half hours of unavailability in a week).
About the Author: Harry L. Kirkpatrick is a lead quality assurance representative for BMC Software Inc.