Database Recovery When Data Sharing
DB2 for MVS/ESA Ver.4 has introduced many enhancements in the areas of client/serverprocessing, availability and performance. One of the major features introduced in Ver.4 isData Sharing. In a data sharing environment, where multiple DB2 subsystems read and updatetables in shared mode, database recovery becomes more complex behind the scenes. Thisarticle describes the major components involved in database recovery, changes made in DB2to support database recovery, and, finally, some performance issues related to recovery.
The major advantages of DB2 data sharing are improved availability of DB2, extension ofprocessing capacity and flexibility in growth. In order to understand database recoveryscenario, we will first glance at different components in data sharing environment, beaware of some basic definitions and understand how data sharing works when more than oneDB2 subsystems try to read and update a shared table at the same time.
Sysplex Environment. A sysplex is a set of MVS systems communicating with eachother through different hardware components and software services to process customerworkloads. A coupling facility (CF) enables the communication of subsystems within asysplex. It enables parallel processing and improved data sharing. The MVS componentcross-system Extended Services (XES) supports applications and subsystems to takeadvantage of CF services. The multiple instances of an application or subsystems, runningon different systems in a sysplex communicate with each other and share their statusthrough cross-system Coupling Facility (XCF) services. A sysplex timer synchronizestime-of-day clocks for all systems in sysplex. Figure 1 describes a sysplex structure withtwo CPCs communicating with each other.
CF Structures. A Coupling Facility has three areas of storage calledstructures. Lock structures are used to implement global locking for DB2 data sharing.These locks are recorded in lock structure so that all DB2 subsystems are aware of theselocks. Cache structures store changed pages and sometimes they also optionally storeunchanged pages. List structures (Shared Communication Area, or SCA) are used tointerchange information between sharing DB2 subsystems. SCA contains information aboutdatabases in exception conditions and other information.
DB2 data sharing uses these three structures in CF to manage data integrity and providedata consistency between all DB2 subsystems.
DB2 Data Sharing Group. A data sharing group can have one or more DB2subsystems as its members. Data sharing group is established either by installing a new V4DB2 subsystem with data sharing functions as the first member and defining the group, orenabling an existing V4 DB2 subsystem as first member of data sharing.
The DB2 Catalog and Directory for this first member then becomes catalog and directoryof the group. When a new member is added to the group (by installing new DB2 subsystem asa new member of the group), it shares the existing group catalog and directory.
DB2 Data Sharing Member. Each DB2 subsystem in a data sharing group is amember. All members within a data sharing group share the following DB2 resources witheach other: DB2 catalog and directory; Databases, tablespaces, tables, indexes, etc.;Aliases, views and synonyms; Plans and packages; Authorizations; Image copies; Results ofutility execution; Recovery information; Group buffer pools; Lock structures; SCA;Application programming defaults in DSNHDECP load module.
Also, each member maintains its own: Subsystem identifier; IRLM; Set of address spaces;Bootstrap data sets (BSDS); Active log data sets; Archive log data sets; Work filedatabase; Virtual buffer pools (local BPs); Hiperpools (optional); Environmentaldescriptor manager (EDM) pool; Sort pool and rid pool; DSNZPARM definition.
As a result of this, in a data sharing environment:
- DB2 objects (tablespace, table, etc.) created on one member can be used by all members.
- A plan bound on one member may be executed on another member.
- A grant given to execute a plan on any member authorizes execution of plan to all members.
- A utility for a tablespace can be executed on any member.
- Recovery of a tablespace can be executed on any one member and the tablespace gets recovered for all members.
Local Data Access for One. Any tablespace, which is required to be accessed byonly one member and does not require to be shared by other members, should be assigned avirtual buffer pool which is uniquely defined only for that member and that buffer poolshould not be defined for any other member. Also, a group buffer pool (GBP) correspondingto this virtual buffer pool should not be defined.
For example, if member DSM1 defines a tablespace TS1, for its local access, and assignsvirtual buffer pool BP6 to TS1, then no other member should have virtual buffer pool BP6and group buffer pool GBP6 should also not be defined.
Read-Only Shared Data for All. All members can share data in read mode. Also,when one member updates data, then during that period, no other member has read access tothat data. This requires that all members must define same virtual buffer pool which isassigned to the tablespace containing read only shared data and group buffer pool shouldnot be defined corresponding to this buffer pool.
For example, if tablespace TS1 is required to be shared in read mode by all members andvirtual buffer pool BP7 is assigned to this tablespace, then all members must have bufferpool BP7 defined and group buffer pool GBP7 should not be defined.
Read Write Shared Access. Group buffer pools are cache structures in CF, andthey are used to cache data which is of interest to more than one member in the datasharing group. For all those tablespaces, which are required to be accessed in read writemode by more than one member, the virtual buffer pools assigned to these tablespaces mustbe defined in all members and corresponding group buffer pools should also be defined.
For example, if a member creates a tablespace TS1 and assigns buffer pool BP1 to it,then for all members, tablespace TS1 is assigned to buffer pool BP1 as there is only onedefinition of TS1 in the shared catalog for the group. For this to happen, other membersshould also have buffer pool BP1 defined. Group buffer pool GBP1 is also required in thiscase.
A data sharing group must have group buffer pool GBP0 defined as all members havebuffer pool BP0 defined to access shared catalog and directory.
Group Buffer Pool. In a data sharing environment, for every virtual bufferpool defined on a member and assigned to shared tablespaces, one group buffer pool isdefined for the group. For example, if each member has local virtual buffer pools BP1,BP2, BP3 and BP4 defined for all tablespaces to be shared with in the group, then groupbuffer pools GBP1, GBP2, GBP3 and GBP4 must be defined. Group buffer pools are used to:
- Store updated data.
- Register the pages stored in local buffer pools of members in Page Directory.
- Optionally, cache unchanged pages reducing I/Os to DASD. This may improve performance in an OLTP environment, where small tables access will have a high hit ratio.
GBP stores updated pages copied from local buffer pools, in order to improvetransaction response time at commit time. Also, when another member needs to access thesepages, they get quickly refreshed from group buffer pools avoiding DASD I/Os. Thisprovides maximum benefit for accessing pages in small tables, space map pages, index leafpages and large tables with hot pages.
Whenever a member reads a new page into local buffer pool, GBP registers the interestin that page by the member in its page directory. This supports the mechanism by whichconsistency of same data pages in local buffer pools across all members is ensured.
Inter-DB2 Read-Write Interest. When two or more members in a data sharinggroup access the same page set (or partition) with at least one member having the writeinterest, then the page set (or partition) is said to be group buffer pool-dependent(GBP-dependent) and data sharing group is known to have inter-DB2 read-write interest inthe page set (or partition). DB2 dynamically detects this interest at page set (orpartition) level with page set physical locks (P-locks).
GBP Dependence. For each page set which is GBP dependent:
- Every page, read in local buffer pool of a member is registered in corresponding GBPs page directory.
- At commit, any page which is not yet written to GBP, is forced out into it.
- When a changed page is copied from updating members local buffer pool to GBP, all copies of that page existing in local buffer pools of other members are invalidated.
- All pages in the local buffer pool are tested for validity before they are read.
- Updated pages can be obtained from GBP into local buffer pool instead of reading them from DASD.
- Changed pages in GBP are written to DASD by cast out process.
- Global commit log sequence number (GCLSN) is used for lock avoidance.
Force at Commit. Typically, in a non-data sharing environment, changed pagesin buffer pool are written to DASD using asynchronous deferred-write. In a data sharingenvironment the changed pages for GBP dependent page are written to GBP. At the time ofcommit, all changed pages in local buffer pool, which have, so far, not been written toGBP, must be written to GBP and till that time these pages are protected by locks.
Buffer Pool Coherency. DB2 in a data sharing environment, ensures dataconsistency across all local buffer pools and GBPs. If same page exists in local bufferpools of more than one member, and a member updates a page, then all copies of this pagein other members are invalidated in their local buffer pools by a cross-
system invalidation process. The changed page is moved to GBP and any other memberrequiring this page now, refreshes copy of this page from GBP into its local buffer pool.
Group Buffer Pool Cast Out. Writing the changed pages from GBP to DASD iscalled as cast out. In this process, a page is read from GBP and copied into privatebuffer pool of a DB2 subsystem (the owner of cast out process for that page set) and fromprivate buffer, it is written to DASD. The owner of the cast out process is the first DB2subsystem to express write interest in the page set.
Changed pages in GBP are written to DASD when:
- Number of changed pages for a class out queue exceed class threshold value.
- Total number of changed pages for a GBP exceed GBP threshold.
- GBP checkpoint occurs.
- There is no more a inter-DB2 read-write interest in page set.
Group Buffer Pool Checkpoint. GBP checkpoint is the process of periodicwriting of changed pages in GBP to data set. In the case of group buffer pool failure, allchanged pages belonging to a GBP-dependent page set are recovered from DB2 logs writtensince last GBP checkpoint. A high checkpoint frequency means less number of log records toapply and therefore faster recovery of GBP. At GBP checkpoint, DB2 registers log recordsequence number (LRSN) in SCA and BSDS of member performing checkpoint. A member which isthe first DB2 subsystem that connects to GBP triggers GBP checkpoint.
Database Recovery Scenario. In a data sharing environment, each membermaintains its own active and archive sets of logs and bootstrap data sets. When a memberupdates pages in a page set, DB2 logs corresponding to these changes are written to activeand archive logs which belong to that member only. When two or more members update a pageset, then the DB2 log records reflecting updates for that tablespace are distributedacross multiple active and archive log data sets. A tablespace recovery in this case wouldneed logs from all those members which have updated the tablespace. Since tablespacerecovery in a data sharing environment, can be initiated from any member in the group, itis mandatory that each DB2 member has access to bootstrap data sets and log data sets ofevery other member. The group uses Shared Communications Area (SCA) to coordinaterecovery. It is also used to coordinate STARTUP process. SCA contains member names, BSDSdata set names, as well as status information about objects and members in the group.
For this recovery scenario, DB2 uses another group wide identifier LRSN (log recordsequence number) to merge log records read from all members and apply them to data. Alsothe bootstrap data set of each member maintains information about the bootstrap data setof all other members.
Log Record Sequence Number. As all members in a data sharing environment canupdate a tablespace, a recovery or restart process would need to access multiple log datasets and merge these log records in a time sequence in which the updates were made. Eachmember maintains its own log and each DB2 subsystem may be added to the group at differentpoint in time and each member can have different amount of update activities for atablespace in a given period. Therefore, log RBAs used in one DB2 subsystem have norelationship to log RBAs used in another DB2 subsystem. Hence, in a data sharingenvironment, log RBAs can no longer be used to arrange log records from different membersin time sequence. In data sharing environment, the log record sequence number (LRSN) isused to merge log records in time sequence.
The LRSN is derived from STORE CLOCK (STCK) instruction and is a 6-byte value equal toor greater than time stamp value truncated. LRSN must be in synch for the whole group andthis is achieved by sysplex timer.
LRSN in Log Records and Data Pages. In data sharing environment, the logrecords maintain LRSN in their log record header (LRH) as shown in Figure 2 (on page 42).LRSNs are maintained in place of RBA values in data pages. In Figure 3 (on page 42), aDSN1PRNT output (with FORMAT option) for a tablespace describes a header page with fieldsPGLOGRBA, HPGRBRBA, HPGCLRSN and HPGLEVEL storing LRSN values in place of log RBAs.Similarly, data pages also now maintains LRSN values in PGLOGRBA in page header.
Multiple members can update a tablespace and for each members update, UR (unit ofrecovery) log records are written to their corresponding log datasets. During tablespacerecovery, these UR records are read from log data sets and merged in LRSN order.
Changes in DB2 Utilities. LOAD, REORG, COPY, QUIESCE, RECOVER and MERGECOPYutilities now register LRSN values in START_RBA (and PIT_RBA) columns in SYSCOPY table.They also register the name of the member running the utility in SYSCOPY. SYSLGRNXcontains starting and ending LRSN values along with RBA values.
REPORT utility includes related LRSN information in recovery report. It also reportsvolume serial numbers of archive log data sets for all members.
QUIESCE utility returns LRSN value for quiesce point and WRITE(YES) option triggerscast out of all changed pages to DASD.
DSN1LOGP has the option to print log records with LRSN values.
Reading DB2 Logs with RBA and LRSN Values. The log RBA for a log record is anoffset from the beginning of data set to the log record. Therefore, for reading a logrecord on DASD by its log RBA, DB2 can find out exactly where the required record existsin log data set. For reading a log record by LRSN, it is not possible to locate a logrecord directly as the log update activity is not uniform with respect to time; therefore,one part of log data set can have more log records written in a given interval of timethan the other part of data set. However, all log records in a data set are written inincreasing LRSN value and for reading a log record on DASD, DB2 does a binary search oflog blocks to locate a record containing a particular LRSN value.
Extracting Log Records. For all our subsequent discussions, consider a datasharing group with group attach name DSM0 having two members DSM1 and DSM2 as two DB2subsystems in data sharing environment. In Figure 4 (on page 43), we have the followingsequence of updates for tablespace TS1 in database DB1:
- Member DSM1 updates TS1 at time T1 and T4 creating a SYSLGRNX record 1.
- Member DSM2 updates TS1 at time T2, T3 and T5 creating a SYSLGRNX record 2.
- Member DSM1 updates TS1 again at time T6, T7, T8 and T10 creating a SYSLGRNX record 3.
- Member DSM2 updates TS1 again at time T9 and T11 creating a SYSLGRNX record 4.
Now, to recover TS1 to time T9, the last full image copy taken at time T0 (this imagecopy may be taken from a member different than the member on which recovery is being done)is used and all 4 SYSLGRNX records are used to determine log ranges to scan log data setsfor each member. Log records for updates U1 to U9 are read and applied for recovery inthat order.
SYSLGRNX record contains RBAs, as well as LRSNs (STARTRBA, STRTLRSN and ENDRBA,ENDLRSN) both. For recovery, the log manager will always try to read the log records byRBAs (it is easier to locate them by their offset) and use STARTRBA instead of STRTLRSN.In case the last image copy is taken with SHRLEVEL CHANGE and image copys LRSN(value in START_RBA column of SYSCOPY) falls in between STRTLRSN and ENDLRSN values ofSYSLGRNX record, the log manager will read the first log record on data set by LRSN value.Once this record is located and read, all other log records will be read using log RBAs.
Applying Log Records. Once a log record is read, DB2 compares the LRSN in logrecord (in non-data sharing environment it compares log RBA) with LRSN value in PGLOGRBA,in page header (see Figure 3 on page 42). If the LRSN in log record is higher than LRSN inthe page header, the log is applied to the data page. For point-in-time recovery in datasharing environment, LRSN must be used in place of log RBA.
Changes in DB2 Catalog and Directory. SYSCOPY, SYSLGRNX and SYSUTILX have nowadditional fields as described in Figure 5 (on page 43).
Bootstrap data set maintains following additional information:
- LRSN ranges for each active and archive log (along with RBA ranges).
- Member record containing group name and member name to which the BSDS belongs.
- Names of all other members in the group.
- Member ID (it is assigned when a member joins a group) for all members.
RBA When converted to V4. This is set when migrating from V3 to V4. This RBA is notrelated to data sharing but it is used to manage certain utility operations. Forincremental image copies, logic for tracking modified pages is changed in V4 to improveperformance.
MAX RBA for TORBA it is RBA of originating member when data sharing was enabled. InRECOVER utility, TORBA value, if used, cannot exceed MAX RBA value.
MIN RBA for TORBA if data sharing is disabled, recovery to a previous point-in-timeusing TORBA cannot go before this point.
STCK to LRSN delta mostly this value is zero. This is non-zero when DB2 is not able touse exact store clock value to determine LRSN.
Example of a BSDS with Print Log map utility is shown (see Figure 6 on page 44) withtwo members DSM1 and DSM2 in a data sharing environment. Message DATA SHARING MODEIS ON indicates that data sharing is enabled. Member ID 1 always corresponds tooriginating member. DB2 keeps track of originating member in a group in order to performrecovery prior to a point when data sharing was enabled.
Changes in RECOVER Utility. Recovery for current point-in-time is the same asin non-data sharing. Image copy is restored and logs from multiple members are read andapplied in LRSN sequence. Recovery for TOCOPY is again the same as in non-data sharing.
Recovery for TORBA is used to recover to a point by log RBA in a non-data sharingenvironment. For a data sharing environment, it is used to recover to a point beforemigrating to a data sharing Group.
Recovery for TOLOGPOINT, in a data sharing environment, is used to recover to a pointin a log after enabling data sharing. It is done using LRSN. In a non-data sharingenvironment, it is used to recover to point-in-time by log RBA.
As we can see in Figure 7, TORBA and TOLOGPOINT using log RBA value can only be givenbetween MIN RBA and MAX RBA as reported by Print Log Map utility or REPORT utility. Torecover to a point before migration to V4, TOLOGPOINT option can not be used.
REPORT Utility. In data sharing environment, this utility provides followinginformation for the group:
- SYSCOPY information for CURRENT option (entry made by last member, for ALL option entries by all members)
- SYSLGRNX information by all members
- Log data set information by all members
- MAXIMUM RBA
- MINIMUM RBA
- MIGRATING RBA
An example of REPORT output for tablespace TS1 updated in data sharing environment bytwo DB2 subsystems DSM1 and DSM2, can be seen
in Figure 8 (on page 46) with MINIMUM RBA=000000000000, MAXIMUMRBA=000000646C9A, MIGRATING RBA=000000646C9A. SYSCOPY entries arereported in time sequence across the group. SYSLGRNX entries with START LRSN and STOP LRSNvalues are reported member-wise. For member DSM1, member ID is 01 and for DSM2, Member IDis 02.
RECOVERING a Tablespace. RECOVER utility can be run on any member of datasharing group and it can restore from an image copy data set copied by one of the membersin the group. The additional functions supported in data sharing environment supportaccess to bootstrap data sets and log data sets for all other members to the memberrunning RECOVER utility which requires that all DB2 members in the group have connectivityto all bootstrap data sets and log data sets.
Consider the sample REPORT output in Figure 8. If we run recover for this tablespaceTS1, DB2 will first restore from image copy data set PREF1.UTIL1.TS1.D021202 copied withSHRLEVEL CHANGE by member DSM1. The START_RBA for this SYSCOPY row has LRSN value=B000F6184A2E. Subsequently, it will read SYSLGRNX entries to find out logranges required to apply log records.
For member DSM1, this LRSN value falls in between START LRSN (B000F60BE0EE)and STOP LRSN (B000F75B90D5) of first SYSLGRNX entry, and DB2 will do a binarysearch to position at right RBA record and use RBAs up to STOP LRSN(B000F75B90D5) on log data set. For second SYSLGRNX entry for DSM1, it willdirectly position to START RBA (0001EC19028E). For member DSM2, this LRSNvalue falls before START LRSN (B000F6937EE7), and DB2 will position directlyat START RBA (0001A5A35B70) on log data set (see Figure 9).
Pages in Logical Page List (LPL). When a transaction is committed, all updatedpages in local buffer pool which have not been written to GBP, are written synchronouslyto GBP and if GBP is not available at that time or write to GBP fails then these pages areplaces on logical page list associated with each page set. This is done to ensure dataintegrity with limited unavailability of pages (placed on LPL). Any application trying toread these pages gets -904 sql code. A DISPLAY DATABASE LPL (or RESTRICT) command willshow all page sets with LPL mode.
Group Recovery Pending (GRECP) or LPL Recovery. Tablespaces with grouprecovery pending or logical page list status can be recovered by using START DATABASEcommand. For this recovery, the starting point for tablespace is LRSN of last GBPcheckpoint, and subsequently all changed pages are recovered from DASD and DB2 logs.
For faster recovery, the START command for all tablespaces can be distributed acrossdifferent members in parallel.
Recovering Tablespaces Outside of DB2. If the backup copies have been madeoutside of DB2s control such as through DSN1COPY, same method can be used to restoreobjects to a prior point-in-time. However, if you want to apply logs subsequently, you canuse LOGONLY option.
Recovery Performance. A single tablespace recovery may require at least onetape unit for each DB2 member for reading and merging archive log records. If you runrecovery for multiple tablespaces in parallel, you may need even more tape units all atthe same time. As compared to DB2 V3, the pre-mount function is not available in datasharing environment. Therefore, if enough tape units are not available for recovery, DB2can deadlock, reading logs from tapes may degrade performance.
One solution to this problem is to minimize reading archive logs. To achieve this weneed either to create more active log data sets or to increase size of size of active logdata sets. There is a maximum limit of 31 for the number of active log data sets for amember and most of the time active log data sets size is decided so that they can bearchived to a single tape. Also, a group may have up to 32 members and keeping more logson DASD may result into DASD constraint. If DASD is not a constraint, the other solutioncan be to keep one set of archive logs on DASD and another set on tapes.
A better option is to archive logs to DASD with DFSMSHsm. Using this, we can managedisk space As well as availability of archive logs as and when required for recovery.
The other logging considerations are to increase image copies more frequently. Thiswill reduce reading archive logs during recovery. Based on the total active logs size wecan determine the optimum frequency of taking image copies. Batch jobs should commitfrequently so that a unit of recovery is always confined to active logs only.
About the Author:
Sharad Pande is a Lead Associate for PricewaterhouseCoopers SolutionsThruTechnology practice. Based in New Jersey, Sharad has 10 years of experience withdatabase technology and is a Database Administrator.