In-Depth
Disaster Recovery Using Real-Time Disk Data Copy
Worldwide commerce and industry have become increasingly dependent on their IT systems to conduct business. So much so that even a temporary loss of critical applications can cause serious economic impact to a company, and an extended outage can threaten a company’s existence. Regulatory and competitive pressures, coupled with this potential financial impact of unavailable systems, have motivated IT executives to address the issue of real-time protection in the event of disaster.
In these days of ultra-high availability requirements, where 24x7 operations are the norm, how does an operation protect its critical applications against a disaster or an extended outage? New technology, allowing critical data to be "shadowed" to a remote location, can provide this protection; it also can allow a company to resume processing within hours, or minutes, of a catastrophic disaster. This new technology, which blends innovations in disk subsystems and OS/390 software, can play a significant role in supporting system-level failover.
The concept of "dual copy" or "mirroring" is not new. However the requirement, allowing complex data structures to survive catastrophic failure at remote locations thousands of kilometers away, has led to an emerging technology that has now come to market.
In this article, following a quick background of the new technology, I will describe the design challenges that were inherent in developing robust remote copy solutions. Although interesting to the architect, designer and developer, these design challenges are also important for you to understand so that you can select the appropriate remote copy solution for your needs. Following this description of the design issues, I will present, at a high level, the solutions that emerged to address these needs. Lastly, I will present a guide (also intended as a summary) to evaluating each of these solutions for your own needs.
Background
In 1990, the first hardware dual copy solution was introduced. This technique is called "mirroring," as it provides synchronous and identical copies of disk volumes to be maintained on the same disk subsystem. The function allowed customers to continue processing in the event of a single-disk failure.
Very shortly afterwards, customer requests were for a similar function, but one where the additional (secondary) disk copy could be located on another disk subsystem, some distance away from the primary. The reason for this requirement was to protect critical applications from total failure of the disk subsystem or the entire primary site.
Why did it take four years - until 1994 - for solutions to begin reaching customers? And why has it taken another two to three years for that technology to evolve into products robust enough to survive almost any disaster? The answers are in the technology challenges inherent in providing a workable solution.
Technical Challenges
There were a number of significant technical challenges in developing the products.
Those technical challenges include, but were certainly not limited to, the following:
* Performance
* Data integrity and Sequence Integrity
* Understanding "Rolling Disasters"
* Providing a Restartable Image
Prior to understanding the technical issues, I should explain that there are essentially two forms of remote copy solutions: those that are synchronous, and those that are asynchronous or non-synchronous. To summarize the two approaches:
* Synchronous - An application is informed that an update is completed only when both the local and remote subsystems have received the new or changed data. This maximizes data currency at the remote site since it is in lockstep with the primary site.
* Asynchronous or non-synchronous - An application is informed that an update is completed when the update has been received by the local subsystem; the remote subsystem is updated at a later time. This minimizes any change in application performance due to implementing the remote copy function.
Both forms of remote copy have benefits and drawbacks, as you will see in the discussion below.
The Performance Challenge
The issue of performance generally involves the length of time it takes for a disk update to be completed. In the simplex case (where no disk shadowing is occurring), an update operation can normally complete within a few milliseconds. However if remote copying is enabled, and if care is not taken with design, the response time for updates can more than double, since the update must be performed in two locations as well as travel between the two sites. In non-synchronous implementations where updates do not have to wait for any acknowledgments from the secondary, this increase in response time increase is negligible. As a result, much of the development effort on synchronous implementations attempted to mitigate or mask this performance penalty. Most synchronous solutions today, through a combination of improved technology and clever transmission techniques, will keep this impact to a minimum. There are two issues that should be kept in mind:
1. For truly synchronous solutions, there is a performance impact. This is the cost of doing synchronous remote copy.
2. This penalty can limit the acceptable distance (impact grows by distance because of signal propagation delays), as well as the conduit (impact grows if traversing a telecommunications link).
The Data Integrity Challenge
This was a significant design challenge. Obviously any solution that could not maintain data integrity would be useless. Just as performance was the major design issue in synchronous designs, data integrity by maintaining the sequence of updates was the issue in the non-synchronous designs. To be specific, the difference between loss of data and loss of data integrity is analogous to losing an entry or two in a phone book (loss of data) and having all addresses and phone numbers in the phone book randomly assigned to the names. In the latter case, all of the data is present - but useless.
To understand the data integrity issue, the importance of the sequence integrity of updates should be clearly understood. In any operating system, file access method or application, the sequence of the updates is key to maintaining integrity of the data. Simply put, if one were to do random updates to the complex data structures in today’s systems and database applications, the associated data would be useless. For example: the sequence for updating indexes (pointers to data) and data components allows these data structures to survive a sudden failure. Furthermore, each update must be assumed to be logically dependent on a previous update.
In the above index/data example, an index pointer cannot be updated until the data is successfully written. This means that the index update must wait until the data updates are successfully completed. This serialization of updates, where one update is logically dependent on a previous update, is called dependent writes.
In synchronous solutions, where the secondary image is in lockstep with the primary, maintaining the sequence of replicated updates is not an issue. Any logically dependent update will never have a chance to start until the previous update - upon which it might be dependent - has a chance to complete successfully at both locations. In non-synchronous solutions, however, sequence is very much an issue.
In a situation where a primary update is allowed to complete prior to its propagation to the secondary, a subsequent update is allowed to start. In this scenario, the arrival times of the updates at the secondary can easily be out of order (update #2 arrives prior to update #1), temporarily leaving the secondary image corrupted until update #1 arrives. Although this "exposure" may only last a few milliseconds for that particular update, the fact that thousands of updates a second may be occurring, can cause this corruption exposure to be essentially continuous. That is, any random failure (like the disaster one is trying to protect against) will find corrupted data at the secondary location, where potentially numerous records not yet received at the remote site are necessary to preserve integrity. Consider this scenario: a database log, recording completed transactions, contains an entry indicating a large funds transfer has completed; the database itself, however, does not contain the data changes.
The requirement this imposes on non-synchronous remote copy solutions should be quite obvious. The solution must be able to recover a consistent image of data, with the exact sequence of updates as they occurred at the primary location, to avoid a corrupted secondary image. This sequencing also must be maintained across multiple volumes, across multiple disk subsystems and across many host images.
Rolling Disasters
A rolling disaster is an event resulting in multiple failures to disk, processors, tape subsystems, networks and applications - all logically related - and can last for seconds or even minutes. In the simplex case (no remote copy present), a rolling disaster can result in lengthy recovery actions most likely involving data recovery from tape backups. This "window," from the time a disaster begins until final "meltdown," is the rolling disaster.
Addressing the challenge of rolling disasters was probably the most elusive; it involved both the technology of the disk subsystems (hardware), as well as a great understanding of the complex data structures employed by various database products, such as DB2, IMS and VSAM.
There also was the issue of not knowing how the various applications, database products, systems components in OS/390 would react in a rolling disaster. As an example, the remote copy technology certainly will allow for maintaining an exact duplicate of all data at the secondary site so that following a catastrophic failure, the copied data is in fact a true image. However, the image of the data at the primary site is most likely unusable, so replicating all actions occurring at the primary site probably will not achieve the desired result! Why?
Consider that for many database management systems (DBMSs), write errors, which are expected in a disaster, can leave a database in an unknown state. This is certainly true for DBMSs that "cluster" their updates to the database to optimize performance. These write failures, especially multiple write failures, can sufficiently corrupt the database at the primary location; the only recourse a customer might have is to recover that database from the previous image copy (tape). If this corruption is allowed to propagate to the secondary location as well, any attempt to start the critical applications at the secondary location can result in a lengthy recovery from image copy tapes.
Put another way, in the simplex scenario (no remote copy), following a rolling disaster, the only hope of successfully recovering critical data might be to recover the data from backup tapes, then doing a forward recovery to the point of failure. This is a process that can take many hours, perhaps days. If the remote copy solution emulates this process, the image at the secondary (the one needing to be recovered) would take equally long. What has been observed is that the updates occurring within this rolling disaster window will be unpredictable and chaotic due to the extensive error recovery actions being performed by the system and applications.
Another consideration that makes rolling disasters problematic is that in a normal situation, write errors to disks are extremely rare. With RAID disk subsystems and the redundancy built into them, a write-failure to a disk is quite unexpected. Now consider that during a catastrophic disaster, write errors to disk are not only very likely to happen, but there may be hundreds occurring in a short period of time. This distinction separates remote copy solutions from other mirroring solutions.
Rolling disasters will put any remote copy solution to the test! Remember, if you successfully ran a remote copy solution for one year, and it did not survive a five-second rolling disaster, you have achieved 99.99998 percent availability. That number may sound impressive, but it is clearly not the objective of a remote copy solution! That’s like saying a car’s seat belt worked only when you weren’t having an accident.
Providing a Restartable Image
Consider the simplex case of the sudden outage (one that caused immediate failure in every component of your site at the exact same moment). In such an improbable case, the image of your data is sequence consistent, and no corruption will have been allowed to occur. When your system came back online and your applications restarted, they would essentially begin where they left off. Incomplete transactions may complete or be backed out, some recovery actions will be performed, and you will be back processing your critical applications within a few minutes. This process is known as an emergency restart since the application could recover the image as it existed on disk. Most system failures, such as power outages, will result in only having to perform an emergency restart of your application. In contrast, a restore from image copy tapes is called a full recovery, is measured in hours or days, and is not at all desirable. In fact, if the end result of a real-time disk shadowing solution was to cause a tape restore, you might ask the obvious question of why you’re investing in such a solution. Not having a restartable image at the secondary location would certainly defeat the purpose of the solution.
Now let’s consider the rolling disaster as discussed above. We know that immediately prior to a rolling disaster the image of the data is emergency-restartable. We also know that the image of the data following the disaster has a high likelihood of being unusable and, as explained, may require a full recovery. Somewhere within the rolling disaster, something happens to the data to contaminate it. Therefore, another one of the requirements of an effective remote copy solution must be to "freeze" the image of the data at the secondary site at the exact moment the disaster occurs to preserve that restartable image. Following the disaster, a decision must be made as to how much, if any, of the updates occurring during the disaster should be applied prior to recovering that secondary copy.
Design Approaches
There are numerous remote copy solutions from multiple vendors. Why? Because no single approach applies equally well to all application environments.
To simplify the understanding of these design approaches, let’s twice divide the designs:
1. Any solution is either synchronous or non-synchronous.
2. Any solution is either hardware-based or operating system-integrated.
Synchronous or Non-Synchronous
Synchronous implementations imply that the application update will not be completed until that update is successfully secured at the secondary site. This four-step process where the update occurs to the primary disk subsystem (1), is sent to the secondary site (2), which returns acknowledgment to the primary site (3), which then allows the original update to complete. In the non-synchronous approaches, the application update (1) is allowed to complete (2) before the update is applied to the secondary site (3 and 4).
By contrasting these two approaches, it’s clear that the synchronous approach incurs an I/O response time penalty that grows with distance. Transaction response times may increase, batch run times may lengthen, and overall application and system throughput may be affected. This is the cost of synchronous. The non-synchronous approach can virtually eliminate this penalty.
Another key difference is that while synchronous remote copy attempts to keep the two copies identical, the non-synchronous design causes the secondary image to "lag" some amount of time behind the primary image. So a non-synchronous solution mitigates the performance issue, but does so by allowing some updates to be lost in flight. So where the cost of synchronous is performance, the cost of non-synchronous is potentially updates lost in flight.
Hardware-Based or Operating System-Integrated
The second contrast, shown in the diagram below, compares the hardware-based solution to the operating system-integrated solution. The hardware-based solution flows the data directly from the primary DASD subsystem to the secondary, whereby the integrated solution either flows the updates through a component in a host processor or has sufficient host awareness to manage the activity amongst many disk volumes, disk subsystems and, in fact, many host images. Cursory evaluation generally concludes that the hardware-based approach (synchronous or non-synchronous) is simpler, but further study shows that this approach incurs some operational challenges.
One of the challenges encountered with the pure hardware-based solution was that error activity on one disk subsystem was not "seen" by the other subsystems. Given the rolling disaster discussions above, a failure of one subsystem might immediately be followed by an update on another subsystem. If the objective at the time of the write failure was to immediately freeze the secondary copy before additional updates could be transmitted, how would that be accomplished if the other subsystem were not aware of the first failure? Since the subsystems do not communicate, one can quickly see the rationale for the supporting host software. In other words, if the objective is to create a point-in-time, I/O-consistent image at the secondary site, and if the data spans multiple subsystems, the shadowing activity must immediately cease on all participating subsystems. This freeze needs to be initiated by a component that has immediate access to all subsystems: namely the host.
How to Select the Proper Product
Now that we have covered all the requirements of a remote copy solution, how do you, the customer, decide which of the many solutions is right for you? You now know you can chose a synchronous solution (with the associated impact on response times) or a non-synchronous solution (which may mitigate the performance penalty, but cause updates to be lost in flight).
We now also know that any solution must be able to:
1. Replicate the sequence of updates and writes.
2. Be able to "freeze" the image at the secondary location at the time the disaster begins.
3. Survive a "rolling disaster."
The tradeoff of performance versus data currency (i.e., synchronous vs. non-synchronous) is fairly straightforward and will depend on the application in question.
In the analysis of the other three points above (sequence integrity, freeze and rolling disasters), you must look at the level of integration to the operating system. For some non-synchronous solutions, the solution is sufficiently integrated into the operating system so as to provide all three. A host component, in support of the remote copy solution, can provide the three required functions because the host can "manage" the activity amongst multiple DASD subsystems.
In the cases where the copy function is imbedded in the hardware (i.e., hardware-based solution), some instances will require a component on the host to monitor the failure activity within the DASD subsystems. This process, generally referred to a "geoplex," provides sufficient automation to address the issues stated above, namely to: ensure recovered data is sequence-consistent, can be frozen at the point of disaster and be able to survive the rolling disaster.
Remember your data is your most important asset: hardware is replaceable; critical data is not.
ABOUT THE AUTHOR:
Claus Mikkelsen is a Senior Architect, Large Systems Storage, for IBM Storage Systems Division (San Jose, Calif.), and was one of the principle architects for IBM’s Extended Remote Copy (XRC) and Peer-to-Peer Remote Copy (PPRC), as well as the architect for Concurrent Copy (a transparent non-disruptive copy facility). He has earned more than 19 patents in these and related areas, and has over 20 years of experience in DASD storage and data management, both hardware and software.