In-Depth
E-Business and DR: Alternatives to Data Mining
Over the past few years, vendors of disk systems have been touting various forms of Remote Copy or Remote Mirroring as a Disaster Recovery (DR) solution. Remote Copying or Mirroring could be part of a DR solution; but left on its own, could make the recovery process ineffective. Just copying data can result in copying the disaster, which means as the data gets corrupted at the primary site, it also gets corrupted at the secondary site. Although mainframe data centers have been the primary focus for DR solutions, the need is apparent within the NT, open environments and other operating systems.
In a DR situation, you need to restart critical applications at a remote site with full data integrity, without relying on primary site personnel. The restart must come from a known point in time where data has guaranteed integrity. Most business applications can tolerate a loss of some transactions, but not corrupted data. Any effective DR solution has to be cognizant of this fact. But, to understand how to prepare for a disaster, you have to understand the nature of a probable disaster.
Vaporization or Meltdown?
It is highly unlikely that a disaster would vaporize a data center in a nanosecond. A more likely scenario would be a flood, earthquake or hurricane. In these cases, a disaster occurs over a period of several minutes. Disk drives, tape drives, servers and network controllers will fail randomly. Logs and their database are usually on different disk subsystems, so they will not fail in synch. Databases typically have dependent writes to ensure they are written in correct sequence and databases with deferred writes may need to write across multiple disk systems. Some of these deferred writes may be in server memory.
Error recovery procedures within the operating system, the application and various hardware units may be invoked automatically. While some units will fail and some may come back, others may not. Multiple disk drive failures may occur. RAID disk products are designed to mask a failure of one disk drive within a domain, but not multiple drive failures.
It is during the several minutes of the disaster happening, called a "rolling disaster," that data gets corrupted. Unless you can freeze the copying of data immediately before the disaster begins, a Remote Copy function may well replicate the disaster. If the data is corrupted at the remote site, how do you begin to restart the critical applications?
At several seminars held by the Evaluator Group, we have painted the following picture to mainframe data center technicians. We asked them to imagine that a person walks into their data center, and over several minutes randomly turns off every machine. We then ask them how long to restart their critical applications, and what do they do?
Universally, the first thing they want to do is get a copy of the tape with the last mirror image of their databases. Recovery time is usually in days. When they have completed their recovery picture, we make this observation: None of them had confidence in their data as a result of this "rolling disaster." If they did not have confidence in this data, what confidence would they have in a remote copy of this data? And if they had no confidence in this remote data and would not use it in an emergency, why do they have a remote site and why implement remote mirroring?
One user reported that a customer engineer had inadvertently switched off one physical volume -- not usually a problem. But, in these days of mapping volumes or virtual volumes across multiple physical drives, error recovery can be tricky. In this instance, a simple unexpected powering off of one physical volume containing part of an IMS database resulted in corruption of the whole database.
The majority of data centers still rely on copying data to tape and trucking the tapes to a remote site. This is not without its own problems, and becomes a real challenge when sites have multi-terabytes to restore. For a true DR involving a mirroring solution, you need to ensure that all writes are time-stamped to guarantee data is written in the correct sequence. Disk subsystems do not share common clocks so the server must provide this feature. Rather than just copying data from one storage system to another, one needs to be able to transfer the application with a rapid restart capability. Vendors need to provide a recovery mechanism that ensures quick application restart at a remote site prior to a disaster corrupting data.
Geographically Dispersed Parallel Sysplex
Probably the best solution today is IBMs Geographically Dispersed Parallel Sysplex (GDPS). GDPS is architected to transfer the application and the transactions to the remote site, rather than just the data. Considering there are thousand of reads and writes for every transaction, this seems a wise choice.
GPDS uses peer-to-peer remote copy (PPRC), which is a synchronous copy solution that is also offered by Amdahl, Hitachi Data Systems and StorageTek. EMC offers a version of their Symmetrix Remote Data Facility that operates with GDPS. An EMC supplied STARTIO exit routine traps PPRC commands and issues the appropriate SRDF equivalent. With SRDF, GDPS customers can also use the EMC TimeFinder point-in-time copy feature.
The disadvantage to GDPS is the limited distance, the cost and complexity of implementing the product.
One user we spoke to has recognized the exposure with remote copy products. He told his management that the proposed remote copy solution would not work in a real world disaster. Therefore, rather than spending the money on a remote site and remote copy software, they should pretend they had a remote site. In the event of a disaster, they would be in the same situation as if they had a remote site, but without the associated cost. Cynical as that user may be, he was not confident that the vendor could deliver a DR solution that would actually work. Given the cost and the criticality of DR, you should apply due diligence before deciding which offering you want to bet your company on. Unfortunately, that involves more than getting a few vendors in and listening to their promises.
You need to get the vendors to supply the equipment necessary, so you can simulate your own disaster. For a remote mirroring solution the vendors need to supply a primary and secondary storage system. You should create a script to run against a copy of your production database. Then, when the test suite is running, simulate a rolling disaster by randomly switching disk volumes off and sometimes on again over a few minutes. Then, test the integrity of the data on the secondary system. You may also become a cynic!
Dick Bannister is a Senior Partner at the Evaluator Group Inc., an industry analyst organization focused upon storage products. He can be reached via e-mail at dick@evaluatorgroup.com