In-Depth
Unifying Local and Remote Recovery
Three technologies that can help you create an integrated recovery plan.
by Eric Burgener
Disaster recovery is as important as ever, but a depressed economy may have postponed your plans to implement a solution. Since DR is effectively an insurance plan for a set of events that, if you’re lucky, will occur infrequently, it’s tempting to put off implementation unless there are specific business or regulatory mandates driving it.
The idea that DR is a discrete problem that must be solved separately from all other IT issues may be part of the problem. When disaster strikes, getting your business re-started at an alternate location requires recovering critical data and applications at a remote site. However, recovering data and applications also must be done more locally, often on a daily basis, to keep business services running.
Why are “local” recovery and “remote” recovery thought of as two separate problems? The answer has to do with technology. When backing up to tape was the data protection best practice, manual involvement was needed for local and remote recovery. To facilitate data recovery, daily backups to tape were stored locally; weekly copies were created and usually shipped to a remote location. Recovering applications was a manual process that took a long time and was greatly dependent on the skill set of the administrator(s) doing the work. Backup software, tape storage closets, trucks, and manual intervention are difficult to combine into a single, seamless process. This labor-intensive recovery approach often resulted in long recovery times and lost data, particularly at remote sites (since the data available for recovery purposes there was much older).
Transforming Recovery
Vendors initially offered disk as a recovery media to help improve backup. Although disk did offer improvements in areas such as backup window, recovery point objective (RPO or data loss on recovery), recovery time objective (RTO or recovery speed), and recovery reliability, the basic paradigm upon which all local and remote recovery was built - point-in-time backups - did not change.
The use of disk introduces a more valuable option: access to newer technologies that can literally transform data protection and application recovery, both locally and remotely. These technologies include continuous data protection, heterogeneous asynchronous replication, and automated application recovery. They effectively allow local and remote data and application recovery to become part of the same recovery continuum, lowering the impact of data and application protection and recovery operations while making them faster and significantly more reliable. These technologies also simplify administration by cutting out several steps.
By integrating those technologies into a single platform, companies can leverage a single, integrated recovery solution that can handle both local and remote recovery requirements for data and applications. The solution can be justified for its ability to recover from disasters as well as for its everyday ability to help you recover files and/or entire systems, depending on your requirements. This is a “DR” solution that provides value each and every day.
I already have a backup solution, you might say. If you are not having any problems with backup windows, RPO, RTO, or recovery reliability issues, you may not be interested in an integrated recovery platform. However, if you’re still doing point-in-time backups (regardless of whether you’re using disk or tape), chances are that high data growth rates, high overhead, poor recovery capabilities, and/or high administrative costs (due to backup scheduling, tape handling, etc.) are causing problems. In that case, you may want to consider improving your “backup” capabilities with an integrated recovery solution.
Evaluating the Transformational Technologies
Let’s look at what each of these technologies brings to the table:
Continuous Data Protection (CDP)
This technology completely dispenses with point-in-time data protection regimens, collecting application changes in real time and sending them to a disk-based log. The size of the log is determined by a retention window defined by the administrator. As long as data is in the log, it can be used to retroactively create a disk-based image of any selected recovery point. Think of this as TiVo for your data. You can roll backward or forward to any point within the log and generate the required recovery points.
CDP completely eliminates the need to ever perform a backup again, and instantaneous resource requirements for data protection become negligible on both the servers being protected and the network. With CDP, you’ll never have to do another backup against a production server again. It doesn’t just decrease the backup window; it eliminates it -- and you’ll get rid of backup agents on the servers you are protecting.
Because of the way CDP captures data, it also offers unbeatable recovery granularity. Data is recoverable as soon as it is created, not just once it’s backed up. CDP gives you a very granular continuum of possible recovery points, allowing you to pick the one that best meets the failure scenario. Contrast this with point-in-time models that can only offer recovery points where backups were taken. Wouldn’t you rather select your recovery points after you know how and why you have to recover instead of before? The ability to recover granularly is what allows CDP to support recoveries with near-zero data loss.
Because granular recovery capabilities are only important for near term recoveries, most enterprises using CDP configure it with a one- to three-day window. The amount of capacity required is a relatively simple calculation: the size of the initial data set plus the changes over time. A two-day retention window set up against a 100GB database with a 5 percent daily change rate would require 110 GB. That 110GB would be all you need to protect the most recent two days of data on an ongoing basis, since data is discarded after it “ages” out of the CDP log.
If you need to keep data longer to meet compliance requirements, it can always be backed up to tape from a disk-based image created by the CDP product. It’s common for CDP customers to meet near-term recovery requirements from the CDP log, keeping at least one tape-based backup server so they can create a tape copy of the backup every week or every month to meet compliance requirements. This approach completely insulates the creation and use of any disk-based images from the production environment.
Some CDP products can use application snapshot APIs such as Windows Volume Shadowcopy Services (VSS), Oracle RMAN, and others to mark specific recovery points in the data stream that are particularly interesting. Those could be application-consistent recovery points that can enable rapid, reliable recovery, or they could be business process points such as pre-patch and post-patch points, database checkpoints prior to the start of a large batch job, or monthly or quarterly financial closes, to name a few.
Marking these points in this way makes it simple and easy to refer to them if and when they are needed to feed any of a variety of “off-host” operations that use production data: test, development, reporting, analysis, data migration, etc. CDP also allows these data sets to be created on demand without impacting production environments in any way. This further increases the return on investment of the “DR” solution, because these tasks provide value every day.
Heterogeneous Asynchronous Replication
Local data protection is about making local copies. Disaster recovery requires remote copies that are far enough away from the primary site that a disaster that takes the site offline will not impact the recovery site. Asynchronous replication allows you to reliably make and maintain a real-time copy of production data at a remote site any distance away without impacting the performance of production applications. It’s just like local disk mirroring, but it has been modified to be able to work with a remote target that is connected to the primary server across a network.
Compare this to the older tape method where once a week you make an extra copy on tape and ship it to a remote site. By the time the tape arrives at the remote site, the data is probably at least a week old. If you have to use it to recover, you will lose at least days of data (if not a week or more). Asynchronous replication will get the data to a remote site within seconds or minutes of its creation, limiting data loss on recovery.
Newer asynchronous replication technologies support flexible replication topologies like 1-to-N or N-to-1. A 1-to-N topology can be used to send current copies of production data to one or more targets. This is a critical piece of functionality in the creation of an integrated recovery platform. To create a single solution that covers both local and remote data recovery, you need a current copy of the data in at least two locations (one local and one remote). Asynchronous replication gives you a way to automate that process, making it faster and more reliable without requiring any manual involvement.
Heterogeneous support further increases the value of this technology. In the past, when the replication source and target had to have the same exact model of storage, these configurations could become quite expensive. Heterogeneous support allows you to use storage you already have, preserving existing investments, or providing the freedom to purchase any storage you want going forward.
Automated Application Recovery
Recovering an application environment manually requires a common set of steps across most applications. To start, confirm you have a reliable recovery point (in terms of data integrity); many administrators prefer having an application-consistent recovery point because it results in faster, more reliable recovery than crash-consistent points. Application-consistent recovery points can be created using application snapshot APIs like Windows Volume Shadowcopy Services (VSS) for Windows applications, Recovery Manager (RMAN) for Oracle, backint for SAP, etc.
Next you’ll want to bring the recovery server up, mount the selected recovery point, and start the application. Then the network-attached clients that were using the application will need to be re-directed to the new physical location where the application service now resides, a process that generally requires updating Active Directory (AD) or Domain Name Service (DNS) directories.
Application failover/failback products will automate this process, making application service recovery rapid, reliable, and predictable. The ability to move application services around this way is handy in recovery scenarios for such tasks as DR testing and minimizing the impact of maintenance operations (such as software updates).
Bringing It All Together
How are these technologies combined to provide an integrated recovery solution? CDP is used to capture the data from the production servers, asynchronous replication is used to send that data to target location(s), and automated application recovery can be used to easily move application services around to different locations.
CDP and replication handle the “data” part of the equation, providing the same set of recovery capabilities at local and remote sites. Their combined use eliminates backup windows, minimizes data loss on recovery, results in recovery times literally within minutes even at remote sites, and exhibits excellent recovery reliability (since it’s based on disk). Application recovery technology can be used at either local or remote sites to handle maintenance, high availability, or disaster recovery needs.
These types of solutions are becoming available from several vendors and have much to offer for improving comprehensive recovery capabilities while minimizing or eliminating tape infrastructure and reducing manual involvement in data protection operations. They let you combine data and application recovery into a single solution that services both local (backup) and remote (DR) requirements.
Eric Burgener is senior vice president of product management for InMage. Prior to joining InMage, he served as a storage industry analyst with The Taneja Group and has held executive-level positions at Mendocino Software, Topio (acquired by NetApp), Veritas Software, and Dell. He can be reached at [email protected]