In-Depth
Reliance on Technology: Driving the Change to Advanced Recovery
In most companies, technology is shifting at an exceedingly faster pace from a business-supporting to a business-enabling role. Business functions have become so automated that the technology itself has become the method of delivering service. In many industries, the inventory process is heavily technology-dependent.
Integration of technology is also raising its criticality. For example, with enterprise resource planning and data warehousing implementations, applications and data across the enterprise are being integrated into single databases. These applications can provide a tremendous boost to productivity. At the same time, all the information within these databases becomes critical because it is all interrelated, having a significant impact on how it needs to be recovered.
In other instances, such as with e-commerce and EDI, all data entry is occurring online. The only record of transactions is the electronic data itself. There is no way to manually recreate the transaction. If the system goes down, those transactions that have occurred since the last backup are lost and no new transactions can occur while the system is down.
In each of these situations, emerging technologies, increasingly complex deployments and the growing reliance on technology is causing recovery windows to shrink – and quickly. In fact, according to GartnerGroup projections, organizations that could tolerate a two-day recovery time objective just last year will see that horizon diminish to one day or less by the end of this year. Yet in many organizations, the criticality of these systems is not reflected in their recovery programs. Those attempting to apply traditional recovery methods in these evolving environments will find the survival of their business threatened.
There are some key terms used when discussing an organization’s objectives and their ability to recover. The first is recovery time. The recovery time is defined as the amount of time that has elapsed from the point when the disruption occurred until the specified business operation is resumed and current business transactions can be applied. The recovery time objective (RTO) is determined by executive or business unit management and is the acceptable recovery time, captured in hours, minutes or transactions. Recovery point reflects the age of the data available to the operation upon completed recovery. This is measured in terms of time prior to the disaster event.
The second term, recovery point objective (RPO), measures the amount of potential data loss in number of hours. As an example, an organization may determine that no information loss is acceptable for a mission-critical system, requiring an RPO of zero hours. In this situation, the data needs to be restored to the point in time that the disaster took place. Another organization may accept an RPO of 36 hours, meaning that at the completion of the recovery processes, the system could be restored with a backup from the previous day or evening, retrieved from a secure offsite location.
Identifying a Business at Risk
Two key steps in determining to what degree a company is facing unacceptable risk are understanding how long it would truly take to recover from a disruption and what the impact of that disruption would have on the business. Once the impact is understood, management and business units can make informed decisions regarding availability and recovery.
When testing a recovery program, most companies start several steps into an actual recovery plan and often focus only on a discrete portion of the recovery. As a result, the time it takes to recover during testing could be far less than the time it would take to recover from an actual outage. In fact, the discrepancy between a test and recovery is usually 10 hours or more.
To gain an accurate assessment of how long an actual recovery would take, companies need to realistically take into account all the activities that must take place from the time of disaster until their business is up and running at an alternate site. This includes the time to make a declaration decision, transport personnel to the recovery location, ship, and organize and stage backup tapes. It also includes the time to implement the system restore process, system start up, application data restore, application recovery and network recovery. Another factor to consider is the time it takes to re-enter application data recorded remotely prior to and during the recovery.
Quantifying the Business Impact
Accurately estimating the recovery timeline and comparing it to the existing RTO can provide an early warning sign to unexpected exposure. But in evaluating and selecting advanced recovery solutions, companies also need to quantify the impact of exposure. Conducting or updating an existing Business Impact Analysis (BIA) to incorporate the importance of new applications and systems on which the company now relies is an important step in setting priorities.
Business managers participating in the BIA should be made aware of the realities of an interruption to help them accurately make their assessments. Among these realities:
• Systems or applications will be unavailable for the duration of the restoration and recovery process.
• When systems are restored, the data will only be as current as the last offsite backup, requiring all online transactions since that backup be manually recreated and reentered.
• Once the applications are available, the data that moves between the applications will require synchronization.
• Test or development platforms and data are not usually recovered, protected or available.
With these issues in mind, business managers should be asked to detail how unplanned interruptions of various lengths (e.g., one hour, eight hours, 24 hours) would limit their ability to:
• Achieve revenue and income objectives.
• Meet customer requirements and expectations.
• Capture all data (including transactions) for recreation in the recovery location.
• Operate for an extended period of time.
• Comply with contractual obligations and regulatory requirements.
• Maintain the operational performance of business processes consistent with the expectations of executive management.
• Sustain the public image of the organization.
Estimating qualitative impacts, such as sustaining public image, can be difficult. However, they should not be discounted, as their impacts can be potentially devastating. For example, an organization that markets its unique ability to perform a specific function at a customer’s premise, like quote services or claim settlements, could significantly jeopardize its market position if the systems that enable it to provide onsite service were inoperable. Quantifying the value of that market position and the damage that would ensue if the company were unable to deliver on its promises could have a long-term impact that needs to be factored into the assessment. Conversely, a company that demonstrates a recovery that limits their risk or improves availability can use that information as a marketing tool and a point of differentiation between themselves and their competition.
In addition to quantifying the impacts of a disruption, the BIA is also useful in helping management to establish or update the recovery time and recovery point objectives to adequately reflect the criticality of different applications and systems.
Identifying Options
Advanced recovery solutions were initially introduced in the late 1980s to meet the needs of major money center banks. Pressured by regulatory requirements, they needed to put into place methods to protect and ensure the recoverability of their mainframe-based data. In the past few years, however, numerous techniques for every platform have emerged, driven by demands from companies in virtually every major industry. As technology progresses and user demand continues to increase, more solutions continue to be introduced and become more robust. Among the techniques in use today are:
Electronic Vault. Electronic vaulting is the process of transmitting data in bulk at the tape volume, disk volume or file levels. Each of these levels is achieved with different technology, but provides for a remote copy of discrete data at a specific point in time. The scalability in the implementation allows companies to customize a solution to achieve the appropriate impact on the recovery time and recovery point. Remote Electronic Vaulting technologies include automated tape libraries from either IBM or StorageTek, bulk data processing software like Sterling Commerce CONNECT:Direct and virtual tape technology.
Standby Operating System. Maintaining a remote copy of the operating system on disk, directly attachable to the recovery processor, ensures systems can be started immediately at time of test or disaster at the recovery site. A standby operating system is comprised of dedicated disk and a way of updating it on a regular basis. The update can be via a tape shipment or electronic means. One product that allows you to send only the updated data within a file is Serena SyncTrac. Using standby operating system techniques in combination with other advanced recovery techniques to improve data or application availability can further enhance an organization’s recovery solution.
Remote Transaction Journaling. Organizations concerned about improving their RPO in addition to RTO should consider remote journaling. This solution includes intercepting the writes to a local database log or journal and transmitting them offsite in real-time mode, providing for a recovery point within seconds of the failure. Remote journaling solutions are specific to the operating system and the database that requires protection. In the mainframe database environment, one such product example is E-Net Corporation Remote Recovery Data Facility (RRDF).
Database Shadowing. Database shadowing is the combination of a point in time copy of a database on disk, remote journaling and the regular, scheduled application of the log/journal updates to the database. Database shadowing is a flexible option for managing to an application-specific RTO, allowing application updates to be shadowed as often as required to meet the RTO. Applications requiring shorter RTOs will need more frequent intervals of updates to the database. Regardless of the recovery time being managed to, the recovery point is within seconds of the failure. Technologies available to provide this protection are highly specialized to the environment they support. Some examples are: DB2 protection using E-Net Corporation RRDF Log Apply feature, Oracle protection using Oracle Automated Standby Database and AS400 file support with MIMIX/Object.
Remote Mirroring. One of the most talked about solutions today, remote mirroring, allows for a duplicate copy of an organization’s disk data to be maintained at a remote location. This solution allows for a drastic improvement in recovery point and recovery time for the protected data.
There are two methods of remote mirroring: host-based software and controller-based software solutions. Solutions available today that are host-based are IBM eXtended Remote Copy (XRC) and Hitachi eXtended Remote Copy (HXRC); Controller-based systems include EMC Symmetrix Remote Data Facility (SRDF), Hitachi Remote Copy (HRC), IBM Peer to Peer Remote Copy (PPRC) and Sun StorEdge A7000. A major advantage of controller-based mirroring is the ability to support enterprise storage recovery with a single solution. Because IT personnel are managing a single product in this scenario, it’s likely to require fewer resources and these savings should be calculated when weighing the costs. An advantage that software-based technology provides is the understanding of local production timestamps and the ability to ensure a timestamp-consistent image at the remote site, where required.
System Replication and System Fail-Over. System replication provides a continuous operating environment by duplicating systems, data and networks at a remote location. When the ability to perform failover is added, the result is the most comprehensive solution for addressing RTO and RPO and truly achieving a high availability environment. These solutions are most robust in the open systems world, and supporting technology includes NSI Double-Take, HP MetroCluster, IBM High Availability Geographic Cluster and Lakeview MIMIX/Switch.
Hot Network Node. Rapid recovery of systems and data is only effective if they are accessible to those that need them. Establishing network communications at time of disaster can be complex and time-consuming; pre-staging of the configuration eliminates error and reduces recovery time. Organizations typically locate a hot network production node in the same location as the recovery capability. The hot network node is continually monitored and in use, thereby minimizing the potential for failure.
Each of these techniques has its own benefits, and which is most appropriate depends on each organization’s key objectives. Often companies combine several solutions together to create a program customized to their specific requirements. Many companies also are combining advanced recovery techniques with traditional hot-site programs to come up with a solution that fits their recovery objectives and can be cost-justified. For example, their most critical data is mirrored to the recovery location, while important data is mirrored to a campus location. In the event of a recovery, the critical data is available to a recovery processor set immediately, the important data is protected and available as soon as it can be copied to the recovery location. This hybrid solution significantly reduces the RTO and RPO compared to a traditional recovery while providing a cost advantages over using completely dedicated equipment for recovery.
About the Author: Carol Elstien is Director of Advanced Technology Implementation for Comdisco, Inc. (Rosemont, Ill.).