In-Depth
The End of Disaster Recovery Planning
Disaster recovery planning must become an integral part of application development so the right middleware and coding choices are made at the outset of system design.
For years, disaster recovery planning (DRP) has been treated as an afterthought—as a set of capabilities to be bolted on to an IT platform only after it has been designed and deployed. Today, the efficacy of such an approach is increasingly questionable. Changes in the nature of enterprise computing have made traditional DRP an anachronism.
As a concept, “bolt-on DRP” derives from the days of mainframe computing. In the mainframe data center, where systems, peripherals, and terminal networks were homogeneous and tightly coupled by vendor-specific standards, DRP was regarded as a “secretary-friendly” task. Planning required little more effort than signing a “hot site” agreement with either the mainframe vendor or a third party. Testing to validate the workability of the arrangement was scheduled into the operations calendar.
The inherent strategy for mainframe recovery consisted of replacement of the existing hosting environment with identical technology. While better-heeled companies built redundant data centers to protect against prolonged outages of mission-critical systems, most firms with the wisdom to prepare for the possibility of a disaster (and that has consistently been less than half of companies worldwide) bought a “share” in a platform maintained by someone else.
Over time, even this “share” became a virtual thing. In the early 1990s, first Comdisco, then IBM, began selling platform replacement on a “MIPS, not sites” basis: you didn’t need to know the physical location of the facility where your systems would be restored; the hot site service vendor would fit your applications into some available mainframe space somewhere. (This opened the door, by the way, for a couple of charlatans who sold non-existent facilities to gullible customers.)
However, in the late 1990s, with the approach of Y2K, there was a sudden acceleration in the re-hosting of applications on distributed servers. Fearful of the impact of date format changes on the operation of older, mainframe-based, COBOL applications, many companies decided to migrate mission critical-applications to UNIX and Windows servers interconnected by TCP/IP networks. Multi-tier client-server systems became commonplace in many shops.
The net effect of the boom in client-server computing was that it increased the number of target systems that needed to be replaced in an outage. Target proliferation was made worse by the use of multiple client-server middleware products used to stitch together application components loaded on different hosts. Even in “shrink-wrapped” enterprise resource planning (ERP) products, different middleware products were often used to interconnect different functional models—a reflection of the consolidation of vendors in the ERP space in those times.
The important point about middleware is that some products require “hard coding” to link machines together. That is to say, for software components on server A to communicate with components on server B, the two systems needed to have consistent “addresses” in the form of machine identifiers or some other permanent mechanism. Recovering such systems required that they be replicated down to the MAC ID level in a recovery environment. This was no small task given the constant infrastructure changes that occur in most production data centers.
The alternative was to use certain types of messaging middleware that, upon initiation, “discovered” the locations of client-server software components and built a directory of addresses dynamically. This strategy enabled applications to be restored at an alternate facility in a disaster recovery situation without the need for one-for-one platform replacement. However, little attention was paid to this advantage in the hectic environment of Y2K preparations in many firms, and many applications were rolled out using hard-coded middleware.
With the rise of the Internet, many firms “Web-enabled” their client-server applications so that they would be accessible via a browser to both local and remote end users. To accomplish this, additional middleware, Web servers, and application servers were brought into play. Older client-server configurations, already resembling a Jenga game, were in many cases “lifted off their blocks” so that new Web technology could be inserted underneath. Many applications have been left teetering precariously in “n-tier” client-server configurations as a result.
Now, with the advent of networked storage and Fibre Channel SANs, still more tiers are being added to the compute platform. SANs are intended to untether storage from its one-to-one relationship with servers in order to permit nondisruptive scaling of storage capacity in support of space-hungry databases. However, this technology too is half-baked and still in its infancy.
The end result is that strategizing for application recovery has become a much more difficult and costly endeavor, and with the increasing deployment of unstable FC SANs, data is increasingly placed at risk.
If this picture of the current situation in IT is a bit darker than one might read in a vendor brochure, it might be a reflection of the tendency of most people to ignore vulnerability until a disaster actually happens. Truth be told, for the problems to be rectified, the practice of disaster recovery planning must mature.
The consideration of disaster recovery capabilities can no longer be deferred until after the systems are rolled out. DRP needs to become an integral part of application development so that the right middleware and coding choices—those that enable cost-effective system replacement—are made at the outset of system design.
Moreover, DRP must become a much more important criterion in IT acquisition decision-making. Commodity and standards-compliant hardware need to be preferred to “stovepipe” gear to ensure ready replacement, and technologies that are short on standards—such as SANs—need to be selected with enormous due diligence and caution. In the storage world, standards have been written with sufficient “wiggle room” to enable a vendor to manufacture a SAN switch that is in full compliance with standards yet will not interoperate with the switch (also standards compliant) from its competitors.
This new role for DRP—as a proactive component of system and network design —constitutes nothing less than a sea change from the traditional approach. It will require a level of expertise about a number of technologies that has usually been outside the domain of traditional planners. It will also require a new set policies and procedures to implement and integrate disaster recovery requirements into design criteria and platform selection.
Disaster recovery planning is dead. Long live disaster recovery planning.
About the Author
Jon William Toigo is chairman of The Data Management Institute, the CEO of data management consulting and research firm Toigo Partners International, as well as a contributing editor to Enterprise Systems and its Storage Strategies columnist. Mr. Toigo is the author of 14 books, including Disaster Recovery Planning, 3rd Edition, and The Holy Grail of Network Storage Management, both from Prentice Hall.