In-Depth
Recovery or Tolerance? When It Comes to Dealing with Disasters, the Choice Is Yours
Disaster Recovery and Disaster Tolerance are a necessity, rather than an option, for an increasing number of businesses. The globalization of business is driving the requirement for increased levels of IT resource availability. We have witnessed a dramatic increase in business awareness regarding the importance of avoiding information system failures. We are seeing the beginning of Web enablement within institutions and organizations that have traditionally been based on paper and mail. Web enablement is accelerating the rate of change in the ways that organizations work, propelling them into this new online, interactive, information-based reality which is full of challenges and opportunities.
In the Web-enabled world, any organization without a highly available information infrastructure runs the risk of falling behind the competition and potentially being driven out of business. E-commerce, telecommuting, the emergence of the virtual corporation and competition on a global scale are forcing companies to provide their services on a global 24x7 basis. This, in turn, is driving the ever-increasing requirement for computer systems that provide enhanced or continuous availability. This reality imposes the requirement for disaster or fault-tolerant information systems without which most businesses will not be able to effectively compete, let alone survive.
Customer-facing applications are those in which a business customer interacts with businesses either online, over the phone or at an ATM machine. These applications are being implemented by businesses in order to increase business efficiencies, enhance productivity and lower costs (both to the customer and for the business). Through customer-facing applications, customers interact more directly with the corporate information resource than at any other time in business history. Any service outage experienced by users on this type of application will have a dramatic impact in terms of lost revenue, lost transactions and customer defection to competitors.
Risk Assessment
The role of risk assessment and analysis in disaster recovery and in fault tolerant information systems is to minimize the risk associated with either a disaster, fault or system failure. Additionally, both achieve the most benefit by totally avoiding a disaster or fault. When a business cannot continue its work due to a partial or total information system failure, the impact will include one or more of the following: an inconvenience, a temporary loss of productivity and revenue, a severe impact to the business’s financial health and/or a threat to public and personnel safety.
When more than an inconvenience occurs, the event is determined to be a disaster and businesses should have a plan to either recover from it (Disaster Recovery) or avoid it in the first place (Disaster Tolerance). The choice must be based on a realistic assessment of the potential severity of the disaster, and should be compared to the estimated cost in aversion or recovery measures.
Fault Tolerant Systems
Implementing a procedure of regularly backing up data and applications, either to tape, disk or a remote site is one proven way to provide an avenue for recovery from a catastrophic loss or disaster. However, the preferred and most effective solution is to avoid the risk altogether and assure that the data and computer will always be available!
Fault Tolerant Information Systems allow applications to continue processing without impacting the user, application, network or operating system in the event of a failure, regardless of the nature of that system failure. All fault tolerant information systems use redundant components running simultaneously to check for errors and provide continuous processing in the event of a component failure. However, to truly meet the requirements of mission-critical applications, such as data servers, network servers and Web servers, fault tolerant information systems must satisfy the following requirements:
1. The system must identify any single error or failure.
2. The system must be able to isolate the failure and operate without the failed component.
3. The failed system must be repairable while it continues to perform its intended function.
4. The system must be able to be restored to its original level of redundancy and operational configuration.
Redundancy Is Fundamental
Basic computer theory tells us that system reliability can be improved by appropriately employing multiple components (redundancy) to perform the same function. Redundancy can be applied, and therefore should be considered, in terms of both time and space. The downside to this redundant approach is that it takes twice as many resources or an extra (redundant) device. No matter how you look at it, reliability requires redundancy, and redundancy expends either time or resources, both of which are not free. Furthermore, redundancy is only the starting point. It provides the basis on which one can build a reliable or continuously available information system. In order to provide the most complete protection, the following steps must be taken.
Minimize Points of Failure. Minimizing single points of failure provides the basis for insuring fault tolerant information systems. In order to minimize single points of failure in any system, redundancy must be applied, as appropriate, in all aspects of the computing infrastructure. We’ve heard the war stories of system managers who were careful to run dual power cords to their computer systems, but unfortunately ran them through the same wire channel, creating the opportunity for an unaware service person to accidentally dislodge both cords, while servicing the system. Some options for avoiding single points of failure include using alternative power sources and RAID disk subsystems to protect the system from being brought down by power supply or disk drive failures. The ultimate application of this principle is to duplicate a complete physical facility at a different geographical location to provide a disaster recovery site.
Provide the Right Server Availability. The trade-off between availability and cost should be analyzed during the planning and implementation phases of an information system. For example, it costs more to run a system from multiple power sources or to double up on the amount of disk used for data storage. A primary consideration is the cost of a highly available system as compared to a conventional system. While some highly available information systems can cost as much as twice, and some fault tolerant systems as much as six times that of a standalone system, the cost of these systems is small in comparison to the opportunity cost associated with a service outage. In general, the direct and indirect cost of system downtime should determine the amount of investment to be made in system availability requirements, along with the nature of the application and the end user’s needs.
Employ Capacity Planning. An often overlooked factor that is important to all parts of the system is capacity planning. Capacity planning is used to analyze the performance of various system components to assure that the necessary performance is delivered to the users. A number of questions need to be addressed during this process such as network loading, peak and average bandwidths required, disk size, memory size and the speed and number of CPUs required. Care must be taken to address the interactions of all system hardware and application software under the expected system load. This is particularly important when considering high availability failover configurations, where the interrelationship of all applications and middleware must be fully understood. Otherwise, the system could failover the specific user application, but not bring all the necessary support middleware, such as the database.
Eliminate Serial Paths. Serial paths are comprised of multiple steps, where the failure of any single step can cause a complete system failure. Serial paths exist in operational procedures, as well as in the physical system implementation. Application software is often the most critical of serial elements because an application software bug cannot be fixed while in operation; it can be restarted or rebooted, but it cannot be repaired. A well-written application can minimize the opportunity to lose data by employing techniques, such as checkpoint and restart. Checkpoint and restart stores intermediate compute results when passing data from one process to another in order to avoid a serial failure.
Select and Manage Software. Selecting and managing the software used for critical applications are important steps that must not be overlooked. First, utilities and applications must be stable, as determined by careful selection and testing. Many IT organizations test new versions of critical applications in an offline-simulated environment or in a non-critical part of the organization for several months before full deployment to minimize the probability of crashing a critical application. The software industry has promoted software upgrades as the pathway to computing heaven, however the rate of release and complexity of upgrades often exceeds the ability of IT managers to fully qualify one upgrade before the next upgrade is released.
The pressure to upgrade should be resisted; the installation of an unstable application can be more devastating than a physical server meltdown. Likewise, even though management may be pushing to consolidate distributed applications onto fewer servers, it should be avoided. Consolidation can jeopardize the availability of the critical applications.
Consider Physical Issues. The physical aspects of the computing environment must be considered when establishing a reliable and safe information system environment. The primary components of the computers and network must be addressed initially. Then, consider the physical environment of space, temperature, humidity and power sources. Most of the time these factors only get attention when building a new facility and are overlooked when making small system changes, installing new systems or upgrading current systems.
Another key, and yet often ignored, element in the management of the physical environment is the actual physical security and access. It is a basic element of protection of the business’s information assets. Allowing casual access to critical information systems can result in inadvertent or intentional system outages.
Maintain Processes and Procedures. The processes and procedures used in managing the information infrastructure should provide maximum system availability with minimum interruption in service in the event of a failure. This includes access control, backup policies, virus screening/recovery, staffing, training and disaster recovery. These processes and procedures should be documented and updated regularly. They should also be exercised and revised at least once each year.
Exception procedures are elements of last resort and must be complemented by proper day-to-day operational processes that ensure the proper allocation of system resources via application and operating system tuning. In too many cases processes and procedures are ignored until a crisis. Then it may be too late to avoid a system outage. Finally, remember that even a well-documented process has little value if the operators and system managers have not been trained and updated on a regular basis.
Architecture and Assure Design Control. The overall architecture of the system, including the major functions of each subsystem and component, must behave as an integrated whole to accomplish the business goals of the enterprise. The design of any system requires the application of trade-offs and design decisions to implement the architecture. The architecture and design decisions should be documented and managed on an ongoing basis to maintain the system’s architectural and design integrity and also to provide a means for transferring knowledge to new personnel.
The New Fault Tolerance
Commercially available fault tolerant computers have been around since the 1980s. Historically, they have been characterized as expensive to buy, proprietary in nature and complex to manage. Today, fault tolerant systems are not necessarily proprietary, but they still tend to be the most expensive. For example, UNIX-based fault tolerant systems are more open and somewhat easier to manage, but they can cost four times as much as a standalone solution. Recently, with the advent of commodity PC Servers, the NT operating system and new hardware and software technologies for high-availability clustering and fault tolerance, new solutions are available.
The key to deploying a disaster or fault tolerant information system is to assess all the risks and then take the most appropriate actions. In the case of making your computer applications and data fault tolerant, we recommend IT managers consider the following:
• Begin with redundancy in hardware and software.
• Minimize all single points of failure.
• Choose the right server availability for the job.
• Employ thorough capacity planning.
• Eliminate hardware and software serial paths.
• Carefully select and manage software.
• Consider all the physical issues.
• Apply good processes and procedures.
• Maintain consistent architecture and design control.
The fundamental guideline is to not be distracted by the cost to implement the proper solution, but rather look at all the cost factors including the cost of loss of business and customer good will. They provide a basis for IT managers to determine the most appropriate allocation of resources for the highest level of availability consistent with the mission of the enterprise and the cost of downtime.
About the Author: Bob Glorioso is Director, President and CEO of Marathon Technologies (Boxborough, Mass.), which specializes in fault and disaster tolerance solutions. He can be reached at (800) 884-6425, or [email protected].