Availability in Enterprise Computing

In today’s Internet age, highly available computing resources are more important than ever. Enterprise computing has moved beyond the data center to join employees and employers on companywide intranets, in addition to vendors and customers via the Internet. More people than ever depend upon reliable access to computing resources for product information, sales channels, online transactions, supply chains, internal tools and resources.

Businesses face a challenge in deploying applications to a wider user base while providing high levels of service and availability. It’s not enough to merely ensure that a system remains up. To be judged available by the end user, a system must also provide performance and response times that meet expectations.

This article discusses issues that affect availability and explores how system features, including dynamic reconfiguration, alternate pathing, dynamic partitioned domains and dynamic resource allocation, can provide availability for enterprise computing environments. For even higher levels of availability that cannot be provided by single system, we explain how clustering technology can help.

Availability: More Than Just Reliable Hardware

Businesses are under pressure to improve application availability by reducing, or eliminating, system downtime. The causes of system downtime can be traced to three sources: errors associated with people, process or product.

People. People errors are possible with any non-automated task, and are caused by poor training, lack of expertise or insufficient experience.

Process. Process errors result from poorly defined or documented processes for activities such as backup, routine maintenance and change management.

Product. Product errors include hardware faults, operating system errors, application failures, power outages or environmental disasters.

Product issues are often an obvious place to start when addressing availability, but measurements have shown that these account for a surprisingly small number of service outages. To improve availability, organizations may need to first address the people and process issues within their organization.

The increasing complexity of application environments and dependence on multiple suppliers and products aggravate people problems. For example, modern Enterprise Resource Planning (ERP) environments depend on third-party database management systems (DBMS), numerous computing and networking hardware vendors, as well as system integrators to develop, deploy and manage applications.

To avert problems in these complex, dynamic environments, people need proper training and access to powerful, easy-to-use tools that simplify management and reduce the incidence of outages caused by human error.

Well-defined processes for administration, upgrades, management and recovery are also essential. Computing environments are dynamic, and the installation of new applications, frequent hardware and software upgrades, regular system backups and changes in configuration are common. IT managers who must carefully track and coordinate these changes need well-understood processes to eliminate problems that result in errors and downtime. Clearly documented processes also help eliminate finger-pointing at the time of failure and can help focus on the task at hand – restoring availability to users as soon as possible.

Given today’s complex computing environments, defining new processes requires the collaboration of IT personnel, hardware and software vendors, system integrators, and service and support organizations. Comprehensive and integrated support, training and consulting services are an important part of developing effective methodologies, sharing best practices and proactively managing change.

Products Are the Foundation of Reliable Computing

Once people and process issues have been addressed, IT managers can turn their attention to products, the foundation of reliable computing.

Continuous application availability is vital for user productivity and customer satisfaction, and no single hardware or software product can guarantee it. Instead, a complete, end-to-end solution is needed to ensure a high availability experience for end users. All pieces in the computing resource stack – hardware systems (servers and storage), network infrastructure, operating environments and applications – must be reliable and provide high levels of availability.

Today, hardware is increasingly reliable, thanks to highly integrated designs and other product features that make problems easy to diagnose and service with minimal or even no disruption to operations. But reliability alone is not enough. Hardware vendors must also design in availability and serviceability to shrink the downtime caused by failures and scheduled maintenance. Features, such as redundant and hot-swappable components, dynamic reconfiguration, alternate pathing, dynamic partitioned domains and dynamic resource allocation, can all help provide the system availability required by enterprise computing environments.

Servers featuring redundant, hot-swappable hardware components are common today. Redundant components, such as power supplies or cooling systems, improve availability by enabling a system to continue operation after a hardware failure. For example, if a server with redundant power supplies loses a single supply, the aggregate power provided by the remaining supplies is sufficient to allow operation to continue without interruption.

Hot-swappable components provide the ability to add or remove hardware components in a functioning system without bringing it down. This allows administrators to dynamically reconfigure a system to handle larger workloads or to replace a failed component, such as a power supply, without shutting down the system or requiring a time-consuming reboot. Together, hot-swappable and redundant components improve serviceability and allow systems to continue providing uninterrupted service after a component fails and while it is repaired or replaced.

Uninterrupted Availability

Dynamic reconfiguration extends the utility of hot-swappable components by providing the capability to reconfigure an operating environment while it continues to service users. Unlike hot-swappable components, which provide the capability to physically attach or detach boards from a live system, dynamic reconfiguration provides the ability to logically attach or detach components of a running system. This ability to logically manipulate components over and above physical manipulation is important, as it permits reconfiguration without any interruption to applications.

Dynamic reconfiguration preserves application and data integrity when removing and adding components, ensuring uninterrupted application execution with no loss of data. For example, assume you need to remove a faulty system board or want to perform a live upgrade of a system board. Dynamic reconfiguration software allows an administrator to issue a command to logically detach the board from the operating system, instructing it to begin migrating process execution, network and I/O connections and memory contents to other system boards. All pageable memory is flushed to disk, free pages are locked to prevent further use and kernel memory is remapped to other system boards. After the board is logically detached, it can safely be removed (hot-swapped) and either replaced or upgraded with additional resources such as memory or processors. Then, it can be re-inserted (hot-swapped) and logically attached to the running system, making it once again available for use.

As predictive failure analysis gets more sophisticated, it is possible to dynamically reconfigure new components to take the place of components that have been predicted to fail, further improving availability and a major contributor to minimizing planned downtime due to upgrades or maintenance.

Alternate pathing, another technology designed to enhance system availability, allows applications to continue without interruption even after a failure in a primary data path. With alternate pathing, I/O operations in a live system are automatically and transparently redirected to a predetermined I/O channel if the primary path fails or must be removed from the configuration. When used in conjunction with dynamic reconfiguration, it provides the ability to seamlessly move all I/O from a system board before its removal for upgrade or repair.

To implement alternate pathing, a system must have available redundant I/O paths between controllers and physical I/O devices such as disks or networks. An administrator begins by defining the primary and alternate paths for each device. If a failure subsequently occurs on the primary path, the alternate pathing software automatically fails over, or redirects, all I/O operations from the failed path to the alternate path. This redirection occurs while the system is operational and requires no action from the user or administrator. Once the alternate path is activated, the failed component can be swapped out or upgraded without interrupting the system or affecting application availability.

Dynamic partitioned domains, technology that traditionally has been implemented in only fault tolerant or mainframe systems, is now available in some high-end UNIX servers. Partitioning allows administrators to divide a single multiprocessor system into multiple smaller systems called partitions or domains. Each domain is a fully functional, self-contained server running its own instance of the operating system and including one or more system boards, its own CPU, memory, I/O and network resources, and boot disk. Domains are engineered to be logically isolated from each other within the system, providing a highly secure and reliable environment for running multiple environments simultaneously. By creating "systems within a system," domains increase availability as most failures in one domain cannot affect applications running in other domains.

Dynamic partitioned domains can be used in combination with dynamic resource allocation to provide greater flexibility and availability to computing environments. Dynamic resource allocation, the ability to dynamically control resources, such as memory, CPU or I/O, enables better resource utilization and allows systems to respond flexibly to peak conditions. Careful monitoring and the use of control mechanisms can also prevent any one application from monopolizing a resource and affecting the response time or availability of other applications. Such control can provide increased levels of availability in the face of rapidly changing demands on the computing environment, and ensure graceful performance during peak loads.

Although hardware is increasingly available, protecting against hardware failures is only one-half of the equation. Software – both system and application software – is the principal cause of service interruptions today.

Fortunately, system software is becoming more robust. Extensive internal and field testing of mature operating systems eliminates many outages caused by system errors, and modern operating systems can eliminate the need to reboot in many circumstances, even after a hardware or software reconfiguration. Combining a robust operating system with reliable hardware ensures a solid platform for deploying mission-critical software applications.

Although systems are becoming more stable, the applications software is becoming more complex and is constantly evolving as new solutions are deployed to meet rapidly changing market conditions. As a result, the application software layer is the source of an increasing number of failures. Centralized management tools that simplify administration, reduce operations errors and assist in identifying or diagnosing system errors can greatly reduce the unavailability caused by software-related problems. Test environments, either at the customer or the vendor site, are also useful for testing the entire computing resource stack, from hardware and operating systems up through middleware and multiple application levels.

Clustering for Higher Availability

There is a technology that offers an innovative and comprehensive approach to delivering higher levels of application availability for users beyond those discussed so far. Clustered systems have no single point of failure in hardware, software or the network. As a result, clusters provide higher levels of availability than is possible with any single system, delivering enhanced levels of service without requiring high-priced, inflexible and proprietary technology.

Clusters start with a reliable hardware foundation that incorporates the systems, mass storage and communication components necessary to ensure data and application availability. Multiple systems each contain redundant critical components and support high-availability features such as dynamic reconfiguration and alternate pathing. Redundant disk storage, along with redundant I/O connections, protect against hardware errors and ensure highly available access to data.

Some clusters also feature network management software that provides monitoring and network adapter failover. (This software differentiates between a slow network and a failed network connection, reducing the possibility of disruptive false error recovery.) To implement network adapter failover in clusters, multiple network adapters are grouped into sets. If one adapter in the set fails, software detects the failure and another adapter in its group automatically takes over the servicing of the network requests.

This transition is transparent to users, who experience only a small delay while the error detection and failover mechanisms are in progress. Network adapter failover reduces the need for a complete failover of applications from one server to another in the event of simple network-related failures, reducing application disruption and increasing cluster availability.

To guard against software errors, some clusters also provide automatic application failover capabilities. During configuration, administrators specify primary and backup servers for applications. Should an operating system fault occur on one node (system), the cluster software seamlessly shifts applications from the failed node to one or more of the remaining nodes in the cluster. Clusters also protect against individual application errors by automatically restarting the application on the same server (for a faster recovery) or migrating the application to a backup server. Automatic migration of network addresses and storage resources minimizes the impact of failures and further increases usability and availability.

Clusters are also uniquely suited to address concerns for scalability, providing a high-end growth path for enterprise computing environments. System resources, such as CPU and memory can be added to individual cluster nodes, or entire additional nodes can be configured in. This allows performance to grow along with changing requirements, ensuring users get the performance and response time they need. Such scalability is especially important as companies make more and more resources available through intranets and the Internet, a place where user demands can increase dramatically and unpredictably.

It is easy to see how cluster technology can improve availability, but many system administrators would argue that cluster management is more complex than administering a single system. In addition, MIS managers ask about people issues like training and support. The news here is good. Advanced cluster technologies can help to simplify management and administration by presenting one resource to manage versus multiple discrete system. Integrated tools for reconfiguration and load balancing, plus ease of recovery through powerful, intuitive user-oriented tools help simplify the complexity and aid in reducing human error, ensuring higher levels of availability.

About the Authors: Ravi Chhabria is the Product Line Manager for Systems Software and Architecture at Sun Microsystems, Inc. He can be reached at ravi.chhabria@ sun.com.

Neville Nandkeshwar is the Product Line Manager for Sun Cluster. He can be reached at neville.nandkeshwar@sun.com.