No Single Point of Failure: Clustering for the AS/400

A close up look at the technology behind achieving high availability in an AS/400 computing environment

This is the first of two articles in which MIDRANGE Systems presents the ideas and technologies behind clustering. In part one, we examine the ways in which multiple computers have historically been combined, then jump ahead to present day and look at fast memory bus architectures that serve to facilitate communication and synchronization between mostly independent machines. We not only explore the current issues, challenges and solutions concerning AS/400 clustering from a technology perspective, however, we also attempt to predict IBM's next move in this area.

In the past, computers tended to be monolithic. One large computer served an entire corporation-management, personnel, finance, operations and even the business itself. If it failed, the entire corporation might shut down until the system was repaired. As demands grew, the cost of increasing the power or capability of the system grew as well. Computers from some companies could be field upgraded-that is, components could be added and swapped to yield more power from an existing system. For the most part, however, upgrading a system usually meant rolling out one entire system and rolling in its replacement.

Some solutions centered on duplicating resources within the system. The best example of this is I/O processor and adapter technology. In the days of mainframes, if the I/O subsystem presented a bottleneck, it was moderately inexpensive (compared to the cost of the entire system) to add an IOP and redistribute the I/O devices across the new IOP set. The equivalent in the CPU dimension was to add processors to the system. Instead of a single CPU, systems could have multiple CPUs and could therefore present more aggregate CPU power. These are known as Symmetric Multi-Processor (SMP) systems.

The problem with SMP is that the addition of CPUs to the available SMP set fails to solve several critical problems. For example, adding a second CPU does not double the available CPU power because some of that power is necessarily lost in synchronization between the CPUs. In general, all of the SMP CPUs must also share the device and memory buses and therefore are typically located in the same system and run the same copy of the operating system. Thus, if the operating system fails, or if any of the components common to both processors fails, the entire SMP set fails.

Rapid Improvements
Technology improvements have led to several dimensions of improvement in computers. First, the power available from a single computing entity-in terms of storage and raw computing power-has increased dramatically. At the same time, the smaller computer building blocks have decreased in price by several orders of magnitude. In short, the rates of improvement increase and price decrease have both accelerated.

The reliability of computers and computer-related devices (disks, scanners, etc.) has also improved considerably. It's not unreasonable to buy a computer for less than $5,000 and expect that computer to run without any significant failure for three years.

Along with this increase in reliability, however, there has been a dramatic increase in the expectations of business managers (and even home computer users.) It's no longer acceptable for a business computer to be available 90 percent of the time. Some businesses may survive with a reliability/availability of 95 percent, but only until a competitor recognizes the weak spot!

The Cluster Cavalry
Cluster technology presents a different kind of multi-processor solution. Rather than tightly coupling the CPUs (sharing memory and device buses, etc.), clustering connects computers more loosely. In most cluster implementations, a highly specialized communication path exists between a cooperating set of computers. This path is very reliable and presents transmission guarantees in terms of device response time. That is, if a message is queued for transmission but fails to reach any destination, the hardware detects the failure and reports it to the software managing the cluster. Many cluster implementations allow the use of the extended SCSI protocol for intra-cluster communications.

By loosening the proximity of the coupling between the component systems in a cluster, we present an opportunity to solve many complex problems while introducing only a small number of minor challenges. For example, we can solve the problem of almost perfect availability by housing one system on the East Coast and one on the West Coast of the country. We achieve completely independent power supplies and our exposure to the same natural disaster is minimized. The problem we introduce is how to duplicate the data between the two sites with sufficient speed and accuracy such that the second system is a useful backup for the first.

Below, we'll focus on three benefits afforded by clustering-scalability, availability and partitioning-so that we can better describe the underlying synchronization technologies available.

The technical aspects of scalability fall into two major categories: the ability of the cluster to support additional hardware and deliver the new hardware resources to be used; and the ability for the newly expanded hardware and software resources to be added together and applied to the problem at hand.

The ability of a computing system to grow with the demands placed on it is one dimension of scalability. A simple example to illustrate this is found in the need to address an increasing demand for disk space. There are essentially two options here: replace the existing disk with a larger one every time space gets short, or add a disk to the existing set as required. If each disk needs a new controller or an upgraded controller and the controllers are nearly as expensive as the disks, then these solutions don't scale well. On the other hand, if an infinite number of disks--of ever increasing size--can be added to the end of a daisy chain of disks, we would say that the solution scales well, as long as performance levels are maintained as the disk population grows.

With respect to applications, a cluster solution scales well if we can incrementally add cluster nodes (computers) and in so doing, be able to serve the user community better. That improvement might be in the number of users we can service simultaneously, or it might be in the speed with which we can serve any given user.

Availability (Fault Resilience)
As we'll see later, the overhead costs involved in synchronization between cluster members can sometimes be so high that it's impractical to coordinate data access such that machine resources can be added to an existing set of resources in a cluster.

For this reason, the major focus of clusters has, in recent years, become that of preserving the availability of the entire computing resource, rather than providing new dimensions of scalability. Uses of a cluster to enhance availability can range from maintaining a duplicate system in the same room to protect against individual component failure (disk, power supply, etc.), to locating a duplicate system in another city, another state or even another country so that the service provided by the cluster is not threatened by related geographical, meteorological or political conditions.

If a machine in the cluster fails, at least one other machine in the cluster notices, and the tasks that were the responsibility of the failed node are redistributed between the surviving members. The manner in which the tasks are redistributed is dependent on how the cluster administrator originally configured the cluster.

It's worth examining the statistics of device failure in order to understand why the duplication of devices is almost always adequate protection against those failures. The failure rates of devices such as disks and fans are measured as MTBF--Mean Time Between Failure. That means that an "average" product will almost certainly fail within that number of operating hours. It does not imply that the device cannot fail before that time is up, nor does it imply that the device must fail after that much time in service. The significant consideration is that for sufficiently large MTBF ratings, it's very unlikely that two similar devices, even of a similar age, will fail within the same 24-hour period. For this reason, having a single level of redundancy provides adequate time for repair or replacement as well as adequate protection against loss.

Application Partitioning
The ease with which applications can be made compatible with the cluster environment is very important. If application vendors must rewrite entire applications in order to adhere to a set of very strict design principles, the cluster architecture is unlikely to be commercially practical.

Application partitioning schemes can be very simple or extremely convoluted. Suppose we have a database, and the demand and load is such that we need to expand the resources available to support that database system. In order to enhance the database's availability, we decide to implement the new solution on a cluster. We might decide to partition the database down the mid-point of the alphabet for client last names such that incoming requests are directed to the appropriate database server by reading the client-last-name field. If there are two machines and the division is made down the A-M and N-Z mid-point (or a more judicious choice based on the actual data) then, all things being equal, approximately half the requests will be serviced by each machine.

If one machine fails, the surviving machine will be reconfigured to take on responsibility for both halves of the access to the database. The result will be that the delivery of data will be slower and that one surviving node will be potentially quite busy. But the entire database will survive and business will continue.

A Thousand Points of Failure?
The guiding principle of cluster design is that there should be no single point of failure. It's unavoidable for any individual component to have multiple failure points (e.g., power supply and disk heads). However, every hardware entity must be duplicated and every data path redundant. Preferably, each of the redundant components will either be in continuous use, or usage will switch to the secondary path or subset if the first fails.

Redundancy comes at a cost, however, whether it means duplicated computer systems, battery-assured power supplies, disk spindles or RAID storage, or duplicated network paths and hubs. Moreover, some of these hardware aspects may be more involved than they appear to be at first glance. For example, if a network path is duplicated, it's essential that the secondary path is tested on a regular and frequent basis so that its suitability as a backup path is assured at all times.

If the secondary path ever fails, it must be handled with the same urgency and significance as if it were a failure in the primary path. The consequence of overlooking this is that if a failure in the primary path does occur, the secondary path might not be available and the cluster has been exposed to a single point of failure. In cases where a fast and shared memory bus is employed as the communication channel between systems, the memory must provide multiple power supplies for the shared memory.

There are some aspects to cluster design and implementation that involve more than hardware duplication. For example, suppose we have a cluster of two nodes and those two nodes each support a primary application and provide backup/failover support for the other node's primary application. What happens if the two nodes lose the ability to communicate with each other? Each node will detect the failure to communicate and will attempt to acquire the resources of what it perceives to be the missing node. If both nodes fool themselves into believing that they exclusively own the resources of the cluster, the results can be disastrous if each node's clients try to make competing or colliding claims and demands on those resources.

These issues of software reliability and the ACID (Atomic, Consistent, Isolated, and Durable) nature of the cluster's internal database must be resolved by software synchronization techniques. In general, there are two approaches to cluster synchronization: the share-nothing and the distributed lock manager approaches.

Synchronize Your Approach
The problem of synchronization within a cluster is complex and encompasses several of the thornier issues that have plagued computer advancement for some time. At one end of the spectrum is the problem of a distributed database (software complexity). At the other end is the problem of eliminating duplicate messages where there is automated failure recovery by re-transmission using an alternate path (hardware assistance).

The Distributed Lock Manager Approach In the early 1980s, Digital Equipment Corporation introduced the VAXcluster, which later became the VMScluster when the Alpha series was introduced as alternate hardware for the OpenVMS operating system. The OpenVMS cluster architecture was based on a Distributed Lock Manager, or DLM.

The DLM manages a distributed database. The database contains representations of locks that are held on resources. A resource can be any arbitrary entity, ranging from a collection of bytes on a disk, all the way to the abstract notion of the execution of a section of code.

When code is written for a DLM-based cluster, the programmer must first request ownership of the lock representing the resource. The lock will be in one of four states:
1. Non-existent
2. Existing and free
3. Existing and held by someone else in a compatible state
4. Existing and held in an incompatible state

In cases 1, 2 and 3, the lock database is updated to reflect the new state of the (possibly new) lock and access is granted. In the 4th case, the programmer can choose one of three actions:
1. Declare the entire effort a failure and back out, reporting an error to the user
2. Wait silently for the lock to become available
3. Or retry some given number of times and react appropriately after a timeout or other scenario.

In the case of the OpenVMS DLM, there's also the possibility of writing asynchronous notification and alerting code. When a lock is created for a resource the programmer creating the lock identifies a notification routine. If another process takes an interest in the lock that's incompatible with its current state, that asynchronous routine is executed. This means that the owning software component can react accordingly to another process trying to gain access to the resource.

Consider the implications of complexity for this design. Bear in mind that a lock can be in any of six states and that a large system has potential for handling many hundreds of thousands of resources and locks. In a large cluster, there will be many large-power nodes and each will serve users who might need access to the same, or related resources, as each other.

As soon as multiple nodes are introduced, the complexity of the problem grows substantially. Looking for the existence of a lock requires polling multiple nodes. Upgrading the state of a lock may require notifying multiple nodes. Adding a new sharer to an existing lock can be costly in terms of notifying all current sharers. Firing those asynchronous notification routines may involve the exchange of large quantities of data within the cluster. Lock activity and its related traffic can become very heavy.

The fundamental design basis for the DLM cluster is that an application will do the following:

Acquire the lock for any shared resource
Manipulate the resource
Release the lock acquired

If there's a section of code that manipulates the resource without having acquired the lock, the scheme will be destroyed and the resource probably corrupted. If a section of code takes an error path such that the lock is never released, other processes that need the resource will wait infinitely for it, causing the application, if not the entire system, to fail.

The "Share-Nothing" Approach Whereas the resource is juggled between processors on the whim and coincidence of users and their needs in the DLM-based cluster, a particular share-nothing cluster member owns resources until some momentous event causes the resource to move to a different node. That momentous event might be the failure of some critical component of the system--a component on which the resource is directly or indirectly dependent.

The programming model presented by share-nothing cluster management is much simpler than that of the DML-based cluster model. With few exceptions programs can be written without consideration as to whether they are running in a clustered environment or a non-clustered environment--because the responsibility for where the software runs resides at a decision layer underneath the application. That is, the system administrator decides on which node the application will run. The availability and correctness of hardware components may contribute information and so influence that decision.

Besides the obvious difference in design and usage, the critical difference between these two cluster models is really the frequency with which resources can move over the intra-cluster boundaries. In the DLM case, resources move potentially very quickly between nodes in the cluster. In the case of the share-nothing model, resources move very rarely, and only when there is good reason (hardware or software failure) for the transition.

Both the IBM AS/400 and the Windows NT cluster product, Microsoft Cluster Server (MSCS) are based on the share-nothing cluster architecture.

The AS/400 Cluster
The first thing to note about the AS/400 cluster offering is that it's a very young product. Announced in May 1999, it's only now seeing its first major release since the original offering. The second note of interest about the AS/400 cluster offering is that IBM has deliberately not developed all the software necessary to support the cluster. Rather, they have devised and implemented the "plumbing" that supports the management and transfer of data, but have left the details to a select group of business partners.

In part two of this series, in an upcoming issue of Midrange Systems, we'll discuss those business partners, their offerings and how these products relate to the overall effectiveness of the AS/400 cluster picture.

As noted earlier, IBM's focus for AS/400 clusters has been toward improving system availability, accomplished by employing the "share-nothing" approach to cluster technology. Because of its architecture, the single-level store and the integrated DDM data communication protocol, the AS/400 readily lends itself to the share-nothing cluster.

In general, the AS/400 cluster product relies on applications written as transactional, so they can use the integrated database transaction semantics and memory mapping of disk space. When an AS/400 cluster is deployed, the administrator chooses between synchronous and asynchronous data transmission to secondary and failover cluster members. A synchronous member will accept the data changes across the link, but the original transaction at the primary node will not complete until the data has been successfully transmitted and acknowledged at the remote node. In the asynchronous data transmission case, the primary node does not wait for updates to be transmitted and acknowledged, so the primary node is likely to run significantly faster. In the event of a failure, however, the state of the application must be reconciled with the state of the data.

The consequence of this is that the AS/400's failover recovery time can be up to a few hours if the application data is in a sufficiently confused state. Other examples of share-nothing clusters, such as Windows NT and Windows 2000, have a failover recovery time ranging from a few seconds up to a minute.

This has been a brief introduction to the vast subject area of clustering and a look at some of the advantages and disadvantages of DLM and share-nothing approaches to cluster database correctness and management. The AS/400's architecture lends itself readily to the share-nothing cluster model, and IBM has used existing technologies to leverage their new cluster offering. Going forward, IBM makes no secret of the fact that ISVs must prepare their AS/400 software for the AS/400 cluster age.

There's no universal formula that will ensure that an application that functions correctly on a single AS/400 will also function in a cluster of AS/400s. Furthermore, IBM does not provide fundamental capabilities, such as disk shadowing (mirroring) for the AS/400 cluster. While it's probably acceptable for small-volume applications to use the third-party data mirroring devices, it's far too inefficient to mirror all the contents of a volume using those techniques. An operating system level disk-shadowing device is badly needed in this product space.

A final word of advice-while the prospect of gaining higher availability in an AS/400 environment may appear relatively simple, it's not. If your enterprise is considering using the clustering capabilities in V4R4, you should:

1. Start with a clear understanding of which enterprise-essential applications are AS/400 cluster-compliant and which are not.
2. Understand how AS/400 clusters are held together and how they're managed-and in great detail. Understand, for example, how to define and control failover ordering and how to control failback (what happens when the original node returns to service after a failure).
3. Understand which, if any, of the business partner products you may require and gain a clear understanding of the software and hardware costs.

Keith Walls is the senior consulting architect for Network Trust International and specializes in operating system performance and technology.