AS/400 Clusters, Part II: Data Replication and The Holy Grail

This article is part two of a two-part series on clustering for the AS/400. Previously, we looked at the meaning and implications of joining computers together to improve availability, and therefore, reliability of the resulting computing system. In this article, we'll look more closely at how improved availability is achieved for the AS/400.
The goals of high availability are worth reviewing as we examine the specifics of the AS/400 cluster technology. Fundamentally, clustering involves improving the reliability of existing systems. You cannot build a highly available system from two systems in which each constituent system is inherently unreliable. The mean time between failures for each constituent system must be sufficiently long (i.e., sufficiently reliable) that you can cluster systems together and reasonably expect that the resulting compound system will be truly highly available.

How High is Up?
It's an interesting mathematical exercise to determine what the minimum acceptable reliability of each system must be in order to build a highly available compound system from the components. For example, if we define "highly available" for a mythical enterprise to be 99.9999 percent available, we're saying that we will tolerate a 0.0001 percent failure rate, or one failure of one minute's duration every two years.

If our constituent systems each have an availability of 99.99 percent, how many component systems do we need to build the target 99.9999 percent available cluster? The way to find the answer is to multiply the implied failure rates (100 - availability rating) together. So, if we have two 99.99 percent available systems, we could build a compound system with (100 - 99.99 = 0.01) x 0.01 = 0.0001, or 99.9999 percent availability.

IBM claims that a single AS/400 system can be configured with an availability of 99.94 percent. This would mean that a two-node cluster of AS/400 systems would yield a compound availability of 99.996, or an expected outage of about 21 minutes per year.

Just exactly how much system failure an organization can withstand has more to do with the business supported by the system than it does about the system itself. The cost of a system outage for a stock exchange might be horrendous, while the cost of doing business on paper for a few hours in a country store might be little more than a nuisance.

A supplier of air conditioning and power protection systems, the Liebert Corporation's Web site presents a downtime cost calculator well worth investigating if you have any doubts or questions about the cost of system outages for your business. One of the most interesting observations to make from this calculation is that most of the hardware costs incurred by doubling the availability of a system will be defrayed (with interest) at the first significant failure that would otherwise bring the business to a halt. Typically, this is as true for a modestly-sized business as it is for a large company. The numbers may surprise you.

Depending on the technology deployed in the system, the real cost involved in migrating from a single system to a clustered system is likely to be in software rather than in hardware.

High Availability Clustering
First, we must assume, of course, that the cluster architecture and implementation satisfies all the requirements that constitute the cluster itself--no single point of failure and a reliable, ACIDic mechanism to assure the integrity of the cluster state itself.

Second, we must assume that we have a means by which data managed and manipulated by the protected and availability-enhanced applications will still be available when one of the constituent systems fails. This is normally achieved by either shadowing (mirroring) or by replication.

In the previous article ("No Single Point of Failure: AS/400 Clustering," October 18, page 30) we discussed most of these topics. To complete the picture, we need to briefly revisit cluster state integrity, and then look at application data protection in some detail.

Cluster State Data Integrity The significance of the integrity of the cluster state database cannot be emphasized enough. This is the repository of the ever-changing information about the state of the cluster and all its devices, resources and services. At any given time, all the cluster members must agree on the contents of this database. For example, it would be unacceptable for two cluster members to believe that they were each the sole supporter of an application of which there could be only a single copy.

This constitutes one of the more difficult aspects of cluster architecture. Fortunately for the AS/400 market, the object-store architecture, OptiConnect and the extension to the implementation of journaling, the integrity of the cluster state is assured.

Shadowing Shadowing, also known as mirroring and RAID 1, is the technique of copying the contents of a volume (disk or collection of disks) to a completely redundant disk set. For most implementations, shadowing is accomplished at the operating system level, often at the device driver level. Shadowing is expensive on write functions because the data must be written twice. However, much of what is lost in the write function is gained during read because a read can be satisfied by whichever disk's head is closest to the data required. Naturally, there are several variations to this scant design view. For example, caching can eliminate a significant proportion of disk I/O.

IBM is the natural candidate for providing volume shadowing for the AS/400 but it does not. While there are indications that IBM will provide volume shadowing for the AS/400 in the future, for the moment the AS/400 must rely on data replication and/or hardware RAID to duplicate data.

Data Replication In many instances, it's not necessary to duplicate an entire volume (disk) of data. Furthermore, it's not always necessary that the data be duplicated synchronously. For the most part, volume shadowing will replicate data synchronously between entire spindles. Shadowing tends to assume a very fast and highly available interconnect, so much so that a failure of the interconnect is assumed to carry serious consequences for the entire system.

By contrast, data replication can relax these requirements and will often more readily fit the requirements for a particular application.

Data replication differs from volume shadowing in that data replication involves an "agent" at some level of the data management. Exactly where this "agent" is installed will have a significant impact on the design of the resulting system.

Perhaps the worst choice as the location of the replication agent is in the application itself. Each application developer has a different view as to how data replication should be managed and how it should be implemented. Some would integrate the replicator into the data storage mechanism, while others would make replication an optional extra and deliver copious bells and whistles, complete with knobs and dials.

Slightly better, but still far from ideal is to embed the replication agent in the database engine. The AS/400 makes a clear distinction between objects and data. In order for an administrator to competently manage a replicating database, an administrator would need to understand exactly what was replicated and by what mechanism. By the time an administrator had two or three separate databases, and several applications for each, including some that utilized flat files, details would be forgotten and errors would be made.

A Superior Approach
A superior solution is to design the replication agent into the layer between the lowest level of the application and the highest level of the operating system. In the case of the AS/400, the single-level object store presents an almost ideal opportunity and level for data replication. The advantages of using the object store level are many.

First, there is the advantage that disk data on the AS/400 is necessarily mapped through memory. The implication is that if one AS/400 cluster node maps disk data, the DDM protocol can be used to ship the mapping (and perhaps the data itself) to second and subsequent cluster nodes.

Second, the single-level storage system of the AS/400 can be exploited by extending the journal-based architecture to allow for data duplication across a cluster interconnect. This means that a single data transmission block can be used to cause the local data to be stored, then forwarded to a remote system, mostly unchanged, to effect the parallel change there.

While synchronous data transmission is somewhat more difficult, and perhaps unnecessary, this store-and-forward and journaling approach to data replication is both practical and reliable. The actual transmission protocols can be made as intricate as needed to ensure that the journal entry is transmitted and the primary node can continue. Should the first node fail, the secondary node is known to have received the data update order, together with the data. It's then a matter of time before the secondary store will be up to date.

The Cost of Synchronicity
The critical question with regard to whether asynchronous data replication is more appropriate than synchronous data replication comes down to whether or not transaction speed is more important than failure recovery delay. If data is replicated synchronously, the primary system, the one generating the data, will not consider a transaction complete until the secondary systems have received the transaction, absorbed it, applied it to their private data store, and acknowledged that set of facts back to the primary application node.

The round-trip time for a transaction could therefore be significantly longer than for the standalone system case, particularly if the physical interconnect between the systems is slow compared with the speed of disks. However, at any given point in time, it's known that the two databases are precisely in step with each other from the viewpoint of the data content. If a secondary system fails to acknowledge a transaction, then it's known that the transaction was not recorded there.

This synchronous data lock stepping is not always necessary, however, and will often lead to unnecessary denial of service. Suppose that the connection between two systems is temporarily interrupted. This is more likely the greater the distance between the two systems and the more public the data transmission medium. In the case of synchronous data replication across that line, both systems will need to assume that the line has dropped and take appropriate action. The appropriate action might be to initiate the failover activity--bringing the secondary database online and redirecting incoming requests to the local database.

Suppose, then, that the connection is restored within a few minutes. Now we have the unenviable task of catching two systems up with each other. We must take transactions applied to each side and apply them to the other. If the transactions collide, we must intervene, probably manually, and get the two views of the data to agree.

If instead, we had used asynchronous communication and accepted that the data line had failed and would be back in service in a brief time, we could have queued the write and change transactions while we continued normal service. The only issue would then be to transmit and apply the changes that accumulated while the link was broken. Provided those changes are applied in the correct sequence, there will be no loss of data integrity.

In the event that the link is down for an extended period, the secondary system will have to declare itself out of data and take appropriate action. In some cases, that might mean that access to the database should be declined. In other cases, it might mean that the database can be read but not updated.

Replicate, But How?
The abundance of choices (or more correctly, the abundance of acceptable choices) left IBM with an important decision with regard to data replication. There was never any doubt that data replication was necessary. However, there are any number of ways it can be done correctly and IBM decided to present a complete set of API functions to allow third-party companies to implement data replication.

The major disadvantage of carving off the data replication task set is that every third-party software company's view of the "right" method will be different. For this reason, it makes the greatest sense for a single entity to make the decisions and provide the lower level "plumbing" in such a way that the third parties can each subscribe to the same methodology. This keeps the architecture the same for all cases and it means that the management at the operating system interface level is identical for each third-party product.

Thus, IBM provides the plumbing and the APIs to gain access to that plumbing. The business partners provide the management and direct programming interfaces for data replication.

There are three major business partners in the quest for AS/400 high availability. They are:
Vision Solutions Inc. (Irvine, Calif.)DataMirror Corporation (Markham, Ontario)Lakeview Technology (Oak Brook, Ill.)
Not the End of the Story!
The High Availability Story is a long one. We've considered the architecture of the cluster itself and the architecture of the cluster's state database. Now we've added two more complications in data replication and business partners offering replication products. In part one of this series, we discussed the need to plan a highly-available installation from the perspective of hardware and software capability. Now, unfortunately, we must add a few more complicating factors.

First, any application that is to be run on an AS/400 cluster must be cluster compliant. With Windows NT, Windows 2000 and OpenVMS clusters, applications that run on the native operating system platform are necessarily cluster compliant. This is not the case with AS/400 applications.

Second, while disk data can be replicated, other types of data devices cannot be simply replicated in the same way. Network data can be duplicated at very high cost. More feasible is to use a secondary wire and then make sure that the protocol being used is one that will automatically retransmit packets if either end fails.

Power supply management is never as simple as it seems at first glance. Overloaded power supplies are a very common cause of problems when failure does strike.

Finally, the attitude and approach of management and technical teams will be the real determining factor in the success of the design and implementation of a highly-available system.

The AS/400 lends itself readily to clustering by virtue of the design of OS/400. The three natural types of storage--the object, the file system and data--make the task of data replication (and therefore, data integrity) a relatively easy achievement for the administrator. While this symmetry makes choices easy at the implementation point, there are still several factors that complicate the quest for the holy grail of high availability:

Decide how highly-available your system needs to be. Decide whether it would be better to enhance the availability and reliability of your existing system rather than going to the expense of complete redundancy. Examples of availability enhancements might be:
  • (a) a RAID controller to protect against disk hardware failure,
  • (b) a reliable uninterruptible power supply,
  • (c) a larger or faster backup tape drive, or
  • (d) paying for a higher level hardware service contract that will guarantee a faster repair time.
If you need a truly highly-available system, consider how much data must be replicated and how frequently. Take care to do a complete analysis of this because, while excessive data protection might feel as though it adds a level of safety, it will also add several levels of management and paranoia. If you need it, it's worth pursuing and maintaining. If you don't need it, it will become very tiresome very quickly.

Watch the costs! Remember that clustering a system to improve reliability will almost certainly mean that you'll need to duplicate everything. When you need to buy a new disk, it will now cost double because you must buy two where you only needed one previously.

Almost nothing in the high availability realm is either cheap or easy. Everything must be planned carefully. Procedures need to be written for off-peak operators and even users so that the correct actions will be taken should a failure occur. Recovery procedures will change as the system grows and changes. Software upgrades will never again be a simple case of installing distribution media and wishing. Everything must be planned precisely if the business is to remain highly-available.