Clustering: The Quest for Availability

The search for high-availability software remains crucial, as many systems today are riddled with redundant features that fail to provide the availabilty organizations require for their mission-critical applications.

When commercial general-purpose computers were invented, they were usedonly for a few specific purposes, such as compiling the U.S. census. Back then, it wasremarkable to get eight straight hours of computing without a failure. They were expensiveand attended to by an assemblage of trained professionals who required constant access tothe physical hardware. Preventative maintenance was the mantra and rigidly followed, asdictated by the manufacturer.

As hardware matured, the components became more reliable. When a computer was used in amilitary or life-dependent situation, such as a space mission, redundancy was built-in.These were custom systems and very, very expensive ­ but significantly more reliable.

Computer designs further matured, including such features as error-correcting memoryand check-sums added to stored data.

Software-caused crashes were another issue, often brought about by programming errors.Then, programming techniques changed from merely sitting down and coding to writing designdocuments and using peer reviews of the code.

One of the earliest attempts at high-availability software was the concept ofcheckpoint/restart. A complete snapshot of the operating state of the system was taken andwritten to disk, thus providing a point during a long processing operation to fall backto, should a failure occur.

Today, many systems are replete with redundant features, including cooling fans, powersupplies, power cords, processors, I/O controllers, paths and communications. Yet, thesehighly redundant systems and their operating systems still fail to provide theavailability required by many of today's organizations for their mission-criticalapplications.

Increasing Up-Time

Increasing and sustaining high up-time can be crucial for a company. A recent surveypegged the cost of one hour of down time for a catalog sales company at $90,000. Even moredramatically, for brokerage houses, the losses rise to $6,500,000 per hour.

From an end user perspective, their view of the "system" starts right attheir desktop. For the system to be up implies that everything from their desktop to theapplication server to their print server must be operational. This view includes theirdesktop system, the local area network, all servers (and everything associated with them),the printer, the connection (LAN or WAN) to the servers, the disk storage and theapplications. A failure in any piece of that supply chain will make the user perceive the"system" is down.

A primary method of increasing up-time is to mask failures with backup and redundancy.That's where redundant arrays of inexpensive disks (RAIDs) come in, for example, by hidingthe effect of a single disk failure. That same type of thinking applies to servers: usinga redundant server as a backup.

Early Clusters

Although Digital Equipment Corporation is largely credited with inventing the termcluster, as applied to grouping multiple independent systems together, Tandem really hadjump-started it in the late 1970s. Tandem released one of the first commercially available"non-stop" systems, running the Guardian operating system, targeting the OLTP(online transaction processing) market.

Tandem's multiple-processor architecture, besides having mirrored disks and otherredundancies, allowed one processor to "shadow" another processor. When oneprocessor failed, its shadow could take over, having access to the same data, therebyavoiding user disruption. While not exactly what we think of today as clustering, it wascertainly the seed. The system could be expanded by adding processors in pairs and disksin pairs. Unfortunately, the application was required to be intimately aware of theshadow-processor arrangement, to send it data and operating state information.

In the early 1980s, Digital introduced its first VAXcluster (Virtual Address eXtensioncluster) system. One of the original intentions of clustering was to group minicomputersso that they could be more powerful than a mainframe ­ a virtual mainframe, so to speak.

All systems in a cluster had access to the same resources (disks, printers, etc). Diskswere accessible via a separate storage controller. Thus, all disk storage was available toall cluster processors. Disks could be mirrored for redundancy. And the disks could bedual-ported, allowing for inclusion of a second storage controller.

It did not matter which physical system a user was logged on to, making the clusterappear as a single system and facilitating load balancing. When a node in the cluster wentdown, users that were logged on to that node had to re-logon to the cluster,reestablishing a connection to another node in the cluster. The user's saved work was thenaccessible from that newly accessed node.

In contrast, Tandem's Guardian system fail-over was relatively rapid and usuallytransparent to the user.

Clustering Spreads

Late in the 1980s, UNIX systems ­ touted as "open" ­ gained in popularityand market penetration. Many of us recall how that market penetration led to headlinesclaiming that the dinosaur mainframe was dead, and most major manufacturers postedfinancial losses at some point in the not-so-distant past. However, with the continuedneed for highly reliable and scaleable processing systems, mainframe sales have surged inrecent years, as the roles have been defined and a balance struck between distributed andcentralized processing environments.

Today, many major vendors have a clustered version of UNIX available. Together with astrong multiprocessing capability, clustering's fail-over capability offeredmainframe-class legitimacy to a fledgling operating system.

Microsoft entered the enterprise-class operating system arena with its Windows NTServer architecture. It then tackled clustering with its Microsoft Cluster Server (MSCS,codenamed Wolfpack) enhancement for NT. With it, Microsoft addresses availability (in thefirst release) and scalability (with the Windows 2000 release). MSCS is discussed indetail later.

Other vendors also provide clustering solutions for Windows NT, covered later,including Oracle Parallel Server and Vinca Co-StandbyServer.

Cluster Shopping List

Here is a list of questions to consider when shopping for a clustering solution.

  • What is the maximum number of nodes? Can nodes be added dynamically?
  • How are repairs handled ­ hot swap out, for example?
  • How is expansion handled ­ hot additions, for example?
  • Will the desired application operate in a clustered environment?
  • How are applications installed ­ on one node and then automatic availability across the cluster, or on every node?
  • Can applications be installed/upgraded while portions of the cluster are operating? How?
  • What is the largest node size? The smallest?
  • Must all nodes be homogeneous (same numbers and types of processors, for example)? Same question for the operating system. And what about upgrading versions of the operating system?
  • What is the minimum/maximum disk storage size allowed? Must it be homogeneous? What about RAIDs and mirroring? Same questions for main memory.
  • How is storage allocated and shared? Equal access or owned by a node? Can ownership be changed (during operation)?
  • How is backup handled? Can tape drives be shared?
  • Are there limits on communications ports or speeds for LANs and WANs?
  • How is work distributed? Is it automatic? Is manual intervention ever required? When desired, is manual intervention permitted? Is tuning handled automatically?
  • What is the impact of losing a node? How much time does a fail-over require? What is the impact on users? Is fail-over automatic? Is operator intervention required and/or wanted? How clear are the system requests?
  • How is recovery handled? (e.g., procedures; what if a node hangs and cannot be accessed.)
  • How is disaster recovery handled? (e.g., splitting a cluster across the city, state or country.)
  • What is the operator's view of the cluster ­ individual consoles or everything viewable and controllable from one console? Are GUIs available? Wizards?
  • Are capacity/performance thresholds supported? When crossed, what kind of notification is given? Are actions automatically taken?
  • What services are required to implement this solution? Before considering the standard installation and implementation services, look at assessment services for the LAN, WAN, existing servers and application suitability to clustering. Don't forget 24x7 maintenance and administrator training.

Which One?

UnixWare ReliantHA is for two- or four-node clustering applications running on UNIX systems.

Vinca Co-StandbyServer is a low cost two-node cluster solution for small and remote offices.

Oracle Parallel Server is an industrial-strength database management cluster solution for two or four active-active nodes for mission-critical applications requiring high availability. OPS does not provide clustering for other applications.

Microsoft Cluster Server is two-node clustering for mission-critical applications running on Windows NT Server only, and comes with out-of-the-box support for applications such as Microsoft Internet Information Server, Message Queue Server, Transaction Server, and Oracle Fail Safe. A Microsoft "cluster-aware" application such as SQL Server Enterprise Edition is even more finely tuned to allow continued, uninterrupted application availability. By 2001, a base of around 500,000 clustered NT nodes is expected, against a base of 6.5 million Windows NT Server installations.

The Unisys Cellular MultiProcessing (CMP) system, through its partitioning and memory-sharing capabilities, supports multiple cluster combinations within a single cabinet.

­ C.Y.

What Is Clustering?

The CMP Technology Network Web encyclopedia defines clustering as using "multiplecomputer systems that are linked together in order to handle variable workloads or toprovide continued operation in the event one fails." Just as SMP (symmetricmultiprocessing) systems duplicate processors, power supplies, etc., for fault tolerance,clustering takes SMPs one step farther by duplicating the entire system.

A system in a cluster is often referred to as a node. And clustered servers aresometimes referred to as "loosely coupled" servers. Each system in a cluster canbe a multiprocessor system itself. For example, a four-node cluster comprised offour-processor servers has a total of 16 processors.

Multiple clustered SMP computers are usually tied together and communicate with eachother through a dedicated clustering communication or I/O port. Sometimes a specializedcard is used or recommended, such as Tandem's ServerNet PCI Adapter. These high-speednetworking interconnects are slower than the internal bus of an SMP system.

Awareness of clustering can be forced into the application, depending on
the form of clustering selected. Alternatively, even if an application can run in acluster without being cluster-aware, it can often be more desirable to modify theapplication for recoverability reasons ­ speed of recovery, user visibility of recoveryand thoroughness of recovery.

Systems participating in a cluster must all know the status ­ the health ­ of theother nodes in the cluster. They also need to know what applications are running in eachnode. And every node must know what actions to take upon failure of another node.

In almost all implementations of clustering for Microsoft Windows NT Server, no specialhardware is required. However, to ensure quality and proper support, hardware vendors docertify and recommend specific configurations.

Types of Clusters

International Data Corporation (IDC) segregates clustering into four categories.

The first category, high availability, uses cluster management software to improvesystem reaction to a failure and, thereby, system availability to the end user. As anaside, be aware that application availability and system availability are not the samething. System availability means that the system is up and running, but says nothing aboutwhether the application is available to the user.

Administrative clusters are a logical grouping of servers, which use clusteringtechnology to enhance application and system management. Specific applications may bededicated to specific nodes, but clustering makes it easier to perform load balancing andto allocate resources (processors, storage, printers, networks, backup, etc.) across allapplications.

Application clusters use high-speed cluster connections to enhanceapplication-to-application interoperability. This requires application interfaces andmiddleware that are cluster-aware and optimized to operate across multiple servers.Optimizations can include work sharing and failure management. Unfortunately, there arenot many of these applications available.

Workload clusters are designed to increase performance of an application by spreadingthe workload across all nodes in the cluster. This requires unique application softwaredesigned to split the workloads and perform parallel computing tasks. This type ofsoftware is often seen in scientific and data warehousing applications.

Don't confuse clustering with the terms fault tolerance and disaster recover. Faulttolerance refers to a system's or application's ability to recover from unexpectedfailures in the system or application, continuing to provide its services afterautomatically recovering. Examples include a failed disk drive or other electricalcomponent.

Disaster recovery, conversely, refers to providing service after an unexpected outagehappens, usually caused by some major catastrophe that occurred outside the control of asystem or application. Examples include a power failure, earthquake, flood or maliciousattack. Properly laid out data center administrative procedures should incorporate aspectsof both of these areas.

Cluster Architectures

There are two fundamental approaches to clustering, as defined by how they shareresources, usually specifically referring to disk storage: shared-nothing and shared-dataclusters (see Figure 1). All clustering offerings use one approach or the other. Neitheris right nor wrong, but there are tradeoffs, including resiliency, scalability,administration and application behavior.

Shared-data clusters have a single point of failure, namely the shared disk subsystem.This can be mitigated by using a shared storage device that supports RAID. And by sharingdata, some clustering implementations make it easier to incrementally scale.

In shared-nothing clusters, all devices and resources are owned and managed by a singlesystem at a time. This means that if the device is to be shared, the other nodes in thecluster, rather than having direct access to it, must go through the server that owns theresource. Also, the owned resource may be on a shared bus. That way, in case of failure,the system sharing the bus can take over the disks of the failed system.

Shared-nothing clusters can require duplication of data, which can be expensive, bothin storage and bandwidth requirements, especially for large volumes of data. This isbecause data duplication requires the duplicate copies be synchronized.

The shared-nothing approach can eliminate the single point of failure in the disksubsystem and can offer higher throughput since user access of the data can occur inparallel by being split across multiple instances of the same information.

Another factor in a cluster architecture is whether all the nodes are active or whetherone or more servers are on standby. Active in this case means that the server can performuseful work. Standby implies that the server is doing nothing other than standing bywaiting for a failure to occur. When there are two nodes and both are active, this isnotated as active-active. When one node is active and the other is a standby node, this iscalled active-passive.

UnixWare ReliantHA

Because of all the press Microsoft generates, even in the area of clustering, one canbe found ignoring the other operating systems. UnixWare ReliantHA is a software-basedcontrol framework enabling high application resource availability for two to fourclustered nodes. All the nodes in a UnixWare ReliantHA cluster can be used.

Each node runs its own copy of UnixWare ReliantHA. The nodes communicate over aredundant private network controlled UnixWare ReliantHA. The private network is used toguarantee that the messages are delivered as quickly as possible.

UnixWare ReliantHA provides proactive fault detection and continuous monitoring of eachsystem and managed resource for rapid detection of node failures. When a system orresource in the cluster becomes unavailable, UnixWare ReliantHA can automatically performa shutdown and switchover, redistributing selected parts of the application load to anavailable system. All shared files and applications become available minutes after a faultoccurs.

Customizable recovery scripts enable the restarting of the resource without operatorintervention. Once the failed system is replaced or repaired, UnixWare ReliantHA recoverycapabilities return the cluster to its original state.

UnixWare ReliantHA includes graphical installation, configuration, and management, andprovides for single-point cluster control, and remote and local management capabilities.It supports standard hardware RAID solutions.

UnixWare ReliantHA is targeted for transaction processing, database, Internet/intranet,replicated sites and NFS (network file system) services. It comes with predefined scriptsfor NFS, databases and generic applications, and it is transparent to all UnixWare 2.xapplications.

Vinca Co-StandbyServer

Billed as high availability clustering for small and remote offices, VincaCo-StandbyServer was originally produced for Novell NetWare, allowing for automaticfail-over to a standby server operating a DOS partition and, in later releases, to aNetWare volume. Follow-on releases now support mirroring remote servers and a many-to-onesolution that connects a standby machine to multiple servers, for a redundant collectionof servers.

The latest releases support intraNetWare, OS/2 Warp and Microsoft Windows NT Serverenvironments. Novell offers Novell Vinca solutions directly, called for example, NovellStandbyServer for NetWare/intra- NetWare.

Co-StandbyServer for Windows NT allows either of two servers to fail-over to the other.In contrast, Vinca's earlier product, StandbyServer for Windows NT, is an active-passivefail-over capability, where a primary server can fail-over to a standby without manualintervention.

Each clustered server requires its own unshared disk for booting purposes. Data on theother disks can be mirrored between the clustered servers.

Mirroring operates at the block/device driver level on individual I/O requests, keepingdata synchronized between the clustered nodes. Selecting block-level I/O avoids open-fileconcerns. The mirroring traffic uses a separate dedicated inter-server link to avoidimpacting the client's LAN bandwidth.

For StandbyServer, all data is mirrored to the standby server. For Co-StandbyServer,where each server (call them A and B) is running its own mission-critical application, thedata for the application running on A is mirrored to B and vice versa. Through this means,if either server fails, the other server assumes the workload of both servers, albeit at areduced performance level.

When a fail-over occurs, the data and server identity of the failed server aretransferred to the surviving server, with the remaining server also retaining its originalidentity. Disks, TCP/IP addresses, and printers are all activated on the surviving node(see Figure 2).

Applications running on a failed server can be restarted on the surviving server.Through this means, an NT server supplies data, as well as application availability forclient systems.

The Co-StandbyServer automatically installs on the servers in the cluster. Itfails-over applications, which requires using a scripting language described in a separatemanual. The company is providing scripts for common applications such as Lotus Notes,Microsoft SQL Server, and Internet Information Server.

Clustered servers do not need to have the same hardware and software configurations.And primary and standby clustered nodes can be connected to all disk types, includingshared RAID devices, instead of duplicating disk space.

Supported applications, such as Microsoft SQL Server and Microsoft Exchange Serverappear in the management console as resources that can be clustered. NT Registry entriesfor all clustered applications are duplicated and dynamically staged and updated betweenclustered servers to automate fail-over. Other applications fail-over using command filesand scripts.

A remote administrative management console facility allows clustered servers to bemanaged from any workstation or server on the network.

With Co-StandbyServer, a server can be removed from service, while users access theirapplications on the other clustered server, thus permitting maintenance.

An entire fail-over can take only seconds. Depending on the server's functions, usersmay not even know a fail-over has occurred.

Oracle Parallel Server

Oracle Parallel Server (OPS) provides for managed fail-over of applications in two orfour node Microsoft Windows NT configurations. And even larger clustering configurationsare planned. OPS supports active-active clusters with almost no failure recovery time; theworkload can be shared across all active nodes.

OPS is the only Windows NT Server clustering technology today that provides both theavailability attributes of managed fail-over and scalability attributes of workloadsharing and load balancing.

OPS operates with shared access to all disk data. Each node can concurrently access andupdate the single database. OPS handles data locking conflicts between nodes attempting toaccess the same data.

Its distributed lock manager (DLM) is the software that coordinates the shared accessto the database files. It ensures all changes are saved in a consistent manner, and it isresponsible for recovery of shared resources in case of failure.

Client applications access the Oracle database through standard SQL calls. Load sharingis accomplished by allocating client session to different servers in the cluster.

When a node fails, clients connected to the other nodes are unaffected. For thoseconnected to the failed node, requests are rerouted to a still-available node. For SQLrequests, the fail-over is nearly instantaneous, since everything necessary to continueprocessing (i.e., all the OPS and Oracle software) on the other node is alreadyoperational.

For users connected to the failed system executing an Oracle application, the failedapplication is restarted on another operational node, and processing continues. Fail-overtime is minimized, since the database is already up and running on the other system; onlythe application needs to be restarted.

When the repairs of the failed node are complete and it rejoins the cluster, it does soautomatically. Workload is again routed to the new server without affecting sessions onother nodes.

OPS has a management extension that allows administrators to manage the databases withGUIs. The OPS management extension generally complements other Microsoft Windows NT Serversystem management tools.

Microsoft Cluster Server

The clustering services of Microsoft Cluster Server are separate, isolated componentsthat are added to the Microsoft Windows NT Server operating system, reducing thepossibility of introducing problems in the existing code.

MSCS is a two-node availability solution, much like Vinca. Future versions will supportadditional nodes and have scalability features. MSCS requires that each server has its ownprivate local system boot disk.

MSCS provides system-managed fail-over and bi-directional fail-over, supportingactive-active configurations, similar to Vinca's Co-StandbyServer. The node managementsoftware sends periodic messages, called heartbeats, to its counterparts.

The cluster interconnect is over ordinary TCP/IP. This is satisfactory because theinterconnect load is relatively light compared with other clustering solutions, wherehigher-performance cluster interconnects are needed.

MSCS is transparent for most applications. Services provided by the clustered nodes aremade available to end users as virtual servers. Users are not aware of which physicalsystem is actually providing the service. Load balancing is presently done manually, via aclick-and-drag mechanism to move whole cluster groups to a less-loaded server.

MSCS automatically detects hardware and software failures, with automatic reattachmentof clients. As with all clustering solutions, special software on client systems is notrequired. Therefore, what the user experiences on a fail-over depends on what the clientis doing. Fail-over is usually transparent, since MSCS automatically restarts everythingon the surviving node at the same TCP/IP address.

For stateless connects, such as with a browser, the user would be unaware of afail-over if the failure occurred between server requests. If the client were activeduring the failure, the client would receive the standard error notification from theapplication, such as "Abort, Retry, or Cancel." For stateful connections, suchas SAP, where the application's state is remembered from communication to communication, anew logon is typically required following a server failure. The reconnect occurs inexactly the same manner as was done for the original connection. This reconnection alsosatisfies security requirements.

Depending on the application, automatic and secure user reconnection is possiblefollowing a failure. As part of the recovery from a failure, MSCS restores the state ofthe application's Registry keys, but other state information must be managed and restoredby the application.

MSCS has remote management capability to view the cluster as a single system.Administrator-initiated fail-over is also provided for maintenance.

MSCS requires visibility to a shared disk subsystem that is available to all nodes. Incontrast with Vinca, MSCS does not duplicate data. MSCS is based on a shared-nothingclustering model. This model dictates that although more than one node in the cluster hasphysical access to a disk (or device or resource), the disk is owned and managed by onlyone system at a time.

Microsoft Cluster Server requires a quorum disk. This disk is used by MSCS to determinewhether a server is up or down. Only one server can own the quorum disk at a time. Serversin the cluster can negotiate for the ownership. This avoids a situation where both serversare active, yet think the other server is down. This situation could occur, for example,during a network response-time problem.

Upon fail-over, the active server "steals" access to the quorum disk andblocks the failed system from accessing it or any drives that are in the shared disksubsystem. It takes a minimum of 10.5 seconds to fail-over the quorum disk.

The remaining active node takes over the failed node's TCP/IP addresses, initiates theapplications that were running on the failed system, and provides access to their data onthe shared subsystem. The applications provide their own data recovery and then allowreconnection of clients.

An application for which is written its own resource DLL (dynamic link library) iscalled a "cluster-aware" application. Applications that do not have thatfunction are unaware of clustering and do not know that MSCS is running. Cluster-awareapplications, for example, can report status to MSCS and can respond to requests to bebrought online and to be taken offline gracefully. MSCS can monitor and recover both typesof applications.

Microsoft has tools for application developers to cluster-enable their applications.However, if an application is "well-behaved," it can run without beingcluster-enabled.

Well-behaved means the application keeps everything it needs to restart on a diskaccessible from another clustered system, and its clients can satisfactorily handleservice pauses of up to a minute. Most commercial applications already satisfy these twocharacteristics.

Microsoft applications, such as Microsoft Internet Information Server (IIS), SQL andExchange, are all MSCS-ready.

Oracle has also made it easy to implement their database in an MSCS environment withOracle Fail Safe. In the event of a system failure, Oracle Fail Safe automaticallyrestarts the Oracle database on the surviving node and rolls back any uncommittedtransactions at the time of failure.

To guarantee proper support of MSCS, Microsoft requires that specific hardwareconfigurations be certified for compatibility. Microsoft provides a published hardwarecompatibility list (HCL) for all tested hardware. This can be a lengthy process forhardware vendors, taking up to four months to complete. Check out the Microsoft HCL at

Clustering Advantages

Reliable servers are in place today, largely due to redundancy of hardware ­ UPSs,RAIDs, power supplies, etc. However, even reliable servers fail. Aside from pure hardwarefailures, there are software bugs, administrative and operator errors, maintenance(scheduled and unplanned), disasters, and city/ region-wide catastrophes, all factoringinto server failures.

Administrators also face a network dichotomy. Distributed networks were not originallydesigned to support mission-critical applications, yet increasing numbers of thosecritical applications are moving to and must be accessible through the network. Thisdowntime sensitivity creates the business reason to improve the network infrastructure.

Clustering provides the obvious advantage of increasing the computing power over singlesystems, accomplished through load balancing with applicable thresholds. An example use ofthis might be for rapidly expanding Internet Web sites.

Perhaps, even more importantly, clustering can provide a fail-over resiliency that asingle system often cannot provide; when one system fails, the other can take over theprocessing load of the two in a degraded mode.

According to an SRC Clustering Practices Survey as to why clustering solutions areimplemented, this availability reason was the main reason (79 percent). The same surveystates that average application availability before clustering is 90.1 percent, (36 daysof downtime per year). With clustering, application availability improves to 98.6 percent(five days of downtime per year).

All other reasons to implement clustering share a 6 percent to 4 percent response rate.These include sharing resources, data protection, disaster recovery, performanceimprovement, load balancing, and security and application support.

Not mentioned in the report, surprisingly, was support for planned hardwaremaintenance, allowing an administrator to take a functioning node temporarily out of thecluster to perform scheduled or unscheduled maintenance.

As to what applications are being clustered, around half of all clustered applicationsare database applications. Around one quarter are file-servers, and one-tenth are e-mail.

These latter applications demonstrate a growing acceptance and use of clusteringsolutions as being no longer just for the elite mission-critical applications. Inaddition, the survey reports that 95 percent of the sites using clustering are satisfied.

Clustering Disadvantages

With all of clustering's advantages, it would appear to be a panacea. However, one ofthe challenges of clustering is administering multiple interconnected, though separatesystems. Distributed networks do not lend themselves well to easy management.

Clusters are complex and do require careful setup. Management tools are easing theburden, but clusters are still not easy. With clustering, as the number of nodesincreases, so does the number of communication links, exponentially increasing complexity.

Furthermore, not every application can run under clustering; some never, some notwithout help.

About the Author:

Charlie Young is Director of U.S. Network Enable Solution Programs in the GlobalCustomer Services (GCS) organization.

Must Read Articles