Putting Some Luster On NT Clusters -- Enterprise Systems

Putting Some Luster On NT Clusters

07/01/1999

Clustering products for Windows NT platforms are available in several varieties,offered from Microsoft as well as other vendors. Microsoft's products are centered onWindows NT Server Enterprise Edition, which offers SMP capabilities for up to 32processors. The Enterprise Edition is comprised of Microsoft Cluster Server (MSCS) andWindows NT Load Balancing Service (WLBS).

Introduced in 1997, MSCS has been available as a system service in NT EnterpriseEdition. MSCS allows the creation of two-node, share nothing clusters. The cluster appearsas a single server to network client machines. Any applications or data files installed onthe cluster are referred to as resources. These resources run on only one server at a timebut can be configured for failover (a process by which the operation of the resource ismoved from one server to another in the event of a failure).

If At First You Don't Succeed

Resources can be organized into groups and entire groups can failover as well. When aspecific resource is requested, MSCS routes the request to the server operating theresource.

An MSCS cluster consists of servers, common storage and networking. The servers must besingle or multiple Intel Pentium (minimum 90MHz) or Compaq Alpha processor-based (Inteland Alpha processors cannot be mixed in the same cluster).

Each server requires a minimum of 64MB of RAM. Each server must be connected to ashared, external SCSI bus that is separate from the bus containing the system disk. TheSCSI adapters must be PCI. All resources configured for failover must reside on disks onthis common bus. To provide maximum protection from hardware failure, storage on thecommon bus should be hardware RAID-based to eliminate the disks as a point of failure. TwoPCI network cards should be used in each server: one to connect to the organization'snetwork, the other to create a private network between the two cluster systems.

So how does MSCS provide high availability? The servers in the cluster constantly checkavailable resources by sending messages called "heartbeats." The heartbeatscheck all resources on both servers. If an application resource has failed but the serveris still functioning, MSCS tries to restart the application. If the application will notrestart or the server has failed, MSCS moves the application's resources and restarts iton the second server. Because both servers are constantly checking each other, discoveryof failed resources and initialization of the failover usually takes place within 10seconds. The amount of actual time to failover depends on the application. As an example,according to Microsoft, failover of SQL Server usually occurs in less than one minute.

Failover occurs automatically. Because the cluster appears as a single server toclients, client computers do not have to be configured in any special way to use thecluster or handle a failover. Whether or not the client actually notices the failoverdepends on the nature of the application. For instance, delivering Web pages isstate-less. That is, the connection is not maintained and communication is discrete andindependent. So if the failover occurs between requests to the server, the client may notnotice. If the failover occurs during the request, the client may receive a "ServerUnavailable" message. Once the failover is complete, the client will be able to getpages again. Hitting the refresh button on the browser is all that's necessary to actuallyget the requested page. Of course, the client may wonder what happened.

If the application maintains a state, a new logon may be required after a failover.This is dependent on the application. For instance, an application such as SQL Server maycache the user id and password and be able to re-establish the connection without the userdoing anything at all. If the application is not cluster aware, it probably doesn't havethis capability.

A CLUSTER OF OTHER CLUSTERING PRODUCTS

There are non-Microsoft clustering and cluster management products for Windows NT. "Some providing far more nodes and far more capability than Microsoft," says Harvey Hindin, senior research analyst for D.H. Brown Associates. "They just got tired of waiting for Windows 2000," he chuckles. Here's a sample of some of them.

Marathon (Boxborough, Mass.; www.marathontechnologies.com): HP and Marathon Technologies Corporation have just announced a deal that will offer Marathon's Endurance Assured Availability non-stop array with HP NetServers. The Endurance system is designed to provide 99.999% uptime and Marathon has offered a $250,000 warranty against data loss with some of their products. "This is not really clustering," says Hinden, "but multiple servers." HP is expected to announce several configurations bundled with service and support options.

IBM: "IBM is about to announce enhancements so that Cluster Server will work with up to eight nodes," says Hinden. The announcement will be centered on IBM Netfinity hardware. The product is expected to include features such as hot-swap PCI hardware. IBM currently offers two-node Netfinity packages bundled with support and performance guarantees. The Netfinity Availability Extensions will sit on top of MSCS and offer cluster management software featuring administration capabilities and tracing and logging services.

Vinca Corporation (Orem, Utah; www.vinca.com): Co-Standby Server for NT (see HP Professional's Product Watch, May 1999) provides a shared nothing cluster of two nodes. The systems in the cluster communicate via a "bi-directional mirroring process" which keeps the data on both servers current and in-sync. If a system fails, the other system takes over. With mirroring solutions such as Vinca, some data loss may occur if a failover occurs before the mirroring process is complete. It does provide a more secure backup than MSCS by keeping a second copy of the data available. Vinca also has versions of Co-Standy server for Novell NetWare and IBM Warp Server.

NuView, Inc. (Houston, Texas; www.nuview.com): ClusterX 2.0 is due for release this month. The company claims it's the "first and only cluster management solution to integrate the management of both Microsoft MSCS- and WLBS-based technologies." ClusterX provides a single console to manage both current Microsoft cluster technologies. The single interface allows not only control and configuration, but performance reporting and audit logs to monitor cluster activity.

-- R.M.

Something To Failback On

Once a failure is corrected, no manual intervention is required for failback to occur.Failback is the process of returning resources to their default server. For instance,assume Server A has experienced a hardware failure and Server B has assumed all operationof resources in the cluster. Once Server A is repaired, it's rebooted and rejoins thecluster. It communicates with Server B and initiates failback, bringing back the resourcesit usually hosts, resuming operations.

The second clustering component on Windows NT Enterprise Server is WLBS. It has supportfor any IP-based service such as Web or FTP servers. It supports the clustering of up to32 servers using the shared nothing model. As in MSCS, the cluster appears as a singleserver to its clients. WLBS is different from MSCS in that many servers in the cluster canoffer the same resource. When a request comes in under MSCS clustering, the request isrouted to the server that controls the resource because only one MSCS server can offer theresource at a time. When a request comes into a WLBS cluster, the request is routed out toa server based on the traffic in the cluster. Servers may be designated to handle acertain percentage of the requests or they may divide the workload evenly. This is loadbalancing.

By providing the identical service from multiple systems in the cluster, WLBS providestrue availability and scalability for many applications. When one system in the clustergoes down, traffic is automatically directed to the servers still operating. When moreusers need to be serviced, add a system to the cluster.

WLBS runs as a device driver under NT. It uses an algorithm to determine how to dividethe workload. Each node in the cluster runs the algorithm independently so the loadbalancing is not dependent on a single system and not subject to a single point offailure. The workload is distributed statistically rather than dynamically. In otherwords, the workload is divided among servers based on parameters set by the systemadministrator, not by dynamically adjusting the load based on how busy each server in thecluster actually is. Workload can be distributed evenly or more powerful servers can begiven a higher percentage of the work.

WLBS communicates within the cluster in a fashion similar to MSCS. In WLBS, the processis called convergence. A heartbeat is broadcast once a second by each server to all theservers in the cluster. If a server does not respond to five consecutive heartbeats,convergence begins. During the convergence process, the heartbeats are sent out twice asecond. Each server communicates with all the other servers until they agree on the statusof the cluster. This is necessary because WLBS runs on each and every server in thecluster and they must each have consistent information. This convergence process takesplace when the cluster first starts or when systems enter or leave the cluster. Noservices are interrupted during convergence. The convergence messages and heartbeats areapproximately 1.5KB and consume little bandwidth. In fact, no special or dedicated networkinterface is required. The convergence process takes place on the same interface as allother network traffic.

WLBS is ideal for Web servers and similar Internet applications. As a Web site becomespopular, it may experience rapid growth. If the site is run with a WLBS cluster, it's easyto add new capacity by adding new systems. Because they are standard NT systems, not evenrequiring a second network interface card, they can be added very cost effectively.

COMING TO TERMS WITH CLUSTERING

A cluster is any set of independent, whole computers that work together as a single resource and appear as a single computer to end-users. In general, clusters are used to address two specific computing problems: availability and scalability.

Let Me Take You Higher

Availability refers to the amount of time a system is available for clients. A database is a great thing when it's running, but useless if it's not. By keeping the database available, orders can be taken, customers can be serviced, etc. In the past, only a few systems were considered critical or were only considered critical during normal business hours. Now, as new systems and applications become central to organizations, it's more important than ever for them to be highly available.

Availability is usually defined as a percentage of uptime. A particular vendor may guarantee uptime at 99.5% or 99.9%. This may not seem like a large difference, but based on continuous operation of 24 hours a day, 365 days per year, the difference is quite large. Because the cost of downtime is so high, even a single hour of downtime may be unacceptable.

Availability	Downtime Per Year
99.000%	87 hours 36 minutes
99.5%	43 hours 48 minutes
99.9%	8 hours 46 minutes
99.95%	4 hours 23 minutes
99.99%	53 minutes
99.999%	5 minues

Scaling New Heights

Scalability refers to the ability to provide more computing services transparently. When the hits on your Web site reach an all time high and performance begins to degrade, what do you do? You can upgrade or replace your Web server with more powerful hardware, which requires bringing the server down. Depending on the nature of your site, bringing a server down could mean lots of lost revenue.

Clusters address scalability by allowing the addition of capacity without interrupting the delivery of service. This approach offers a large benefit: You don't have to forecast demand accurately. If your usage prediction is low, you can add another system to a cluster. As demand increases, you can incrementally add smaller systems to meet the demand. Without a cluster, you must either be very accurate in your demand forecast or make up-front commitments to larger, more expensive servers with headroom.

-- R.M..

Application Equilibrium

Component Load Balancing (CLB) is a new Microsoft clustering technology that will beincluded in Windows 2000, (a k a Windows NT 5.0), and will provide application clusteringthrough Component Object Model (COM) components. CLB is scheduled for inclusion in theAdvanced Server and Data Center versions of Windows 2000. Advanced Server will supporttwo-node MSCS clusters and Data Center will support four-node MSCS clusters.

CLB enables applications built using COM components to be distributed across a group ofservers. Microsoft is positioning CLB as middleware that will provide high-availabilityfeatures to systems such as Internet Information Server with Active Server Pageapplications. CLB will function like WLBS by providing load balancing. However, it goesone step further and is able to dynamically load balance by checking a server'sperformance through such measures as object response time and CPU load. The CLB modelfeatures a CLB routing server that handles requests from clients.

The routing server determines which node in the cluster is best able to fulfill therequest and communicates this information to the client. The client then communicatesdirectly with the specific server, leaving the routing server free to handle more requestsand perform load balancing. In the event one of the servers in the cluster fails, the CLBrouting server starts the COM component on another server in the cluster.

WINDOWS NT LOAD BALANCING SERVICE -- ON BALANCE, GOOD ENOUGH

The Windows NT Load Balancing Service (WLBS) is part of Windows NT 4.0 Enterprise Edition. It's not included on the distribution disk, but is available from Microsoft at www.microsoft.com/ntserver.

Installation of Enterprise Edition is annoying, but not difficult. The package comes with four CDs. Two contain Enterprise Edition, one contains Service Pack (SP) 4 and the last contains the Windows NT Option Pack. Microsoft continues to offer SPs instead of integrated version releases of the operating system.

This forces several steps during installation. I had to install NT. When the system rebooted, I was told to continue with the installation by installing SP 3. After installing SP 3 and rebooting, I had to install any Enterprise Edition Components, such as MSCS. After the third installation and reboot, I had an operating system. To get the latest version of Internet Information Server, I had to install the option pack, which also required that I install Internet Explorer 4.01. All told I had four (or was it five?) reboots. I didn't even bother with SP 4.

Installation Is Simple

After downloading WLBS, installation is simple. The download file is slightly less than 2MB. It installs from the Control Panel, Network option as a network adapter in about two minutes. I installed WLBS on a cluster of four machines, two Intel and two Alpha. Each machine had a single 10Base-100 Ethernet network interface card that carried both the regular network traffic and the WLBS convergence traffic.

Configuration of WLBS is done on a single Properties screen (see page 27) with three sections: Cluster Parameters, Host Parameters and Port Rules. In Cluster Parameters, the primary IP address of the cluster is set. This is a virtual address used by clients to access the cluster. All the systems in the cluster must be set to the same primary IP address. In Cluster Parameters, you can also enable remote control of WLBS so other machines on the network can remotely manage the cluster.

In Host Parameters, the system's unique IP address is assigned. A checkbox makes the system an active member of the cluster or removes it from the cluster. You can also assign a host priority. Each system in the WLBS cluster must have a unique host priority. By default, cluster traffic is handled by the host with highest priority. The priority also controls which systems are first to pick up the traffic when a system goes down.

Port Rules Rule

Port Rules allow you to configure how specific TCP/IP services are to be handled by the cluster. This is done by configuring the ports (80 for Web services, 21 for FTP, etc.) individually. There are several options for directing the traffic. Multiple hosts can handle the individual port, with each host sharing the load equally or more powerful servers can be assigned a higher percentage. The port traffic can also be directed to a single host in the cluster. For instance, FTP traffic is low, so it will be directed to a single host and the others will be free to serve Web pages. You can also disable the port in the cluster.

After the initial configuration, WLBS is low maintenance. I configured the cluster as an FTP and Web server. As I brought the last system in, the convergence process took eight seconds. The different platforms (Intel and Alpha) worked together very well. As I disconnected one machine, the failure was detected and the systems reconverged in 11 seconds. I tested the failover using a script that demanded refreshed Web pages via a browser. As the system "failed" (was unplugged) there was a noticeable pause, which I didn't think was unusual, given the sometimes choppy performance of the Web.

I was generally satisfied with WLBS, but the system is noticeably lacking in management tools. There's no place to see which systems are participating or what the traffic is like. You can determine the participation by checking the Event Viewer. A message is logged at each convergence, listing the host in the cluster. Hardly an optimum solution. I couldn't find any Performance Monitor counters or objects relating to WLBS. A glaring error, the lack of management tools may be a function of the relative newness of the product and hopefully will be corrected in subsequent releases, but for quick, no frills clustering, WLBS does the job.

-- R.M.

Cluster Muster

It may be difficult to immediately see how these various clustering technologies can beused together. "Cluster Server [MSCS] provides high availability and dataintegrity," say Kevin Briody, Microsoft's product manager for clustering and loadbalancing. Briody says that Cluster Server was primarily designed to secure data. Thefocus of WLBS is scalability. "It's a function of how they were designed. LoadBalancing is an NT device driver. It has no way to know a database is running as aseparate instance on another machine."

Briody describes a "two-tier approach" to clustering. "Web servers willbe clusters for e-commerce using WLBS. Database servers will be clustered with MSCS."In the first tier, WLBS provides high scalability and availability for Web servers sendingpages to users. As actual transactions occur, MSCS can insure database integrity byproviding a server failover to a common disk storage system.

Briody believes the future is actually in a three-tier system. "NT 4 just has twotechnologies. Windows 2000 will provide a third tier." CLB will provide clusteringfor application services such as Microsoft Transaction Server. The three-tier approachwill provide organizations with the ability to provide systems that are not subject tosingle points of failure and can grow to match the Internet's explosive growth.

Clusters are not necessarily the perfect solution to all your problems. Clusters cannotprotect against things such as power failures. As systems become more and more critical,even the short failover times with current clustering technology can mean serious dataloss. But Microsoft and others are investing heavily in cluster technology that they hopewill drive NT further into the enterprise.

-- Ryan Maley is a Microsoft Certified Systems Engineer and author of HPProfessional's On The Server Side column.