<B>Feature:</B> Widespread High Availability Use Remains on Standby

The term "high availability" may sound reassuring to IT managers charged with keeping Web servers, e-mail and data warehouses humming along. Achieving high availability with Windows NT systems, however, has proven to be far more elusive than for mainframe or Unix systems.

High availability is typically defined as 99.5 percent uptime, which means systems are still down about 44 hours per year. Continuous availability is defined as 99.9 percent uptime. The best a Windows NT site can hope for at this time is "about 99.7 percent availability, even in a clustered environment," according to Joe Barkan, research director with GartnerGroup (Stamford, Conn.).

On the server, most outages result from disk problems or power failures, according to research from Intel Corp. Other sources of unplanned downtime include software integration problems that result when pieces of a system interact, says Jon Affeld, worldwide product manager for the NetServer high availability program at Hewlett-Packard Co.

Another culprit that can bring NT systems down is system mismanagement or environmental errors. "Even if you set up the system correctly, there are a lot of chances you could load the wrong software, wrong version of software, or didn't test it thoroughly once you've made a change to it," Affeld says.

From Shared Disk to Shared Nothing

Typically, high availability is achieved through the clustering of two or more processors, so that if one fails, another can pick up the slack. These processors need to be connected by a network and be able to access data on each other's disks, with clustering software running on both machines.

The earliest technique for high availability clustering was shared disk, in which every server accessed every disk. But this required expensive cabling, switches, and Distributed Lock Manager software. A more flexible alternative, mirrored disks, entails each server having its own disks, and running software that mirrors every write from one server to a copy of the data on another. Complete data synchronization, however, is not always assured. The latest approach, shared-nothing technology, enables each server to have its own disk resources. Disk ownership is transferred between servers in the event of server failure.

Clustering software permits continuous monitoring, fault detection, rapid recovery, and automated failover between two or more servers. Along with Microsoft Cluster Server (MSCS), other Windows NT high availability clustering software packages on the market include NonStop Software from Compaq's Tandem unit, Octupus from FullTime Software (San Mateo, Calif., www.fulltimesoftware.com), Double-Take from NSI Software (Hoboken, N.J., www.nsisw.com), Oracle Parallel Server, and Co-Standby Server from Vinca Corp. (Orem, Utah, www.vinca.com).

Microsoft Cluster Server, bundled into Windows NT Server 4.0 Enterprise Edition, supports SQL Server and mirrors applications and data running on NT Servers. Still referred to by many by its code name of Wolfpack, MSCS supports two-node active failover.

National Semiconductor recently migrated 350 users to Windows NT 4.0 from a Novell NetWare environment to consolidate system storage and to centralize backups. The company implemented Microsoft Cluster Server, supported by SuperFlex 5500 RAID storage solution from Artecon Inc. (Carlsbad, Calif., www.artecon.com), to provide high availability for manufacturing tools used by engineers. The system uses active-active controllers, employing MSCS as a load-balancing as well as failover tool, says Dee Seneviratne, team leader for desktop LAN services for National Semiconductor.

The department maintains four Compaq 3000 servers with dual processors. "The two data servers are set up as a cluster, with a constant heartbeat between the two servers," she says. While initial configuration was "a bit tricky," she notes that testing shows that failover occurs fairly rapidly.

High availability was recently introduced to the LAN environment of First Union Capital Markets, the investment banking arm of Charlotte, N.C.-based First Union Corp. First Union implemented the Microsoft Cluster Server to ensure the flow of data on bonds, securities, commercial loans, and other monetary instruments. In a recent update of its network, the company implemented Microsoft NT Server Enterprise Edition to replace Novell NetWare.

"Fault tolerance for the network was at the top of our list," says Sushi Vyas, assistant vice president in charge of First Union's Windows NT Server infrastructure. Vyas implemented Microsoft Cluster Server in active-active mode, where "both nodes can be performing functions, such as Exchange messaging or file and print services, yet still be backing each other up," he says. First Union implemented about 80 Compaq dual Pentium-Pro processor-based servers that administer to 8,000 employees. While the clusters in production are using shared SCSI connections, First Union plans to migrate the system to Fibre Channel to maintain two physically separate data centers in downtown Charlotte.

Yellow-Tape Syndrome

A drawback with the current release of MSCS is that it requires dual-ported SCSI peripherals, mandating the location of the standby server to be in the same room. Thus, MSCS is not a solution for "yellow-tape" recovery issues -- when employees cannot get into their building because of an emergency, such as fire or flood.

"Having both servers in the same room won't even protect you against power failures, and that's one of the most common reasons servers fail," notes Don Beeler, CEO of NSI. "What would happen if there's a fire or a power outage in your building, and your company just spent all this money on a high availability cluster solution, yet no one can get on the system anyway?"

Most MSCS implementations seen by Barkan and other industry experts consist of file-and-print and Exchange applications. "There's very little OLTP being supported by high availability products -- the risk is too high," Barkan relates.

Some users complain that the failover process with MSCS is too long for some applications, potentially extending to more than 30 minutes. Because of this slowness and the complexity of installing and configuring MSCS, "Microsoft doesn't want to push Cluster Server as a solution," according to one industry source. "They don't want a lot of support calls or other problems they may create for themselves."

Service and support for implementing clustering is limited, agrees Barkan. "Clustering is not a trivial task. And the technology is too complicated to be sold by resellers." Implementing Cluster Server may be daunting "if you have very little staff or partners to help implement it," Barkan relates. "Microsoft Cluster Server is appropriate for some, but not for everyone."

Microsoft plans to make the next release of MSCS more user-friendly, with wizard-driven installation and plug-and-play features. Other new features, which may ship with Windows 2000 Advanced Server and Datacenter Server, include storing cluster-related objects in Active Directory, Active Directory failover in clusters, and support for COM in clustering. Cluster management will also be integrated with Microsoft Management Console.

Microsoft confirms that a four-node failover edition of MSCS has been developed, but needs further testing and will not be included with the first shipments of Windows 2000. In addition, the company reports that its own internal surveys find that two-server clusters satisfy most of its customers high availability needs at this time. Microsoft may even be watering down its formerly ambitious plans for Cluster Server, Barkan says.

The other benefit to clustering -- scalability as workload is shared between multiple nodes -- is being addressed by SMP. "The current release of MSCS provides high availability based on a shared-disk subsystem model for a two-node cluster," Barkan says. "Later enhancements are planned to provide the underpinnings for shared- nothing, message-passing database implementations. Microsoft has not indicated an expected date for these enhancements."

As part of its efforts to build high availability clustering capabilities to Web-centric environments, Microsoft recently acquired Valence Research, a Beaverton, Ore.-based developer of TCP/IP load-balancing and fault tolerance software for Windows. Valence Research's Convoy Cluster Software will be integrated into Windows 2000 Advanced Server and Datacenter Server, and supports Exchange Server and IIS, and provides availability to Internet server farms.

Cautious Approaches

The cautious approach was adopted by Derek Glickstein, owner of Wise Choice Software (New York), which provides manufacturing solutions to the jewelry industry. Glickstein considered Microsoft Cluster Server as a high availability solution, but due to its newness, "we just weren't too confident with it." Wise Software settled on Vinca's Co-Standby Server, which provides two-node failover functionality similar to MSCS. Failover scripts enable individual applications to run during a failover, including SQL Server and Exchange Server.

Installation and implementation of Vinca’s product on custom-made servers running Windows NT 4.0 Enterprise Edition took about four hours, Glickstein reports. "After we got it installed, we flicked the switch on the first server and it worked," he says. "Our clients are running billions of dollars worth of business with our software. It's unacceptable for us to be down. We want to support a client in their time of need -- which happens every minute," Glickstein says.

A cautious approach to MSCS was also adopted by Financial Pacific Insurance (Rockland, Calif.), which needed to improve the availability of its network of Dell PowerEdge 6100 quad processors, running SQL Server 6.5 on NT 4.0. The company supports an online transaction processing system for capturing insurance quotes, administering decision support systems, generating management reports, and tracking bond transactions.

The company did review MSCS, but didn't feel "100 percent confident" in the technology, especially since it was in beta at the time the company was evaluating its options, says Ben McLarin, development manager for Financial Pacific. "Microsoft is eventually going to get it correct, and eventually I will look at MSCS again," he says.

The company implemented Vinca’s Co-Standby Server for two-node failover to keep the systems online at all times, says McLarin. One server is "purely used for standby," or active-passive mode, he explains.

Based on tests and actual experiences, the second server and database can be up and running within 30 seconds, he adds. "It's truly amazing for SQL Server to come back up and respond that fast. Vinca prestages all the services, all the databases, and everything else, so SQL Server comes up relatively quick."

High availability clustering is becoming a part of most major database vendors either already have or will have parallel versions of their databases available. For example, Oracle Parallel Server will support up to four NT nodes. But Oracle Parallel Server is not a widely adopted system, Barkan notes. "There are probably less than 20 production Oracle Parallel Server implementations across the world today," he says. Oracle also offers a product called Oracle Fail Safe for Windows NT, which is built on MSCS.

It's notable that "database developers aren't waiting around for Microsoft," Barkan says. "They're doing their own thing without Microsoft. Even SQL Server is doing their own thing without the operating systems division. The SQL Server folks are under more pressure than the operating systems folks to provide clustering," he says.

A major high availability solution that brings over technology from the mainframe and Unix worlds is Tandem Nonstop Software, which supports up to 16 Windows NT nodes. The NonStop Software product line includes NonStop Database and NonStop Transaction Processing, which run on both Windows NT and Tandem's Himalaya architecture. The software is designed for the high availability failover capabilities that are part Wolfpack. NonStop Software also features scalability to multiple nodes of processors that can be interconnected with Tandem's clustering architecture.

Major hardware vendors, including HP, IBM and Unisys, are providing bundled premium high availability solutions to the NT market. HP recently announced its Mission Critical Server Suite for Windows NT, which is modeled after its Unix approach and promises 99.95 percent uptime or the fees for the service are refunded to the customer. The HP suite includes a configuration of the HP NetServer LXr 8000 system, clustering tools, testing, as well as access to HP's disaster recovery services.

New Microsoft Cluster Server Features

(Planned to ship with Windows 2000 Advanced Server)

  • Cluster-related objects are stored in Active Directory
  • Active Directory failover in clusters
  • Plug-and-play support for NICs and disk elements in clusters
  • Support for COM in clustering
  • Resource APIs are COM-enabled
  • Client network failure detection and recovery
  • Improved setup with Wizard
  • Cluster management is integrated with Microsoft Management Console -- administration tools become MMC snap-ins
  • Failover support for system services that do not failover today, such as DHCP, DNS, DFS, and Index server
  • Real-time performance monitoring
  • Advice on how to redistribute loads and the ability to redistribute loads at set time (not dynamically)
  • Proactive failover on hardware management alerts.

Source: GartnerGroup