Getting Down With NT’s Upside -- Enterprise Systems

Getting Down With NT’s Upside

12/17/1999

All Good IT Managers know one thing: possessing the greatest technology in the world doesn't mean a thing if the systems aren't running. All great IT do something about it. That brings us to using Windows NT for mission-critical applications. While Microsoft has made some solid progress in this area with Cluster Services for NT (see "Putting Some Luster on NT Clusters" in the July 1999 of HP Professional) it still has a way to go before reaching the elusive "five nines".

Failure Is Not An Option

5nines:5minutes (HP's term for it) or 99.999% uptime (which translates to 5.256 minutes of unscheduled downtime) is a very demanding standard. A system must not have any single point of failure. If a failure does occur, the system must be able to continue processing without data loss. The system must be repaired while it operates. Once repaired, the system must transparently return to its original state.

While these four concepts are easy to repeat, they're not so easy to implement. According to a report released by the Harvard Research Group (hrgresearch.com), only one system meets these criteria while running Windows NT: the Endurance 4000 from Marathon Technologies.

The Marathon Endurance product is designed primarily with off-the-shelf components: standard Windows NT software and industry standard Intel-based hardware. Endurance is an array of systems that appear as a single system to end-users. There are actually four separate systems in the array. Two are designated as Compute Elements. These systems are bare bones machines with only a CPU and memory. The other two systems are I/O Processors, which contain network interface cards and storage devices. The four systems are connected with a proprietary high-speed interconnect designed by Marathon.

The Compute Elements run all the software. Both operating system and applications are run by both compute elements simultaneously. Both Compute Elements run all the instructions. They are designed to run in "lockstep:" that is, each instruction is run on each element simultaneously. This assures that if one Compute Element fails, the other continues at the next instruction and users do not know anything has gone wrong. This ÒlockstepÓ requirement means that the Compute Elements hardware must be identical.

If One Should Happen To Fall

The I/O Processors handle all I/O operations including disk access and network communication. They carry out any I/O instructions, which the Compute Elements generate. Each I/O Processor contains a storage system and all data is mirrored between the I/O Processors. If a system fails, the other takes over. A RAID array on each I/O Processor can be used to provide hot-swap capability and even more redundancy.

The disk systems do not need to be identical. The I/O Processors also handle network traffic. Each one has a network interface and the registry of each uses the same MAC address for both cards. If one of the interfaces fails, the other systems on the network see the same Ethernet address and can continue without interruption.

One other important feature of an Endurance array is tolerance of site failure. In Marathon jargon, a Compute Element-I/O Processor pair are referred to as a "tuple." In other words, an Endurance array consists of two tuples. The tuple can be physically separated by up to 1.5 kilometers and connected via fiber optic cable. This allows them to be put in separate buildings on separate power grids, providing not only component failure tolerance, but site failure (power, air conditioning, etc.) tolerance as well.

The Net Effect

In a demonstration at HP World, this highly redundant system seemed to work very well. Simply turning off various parts of the array simulated failures. For instance, turning off an I/O Processor yielded no net effect that I could determine with a streaming video playing on the system. Similarly, turning off a Compute Element caused no effect.

Just as importantly, turning the system back on caused no interruption and they rejoined the array with no intervention. It's very easy to see how any component could be repaired and returned to service quickly and easily.

HP has bundled the Marathon Endurance with its NetServer systems and offers it as the HP NetServer Assured Availability System. The system uses rack mounted NetServer LPr and LH systems in a variety of prepackaged configurations. The systems can be ordered in a single rack for a single site or in two racks to provide split-site redundancy. The prepackaged solutions feature all the necessary hardware and software components to run your own high availability NT system.Unfortunately high-availability is not cheap. Four systems are required. Because two of the systems can be stripped down (the Compute Elements), the cost is approximately equivalent to three systems. Additionally, four Windows NT licenses are required - one for each system. And donÕt forget your applications: Two licenses are required, one for each Compute Element. On top of these expenses, there is also a 10% performance penalty (according to Marathon calculations) for the various error checking and verification functions.

Making It All Up

While all this may lead to a severe case of sticker shock, the important measure is uptime. How much does not being able to process airline tickets, run manufacturing equipment or fulfill stock trades cost a company?

In the case of E-Bay, a 22-hour downtime incident cost $5 million in lost revenue and a 20% drop in stock price. Will your losses be as dramatic? That's up to you.