Hardware High Availability Programs in Action
Here’s a challenge for you: Use the phrases "Windows NT" and "99.9 percent uptime" in the same sentence.
Here’s a challenge for you: Use the phrases "Windows NT" and "99.9 percent uptime" in the same sentence. If you’re having as difficult a time as I did relating the two terms, you, too, might be skeptical when it comes to vendor programs offering 99.x percent uptime commitments on their enterprise class NT systems.
System performance agreements, old news in mainframe and midrange environments, are new to the NT world. But why is there this seemingly sudden rush by vendors to offer these "nines" programs? Some consultants see this phenomenon simply as more evidence of NT’s growing acceptance as an enterprise solution.
Ed Canfield, senior manager at Ernst and Young’s Center for Technology Enablement (www.ey.com) believes there are several forces at work. "First, NT is gaining more acceptance as a mission critical platform as enterprises experiment with migrating large applications off their mainframe and midrange platforms. Second, we see a trend toward centralized computing as many formerly ‘departmental’ NT servers are being reined into the data center. Thirdly, many NT systems are being used as gateways and front ends for larger systems. Uptime for these systems is just as critical as it is for the legacy platform sitting behind it."
Peter Sequera, NT mission critical services product manager at Hewlett-Packard Co., agrees that the trend is happening, if not on the reasons why. "In many cases, NT’s potential cost savings are pushing companies into implementing it a faster than they’d like," he says.
What follows is a look at the uptime programs offered by some of the industry's heaviest hitters: Compaq Computer Corp., Data General Corp., HP, IBM Corp. and Unisys Corp.
The basic component of most uptime programs is a high level of hardware and operating system support. Included in this configuration is a robust, redundant NT system -- typically an NT cluster. These systems are monitored and controlled through an enhanced set of system management tools, a highly responsive repair service and "red phone" operating system support access to vendor trained engineering groups and Microsoft.
Getting a reliable initial configuration is the key first step to achieving exceptional uptime. "In a sense, Windows NT has gotten a bum rap regarding the issue of reliability," contends Shawn McPherron, director of database and high availability marketing at Unisys. "NT systems can be extremely reliable if set up correctly from the start."
These vendors take great pains in selecting, testing and qualifying components for their high-availability systems to reduce the chance of downtime caused by hardware, driver or NT operating system faults. Unisys, for example, tests and refines driver code, particularly components related to clustering, as a chief means of eliminating downtime issues.
HP's Sequera says his company’s test lab uncovered clustering problems caused by seemingly innocent factors, such as device cable length. He adds, "Developing a high-availability cluster can be a challenge. A reliable high-availability solution requires extensive planning, testing and qualification -- tasks that can be beyond the customer’s ability, schedule or budget."
Data General's philosophy is similar. "These are n-dimensional systems where there can be no single points of failure," says Dave Flawn, vice president for NT marketing at Data General. "Systems must be clustered and must talk to RAID disk subsystems. We configure ‘dual everything’ -- dual SCSI adapters, dual host bus adapters, dual write caches and dual fans, just to list some common components."
Basics statistics comes into play, too. To achieve a 99.9 percent uptime, individual components each must exceed the reliability of the entire system by a significant margin. "If you do the math and multiply 99.9 percent reliability for each system component, you wind up with total system reliability of less than 99.9 percent," HP's Sequera notes.
Once a machine is built using statistically validated components, monitoring becomes vital. For example, IBM uses its Remote Connect program to detect unhealthy components before they fail. The system automatically dispatches a technician to the site sometimes before data center administrators even know of the problem.
Configuration changes are another issue that can degrade the stability of a system. HP is planning to address this subject with the launch of its High Availability Observatory, a tool that can track configuration changes and send system configuration snapshots to HP uptime control centers. Unisys uses its Uptime management tools to take comparative snapshots of a system’s configuration. These types of tools can help prevent system failure problems that could be caused by something as simple as incompatible DLL.
A Different Model
It used to be that the vendor who provided your hardware also provided your operating system and utilities. Windows NT forever changed that model, requiring vendors to develop a relationship with Microsoft to gain the level of operating system support needed to ensure viability of their uptime programs.
The offerings described here have staffs of certified Microsoft engineers who specialize in troubleshooting operating system problems -- circumventing a need to contact Microsoft for help. While some vendors, such as Unisys, don’t roll NT support into their formal uptime guarantee program, all vendors agree that high uptime expectations cannot be met without such support.
All five vendors interviewed claim to have a full staff of Microsoft Certified Systems Engineers (MCSEs) and "insider" status regarding Microsoft’s NT support database. Unisys, for one, also convinced Microsoft to sign a high-availability support agreement guaranteeing an agreed upon support level for Unisys engineers.
IBM’s Mobile Service Terminal (MoST) offering provides an online connection to IBM's Center for Microsoft Technology, located in Redmond, Wash. The 200 IBM engineers in the center can dial into the server that might be experiencing problems, and then provide a response within two hours. Compaq’s Frontline Partnership with Microsoft gives the vendor access to NT code and the ability to engage in joint development efforts with Microsoft to further develop NT as well as tools and utilities.
So you sign on the dotted line and instantly, you're ready to enjoy heart pumping, non-stop, trouble-free computing, right? Wrong. In fact you’re not even close.
So far you’ve taken care of only 20 percent of the availability problem -- at best. According to Donna Scott, an analyst at Gartner Group (www.gartner.com), 40 percent of availability issues result from operator error and another 40 percent are caused by application problems. This 80 percent is riddled with issues over which the vendor uptime programs have little control. These issues include backup policies, configuration control, change management and, in extreme cases, an operator playing Quake on the database server.
"The 20 percent portion of the uptime issue should be easy for vendors to uphold," Scott remarks. "Let’s face it, 99.99 percent uptime, particularly in a clustered environment, should be expected. Anything less than that should prompt customers to consider another systems vendor."
Scott says of far greater importance to uptime is learning about things like best practices from vendors. "[These are] things that go without question in a mainframe environment but are often pushed to the background in the NT world," Scott observes.
"Just as you can’t walk into your doctor’s office and expect him to work miracles after years of poor health practices, customers can’t expect uptime programs to work unless they commit themselves to good server health practices," adds Tom Iannotti, vice president of worldwide sales and marketing at Compaq Services.
While many data centers already employ strict change management, configuration control and operational procedures, this isn't true for all sites. Even some large data centers need to be re-educated as NT makes it’s way up the food chain.
"This is where a company like ours can really make a contribution," contends Ernst & Young’s Canfield. "Many customers are not aware of what high availability involves in the overall sense. Sometimes they need to be made aware that it’s not just redundant hardware. On the other hand, vendors also need to understand the customer’s true needs for system uptime. We can partner with both customer and vendor to ensure that a true high-availability solution gets implemented."
Like any other insurance policy, uptime programs come with a price tag. The costs of enhanced, redundant clusters along with a la carte uptime consulting services have to be evaluated carefully. Many administrators like to think their systems need 24x7 availability, but the reality is that not all systems do. A million-dollar-a-minute trading system probably qualifies for an uptime security blanket. Other important applications, such as e-mail, may not warrant the extra expense.
Another consideration is the liability vendors will incur for non-performance. Typical penalties amount to credits of support premiums for some period of months. However, if your business is worth thousands of dollars per second and your 99.x percent cluster has been down for a day, support credits probably won’t be adequate compensation for the financial harm you incur.
On the other hand, offering any other form of compensation for non-performance may only be achievable through a true partnership with a customer. Some vendors we talked to have, on a very limited basis, entered into shared risk-reward agreements with selected customers.
While most of today’s uptime programs primarily involve hardware and operating systems, future programs will go as far as the application level. Data General has taken a step in this direction with its uptime program for Microsoft’s SQL Server. This program guarantees SQL Server’s performance from an operating system base SQL Server code level. No guarantees, however, are made regarding the behavior of user developed code. Data General’s Flawn expects similar programs for applications such as Microsoft Exchange in the future.
Michael Liebow, directory of strategy for IBM Netfinity, says plans are also falling into place at his company to offer application availability guarantees. With degrees of this program available now, and a guarantee up to 99.99 percent available in the first quarter of 2000, IBM’s High Availability Services will -- on a customized basis -- provide 99.99 percent guaranteed availability from end to end, including application, middleware and network components.
Still, GartnerGroup’s Scott remains skeptical. "While admirable, I don’t see the end-to-end guarantees happening unless customers agree to bring on a vendor as an outsourcing partner. Otherwise, there would be so many caveats as to render the guarantee useless," she explains.
System Uptime and Windows 2000
The introduction of Windows 2000 Server will impact uptime programs. Many of the uptime features and practices employed by vendors today will be adapted to fit Windows 2000 Server. For example, proactive detection of memory problems and enhanced qualification and testing of device drivers will become part of W2K.
Shortening reboot and recovery time will also be a prime focus. Integrated clustering, support for redundant NICs and built-in UPS support will also improve system uptime. Microsoft Corp. claims that the new features employed in Windows 2000 can reduce downtime by more than 20 percent.
Should users simply wait for Windows 2000's enhanced features rather than attempt to provide uptime programs for systems running NT 4.0? "There’s going to be a huge adoption curve with Windows 2000 Server," notes Linh Stroud, manager at Ernst and Young’s Center for Technology Enablement (www.ey.com). "Although the advanced uptime features of W2K will be certainly welcome, we don’t think vendors should hold their breath until the bugs are eliminated."
Uptime Guarantee Factors to ConsiderDo your financial homework first. What levels of availability do your systems require? Do all systems need the same uptime level?
Watch your language. An "uptime program" may not be the same thing as an "uptime guarantee." One does not necessarily imply the other.
- Know what components are covered under the program or guarantee. Some programs cover hardware only, some hardware and operating system, and at least on program includes SQL Server uptime.
- Purchase a complete system from a vendor with a good track record. Rely on your vendor to assemble, test and qualify your high-availability solution.
- What tools are offered to provide proactive event monitoring and alerting? Can repair service be called automatically?
- Are you willing to commit to using a set of best practices? Are you willing to share the development of such practices with your vendor?
- Is your vendor committed to expanding their guarantees in the future, perhaps including application level support?