Achieving Application Availability

The need for system availability extends across all components of a Windows NT network. But nowhere is the availability of a system more obvious to users or more indicative of the competence of IT managers than at the application level.

The need for system availability extends across all components of a Windows NT network. But nowhere is the availability of a system more obvious to users or more indicative of the competence of IT managers than at the application level.

When applications perform below expected levels or fail altogether, the spotlight glares on NT managers. And as NT sites include more mission-critical applications in their portfolios, the heat rises – especially with applications with unpredictable or high-volume traffic, such as Internet-based programs.

As a result, NT managers are looking more to hardware and software tools to enhance application availability. These tools, ranging from clustering and new storage approaches to management and load balancing products, can be combined to wring the best application performance from an NT system.

"There are a number of proprietary solutions available that will allow you to get close to 100 percent application availability," says John Parkinson, chief technologist for the consulting firm Ernst & Young. But NT managers must take the proper precautions and learn to view system availability as a discipline requiring time and dedication.

At Tyson Foods Inc. (www.tyson.com), an $8 billion producer of chicken products, user demand is driving the manager of systems administration to try a variety of application availability strategies. The company, which uses 160 NT servers primarily as file, printer and e-mail servers, runs 24x7 to support employees on 4,500 networked PCs.

"We use technology for every aspect of our business" and applications must be available, says Jim Bennett, manager of systems administration at Tyson. To ensure that they are, Tyson uses a performance management tool from Datametrics Systems Corp. (www.datametrics.com), RAID on its storage devices, redundancy on power supplies and clustering.

Hardware Solutions

Hardware solutions marketed for application availability focus on keeping the system up and running, which allows users to access applications. But many of these technologies -- including RAID, solid state storage, clustering and parallel serving -- have twisted traditional reliability features into new products that are appropriate for providing application availability.

For example, RAID technology is a traditional high-availability technology making its way down to the NT server level. Mylex Corp. (www.mylex.com), a RAID controllers vendor, has experienced unexpected high sales of its eXtremeRAID 1100 PCI controller for midrange and high-end servers.

The product uses a 233 MHz Intel RISC processor, a 64-bit PCI bus and proprietary firmware to boost performance from the typical 2,500 I/Os per second to more than 8,000 I/Os per second. By decoupling the disk drive I/O management tasks from the NT operating system, the product helps eliminate bottlenecks between disk drives and the server, explains Eric Herzog, vice president of marketing at Mylex.

Another former mainframe storage technology to enter the NT market is solid state storage. Although still significantly more expensive than RAID, solid state storage can be cost-effective when used in systems where a small percentage of files comprise more than half the I/O transactions, explains Gene Bowles, president and CEO of Solid Data Systems Inc. (www.soliddata.com).

Typical storage devices process 300 to 400 I/Os per second. Solid state storage devices increase processing to 12,000 I/Os per second, Bowles says. Customers have placed I/O-intensive files -- such as transaction logs, temporary spaces and index files -- on solid state storage devices to reduce the occurrence of bottlenecks, making servers more resilient and applications more available, he adds.

Clustering in Parallel

With Microsoft Corp.’s current and planned releases of Cluster Server technology, clustering may be the highest profile hardware availability solution. In addition to the widely publicized choices among vendors’ clustering solutions, however, NT managers should also consider the interconnect between CPUs.

Some interconnects are designed to provide low latency so the interlock between failover components occurs quickly. "The quicker you can communicate the state between systems, the quicker you can recover, and the higher application availability will be," points out Dave Follett, founder and CEO of clustering vendor Giganet Inc. (www.giganet.com).

At Partners’ Healthcare, a Boston-based integrated health care provider, clustering and failover solutions are currently being tested. Partners’ Healthcare runs one of the largest client/server implementations in the world, supporting 30,000 clinicians, including 1,000 primary care physicians.

Physicians require applications to be available immediately, explains Steve Flammini, director of application development. At Partners’ Healthcare, the user login and application start-up time for client machines has been adjusted so doctors can begin using applications within eight seconds.

To protect applications, Partners’ Healthcare uses shadow machines to journal all transactions that occur on the InterSystems (www.licensetospeed.com), relational database. The shadow systems are backed up during overnight hours so the production machine never goes down.

For some managers, clustering -- where one server is dedicated to silently backing up another -- is akin to underutilization. By contrast, parallel server technology allows multiple servers to each handle portions of the processing capability, ensuring that all remain active and able to take over processing should one server fail. Failover times are often shorter during system failure, since all CPUs are already running.

The disadvantage of sophisticated parallel server environments is the complexity they bring. To simplify management, vendors such as Oracle have introduced single-view functionality that allows network managers to observe the parallel processing servers as a single-node environment, says Merrill Holt, director of product management for Oracle Corp.’s Parallel Server.

Other vendors, such as IBM Corp., offer more hands-on support to address problems of complexity. Customers who purchase a bounded Netfinity configuration, warranty and services receive IBM’s assurance review, remote monitoring of error logs and server conditions, proactive monitoring and notification of impending problems. An IBM engineer is sent to the customer’s site when failure occurs.

Currently, parallel serving is popular in large system environments, says Don Roy, product marketing manager for IBM’s Netfinity. But the market for parallel servers is growing in the NT world.

Software Solutions

The complexity of application availability technology is also a concern for software-based solutions. Management products from vendors such as Mission Critical Software Inc. (www.missioncritical.com), NetIQ Corp. (www.netiq.com) and Heroix Corp. (www.heroix.com) are designed to identify common availability problems, assign a resource to address them, and, in some cases, invoke automated procedures to correct them.

Many of these products ship with extensive built-in knowledge about the NT platform. "NT customers expect things to run out of the box," observes Kent Erickson, director of product management at Mission Critical Software.

For Apollo Group (www.apollogroup.com), a provider of higher education programs for working adults, simplicity and rich functionality was critical in the selection of a management tool. The company’s rapid growth caused its NT network to mushroom. Despite using 31 of 49 NT servers for Microsoft Exchange, they had an unacceptable availability rate of slightly more than 80 percent.

"We needed to catch the trouble conditions ahead of time instead of waiting for problems to develop," explains Scott Kitchens, senior NT administrator at Apollo. After reviewing several products, the company settled on EcoTools from Compuware Corp. (www.compuware.com). "You didn’t have to be a programmer to set up all the rules," Kitchen says. "And when we had a problem, the vendor flew out an engineer, who solved it within a few hours."

Another type of software tool offers integrity control to NT implementations, in some cases guaranteeing availability of applications on the desktop. Most IT managers are frustrated with the large number of calls to the helpdesk, which is indicative of their expectations for application availability, explains Jeff Mulligan, vice president of marketing for Swan International (www.vision64.com), a desktop systems management vendor. The integrity control tools provide electronic software distribution, graphical remote control, hardware and software inventory, automated backup and restore and process control. By using integrity control products, many NT sites can reduce overall support costs while increasing both application availability and user productivity.

One application availability technology that is beginning to make its way into NT sites is service level agreements (SLA) and service level management (SLM) software. Ernst and Young’s Parkinson says that he does not see many sites using SLA/SLM, but the numbers are growing nonetheless.

"At the very least, people want to report on what’s going on against service level profiles," he says. This can be difficult using NT 4.0, but Windows 2000 should make it more straightforward to use snap-in SLM products.

Interestingly, cost seems to play only a minor role in product selection for application availability issues. Some industry watchers attribute this to the relative low overall cost of NT systems, especially when compared with Unix.

But others say the cost of unavailable applications is the real motivator. "Price sensitivity is reflective of what a segment of downtime means" to buyers, IBM’s Roy explains. For example, in mission-critical applications such as ERP -- where a minute of downtime can cost a company $13,000 in lost business -- "if you can improve availability, ROI is immediate," he says.

Good Management Cures Many Ills

Although application availability tools are the salve of choice for many burned NT managers, there are proactive management techniques that can keep applications healthy and the flames of crises at bay.

"There’s been a steady progression of capability and realization that, with a little care, NT can provide very high levels of application availability," explains John Parkinson, chief technologist for consultant Ernst & Young. "But you must have people who understand the technology."

Although many companies buy preinstalled, preconfigured NT servers at commodity prices, Parkinson explains that these systems may not be configured to provide optimum application availability. Rather than relying on vendors, NT managers may find it more cost-effective to buy or develop configuration expertise.

Parkinson says the biggest problem in most enterprise NT sites is under-configured servers. Although the NT operating system comes with a resource kit that provides a capacity model to determine the machine and memory size needed to handle a given workload, most installers never "go through the basic arithmetic of figuring out how big the box needs to be," he explains.

"NT’s performance and reliability is critically dependent, in particular, on how well the SWAP file is managed," Parkinson adds. "If you don’t have enough memory or you have too many users, sooner or later you’ll have a SWAP file crash."

Even when resilience is maximized through an appropriate system configuration, NT managers can do more to ensure reliability. Many planners neglect environmental considerations. Simple changes such as power conditioning, temperature management and air filtration can go a long way toward creating an environment conducive to long-term system operation.

"If you take care of all of that, then the mean time between failure for hardware is in the tens of thousands of hours," Parkinson says. "As long as you keep that kind of health maintenance going on the servers, our experience is that NT is incredibly reliable. But that kind of reliability doesn’t happen by itself."

Must Read Articles