Managing Service Level Agreements with Automated Tools
Test Track: A review by ENT and Client/Server Labs inspects four service-level-agreement packages for Windows NT
Cries of "Why is the network so slow?" and "When can I get my data?" have not gone unheard. In fact, these pleas have given rise to a number of management approaches for network resources. One of the more recent developments has been the introduction of service level agreements (SLAs) -- familiar in such areas as telephony and other utilities -- into computer networking environments.
In much the same way that a business telephone customer might require a phone provider to guarantee certain levels of up-time and quality of service, businesses that have critical computing needs are now requiring network and application providers -- both internal and external -- to assure specific levels of service.
ENT and Client/Server Labs examined four software packages designed wholly or partly to handle service level agreements. Several companies, such as 3Com Corp., Cisco Systems Inc. and Bay Networks Inc., provide service level agreements for hardware devices, but we chose to look at packages that monitored end-to-end application flow. All four packages are specifically designed for Windows NT, rather than serving as a piece within a larger framework.
Within these requirements, we ran VitalAnalysis and VitalHelp from INSoft Inc., EcoSystem from Compuware Corp., Empirical Suite from Empirical Software Inc. and Jyra from Jyra Research Inc. through the test track.
The SLA management tools available today are surprisingly usable and mature, even though the idea of SLA is fairly recent in data processing environments. Presumably because SLAs are a well-established concept in other areas, these tool vendors seem to have started with solid concepts in mind.
We were pleased that we encountered no conflict between any of the products, even when we installed competing products onto the same hardware platforms at the same time. Because it is unlikely that administrators would install more than one of these package on the same network, we did not exhaustively test combinations of various packages.
All four vendors seem to agree that service level management is primarily an issue of how applications behave on a network, as well as how users perceive that behavior. This is a marked departure from the view of network management, in which the focus has long been on devices and architectures, and where bottlenecks occur.
Service level management, on the other hand, focuses on monitoring the amount of traffic traveling through the network. Network traffic can be measured in several ways. One is to monitor the types of traffic and keep track of the time it takes that traffic to reach its destination and return. Another is to monitor the most frequently visited servers in the system. A third is to send out your own traffic to measure how well it travels the important pathways. And finally, you can watch the smaller entrances and exits from which the vehicles enter the system on the way to the big resources.
These approaches represent the different strategies of one or more of the four vendors in our comparison. Each has its advantages and disadvantages. Which one is right for a given environment depends on the concerns of the users to whom the service level is being promised.
Farmington Hills, Mich.
Compuware Corp.'s EcoSystem package is a suite of three products: EcoScope, EcoTools and EcoSnap. Only EcoScope and EcoTools offer SLA management assistance. EcoSnap is a failure coverage and application recovery tool designed to notify administrators of application failure. Since it is built for application recovery, rather than service level enforcement, we did not review this facet of EcoSystem.
EcoScope monitors traffic passing along the various network segments. EcoScope software is loaded on as many Windows 95/98 or Windows NT workstations as necessary for a particular network topology, with at least one station on each logical segment of the network. Every EcoScope station runs one or more Super Monitor, which watches traffic on a given network interface. For WAN traffic, a separate adapter card is available to connect to appropriate routers.
Another station in the network runs a facet of EcoScope called Single View, which gathers and summarizes the data from the Super Monitor stations. At first glance, it appears that the Single View station is reporting traditional network management data, especially because it generates a fairly classic diagram of the network's logical layout.
A little observation, however, quickly reveals that, unlike conventional traffic monitors, EcoScope is paying close attention to the content of the traffic, rather than simply size and speed. EcoScope automatically recognizes a large number of common applications, such as Domino, Exchange, various databases and traditional network utilities. Administrators can define as many additional applications as desired, though defining an application is not necessary for EcoScope to see and capture data on all the network traffic. Also, EcoScope was the only product that recognized IPX/SPX traffic.
We configured EcoScope to send alarms based on defined thresholds. The alarms are only sent via Simple Network Management Protocol (SNMP) traps, which somewhat limits their immediate use to environments already doing SNMP-based monitoring.
The second major EcoSystem component, EcoTools, runs on a Windows NT server and monitors the behavior of services on other Windows NT servers running the EcoTools Agent software.
The key element in the strategy is the use of system performance counters already maintained by the Windows NT operating system. These counters were familiar since we had previously run the standard Windows NT "perfmon" utility. The utilities cover a variety of system activity from CPU loads, to network traffic, to application specific counters such as Web browsers and e-mail servers.
We configured EcoTools to perform a set of tasks to monitor a collection of one or more counters, and then to take action when those counters pass a threshold level defined by the administrator. The actions include sending SNMP traps to other systems -- such as Tivoli TME or Unicenter -- sending an e-mail message, paging an administrator, or running an application program. The latter action may be used to do such things as attempt to restart a failed service. We found the configurations easy to understand, and we particularly appreciated the wizards for setting up tasks.
Overall, we found that the EcoSystem products were relatively easy to install and configure. The variety of built-in reports in both EcoScope and EcoTools was fairly impressive, and the tools for configuring user-defined reports were easy to grasp. Our only concern was the EcoScope’s reliance on SNMP traps as the basic alarm mechanism.
In sharp contrast to the other packages, VitalAnalysis and VitalHelp from INSoft -- which recently acquired the company VitalSigns Software -- take a completely user-oriented view of the service level issue. The operation of the system revolves around the idea that the PCs where users work should be the focal point for determining performance levels.
INSoft’s products were the simplest to install. We liked the fact that the documentation is well laid out and concisely written. We also appreciated that the products are centered on the idea of doing a small test installation before deployment. Excellent visual reports highlight useful information, but the product is marred by a lack of tools, such as paging and e-mail messages for problem notification.
The VitalAnalysis program runs on a central Windows NT server or workstation, while another system runs VitalHelp. Though both packages may coexist on a single system, INSoft recommends against this configuration for performance reasons. Once the servers are configured, a piece of client software called VitalAgent is loaded onto as many individual user machines as desired. The agent software conceivably could be deployed onto every desktop in an organization, onto a small number of critical stations, a representative sample of various network areas or any other combination of stations.
VitalAgent, which appears to be based on VitalSigns' earlier NetMedic product, monitors the network activity of the individual user at that workstation, breaking down traffic by types of applications, such as FTP, DNS, e-mail and database applications. That information is communicated on a scheduled basis to the VitalAnalysis server, which summarizes the data into a series of graphs and reports. The gathering and reporting schedules were easily configured to our tastes. All communication between VitalAgent and the VitalAnalysis server is through HTTP protocols.
We also defined the various thresholds at which performance levels become a matter of concern. An attractive outgrowth of this process is the product's ability to produce a "heat chart." A different color is assigned to each threshold, and then performance is graphed in a matrix that displays which applications or areas need attention. Another appreciated feature of VitalAnalysis is the five threshold levels, from excellent to critical. This gives more granularity of classification than in any of the competing products.
The second major component, VitalHelp, though integrated with VitalAnalysis and VitalAgent, is more properly categorized as a helpdesk application than a service level agreement manager. It does, however, aid a service provider in resolving problems once they occur. It is important to bear in mind the user-centric focus of the Vital tools. VitalHelp can utilize information gathered by VitalAnalysis to help locate problems. It can also work with the VitalAgent software on a desktop to aid the helpdesk worker in troubleshooting a particular problem.
The combination of VitalAgent with VitalAnalysis provides an actively staffed and managed helpdesk or administrative group with a good tool set for measuring service level agreements.
Empirical Software Inc.
Empirical Suite has two unique features that distinguish it from its competitors: Empirical Planner and a concentration on database applications.
Empirical Planner is not a software product at all, but an exhaustively thorough guidebook into the process of creating and managing service level agreements. Rather than dealing with the issues of monitor stations and reporting tools, Empirical Planner walks the reader through the steps of deciding what services to agree on, how to decide the important metrics and how to structure the agreement.
Another major differentiation is Empirical's focus on monitoring and managing database applications. The actual application components, Empirical Director and Empirical Controller, concentrate almost exclusively on database applications and show a remarkable depth of penetration into Oracle and SQL Server systems. This dedication makes them an outstanding choice for managing mission critical applications. It also makes Empirical virtually impenetrable to the more general-run of administrators and probably a poor choice for more broad-based sorts of SLA management.
Despite the focus on databases in the other management applications, Empirical Planner remains an impressive work, which deserves a place on the shelf of anyone who might implement an SLA program, regardless of whether the agreement is focussed on a database. The book covers everything from the highest level of deciding who has responsibilities for which elements of a program, all the way down to the level of customer response cards for service call follow-ups. A set of templates is supplied on a CD for the various documents outlined in the book. SLA management in its broadest sense is about the managerial process, and Empirical Planner covers exactly that.
As far as software tools go, Empirical Director measures the service levels of one or more applications. We had to define each application to the system, though canned definitions are supplied for such things as Oracle Financials. The definition of an application depends upon setting up an appropriate ODBC source. Once that is accomplished, the number of parameters selectable for monitoring is mind boggling. Approaching this tool from the perspective of a network administrator rather than a data base developer provided our testers with a rapid education in the limits of their knowledge.
Once defined, applications are associated into Application Groups for reporting purposes. An application group might represent the applications critical to a department in the organization or co-located in a particular region. Empirical Director then monitors the defined applications on a scheduled basis and generates reports about the applications’ health.
Working with Empirical Director, Empirical Controller helps manage and maintain the health of a database application. It takes the data gathered over time by Empirical Director and uses it to develop trending information to either suggest or actually take corrective action. Primarily focused on Oracle installation, with a more limited knowledge base for SQL Server and Sybase, Empirical Controller is essentially an "expert" system, with some interesting tools for the database developer.
One tool allows the user to enter Structured Query Language (SQL) commands to be tested against a given application. Empirical Controller analyzes the commands in light of the data source used by the application and attempts to determine the most efficient way to process the command. As set by administrators, Empirical Controller recommends changes to the command it analyzed, or suggests changes to the way the database is structured in terms of disk allocations and indexing.
It is this same sort of corrective analysis that gives Empirical Controller its value in an ongoing environment. The product can use the data gathered by Empirical Director to generate a prediction of where future problems are likely to occur, based on trends observed. It is a tremendous value for data base administrators to know with some reliability that six months in the future a larger disk drive, more memory or a rearrangement of indices will be needed to meet foreseeable demand.
The Empirical tools bring so much power to bear on the specific problems of database applications that it is almost misleading to present them as general SLA management tools. For environments in which it is important that a particular database application is available and the speed and reliability of the application is vital to the business, none of the other products we examined comes close to Empirical's level.
Jyra Research Inc.
San Jose, Calif.
Of the four packages examined, Jyra was the most intriguing and the most vexing. Jyra Research Inc. supplied us with a demonstration package that, though containing a complete implementation of the Jyra product, lacked any printed documentation. The CD documentation was minimal at best and required the application of some significant levels of experience in installing balky software.
Unlike the other products we tested, Jyra is completely focused on active monitoring. One or more systems are used to generate a small amount of artificial traffic on the network and to measure the response times of the results.
Two software components, the Service Level Manager (SLM) and the Mid Level Manager (MLM), are used in the system. Both portions are configured and controlled exclusively through Web browsers using Java applets. Because of the requirements for a high level of Java support, control functions are only available through the HotJava browser or through Internet Explorer 4.01 or later.
The SLM is used to execute tests defined by the administrator, while the MLM manages the process and gathers the data for reporting. In larger and more diversified networks, the systems may be arranged in a hierarchy, with SLMs reporting in groups to a number of MLM systems and the MLMs in turn reporting to a master MLM at the center of the structure.
All of the network elements to be examined must be defined by the administrator, including the various network points such as Web servers and routers. The definition process is straightforward and well thought out. Tests can be conducted with granularity down to the level of a minute, and data can be aggregated for analysis in a similar fashion. After only a few repetitions we could perform the process almost without thinking about it.
Reports, likewise, are easy to configure. Data selection aside, the system prepares only two reports: a bar chart and a line chart of the selected data for a selected time period. The bar chart is interesting in the way it presents minimum, maximum and average data very clearly. The uniformity makes the reading and interpretation of results easy.
One difference between Jyra and the others tested is that Jyra does not transport the collected data to the central systems, it merely displays the data at that level. The detail data repository and the corresponding reports remain on the individual SLM systems. Only aggregated data is stored on the MLM systems. The benefit is distributing the storage load for analytic data, but it also introduces the risk that portions of the detail data can be temporarily unavailable or lost because of outages at lower level systems.
Jyra's focus on active testing may be a handicap for organizations that want to monitor real-world network activity with clearly identified resources. On the other hand, it may be ideal for broad-based service providers, such as ISPs or managers of more generic corporate networks, who are less concerned with specific users than with overall responses for widely used resources.
The Jyra tool’s active testing, however, will detect at least one class of problem that no form of passive monitoring will see: resources that become unavailable while not in use. Active testing may provide the only alert to an administrator that the POP3 e-mail server has gone down late Sunday night, before some senior suit discovers Monday morning that e-mail does not work.
Through the Test Track
To exercise the various products, we set up a small network consisting of two IBM NetFinity 5500 systems with single 400-MHz Pentium Pro processors running Windows NT 4.0 Server, a Dell 2200 running NetWare 5.0 server, and an HP Kayak workstation running Windows NT 4.0 for management purposes. The Windows NT servers hosted Oracle and SQL Server databases, as well as a Lotus Domino server.
The SLA management pieces were installed on the Windows NT servers or the Windows NT management workstation as appropriate. We also generated system activity by running miscellaneous applications against the servers from an array of 48 Dell Optiplex Pentium systems running NT 4.0 Workstation.
SLM/SLA: Setting High Marks and Solving Problems
The Power of SLM