Why Traditional SLAs Are No Longer Adequate

SLAs have traditionally relied on operating system utilities to determine process availability -- but their accuracy isn't assured. Measuring the availability of business processes may hold the key to correct application performance assessment.

There are plenty of tools that monitor the basic heartbeat of the systems underlying your enterprise applications. These monitoring tools often rely on operating system utilities to determine server availability. While they are useful, they can be out of sync with what’s really happening with the actual service levels of the application. If you’ve ever had the experience of fielding calls from end users saying the app is not responding, but you’re seeing all green lights on your monitoring system, you know what we’re talking about.

What’s Wrong with Traditional SLAs?

IT organizations in Fortune 1000 companies are under increased pressure to deliver on stringent application service levels with limited staff and resources. Moreover, enterprise applications now have complex, multi-tiered architectures that operate on heterogeneous operating systems and have an interdependent set of firewalls, servers, processes, Web and J2EE application servers, databases, and many other subcomponents.

Traditionally, companies adopt a variety of approaches to application service measurement. Some use a commercial system-monitoring solution, some build their own custom scripts, and some combine these two approaches. Typically these approaches use operating system utilities such as ping, vmstat, df, top, and perfmon to determine process availability. If the network, operating system and running processes that constitute an application appear normal, these approaches commonly assume that the application is performing normally.

The trouble with these indirect approaches is that they assume that if all the components appear operational, the application is performing correctly, properly processing orders or updating payroll, for example. What users and application stakeholders really care about is the system’s ability to complete a business process within a predictable amount of time. This real bottom line for application availability is whether the application is able to complete a business process.

Until recently, indirect low-level uptime/heartbeat monitoring was the only type of measure readily available for service-level agreements. But as applications have become larger, more complex and more distributed, this approach has become problematic. There are now cases in which an application technically fulfils its low-level uptime measurements, but is actually not completing its business process satisfactorily. For example, communication between tiers may have been disrupted, there may be a network problem, or the release of database connections may have failed.

Evolving to Business-Process-Based Service Levels

Today, reliable technology exists to record entire business processes and replay them at desired intervals to simulate business processes performed by an end user. These recorded business process scripts can be deployed to different locations and run continuously to monitor performance and determine availability and response time of the business process. These can form the basis of a service level measurement based on business-process success.

Business-process-based service levels are measured by the availability of business processes, rather than traditional device or server uptime service level measures. Business process performance monitoring should be Web-based and perform the following:

  • Show total response time for a complete business process, with a breakdown of the discrete user steps (i.e., logon, search for product, purchase product)

  • Estimate the expected response time for critical or fatal performance levels

  • Replay the business process with and without delays between user actions (simulating user “think time”)

  • Store recorded scripts centrally and deploy them to external “robot” machines for automatic replay

  • Correlate to each of the servers and components making up the n-tier application

  • Provide a central database for application service reports

Once we understand the concept of business-process-based service levels, the next step is to explicitly link business process performance to the specific IT tiers and components that make up the application infrastructure, specifically the performance information and events affecting Web, application, and database servers.

The application technology stack is the starting point for mapping business processes to application infrastructure. All applications can be broken down into component layers:

Layer Description
Business Process A synthetic transaction that simulates a logical end-user business transaction. It is the "acid test" because if it is performed successfully, then all underlying parts must be performing normally.
Application Integrations Third-party component integrations such as ERP to ERP, ERP to CRM, ERP to data warehouse, Web solutions, and other third-party applications.
Custom Applications
Application managers, application logs, performance flags, and metrics.
Web servers, J2EE application servers, and database servers including SQL metrics and diagnostic tools.
Operating Systems Servers, operating systems, and resources (CPU, memory, disk). System log monitoring captures critical and fatal errors such as a bad disk, corrupted memory blocks, and SCSI bus failure.
Network and Services Server availability and communication pathways (local and wide area) for all of the above layers including "Networked Services," services offered over the network (such as LDAP, NFS, SMTP, NSLOOKUP), database connections, and other port/listeners with request/response interaction.

The key point is that a business process relies on every layer of the application’s technology stack and must be correlated to these layers for a complete understanding of performance. The business process layer is the most accurate reflection of the availability and performance of the entire application because it naturally exercises every layer of the stack.

Mapping Business Processes to Application Infrastructure

A Web-based business-process-performance monitoring solution can correlate real or simulated transactions to show performance at a “business” level. Let’s look at an example of how this could work. Consider three fictional business processes for an e-PetStore: product lookup for dogs, for birds and for fish. At the highest level, we really just want to see that these business processes are performing normally. The following illustrates how we could visualize this level of monitoring, using Quest Software’s Foglight application monitoring solution (full disclosure: I am a product manager for Quest Software).

Quest Software’s Foglight Application (Click to enlarge)

We can see a yellow status indicator under the “Application” label on the monitoring screen. This indicates that one of the underlying application tiers is experiencing a warning condition suggesting a potential problem—perhaps high CPU, low cache hit, or database deadlock—even though the recorded transactions are currently running within the expected timeframe (and so are green).

When we see a problem with a business process, we need to be able to drill down immediately to the application tiers or business process details to narrow down the offending tier or resource. It’s also important to be able to drill down to the individual component level, to detailed Web, application, or database server metrics, to fully isolate the cause of performance or availability problems, as well as produce historical reports similar to traditional service-level-availability reports.

Providing a True Picture of “Whole Application” Availability and Performance

IT organizations should consider using business-process-based performance as a primary measurement when establishing application service levels for complex distributed business applications. With business-process-based service levels, IT can provide the business with a picture of the true availability of their critical applications, something far more valuable to enterprises today. Business processes should be explicitly linked to their specific IT resources, including Web, application, and database servers, promoting more efficient use of existing resources and better coordination among IT teams tasked with assuring maximum availability.