Edge-To-Edge Service Management: A Modern Day’s Gordian Knot
Correlation is emerging as a key technology for managing end-to-end delays for the revenue-generating transactions of the enterprise. Once the key issues are identified, the same time-based statistics can then be used to both look for historical patterns as well as trends that indicate future growth in activity.
About 13 years ago, Viguers’paper "IMS/VS Version 2 Release 2 Fast Path Benchmark" introduced the ONEKAY benchmark. The purpose of this benchmark was to achieve the ONEKAY rate (one thousand transactions per second) and provide extremely high service levels (fast response times) for the average end-to-end residency time of a transaction. The benchmark’s transactions were constructed to simulate revenue-generating transactions, such as debit and credit updates. At the time, one thousand transactions per second were considered to be theoretical in nature, with no apparent means for actually achieving the rate using commercial systems.
How times have changed. Today, a very popular online auction house initiates, on the average, seven new auctions each second, with 2.1 million unique users visiting the site a day (that’s an average rate of 25 unique visitors a second). Responsiveness regarding the end-to-end service levels is critical for this auction house. Users can simply click onto another auction site if service levels degrade and the browser response is perceived as being too long.
Today, high transaction rates are achieved in a totally unforeseen way. At the time, the ONEKAY rate was achieved on monolithic systems. Today’s systems are scaling up using replication of resources. This presents a new complexity regarding the transaction’s sojourn in the "system". Indeed, today, there are many individual components that make up the "system" that is involved in the processing of a single transaction.
Originally, the end user connected to a monolithic system through "front-end" communication processors. Today, there are many systems involved. Some are used for routing and switching information from one system to another. Some are used to purely handle the Web services. Still others are used to process the application or business logic. And, at the "back end" is the original "front-end" processor, knitting the legacy system into the overall complex. It is interesting to note that most of the Web-based features and functionality are achieved using replication of network devices, Web and application servers. In many cases, the original monolithic system is still involved, but it is included as the "back end" of the processing cycle. End user perception of service quality is primarily formed through the responsiveness of the Web page that is delivered, which in turn is at the mercy of all of the individual components.
So, the question arises – can transactions be measured and managed from beginning to end (edge-to-edge management) in light of today’s complexities?
Holy Grail or Gordian’s Knot?
Let’s now close our eyes and imagine the "holy grail" for the edge-to-edge service level management of a transaction. This would include measuring the time that the transaction arrives at and departs the system, keeping track of CPU, disk I/O, memory storage and network consumption throughout the life of the transaction and then providing a method to accurately report these statistics. If all of this were available, an IT manager would be able to understand that introducing a new feature into the mix of available transactions would have a specific impact on, let’s say, an over-used disk drive.
Now, let’s open our eyes. For the ONEKAY benchmark, transaction-level performance information is not directly measured because the CPU cycles used for measurement are greater than the resources required to process the revenue-generating transactions themselves. Now, consider that today’s transactions meander from one system to another, using network services as a transport to and from various processing points in the enterprise. A single auction bid could easily include hundreds of small CPU and disk requests (on multiple systems), along with traversals through networking devices to transport the transaction request from one system to another. The scale of processing necessary to handle the transaction bookkeeping would be more of a challenge than the actual implementation of the revenue-generating transactions themselves. Therefore, direct measurements of transaction activity are not only neglected in today’s environment, they are simply not practical.
According to an old Greek myth, Gordius joined a cart to its yoke with an impossibly complicated knot, called the "Gordian Knot". It was predicted that the person who could separate the cart from the yoke would be the ruler of the known world. Alexander the Great sliced the knot with a sword and therefore separated the cart from the yoke. A simple solution was applied to an impossibly complicated problem. A more modern example of this involves quantum physics at the turn of the century. It was discovered that the actual charting of electron orbits was considerably more difficult than an astronomy exercise. As a result, a "statistical patch" was developed to explain the atomic behavior without having to directly measure the electrons.
In the same way, statistical inference can be used to gain insights into the cause and effect of transaction service levels, insights that could have never been available through direct measurement. Correlation is at the heart of this statistical technique.
Cause, Effect and Correlation
The cause and effect model for end-to-end analysis of transactions can be visualized as an "Application Sandwich" with three layers. At the bottom of the sandwich, transactions arrive requesting service. These applications ultimately generate revenue for the enterprise. The "filling" of the sandwich consists of the various resources, which include CPUs, disk drives, memory and network services. The arrival of transactions causes activity and thus the resources and services are used. At the top of the sandwich is the response time or service level for the transaction. The service level is affected by the amount of resource activity. If a resource is over-utilized, then the response times will increase non-linearly as the transaction waits in various queues for service.
In summary, transaction arrivals cause resources to be used. High resource utilization will cause transaction end-to-end delays to increase. The understanding and management of this cause-effect relationship is necessary to manage service levels. The problem with managing this cause-effect relationship is realized when the "middle" part of the sandwich is examined. It encompasses a large array of measurements, taken over thousands of resources. Given the sheer number of resources, the number of samples per resource and the time constraints for finding meaning in the data, manual inspection of graphs or statistical tables is not feasible. Therefore, the raw numbers must be condensed into a smaller number of significant values. Indeed, this is the motivation for statistics in general. More specifically, correlation provides a statistical method for quickly identifying the resources that are critical to the cause-effect relationship from arrivals to service levels. Its effective implementation can quickly identify what resources are over-used at the exact times that service levels become unacceptable. Similarly, it can be used to identify key resource activity as a function of the arrival rates of the transactions into the system. This is accomplished without directly measuring the resource activity on a transaction by transaction basis.
Technically, the correlation coefficient is a measure of how two different stochastic processes (referred to here as variables), simultaneously vary from their mean values when measured on the same time scale. By accumulating statistics on a fine enough time scale (such as 30 second intervals), any two sets of variables can be compared on the time line and a "correlation coefficient" can be estimated. Perfect positive correlation would have the value of one. Perfect negative correlation would have the value of negative one.
An example of negative correlation would be disk activity correlated with CPU activity for a simple workload that either uses disk or CPU. When one is busy, the other is idle. The statistics that are not related in any way, or are linearly independent, will show a correlation coefficient near zero for normally distributed variables.
Putting It All Together
Once two highly correlated sets of statistics are identified, the cause/effect relationship is obvious. But, an important piece of the emerging correlation technology is the efficient measurement and management of the variables themselves. When finding correlation between resource activity and transaction arrival rates, tens of thousands of different variables may need to be considered. A single variable, such as response time, is identified as the independent variable. That variable is then compared to each of the other variables that have been collected and a correlation coefficient is calculated for each pair. Once the correlation coefficients are calculated, the variables with values close to one or negative one are considered as key to the cause/effect relationship.
In other words, if response time is to be compared to 10,000 different resource variables, 10,000 correlation coefficients will be calculated. Add to this the fact that 30 second measurements that range over an eight hour period (or longer) may be examined. The estimation of the correlation coefficient includes pairwise comparisons for each data point. Given these considerations, the correlation technology is achieved not just by applying statistics, but also by establishing a high performance method for measurement, storage and retrieval of the individual variables.
Light-weight agents must collect and store statistics at a high granularity (30 seconds) without impacting the measured system’s resources. Once the statistics are collected, accessing millions of statistics in seconds requires a data model that is architected "from the ground up" with time-based statistics retrieval as a fundamental requirement. "Off the shelf" relational databases are simply not adequate for the task. What e-business demands is a well implemented, proven technology that can produce near realtime correlation results for millions of data points in seconds on a standard PC workstation.
Managing Service Levels Using Statistical Correlation
When service levels suddenly degrade, the quick, automatic discovery of the underlying cause for degradation must be performed in realtime. A system is never static. There are always new service issues arising. There is no reasonable "hit list" of cause/effect relationships for a given enterprise because it can be obsolete in hours, given changes in hardware, features and customer workload habits. Direct measurement of resource activity, as a function of transaction activity, is not possible for the many reasons listed above.
In light of this "Gordian knot", a simple, statistical methodology is emerging to manage the edge-to-edge service levels by quickly identifying the key resources involved in critical service flows. This can be applied to packet rates within routers, as well as CPU or disk activity. An enterprise can monitor Web page response time in realtime and, using these measurements, find the correlated resource activities causing the delays, even across multiple shared resources. Beneath that, the key resource activity that is driven by the underlying transaction arrivals can also be identified.
In this way, correlation is emerging as a key technology for managing end-to-end delays for the revenue-generating transactions of the enterprise. Once the key issues are identified, the same time-based statistics can then be used to both look for historical patterns as well as trends that indicate future growth in activity.
Michael A. Salsburg is the Vice President of Software Engineering for FORTEL Inc.