Planning Ahead: Capacity Planning for Effective Decision Support

Capacity planning is the fulfillment of commitments made by some people in response to requests made by others. Business process re-engineering (BPR) means examining the recurrent patterns of requests and promises in the network of conversions of an enterprise with the objective of improving quality outcomes, cycle times and throughputs of human processes supported by information processes.

Unlike detailed configuration planning for a specific network design, mainframe component or client-server architecture, capacity planning takes a broader view that focuses on decision support information for managing service, costs and risk. Like other system management disciplines, capacity planning uses principles and methodologies that are broadly applicable to many types of resource planning needs. It includes most of the data and mathematical tools used in performance management, but in addition, it requires sophisticated forecasting and predictive modeling capabilities. The objective of capacity planning: "not too much, not too little, not too soon and not too late." This kind of multiple restraint problem is inherently mathematical in nature, as is the behavior of computing systems and networks.

The evolution of quantitative techniques for capacity planning has created a foundation for the management of a new notion of work: The fulfillment of commitments made by some people in response to requests made by others. Commitments are not items of information, and machines cannot make them. Business process re-engineering (BPR) means examining the recurrent patterns of requests and promises in the network of conversations of an enterprise with the objective of improving quality outcomes, cycle times and throughputs of human processes supported by information processes. As a business evolves, its process maturity levels to effectively manage sprawling networks of machines and people, many of the quantitative methods used will be direct descendants of the methods now used in the capacity planning field.

Forecasting and Predicting

Many of the mathematical tools and techniques used in capacity planning are of value to managers, performance analysts, systems analysts, programmers and system administrators, and can be used to answer such questions as the following:

Managers -- When will we need more CPU power? How fast are we using up disk space? Capacity Planning tool: Time series forecasting, linear regression.

Performance Analysts -- How many batch initiators do we need to get the maximum throughput? Capacity Planning tool: Nonlinear optimization techniques.

Systems Analysts -- How much capacity do we need to process X claims per day? Capacity Planning tool: Causal reconstruction using business element modeling.

Programmer -- If my program uses two seconds for processing 100 claims, how long will it take for 1,000? Capacity Planning tool: Analytic queuing modeling techniques.

UNIX System Administrator -- Can we improve Sybase query response time by reducing I/O wait? Capacity Planning tool: Predictive queuing modeling of I/O configuration scenarios.

•  Network Analyst -- If we add 400 more users to LAN X, will packet collisions be too high? Capacity Planning tool: Network Queuing modeling.

It is clear that these examples are based on very different requirements, but they have one feature in common: They represent a need to predict some aspect of the future. Forecasting is an essential dimension of capacity planning. Not only is forecasting essential to capacity planning, it is central to the growth and success of any business.

The other dimension of predicting some aspect of the future (or a possible future) is predictive modeling. In the case of computers and networks, a causal reconstruction of system dynamics can be achieved effectively using queuing theory and queuing network models (QNM). Queuing network models are used to predict, at a detailed level, how a specific hardware configuration will behave under a defined load.

The Art of Capacity Planning

Recent studies tend to confirm what managers have always suspected: The most important factor affecting forecast accuracy is not the method, but the person. Despite mathematical models and digital computers, human judgement remains the most critical element in any forecast. The forecaster still must decide what data and assumptions will go into the model. An understanding of the data is the first requirement of a good forecast. The second requirement is an understanding of the tools for analyzing those data.

When presented to the jury of executive opinion, unaided human judgment is good, but often not good enough in a highly competitive market. As a general rule, a forecast based on a sound mathematical model, skillfully deployed, is preferable to one that was made by a group of people using their best judgment (a "defactualized" or judgmental model). Preferable in what way? The expected error of an analytic model is known, or at least computable (thus turning uncertainty, which is unmanageable, into risk, which is manageable), whereas mere opinion is analytically intractable. Nevertheless, the emphasis here is expected error. Confidence in a forecast model can be based on probabilistic calculations (statistical models) or calibration error rates (causal reconstruction models).

Unfortunately, many networked computer systems are often too complex for exhaustive mathematical representation either because measurements are incomplete (or unavailable) or due to parametric uncertainty (causal relationships cannot be reliably computed). In this situation (i.e., the administration of complex systems in a context of uncertainty), the consensus of experts is sought as a way of reducing the variance in judgmental modeling. In other words, in the face of such uncertainty, guessing artfully and cooperatively is better than not guessing at all. Whenever a business must bid on systems outsourcing opportunities, artful guessing, consensus building, analytic modeling and statistical forecasting regarding IT resource requirements are combined to support a competitive bid.

If the IT organization is strategically important to the larger enterprise, then its welfare is codeterminate. Forecasting activities should be actionable (i.e., a basis for action and not merely a budgeting scenario). The art of planning capacity in this context must serve several ambitions: explanation, description, prediction, causal reconstruction and prescription.

What Is Capacity?

At first glance, most people think they know what capacity planning is and how to do it. It must be simple, right? With cheap client/server hardware prices, capacity planning may seem unimportant; you can always upgrade later. A simple estimate of the capacity of the system should be sufficient, right? Why give this subject any more thought? Here’s why: The rate of technical change in the IT industry, and the complexity of networked systems. With rapid deployment of mission-critical systems, there’s no time for adding more capacity later.

Once systems are in place, they become an integral part of the business. Downing the system for upgrades becomes increasing expensive in both time and resources. In addition, the added complexity of the environment typically requires more care, due to the interdependency between various application components.

So, let’s look more closely at just what we mean by "capacity." There is no standard definition of capacity even within the context of capacity planning. Rather, the definitions of "capacity" depend on the context of interest and system components being studied. For example, one definition of capacity may be stated in terms of "how fast" something (cpu, disk, etc.) is, such as requests per unit of time. Another definition of capacity may be stated in terms of "how much" can be done relative to a given level of service.

If one examines the metrics available from UNIX or MVS that describe the performance of these systems, it is literally overwhelming. The effort to get even finer detail requires significant overhead for the measuring process as well as the management of the recorded data. Planners must strike a balance between the appropriate level of detailed information needed and the effort expended to obtain it. In the field of capacity management, this is most true of the trade-off between the very detailed and specific information that may be required to resolve a particular performance problem, and the more general average and trend data that characterize capacity planning.

The level of detail required for successful planning might well be different from that required to solve a performance problem. Some confusion can arise because both activities use largely the same data. While it is possible to use analytic models to resolve detailed application-specific performance problems, it is their long-term impact that is more important -- and which is practically impossible to achieve by any alternative approach.

Capacity planning, then, is primarily a business activity that happens to have technical content. It is the means by which service levels can be protected in the future. For practical purposes, we can define capacity planning as the process of determining the computing resources needed to meet anticipated future requirements based upon the business direction and anticipated business growth.

The fruit of effective planning will back business operations with the right quality of computer support to users. This process involves predicting the workloads resulting from new and changed business directions, and using the appropriate analytic models to determine the future performance of IT.

Capacity planning is driven purely by financial considerations. Effective capacity planning can significantly reduce the overall cost of ownership of a system. Although formal capacity planning takes time, internal and external staff resources, and software and hardware tools, the potential losses incurred without capacity planning are staggering. Lost productivity of end users in critical business functions, overpaying for systems equipment or services and the costs of upgrading systems already in production more than justify the cost of capacity planning.

Examples of Capacity Analysis

Getting in Trouble with Statistical Forecasting (Linear Regression). Suppose we wish to forecast the online response time of an application. We know, for example, the response times for 1,2,3 ... 20 users and want to have some idea of the response time for 45 users.

It is tempting to use linear regression (straight line trending). However, computer systems tend to exhibit nonlinear (often exponential) characteristics. To see how far off this can be, suppose that we have actual measurements made at times T0 and T1 using load L1. Let us represent maximum desirable response time as Tmax. A linear regression trend line would then provide a linear projection for our target load L2 (in this example, 20 users). If we then compared the projected response time from a Queuing Network Model (QNM) and compared them on the same graph, we would find out that, in fact our linear forecast was too optimistic. We will, in fact, reach our maximum at less than 20 users.

Thus, although statistical forecasting can define a trend line using linear extrapolation techniques, predicting response time (as well as CPU utilization) requires a model that reflects the behavior of a queuing network.

Running Multiple Jobs at the Same Time. Recently, the question arose as to how many batch initiators should be active on our MVS mainframe in order to get the most work done. If too few initiators are available, then jobs wait to run. If too many are active then system overhead increases to a point where it interferes with program execution and causes ALL jobs to run slower. So, what is the optimum number?

To answer this question, we look at CPU time. There are two flavors, if you will, of CPU time on MVS. One flavor is SRB time (System Request Block) and can generally be attributed to activities of the operating system. The other flavor is TCB time (Task Control Block) and can generally be attributed to the application code (i.e., "useful" work). As more and more jobs run, more TCB time and SRB time is used. But at some point, the total amount of TCB time reaches a maximum and then declines as a function of the number of active jobs running, while SRB time continues to increase due to the contention for shared resources by multiple workloads.

If we fit a linear regression equation to this set of data, we can, in effect, predict the TCB time as a function of the number of active initiators. A linear model has the form of: y = a + bx.

Where y is the average TCB time per active initiator and x is the number of active initiators.

Our objective, however, is to find the total TCB time. To estimate this, we multiply the number of active initiators times the average TCB time per initiator TotalTCB = y x x. Substituting a+bx for y, we get the quadratic form: ax + bx2.

When this is plotted on the same graph as our original data, we see how the total TCB time reaches a peak, then declines as MVS becomes over initiated.

Thus, we can see graphically an optimum number of initiators at which total TCB begins to decline. This is the point on the parabola at which the slope is zero (or the 1st derivative is zero).

Modeling Results

Management wanted to know which of three UNIX Web servers would support a 150 percent increase in workload while processing 50,000 transaction per hour. The server was modeled and the workload increased on the model for 2, 4 and 8 CPUs. The results of analytic modeling show how throughput becomes saturated at different levels of workload intensity and thus identifies the appropriate sizing to accommodate the expected demand.

About the Author: Tim Browning is a capacity planner for Consultec Inc. (Atlanta), and the author of the book Capacity Planning for Computer Systems (1995, Academic Press Professional). He can be reached via e-mail at tim.browning@consultec-inc.com.