Capacity Planning in Distributed Environments, Part 2
Editor's note: In the conclusion of this two-part article, third-party capacity planning offerings, the cost-effectiveness of distributed computing and controlling "service levels" of projected workloads are explored.
Organizations sometimes find themselves in a bind - additional work has to be processed because of a higher volume of business, but their in-house IT resources are at or near their processing capacity. At times like this, organizations can do one of two things: They can scrutinize the system carefully to identify and tune resources that could be used better, or, they can go out and buy more resources. Tuning, while important to delivering good service to customers, very often yields very small increments in capacity - for example, perhaps a 1 percent or 2 percent buy-back in CPU utilization on a single processor. While there are numerous stories in the capacity planning folklore of changing one or two system parameters that yield incredible improvements, the prudent capacity planning should not count on finding those magic tuning parameters - especially if the system and its critical applications are constantly being reviewed. Buying more resources obviously can yield tremendous capacity increases - but for more money!
If you consider history once again, the mainframe was, for the most part, a single system to manage. And it was a stable environment for production workloads. Mainframe vendors provided a rather thin set of tools for systems management, and this resulted in an entire cottage industry to be born: companies that made systems management and capacity management tools. Tools that emerged from third-party companies included software monitors (e.g., Candle's Omegamon and Landmark's Monitor), performance databases/archives (e.g., Merrill's MXG and Computer Associates MICS) and queuing-based modeling packages (e.g., BGS' BEST/1 and Metron's Athene). In addition, many of these mainframe tools rely heavily on raw measurement data being processed by SAS (SAS Institute). Purchasing licenses for these tools would give the Capacity Planner an arsenal of functionality that would be used to predict future IT resource needs at specific service levels. But these tools are/were not cheap! Typically, a software license for any one of these tools could range between $20,000 and $120,000 (MXG being the exception).
Let's point out a couple of characteristics of this set of tools. First, the predictive tools never had perfect accuracy. Response time predictions were considered acceptable if they were within 15-25 percent of reality. Second, no single tool stood out as being able to predict workload growth based on past history or anticipated business volumes. And third, and most important, these tools were not designed to understand distributed environments. Software monitors could not capture and report on end-to-end response time. Modeling products often did not take the network into account. The message here might be to stay away from fancy tools that have a single system focus if you're planning for the capacity of a true distributed environment.
So what does the distributed system's capacity planner do? Measurement is the key to proper use of these tools in an economic way. The measurement architecture should be able to populate a performance database with distributed workload data. The capacity planner must look for tools that have a more global perspective. For example, we would want a software monitor to be able to display the status of every IT resource in the distributed environment. And, if some resource is performing poorly, then we want to be able to isolate and drill-down on that device to get more detailed information. In the performance database area, we need to have archives that contain summarized data across systems. We need cross-system data to look at trends. HP's OpenView PerfView/MeasureWare Agent (MWA) and GlancePlus are examples of distributed systems tools that monitor, archive and allow drill-down of specific IT resources connected on a network. And in the modeling area, we likely have to use simulation-based modeling tools, such as NetArchitect, QASE or SES/Workbench, to understand the network along with the client and the server and each one's impact on end-to-end response time. Such a model would have to have a greater understanding of the application as well, to identify poorly designed software components. The objective is to be able to provide planning information about the entire distributed environment, and how applications run in that environment. By having a measurement architecture that doesn't populate tools with distributed workload data, then even with the best predictive tools, we still have to guess on how to populate the model with behavioral data.
Thus, Capacity Planning tools (monitors, collections of performance data, and models) can and should be used to find when to add the right pieces of capacity - call it Just-In-Time-Capacity. If resources are added at the right time, there will be no interruption in the quality of service delivered, and there will be no waste with respect to paying for resources before they are really needed.
Figure 1: History of Client/Server Processing
Capacity Planning efforts today must focus on the multi-system aspect of distributed systems. Figure 1 above is meant to illustrate how mainframe based systems evolved to client/server architectures. The diagonal line depicts the network; functions below the network line are functions that are performed on a client; functions above the line are performed on a server. The evolution shows how distributed systems have emerged by placing more emphasis on the client as it became more intelligent. The term Remote Management refers to managing the data and processing resources that exist at a remote server. The term "distributed" is used to indicate that a key function (e.g., "presentation" or "processing" or "data management") is performed on several machines. Thus Distributed Presentation refers to systems where the client has enough intelligence to offload some presentation functions; Distributed Processing implies that the client can offload presentation functionality and some of the processing chores; and Distributed Management indicates that the actual management of data occurs on both clients and servers.
Today, we have the intelligent client that can perform some significant functions on its' own; we have the network, whose bandwidth can define the system-wide performance; and we have the server (or servers) that often contain the data being sought for analysis and presentation on the client. This is a multi-tiered architecture, and is fundamentally different from old mainframe systems in that (1) there were no intelligent clients, only dumb terminals, (2) the network only connected machines of the same type, all using the same communication protocol, and (3) the one single server contained all of the data necessary on it and it alone.
Today's systems have a great challenge to conquer: they often are asked to pass information along in a three-tiered computing architecture: data often resides on a large mainframe; midsize machines often house smaller, but key, databases; and even smaller "client/server" machines often characterize a LAN. Capacity planners must face this reality of multiple systems connected via one or more networks, where the systems are of different sizes, running different operating systems and containing different database management systems. Again, the key to capacity planning must be to deliver useful information in a cost-effective way. Thus, the capacity planner must examine how applications will be using these different tiers. Common questions would include (a) getting many small servers vs. few large servers, and (b) where should specific processing functionality reside: on the client or on a server?
And let's not lose sight of networks! The sensitivity of the backbone network of multi-system configurations could easily be the single resource that dominates poor service by applications. The network bandwidth and speed is sensitive to choice of protocols. If an organization is considering using a Wide-Area Network, this often will require some study - perhaps involving a network simulation! The available/useful bandwidth of the public network, too, is extremely sensitive - and is less controllable! Router capacities must be carefully considered as well. A common performance/validation technique is to place a sniffer at some point on the network to collect network traffic statistics from which response times can be calculated.
Presumably, a great advantage of distributed systems is being able to buy needed capacity in small increments. Small systems have exhibited this characteristic; this allows a more accurate sizing of necessary equipment to business needs. All of this translates into less excess capacity and therefore lower overall cost - or so one might think.
As we've said before, scalability is key for capacity planning for distributed systems. Economically, scalability is a primary contributor to reduced cost. The theory is you buy enough capacity to do your processing now, and if additional capacity is required in the future, it is acquired at a reduced unit cost because of the constant improvement in price/performance ratios.
Is there a fallacy in this thinking? Consider what happens during the entire life cycle of equipment - especially client/server equipment. You buy what you need today - and incur both acquisition and installation costs. Over time, you also incur an operational costs (licensing fees, support personnel and maintenance). But is that all of the costs? What happens when additional capacity is needed (i.e., one server needs more capacity). Yes, you acquire a bigger server, but what do you do with that old server? Do you throw it out? Most companies would rollover the server to a new place - that is, the old server is likely to replace an even smaller machine somewhere else in the organization. And as this may cause a cascading effect, consider, too, that there are costs that have to be incurred when installing each old machine in a new place (e.g., installing new software, testing, support personnel costs, etc).
In the mainframe environment, these rollover costs were seldom encountered. Processors were generally exchanged or added. But in the distributed environment, a processor swap can cause multiple rollovers. If the rollover costs become significant, it may become uneconomic at some point to do the rollover! Leilani Allen at a meeting of the Financial Management for Data Processing Association said "By the year 2000, it will cost more to keep old technology than to upgrade." (Allen, Leilani E., Ph.D., I/T Investment in the Year 2000, FMDPA Annual Conference, May 1994)
While this may be surprising, the situation does point out that we should understand the actual magnitude of rollover cost in building a financial model over the life span of a distributed system. James Cook (Cook, James R., Should You Drive a Used Car on the Information Superhighway?, Proceedings of the Computer Measurement Group, 1994, Orlando, Fla.) proposes such a financial model; namely that the life cycle cost equals the initial acquisition cost, plus the operational cost, plus N times the installation cost, where N is the total number of swaps in the rollover series. The net impact of a first processor swap may cause an increase of nearly 50 percent to the original acquisition cost, 100 percent for the second swap, and 150 percent for the third swap! Note, too, that at some point, rollover costs will consume any savings that may be gained on cheaper MIPS being available in the future. Thus, spending a little more initially on capacity may actually avoid a processor swap (and its costs) in the future.
The bottom line: The focus of financial management strategies for IT has long been on acquisition. But the realities of the life cycle of equipment in distributed systems dictate that ongoing operational costs (that address rollover) demand more attention. Old, conventional wisdom just doesn't apply to distributed systems.
The capacity planner should attempt to base a capacity plan on projected workloads that each receive an acceptable degree of service - service levels. Workloads should, ideally, be specified in business terms - sometimes referred to as natural forecast units or business transactions. One advantage of basing the capacity plan on business transactions is that a clear correlation exists between the capacity plan and the business plan, and should the business plan change, the effect on the capacity plan will be immediately obvious. Remember, the focus of the capacity planner should always be on business needs.
But before service levels can be managed, they have to be specified and agreed on. Application designers, capacity planners and performance analysts need to agree on the answers to questions like:
- What is the relative importance of each application to the business?
- What is the required availability of each application?
- What is the required performance (i.e., response time) of each application?
- What are the current workload volumes for each application?
- Are the workload volumes or relative importance expected to change over time? If so, how?
- Are new applications planned? What are their relative importance and workload volumes?
- What financial constraints exist on the acquisition of computer resources?
- Can IT costs be charged back to the business units that use the applications?
Service level measures (specifically availability), response times and workload volumes, should be reported by business unit and application. When compared to service level requirements, we can see whether goals are being met, and, based on priority, determining that impact on the business. Re-planning capacity requirements may be necessary if the stated service level requirements are adversely affecting the business.
But there are several obstacles to hurdle before managing service levels across distributed environments. The nature of client/server applications is such that the application architecture distributes some portion of its processing to the client. This introduces additional points of potential failure or bottleneck, and also complicates the definition of a transaction. In the network, heterogeneous systems coexist using different communication protocols and application-to-application session protocols (e.g., SNA APPC, OSF's DCE, etc.). Thus, managing end-to-end service levels becomes very complex because of the different computing and communications device in a path. And there is a relative lack of maturity, as compared with the mainframe, of measurement, monitoring and applications management tools - but the technology problems in building similar tools are different for distributed systems. There is hope - base measurement data, like utilization and traffic, is available from most intelligent networked devices. Standards have been proposed for system performance data across platforms. Agents have been and still are being developed for database instrumentation.
The problem remains one of intelligently correlating the various component measurements into an end-to-end picture of the service provided to application users. To obtain performance at an individual transaction level will require code instrumentation within the application. Application-specific measurements would be collected and fed to an enterprise manager, which would allow the end-to-end correlation of performance.
HP's MeasureWare Agent Transaction Tracker is an example of a tool that can be used to manage IT service levels. User-defined applications enable you to group processes together by user or application name for better tracking of an individual's or application's impact on your resources. MeasureWare Agent is consistent across multiple platforms - HP-UX, Sun SPARC, IBM AIX Bull and NCR WorldMark - using a common interface and standard metrics.
To create application specific measurements, the joint efforts of HP and Tivoli have produced ARM - Application Response Measurement. ARM is an API/software developer's kit that allows application developers to add appropriate performance instrumentation to their code easily. Applications deployed using C/C++, Microsoft Visual Basic, MicroFocus COBOL and Delphi are currently supported under all of the popular PC platforms.
Summarizing, we note that as distributed applications are deployed, and as management responsibilities for these distributed applications shift to the central IT organization, service level management becomes increasingly important, complex and difficult. Many of the management tools required to effectively manage service levels across the enterprise are now first coming into their own. The first step toward implementing enterprise service level management is to approach systems management from the end-to-end application workload perspective, rather than viewing the environment as a collection of physical components.
Capacity Planners have some very critical questions to consider, including:
- What type of servers should be deployed to support specific applications?
- How many servers are needed? How much bandwidth does the network need to provide acceptable service levels?
- How big should the servers be to handle the application volume?
- Can we define service levels for different distributed applications?
- How can we tell if the servers are optimized for the network operating system and communications software that is in place?
- Are applications properly configured for a mixed-platform environment?
- Are appropriate measurement sources available?
- What are the true costs of incremental additions to capacity?
- Is effective service level management possible for distributed systems?
Capacity management for distributed systems faces many challenges. The one single driving challenge is to constantly ask whether Capacity Planning is helping IT deliver the best service possible to its customers, now and in the future. IT managers must always address building scaleable architectures that have market-driven compatibility. Evaluating a "checkoff" item capacity planning philosophy for your environment may prove to be a real cost saver - especially when taking into account the cost of maintaining a full-time capacity planning staff. And we shouldn't forget to include the true costs associated with adding incremental capacity.
Instrumentation across platforms is key - especially for new applications. Applications should be designed from the outset with the goal of being able to provide application-specific performance measurements so that we understand the quality of service delivered. Without application-level instrumentation, service-level management across the enterprise may not be possible.
Data reduction/summarization/reporting software will be required to manage the volume of customer data across platforms, and must address the network, as well as the client and the server. Modeling tools must become graphical to allow network topologies to be easily defined, and must then address modeling heterogeneous combinations of hardware and system software platforms. Modeling should be able to provide predictions of IT usage and service from a global perspective, as well as a detailed focused perspective.
It is our hope and desire to see these challenges addressed in the near future with the development and application of new techniques.
About the Author:
Dr. Bernie Domanski is a Professor of Computer Science at the Staten Island campus of the City University of New York (CUNY) with nearly 25 years of experience in the data processing. He is the author of over 50 papers, has lectured internationally, and is CIO of the Computer Measurement Group (CMG).