In the Saddle with Availability Management: A Framework for Business Continuity
Availability managment is a proactive approach to systems management that identifies and responds to systems events -- before they cause larger problems or even a systems failure. Availability is maximized by continually monitoring system components, identifying events likely to occur and taking preventive action.
As the Internet economy continues to raise expectations and requirements for computing services, information technology is being transformed from an enabling technology to a strategic component of the business process. As a result, IT resources have become essential to the business’ goals, driving companies to operate and manage by service-level agreements.
At the same time, the growing gap between demand and supply of skilled IT staffs, the addition of more distributed servers, more applications and new technologies, standards and compatibility issues, all create a formidable challenge for the administrative staff who are held responsible for system operations. Even at a basic level, monitoring and repairing the most common system problems has become too time-consuming, complex and costly for most IT departments to manage.
To tackle these issues, administrators are increasingly turning to systems management solutions. Traditional system management tools produce voluminous reports and logs, tracking a great deal of information about system operations. More often than not, these tools produce an overwhelming quantity of raw historical data and alarm notifications, while providing limited assistance in actual problem prevention. While maintaining optimal availability is the system administrator’s primary mission, most administrators do not get the kind of assistance they need from management tools to effectively "prevent" availability lapses.
Management by System-Specific Intelligent Agents
Availability management – a new approach to systems management – involves proactively identifying and responding to system events that typically precede larger system problems, thereby preventing many types of common system failures, ranging from performance degradation to brief downtime to catastrophic failure. To maximize availability, this management solution continually monitors system components, such as CPU, memory, network connectivity and disk capacity, and their interactions, identifies events that are likely to occur and takes appropriate automatic actions to prevent a failure. By applying technical and business expertise and user-defined priorities, and by anticipating certain conditions affecting the system, analysts say that roughly 80 percent of the routine breakdowns can be averted.
Critical to this model are a set of system-specific intelligent agents that monitor and act when detecting leading indicators of imminent failure. These agents need to reside at every node so that, even in cases of network disruptions, each node can individually attempt to diagnose and correct problems. A flexible, designated control server keeps track of all the nodes in the environment, while the individual intelligent agents carry out user-defined, local and preventive actions as necessary.
Intelligent agents incorporate knowledge about typical system and application behavior and apply that knowledge in case undesirable conditions are detected. The actions of the agents are determined by user-defined rules that are customized for every environment, configuration, application and business process.
Monitoring Disk Capacity
In the case of disk capacity, some applications fail when there is insufficient disk space to write data. By anticipating such a failure and preventing the disk reaching capacity limits, the agent can avert possible system downtime. The agent determines partition capacity by measuring percent full and the absolute size of the partition, or by measuring a directory or file in absolute size or as a fraction of the total partition. A typical sequence involving an agent automatically responding to an event or condition, such as the disk approaching its capacity, would proceed as follows:
1. An event occurs. In this case the disk space in use has exceeded a predetermined threshold.
2. An agent detects the event. The monitoring agent detects whether too much disk space is in use, as defined by the rules configured for that agent. As soon as the event is detected, an event message is created that contains the event class and all of the defined event parameters.
3. The agent logs the event. The disk capacity agent’s monitor writes the event to the event log for user viewing and reporting.
4. The agent publishes the event message. The monitor of the disk capacity agent sends the event message to all other agents (for instance, specialized application agents) capable of handling disk-related events. These agents are known because they have previously subscribed to that event.
5. Subscribing agents receive the event message as an HTTP post. Upon receiving the event message, all subscribing agents determine from the URL of the HTTP post, which of their actors needs to be set in motion and to receive the message.
6. An agent’s actor receives the message and responds. When the appropriate actor receives the disk full event message, that actor determines which rule it will use to process that event. Rules simply represent correspondences between events and actions, so the rule determines what action should be taken at this point.
7. Agent actor takes preventive action dictated by the rule. The actor then takes preventive action. In this disk-full event scenario, the actor might delete one or more unwanted temporary files to reduce disk space in use. Rules set previously would determine which types of files would be deleted first.
At this point, the disk space problem should be resolved, the system and its applications can go on functioning normally.
8. The agent’s actor then logs the response. The user can view all log messages regarding the event detected and the actions taken.
Memory Capacity and CPU Utilization
Devising agents to monitor CPU and memory usage is more intricate. In order to add useful intelligence to the monitoring process, so as not to trigger unnecessary preventive actions, the agents must "understand" what kinds of fluctuation in capacity utilization normally occurs. Abnormally high memory use can be caused by a memory leak or by a more general problem. It can also be the result of a normal pattern in the operation of an application.
Application start-up, for example, requires a predictably heavy use of memory until the application "acquiesces" – settles in after start-up and before it reaches a state of equilibrium. Every application has a unique memory use "signature." An intelligent monitoring agent should be aware of the signature for every application so that it will recognize predictable spikes in usage as being normal application behavior.
Similarly, each application has unique, predictable memory use patterns that signal impending failure. An intelligent agent incorporates pattern recognition in its interpretation of events. Memory can then be measured for specific processes as the average allocation over a given period of time and as a count of page faults detected over a given period of time, disregarding expected peaks of usage. The agent’s response could be restarting – for temporarily solving a memory leak – or shutting down the service, to allow other critical services to continue functioning optimally while the system administrator investigates the cause of excess memory use.
Tracking CPU utilization is similar to monitoring memory use, as CPU use also has predictable patterns for each application that needs to be considered. To monitor utilization properly, the agent can measure global CPU usage as an average over a specified time interval and track CPU for an individual process, measured as an average over a specified time interval. Measurements are not used in calculating the average until after the normal, anticipated usage equilibrium is reached.
Network Connectivity
Discerning connectivity among nodes on a network is commonly accomplished by attempting to communicate (pinging) with a predetermined site, usually the server that acts as the point of control. Monitoring done by a simple agent can indicate, via a message to the administrator on the control server, if communication with any of the nodes has failed or if network performance is unacceptable. However, with no connectivity to a failed node, the control server cannot identify the cause of the problem or do anything about it beyond error notification and logging the event.
If an intelligent agent resides at every node, however, each node can also monitor its own connectivity with the network. Then, if the network response is unacceptable, the local agent can attempt to repair the problem on its own by reconfiguring or restarting the system, or even rebooting, if necessary. The presence of an agent at the local level that can initiate actions, which are predetermined by the system administrator, can significantly reduce mean time to repair (MTTR). In the event that connectivity with the control server is lost, any other node can take on the control server’s monitoring and reporting duties. By doing this, a single point of failure is avoided and the availability management tools are still available to the remaining nodes.
Application Availability
The availability of IT applications goes beyond system level availability measured by mean time between failure (MTBF) and MTTR. This is achieved through sophisticated agents that understand the behavior of specific applications and their dependencies on the multiple resources supporting those applications. The agents recognize indicators of impending failure and initiate an appropriate response to prevent downtime and maximize availability.
To make the intelligent management process more effective, these agents can then incorporate the system administrator’s responses to system events into their knowledge bases, learn from them and begin recommending preventive actions based on this history. The availability management tool tracks how events happen, how the system administrator responds, notes the result, analyzes and correlates this data into patterns, and based on those patterns, recommends a preventive action.
About the Author: Sam Mandelbaum is Director of Product at Availant (Cambridge, Mass.).