In-Depth
Application Monitoring in the Clouds
Business-driven processes for measuring and guaranteeing service levels in the cloud will help CIOs successfully transition to this new infrastructure.
by Zohar Gilad
The key benefits of virtualization and private clouds for enterprises include provisioning agility and reduced infrastructure cost. It’s no surprise, then, that so many CIOs are looking at deploying virtualization technologies to optimize their entire hardware and software assets from the desktop to the data center. Some are even venturing into the public cloud with small pilot projects.
Yet a major barrier for CIOs in adopting the cloud for mission-critical applications is the widespread concern around performance, reliability, and security. CIOs are not particularly keen on how they or a service provider will effectively manage this new, more amorphous environment. Who is accountable when problems arise and how quickly will they be resolved? Will corporate data encounter greater risks in the cloud when applications no longer have a dedicated resource?
If performance in the cloud suffers, the fingers start pointing and the "blame-storming" begins. Without accurate methods to pinpoint the source of application and user problems in a virtual environment, inevitably, employees and executives will blame the cloud and those who championed it. There goes the potential of a new, game-changing technology.
For all that the cloud promises, it cannot solve nor prevent most application problems any better than the physical infrastructure. Application failures have been vexing IT departments for years (if not decades): sluggish response times, user access problems, and downtime consume a disproportionate amount of IT staff and help desk hours. Software-related problems are difficult to troubleshoot and resolve by nature because of the complexity of enterprise IT environments. Which application, module, or set of code is causing the problem? Is it the network, a database, or a server failure? Is it user-specific or a system-wide problem?
Even though the cloud promises to pull CPU cycles out of thin air, there are many reasons why an application will suffer, which have nothing to do with the hardware resources. The application design could be flawed or how the application handles user tasks or accesses Web services may be different than an organization had originally anticipated.
Futhermore, virtualization and cloud technologies create unique challenges for applications that need to be adjusted for dynamic provisioning: How applications start and connect to servers will change as the virtual environment allocates and shifts resources based on policies and hardware availability. For instance, some applications have hard-coded server names -- this doesn't work in the cloud.
Then there is the potential impact of resource-sharing on application performance. While virtualization lowers costs, application conflicts increase by an order of magnitude. Two applications within two virtual machines running on the same server can cause service degradation for users, as an example.
A critical step for a CIO evaluating the transition of mission-critical applications to the cloud is to understand risks in quality of service (QoS) and how to manage them. The following scenarios are three of the top concerns for applications in the virtual world:
- Degradation of service during the phases of transition from physical to virtual infrastructure (and thereafter)
- The potential negative impact on QoS of inter-application shared resource contention
- Degradation of QoS at peak application load times
There's no way to effectively manage these risks, however, without automated monitoring. Automated application performance management will help IT departments proactively monitor and troubleshoot, as well as perform historical analysis to prevent problems from occurring again. Here are a few examples of how this automation works:
Co-provisioning: When a new virtual machine is provisioned, the application monitoring software will automatically appear and follow that VM wherever it goes, making the process of monitoring automatic from the physical to virtual world.
Historical analysis: One of the problems with virtual systems is that they come and go. A virtual machine here today may be decommissioned tomorrow, and all of the data within it also disappears. However, monitoring automation should retain that performance data so that you can go back and understand what happened during a transaction window, helping you determine the source of the problem. This is akin to stopping a video and replaying a scene. Without postmortem analysis, application monitoring is unable to effectively pinpoint the cause and suggest remedies.
Linking a virtual event with application performance: Monitoring automation should allow you to determine the impact of the virtual environment on your applications. For instance, if an application migrates to a new virtual machine, you should be able to find out when that happened and if it caused any changes to the application or its performance. The same applies to decommissioning a VM: did the applications perform differently as a result?
Problem isolation: In the cloud, servers are typically arranged in a cluster architecture, so it's often more difficult to determine which server or VM may be to blame when troubleshooting. Yet, just as in the physical environment, monitoring tools need to distinctly isolate where, when, and how an application problem occurred to facilitate a quick fix.
IT needs to understand the impact on the user, the particular server or network being accessed, the transaction, the application (including the impact on users of two applications competing for the same virtual resource), Web services, and databases, when analyzing issues. In the cloud, IT managers also need to see the impact of the VM and application instance that the transaction used.
Of course, no enterprise can have 100 percent automation to handle every potential scenario of application breakdown. When it comes to performance monitoring and problem resolution, at what point should automation stop and humans intervene? There are some simple cases, in which a monitoring system can observe and alert and take remedial action, such as restarting an application server with a known memory leak when memory usage reaches a certain threshold.
Not all decisions are so clear cut. Consider a bank running a large reporting job that is choking performance in the online transaction processing system branch tellers are using to handle customer matters. If the report must be filed that day for SEC compliance purposes, it should receive priority over customers and continue to run despite the temporary impact on tellers. In other cases, when a report is not urgent, the customer-facing processing can take precedence.
These are not the kind of decisions that a machine is likely to make. No matter what business rules are in place dictating how to resolve an application or server problem, IT should be able to override those rules under predefined circumstances. When it comes to business-critical decisions, people are still an important part of the equation.
Product and service offerings are getting more sophisticated every day, helping organizations transition to private and public cloud environments with less risk. No matter what guarantees a provider offers, however, performance issues in the cloud are still going to hang on the CIO’s head. Having capable people, strategies, and automated systems working together can help IT effectively monitor the hardware and network as well as applications and user behavior in real time.
Business-driven processes for measuring and guaranteeing service levels in the cloud will help CIOs successfully transition to this new infrastructure. In an ideal world, effectively monitoring this virtual environment will help resolve problems quickly and prevent them, so users won't have to know or care how or where IT provisions the systems.
This futuristic scenario of “everything is working well” is a problem that most CIOs would like to have, even if it means the IT organization will change as a result. In the meantime, however, CIOs will need to consider and plan for quality of service as they move their mission-critical applications to the cloud.
Zohar Gilad is executive vice president at Precise Software, based in Redwood Shores, California. You can contact the author at zohar.gilad@precise.com