Q&A: Benefits and Best Practices of Application Performance Management
What is application performance management, how can it reduce the number of availability incidents, and what best practices can help ensure your APM initiative is a success?
In APM Best Practices; Realizing Application Performance Management, Michael J. Sydor tackles the many components of APM projects, from scoping the project to staffing, from solving installation/deployment problems to evaluating your success.
Wisely, Sydor doesn't make any assumptions. How do know if APM will work in your organization? Sydor drills down to show you how to evaluate whether your organization has the capability to adopt APM.
To learn more about the promise and problems of APM, we discussed some of the key concepts in the book with the author.
Enterprise Strategies: What is application performance management (APM)?
Michael J. Sydor: It is the processes and techniques for collecting, reviewing, and collaborating with performance information to guide technical and business decisions about the capabilities of an application across its life cycle. APManagement is often confused with APMonitoring -- the various tools that may be employed. Anyone can sell you a hammer, but where do you learn how to build a shed, a house, or a condominium complex? This is what my book introduces for enterprises dependent on software systems -- how to do performance management '"right."'
What benefits does APM promise?
The unassailable benefit is greater visibility into the origin and nature of performance problems, but it is in how you employ this information that will let you gain access to generally accepted benefits such as "reducing the time to triage" or "improving software quality." The benefits that appeal to stakeholders will change as the organization matures in its use of performance information.
Given today's array of tools, are these benefits actually achievable?
It depends on the array. There are five categories of tools that contribute to APM: logging, synthetic transaction, real transaction, instrumentation, and packet analysis. Each of these tools has a range of applicability and benefits. I don't worry as much for the variety of available tools but whether they can contribute information that is useful across the application life cycle. A tool used incorrectly is more damaging that having no visibility at all because it gives a false sense of security.
The keys to realizing the benefits of APM are to understand the applicability and limitations of the various tools as well as appropriate processes to employ the information. With this perspective, you will achieve all of the APM benefits. Use the tools poorly or improperly and you really will have a difficult time getting full value.
To get many IT projects approved, proposals need to include cost justifications in the form of return-on-investment calculations. You point out that with APM projects, project leaders shouldn't confuse justification with ROI. What's the difference and why can't we use ROI for APM projects?
The difficulty is in having an accurate measure of the "as-is" state in order to forecast the "future-state." Normally, this starts with incident analysis because reducing the number of incidents is a desirable and expected goal and the costs of the incidents are usually well-known. Depending on your overall maturity, you may only be monitoring availability incidents, resulting in "up" or "down" status. It's black and white and easy to calculate your overall availability and the time to resolve an incident.
Performance degradations and end-user experience are much more subjective. These are what APM gets you visibility into. Many organizations don't even consider performance degradation as a severity incident. Thus, although APM will have some small impact (reduction) on the overall number of availability incidents, the introduction of performance incidents will often dramatically increase the overall number of incidents. Nothing is broken -- you simply never before had visibility into the nature of performance problems that did not result in an outage.
The end result is that you will be spending more on incident resolution in the short-term because you will have uncovered a previously unaddressed quantity of performance incidents. These are simply not going to be addressed via a reboot of the affected system.
How can APM reduce the number of availability incidents?
Simply monitoring in production alone will not magically impact the number of incidents. You have to introduce the technology earlier, either during user-acceptance testing or stress testing, in order to create the opportunity to catch performance issues prior to production deployment. This is where APM visibility has its greatest impact, provided you have a consistent QA practice to leverage. You simply get to detect problems before you are operational. You also get to validate the performance monitoring configuration (dashboards, reports, alert thresholds) that the operations team depends on. You have to show them what normal and abnormal performance looks like to help them transition from availability management to performance management.
You write that to determine how to create an efficient APM initiative, you must assess your current environment; that is, you need to perform four assessments: application, skills, visibility, and performance. Despite performance being in the very name of the initiative, you point out that application assessments are the most important. Why?
The application assessment, or app survey, collects a variety of information necessary to make decisions about the type and scale of monitoring technology that will be employed. This affects the overall sizing of the APM environment and the schedule of when specific capabilities will be available. Every application is different and APM has a spectrum of techniques that can be applied. The simple question is, "How much do we need?"
It has been stunning to see how little a monitoring team will actually know about the very applications they are responsible for. This gap is systemic -- the SNMP-based monitoring that APM augments never needed any of these considerations. SNMP is focused on platform metrics (about a dozen) while APM generates many thousands of metrics, depending on the type of application. You need to know some attributes about the application to decide the monitoring strategy and forecast the impact on the APM environment. This is real basic stuff but it's a consideration that the monitoring team never had to appreciate prior to APM.
What are the biggest mistakes IT makes in their APM initiative?
There are two: Not sharing the responsibility for APM among two or more persons and not monitoring the APM environment for performance and capacity. A challenge with APM is that, for a single application, APM is not a full-time job. You will have about 5-20 hours of work, over a six-month period, to get an application through test and into production. It is difficult to keep your APM skills fresh when they are being exercised so infrequently. It is even more difficult when that individual walks out the door. It's a symptom of a small AP initiative (1-5 applications) -- which is the way many of these projects start: the initial team members do not last through operational deployment and the initiative flounders. The solution is simple -- train two people!
Also, these small initiatives represent about one to three percent of the APM solution capacity. No one is ever thinking about exceeding capacity. When an initiative is successful in accumulating additional applications over the next two to four years, the environment can become saturated and may suddenly collapse. No one was watching capacity. This is also something that I attribute to the SNMP-mindset -- capacity, storage, large numbers of workstations, etc. -- none of this was ever a concern for SNMP. "It's a monitoring tool and APM looks just like it. Why is there a difference?" These are the gaps I will more often address to put an initiative back on track.
What two or three best practices can you recommend that make the most difference in preventing these mistakes?
The most important is a runbook documenting exactly how a transaction was defined, or an agent deployed -- for your environment. You cannot simply point your teams to the docs and have them "figure it out." You got it right once -- do the world a favor and document it! The common APM activities take 10-20 minutes to accomplish -- following a cookbook. It will take a lot longer when you have to re-invent the wheel -- and often without even the benefit of vendor training.
The second best practice is even easier: ensure that any complaints about the APM environment are logged as incidents. This ensures that someone will notice when performance degrades or capacity is exceeded. You need to treat the APM environment the same as you would a mission-critical application because when you are successful with your initial deployment, you will have the opportunity for significant and rapid growth. With growth, comes responsibility. The same incident tracking that you would perform for any other significant application is fine.
In terms of the project's structure, you mention that "ownership" of the project should be shared by IT and business users. What are the benefits of such joint ownership? Doesn't such a structure introduce other problems (I'm thinking of politics, for example)?
Politics are part of the dance among IT and the Business. New initiatives such as APM are a magnet for politics. The easiest way to diffuse politics is to share the glory and the responsibility for success of the initiative.
When I assess an organization's progress with APM, skills and processes are the language but what I'm really measuring is the degree of collaboration -- who has access to the performance data. If the tools and processes only benefit a narrow audience, everybody loses, and this is exactly the case when an application team makes a technology selection in isolation just to avoid the politics. Their instincts are right -- they need APM visibility to bring a project under control -- but without additional support (budget, staff, experience), they are putting the initiative at risk.
Of course, everyone struggles with budgets and staffing. IT is always under pressure to reduce costs, so no one wants to admit that there is an unexplored cost between selecting a tool and practicing APM (that is, being able to actually use APM successfully). The cost is not huge but it is definitely not zero. It is just that you do not know what it will actually take. Showing you how to conduct this assessment and understand the skills, processes, and the overall life cycle of APM -- this is the gap that my book addresses.
For me, it all comes down to "showing value." If you know what successful APM looks like, then you can drive the conversation to get the right participation and achieve that outcome. You just cannot expect to do it in isolation. You will need to partner to get the right level of support, and these partners expect that you understand what is needed and how to get it done. Your partners are the other stakeholders in the application life cycle. When they see the value, you get their support and participation. That's what makes APM most effective and achievable.