Q&A: Best Practices for SLAs
Are SLAs still useful, and can metrics really measure what's important?
Are service-level agreements (SLAs) still beneficial, or have they outlived their usefulness? In increasingly complex environments -- complicated by virtualization and cloud computing environments -- can SLAs even define the right metrics? To learn more, we turned to Alex Bewley, the CTO of uptime software, a systems management software vendor that focuses on virtual and physical server monitoring, capacity planning, and service and application monitoring using a single interface.
Enterprise Strategies: Are service level agreements vital to every organization? What value do they bring?
Alex Bewley: SLAs are not necessarily vital to an organization. Many people run their IT departments and businesses without them. However, not tracking SLAs is similar to not completing performance reviews with staff. With nothing to strive for and no benchmarks, how will you know if your staff is performing at the necessary level? How will you know if they are succeeding or merely doing the "bare amount of work required to get it done?"
SLAs work in the same manner -- they highlight what is important to the parties involved: the business units who generate revenue and IT that supports the business units in delivering value. Simply put, SLAs helps create focus on things that are important, such as "we need $10,000 revenue per hour from e-commerce application A." Although there may be many infrastructure components underneath, there really only is one key metric that matters at the end of the day.
What are the problems that typically arise with SLAs?
The general problem around an SLA is quantifying a meaningful metric to focus on. Dollars per hour, orders per minute, or claims per day are relevant metrics, followed by user experience metrics. SLAs filled with technical metrics do nothing to enhance the IT/business relationship because the business can't link the SLA to a desired result.
Given how long SLAs have been used with IT and business users, why do these problems still arise?
Although SLAs have been around a while, things can still go off the rails due to human error in tracking. However, the advances in tooling are making it easier to track and manage SLAs. Additionally, people have historically considered SLAs as part of big contracts that go with outsourcing (and these can be very complicated). However, today's systems management tools make creating infrastructure-based SLAs much easier. It's important to collect data from many infrastructure components, blend it with application-level metrics and end-user performance, and understand higher-level business metrics (such as $/hour).
New technologies are making it difficult to create meaningful SLAs and measure against them. For example, in cloud computing environments, performance may slow due to problems at the cloud provider, something over which IT has no control. In a growing self-service world, users may try to run poorly formed reports that try to boil the ocean, slowing performance. How can IT and business users build SLAs that handle such conditions and still hold accountable outside providers?
Ultimately, it will be important to understand all aspects of an application, and how it is divided across environments such as the cloud. End-user experience management will become a greater factor in SLAs. It will be increasingly important to understand where deficiencies in the application exist (network issues, availability of a provider's data center) to pressure providers to provision additional resources (e.g., having East Coast and West Coast data centers or international data centers). Also, there's a reason for using outside providers, usually the business has decided that it's cheaper than running an application in-house. However, the lack of true control, occasional downtime, and user latency needs to be factored in.
What are the key components of an SLA?
The key components involve the actual business metrics that are important, a component of end-user experience (e.g., all screens must take less than 3 seconds to draw), identification of the business hours the application is running (not all apps are 24x7), and how much downtime (and with what frequency) is acceptable. A common misunderstanding is the true difficulty required to achieve high-availability of applications, and that creating and maintaining super-high-availability infrastructure costs money. Quite simply, the higher the availability needed, the higher the cost to produce it.
What's the first thing IT or business users should do before creating an SLA? (Please provide examples of both good ideas and bad ideas from what you've seen and have dealt with in the past.)
Here, the key component is really figuring out what the key business metric is, and not what the specific infrastructure metric is. Infrastructure metrics will be instrumental feeders into the overall SLA quality, but they are not useful in defining the key business metric.
Bad examples of SLAs include:
- "Application X needs to be up 99.999% of the time." This doesn't consider that your users are only in the office during the week between 9am and 5pm. Outside of these hours, 99.999% uptime isn't necessary. In fact, there must be time for maintenance and tweaking during the "allowed" downtime.
- Isolating infrastructure metrics such as total bandwidth through a switch or port checking an e-mail server are not good for e-mail SLAs. Users only care that e-mail "works," so it's better to test the entire process of e-mail delivery and receipt (both from inside and outside your network), and not just the availability of particular infrastructures.
These metrics are important indicators that something may go wrong, (and its important to factor them in) but they should not form the key basis of an SLA.
What three SLA tips or best practices can you offer to avoid the typical pitfalls of most SLAs?
1. Baseline everything and then back-test your SLA against the collected data. This removes the IT risk (and job risk) of signing up for an SLA doomed to fail. Your SLA tooling should be able to do this for you.
2. Use the above to determine what an acceptable level of service should be for the price your business units are willing to spend. Don't sign up for levels that you can't possibly deliver out of the gate.
3. Don't overanalyze your environment. Instead, opt for incremental improvement. Collaborate between business and IT, and do monthly reviews to tweak and tune what is important.
Are SLAs still valuable or are they an outmoded monitoring mechanism? If they are still valuable, where do you think they're headed?
SLAs are definitely valuable, and even more so now that applications are becoming highly distributed across many different kinds of environments (physical, virtual and cloud). Maintaining and reporting on SLAs is going to becoming increasingly difficult, and it's incumbent on the systems management vendors to keep up to speed with new technologies and platforms that are required to deliver applications.
Creating an SLA is just one part of the equation. The other is measurement. What are some of the ways an organization can solve problems before they occur to ensure service levels are met?
The first step to enlightenment is illumination. Tracking things such as end-user response time, availability of underlying infrastructure, and using proactive notifications of potential outages can decrease the risk of missing an SLA. It is mandatory to have a systems management infrastructure-monitoring tool that can both report on SLAs and alert IT of events that might lead to a missed SLA. This type of proactive SLA risk alerting can provide IT with enough lead time to fix the issue before the SLA is negatively affected.
What role does uptime software play in this discussion?
Our company created up.time, a complete IT systems management software solution that simplifies performance and availability management across physical, virtual, and cloud environments.
Up.time provides easy SLA monitoring and reporting, including the ability to baseline and back-test your SLA (this removes the IT and job risk of signing up for an SLA doomed to fail). It provides proactive SLA risk alerts that give IT with enough lead time to fix issues before the SLA is negatively affected, and lets users set up SLAs quickly, then adjust them over time for incremental improvements.