Q&A: Preventing Downtime
It's not enough to respond to downtime; what's needed is a new mindset that focuses on preventing downtime.
With IT already stretched to the max, the last thing you need is systems going down, preventing customers from interacting with your company and essentially stopping employee productivity. IT can no longer be satisfied with uptime guarantees, no matter how many "9s" are in that guarantee. What's needed is a new mindset, one that focuses on preventing downtime, not just managing it. To learn more about this shift, we spoke with Dave Laurello, president and CEO of Stratus Technologies, a company whose products provide monitoring to provide uptime assurance.
Enterprise Strategies: What do you mean by IT downtime? What is the standard associated with IT downtime?
Dave Laurello: The term downtime refers to periods of time when a system is unavailable. That can be planned or unplanned downtime. Uptime, also called availability, is the opposite and can be defined as the amount of time a system is fully operational during a defined period. For example, if a system is fully operational 100 hours per week, its availability is 100/168 = 59 percent on a 24/7 operating basis.
Availability is typically expressed as 9s. "Three nines" availability means that a system is fully operational 99.9 percent of the time; that's an average of nearly nine hours of downtime per year, or 10 minutes a week.
I would argue that any amount of downtime is too much, but availability is often grouped in three categories: conventional, high availability, and continuous availability. Conventional availability can be defined as 99 percent availability, or an average of 87 hours and 40 minutes of downtime a year. High availability falls in the 99.90 to 99.95 percent range, meaning between 4 and 8 hours of downtime a year. Continuous availability requires a minimum of 99.999 percent availability, which is just over 5 minutes of downtime per year. Stratus servers have consistently operated above 99.9997 percent uptime; that's about 90 seconds a year.
Why is it so important for businesses to prepare for potential downtime? What are some of the things IT does as part of their preparation?
It used to be that only a handful of applications were considered business critical. Today, the opposite is true. In our experience, most enterprises (especially small and midsize businesses) really don't know how downtime will impact operations or their bottom line.
The Aberdeen Group pegs the cost of downtime for the average company at $150,000 per hour. Even if it's $20,000 or $30,000 an hour, that's still a major blow to a business. Although hard costs are obviously important, the damage to reputation and customer satisfaction are also at the top of the list of downtime concerns for many enterprises.
Businesses need to determine the real hard and soft costs of suspending business operations for a few minutes, an hour, or a day. Only then can IT make informed decisions about cost-effective uptime solutions. In end-user surveys conducted with Information Technology Intelligence Corporation (ITIC), Stratus has found that a majority of companies have no idea what their cost of downtime is. For those that do make calculations, many omit several factors that contribute to downtime cost. Some of the alarming findings include:
- 81 percent of businesses don't calculate goods and materials lost into their cost of IT downtime.
- 45 percent of businesses don't consider lost sales revenue as a factor of their IT downtime cost.
- 29 percent of businesses don't consider customer dissatisfaction a factor contributing to their cost of IT downtime.
- 38 percent of businesses don't consider damage to their company's reputation as a contributing factor to the financial impact of downtime.
After calculating the cost of potential downtime, IT should think long and hard about which solution to implement. It's important to note that although most vendors offer solutions that address downtime after it has occurred, there are proactive solutions available that can prevent outages from happening in the first place.
What type of negative effects could occur to a business experiencing significant downtime? We all know the common downtime affects such as worker productivity and lost sales. What are some of the less obvious effects?
Some companies may have performance clauses; this is particularly true for just-in-time manufacturing, financial services, and other time-sensitive operations. Regulatory compliance violations also can come with a cost. The impact of downtime can spill over into the valuation of a public company if the incident or incidents materially affect revenue and profitability. Downtime cost can go far deeper than most people realize.
It may not be immediately obvious, but a company's reputation and customer satisfaction can be enormous cost factors. It's very, very hard to win new customers, but it's a cinch to lose them. One bad experience and you've annoyed a customer. One or two more and you've probably lost that customer for life, along with the many potential customers they shared their dissatisfaction with. Unhappy users can and will vent their displeasure through Twitter and Facebook, reaching hundreds, perhaps thousands, of potential customers.
What is the most important thing businesses should consider when weighing the importance of preventing downtime?
There is preventing downtime and there is recovery from downtime. Most vendors offer only products designed to get businesses up and running again after the failure has occurred. That means there will be downtime, application restarts, possibly data loss, and an inability to figure out what actually caused the outage in the first place. Solutions from Stratus (the company I work for) are designed from the ground up to keep businesses running even when there is a component or system failure. Beyond the products themselves, a key part of doing this successfully is proactive remote monitoring and availability management.
Another thing is that virtualization technology can really help to improve uptime. The paradox with virtualization, though, is that it actually makes hardware reliability more important. Consider that with virtualization you have more applications relying on the health of a single server. You cannot migrate applications off a dead server. Consider, too, that products such as VMware vCenter Server management software may be a single point of failure; if the server it's on fails, the virtual environment cannot be managed or controlled. This is also a matter of prevention versus recovery. Know what you are being sold.
Is it necessary to calculate downtime? How many organizations do you estimate actually do so?
Our own research over several years shows that a majority of companies do not calculate their cost of downtime. In fact, our latest survey with ITIC found that 52 percent of businesses do not know the potential financial impact of IT downtime. Referring back to your second question, what's even more surprising is that of those companies who do calculate downtime, many aren't doing it accurately.
Instances of potential downtime happen all the time. It really doesn't make a difference what business you're in, the products or services you sell, or the time of year. The reason many crashes happen -- whether it's computing services in the cloud, in the corporate data center, or some combination -- is because failure is assumed to be unavoidable.
When you assume failure is unavoidable, the natural propensity is to gear up with inexpensive computer and network components, and spend a lot of money to fix problems -- such as solutions to recover from inevitable failures. The mindset is to build cheap, expect downtime, and condition customers to accept that fact. The consulting firm Saugatuck Technology termed this as architecting the system to satisfy the "adequate" expectations of users. (Research Alert, IT Cloud Services: Designed to Fail?, Nov. 3, 2011).
Some people will argue that everything fails sooner or later. Yes, things do, indeed, break, but it needn't affect the applications and the people using them. Approaching the problem from the direction of failure inevitability is self-serving. It's a failed cost-driven model. It relegates customer service and satisfaction to secondary importance. The fact is that this approach usually ends up costing more money over the long run, not less.
Tell us about a time when you were personally affected by IT downtime.
Here's a really simple example that happened to me. Like every good son, I called to wish my mom a happy Mother's Day. When she didn't thank me for the flowers that I had ordered for her, I asked how she liked the arrangement. To my surprise, I learned the florist never delivered them. I called the florist to see what had happened, and was told that their computer system had been down for two days, and none of the orders placed for Mother's Day had gone out.
Mother's Day is the florist industry's Christmas season. Like many businesses, from the smallest to the largest, the excuses for not anticipating the havoc resulting from an outage are familiar: my business can ride through an occasional outage; investing in uptime assurance is too expensive; what I have is good enough. What they're really saying whether they realize it or not is that their customers are not their first priority. This florist has no idea what this misadventure has cost in lost business. I can confidently say I will not use them again for Mother's Day or any other occasion, and I'm sure I'm not the only person who feels this way.
Is there a difference between uptime in the cloud and uptime in a virtualized environment?
Uptime is uptime, whether it's for your data center or private cloud (where you have control) or it's covered in the service-level agreement from a cloud service provider. Uptime assurance is governed by using smart technology, smart management practices, and proactive monitoring and systems management. Many say that you will get better uptime from public clouds than you will from your own data center. In many cases, that's probably true, but not because cloud providers are doing such a tremendous job. I believe that for the foreseeable future, putting critical applications into public clouds is a recipe for disaster. The industry simply isn't mature enough to deal with mission-critical computing.
Massive consolidation is underway and will continue to destabilize the service provider landscape. There are no standards to speak of. Cloud service providers are still focused on the cost-driven model for their infrastructure. They are building for volume more so than quality. Their mindset is failure recovery and skimpy SLAs.
Where do you see the market for uptime going in 2012? In 2013 and beyond?
The market can only grow. Many businesses and employees simply cannot function without IT systems being available 24/7. Mobile apps are a huge driver for uptime assurance. The spate of major outages among cloud service providers in 2011 only helps to spotlight the broad and intense pain IT failure can inflict. Awareness of downtime incidents used to be restricted to the company where it occurred. Now, with widespread cloud adoption it's in the headlines and on social media.
For the next three to five years, uptime will be a major factor for determining where critical applications reside. Companies and organizations will be evaluating their applications to determine which can safely and securely be turned over to cloud computing and which absolutely need to be retained within the internal IT infrastructure. There are many legacy workloads and technologies working just fine where they are, including applications that are essential to business success. On the other hand, new applications will be increasingly written and optimized for cloud computing.
Businesses relying on the cloud will be forced to examine the technology they have in place to ensure that they can deliver uptime to their customers. With everything relying on servers, from smartphones to emergency rooms, 2012 will be the year in which quality of service in the cloud and server uptime assurance will come into sharper focus.
What products or services does Stratus Technologies offer that are relevant to our discussion?
Stratus products integrate round-the-clock monitoring services with advanced, resilient technologies for comprehensive uptime assurance. Stratus' approach to availability is to detect, isolate, and correct system problems before they cause system downtime or corrupt valuable data. Our product line includes ftServer systems for organizations with mission-critical applications that require continuous protection against downtime and data loss; ftServer hardware and software handle errors transparently, shielding the operating system, middleware, and application software. Additionally, Stratus Avance high-availability software delivers superior and reliable uptime without the cost and complexity of clusters.