Q&A: Managing Cloud Downtime
One of the key benefits of moving to the cloud are the economic advantages the technology offers, until downtime seriously affects your customers’ ability to interact with your enterprise. We asked Joe Graves, chief information officer at Stratus Technologies (a high-availability solutions provider) what your enterprise can do to ensure that such service disruptions don’t cripple your business.
Enterprise Strategies: How would you characterize the outages from Amazon Web Services (AWS) that have been in the headlines recently and earlier this year?
Joe Graves: Amazon’s outages have been huge. As a large and high-profile cloud provider, its recent string of outages has affected companies such as Netflix, Pinterest, Instagram, Reddit, and Quora, among others. All of these brands have one critical thing in common – their websites are their product and revenue centers -- which means every second of downtime compromises clients’ businesses and costs money. Any downtime, never mind extended downtime, can be debilitating, and it’s surprising that more of Amazon’s customers haven’t followed dating website Whatsyourprice.com in discontinuing its relationship with AWS.
What, if anything, could Amazon have done to prevent the outage?
Cloud providers perpetuate the myth that server downtime is inevitable. This is like saying that it’s inevitable you’re going to get a cavity; but if you brush and floss your teeth properly, that’s much less likely to happen. Too many companies are content with the putting a plan in place that simply reacts to downtime. Implementing technology that proactively monitors for threats of downtime and proactively intercedes to prevent it from occurring makes more sense. Our customers for instance, tell us over and over again that the first time they hear about a potential problem is when we call them to let them know our proactive monitoring technology has picked up on something. We solve the problem before it has a chance to manifest.
What, if anything, should Amazon's customers such as Netflix and Instagram have done to prepare for a possible outage?
It’s important to know what technology and systems your cloud provider has in place to prevent downtime and provide users with a worry-free experience, especially when the data center is also your profit center. In terms of Amazon’s outage in June, AWS officials have publicly said that misconfigurations by customers actually added to the problems. In fact, according to Newvem, a software startup that tracks AWS customer usage, many customers were not taking advantage of multiple availability zones through Elastic Load Balancers (ELB), which could have minimized the amount of downtime experienced by customers. ELBs have the ability to automatically reroute traffic based on availability and need.
Customers using the Amazon EC2 service are always registered for ELB, which automatically distributes incoming application traffic across multiple Amazon EC2 instances for an added cost. Basically ELBs can detect unhealthy instances and automatically reroute traffic to the healthy location until the problem is fixed. Newvem also found that up to 20 percent of heavy users aren't properly configuring their ELBs. That said, it is important to note the cost, complexity, and overhead of deploying and testing a robust ELB based solution. When creating an ELB, you are responsible for configuring ports and protocols for your balancer and you pay monthly for the amount of resources used which vary based on the operating system and type used. Although the benefits are certainly clear, these factors must be a part of the conversation.
How will this outage affect companies beyond the downtime they have experienced?
Beyond the thousands of dollars companies likely lost during the hours they weren’t up and running, there are some other lasting affects they will face. Sure we can say that some customers may leave to competing services, and they certainly bashed them on social media channels, but there are some other factors to consider.
In July of 2011, some Netflix users experienced an outage of the company’s streaming services. Coming just after the announced price hikes, customers were understandably agitated and voiced their concerns just about everywhere, from Twitter to news articles. In response, Netflix offered a 3 percent credit on the company’s next billing cycle to make up for the lost time. Although that only amounted to about 23 cents per user, it was about a $4.6 million hit for the company.
More recently, Microsoft offered their customers a 33 percent credit after an outage hit the Windows Azure platform on February 29th. Additionally, eBay, a company whose revenue rests solely on their site, experienced a complete outage of its search engine in April, making it impossible for shoppers to search the site. As part of their policy, in the event of a title search outage lasting an hour or more, eBay automatically credits all associated feeds for affected listings. Obviously the costs here can add up to a large hit.
You also can’t underestimate the potential long-term impact of customer satisfaction and reputation, if not the cloud providers, then the clients of the cloud provider. Our studies done with ITIC show that these two factors rank as the #2 and #4 concerns.
In your opinion, was WhatsYourPrice.com justified in terminating their relationship with Amazon? Doesn't that seem a bit extreme?
This is a prime example of how downtime can tarnish a reputation. WhatsYourPrice.com’s CEO Brandon Wade actually said, “Amazon’s failure has negatively affected our website’s reputation as a reliable online dating destination.” Wade went on to explain just how crucial uptime is for their business model, emphasizing how dating sites require constant accessibility. Recognizing that not all companies are the same, I can definitely understand the site’s motivation for leaving. Having launched less than a year and a half ago, it’s critically important that they don’t anger customers now. They aren’t eHarmony or Match.com. They aren’t there yet in terms of clout in their industry. One bad experience truly could lead their customers elsewhere. By ending their relationship with Amazon and moving on to FiberHub, they are invested in a technology they feel will better suit their company and have publicly told their customers they are there for them 100 percent.
Did the public cloud suffer a blow to its reputation because of Amazon's outages?
It’s not so much a blow to the reputation of the cloud as it is a hard lesson learned by Amazon that other companies can learn from. People aren’t discouraged to build homes because of natural disasters - they are bound to happen. What homeowners do is learn to build a firm foundation so that they stand the best chance in the wake of a storm. The same goes for the cloud. There are situations for businesses to consider when deploying a cloud platform. Accounting for cloud outages is one of those. By accepting the probability of downtime, a company can take the appropriate actions to ensure they don’t fall victim to an outage.
What multi-cloud strategies are data centers considering? Is a multi-cloud strategy a viable option to protect against downtime?
Although we have yet to find the perfect cloud, the recent outages across the industry, including those with Amazon, seem to prompting businesses to look into divvying up their workloads among multiple clouds in order to mitigate their risk. One way companies are doing this is by keeping multiple copies across multiple locations in different clouds. That’s as simple as having your data stored in one vendor’s cloud, with a copy in another vendor’s cloud. This may seem expensive at first glance, but companies need to weigh their options as to what they would prefer -- more investment in the technology they choose or more risk into the amount of downtime they may experience.
Another option is using multiple availability zones, which is a physically separated set of infrastructure that includes separate firewalls, switches, load balancers, servers, and storage. As I already mentioned, Amazon offers different availability zones, and it looks as though the companies that invested in those properly were not affected like those who didn’t (like Netflix). For example, Okta recently said this in their corporate blog, “The recent AWS outage hit one availability zone, US-East-1, within Amazon’s Virginia region, but because of the software and operational investments Okta has made across our five-availability-zone footprint in that region (and in two availability zones in another region) our customers weren’t affected.”
What are the biggest considerations that companies should make when contemplating a move to the public cloud?
Taking a step back from thinking about downtime, there are some other factors companies need to seriously look into when venturing into the public cloud. First and foremost, when you move to the cloud, you lose visibility as businesses get limited control because the vendor is responsible for managing the infrastructure. Aside from that, companies need to examine which of their applications they want going into the public cloud. It’s important to figure out if a multi-cloud strategy would be more suitable, as some information could be held in a private cloud. The public cloud also comes with certain security concerns, as they are a prime target for hackers.
Why is it important for companies to calculate the amount of downtime they're faced with? What kind of estimate is reasonable? How is that calculated? Does it vary (by company size, industry)?
For companies who don’t believe calculating their cost of downtime is important, they should consider this - In February 2012, Aberdeen conducted an in-depth analysis of a number of factors surrounding datacenter downtime and found that compared to figures reported in June of 2012, the average cost of an hour of downtime has increased by 38 percent. For companies defined as those falling in the “industry average,” the hourly cost of downtime is estimated to be $181,770. Even for those classified as “best-in-class,” downtime still costs $101,600 an hour.
By not calculating your cost of downtime, you’re potentially throwing away hundreds of thousands of dollars. When calculating your cost of downtime it’s important to include a variety of factors. I’ve bulleted the most important here:
- Lost productivity/reduced production
- Goods and materials lost
- Financial impact of customer dissatisfaction and negative effects on your reputation
- IT recovery costs, meaning out-of-pocket expenses needed by the IT staff to restore the system
- Employee recovery cost, meaning the time it takes to get back up to speed once supplications are back up and running
- Overall time lost
- Potential litigation
Although these factors are for the most part uniform across industries, there is one exception where we need to consider another factor – public safety institutions. For 911 dispatchers, police call centers, and government agencies, lives are literally at risk. If a call center goes down, vital information may not get to the responders and precious time could be wasted before services are dispatched.