How Recovery-Oriented Databases Help Retailers Handle Black Friday

How to keep IT systems running during the post-Thanksgiving shopping rush

By Tasso Argyros

Black Friday marks the Friday after Thanksgiving when holiday shopping kicks into full gear; early deals, long lines, and traffic fill much of the landscape. Interestingly, the name "Black Friday" started as a comparison to a chaotic and stressful experience of "Black Tuesday" which marked the start of the 1929 stock-market crash. Today, many IT organizations can relate to the chaos and stress associated as their systems try to support the crush of online shoppers who hit their sites looking for the best deals (they've even coined the term "Cyber Monday" as one of the busiest online shopping days).

What happens if the rush of shoppers takes your system down? The headline of a November 2007 ComScore study (see http://www.comscore.com/press/release.asp?press=1914) says it all: "Black Friday Sees $531 Million in Online Retail Spending, Up 22 Percent versus Last Year." The press release posted out that last year's Cyber Monday spending was expected to exceed $700 million, which would make it the "heaviest online spending day on record. Needless to say, losing a system during the holiday shopping rush could be catastrophic to your business."

The good news is that computer systems are becoming smarter at reducing both planned and unplanned downtime and it's not just for the elite companies with massive IT staffs and budgets. For example, consider the case of an online service provider who works with online retailers and media companies to help customers find products, services, and content most relevant to them based on their online shopping behavior and interests. This company relies on a sophisticated analytic engine to power their recommendations on a 24x7 basis for high-end consumer clients, such as Sony, BusinessWeek, drugstore.com, and others. Their recommendation engine has roots in "recovery-oriented computing", enabling them to meet or exceed customers' expectations, especially during such a busy time as Black Friday.

Recovery-oriented computing is based on a a simple concept: rather than trying to completely eliminate downtime by throwing expensive hardware at a problem, you should enable systems to recover quickly from failures and software bugs on their own -- instantaneously -- without human intervention. Even more, do all of this recovery while running on inexpensive, off-the-shelf commodity hardware.

Building "Always-on" Systems with Recovery-Oriented Computing

Many large online retailers require systems that can handle huge volumes of traffic and transactions on their sites around the clock. In the early 2000s, the Recovery-Oriented Computing (ROC) project was started at Stanford University and the University of California, Berkeley to investigate novel techniques for building highly-dependable Internet services. As noted on the Stanford/Berkeley ROC Web site (see http://roc.cs.berkeley.edu/roc_overview.html): "In a significant divergence from traditional fault-tolerance approaches, ROC emphasizes recovery from failures rather than failure-avoidance. This philosophy is motivated by the observation that even the most robust systems still occasionally encounter failures due to human operator error, transient or permanent hardware failure, and software anomalies." In other words, "stuff happens" and systems go down. The challenge is for retailers to handle the "stuff" while keeping customers happy and costs low.

Traditional systems can become unavailable in one of two ways:

  • Unplanned downtime: the system is completely unavailable due to human errors, software bugs, hardware failures, system shutdown, etc.
  • Planned downtime: the system is fully unavailable due to planned activities such as capacity expansion or upgrades. The system may also be partially unavailable as some or all users see significant performance degradation due to routine operations such as data loading, exports, backups, high user concurrency rates, upgrades, etc.

Avoiding Unplanned System Downtime with Smarter Software

To understand how recovery-oriented computing impacts unplanned downtime, consider the definition of "availability." Availability of a system is generally an expression of the system's readiness to deliver service. It can be expressed as a ratio between the system's mean time to failure (MTTF) and its mean time to recovery (MTTR):

Availability = MTTF / (MTTF + MTTR)

On small-scale systems, building several layers of redundancy helps reduce MTTF. Consequently, some organizations expect that mere redundancy and fail-over can address all of their availability concerns.

However, large-scale systems have tens to hundreds of CPUs, hundreds of disks, many administrators, several applications, and so on; the failure probabilities of each of these parts add up to drastically reduce the MTTF of the overall system. In the face of such diminishing returns on MTTF improvements, the focus must be on reducing MTTR.

This is the focus of recovery-oriented computing. A system that instantly recovers from every fault is 100 percent available. Reducing unavailability by an order of magnitude through a ten-fold increase in MTTF is considerably more difficult and expensive than reducing MTTR by a factor of ten, especially in a very large-scale system.

There are two ways to reduce MTTR: you can preserve the current approach to recovery but engineer the system to perform that recovery faster, or you can reduce the scope of recovery, making recovery both faster and less disruptive.

Commercial software packages are incorporating these principles to avoid unplanned downtime -- the underlying infrastructure can detect both transient and permanent failure of both software and hardware. A server failure is handled by transparently failing-over tasks to data replicas within the cluster. Because of this transparent fail-over capability, tasks complete successfully even when a server fails. This ensures that business processes continue even as administrative staff repairs the failures.

Avoiding Planned System Downtime with Live Administration

A less glamorous but often overlooked aspect of measuring uptime and availability in systems is planned downtime. Often, companies look to late nights or long weekends as an opportune time to take a system down for upgrades, scaling, or general maintenance. For customers, however, nothing is more frustrating than being stopped by a message stating, "Sorry, our systems are down for routine maintenance. Please try again later." In the 21st century, there is no good time for down time.

The good news is that advances in system architecture and software have solved these problems by avoiding planned downtime. Administrators can scale a system, remove servers from the cluster for upgrades and repair, load/export data, and more -- all while the system is online.

Take scaling, for example. In November 2007 an online retailing service provider saw a large spike in demand for recommendations powered by their system. Their existing capacity wasn't enough to meet service-level agreements for performance. Most retailers in this situation would have had to wait until after the holiday rush to scale their system capacity or lose the revenue associated with having a system down during prime-time shopping season. However, this company's infrastructure enabled live system administration and they were able to expand capacity by 60 percent in a matter of a few hours. Administrators only needed to input the MAC address and power on the new bare-metal servers they were adding. The system automatically retrieved the software, formatted the drives, configured the network, and rebalanced the existing data. All this was done in the background while the system continued to be available to retailers and publishers during this capacity addition. You can bet the system administrators slept well that night.

Enjoy the Holiday Shopping Rush

All eyes will be on Black Friday this year to see if consumers are rattled by the global credit crisis. Customers may be cautious or decide to spend like never before. The online retailers who are prepared for any ensuring rush will reap the rewards. Those who aren't may find themselves in a different kind of crisis at the end of the shopping season. Thanks to advances in recovery-oriented computing, there are many systems administrators can relax and find a little more time in their schedules this season.

- - -

Tasso Argyros is the chief technology officer, vice president of engineering, and co-founder of Aster Data Systems. You can reach the author at tasso.argyros@asterdata.com.