Fighting Fires on the Web

Make no mistake: Moving applications onto the Web merely makes disaster recovery planning even more critical.

On March 21, a San Antonio, Texas-based Web hosting company, Rackspace Managed Hosting, announced a "California Blackout Promotion," raising the eyebrows of e-business managers. The two-year-old upstart seized on the electrical utility outages plaguing California to pose the simple question: "Do you fear losing your Web site's power? In a blackout, you'll lose your site!"

Leveraging fear, uncertainty and doubt (FUD)—the time-honored recipe for selling disaster-related services—Rackspace offered to relocate California-hosted Web sites to its primary data center in San Antonio without charge, provided customers acted before April 15. It noted that this generous offer would save companies "between $1,000 and $60,000" in relocation costs.

According to the company, the marketing effort was a success. One new client is London-based Technojobs. The company's managing director, Stelios Fesentzou, explains, "When our servers were down for as many as 12 hours at a time and our provider wouldn't tell us what was happening, we figured that the power outages were the problem. That's when we decided to cut our losses and move our servers from California to Texas."

Technojobs is just one of a growing list of companies learning an important, though rarely discussed, lesson of e-business: Moving applications from internal hosting onto the Web doesn't eliminate the need for disaster recovery planning—it merely complicates it. If you want your Web-based initiatives to produce revenue streams, you'll need to subject them to the same stringent continuity planning and testing practices that companies have applied to internally hosted applications over the past three decades.

In fact, Web-based application hosting introduces a set of challenges for your disaster recovery planners that goes beyond the hurdles of application deployment. For many companies, the move to the Web merely shifts disaster recovery planning out of the frying pan and into the fire.

Problems with the "Co-Lo" Model
With more and more companies trying to capitalize on e-business, vendors are rising to the occasion with a variety of "New Age" Web-hosting solutions. The list of options is dynamic—service models seem to change on a weekly basis.

Early experiments with "co-location services" have generally produced lackluster revenues for data center owners. Co-location means a service provider rents space within an Internet-connected data center to a business customer who, in turn, uses it to install and operate its own equipment and applications. Joseph Fuccillo, senior vice president with Xand Corp. in Hawthorne, N. Y., terms the model a "failure" and attributes its demise to a "commoditization of data center floor space in major metropolitan areas."

According to Fuccillo, "There's over a million square feet of data center space within a 30-mile radius of New York City, and more is being built every day. Eventually it will probably all be used, but co-lo providers are realizing that they will never recoup their costs by renting the space to customers in four-square-foot increments—the footprint of a server rack. The availability of so much space is driving down prices. As a result, providers are looking for new services to become profitable."

Fuccillo says that many early co-lo players, including Exodus Communications Inc., AboveNet Communications Inc., AT&T and Qwest Communications Intl. Inc., have seized on managed Web hosting services as a value-added service-model strategy that Xand adopted early on. The difference is $1,500 per month for a rack at a co-lo vendor versus $45,000 per month for the same rack space—plus management services that include system administration, backup and recovery, and a number of other labor services.

"Businesses are finding that they don't have the people or the expertise to do internal hosting or co-lo," Fuccillo says, "They want operational support-managed hosting services."

The problem with the still-evolving managed hosting model, according to Fuccillo and others, is scalability. Organizations like Digex Inc., which was acquired by WorldCom Inc. to provide first co-lo, then managed hosting services, have seen significant growth in new customers, but quarterly losses in revenue. Clearly, something is missing.

Scaling managed hosting solutions means finding the right balance between the dedication of resources—servers and infrastructure components, such as networks and storage devices—on a customer-by-customer basis and the sharing of infrastructure resources among several customers. The more resources that can be shared, the greater the economies of scale that the service provider can realize. Shared infrastructure is less expensive to deploy and manage. However, shared infrastructure also tends to heighten customer perceptions of security and business continuity risks. Companies don't want their hosted applications held hostage to another company's downtime. Few, if any, vendors have discovered the right combination, says Fuccillo, who admits that Xand has changed its own model four times in the past 18 months.

Yahoo's Big Outage Mystery

Service outages at Web hosting providers are often the result of a failure to scale appropriately. The recent experience of portal service provider Yahoo! Inc. is a case in point.

In early May, a number of Yahoo! servers were taken offline by what was publicly described by its Web hosting service provider, Exodus Communications Inc., as a utility service failure that cut off power to all servers located at its Sunnyvale, Calif.-based Internet data center (IDC). According to statements by Exodus, an explosion in a power company transmission vault cut commercial service to the IDC. Generators that should have kicked in to manage the load failed to come online for mysterious reasons that are still under investigation.

But there was no mystery, at least according to Dave Gulbransen, an engineering vice president for @Lightspeed LLC, a hosting infrastructure provider located in Denver. "I was talking to Exodus technical folks—one engineer to another—and they said the reason for the outage was simple: over-subscription. One of the backup generators at the data center was already online, helping support the load from existing customer systems [operating in production mode]. A second generator, held in reserve for backup, had some trouble when they started it up," Gulbransen says.

"They were getting ready to deploy additional generators for backup purposes when commercial power was cut," Gulbransen notes. "It was just bad timing. Trouble happened, and they didn't have enough backup power."


When it comes to utility infrastructure, the Web hosting industry as a whole is over-subscribed by as much as 500 percent. That's according to Dave Gulbransen, vice president of design engineering and construction for @Lightspeed LLC, a hosting infrastructure provider located in Denver. "For most vendors, it comes down to what you can risk-manage effectively," he says. Gulbransen adds that businesses concerned about the integrity of hosting service infrastructure should consider including "SPA" criteria—space, power and access—in their provider evaluations.

Space refers to the amount of floor space available for racking servers and networking hardware—as well as the heating, ventilation and air conditioning (HVAC) used to provide the right hardware environment. Gulbransen notes that customers are becoming more realistic in their vendor expectations. A few years ago, he says, everyone wanted their racks to be deployed inside closed rooms, which created significant issues for vendors in terms of power and HVAC.

"It was as though they wanted to hide how they connected a router to a switch!" Gulbransen says. "Now there's general agreement that the secret sauce is inside the box, not in the way that the hardware is connected together. Some service providers have cages, others satisfy concerns [about physical security] by using solutions that take five pictures per second of anyone who opens the door on the front of the rack."

According to Gulbransen, @Lightspeed's Denver facility is engineered to provide 105 watts of electrical power per square foot—scalable to 300 watts per square foot. He describes this capability as "unheard of in the industry," which routinely delivers power in accordance with "old telco standards of 35 to 70 watts per square foot." Backing up the approach is 24 megawatts of commercial power entering the facility and 13 two-megawatt Caterpillar generators providing a backup power reserve. To create that capability, the company had to work with the city to redefine its building and zoning codes. @Lightspeed selected an older industrial area of Denver to construct its data center. The location was chosen in part for its ability to safely and cost-effectively store 4,000 gallons of diesel fuel per backup generator.

"By locating in Denver, we avoided the [real estate] costs and regulatory issues of building a comparable facility in California," Gulbransen says. Network access was a third, and equally important, factor in site selection, he adds.

As metro areas go, Denver isn't a "Tier 1" city—that is, one that telecommunications companies regard as having the most robust or diverse telecommunications infrastructure. Gulbransen calls Denver "about a Tier 1.5."

Many hosting centers tout their network accessibility advantage in terms of their proximity to an "E-POP" or "S-POP." This is a point-of-presence supplied by a telecommunications provider that affords the center with access to either an Ethernet-based (E-POP) or SONET-based (S-POP) fiber optic ring network that interconnects the vendor to the high-speed core carrier network. Such ring networks exist mostly in Tier 1 cities, including New York, Chicago, Los Angeles and San Francisco. Build-out of these high-speed "pipes" to serve smaller cities and outlying areas has slowed considerably with the slowdown of earnings within the telecommunications industry.

That is not to say that high-speed access can't be provided outside of a Tier 1 city however. "What is important is the number of router 'hops' the data traversing the network needs to make before it gets into the core carrier network," Gulbransen observes. "We've provisioned our facility with dual-entrance, network access facilities that get us to the core carrier router in the same number of hops as you would expect from a facility located in a Tier 1 area. So the vendors located in Tier 1 areas have no advantage over our offering."

When considering Web hosting providers from a disaster recovery perspective as well as a performance standpoint, Gulbransen recommends that planners start with the SPA criteria. "That covers you to just outside the server box itself," he says.

Getting What You Pay For
@Lightspeed is an example of an infrastructure provider that has seen the handwriting on the wall regarding the need for new, managed services lines of business. Since its inception, the company has enjoyed "solid funding by hands-off investors" to build a facility characterized by redundancy and resiliency, according to Gulbransen. Moving forward, the company plans to add storage services, which Gulbransen regards as a natural evolution of their infrastructure focus, "Bits come in, bits go out, but they also need to have a place to rest. Storage Service Provisioning is our next big move."

Gulbransen says that @Lightspeed is being circumspect about developing a relationship with one or more managed hosting service providers to use their facilities, preferring to partner with a best-of-breed provider with whom the company can reach a "meeting of the minds with respect to a service level agreement."

Tim Hazard, director of Web Hosting Solutions Development for Electronic Data Systems Corp. (EDS) believes that his company is such a provider. He acknowledges that disaster recovery is an important component of the hosting service selection process and notes that EDS Web Hosting Solutions often include a hot site or tape vaulting agreement with Comdisco Continuity Services.

EDS' Plano, Texas facility, according to Hazard "sits on two electrical grids and has enough diesel power and battery backup [to last] for six months." The organization also has its own internal disaster recovery plan, which is drilled on both a routine and ad hoc basis.

"Last week, we were told that a plane had crashed into our hosting facility. It was an impromptu test of our plan, which is routinely audited by KPMG," Hazard notes.

Says Hazard, "Companies that bet the business on their Web site and demand high availability and uptime often approach us with a requirement for redundant, geographically dispersed, load balanced Web sites. Interest in this strategy for disaster recovery has been growing [since the power outages in California began.] Usually, however, nobody can afford such a strategy. As an alternative, most customers can accept the hardened facilities, the infrastructure redundancies and the internal disaster recovery preparations we have made in our hosting facilities as sufficient protection."

This perspective is echoed by Mark Browning, director of marketing for hosting facility provider eDeltaCom. "An effective infrastructure provider needs to deliver redundancy in: security provisions, HVAC, network connectivity, and a myriad other things, including power, which is currently getting a lot of attention. You get what you pay for. The infrastructure supplied at a hosting center costing $300 to $1,000 per month is not going to provide the same value as one costing $1,000 to $10,000 or more per month. The investment in infrastructure just isn't there."

Browning says that many customers ask about site redundancy and mirroring, although few pursue the option when they learn the price tag. Customers settle for the provider's in-house disaster recovery capability, or else they can avail themselves of hot site or storage mirroring services, or both, provided as a separate agreement through eDeltaCom partner Sungard Recovery Services.

Sanity Check: Do You Really Need Five Nines?

Zerowait Corp.'s high availability engineering services have been leveraged by both private companies and service providers to support "must not fail" e-business applications. Mike Linette, president of Zerowait, is quick to point out that, while a full-blown, load-balanced site mirror may be cost-prohibitive, the fact is that many sites do not require a one-for-one replacement of all hosting hardware components. "Most sites are not taking transactions in volume, so mirroring databases in real-time—an expensive proposition—is often not required in a site replication solution. If this isn't a requirement, you can replicate most sites fairly inexpensively in most cases. Load balancing can be done simply and efficiently using a simple ping or check sum that checks to see whether the primary site is available. If it isn't, traffic goes to the backup server. It doesn't cost very much to shove a server into another hosting company's rack."

For more complex sites, Linette maintains, site mirroring can become complicated and expensive. "High availability engineering costs need to be considered against operations risks. It is analogous to how produce goes to market. Seattle apples that need to get to New York City can take a variety of different paths. A three percent variance in the amount of transit time—the difference between one path or another—doesn't matter much. Perishables like lettuce may be much more time-to-market sensitive. High availability provisions may need to be made for lettuce."

Linette says the same practical logic needs to be applied to hosted Web sites. He observes that the difference between "three nines" (99.9 percent uptime, or 8.76 hours per year of downtime) and "five nines" (99.999 percent uptime, or five minutes of downtime per year) is not huge in terms of downtime. In terms of cost, however, the difference can be enormous. "Planners need to be realistic. If folks can't get into a site for a few minutes per month, is it worth the expense of site duplication with high-availability load balancing? It isn't that it can't be done ... it really comes down to whether it is needed at all."


What About Web-Hosting Platforms?
With all the attention on service provider infrastructure redundancy and resiliency, especially in light of the California power crisis, it's easy to overlook the vulnerability represented by the e-business hosting platform itself. As companies move beyond the "online brochure and shopping cart model" and begin to offer interactive applications across the Net, the platforms used to host these more complex applications are increasingly characterized by multi-tier client/server architectures.

A typical platform for an enterprise class application enabled for use via the Web typically includes a Web server used to communicate with browser-equipped (and sometimes wireless) clients, an application server used to integrate Web technology with non-Web-ready applications and middleware, a directory server used to manage application provisioning, a security server used to handle application access and secure communications, one or more application hosts that contain the actual application software itself, possibly one or more database servers providing the data for the application software, and a storage infrastructure that is used to store all of the data from the other servers.

Commonly referred to as an n-tier client/server platform, Web hosting platforms are notoriously prone to failure and can be recovered only through a one-for-one component replacement strategy. While strides are being made by companies like IBM Corp., BEA Systems Inc., and Oracle Corp. to add greater fault tolerance to their application servers. While many applications today are being designed specifically for Web-based delivery, one-for-one or "n+1" redundancy remains an extremely costly approach for disaster recovery.

Using the right middleware to connect the various application elements together can help reduce expense. Some middleware can be set up on-the-fly to discover and catalog application components dynamically. From a disaster recovery perspective, this is preferable to middleware, which refers to communicating servers by machine IDs or other "hard-coded" identifiers. The former strategy facilitates application restoration on hardware other than that used in the original deployment; the latter ties service restoration time to how quickly new platforms can be configured with old machine addresses.

Bottom Line
WWW stands as much for Wild, Wild West as for World Wide Web. As businesses deconstruct internal processes and push applications onto Web hosting platforms for shared use by customers and suppliers, the need for disaster recovery planning will need to migrate with the application.

Your disaster recovery planners will need to become more directly involved with the evaluation and selection of service providers, the design of hosting platforms and the construction of the software itself to ensure acceptable levels of business continuity. Those are new roles in most cases for contingency planners—appropriate for a New Age application hosting solution.