Q&A: Long-Distance Disaster Recovery

While the distance limitations of the technology remain unchanged, vendors have been demonstrating that extended-distance clusters can work

In business continuity planning, the more distance you can put between your primary and secondary sites, the better. With this in mind, we spoke last month with Eric Shepcaro, vice-president of application networking with AT&T, and Joe Weinman, AT&T’s director of application networking. We wanted to get a feel for the issues involved with putting more distance between primary and secondary sites, and what we discovered surprised us: When the speed of light is what’s holding you back, you know you’ve reached an impasse.

Last year, several government agencies published a list of disaster recovery “sound practices” that included, among other items, a recommendation for a minimum of at least 200 miles between sites. There was a lot of moaning and gnashing of teeth at the time, and most everyone agreed that [such a distance] was impossible. Have things changed at all since then? Are any of your customers implementing extreme wide-area disaster recovery topologies?

Eric: Customers are interested in a variety of distances for a variety of reasons, some having to do with the distance between disaster recovery sites, some having to do with distances for data replication and mirroring, and so on. The issue has always been that there is a fundamental trade-off between the costs of bandwidth and the degree of synchronization required between various copies of the data. The issue isn’t so much the [transport] technology: Fibre channel natively over optical fibre has been tested to distances I think of over 6,000 kilometers, for example, but that’s a theoretical limitation. It’s not something that anyone—not Sun, not HP, not anyone else—is capable of doing now.

Joe: You’ve got to look at the varying technologies. It first of all is a cost trade-off performance issue, it secondly relates to the cost of the technologies that are involved. [Sun Microsystems’] SunCluster, for example, has been certified at 200 Km, but the underlying capabilities of assorted AT&T network services are that we can easily span several thousand kilometers. So it really depends on whether the customer wants synchronous mirroring, asynchronous mirroring of a snapshot, or what they’re willing to pay.

In terms of putting some meaningful space between disaster recovery sites, what’s about the furthest that customers can space sites from one another?

Joe:: For DWDM solutions, we have gone with distances that are actual fibre route distances of over 100 Km, although those kinds of solutions require additional components like amplifiers or regenerators to reach those distances. It really depends on whether the customer wants synchronous mirroring, asynchronous mirroring of a snapshot, or what they’re willing to pay.

Eric: The distance limitations of the technology over the last three years have actually not changed substantially, so it’s not as if somebody came up with a breakthrough that was able to extend distances, because basically the protocol hasn’t changed. What has changed is that as … vendors have been certifying some of the components of that technology at extended distances.

HP recently did a trial for Oracle 9i Real Application Clusters, and they announced a capability for service guard clusters, now those solutions have been certified at 200 Km for mirroring. What’s been going on is not that the distance limitations [of the technology] have changed, what’s happened is that as client interest has grown in extended distances, many of the vendors—EMC, HDS, Veritas, Sun, etcetera, etcetera—have basically been demonstrating that these extended distance clusters can work.

What are some of the technological limitations associated with extending the distances between sites?

Joe: Well, there’s a fundamental physical limitation in terms of the propagation delays due to the speed of light [in fibre]. There are a variety of technologies for copying data from one location to another location, and there are a variety of layers [at which it can be done]. There’s the storage layer, host layer, access layer, [and] network layer, but sticking for the moment to classic disk-based mirroring, the trade-off you have is that if you want to make sure that every block of data that is written to the disk at a primary site matches another exact copy of it that is at another site at any given microsecond, that both computers would have been committed at either location and completely in synch.

What’s then required is that the first disk must ask the other disk to write that same block, wait for the second disk to say that "I did what you asked me," and then the input/output transaction concludes. If that’s the degree of accuracy you’re looking for, the question becomes how quickly can you perform that cycle over and over again. And the more distance the two locations are from one another, the natural impact is that you can only go through that cycle a number of times per second.

So distance, then, is largely a function of the requirements of the application. But for most disaster recovery scenarios, customers typically want to cluster transaction-intensive applications for the purposes not just of data mirroring, but availability, right?

Eric: Yes, that’s the proverbial rub. The question for them becomes, does the application have a requirement for this [synchronous mirroring]? If it does, that is absolutely mandatory, then that puts a limit on the frequency that you can write those transactions. But if you relax that criteria slightly and say both copies of the data don’t need to be the same, we can get by with them being slightly different. Therefore, you can do an asynchronous kind of mirroring where the second site is told to write the data, and acknowledgements come back, but you’re not waiting for the first one to come back before you proceed.

Let’s switch gears a bit and talk about your own services. You recently were a player in a major new disaster recovery implementation at the Chicago Tribune, along with Nortel, Sun, and EMC. Have you worked with these partners in the past, and will you consider working with them in the future?

Joe: With Nortel, we’ve had an extensive partnership with them in terms of building out our ultra available services for several years, and, of course, Nortel has also been a strategic partner of AT&T for our core network infrastructures, so that is an ongoing and evolving relationship to include new areas. We’re also working with Sun in new areas, particularly utility computing. As for EMC, we’re becoming a much bigger partner with them.

We’re actually building joint go-to-market offers with all of them, and you’ll be hearing more and more of them where we actually have productized offerings. It could be around utility computing, it could be around business continuity and recovery services. We definitely see that as an opportunity for collaboration and going to market together.

About the Author

Stephen Swoyer is a Nashville, TN-based freelance journalist who writes about technology.