In-Depth
A Season of Disasters: The Signs Are There
Forget the tape versus multi-hop-mirroring battle. Our storage analyst, Jon Toigo, has some suggestions.
Last week I discussed the seasonal interest in disaster recovery planning, which seems to coincide with the prospect of summer in most parts of the country for the obvious reason: weather. This week my concerns were reinforced when NASA released some scary pictures that showed significantly higher ocean temperatures in “hurricane alley”—a region of water running from the western coast of Africa across the Atlantic to the Windward Islands and Cuba. Since water temperature provides the fuel of hurricanes, the subtext of the message from NASA is: get ready for a bumpy ride as the severe weather season officially begins.
Businesses along the Atlantic coast and around the Gulf of Mexico are at obvious risk of weather-related disasters during the summer—if not from storms themselves, then from the collateral effects: power outages and infrastructure failures, to name just two. A utility room full of plywood to nail up over windows isn't the answer.
Recently, I chatted with someone from a company in Jacksonville, FL who, despite his concern over his firm's lack of preparedness, reported that management was not interested in investing one dime in disaster recovery (DR) planning or data protection. He noted that a branch office in another state had burned to the ground last year in an accidental fire. Though a lot of "tape and bailing wire" assured that operations were back up within a short time—without any extraneous DR provisions --management was lulled into complacency regarding the need to do any planning at all. The event “taught” them that, somehow, things would always work out.
The difference, he observed, was that the branch office did not have critical, irreplaceable applications or data. It was using what could be termed a commodity-off-the-shelf (COTS) infrastructure and standard shrink-wrapped applications from Microsoft. Important data? Yes, but not mission-critical.
By contrast, the company’s Jacksonville data center is loaded with mission-critical material. Only a few feet above sea level, it could sustain substantial damage even in a small hurricane—if a category 1 with a storm surge potential of only 4 to 5 feet hit the northern coast of Florida.
“They just point out that we have a rooftop generator if power goes out,” he said, “But that leaves a couple of questions unanswered. For one, will it be safe to turn it on if the building has water on the floor? Second, how will we get fuel to the generator if the streets are flooded?”
At a minimum, the gentleman wants to put some sort of data replication strategy in place that would leverage his company’s real estate holdings (several branch offices in a broad geography, some of which might prove to be excellent backups for the home office).
The company’s larger processors (he didn’t say if these were mainframes) are already backed up at a hot site operated by IBM in Sterling Forest, NY. However, a lot of important work is being done on Windows platforms: e-mail, SQL Server, Web stuff, and just about all user computing environments. There is no strategy in place for these applications or their data—some of which are just as critical in his view as the mainframe apps.
I hate to hear this kind of story, especially when there are so many really effective approaches—mostly inexpensive—for addressing the situation. When I sit down with the IS managers for the National Football League franchises in a couple of weeks, we will be walking through the list, which looks like this:
Data replication can be done asynchronously or synchronously. The most common async method involves tape backup. Routine backup to tape or other media will provide a copy of critical data that can be relocated and re-hosted on the same or different equipment designated for recovery. At less than 44 cents per gig for media, complimented by fairly inexpensive software, tape does the job of data protection in the majority of shops today. The upside of tape is the portability that it affords to data: you don’t need to reload the data in the same brand of storage hardware used to make the copy. That’s a life saver if your recovery site doesn’t have the same equipment profile as the production facility.
The much-touted downside of tape is that it is resource intensive. The fact that people must be depended upon to take the backup with them as a matter of routine, that people must pick the tapes for removal to secure offsite storage, that the offsite storage company may employ transport folks who may misplace their critical cargo from time to time, and that the off-site storage industry itself refuses to articulate best practices for tape handling in their own shops—all contribute to a growing unease about the efficacy of this approach. Combine these human-factor issues with scare stories about tapes not restoring properly when they are needed and you'll find a lot of people are looking for alternatives.
However, tape is very resilient and human factors can be taken out of the equation. Look at Arsenal Digital and other service providers who can take backup volumes via networks, reducing the number of hands that touch the data from point of creation to point of restore. E-vaulting has never been simpler or less expensive than it is today.
Problems with Multi-Hop
Some folks, however, are not content with async methodologies. Their data is so critical, they want it back in less than the three-hours-per-terabyte nominal speed available with tape. It used to be that the only solution for these folks was multi-hop mirroring.
Multi-hop mirroring provided a way for a name-brand storage vendor to sell you three arrays when you asked for one. Two were configured to operate in parallel, with special software working in concert with the array controller to replicate writes made to array #1 on array #2 in near real time. A third array was then deployed at a remote location and a copy process (using the proprietary software and controller of the name brand vendor) was used to replicate data over distance. A delta was introduced, courtesy of Einstein, between the data on #2 and #3, but what’s the loss of a little data among friends. If a disaster took out the facility where #1 and #2 were located, #3 could be brought into production and work could continue with minimal hassle.
The multi-hop approach was touted as an iron-clad guarantee against interruption of mission-critical processes by its vendors. It was expensive to be sure—a 3x hardware/software cost model plus network charges—but, hey, wasn’t your data worth it? To many companies with deep pockets and a must-not-fail mentality, the idea resonated.
Problems with this approach, other than its price tag, came down to a lack of visibility into the mirroring process (you couldn’t actually tell whether replication was occurring), vulnerability to virus and worm software (which could corrupt all of the replicated data, since you replicated all bits, not just the good ones), and the vendor lock-in imposed by the strategy (replication software only worked on Brand X manufacturer’s controllers, so you could only use their equipment). Bottom line: portability was sacrificed for high availability.
Another problem with multi-hop was its violation of the common sense dictum that not all data would be needed at the recovery site. Recovery operations typically present a workload that is between 60 and 80 percent smaller than normal ops. Only mission critical applications and their data need to be restored rapidly following a disaster event. This has given rise to focusing on minimum equipment configuration requirements—creating a minimally acceptable hardware/software platform at the recovery facility required to support a much reduced workload. Minimum equipment and network configurations have been a godsend to planners who could never convince management to simply build a redundant data center offering 1-for-1 platform replacement.
Bottom line: these options—tape versus multi-hop mirroring—have been the extremes on a spectrum of solutions for data protection for nearly the past two decades. In between, technologies offering incremental improvements either to take some of the pain out of backup or to reduce the data deltas associated with mirroring have come to the fore.
Data Mirroring Options
That is changing as we speak. Data mirroring technology is now available from several vendors in ways that do not involve the array controller. The only question is how sensible you want the data to be.
Take, for example, SOFTEK. The company is the inheritor of TDMF, a data migration utility that has been widely used in mainframe environments for many years. Over the past few years, developers at the company have expanded the functionality of TDMF to create a replication utility that works with open systems software and data. With SOFTEK’s Replicator, you can set up a data copy routine that continuously copies data over distance between two or more locations and storage targets. Since the product is platform-agnostic, you can re-host your data on any hardware that makes sense.
Revivio has also been working along these lines, to take bit-level data changes and replicate them over distance between inexpensive SATA arrays. I was just in Lexington, MA, at Revivio’s headquarters and saw the latest improvements to their already stellar continuous data protection (CDP) solution: they are working toward designing CDP and disaster recovery directly into the architecture of I/O. Pretty exciting.
Replicating bit changes works well, especially in the case of data bases and other volatile data. But, what if you want to replicate at the file level or the transaction level? Products from vendors such as XOsoft and Neverfail have begun to move the functionality of data replication further up the software stack.
XOsoft offers some robust tools for creating policy-based schemes for data replication at the file or transaction level over distance. It’s WANSync product has just been enlarged with WANSyncHA (HA stands for high availability) which has modules for replicating transactions across a set of Windows-based applications. In my opinion, the company is also on the brink of providing a meta-language, if you will, for developing customizable scripts that will let you build some fairly complex replication policies.
Meanwhile, Neverfail will let you back up not only application-layer transactions, but also sever images, with patches, so you can commission gear quickly in the recovery environment to host the applications that use the data. One very interesting component of their solution is their failover and failback theory as it applies to equipment in two locations that has the same machine name or IP addy. The Neverfail approach is to hide the backup system from the network while non-disaster replication is going on. That way, there are no addressing conflicts during protective operations and failover to the remote requires only a simple change to DNS.
I’ve only scratched the surface of the functionality presented by these different options for data replication and protection. If you want to touch them yourself, pop down to the Disaster Recovery and Data Protection Summit scheduled on May 31 and June 1 in Tampa, FL. Registration for this event, which is sponsored by ESJ.com and the Data Management Institute, and is completely free of charge to business. Registration is online at http://summit.datainstitute.org.
See you at the show, and your comments are welcomed at jtoigo@toigopartners.com.