In-Depth
A Nickel's Worth of Disaster
Are you willing to spend a nickel to avoid a data recovery disaster?
Surveys of CIOs and IT managers suggest that business continuity/disaster recovery planning has moved to the front burner in the front office. Depending on the analyst , BC/DRP is either first or second priority (just behind cost containment). It makes sense.
Periods of economic uncertainty always seem to direct attention back to preserving the current assets. Add some regulatory mandates in a variety of industries, and a few high-definition disaster events playing 24/7 on CNN, and you have the mindshare behind continuity that may have been lacking in other times.
When I visit enterprise clients to discuss their continuity objectives and strategies, three issues always seem to come up. First, of course, is cost. How much do you want to spend to save yourself from a nickel's worth of disaster?
You read it correctly. I said "nickel" as in five U.S. cents with Jefferson on the front and the Jefferson Memorial on the back. A nickel may be all that stands between you and a full-on data disaster. Let me explain.
Over the past several months, my testing labs have seen rapid, successive failures in hard disks -- 10 so far -- all the same type of drive, all from the same manufacturer, and all probably sequentially numbered from the same manufacturing batch. About a week ago, I found out why. It appears that the manufacturer in question employed someone in procurement who didn't know much about hard disks but did know how to save a nickel per disk drive. He opted for a cheap vibration sensor for the electronics pack on the back of his 500 GB SATA drives. He probably saved the company a fortune, but he also cost them my continued patronage.
The person who recounted this story, formerly a marketing manager for the brand-name disk drive maker, explained that the cheap vibration sensor failed on every drive where it was installed. Its failure was signaled to the S.M.A.R.T. technology on the drive, which, in turn, re-tested the drive. This repeated over and over, spinning up drives in test mode, producing even more vibration, flooding the servers connected to the arrays with S.M.A.R.T. error traffic, reducing their performance, until finally the drive failed.
This happened in every array or server where the 500 GB drives were installed. If they were in RAID arrays (even RAID 5 or 6), multiple drive failures produced an irretrievable loss of data from a drive rebuild perspective. Good thing I had tape backup.
This leads to the inevitable second question I hear from everyone I visit: how do I get rid of tape? Apparently, analysts and disk array vendors are persuading a lot of users that tape is some tired technology from the past. This is underscored by public statements from users that they are having a terrible time restoring data from tape when they need it.
A colleague recently forwarded me an e-mail from one of his clients stating that she had gone to her offsite storage vendor to retrieve 40 tapes that had been there for the better part of a year, only to discover that none of the tapes was readable by her system. Upon investigation, I discovered that she had never performed a read/verify operation after writing the data to tape in the first place. Read/verify, as the name implies, is your way to confirm that data has been written accurately to the tape media.
Failure to perform this step -- and I find that many shops don't do it because of time constraints -- is the road to ruination. In addition, she acknowledged that the offsite company did not keep her tapes in an environment with appropriate heat and humidity controls. That, too, will impact tape restore in a bad way.
If read/verify takes too much time to do in the operations window assigned to backup, I would recommend that the task be offloaded. Read/verify does not need to be performed by backup software at all. Crossroads Systems in Austin, TX has a router appliance to which you can hand off the read/verify step in your backup process -- it's for those who don't have a read/verify offload provided directly on your tape library subsystems. Vendors such as Spectra Logic are increasingly proficient at adding real value to their tape libraries with functionality that enables companies to do more with less. (Their next generation products will be a subject of this column later this year.)
Now for question three. With so many vendors adding so much value to their storage products, including continuous data protection, de-duplication, compression, and proprietary mirroring, how is an administrator supposed to keep track of multiple data-protection processes?
It does seem as though many vendors are trying to perpetuate the margins on boxes of spinning rust by leveraging the current fascination with data protection and disaster recovery. Deploying gear from different vendors can result in "herding cats" -- the multiple disparate processes implemented on hardware (and sometimes in software, too) for replicating data in order to protect it.
When you perform a disaster recovery business impact analysis these days, you typically find at least 10 concurrent replication processes: multiple mirrors between array pairs or trios -- each mirrored set from the same vendor, application-centric backup or replication schemes, formalized backups to tape (increasingly with the assistance of an interstitial layer of disk referred to as a Virtual Tape Library even if it is simply a disk cache), and a few departmental level saves to USB key drives or portable disk units. How do you wrangle all of these into some semblance of order?
The answer is a painful one. You can't.
Continuity Software and CA, however, both made announcements this past week that suggest that our capabilities to see, control, and manage multiple disparate data protection processes are improving. Continuity Software provides a dashboard that is easy to implement and maps protection processes back to the applications you are protecting. According to marketing director Avi Stone, the latest version of their RecoverGuard product enables you to see, among other things, instances of overprotection "in which data is being replicated five or more times," and also instances in which data growth has compromised your stated recovery time objectives given the strategy you are using. I like this idea, because it provides a real-time status monitor of your data protection capability.
To help you get going, Continuity Software has announced a $15K assessment in which they collect, analyze, and report on problems. According to Stone, "We always find critical problems." Part of the solution he recommends is to license RecoverGuard and to deploy it on all of your critical application servers to keep abreast of the situation. Price: $2K per server.
Who will monitor the dashboard and respond to the issues it presents? Stone says that too often the IT folk are too busy and the business continuity people are not technical enough to understand the technological data. For that reason, Continuity Software is announcing a remote monitoring service for $3.2K per server (inclusive of the $2K per server license). Monitoring is done remotely via the Internet.
CA, for its part, is also moving along the monitoring path. Their latest bundle of CA Recovery Management includes new releases of CA ARCserve Backup, CA XOsoft High Availability (formerly CA XOsoft WANSyncHA), and CA XOsoft Replication (formerly CA XOsoft WANSync) and is designed to do more than report. The product is intended to provide mechanisms to control and manage data replication processes, not only to report on them.
The CA XOsoft products have been improving steadily since CA purchased XOsoft. The product already features its own infrastructure wrapper technology that enables you to create a holistic failover scenario between two different locations of your key servers, networks, and storage gear. A nice feature is the capability provided to failover between unlike infrastructure -- different gear, different physical or virtual configurations, and so forth.
CA has gone a step further and provided full reportage not only of XOsoft's own replication process, but also of its ARCserve backup processes, wherever tape backup is being used. Moreover, CA claims that, using CA XOsoft's robust scripting language, you can also add visibility into vendor proprietary replication processes, though the control of these processes remains limited by the amount of access vendors are willing to provide to their application programming interfaces. This is very similar to the pain that CA's storage resource management product developers already know too well: the need to go to each product vendor and beg for (or pay for) access to their APIs.
The simplified pricing model CA announced in October (http://www.ca.com/us/press/release.aspx?cid=157286) remains the same for all of the new releases. Some new features, such as Central Management, have been added to the ARCserve product suites. The MSRP for the File Server Suite is now $995, and $1,495 for the e-Mail, Database, and Application Suites. Each suite contains all necessary options and agents that were previously purchased separately (prices do not include maintenance). For CA XOsoft High Availability and CA XOsoft Replication, pricing starts at $2,000 per server with at least two servers being required to replicate data from one to the other.
CA has reported that several early deployments have been made by continuity service providers, including Toronto-based Geminare, which has been offering a disaster recovery service based on CA wares for about a year.
So, we return to the initial question: how much money can you justify spending to protect yourself against a nickel's worth of disaster? We will look at some guiding principles next week. Your comments are welcome: jtoigo@toigopartners.com.