In-Depth

Copy-On-Write: The Next Evolution in Data Protection

Mirror, mirror on the….drive. Is continuous data protection the next holy grail? We look at three approaches.

In last week’s column, I discussed the number one New Year’s resolution: protecting data. I offered high-level guidance for building a solid data-protection strategy focused on data copy. I concluded that a real solution might entail both disk-to-disk and disk-to-tape technology, combined with a solid program of archiving to keep data-copy processes practicable.

Continuous Data Protection was also mentioned. It seems to be getting a lot of play in vendor-speak these days, referring to a category of hardware and software products that seek to automate the replication of data in near real-time. Is this the holy grail that the industry wants us all to pursue? It merits some attention.

CDP usually comes down to mirroring: creating the same data on two target devices at more or less the same time. Inside the array, this is familiarly known to us as RAID. The five levels of RAID (some say six, though RAID 0 doesn’t really make any copies and hence isn’t technically RAID at all) have been described in great detail in the original 1977 Berkeley RAID white paper and in many books, magazine articles, and on Web sites since then.

More recently, vendors have been making up their own RAID levels to give their newer bit replication mouse traps the “feel” of standards. We are now hearing that “RAID 6” is the panacea for really big capacity disks, —it will shorten rebuild times should a disk fail. That may or may not be true; you should tread cautiously and make sure that you understand exactly what data protection you’re actually buying with these mechanisms.

But the CDP crowd really isn’t talking about RAID. Some are trying to do copy on write inside the array. The two major strategies are point-in-time (PIT) mirror splitting and write journaling.

Point-in-Time Mirror Splitting

PIT mirror splitting involves using some of the disk in your array to replicate data writes made to primary disk. At routine intervals, you pause or stop the applications that are writing data to disk, flush your memory cache(s) so that the last bits are processed and written to both primary and mirror disk, time stamp the mirror disk as a crash-coherent copy of the original, and allocate new disk to serve as the mirror. Restart your apps and you are now back to mirroring on a new PIT mirror split.

Many array manufacturers offer such products, despite the fact that it an enormous waste of capacity on your most expensive disk arrays. Why use your most expensive disk to make copies if you could do it off-board, on a SATA array for example? Newer CDP products are exploring just this option.

In addition to disk utilization efficiency, another question is why we should make an exact mirror. Why not make one big copy of disk as it currently stands, then just make a journal of byte-level changes to the data, each with its own time stamp—what Revivio calls “time addressable storage.” The amount of space consumed by such a journal is far smaller than the space consumed by multiple point-in-time images of the entire volume, the reasoning goes, so you have a more economical use of disk resources. The strategy also precludes the need to pause applications and flush memory caches. That would appeal to folks with high-performance environments that can’t tolerate a lot of downtime.

Write Journaling

Journaling addresses the Achilles Heel of PIT mirroring: data corruption. In truth, database errors are rarely discovered until long after they occur—12 to 24 hours. By that time, your primary data, plus all of your PIT mirrors, may already be corrupted by the same event. With time-addressable storage or a similar journaling scheme, you can restore your data to the point in time just prior to the corruption event; with PIT mirroring, you may have to lose many hours of transactions to get to the last PIT split image—if all of your splits aren’t corrupt to begin with.

Journaling may also pave the way for the cost-effective externalization of mirroring, since the amount of data to be transferred to external disk is very small. Today, you usually need two more arrays to mirror a disk array externally. You set up one mirror array next to your primary array and replicate locally and “synchronously” (near real time) the data that is being written to the primary. That insulates you against a failure of the primary array, but it doesn’t protect you against a smoke-and-rubble disaster that might claim both the primary and the local mirror.

That’s why the array vendor wants to sell you yet another array and some “asynchronous” replication software that will enable you to place mirror #2 far away from the primary and mirror #1. Typically, your #2 mirror is connected via a WAN and data writes are made to it as a secondary process, after writes have been made from the primary to mirror #1. This is done to keep from “holding the channel high” (delaying application processing) while waiting for acknowledgement of writes to the far-away array. The async mirror process is typically controlled by a separate manager so that it does not impact production operations.

When you think about it, this multi-hop solution involves a lot of expense—and that is why it is the preferred solution from vendors who offer it. You need to buy three $450K-plus high-end arrays, invest a significant amount of money in PIT mirroring, synchronous mirroring, and asynchronous mirroring software, and pay out the nose for a WAN connection.

Moreover, there is little granularity of control over what data is replicated (you generally replicate at the volume, not the file, level), and you have little or no visibility into the actual mirroring process. One vendor suggests testing your mirror using a surrogate: attach a couple of PCs at each end of the WAN connection used for mirroring. Cut power on the local unit and see whether the remote unit picks up the load! It leaves me scratching my head in wonder: does such a strategy really test the inter-array mirroring that is taking place or just the WAN connection?

Copy-on-Write

What everyone wants is a simpler approach that isn’t tied to one vendor’s hardware. Call it copy-on-write.

Copy-on-write involves writing data to two disks (targets) at the same time. There are many ways to implement it. You can buy split write host bus adapters that simply target two disks, or split write controllers that write to two different targets inside an array. Those products are as old as the hills and pre-date FC fabrics and IP SANs. Things get interesting when you add in the new technology.

In the Fibre Channel world, you can accomplish copy-on-write as a function of switch configuration. Simply assign two nodes to the same write path. You don’t have much control over latency with such a scheme, but it is an option with most switches.

Another approach is to use virtualization. Most (but not all) virtualization engines used with FC fabrics today offer the capability to direct writes to different targets as a function of how you set up your LUN aggregation, your “virtual volume.” Some are getting better at controlling the response to the application issuing the write command so that the app can continue processing while the second target is still being written. Some type of cached write management and monitoring functionality is required to make this spoofing work.

In fact, the only virtualization engine that doesn’t support copy-on-write is EMC’s Invista. Given Hopkinton’s other products in this space, notably Timefinder for PIT mirroring and SRDF for synchronous and asynchronous array mirroring, we can only speculate that they believed copy-on-write was unnecessary.

Moving on, the best way to do copy-on-write over distance will be to leverage the capabilities of the Internet Protocol itself. IP provides a functionality called multi-casting which could be leveraged by any IP-based storage protocol to write to TCP/IP connected storage targets. Zetera Technologies is a big pioneer in this space, but you can expect to see additional products as IP SANs become dominant.

Whichever strategy you use, whether IP multicast or some other hardware or software approach, writing to distant remote targets will always entail delays. In the disaster recovery world, these are referred to as “data deltas”—differences between local and remote data states imposed by distance-induced latency. Until somebody finds a workable application for quantum singularities in networking, Einstein’s speed of light restrictions on data transfer speeds will apply.

Dealing with deltas is the bugaboo of remote data replication schemes. Your ability to deal with data deltas will have a lot to do with the tolerance of your applications to partial data and their inherent error recovery capabilities. Recommendation: do regular backups and remove them off site—even if you are pursuing a copy-on-write or remote mirroring strategy—just to be on the safe side.

Your comments are welcome. [email protected]

Must Read Articles