In-Depth

A Sanity Check on De-Duplication (Part 2 of 3)

We discuss the types of de-duplication and what meaningful criteria should be used to evaluate competing approaches and products

De-duplication, a process simply and generically explained by the analyst community as describing data with fewer bits, is on everyone's mind. At the Storage Decisions conference a few weeks ago in Toronto, no less than a third of the sponsoring vendors were selling de-dupe solutions. Recent polls by analysts place the technology as number one or number two on the consumer's wish list, hoping it will save money by squeezing more data onto the same size spindles, thereby helping defer spending on new storage gear, providing a cost-effective way to store all the data that must be retained for regulatory and legal reasons, and reverse global warming trends by making IT operations greener.

A sanity check is required, however, before jumping headlong into the newest de-dupe panjandrum. As one fellow from TD Waterhouse Bank said to me during a conference break, "We have serious concerns about whether de-duplicating data will expose us to regulatory or legal problems, especially in cases where the law states that a full and unaltered copy of original data must retained for auditors." He was at this show mainly to chat with the vendors about this concern.

For many other consumers, the questions are more fundamental. What exactly is de-duplication? What differentiates one product from another? What criteria should be used to evaluate the various options and to select the right one?

To get some answers, I directed a questionnaire to the vendor community. The survey questions were framed exactly as users have posed them to me, which is to say that they often embedded confusion and concerns about the technologies that the vendors were offering. I hoped that the results would provide an educational opportunity, rather than a blatant marketing exercise, and I was delighted most vendors responded in that vein. These included Network Appliance, Exagrid, Data Storage Group, COPAN Systems, Symantec, Permabit, and IBM. Notably absent from those responding were EMC and Data Domain, two major proponents of de-duplication technology. Repeated invitations to Data Domain resulted in a promised response that never arrived; "back channel" comments from an EMC blogger, who said he was not speaking for the company, suggested that Hopkinton probably perceived "no upside" from participating in the survey. I wasn't sure what this meant.

This column provides responses to the first half of the survey, summarizing the vendor's guidance on what de-duplication is and how it should be used. Spokespeople for the companies ranged from business development managers and product managers to product evangelists and technical experts.

My first question cited analyst survey data suggesting that de-dupe is the number one storage-related technology that companies are seeking today -- well ahead of even server or storage virtualization. I asked about the appeal of the technology.

NetApp responded that "storage administrators are reluctant (or prohibited) from deleting data or sending the data to permanent tape archival. But as everyone knows, data keeps growing."

This situation created a "quandary," the spokesperson said. "You can't just keep buying more and more storage; instead you need to figure out the best way to compress the data you are required to store on disk."

Of all the storage space-reduction options, he argued, de-duplication provides the highest degree of data compression for the lowest amount of computing resources and is usually very simple to implement. This is the reason for the broad interest and adoption of de-duplication.

Exagrid's spokesperson expanded on the answer, stating that de-duplication delivers two key values. First, the technology allows users to store a lot of data in a small disk footprint.

"This works great [when applied to] backup," he argued, as each backup data set typically replicates 98 percent of the data found in the previous backup set. De-duplication is used to eliminate the redundant data.

The additional value realized from de-duplicating backup data, he argued, was that it enabled only change data to be backed up across a WAN, making WAN-based data transfers more efficient and economical.

Data Storage Group's respondent took a slightly different tack: "The one primary goal of data de-duplication is to have a longer period of data history on disk and readily available. This is more appealing to organizations because it enables efficient data recovery and discovery. It also improves data reliability because the system can be continuously validating backup images. With tape-based data retention, a media problem will go un-detected until the production data has been lost and needs to be recovered."

Data de-duplication can also improve the process of creating offsite copies, he said. Instead of copying all of the source data, the system can focus on the de-duplicated data. This reduces the total amount of replicated data and network impact. "Instead of managing several replication plans, one for each production volume, the focus can be on the unique bits of data."

COPAN System's spokesperson agreed with the above assessment, adding that network bandwidth savings accrued to de-duplicating data replicated across a WAN were greater than approaches leveraging WAN acceleration. "WAN accelerators don't help when you are sending the same data over and over again," COPAN argued.

Symantec called the bandwidth reduction benefit of de-duplication when applied to backup "client-side" de-duplication. Client-side de-duplication, argued the respondent, is "very effective for remote office data and applications," where it can eliminate the need for a remote backup application and/or remote backup storage. It can also be an effective means to protect virtual server environments, explained Symantec's spokesperson, "because of how it reduces the I/O requirements (90 percent less data, less bandwidth) and consequently reduces the backup load on a virtual host."

Symantec's NetBackup supports both client-side de-duplication and target-side de-duplication, the later comprising the approach used by most other survey respondents, according to the spokesperson, who offered this definition: "In target-side de-duplication, the de-duplication engine lives on the storage device. We often place NetBackup PureDisk in this category for convenience sake, but what we really offer is proxy-side de-duplication. By proxy, I mean that we use our NetBackup server (specifically the media server component) as a proxy to perform the de-duplication process. With this approach, a customer can increase throughput on both backups and restores with additional media servers."

Permabit proffered analyst estimates of data growth rates -- 60 to 80 percent annually -- to explain the de-duplication phenomenon. Their spokesperson stated that "while some of this data is perhaps unnecessary, the bulk of it does need to be kept around for future reference. Digital documents keep growing in both size and volume, and either regulations require that they be kept around or businesses see value in later data mining." The spokesperson placed the cost of storing non-de-duplicated data on primary storage at $43 per GB and stated that such an "outrageous price" provided "the number-one driver for de-duplication."

"Customers don't want de-duplication, per se; what they want is cheaper storage," Permabit argued. "De-duplication is a great way of helping deliver that, but it's only one way in which Permabit drives down costs."

IBM rounded out the responses to this question by referencing the value proposition of recently acquired Diligent products. According to their spokesperson, the focus of de-duplication is on backup workloads, which provide the best opportunity for realizing de-duplication benefits. The two main benefits are (1) keeping more backup data available online for fast recovery, and (2) enabling mirroring the backup data to another remote location for added protection. "With inline processing, only the de-duplicated data is sent to the back-end disk, and this greatly reduces the amount of data sent over the wire to the remote location."

Asked what the gating factors should be in evaluating and selecting the right de-duplication technology, the vendor responses began to define differences in approaches and products.

Network appliance explained that product selection required consumers to address the question of what they were trying to accomplish. Inline de-duplication, he explained, eliminates redundant data before it is written to disk, eliminating the need to store redundant data at all. However, he argued, the drawback of this de-dupe approach was that it required the inline product to decide whether "to store or throw away" data in real time, precluding "any data validation" to ascertain whether the data being thrown away is unique or redundant.

NetApp argued that "Inline de-duplication is also limited in scalability, since fingerprint compares are done 'on the fly' [and] the preferred method is to store all fingerprints in memory to prevent disk look-ups. When the number of fingerprints exceeds the storage system's memory capacity, inline de-duplication ingestion speeds will become substantially degraded."

The alternative -- "Post-processing de-duplication" -- is the method that NetApp uses and it requires data to be stored first, then deduplicated. The key benefit of this technique, said the spokesperson, is that it "allows the de-duplication process to run at a more leisurely pace." Morever, he argued, since the data is stored and then examined, a higher level of validation can be done. Additionally, fewer system resources are required since fingerprints can be stored on disk during the de-duplication process.

"The bottom line," said NetApp's spokesperson, "[is that] if your main goal is to never write duplicate data to the storage system, and you can accept 'false fingerprint compares,' inline de-duplication might be your best choice. If your main objective is to decrease storage consumption over time while insuring that unique data is never accidentally deleted, post-processing de-duplication would be the choice."

NetApp revisited the "false fingerprint" issue later in the survey to suggest that inline de-duplication might compromise the acceptability of de-duplicated data from a regulatory or legal compliance perspective. Symantec's respondent took exception to this characterization, referring to it as "FUD" (fear, uncertainty, and doubt) designed to cast aspersions on inline de-duplication while placing post-process de-dupe in a more favorable light.

Exagrid took the higher ground in providing a more generic comparison of three de-duplication methods. One method, said their respondent, was to break backup jobs or files into roughly 8KB blocks and then compare blocks to store only unique blocks. He associated this technique with Data Domain and Quantum. The second method was to de-dupe via "byte-level delta," a method preferred by Exagrid and Sepaton in which each backup job is compared to the previous backup jobs and only the bytes that change are stored. The third approach, used by IBM Diligent and others, is to perform "near dupe block-level" analysis that operates inline as data moves to its target storage, taking large data segments into memory and comparing them for byte-level delta change, then passing through only the change data.

To this list, Data Storage Group adds Single Instance Storage. This technique reduces redundancy at the file level. The spokesperson noted that it was important for consumers not to focus too narrowly on data de-duplication, ignoring important concerns such as data recovery timeframes, validation functions, retention requirements, or hardware dependencies. Sage advice, in our view, and echoed by COPAN Systems.

COPAN observed that some de-duplication solutions can impose a two-step data restore process (backup data must be "re-inflated" then restored from the containers created by backup software) -- adding delay to recovery efforts following a data disaster. Moreover, COPAN cited the practical issues of solution architecture -- how all the parts of the de-duplication solution communicate with each other. The spokesperson noted, "The more data that can be seen by a single [de-duplication] system (or cluster of systems), the more efficient the solution will be at reducing the storage requirements, complexity of management, and overall total cost of ownership. Also beware of the extra costs associated with having more units, appliances, IP and FC switch ports, etc."

He also emphasized that format awareness is important to product selection. Not all de-dupe solutions are aware of all file formats, which can impact the efficiency of the solution. He added, "Some solutions only look at blocks of data without the ability to understand the whole file. The most efficient solutions have the ability to understand the file as well as break [it] into blocks to achieve maximum de-dupe efficiency."

Symantec added that, in addition to the consideration of the de-duplication process and its efficiency, other important considerations included the integration of the de-duplication solution with backup applications and the high availability features of the de-duplication product itself. Does the de-duplication solution have high availability failover to spare nodes built-in to protect it against server (node or controller) failure? What happens when one controller or node goes down in a distributed storage system?

These questions played into a differentiator of the Symantec offering, NetBackup PureDisk, which provides "integrated high-availability using Veritas Cluster server." He added that the product had built-in disaster recovery options, "including optimized replication, reverse replication, and of course the ability to recover a complete system from tape."

Permabit offered, in addition to criteria such as scalability, that cost should be an important criterion. To alleviate the pain of growing primary storage costs, an enterprise archive has to deliver a major change in storage costs, Permabit's respondent offered, "Primary storage averages $43/GB; Permabit is $5/GB before any savings due to de-duplication. With even 5X de-duplication that realized cost is $1/GB, and is competitive with tape offerings. De-duplication is not the feature; low cost of acquisition is the feature."

In the final analysis, many of the responses to the preliminary questions in the survey introduced more complexity than clarity into the discussion of types of de-duplication and the definition of meaningful criteria for evaluating competing approaches and products. This void seemed to widen later in the survey as issues surrounding the utility of de-duplication and its impact on corporate compliance with data preservation requirements were discussed. We'll have a look in the next installment of this series. For now, your comments are welcome: [email protected].

Must Read Articles