In-Depth
Vendors Respond to De-Duplication Concerns (Part 3 of 3)
Vendors offer their opinions on staying out of legal hot water to where to expand the use of de-duplication technologies.
In our previous column (see http://esj.com/Storage/article.aspx?EditorialsID=3200), we reviewed the responses by several vendors of de-duplication products to a questionnaire that posed common consumer questions about the often-confusing technology. Specifically, we summarized responses from Network Appliance, Exagrid, Data Storage Group, COPAN Systems, Symantec, Permabit, and IBM regarding types of de-duplication and criteria for selecting the right de-duplication solution for your company.
Creating categories to describe different de-duplication methods proved challenging. References were made by vendors to "inline" versus "post-processing" approaches, to "client-side" versus "target-side" de-duplication, and to the focus of de-duplication processes -- files, byte streams, and blocks -- in an effort to create product categories. Generally speaking, these differentiators only complicated the analysis rather than adding clarity to the discussion.
Vendor-suggested criteria for selecting products ranged from comparing resource requirements, ingestion rates, and potential compression ratios to raising the specter of doubt over the impact of certain de-duplication approaches on data integrity. One vendor emphasized cost savings for storing de-duplicated data as a key criterion, citing highly questionable analyst calculations of data-growth rates and the cost per GB of primary storage to underscore his case.
This column picks up the discussion with questions about the broader impact of de-duplication on information governance, business continuity, and storage operations. An issue on the minds of many consumers concerns the potential problems raised by de-duplication in the area of regulatory requirements around data immutability and non-repudiation.
Staying out of Trouble
Does the process of de-duplicating data change it in such a way that you might find yourself in trouble with regulators, auditors, or lawyers? Most respondents suggested that it doesn't.
Exagrid's spokesperson said that he had not heard of any issues. He insisted that since all data can be restored at any point in time using a hash table, he doubted that any problems would present themselves in a court of law.
Data Storage Group agreed with this assessment, noting "As long as the immutability of the content can be assured, de-duplication should be acceptable and can be used in conjunction with digital signatures (one or more) and/or write once media formats for non-repudiation requirements." They added that "the management practices for compliance and non-repudiation requirements do not change with the application of de-dupe," also suggesting that source-based de-duplication might even add an additional level of data integrity by providing content verification when de-duplicated data is written to the target storage media.
COPAN Systems' respondent echoed this view, stating that he believed that immutability referred to the changing (or tampering with) data. "COPAN Systems' de-duplication does not change the data, rather it stores it differently. The traditional backup applications have always done a similar scenario by combining files (i.e., tarring them up). Since that, too, is only storing differently (and more efficiently) and the data is easily returned to its original form, then de-duplication is not very different."
Permabit's respondent drew a comparison to other forms of compression such as LZH. "Plain old LZW compression gives you a different output bitstream than what went in, with redundant parts removed. Conventional file systems break up files into blocks and scatter those blocks across one or more disks, requiring complicated algorithms to retrieve and return the data. De-dupe is no different. Non-repudiation requirements are satisfied by the reliability and immutability of the system as a whole, de-duplicating or not."
Symantec also provided a vote in favor of this view, "Data de-duplication does not change the underlying content of the data; it merely breaks the data up into pieces to store it more efficiently." This was very similar to Network Appliance's view, "NetApp de-duplication does not alter one byte of data from its original form, it's just stored differently on disk. I use this analogy: if a disk volume is unfragmented, isn't it still the same data, just stored in a different place?"
However, the Sunnyvale, CA storage vendor went further to cast doubts about inline de-duplication: "[I]f a 'false fingerprint compare' [occurs] … with inline de-duplication … now the data has been changed. Because of this, inline de-duplication may not be acceptable in regulatory environments."
IBM's spokesperson, speaking of its recently acquired inline de-duplication technology from Diligent Technologies, gives a contrary view. He said, prefacing that he is not a lawyer, that regulatory and legal standards usually required storage to return back a "bit-perfect" copy of the data that was written.
"There are laws against changing the format [of a file]," he said. "For example, an original document was in Microsoft Word format, but is converted and saved instead as an Adobe PDF file. In many conversions, it would be difficult to recreate the bit-perfect copy. Certainly, it would be difficult to recreate the bit-perfect MS Word format from a PDF file. Laws in France and Germany specifically require that the original bit-perfect format be kept."
He argued that IBM Diligent is able to return a bit-perfect copy of what was written, same as if it were written to regular disk or tape storage, "because all data is diff-compared byte-for-byte with existing data."
By contrast, he said, other de-duplication solutions, such as those based on hash codes, can have hash collisions that can, in turn, result in presenting a completely different set of data on retrieval. (In addition to the Diligent-based de-duplication product, IBM also offers the N series A-SIS de-duplication platform, which uses hash codes.)
"If the data you are trying to store happens to have the same hash code calculation as completely different data already stored on a solution, then it might just discard the new data as 'duplicate.' The chance for collisions might be rare, but could be enough to put doubt in the minds of a jury. For this reason, IBM N series A-SIS, that does perform hash code calculations, will do a full byte-for-byte comparison of data to ensure that data is indeed a duplicate of an existing block stored."
None of the vendors responding to the questionnaire indicated a willingness to place itself between customers and litigators should questions arise regarding the regulatory compliance or legality of data de-duplicated with a product. The use of de-duplication is, therefore, a question that should be posed to corporate attorneys before implementing any solution. Based on the stated intention of many vendors surveyed to press the value case for de-duplication into primary storage, the issue must be examined carefully and quickly.
Other De-Dupe Opportunities
We asked in the survey whether de-duplication vendors thought that there might be other applications for de-dupe besides reducing the bits used to describe data stored on virtual tape libraries (e.g., backup). The response was a nearly unanimous "Yes."
NetApp responded, "Approximately 30 percent of all NetApp de-duplication users are de-duping primary storage applications, and the area where we are seeing the greatest growth is our de-duplication. VMware, Exchange, SQL, Oracle, and SharePoint are the primary apps we predict will see the greatest adoption of NetApp de-duplication in 2008."
IBM agreed with Network Appliance, "De-duplication can be applied to primary data, as in the case of the IBM System Storage N series A-SIS. MS Exchange and SharePoint could be good use cases that represent the possible savings for squeezing out duplicates."
Exagrid noted that backups were the low-hanging fruit of de-duplication technology, but they also saw opportunities in the efficient storage of archival data and primary data itself. Symantec PureDisk and EMC Avamar were already demonstrating the value of de-duplicating data prior to moving it across a WAN.
COPAN agreed that de-duplication provides the greatest value "anywhere that there is a high occurrence of repetitive data." VTLs were a logical starting point because of the data repetition in consecutive backups, but this might extend to other environments in general storage repositories where like records exist.
Data Storage Group offered that de-duplication could play a role in optimizing capacity in primary storage, facilitating e-discovery, and data protection -- all using the same de-duplicated backend data store, all in a unified product. They stopped short of claiming that primary storage itself would be de-duplicated.
Permabit agreed, calling VTL a "blindingly obvious use case for the technology" because de-duping backups written to virtual tape repositories could turn "a 30TB box into a 'one petabyte' appliance."
He added, "De-dupe is very important in archives as well, strictly from the perspective of cost savings, but it's also much harder to de-dupe archives, because you don't … save the same data over and over. Building de-duplication for archives is a much harder problem, because you have to work harder to find opportunities for de-dupe, and you must be able to scale to enormous amounts of disk."
Other Benefits
Other points of general agreement among respondents included the characterization of de-duplication as a "green technology" -- as a consequence of the reduction of disk capacity requirements resulting from the application of the technology. The only dissenting opinions came from Network Appliance's respondent, who noted that capacity freed up by de-duplication tends to get filled up again with more data, and from IBM, Symantec and Exagrid, who added that tape was probably still greener than disk -- whether de-duplicated or not.
COPAN Systems countered that de-duplication plus Massive Arrays of Independent Disk (MAID) with drive idling (their primary product) might just change the "tape is greener" equation to favor de-duplicated disk.
Underlying the early value proposition of de-duplication was a claim that the technology would eventually eliminate tape altogether. Although some of the enthusiasm behind this message has waned (vendors such as Sepaton, whose name is "no tapes" spelled backward, are now talking about living within the tape ecosystem), our survey showed that it was as strong as ever among the vendors who responded.
Most look forward to a time in the near future in which de-duplicated data will be written to tape without being first re-inflated, claiming that customers seem to be warming to the idea. Some believe that tape's days are numbered and that disk-to-disk replication of de-duplicated backups between production environments and disaster recovery sites will eventually become the norm.
Our take, humming the old Disney song: "A dream is a wish your heart makes." Your perspectives are welcome: jtoigo@toigopartners.com.