In-Depth
Building an Archive: End of a Six Part Series
Is the obvious so difficult to understand? Obviously, yes.
Sorrento, Italy was the site of last week’s QStar Partner Conference, attended by roughly 80 resellers, integrators, and vendors from around the world. A topic in many of the presentations and less-formal discussions was a campaign introduced last year by EMC called "Shatter the Platter." The Hopkinton, MA-based disk array manufacturer was targeting optical storage devices (often the preferred modality for long-term archiving) and seeking to replace them with the company’s preferred content addressable storage (CAS) product, Centera.
Many attendees complained that EMC was gaining substantial penetration into their customer base with the Centera product, despite the increasingly well-known and publicized operational deficits of the product. Past columns have covered some of the complaints about Centera leveled by consumers—including "header hash collisions" where two or more files are assigned the same "unique" MD5 hash, resulting in an inability to retrieve any of the files having the same hash, and extraordinary long rebuild times if a disk drive fails in the array.
Despite its shortcomings, the product is reported to be selling well, based in part on the sheer weight of the EMC brand and misrepresentations commonly made that the product is "compliance certified." (To our knowledge, there is no governmental entity that "certifies" any product for its compliance with laws or regulations on data retention.)
"Shatter the Platter," attendees agreed, was a shot across their collective bow. They complained that OSTA, the main standards development and trade organization for the optical industry, had done very little to respond to the onslaught of misinformation from the disk array industry. Rather, meetings of the group often devolved into arguments over Blu-Ray and HD DVD optical disc formats—important only to game system manufacturers and motion picture production houses. With the optical industry in disarray, EMC was eating their collective lunch.
No one was saying that hard disks could not be used for archive. Truth be told, most of the host’s (QStar’s) archive system and appliance offerings use hard disks to provide a cache for data that will eventually be burned to optical media or magnetic tape—two media with the merits of more longevity and greater integrity than spinning rust. Moreover, QStar used the event to announce its own first generation CAS: SntryCAS. SntryCAS, unlike EMC's product, does not lock in the customer to an overpriced disk storage array. Rather, it applies content addressing to data on its way to any storage target connected on the back end: disk, tape, or optical.
SntryCAS also suggests the future of archive generally. Moving forward, QStar technologists prefer to store data as objects—reducing dependency on a file system that may become obsolete during the period of time that data is archived. Considerable work (the most impressive work we have seen to date, in fact) is being done at the company to produce next generation technologies based on object-oriented architecture. To fill the line cards of resellers and integrators selling into customer requirements today, the company maintains a robust and well-differentiated suite of products for archive, virtual tape, and other data movement, migration, and replication functions. They also used the event to reiterate the announcement they made at the March CeBIT Conference in Hannover, Germany of a specialized e-mail archiving appliance, SntryML, aimed at smaller organizations, and to add another specialty kit, SntryPACS, targeted to smaller healthcare service providers.
Archive Conclusions
Spending a week in the warm Mediterranean sun wading through the errata of archive media, architectures, and processes might not be everyone’s idea of a good time, but it helped pull together the threads of discussion that we have covered in the past five parts of this series. For now, these are the conclusions we can reach.
First, archive is a key function of any well-disciplined and managed data storage infrastructure. Archive is not backup, nor is it hierarchical storage management (HSM), though it has some elements in common with both. Archive is a more granular technique for selecting and moving specific data to long-term storage in accordance with business policy and in response to business requirements.
The key word here is business. Archive isn’t about back-office operational efficiency (though it can contribute to operational efficiency). It is about solving business problems such as compliance, historical auditability, data non-repudiation, and intellectual property management and protection. Archive, unlike either backup or HSM, has a full business value case to offer in terms of cost-savings, risk reduction, and business process improvement.
Second, archives aren’t musty cabinets where files go to sleep. They must be carefully created using ingestion methods that must themselves be operationally efficient, transparent to users, and highly visible to archive managers. How data is written into an archive—from the format of the data itself, the methods used to validate and secure it, to the media on which the data is written and the tagging methods used to identify and retrieve data later—must be clearly visualized by designers. The conceptualization of the archive experience must also include some provision for how data expiry and deletion will be accomplished. Depending on the nature of the archive, designers must also determine how to cope with the limitations of media, which usually entails periodic migration of data to fresh media (frequently for hard disks, less so for tape and optical), and to new formats to address the changes in data "containers." Container formats, such as Adobe’s Acrobat, change with time, and the data they contain must be migrated to a newer container format whenever significant changes occur. Acrobat, used by some national archives, has changed 37 times since the introduction of the technology in the early 1990s.
Finally, archives impose a significant management burden on organizations that must be addressed intelligently and proactively. If primary storage is a "junk drawer," archive is a mechanism for putting it straight—by taking less-frequently-accessed-but-nonetheless-important information, removing it from expensive storage devices, and moving it into less-expensive and more-durable storage infrastructure. However, archives must also be managed to prevent them from becoming simply more "junk drawers" themselves.
If disk drives are used as archive media, the propensity of SATA drives to failure—which appears to be far in excess of vendor-proffered annual failure rates, as documented recently in separate empirical research from Carnegie Mellon University and Google—must be redressed through proactive management and a program of frequent data migration.
If tape is the archive modality, all of the best practices for tape preservation and protective duplication must be observed. The same holds true for optical disc, which, despite vendor claims of 10 to 50 years of durability, might well lose their integrity much sooner if improperly stored and exposed to daylight (UV radiation, in particular).
This is archive in a nutshell: a circular process beginning with the definition of archiving objectives and the archive experience itself, followed by the selection of appropriate archive objects, their ingestion into an archive platform, and the management of both data and platform over time. There is no one-size-fits-most solution, despite the claims of many vendors, and there are many value-add components that are not, strictly speaking, core components of an archive, but that may contribute extra worthwhile functionality. Unfortunately, sifting through vendor hype to separate the need-to-have components from the nice-to-have components may be a challenge.
While building a business-class archive isn’t rocket science, it isn’t child’s play either. As with most things in storage, archives are standards-free, and the interoperability of different parts from different vendors is by no means guaranteed. To further our understanding of archive and configurations that work, the last thing we chatted about in Sorrento was the creation of an informational Web site to provide objective and actionable information. It will be nested first under the banner of the Data Management Institute at http://archive.datainstitute.org, but will eventually move into its own URL.
Maybe a clear-headed and vendor-neutral information outlet can help stop campaigns such as "Shatter the Platter" that contribute nothing to our understanding of data management, but promulgate what is increasingly viewed as bad technology—"data roach motels" in which the data checks in but can’t check out.
Your opinion is welcome: [email protected].