Puzzling Out an Archive Strategy, Part 1

Archiving isn’t rocket science, but it can be a challenge for even rocket scientists to do well.

Just for a second, close your eyes and picture an archive. Chances are that what comes to mind are scenes from Raiders of the Lost Ark or perhaps the more recent National Treasure: musty old rooms filled top to bottom with ancient tomes and bric-a-brac—all covered in dust and spider webs. In such antiquated settings, you can imagine that, even with the best of cataloging schemes, things get misplaced—or worse, their contents and meaning are forgotten altogether over the course of time.

While the archivists—the clerks of history—might know generally where information is, even their recollection may wane over time. With the natural tendency of all things toward entropy, at some point the whole of any given century might go missing.

Perhaps just the original videotapes of the Apollo 11 moon landing will see that fate. NASA has been unable to locate them for over a year now. It seems that the originals were shipped to the National Archives in 1970; in 1984 they were crated for return to Goddard Space Flight Center in Rockville, Maryland. Somewhere along the Beltway they disappeared, creating more fuel for the conspiracy theorists who insist that the event never happened at all.

Though disconcerting, it is perhaps dwarfed in its significance by the loss of telemetry tapes from the Mercury space program. As documented previously in this column, the magnetic tapes storing the data crumbled to dust a few years ago during an effort to re-play them. It proved an already-known point: it takes far less time for magnetic media to lose its integrity than, say, for paper or stone tablets. The same holds true for magnetic disk and optical media.

Despite these limitations, archiving on magnetic and optical media is currently experiencing a renaissance of sorts. Driven primarily by concern over regulatory compliance as it pertains to data retention, searchable archives have become the rage.

Analysts are underscoring the point. According to IDC, archive and replication software are leading the pack of storage related wares as a result of compliance worries. Gartner, too, holds the growth of archiving to be an inevitable truth, a natural response to data growth and new laws that require companies to put their storage junk drawers in order.

Truth be told, there have always been good reasons to archive. For one, archiving alleviates the primary storage burden and reduces, in theory, the rate of acceleration in IT spending on expensive dual-ported disk array storage.

The Economies of Storage

This rationale, however, did not guide IT managers to embrace archiving in their open systems environment prior to the rise of the regulatory mandate. It had to do, in part, with the economics of open systems storage itself.

At $89 per GB for Fibre Channel drives ($189 in a FC fabric), and $44 per GB for capacious SATA drives ($144 in an FC “SAN”), and between a couple of dollars per GB for optical and 40 cents per GB for tape, the ROI for going to an multi-tier storage infrastructure made little sense.

An IBM official remarked to me in 2002 that creating archival storage arrays at a significantly lower price point than primary arrays made little sense. In his words, "I find it difficult to see how you could make a business case for such a strategy based on the fractional cost savings—measured in peta-cents [sic]—that would accrue to migrating data across less expensive arrays."

The IBMer was doubtlessly contrasting the economic realities of open systems storage with earlier mainframe storage. In the mainframe world, tiered storage translated into very expensive system memory, significantly expensive direct access storage devices (the refrigerator size of a DASD platform required that entire buildings be constructed to house additional boxes of disk if you exceeded your capacity), and tape. The cost differentials between these storage media were significant, and IBM had an additional incentive to create archiving technologies like DFSMS (for management) and DFHSM (for hierarchical storage migration) in order to sell more storage to customers.

Their storage management software (DFSMS) enabled individual administrators to manage more storage capacity, so you wouldn’t need to hire another body when you deployed more DASD or tape, while hierarchical storage management (dfhsm) automated the migration of data sets that had gone to sleep off your DASD and out onto tape.

This tiered storage with the archive metaphor never translated into open systems. Prior to the advent of Fibre Channel fabrics and Gigabit Ethernet, the interconnects between storage devices were of such low bandwidth that simply moving data from one set of disk to another was a lumbering procedure. At the end of the day, the prices of arrays differentiated for data capture (tier one) and retention (tiers two and three) were not very compelling in and of themselves.

Even in the 1990s (before the regulatory boom), if you looked at the broader value of archiving—the “milieu benefits” that would accrue to segregating older data with an unlikelihood of re-reference from the data you were touching every day—you could make a compelling business case for doing it. Such segregation would make backups work better. Archiving meant culling out material that didn’t need to be included in daily, weekly, and monthly data protection processes, enabling them to become more efficient. Even this infrastructure management optimization value, however, was too often ignored: the solution to the data burgeon was simply to buy more disk and whine about the growing inefficacy of tape backup.

Increasing Interest in Archives

SOX and HIPAA and SEC rules and Federal Court Evidentiary Rules, just to name a few, have changed the lacksidasical interest shown by companies toward archive. Today, archive is on the lips of businesspeople ranging from the CEO and board of directors to the CIO and IT manager. Driving this new interest is not a wholesome cost-benefits analysis, but fear of regulatory non-compliance. For purveyors of archive wares, any interest is good, but for those puzzling over the need for an archive solution, fear makes a poor driver for good strategy.

Archive isn’t rocket science, but as the NASA stories suggest, it can be a challenge for even rocket scientists to do well. Here are some questions we will develop in the next installments of this series.

How do we determine what data to archive? This seems simple enough, but it actually is more complex than one might think. Going forward, an enormous task of developing a high granularity approach for understanding data from the standpoint of business processes must be undertaken by companies. Just archiving data based on access frequency (traditional HSM) will not be enough; we need to know the content and the context of the data—a non-trivial technical challenge.

Second, we need to understand how to vet an archive solution from the dizzying morass of marketecture around products available today and in the future. How do we know which solution is right for us in the absence of meaningful performance data and untested scaling requirements? Rumors have begun to surface about products from HDS and EMC, among others, that are showing profoundly slow ingestion rates. Vendors respond that the numbers are without meaning, insisting that many variables impact the speeds and feeds of their products. A future column will drill down into this issue.

Third, we need to understand how we manage archives over time, both to reduce the likelihood of data loss and corruption over time and to prevent archives themselves from becoming as unmanageable as our primary storage “junk drawers.” As a practical matter, the labor pool of qualified IT professionals is shrinking. That will require archive management tools that will enable vast quantities of stored data to be managed, rotated, migrated, and verified by a very small contingent of administrators.

Finally, we need to tackle the issue of retrieval: searching and retrieving archival data when required. There are many technologies in the market today, but none is without significant foibles. A key issue is how one will search data that must be retained in an encrypted archive for a long time because of one regulation to produce discovery data within short timeframes mandated by other regulations.

Your input, as a vendor or consumer of archiving platforms, is welcome:

Must Read Articles