In-Depth

Archive This—Part 2 in a Series

Real archiving is more challenging to set up and manage over time than simply backing up data.

In part one of this series on data archiving, I examined the history of archives and the lack of incentives that existed prior to today's regulatory compliance. I’ll now take a look at the issue of determining what to archive. If your answer is “everything,” then you may not like what follows.

In recent years, as regulatory data retention has become a “front office” issue, many companies have decided to archive everything. This is partly the result of confusion over the meaning of archive and how it differs (1) from backup and (2) from hierarchical storage management (HSM).

Backup and archive are two very different things. Backups are copies of data taken periodically to provide a backstop against the inevitabilities of real life—disasters happen. When they do, application and database hosting platforms (the hardware) can usually be replaced fairly rapidly using commodity (off-the-shelf) hardware. The same holds true for LANs and user-computing platforms. With some logistical planning, and perhaps a re-supply agreement from your favorite equipment vendor, you can quickly restore the components (or a reasonable facsimile) you need to support a skeleton crew operating your most critical open systems applications.

Without the data, however, none of this recovery activity is purposeful. You need a copy of the data to put yourself back in business, and tape backup is the war horse we all know and use—even if we don’t like it.

Backups are made at routine intervals, if done properly. Full backups of everything are made at least weekly, with data changes (snapshots, incrementals, etc.) captured to tape more frequently. The idea is to take the copies of your data on tape and cycle them through an off-site storage facility (or a recovery site if your company has a designated facility where your recovery will be performed).

Note the word “cycle.” A managed tape backup strategy usually cycles data-laden tapes back and forth from the off-site facility to the primary site on a regular basis. This enables the re-use of the media until its usability limits have been reached. Old media with some runway still left is “scratched” then reused—the way that many of us used AOL diskettes back in the day. Older tapes that have been reused a certain number of times are taken out of service altogether.

This process is similar to archive only in the sense that archival media must also be cycled. However, archives are generally retained on media for a much longer period, with migration being determined by the service life of the medium. If a hard disk holds archival data, you generally need to migrate it from one set of disks to another every three to five years in accordance with disk failure rates or mean time to failure stats. Vendors have these documented, but as recent reports from Google and Carnegie Mellon suggest, your mileage may vary greatly. So, the frequency of migration in an archive eventually aligns with your confidence in your gear.

Archive data, unlike backup, doesn’t change. The point of archiving is to take data and store it safely for later re-reference—or until a legal mandate for retention expires—depending on the data itself.

Transforming Archival Data

In some cases, you may need to transform archival data, or rather the way that data is packaged, to keep pace with changes in the format requirements of software that enables the data to be read. One national archivist from Australia told me that the country had wanted to retain data of historical value in an “open” container format, so they chose Adobe Acrobat. Adobe accommodated them with an escrow of Acrobat format source code to facilitate the strategy. The problem, however, was that Adobe changed Acrobat several times in the first couple of years, requiring the archivists to read data back out of the archive, reformat it in accordance with the new container format (e.g., transform it), then write it back to the archive again. Chances are, anyone with long-term data archiving requirements may encounter the same difficulties.

Other than periodic re-hosting and transformations, archives generally have a much less frequent cycle of change than do backups. The two are quite dissimilar in terms of their objectives and their operational nature.

The second misunderstanding, confusing HSM with archive, is actually a by-product of industry obfuscation. In much of the literature presented by vendors seeking to sell “tiered storage” in open systems environment, HSM is represented as archive. Technically speaking, this is as inaccurate as the archive-backup comparison.

HSM is the application of automated data movement technology to the problem of capacity management. The typical HSM strategy keys either to a “capacity watermark” (when the amount of data stored on a volume reaches a certain percentage of the capacity of that volume, move some of the data somewhere else) or to an access frequency count (when a file or data object has not been touched in X amount of time, as indicated by the date last accessed or date last modified attribute in the metadata associated with the file, then move the file to a designated target). Some HSM products offer a combination of both watermark and access frequency parameters to guide data movement.

Some HSM products also work within specific business applications, such as e-mail systems or databases. As with file system HSM, these products migrate older data to alternative hosting platforms or media while leaving stubs in their place to point the application to the new location of the data should anyone ever need to access it again. In addition to freeing up capacity, the strategy may also help to keep applications operating at peak efficiency.

HSM and Archiving Are Not the Same

There is nothing wrong with HSM, and the strategy, properly implemented, can deliver good value to an organization. However, to say that HSM is synonymous with archive is like equating a calculator with a computer. HSM is not very nuanced. In the best of circumstances, it does what it does well and consistently. Usually, the only complaint that arises in an HSM system is linked to a poor understanding of data usage characteristics.

For example, HSM migration rules might be established to move all records in a database or e-mails in an e-mail system or files in a repository that have not been touched in 30 days. The process completes and space is freed up on the primary volume, which is immediately put to use. Then, on day 31, a month-end report that no one knew about needs to be run and the migrated data needs to be returned to active storage to perform the task. However, now the space is already occupied by new data. The result can be nightmarish.

HSM works well in situations where the migrated data is linked in some way to the original data layout. Rockville, MD-based FileTek is a good example of how this can work well. When company uses FileTek’s StorHouse to extract older data from a database into an “archival” repository, links persist between the original database and the migrated dataset (in fact, a data warehouse) and SQL queries can be run against all data. No re-staging or reloading of the migrated data is required.

If HSM is a kind of archiving, however, it is a primordial cousin. Real archiving considers not just capacity allocation efficiency, but also capacity utilization efficiency—with the goal of ensuring that the correct data is migrated from active storage and placed in the archive.

One Big Junk Drawer?

One reason that HSM had a difficult time selling itself into the distributed world (in addition to the low bandwidth of networks until the introduction of Gigabit Ethernet) was that storage devices were not sufficiently costly to motivate consumers to migrate data to cheaper media for cost containment. To date, it remains a challenge to sell HSM merely on the basis of infrastructure cost containment.

What has changed is that regulations and laws now require certain types of data to be retained in an accessible, “discoverable” (searchable), and secure (sometimes defined as “encrypted,” in other cases as “non-repudiable”) state. Knowing what data needs to be handled in this way requires a substantially more granular analysis than does the setup and implementation of a simple watermark/access frequency-based HSM strategy.

Not all data is properly directed into an archive. If it is, the archive becomes a mere collection of auxiliary junk drawers built around the primary storage junk drawer. Archives need to be made of sterner stuff.

The selection of data that is to be placed appropriately into an archival repository keys to many variables or criteria. Regulatory requirements, intellectual property considerations, evidentiary rules, the creator of the data, and myriad other object descriptors need to be assessed—many of which are not captured in simple file metadata. Proper archiving does not “anonymize” data the way simpler HSM does. Just because an object has reached a nominal access frequency rate doesn’t necessarily make it a candidate for archive any more than the fact that the data exists on a repository “owned” by Human Resources.

Imagine that you want to retain certain information from HR—401K plan details, personnel records, etc. This may become the objective of an archive. However, it does not mean that ALL data stored on the HR platform is suitable for archiving. You discover, for example, that someone in HR thinks that his next wife’s last name is JPEG, so he has been downloading every picture he can find of her off the Internet: clearly this file repository is not equal to personnel records in terms of its corporate importance and should be excluded from a real archive process.

A granular understanding of data is required to set up effective archiving rules. You must know what is important from a corporate, legal, historical, and operational perspective. The challenge is to arrive at such an understanding without having to undertake a research project equal to the Human Genome Project. Truth be told, usually no one knows the contextual importance of a file—from the user who creates the file without understanding complex legal ramifications or historical importance to the IT person who sees all files as a bunch of 1s and 0s.

Getting to granularity in archiving will require a bit of research and consultation with many stakeholders in the company. There are no shortcuts, despite what certain vendors claim about the capabilities of their products to auto-classify data. Recently, one such vendor, Scentric, offered a free-to-download-and-try “light” version of its Destiny product, which is touted as a universal classification tool. Folks who have done so have responded that the tool captured many files into classification categories where they didn’t belong, generating many “false positives” as the expression is used in the literature these days. While Scentric concedes that the trial program lacks the bells and whistles of the full monty, and that even the full-blown Destiny product needs to be “trained” and refined to meet specific business needs, I have my doubts that anyone has managed to take the pain out of data classification.

Setting up an effective archive is much more difficult than setting up a less granular HSM process. Both have their place, but real archiving is significantly more challenging to set up and manage over time. In the next column, we will look at some of the criteria that you can use to vet an archiving solution for your company.

Have an archiving story to tell? Feel free to share it here: [email protected].

Must Read Articles