In-Depth

What Does "Archive" Really Mean? -- Part 1 of 2

One technology strategy sure to be strategic is data archiving. We clear up several misconceptions about the technology.

If I had to pick one technology strategy to classify as truly strategic in the New Millennium, it would not be x86 virtualization, or cloud computing, or unified network management. It would be data archiving.

I would argue with high confidence (and lots of empirical statistics) that no other technology strategy has as much potential to deliver real business value -- in terms of cost-savings, risk reduction and improved productivity -- than does archive. No other strategy can offer as much technical ROI, whether measured by application performance improvement or machine/network/storage resource utilization efficiency, than can a well-defined program of active data archiving.

That is one reason why I watched with great interest the recent formation of a new industry initiative called the Active Archive Alliance, which was launched by a collection of vendors including Spectra Logic, FileTek, Compellent, and QStar Technologies. I have covered many of these companies in this column before and I am reasonably well versed in what each brings to the challenge of data management.

The other thing that really piqued my interest about this initiative was simply that it was a true collaboration, something rarely seen in these days of cutthroat competition among vendors for what few purchase orders manage to trickle out of recessionary IT budgets.

We have seen a number of "collaborations" of late -- usually taking the form of proprietary hardware/software stacks being pushed forward by the likes of HP (with Microsoft and qLogic), Cisco Systems (with VMware and NetApp), and Oracle (they buy their collaborators). These collaborations generate slicker marketing than the Active Archive Alliance does, and they pay big dollars to publicists and analysts to paint rosy pictures of infrastructure optimization, end-to-end management, and the decisive remedy for interoperability hassles in distributed computing made possible by their wares. Their message, in effect, is that perfection can be yours for the small price of sacrificing choice in favor of an exclusionary stovepipe computing model. Sounds like a 1980s era mainframe to me -- ironic given that most of the firms participating in these collaborative ventures actually built their companies to be open alternatives to mainframe computing.

That is what makes the Active Archive Alliance a real collaboration in my mind. At the launch, you have a tape technology vendor, a disk array vendor, a database archiving software vendor, and a general purpose archive software vendor working together toward a common goal and without much regard, as nearly as I can see, to the inherent potential conflicts between founders. Should disk or tape be used as the archive repository? As FileTek seeks to extend its reach into file archiving, won't that place it into direct competition with QStar? These questions seem to be less important to the founders than the bigger concern about a lack of industry support for open archiving generally.

Clearing Up the Confusion

When last quoted on this topic, the Storage Networking Industry Association (SNIA) was both promoting archiving (having launched an Information Lifecycle Management forum) and dissing the idea with survey findings in 2007 claiming that archival tools just weren't there. The Optical Storage Technology Group was making more noise debating the relative merits of HD-DVD and BluRay disc technology than they were addressing the potential application of optical media to long-term archives. A tape vendor collaboration orchestrated by Quantum was gobbled up by SNIA because the latter didn't want the competition for sponsorship dollars. EMC was running an expensive "shatter the platter" campaign to drive folks from optical storage to their burgeoning content addressable storage (CAS) platform, Centera.

No one was speaking for archive technology specifically. In the vacuum, old concepts became etched in stone. How often did we hear the truism that archiving can't work in distributed computing environments because (1) the requisite network bandwidth isn't there to transport data between disk and tape/optical storage in an efficient way, (2) storage targets aren't shared, and (3) users control the data and don't participate in classification schemes that could be used to write archive policies? Despite the fact that these truisms are no longer necessarily true, they are still widely held beliefs.

The lack of coherent messaging about archiving has seen the term "archive" confused with everything from "tiered storage architecture" to "hierarchical storage management" to "information lifecycle management" (not the architecture, but rather the marketing concept proffered by EMC in the early Augties). Clearly, this cloud of confusion and misperception needs to be addressed by someone, and the Active Archive Alliance seems poised to make it part of their mission. To help them get started, we offer the following clarifications about "archive" technology.

Tiered Storage is Not "Archive"

Tiered storage architecture traces back to the earliest days of mainframe computing: data was first staged to memory, a precious and expensive commodity, and was quickly migrated to disk-based subsystems (Direct Access Storage Devices, or DASD). Given the small capacities offered by refrigerator-sized DASD, data was migrated off disk and onto tape as rapidly as possible. Each storage device or set of devices constituted a distinguishable tier of storage, discriminated in part by the I/O performance of the device.

In distributed systems, storage tiers have aligned with the price/performance/capacity of different storage products. Vendors offer "enterprise" arrays that feature high-speed/low-capacity/high-cost disk costing from $80 to $180 per GB. These products are referred to as Tier 1 storage in the industry vernacular and they usually feature on-array "value add" software functionality such as RAID, mirror splitting, array-to-array replication, and other functions that enable the vendor to sell a finished array (essentially a box of commodity disk) at a significant mark-up over the cost of the disks and chassis alone.

Tier 2 storage constitutes arrays featuring lower-performance/high-capacity/lower-cost disk. Touting a price point of between $40 and $140 per GB, these arrays are designed to provide mass storage capacity for less-frequently accessed data.

Tier 3 storage usually refers to tape or optical disc technologies, where media costs hover between $.44 and $1.50 per GB. Access speeds are significantly slower than either Tier 1 or Tier 2 devices, so they are primarily used to host historical archive or backup data.

Recently, the tiered storage metaphor has been revisited by leading hardware vendors to provide a backdrop to discuss both "hybrid disk systems" (that blend Tiers 1 and 2 or Tiers 2 and 3 in the same box) and to introduce "Tier 0" technologies, as solid state disk (SSD) has come to be termed. Hybrid disk systems combine different drive technologies in the same chassis and implement a typically simplistic form of hierarchical storage management (HSM) software in the array controller, which serves both to increase the costs of the drives in the array (and the profit to the vendor) and to establish some rudimentary mechanism for moving data over set time intervals between the different disk types in the array. (We discuss HSM later in detail.)

Solid State Disc (SSD) is increasingly an offering with "enterprise" arrays, especially as low-cost Flash memory is leveraged to reduce the price of SSD from thousands of dollars per GB to hundreds of dollars per GB. As in the earlier case of mainframe memory, SSDs are so costly, data generally needs to be pushed out of these components and on to Tier 1 or 2 as quickly as possible, requiring a simple HSM algorithm.

Bottom line: Tiered storage is not a new concept, nor is it in any way the equivalent of "archive" or intelligent data management. It is simply a storage architecture model -- a way of interconnecting various storage devices across which data management processes can operate. Unfortunately, it is misrepresented by some vendors as an alternative to managing data and archiving it according to its business context.

Hierarchical Storage Management is Not "Archive"

Hierarchical Storage Management (HSM) is, as we suggested, a software-based technology intended primarily to support the efficient allocation of tiered storage capacity. The focus of HSM traditionally has been on capacity allocation efficiency, not capacity utilization efficiency, which is the goal of intelligent data management and archive. The difference is important.

Capacity allocation efficiency refers to the balancing of storage assets so that one set of disks does not become fully populated with data, introducing performance degradation, while another set of disks is only sparsely used to host data. This concept has been extended to refer to the placement of data on spindles that provide necessary performance or services (such as replication) required for that data, and alternatively as a methodology for migrating older data to lower tiers to free up space on more-expensive and high-performance Tier 1 storage.

In most cases, HSM delivers capacity allocation efficiency without reference to the business context of data itself. The preponderance of HSM algorithms involve three factors: how full is the disk, how old is the data, and when was the data last accessed or modified.

HSM that uses the first criterion, allocated capacity measurement, is sometimes called "watermark-based HSM." The operational premise is simple: when data amasses to a particular level or watermark, this triggers the migration of certain data to lower-tier media. Often, the methodology for selecting which data to move is FIFO (first in, first out). Data that has occupied the media the longest gets moved.

HSM using the second criteria, file timestamps, may operate in concert with watermark/FIFO HSM systems. The trigger for data movement, however, in a pure timestamp HSM process is the age of the file itself. At a given time, for example, all files older than 30 days are automatically migrated to lower tiers.

HSM keyed to the third criteria, date last accessed/date last modified, leverage file metadata that store last-accessed and last-modified dates to determine what files to move. For example, any file whose date last accessed/modified date in metadata is older than 60 days is automatically moved to lower tiers of storage.

All three varieties of HSM functionality exist in the market today, either as standalone software or as value-add software built on to enterprise array controllers featuring on-array tiering. These are deceptively marketed as intelligent data management solutions, which they are not. Their focus is less on data itself than on capacity allocation in the array or across storage infrastructure. Little if any attention is paid to the contents of the data files that are being moved or their business context, which must ultimately determine how data is hosted and what services are provided to the data during its useful life. Capacity utilization efficiency -- placing the right data on the right device at the right time, and exposing it to the right set of services based on business value criteria -- is simply not the goal of HSM.

In part two of this discussion, we will identify two more "archive" areas that need clarification, and will offer a list of ten criteria for making a smart archive purchase.

Must Read Articles