In-Depth
Building an Archive Strategy, Part V: The Devil in the Details
What every storage admin must ask: what must we save and where shall we save it?
In the last segment of this series, we heard from several vendors about the factors they thought important to consider when selecting a company's archive solution. Predictably, most emphasized archive platform architecture or features that happened to exist only in their own wares. We also looked at the "ingestion rates" presented by archiving products: how quickly they could move selected data into the archival repository.
That column happened to dovetail with some laboratory testing we were finishing involving KOMpliance, a data archive appliance from Ottawa, Ontario, Canada-based KOM Networks. KOMpliance may well be the best archive engine that you have never heard of—at least, that was CTO Kamel Shaath’s claim when we first met him a couple of years ago. Our testing showed that it wasn’t an empty boast; Shaath and company have developed an archiving appliance with a compelling business value case.
We found the KOMpliance platform to be a great performer, delivering excellent throughput under workloads comprised of both smaller- and larger-sized files, and with different numbers of "threads" or concurrent user accesses to the appliance. To baseline the ingestion rate of the KOMpliance, we performed a rather simplistic "black box" comparison between file write rates with the KOMpliance product in the data path, performing its indexing and hashing voodoo on data heading to the KOMpliance-managed destination array, and direct write rates to the same array without KOMpliance in the path at all. We found that this appliance imposed about a 12 percent hit on write performance, which was pretty minimal considering what the KOMpliance was doing in the background.
Our analysis of the KOM Networks and ongoing conversations with Shaath also gave us some additional practical insights. For one, Shaath was pretty clear about the fact that he wasn’t setting out to reinvent things that other companies have already done. His KOMpliance platform (which comes in a variety of configurations and price points) uses commodity hardware running Microsoft Windows Storage Server R2. His reasoning is straightforward: people already use and understand these components, so why reinvent them?
Another interesting architectural decision made by Shaath is to eschew the area of "archive pre-processing." KOMpliance assumes that you have decided what data you want to move into an archive (and helps you create a policy for placing it into a KOMpliance-managed storage volume, then managing it once there). Says Shaath, "Most data that goes into the archive are files managed by the file system. We leave it to the consumer to decide what files to select and include in the archive."
Shaath has caught heat for this position from competitors in the industry, "They want to know, ‘Where is your e-mail archiving software?’ I tell them I don’t need any and they tell me, ‘Then you are not an archive tool.’" Shaath says Microsoft already builds in an archive program for Exchange and Outlook mail that produces archive.pst files he can move into his KOMpliance solution on the fly. If someone wants to use a more sophisticated program to extract e-mail and leave stubs in their place, he can just as readily accept that input to KOMpliance.
Shaath thinks that being standards-based, transparent to users, and compatible with the broadest range of hardware and software is what has delivered such an impressive list of clients to KOM Networks over the past 25 years. He wants to deliver the heavy lifting of archiving, the reliable and secure retention and protection of data over a lengthy period of time—and when the time comes, the secure deletion of that data, too.
He emphasizes AES 256 encryption that his product can apply on demand to any files being stored to the repository that KOMpliance manages, as well as a SHA 256 hashing that can be applied to ensure that data doesn’t change over time. He notes that his company can also be credited for developing and patenting its own write once, read many (WORM) technology, which is integral to KOMworx, the core software engine of the KOMpliance product. Users can apply any and all of these features, selectively or universally, to the data they archive and manage with KOMpliance. There is nothing else to buy.
With everyone else in the archive space talking about advanced features such as compression, de-duplication, content addressing, specialty file systems, search and retrieve, and other "value add" capabilities, Shaath and company have preferred to focus on simplicity. Let the customer decide which value-add features he needs, says Shaath, but deliver a solid compliance archiving solution to the market that is affordable to the largest number of companies. To hear him speak, the purveyors of special archive solutions aimed narrowly at one data type (e-mail, database, fixed content, video) are leading the consumer down the wrong path, into stovepipes or silos of archiving, with each silo requiring its own cadre of skilled personnel to administer it.
The file is the universal container because all data is handled by the file system: so, KOM Networks manages files. Shaath says that this is the real secret sauce of the KOM Networks solution, minimal end user involvement: "near complete transparency."
Shaath’s perspective is shared by Patrick Dowling, BridgeHead Software’s senior vice president of marketing. Dowling emphasizes that proper upfront analysis of data and the definition of workload is key to keeping archive a seldom noticed but always vigilant data management service.
"The devil is always in the details, isn’t it?" Dowling remarks, "When you introduce automated archiving, you need to understand what’s going to happen when you flick the switch. Simplistic programs that just begin to madly find and write data off to the archive can be dangerous."
Citing a few of the strengths of the BridgeHead Software archiving solution, Dowling says that a best-of-class archiving program should feature built-in analysis, simulation, and reporting, "so that you can fully understand your data environment and what the effect of running rules might be."
He also underscores the importance of testing the archive storage environment before deleting original data that has been copied into it, "That way, you can ensure that the archive storage environment is functioning correctly and you can test the accessibility of data in the archive prior to actually being dependent on picture-perfect repository performance."
Ingestion rates (which we discussed last week—see http://esj.com/storage/article.aspx?EditorialsID=2586) are probably given more weight than they should be, in Dowling’s view. He says that the most important test of ingestion is the initial migration of data to the archive, which should take the form of a copy, rather than a move, in order "to make data migration transparent, possibly decrease disruptive results to users or applications, given the uncertainty of random data access. If immense amounts of data are to be moved, it is important for the software not to simply move files piecemeal. Archive jobs should be structured according to policy to aggregate writing large numbers of files into a single data transfer operation. In this way, large-scale ingression can be done at media speeds, much as one sees with backup."
This concept becomes even more important, in Dowling’s view, when non-random access media like tape is the archive target, and where streaming is critical to proper operation. He says that archiving products like BridgeHead’s HT Filestore are specially designed for such environments, featuring advanced job scheduling, multiple queues, and queue management to "give users the ability to control the often hectic process of trying to move terabytes of data off to alternative storage systems."
So long as proper testing and initial migration is done as he describes, Dowling says that archive should stay off the radar of the user population (even the bean counters in finance): "Once the archive is setup and populated, archiving should be a relatively low utilizer of resources—certainly compared to daily backup."
End User Disruption Minimal
By "user," Dowling is referring to administrators. The rule of effective archiving is to minimize end-user involvement. Transparency of archive operations is highly prized, as is the seamless integration of archive products with the business applications and the information output they manage. Fixed content archive companies such as ClearView (formerly Zenysys) and e-mail archive vendors such as Mimosa Systems emphasize tight integrations with Microsoft SharePoint services (in the case of ClearView) and Microsoft Exchange (in the case of Mimosa) as key selling points. The dominant "philosophy of operation" is that users should be able to continue to interact with business applications and client operating systems as they always have, even if the e-mail or document they seek to access actually exists only as a stub in the application that points to an archival repository.
Ideally, suggests Jim Wheeler, director of marketing and business development with QStar Technologies Inc., an effective archive system will not require any intervention by the user at all. Unfortunately, he notes, this is not the modus operandi of all archiving systems in the market.
According to Wheeler, "The user is told to archive, so they do, but they don’t know what to save, so they save everything. In the beginning, that’s fine because disks are cheap. But then they need to access something in particular and they can’t remember where they put it. They end up searching the whole data set, which is now measured in terabytes, if not larger. Searching can take a very long time, and may not even produce the correct file because they put in the wrong search criteria."
In the final analysis, the "disruptiveness" of the archive solution from the user’s perspective may be the real make-or-break criteria of archive program success. If archiving interferes with normal work, whether in the form of nagging requests for users to add classification criteria to files before saving them or significant delays in archived file retrieval, it is likely to become a whipping boy. Similarly, if archive processes slow down business applications with CPU intensive processing tasks, it will likely incur the ire of the masses.
Next week, we will conclude this series on archive by discerning five best practices for building archives. Until then, your comments are welcome: [email protected].