Bolted-on Archive Solutions: Ingestion Indigestion—Part IV of a Series

As with everything else in IT, performance is a key consideration when selecting a backup solution. Finding a reliable measure, however, is problematic.

In the last installment of this series on data archiving, we looked at the (in)adequacy of built-in archive features of e-mail, database, and most business application software systems by surveying vendors seeking to sell archive functionality as a third-party software add-on. Most vendors claimed to offer “solutions” that either enhanced business application archive processes or that replaced them altogether.

Opinions varied about the key problems with “native” (a.k.a. business application) archiving functionality, but all vendors were in agreement that more is needed to provide centralized policy-based data management, as well as to facilitate the kind of granular data discovery required by regulators. To a one, they said that if adequate archive wasn’t built in by the operating system or business application software developer, it must be bolted-on—preferably using their software, appliance, or storage platform.

Their assertions had merit, of course, but they also opened an additional set of questions about how IT decision-makers should go about vetting third-party archive solutions to determine the right one for them. Beginning at the beginning, one topic that needs to be addressed is how you define requirements for archive. More to the point, how do you discriminate from among the myriad “solutions” currently on the market and find the one that best meets your requirements—now and in the future?

One of the selection criteria of immediate concern to many companies interested in data management via archive is how archive products perform. Vendors extol their “ingestion rates” (how quickly files can be placed into an archive) in their marketing pitches, but these claims are rarely subjected to any objective validation. The current experience of users in the trenches suggests that ingestion is one metric where “actual mileage may vary”—widely.

Finding the Right Factors

One European CTO told me not long ago that his vendor’s archive platform was delivering a paltry ingestion rate of six objects per minute, well below the vendor advertised 1000 per minute rate. By comparison, an inexpensive archive appliance platform from KOMnetworks in Ottawa, Ontario, Canada—not deemed “enterprise class” and therefore excluded from consideration by the aforementioned CTO—routinely delivers rates of 600 objects per minute in lab tests conducted in our test labs. Similar or better rates are reported by KOMnetworks’ happy customers.

Andres Rodriguez, whose clustered archive product, Archivas, was purchased last year by Hitachi Data Systems, agrees that some sort of uniform testing is needed. Says Rodriguez, “It’s about time for an organization like SNIA [the Storage Networking Industry Association] to define a standard test to measure benchmark object-storage systems … Hitachi would be only too happy to demonstrate this capability in an SNIA sponsored event.”

Until such uniform testing occurs, and with in-the-trenches performance of archive solutions all over the board, just understanding what factors are germane to product selection can be a daunting challenge.

Last week, I asked a number of vendors what they thought were important vetting criteria. Filtering out the “marketecture,” a few guidelines emerged.

Rodriguez suggests that platform “architecture” is one of the most important considerations. This is not surprising, since his HCAP product uses clustered servers to expedite data ingestion rates. His comments on archive vetting criteria, therefore, must be filtered through this architectural preference.

Says Rodriguez, “There are many variables that can affect the observed performance in a cluster: type of gateway, number of nodes used for ingestion, file size, and network load are some of the most critical. HCAP is capable of ingesting over 1000 objects/second in a fully loaded cluster using HTTP on all 80 nodes. This is a capability that is non-existent in EMC Centera, Network Appliance’s SnapLock, or IBM’s DR550. HCAP can sustain ingestion rates of over 1000 small (1KB) objects while at the same time guaranteeing data integrity all the way to the client application.”

By contrast, Eric Lundgren, vice president of product management for CA, provides a different view that reflects his company’s emphasis on file system management. Says Lundgren, “The real issue here is how files are moved. CA believes the number one issue that prevents a solution from scaling to the ingestion rates that an enterprise expects is the ability for the architecture to be distributed where the collection happens, while maintaining a centralized policy.”

He says that the CA File System Manager is designed for enterprise organizations with disparate operating and file systems distributed across the enterprise. The product enables centralized policy management and control to provide efficient policy management across far-flung, branch-office and end-user computing environments. He adds, “Along with centralized policy, we have distributed components of the architecture that allow for scaling that meets the needs of the largest organizations”—an interesting juxtaposition to the scaling arguments offered by hardware centric archive solution vendors like HDS, EMC, IBM, Network Appliance, and others.

Jim Wheeler, director of marketing and business development with QStar Technologies Inc., agrees with Lundgren that a true archive is stored in a file system and that the file system is, in fact, the first bottleneck to ingestion. “The file system has to deal with both the data structure and the metadata [data about the data] structure. Even if [you are writing] your archive to tape, with a file system you’re going to take a significant write-transfer hit based on the file system used.”

Wheeler says that this problem may be exacerbated by extra metadata being added by vendors: “[W]ith many if not all the disk-based systems like EMC, Hitachi, [and] NetApp, you have the hashing process and the UFID assignment. [UFID is used to calculate a storage address in some CAS systems.] This adds metadata to the file system structure and creates yet another processor bottleneck. The processor can only hash so fast. If the system has to hash many thousands of small files each day, as an e-mail archive would, then the ingestion rate is going to be very slow.”

Beyond content addressing, Wheeler notes that many organizations want to add even more data to a file or record that helps to classify it for speedy retrieval later, or for inclusion into appropriate data protection schemes. “If there is a data classification process, then you’re adding more metadata to the file system structure and creating an additional bottleneck … I think you see where this is going,” Wheeler adds, suggesting that all metadata modifications add to the archive workload.

Feature Bloat and Performance

Another factor he points to as a possible selection criterion is the “self-healing” aspect of the archiving target—the disk or optical or tape systems that hold the data. Processing power is required, he notes, to check the system and to make sure it contains all the files it is supposed to have, as well as a specified number of duplicates for safe keeping. This functionality usually also requires additional metadata structures, says Wheeler.

This perspective is echoed by Goran Garevski, with storage software house, HERMES SoftLab in Ljubljana, Slovenia. As a behind-the-scenes developer of archive and storage management software vended by prominent name-brand storage companies, Garevski observes that many archive software vendors are seeking to add functions such as data de-duplication and various flavors of indexing services to their wares as product discriminators. “These are CPU hungry animals,” he cautions.

It comes down to trade-offs, in Wheeler’s view, between “the amount of overhead to improve searching and indexing” and overall ingestion rate efficiency. “The key,” he says, “is in knowing what you have and how to get at it in a timely manner.”

On this point, he is in agreement with both Sias Oosthuizen, vice president of EMEA Solutions for FileTek, and Hu Yoshida, CTO of Hitachi Data Systems. Oostuizen dismisses the importance of object ingestion rates in platform selection, stating, “It isn’t an issue so long as you can archive faster than data is produced. It is access that is important.”

FileTek’s database archiving technology enables older data to be extracted into a “flat” archival “data warehouse” that can be queried separately or together with the original database from which it was extracted. This kind of access to the archived data is often highly sought after by FileTek customers.

Yoshida, by contrast, notes that his company’s product, HCAP, handles the processor load described by Wheeler quite efficiently “at the storage layer.” Doing this not only enables sought-after services to be provided with minimal impact on ingestion rates, but also frees up business applications to do what they were designed for.

Says Yoshida, “Our approach also enables the index/search, and other object storage services to be offloaded to the storage layer. This will enable the database or e-mail system to concentrate on what they are responsible for, database, e-mail, etc., and scale to much higher volumes without the need to spend cycles on the storage services.”

Performance is just one aspect of bolted-on archiving that IT is concerned about. Next week we’ll look at another aspect: analysis of the data being archived. In the meantime, your comments are welcome at

Must Read Articles