In-Depth

Built-in Archiving: Use at Your Own Risk -- Part III of a Series

Can non-native archive management software improve the performance of the e-mail server by automatically archiving messages without user intervention?

Archiving isn’t easy. For proof, you need look no further than the White House.

By current accounts, as many as five million e-mails dating from 2003 to 2005 seem to have fallen through the cracks of an unnamed archiving system, put in place by the Bush White House ostensibly to fix another flawed archiving solution, the Automated Records Management System (ARMS), which was, in its day, blamed for losing a quarter of a million e-mails of the Clinton Administration.

A post-Nixon era law, the Presidential Records Act of 1978 (PRA), established a requirement for the Office of the Presidency to retain the missives of governmental head honchos for future reference—whether by historians or litigators. Complying with the archiving law is, apparently, a lot more difficult than it sounds.

Various press accounts and public statements suggest that the current missing e-mail problem might have to do with the archive process itself, which doesn’t seem to use any particular third-party archive product. According to one government watchdog group, Citizens for Responsibility and Ethics in Washington (CREW), "The…e-mail retention process in use by the White House consists of extracting e-mail messages from the e-mail system and saving them in large, undifferentiated files on a file server." This approach may seem familiar to many as a typical archiving component of popular mail programs, such as Microsoft Exchange.

White House spokespersons have also placed potential blame at the feet of poorly trained users—or rather to a “failure to adequately train end users in the proper use of the current archive process.” The latest explanation, however, can not be reconciled with the CREW report, given that automated mail system archive is rarely a task that entails any direct end-user involvement.

Predictably, the issue has become so mired in politics that we may never know the real reasons for the archive snafu. Interestingly, however, the matter does underscore a dilemma that many companies face today: defining a reliable method for retaining electronic data, including e-mail, databases, the output of electronic content management (ECM) systems, and good old-fashioned end-user files in accordance with regulations, laws, and rules of evidence that are being issued with increasing frequency by government agencies.

Many applications, from e-mail to databases, come with their own archive capabilities built-in. One question that might reasonably be asked is why these capabilities are insufficient to the task of squirreling away data in a way that passes muster with auditors and lawyers. Clearly, this must be the case or there would not be so many vendors lining up to sell specialized software for archiving files, e-mails, database transactions, and other electronic content.

Why Built-in Isn’t Good Enough

We asked several vendors to help us understand the inadequacy of built-in archive processes that should compel consumers to buy third-party bolt-on products. Several interesting responses were received that might help the IT decision-making process.

Sias Oosthuizen, vice president of EMEA Solutions for Rockville, MD-based FileTek, stated that, in the best of all possible worlds, archive “should be an extension of the operational system allowing for transparent access to archived data.” That way, applications and operating systems would be able to re-reference the archived data readily without needing to reprocess it or drag it back into the application or its primary storage.

Unfortunately, he said, that isn’t usually the case. He notes that, in the enterprise applications world, “SAP BI (Business Information) is one of the few applications I have seen that explicitly makes provision for archiving older data and allows transparent access to that archive data through their NLS (near-line storage) interface.” Others treat integral archiving the way they do backups, as an IT operations requirement fulfilled outside the application itself.

A by-product of this philosophy, which in Oosthuizen’s view is held by many business application software developers, is an “everything-or-nothing” dichotomy that closely ties an archive’s value to its destination media—disk or tape. With disk, everything is available to the application within milliseconds, says Oosthuizen; with tape, nothing is available without considerable effort and wait time.

He says that MAID (Massive Array of Independent Disk), a concept introduced a few years ago and productized by COPAN Systems among others, shows promise to rectify the discrepancy between tape- and disk-based archives. With MAID, archives can be written to an inexpensive SATA disk that spins down when the data is not referenced. When a request is made for application data archived to the MAID infrastructure, the disk spins up and the data can be accessed readily.

While his comments may seem to be a bit off the track of the main question, they actually make a lot of sense. Many application software vendors equate simplistic archiving with anonymous packages of data that are created periodically by the application (usually based on the timestamp associated with the data itself), then spun off as anonymous data sets to be backed up to tape and never re-referenced. Oosthuizen seems to be correctly associating this process with a philosophy about archive held by the application designers—that it is another word for backup.

For companies that need to re-reference archival data in the future (to create historical reports, for example, or to perform other types of investigation or analysis), archive-as-backup is probably insufficient. This view is echoed by Hu Yoshida, CTO for Hitachi Data Systems, who notes that native archiving capabilities designed by software houses often leave him with a sense of déjà vu about IBM’s mainframe-style Hierarchical Storage Management methodology.

Says Yoshida, “I also see a distinction between HSM and TSM or tiered storage manager. HSM is a hierarchical view of storage which in IBM’s case (DFHSM) requires data sets to be migrated back to the top of the hierarchy before it can be accessed by the application. It can not be accessed on the lower levels of the hierarchy because the data sets may have been compressed and moved to other physical volumes which are not known to the application.”

Going Part of the Way

Goran Garevski, whose firm HERMES SoftLabs in Ljubljana, Slovenia, has been responsible for developing many prominent storage management products vended by industry name brands, agrees with Yoshida, saying that native archiving capabilities in most business apps generally go only part of the way toward meeting the fundamental requirements of archive. He notes that the technique of compressing application output into proprietary formats for permanent storage on tape creates difficulty “especially when one wants to do ‘project-based’ archiving—which can include data from different applications (e-mails, records, files).”

Says Garevski, business application vendors would “make life easier for the data management vendors (and indirectly for end users)” if they would use standardized “or at least publicly documented” data formats and enable “good (read: free, well-documented and supported) APIs” that could be leveraged for effective data classification, data movement, and transparent data access. With all business app developers using proprietary formats, using the native capabilities of applications to handle the archiving task is tantamount to turning archives into a Tower of Babel.

Patrick Dowling, senior vice president of corporate marketing with BridgeHead Software, says that, even if standards were adopted industry wide for archival data formats, native application capabilities would probably still need to be supplemented. According to Dowling, there are two distinct sides to archiving. First is the front-end process of selecting data to be copied or moved to the archive. Second is the “backend” management of the secondary storage.

“[Some] applications are perfectly competent at understanding the characteristics and priorities of the information they manage and are capable of gathering up and writing out data. PACS systems in the health care space and digital asset management systems in the pre-press printing market are perfect examples.”

Unfortunately, notes Dowling, many applications—“Microsoft Exchange being a prime example—do not have built-in functionality to do this correctly.” That is why, he notes, there are a number of e-mail archiving products that exist “to supplement Exchange operation with the ability to intelligently extract the data to a larger-scale more functional data management environment.”

Even when applications are competent to select the correct data for archiving, Dowling says, “In our experience, very few understand secondary storage devices and secondary storage management requirements. They are, therefore, usually content to simply write a single copy out to some place on the network called the ‘repository’ and be done with it.”

This opens the door for archiving or repository management products such as BridgeHead’s HT DR and HT FileStore, Dowling says, “which make it their business to ensure that the place that data gets written to, provides data accessibility, security, compliance, and availability in multiple copies to ensure recovery from failure.”

According to the marketing SVP, BridgeHead Software’s HT FileStore is used as a “backend” repository manager for a number of e-mail (Quest, C2C, Waterford, ZipLip), database (Grid-Tools, BrighTech), and image (AMICAS, wave data) archiving products, “whose competency is in the front-end data management.”

Also missing from the native solutions provided by the business software houses, in the view of many archive software vendors, is a meaningful or efficient “discovery” engine that can search through archived data and locate specific transactions or documents. Eric Lundgren, CA’s vice president of product management, makes the point that the ability to meet business requirements (such as regulatory, compliance, and discovery) exceeds capabilities found in native archiving. He goes further to say that this functionality gap “represents the biggest risks organizations are facing today.”

“Third-party e-mail archiving,” says Lundgren, “allows the organization to meet discovery requirements efficiently, cost effectively, and can be executed by the legal team versus IT staff. The legal team can therefore operate under a self-service model. A native archive will only be able to archive messages that are still in the mailbox. An essential difference of CA File System Manager versus native archive is the ability to meet stringent legal e-discovery requirements with workflow including centralized indexing, recursive searching, legal hold, tagging, and export to legal ready format.”

Lundgren says that centralized policy management is an associated benefit of a third-party archive, allowing the organization to eliminate the need for user archives that represent a risk for the user and the organization. “The risk for the organization is that the e-mail in a user’s local archive has no enforceable retention and is a discovery risk; many organizations will enforce a default retention to control the lifespan of e-mail data. An organization may be asked to obtain and search everyone’s e-mail (including local archives) for the word ‘x, y, and z’ as part of a litigation.”

Centralized policy management, enabled by non-native archive management software can also improve the performance of the e-mail server by automatically archiving messages without user intervention, Lundgren notes. The collateral benefits of a well-managed data infrastructure for both storage capacity management and data protection can also be a good argument to bolt on archive rather than using built in functionality.

The argument that more than native business application functionality is needed to build a proper archive capability is pretty persuasive. However, going to third-party software for a solution has its own set of challenges, which we will look at in the next installment in this series. Until then, your comments are welcome: [email protected].

Must Read Articles