Fixed-content Storage: The Struggle Over Standards

Content-addressable storage may be the key to meeting extended data storage management requirements.

Content addressable storage (CAS) has pumped a bit of life into the otherwise flatlined storage spending curve at the high end of the market. With the regulators requiring data to be stored for long periods, especially for health-care and financial services organizations, IT managers are tasked with finding a way to manage data retention and deletion requirements with greater discipline.

CAS seems to be the key, and early CAS appliance offerings from EMC, Nexsan, and others are selling like hotcakes if you believe what vendors are saying.

CAS, at its core, is an indexing scheme. It is a way to encode data before it is written to disk so that it doesn’t get lost over the many occasions when it is migrated from worn-out arrays to new ones, fresh out of the crate.

Such migrations might happen six or seven times over the life of medical records, which HIPAA and some state regulations require health-care providers to store for up to 20 years or more. This is part of what attracted Baptist Memorial Healthcare Corporation, one of the early adopters of CAS, to EMC Centera several years ago.

Additionally, the SEC and corporate governance regulations put seven- to ten year retention requirements on certain data prepared by publicly traded companies. These rules and mandates require firms to be able to retrieve data rapidly, in less than 72 hours after it is requested by an auditor or subpoenaed by the Fed. Without an indexing scheme, fulfilling both the retention and recovery requirements might become impossible.

However, the CAS story is also a confusing one. Like other industry jargon, such as continuous data protection (CDP) or information lifecycle management (ILM) or wide area file services (WAFS), to cite just a few examples, an otherwise simple–to-understand idea has become cluttered with many interpretations. Some vendors, including HDS, seem to add value to CAS by adding a hefty dose of archiving technology. Others want to build out the front end of the CAS play with special “data ingestion engines” (think hierarchical storage management policy engines) or to enhance search and retrieval by collecting additional metadata—data about the data—to store in a secondary search database (think Google or Yahoo search engines).

There is nothing wrong with adding functionality to content indexing, but in the absence of standards, everyone is building their own proprietary stovepipe. This is raising the eyebrows of many IT planners, and obscuring the vision of those making strategic choices. Some key issues include fear of going down the wrong path and arriving at a dead end when and if the world rallies around a particular product and approach. Nobody wants to be stuck with obsolete technology.

Problems from Proprietary Approaches

A second issue is whether the diversity of proprietary approaches will inhibit business in the long run. For example, if your company acquires the assets of another company and their CAS system doesn’t speak the same language as your CAS system, how will the situation be remedied? One thing is certain: the vendors aren’t going to volunteer their services to help you port data off their systems.

In off-line conversations with industry insiders who offer products that compete with EMC Centera, the story is the same: all are afraid to one degree or another of what EMC’s response would be if they used their own software tools to pull data off Centera in order to migrate it into their system. EMC has an archive/backup API for Centera that could be leveraged for this purpose. However, the company doesn’t let just anyone use it. Doing so without their permission, or creating some other way to extract data from Centera, might just bring down the wrath of the storage giant in the form of lawsuits over “reverse engineering.” Noone in the industry needs that kind of grief.

That’s one reason I reviewed with great interest a PowerPoint slide deck prepared by EMC senior technologist and the SNIA’s Fixed Content Aware Storage Technical Working Group co-chair, David Black, which is being presented as you read this column at the Storage Networking World Europe conference in Frankfurt, Germany (which also happens to be where I am today). The deck was e-mailed to me on the QT by a reader of this column, but I can talk about it since it will be “in the wild” by the time these pixels hit your screen.

Black will talk about the challenges posed by data growth, the inefficacy of “data silos,” and the frequency of business process change in order to set the stage for why content needs standards. I agree with him up to this point, since everyone agrees that managing content is good for business, good for compliance, and good for America.

He is expected to talk about three sets of standards in development: JCR for Application Interfaces, iECM for Content Management Interoperability, and XAM for Fixed Content Storage. These things will guarantee interoperability and all of them will specify how metadata is to be handled.

Metadata is very important to Mr. Black because he dedicates so many slides to it. I like the metaphor of canned food he uses. Without their labels, we can’t tell what’s in any can. With the labels, standardized as they are, we know not only the food type each can contains but also its nutritional information and other useful facts.

Content management systems depend on metadata, he asserts, naming only three (Documentum, FileNet, and Mobius) of the hundreds of document management systems on the market, which has been trendily renamed “electronic content management” (ECM) by some analysts.

He then launches into a fairly lengthy technical summary of the status of the evolving standards. I noted with interest that JCR (Java Content Repository) is a Java Community Process effort: an API specification for use by Java-centric ECM products to collect data and metadata from applications. I wonder what that other behemoth in the industry—Microsoft—thinks of this. They haven’t exactly embraced Java, and about 99 percent of business applications run on Microsoft these days.

Outside the server is a second pre-standard standard for interfacing the data/metadata collected via the JCR API to a common content repository service. Called iECM, Black says it offers a language- and protocol-independent reference model providing mappings to specific languages and protocols, beginning with a standard mapping to Web services. It invokes the content management service, which is provided by the underlying ECM product. iECM is being developed by AIIM, which has lately become the superstar of the ECM world by virtue of its long pedigree in the world of document management systems and other types of records administration.

At last, we come to XAM, which is under development at the SNIA. On the surface, XAM is a great idea. It is supposed to be a vendor-independent API and file system interface (FSI) that is “language independent” (though, again contrary to Microsoft interests, I would think, it is mapped only to the Java and C languages). The purpose of XAM is to provide access to fixed content storage (that is, data that doesn’t change very often or at all), independent of the location of the storage system or data. Black says that XAM is motivated by the migration to new systems and technology (which I take to mean Centera) and may prove usable for applications besides content management.

XAM functions to take the data from disparate applications and file systems and to create explicit groupings of content and metadata, with consistent naming rules, and to reduce them to a flat object so that the entire thing will scale readily. When all of this comes together, applications will automatically write their fixed content (and maybe their variable content as well) into standards-based, policy-managed, and easily-searched CAS repositories. Sounds pretty good, right?

XAM Problem: The Metadata Database

Not so fast. There are some CAS players who take umbrage, even at this early stage, with the work that SNIA is doing on XAM. Implicit in the SNIA model, according to one hardware-agnostic CAS vendor, Caringo’s president Jonathan Ring, is a separate metadata database. While it is true that the listed ECM product vendors maintain their metadata in a centralized database, Ring does not believe that this is necessarily how CAS should work.

“Some of this pitch [Black’s slide deck] is clearly focused on creating a justification for a database built into CAS. The argument is metadata only lives in the database. There is no reason that metadata cannot be stored with the data, and in the database and in various other application contexts, if desired. His arguments seem self serving.”

Caringo isn’t just straining at gnats. Chief technology officer Paul Carpentier is the undisputed father of CAS, having architected a sizeable portion of the software that currently runs on EMC Centera while CTO of the Belgium-based company, FilePool prior to its acquisition by EMC a few years ago. He, and cohorts Ring and Mark Goros, decided to open Austin, TX-based Caringo last year to provide a hardware-agnostic CAS software solution that would run on any hardware platform. At the core of their approach is the storage of metadata with data objects themselves, which helps ensure the security and integrity of the CAS indexing system.

Of XAM in its current form, Ring and Carpentier are making the same observation: the specification as it is doesn’t amount to much right now. The fact that they are pushing their “vision” today has more to do with marketecture than architecture.

Ring described the situation succinctly: “From what I’ve seen of the XAM group, their execution seems to me behind the times, rather ambitious, and has some serious flaws that will impact security and performance. They are looking to invent yet another protocol when all those required to do the job are at their finger tips.”

He has some good points. Should XAM build-in search capabilities, this represents a security issue that compromises the basic nature of what we believe a CAS system represents. “Once data is stored in CAS,” says Ring, “no one, not even some software engine, should be able to search the data without having its UUID. This function clearly belongs one layer up from the CAS storage system and should allow a full spectrum of search engines to leverage a CAS store.”

Another good question: why has the SNIA not chosen to base the application-to-CAS interface a network-based protocol such as HTTP 1.1? Ring says that at the last XAM meeting he attended, the consumer delegates in attendance wanted it this way, but “the vendors wanted to add layers of complexity to integrate. When it comes to on-the-wire open protocols, what better than HTTP?”

“When I last looked,” Ring said, “the XAM team was considering building in some kind of database engine and all the type checking that goes with it. The naiveté around this topic was somewhat surprising from a scope-of-work perspective: they seemed in over their heads and only making matters worse with injected complexity. They need to think in building blocks, not monolithic structures. Also, reinventing SQL does not seem to me a wise course of action. Of course, specialized expensive hardware could solve the performance problem. … Hmmm, I wonder who wants this.”

Ring’s comments make enormous sense. Architecturally, functions such as databases and search engines should not be built in. They lock the customer in to a specific XAM-defined version. Ring believes that this is contrary to industry wants and contrary to current trends. “Providing an open system that takes advantage of the latest in database and search technology is clearly the preferred route and creates choice and leverage for the customer. This ‘built in’ architecture would hinder innovation as CAS evolves.”

Ring is also spot on when he notes that the complexity of the current XAM specification will drive a looping process of prototypes, refinements, bug fixes, more refinements, and more, all of which will favor the large CAS vendors over the smaller ones. “Add to that the effort required to create a solid validation test suite and you get a long window before something usable pops out.”

Caringo’s own wares are based on HTTP 1.1 and SQL. This decision was made because it delivers a highly portable, standards-based, on-the-wire protocol for storing fixed content for a very long time. They have an excellent standards-based core technology around which many layers of functionality (search, protocol translation, etc.) can be wrapped.

Do We Need Another Protocol?

The argument could be made that we really don't need another protocol like XAM. There are plenty of standards-based protocols already available to do the job. Is it simply ego that drives engineers at brand name storage companies to seek to over complicate the world? Is there a financial incentive among the big vendors to hold back innovation? If not, why isn’t the SNIA Fixed Content Working Group building its stack with open tools and protocols already in hand?

This is more than a case of technology head-butting. Ultimately, it will go to the core issue behind CAS itself: money. How can we spend less money to get the job done that is being demanded by the regulators? How do we solve the problem of long-term, secure, and searchable fixed-content storage without selling our souls to a three-letter acronym?

Your thoughts are invited.