The Search for Universal Data Classification

You can’t optimize capacity utilization until you understand the data you’re storing.

One of the biggest challenges to wrangling control of storage costs is developing an understanding of the data parked on all the expensive disk arrays we’ve been buying. Until we do this, there's not much we can do much about optimizing capacity utilization— and to keep from buying so many spindles so quickly.

Until we understand the data, we can’t discern its value and properly provision it with the kind of storage services (security, disaster recovery, etc.) that it needs: no more and no less. We are, therefore, usually flying blind and subsequently buying whatever our vendors tell us that we need this week—a story that is likely to change next week because they are just as interested in “top line growth” as we are in our own

Ideas currently in vogue—such as “byte de-duplication” or “byte factoring” to squeeze more information into fewer bits, or archiving by data type to clear stale data out of our databases, e-mail repositories and document management systems—are all well and good. They are, however, only tactical measures. Ultimately, such strategies do not improve utilization efficiency.

Diligent Technologies is the latest vendor to join the de-duplication technology camp. At Storage Networking World in San Diego, CA, a couple of weeks ago, I chatted with CEO Doron Kempel about his new product, which is based on a unique factoring technology he calls HyperFactor. The key difference between HyperFactor and competitive approaches from Rocksoft (now owned by ADIC), Avamar Technologies, and Data Domain is that the factoring is done in memory for enhanced speeds and feeds at the time the data is ingested into the archive.

In case that flew by too quickly, we are talking about compressing data as it flows into an archive. Competitors use techniques that involve reading the data to be archived for a common pattern of zeros and ones. When this pattern is discovered, it is replaced with a stub. Repeating this over and over again creates a “factored” set of data that is much smaller than the original—up to 40:1 by some vendor claims—which is then written to the archive repository disks.

Diligent, according to Kempel, doesn’t like those factoring schemes, so the company wrote one itself that looks for similarities between data that is being “ingested” into the archive scheme and data already written there. When it finds similarities, it records only the different bits of the data being ingested into its ProtecTIER platform. This requires a boatload of memory , which Kempel says can be cobbled together cheaply by using clusterable servers—interesting notion that he said would be validated in conversations with happy beta-test customers when they are willing to talk publicly about their tests (in the next few weeks). We will do the interviews at that time.

Squeezing data might just save you some space in your archive, but it doesn’t tell you what the data is. Kempel said as much when he noted that his approach was just “really fast, but not content aware.” From where I’m sitting, everyone in Diligent’s space is trying to perfect the digital equivalent of a trash compactor. Yes, the refuse occupies less space, but it doesn’t provide any incentive for its user to look for ways to cut back on trash generation.

Archivas, another company I enjoyed talking to at SNW, is the latest player in the data archiving game. Like factoring, archiving is another data management panacea that is spreading like wildfire through the industry right now. When data gets stale, migrate it into an archival silo. That gets older, less-frequently-used data off your most expensive disk. That’s the value proposition in a nutshell.

Archivas is trying to do the value case one better. Its rather complicated architecture provides a means to store metadata derived from the archived files themselves in a searchable repository. Their secret sauce is the search language they use and their high-speed metadata indexing techniques. You can ingest data into the archive via a clustered server engine at about 1000 objects per second, according to the company’s CEO and evangelist, Andres Rodriguez.

Rodriguez says that his indexing technique is fast and efficient and collects a lot of useful information that can be used to search the repository and act on the data stored there. He insists that, despite the limitations of metadata included with many file types, there is enough good info there to identify, for example, all of the documents that might be subpoenaed by the SEC. With all due respect to Rodriguez, that is something I would try before I buy.

What we are seeing, when you stand back from the fray, is the first step in blurring the distinctions between several concepts: compression, archiving, and index-based search and retrieval (a quasi-form of content addressing ). These solutions add up to some improved grooming of the storage repository, but are not the holy grail of data management. As nearly as I can tell, none of the strategies I have seen to date actually classify data itself based on the information it contains and the factors related to its value or maintenance requirements over time.

Moving in that direction is Scentric, Inc. A critical remark I made to a press agent who sent me an advanced release about this week’s announcement of Scentric’s Destiny product prompted a quick response from Larry Cormier, senior vice president of marketing. He reminded me that we had met in San Diego and complained that I seemed to be one of the few who “get it” when it comes to the goals that the company is pursuing. Now I was dissing their product.

I objected to the reference in the press release to a “universal data classification solution” offered by the company. Specifically, I was put off by the use of the word “universal.” Actually, I rather liked their product, I explained, but could do without the hype.

What Scentric Destiny does, in a nutshell, is discover the data you have on your spindles and then organize it into a listing that you can use to apply a classification scheme. That is very useful. Referring to this functionality as a universal data classification solution, however, is confusing. It smacks of the same hyperbole used by the Storage Networking Industry Association, which is setting out to define a universal taxonomy for data classification.

In SNIA’s case, I worry that the undertaking will become a thinly-veiled scheme by hardware vendors to divvy up the data pie among their hardware platforms: EMC gets data class ABC, IBM gets data class DEF, HDS gets data class GHI, and so forth. That’s the kind of thing that the cynics in the storage industry, myself included, think of when you say “universal” in the same breath as “SNIA.”

Cormier was quick to point out that this was not his claim with respect to Scentric Destiny. In its first iteration, due this week, the product will offer discovery and a set of baseline descriptions and actions to enable users to begin assigning values and handling instructions to the data itself. Users can add value by including more descriptions and actions that suit them, building over time a schema that meets their needs.

With Destiny, we are taking a list of files, metadata, and management policies and procedures and writing this to an external catalog. The catalog itself is replicated to prevent loss. The actions that need to be taken on the listed data over time (migration and deletion) are performed according to the cataloged instructions.

Cormier believes that it may be possible to create additional volumes of standardized descriptions and actions over time: templates drawn from experience within specific industries. He might well be right. However, creating a one-size-fits-all classification scheme seems about as possible as developing a unified theory of physics—which is to say, not very possible.

The user interface on Destiny is intuitive and top notch. Data can be grouped by their creator, by user department, or by a number of other schemes to simplify classification efforts. Of course, this also has the potential of dumbing down the classification scheme that is ultimately developed. Just as with IBM’s venerable ILM scheme (DFHSM), you get out of the interactive classification process exactly what you put into it in terms of operational granularity.

Less a universal data classification scheme, Destiny appears to be a good tool that might just develop into a great tool. Still, it leaves it up to those doing the classification (in many cases, users themselves) to evaluate the contents of a file and to apply the right class. Given user reluctance to cooperate with any meaningful data classification scheme up to now, the efficacy of this approach remains questionable.

Think you have a universal approach to data management? E-mail me at