In-Depth
Universal Data Classification, Part 2
While there may be no complete solution to content-and-context -managed storage problem, two vendors have interesting products that worth a look.
Following my column last week on data classification, I was inundated with responses from vendors who, while they agreed with my premise, suggested that I learn more about their wares. Companies such as Index Engines and Abrevity asked for quick phone briefings to acquaint me with their products, what the analysts are calling the “Information Classification Management” market.
ICM is not to be confused with ECM {electronic content management), an updated label for “document management systems” that we saw proliferate in the banking community in the 1980s. Apparently, ECM is the domain of vendors who want to replace all of our venerable Microsoft productivity applications with pre-defined and uniform data entry screens that feed their output (still called “documents”) into a well-managed and indexed content repository.
ICM is different. Mainly, the ICM purveyors are trying to discover data wherever it exists, gather its metadata, add some context (in the form of a parsing algorithm), and package this data-about-the-data into a metadata repository. In some cases, the original file itself goes to some sort of index silo overseen by a content-addressable storage system or an object-oriented database.
In short, ICM is all about managing data by class. In most cases, the management extends across the useful life of the data and across the storage infrastructure in its entirety, from capture disk (where apps first write their data) to retention disk (where data lives after its access and update frequencies drop off) to archival tape (where data goes to die, if you listen to the disk-centric crowd).
The practical issue confronting vendors is how to overlay this technology on existing infrastructure without breaking the customer’s piggy bank. Currently, analysts estimate that the cost to do ICM is about $500 per terabyte stored, according to Bill Reed, vice president of sales and marketing for Abrevity. He notes that there are some other niggling pragmatic barriers that we consider in a moment.
First things first. ICM vendors, which include Index Engines, Kazeon, Scentric (see the previous column), Reed’s Abrevity, Microsoft (when and if its vision for WinFS is realized), and EMC (through its content-addressable storage repository, Centera, in conjunction with Google search tools), are distinguished by their focus on search and retrieval. Each one claims to do its competitors one better, and a lot of patents are being filed to create the discriminators that elude those of us who don’t really appreciate the fine distinctions between the algorithms that a vendor is sporting.
Index Engines, for example, boasts not only a text search feature that finds and indexes all of the verbiage in files (“unstructured data”) and e-mail (“semi-structured data”), but also adds a unique “word proximity” algorithm that spokespersons claim can refine the identification of the “right” data for a given query or class. Purportedly, this engine also enables data to be reconstructed from the index if the file or e-mail is ever accidentally deleted or corrupted.
Index Engines isn’t trying to do storage hardware, thankfully. Instead, they offer a software developer’s kit that can be leveraged by the hardware people (or by other software gurus) to create other types of managed storage solutions. Courtesy of these relationships, their technology now plays across disk platforms and on tape systems.
Reed’s Abrevity technology can be summarized quickly via a step-by-step description of its operation, which he provided via e-mail:
- We scan what you have (CIFS or NFS) or we are handed metadata from a NAS log, etc.
- We parse the metadata (slice and dice) per user-selected criteria (between dots, dashes, slashes, etc.). No other solution is user-selectable.
- We let you discover what you have. Unlike search, you click on an attribute (file name word, file type, etc.) and we list all files with those attributes. For file name or folder name word, for example, you can start typing T and see all words starting with T (or if selected, containing T). Type TOI and see all those words and so on.
- We let you do a rich Boolean query (very easy) to search for all files with XYZ in the name, older than Jan 2006, Office docs, etc.
- Once found, you can output to external reports (Excel, XML, Text, etc.) or internal (customizable visual reporting engine). You can also drag and drop them to tags (for classification) or automate this process. You can also drag them to actions (delete, copy, migrate, etc.). An action could be to crack those files and extract metadata.
- If you selected extraction, we crack those files and search for and extract any of the following metadata: key words or phrases, including proximity word searches; document summaries or themes (this is how you can tell if it's a personal file or business-related or what it is); document tones (is this document positive or negative toward my company or specific persons?); target values, such as people, places, companies, Social Security or credit card numbers, bio-instrument values, and so on.
- Once extracted, we place attributes in proper categories (John Apple is under People, not Fruit)
- From there you can click on those attributes and discover what files contain them, then tag, then manage, etc.
Reed is very proud of the alacrity with which these operations are performed. He attributes the speed to a parallelized processing engine that slices jobs among multiple servers when one server is overwhelmed. His parallel database, in his view, is far more efficient and powerful than are the conventional relational databases used by other vendors. From his description, Abrevity appears to have taken a page out of the grid computing efforts of the National Science Foundation and the High Performance Computing Center at the University of New Mexico, creating a parallel database engine that can add and subtract processing resources as needed by the task at hand.
Like Index Engines, Abrevity is hardware agnostic. It is also, Reed is quick to concede, not a complete solution to the problem of content-and-context-managed storage. “There's no perfect system out there (no one has five 9's data classification), but to our knowledge, this is the easiest and one of the most accurate systems available.” He quickly adds that prices for Abrevity-classified data start at only $499 a TB: breaking the price barrier by a dollar. Competitors, according to him, start their pricing at $10,000 per TB and accelerate from there.
Abrevity is different from the other vendors I have spoken with as it seems preoccupied with pricing its technology to be affordable to the small-to-medium business as well as to the Fortune 500. However, it is clear from my conversations with Abrevity and its competitors that a fully-automated data classification system remains a holy grail.
In addition, while Abrevity endeavors to remain platform-agnostic, it is clear that the pressure is on all of these companies to join their products at the hip to leading OEMs such as EMC, IBM, Network Appliance, and others. To succeed in the short term, they are being pressured to sell their souls to the company store—a not-so-good-thing that will accelerate the cost of the overall solution, which, in turn, runs the risk of placing it outside the reach of smaller companies. Kazeon today has close ties to both Network Appliance and EMC. Some of the other vendors are seeking to brand their own hardware to go with their software to compete for the consumer seeking the one-stop-shop/one-throat-to-choke solution.
What needs to happen, in my opinion, is for several of these smart guys to come together and create a software stack that runs on any hardware. Placing Abrevity, and maybe software-only content-addressable storage vendor Caringo, in the same room would make for a very interesting discussion. Placing Microsoft in the room with them might actually result in a comprehensive solution that could leverage Redmond’s WinFS componentry.
We’ll keep an eye on developments and keep you informed about moves in this space. For now, your insights and observations are invited: [email protected]
About the Author
Jon William Toigo is chairman of The Data Management Institute, the CEO of data management consulting and research firm Toigo Partners International, as well as a contributing editor to Enterprise Systems and its Storage Strategies columnist. Mr. Toigo is the author of 14 books, including Disaster Recovery Planning, 3rd Edition, and The Holy Grail of Network Storage Management, both from Prentice Hall.