In-Depth

Discovery Tools Fundamental to Data Management

Effective data management is an idea that has been ignored for far too long.

It may have taken a recession, or the development of a more aggressive regulatory oversight climate, but data management strategy is gaining sway with more business IT planners. Both the big brands and the start-ups are beginning to offer new products to meet demand.

In future columns I will discuss CA's information governance wares, as well as those of Novell and SGI. This installment focuses on a Boxborough, MA start-up with big ideas: Digital Reef Inc.

Digital Reef came out of stealth mode earlier this month with a Web site and a blog by president and CEO Steve Akers. Akers has a strong network background, having held senior technical and managerial positions with Shiva Technologies, Spring Tide Networks, and Lucent Technologies prior to co-founding Digital Reef. That experience helped him realize both the benefits and challenges of data sharing via a network. Now his attention is directed to solving the problem that distributed data has created: the inability to find the files you're looking for.

Search and discovery tools are nothing new, of course. Engines have proliferated in recent years that find and sort data by metadata, file name, file content, juxtaposition of words within files, and so forth. Akers' contribution is a deep blue math algorithm that captures document context and facilitates similarity searching so that a universe of near duplicates, versions, and related files can be quickly culled from the junk drawer of all documents.

File discovery is an initial step in data classification and management. According to Brian Giuffrida, vice president, marketing and business development, his customers are leveraging Digital Reef's "unstructured data management platform" to reclaim storage space (by identifying duplicates and "thinning the herd") and to identify files requiring special handling (that is, "whose loss or disclosure could yield significant soft and hard costs to the business").

This isn't Google desktop or Microsoft's latest search engine. Digital Reef's software platform provides bells and whistles that focus careful attention on enterprise requirements. Its three-tier architecture consists of an access tier that controls the security parameters on searches while providing a robust interface capable of supporting word, date, Boolean, and similarity or "search profile" scanning of distributed file repositories.

Tier 2 is the services tier that processes job requests and provides indexing of search results complete with a special checkpointing process that differentiates this engine from others you may have used. It works in conjunction with the analytics tier, tier 3, to describe each document with a unique mathematical identifier that can be referenced in "spatial" terms as well as descriptive terms, facilitating subsequent scans and searches based on document similarities.

From what I've seen, this is about the most intelligent search facility you can buy and a real candidate for preparing your "unstructured" (aka user file) data for inclusion into a more global scheme of data archiving and hygiene. It can also be used to identify files that are subject to specific regulatory requirements or other business rules, thereby facilitating information governance and lifecycle management schemes.

The process of discovery is handled by a "grid" of commodity Linux servers networked together into an IP-connected "cluster." To get additional performance in the face of an ever-growing number of files, simply add more Linux servers to the grid. Digital Reef is architected to load balance automatically across multiple server heads. In operation, index data is written in "parallel shards" across as many indexing heads as you deploy, according to Giuffrida, and virtually every file container -- from Microsoft Office output to Adobe PDF formatted files -- can be read and indexed.

An obvious fit for the technology is the realm of storage-as-a-service clouds, where Digital Reef's security capabilities could be the answer to some nagging issues. The tiered architecture provides a convenient place to set and manage rules that determine who can access data and other aspects of their sessions. Role-based security is practicable and the entire product is Web services enabled.

Interestingly, the Digital Reef Web site sings information governance. The product is contextualized in large part as a tool for herding data cats so you can see more readily which assets are subject to which regulations. From where I'm sitting, it has a potentially much broader role to play in culling out the data junk drawer and creating domains of the data universe that can be exposed to business rules for data protection, preservation, archive, retention and deletion, and security. It is the beginning of what SGI, another vendor that will shortly be introducing its own technology in this space, calls a data management operating system.

From my viewpoint, it is an idea whose time hasn't just arrived, but one that has been ignored for much too long. In its press announcement, the company cites analysts who say that "Unstructured and semi-structured data in documents, spreadsheets, e-mails, and presentations account for up to 85 percent of enterprise data." Given the expected growth of all data, between 37 and 100 percent per year depending on which EMC-funded study you believe, the problem of managing unstructured data will grow commensurately for most companies. That requires a scalable solution for discovery that works faster and better than most search-and-index tools today.

Moreover, and this is key, the tendency cited by Akers in his first blog post for the business value or context of data to morph over its useful life, requires that flexible tools for discovery be created that can capture the changing context and spatial relationships between data over time. If Digital Reef can back up its claims, it may just be what the regulators ordered.

Your comments are welcomed: [email protected].

About the Author

Jon William Toigo is chairman of The Data Management Institute, the CEO of data management consulting and research firm Toigo Partners International, as well as a contributing editor to Enterprise Systems and its Storage Strategies columnist. Mr. Toigo is the author of 14 books, including Disaster Recovery Planning, 3rd Edition, and The Holy Grail of Network Storage Management, both from Prentice Hall.

Must Read Articles