Text Mining

Since the mid-1990’s there has been a tremendous amount of focus on analysis of structured data. We have seen an explosion of interest in products that cover capabilities that include reporting, data warehousing, OLAP analysis, charting and visualization, and data mining.

This focus on structured data has been driven by the desire to extract, consolidate, and analyze information from operational systems. These operational systems include customized applications and packaged applications from vendors such as SAP, Baan, PeopleSoft, and others.

This operational data is typically locked up in data jail, inaccessible to business users who needed to make informed decisions. Because of the desire to use this data more effectively, data warehousing and related technologies are slated to grow into a $14 billion market by 2002, according to a report from Merrill Lynch & Co.

But a lot of people are beginning to wonder about that other data that’s sitting out there on millions of C drives and network servers. The volume of non-relational, non-operational data and content that’s been created and massaged into letters, spreadsheets, presentations, white papers, proposals, engineering change orders, and a host of other types of business files, is orders of magnitude larger than all the relational databases in the world put together. Even this volume of information is rapidly being dwarfed by the huge volume of Web site utilization data being collected on Web servers around the world.

If you recognize the value of making this data more accessible to your organization, you should be thinking about looking into building a text mining environment. Text mining can be used in a wide range of endeavors, such as customer relationship management, knowledge management, help desk best practices, fraud detection, or competitive intelligence.

There are a few points you should think about if you are exploring the creation a text mining application.

First, you need to define a mechanism for extracting the information from its sources and consolidating it in a centralized location. Building a text repository can help solve several problems. It provides a means for capturing the documents in a central location managed by IT professionals, rather than leaving them to the vagaries of end users who don’t back up their files. It also provides a location where end users can go to locate information. Most good text repository solutions provide a mechanism for indexing and searching for text strings.

You should also consider how the text repository will work with your corporate security infrastructure. The repository’s security should be consistent with your company’s security environment. Users should only be able to access documents they are authorized to see, based on their roles and responsibilities within the organization.

Once you have the documents in a repository, you can consider how you’re going to enable users to locate vital information within the repository.

One approach is text categorization. There are tools that are designed to assign documents to one or more pre-defined categories -- often called a taxonomy. This approach is useful in environments where there are relatively few business rules that need to be applied, or where there is not a great deal of complexity in the variety of documents. Since these tools rely to some degree on natural language processing, they tend to bog down if there is too great a variety of data.

Another approach is clustering. These tools group documents into related clusters, without relying on predefined categories. These tools often provide a visual map of the documents with links between related documents. While these tools make it easy to browse through a collection, they can create clusters that have multiple documents, or a particular topic may pop up in multiple clusters.

While there are other techniques that are beyond the scope of this column, it’s important to understand that text mining is a very young, imperfect science. The ability of computers to parse and analyze unstructured data is limited. While there is much ongoing research in the fields of neural nets, semantic analysis, and natural language processing, there is still much that needs to be done before we can simply enter in a simple phrase and have the system find exactly what we’re looking for. The good news is that there have been enough success stories that, for some organizations, an investment in a pilot project targeted at a specific application may be called for. --Robert Craig is vice president of strategic marketing at Viador Inc. (Burlington, Mass.), and a former director at the Hurwitz Group Inc. Contact him at robert.craig@viador.com.