In-Depth

Analysis: Flexible Text Mining for Science and Industry

Keeping up on trends, warnings, threats, and insights within unstructured text is a daunting task. Scientific researchers have the toughest job of all, and some have turned to flexible text analytics.

Most people in the food industry missed the first warning. Scientists had published a discovery in 2000 about a carcinogen known as acrylamide, which can develop in starch-rich foods such as potatoes as they are fried. By the time that warning finally hit the public media in 2002, millions of people had been were frightened, perhaps unnecessarily.

That's the bad news. The good news is that text analytics is becoming easier to use. One sign of that is last month's announcement by a Netherlands-based risk management firm, TNO, that it had adopted a new, flexible "semantic knowledge discovery platform": I2E, from UK-based Linguamatics. TNO hopes that I2E will help clients in the food industry keep their own potatoes out of the fryer. With a small leap of imagination, you can see how detecting early signals from outside could help other sensitive industries.

Keeping up is daunting. "If you have 20 million articles to read, where do you start?" asks William Hayes, director of library and literature informatics at pharmaceutical research company Biogen Idec. He's used I2E for about six years.

"The research industry works under a tougher knowledge model than terrorist intelligence gathering," said Hayes. "Our ability to tap that ocean of literature is like dropping a line into the ocean for fish." Yet that ocean of literature is the source of most new ideas and industry information.

In general, a scientist can read 150 to 200 full text journal articles a year, explained Hayes. A curator can review about 100 abstracts a day "for a few days before you start going nuts." Text mining is the only way to keep up with the ocean of literature produced each year.

Standard text mining is worthwhile for what Hayes calls "high-value questions." It's a heavy-duty tool, and the task is usually outsourced. It requires tightly focused search patterns that are usually strung together and run in batches of 60 to 100 at a time. If you want to adjust the search, that's more work. The queries are hard to reorient.

Linguamatics' agile natural language processing is much easier. Users can fine-tune on the fly, said Hayes. He can train new users in two to six hours. "If you can remember bits of grammar and have some concept of what you're researching, it's a piece of cake."

Queries parse unstructured data within the specified source to identify grammatical parts, such as a noun or noun phrase. Queries also identify relationships between parts. The tool can then extract bits of information -- instead of simply identifying a document containing given keywords, as Google does.

Also unlike a search engine, linguistic wild cards let users specify functions within phrases without narrowing down to exact meaning. For example, a query can ask for any safety concern as long as it's about food toxicity.

Results show up, sorted and parsed, in a table or network graph. One screenshot of results shows a neat table: pharmacologic substances in the first column; type of concern, such as "safety concern," in the second; tissue type in the third; dosage in the fourth; and so on.

Besides pulling out the "fish" you asked for, the tool can perform several other useful tricks, such as discovering networks of influence. For example, before you form a partnership, it might help to know who else your prospective partner other partners.

"If information is available, we have a better chance of finding it than anyone in the world," said Hayes, "because we can use that driftnet against the sea of literature." Response from research scientists he serves has been strong, and they ask harder and harder questions. "We're answering questions that other libraries won't consider. We do it in days or weeks where for them it would take years. That's our success."

No business intelligence tool could claim any better results than that.

About the Author

Ted Cuzzillo, CBIP, is a freelance writer based in the San Francisco area. He can be reached at [email protected].

Must Read Articles