Q&A: Text Mining and Analytics Draw Interest
Text mining can turn text into gold
Text mining and analytics are drawing increasing interest as the sheer volume of unstructured text in companies grows. Text mining technology goes beyond simply searching documents by looking at their linguistic structure. It can be used to pull relevant, structured facts and relationships from large volumes of text.
Linguamatics, a privately owned, fast-growing company based in Cambridge, UK and with U.S. regional headquarters in Newton, MA, is well-known in the pharmaceutical industry -- where huge volumes of unstructured data and complex terminologies have long been a challenge -- for its ability to provide intelligent answers from text in real time. As text mining expands, we asked the firm’s CTO, David Milward, for insights into how the technology can impact all sorts of businesses.
Milward is a co-founder of Linguamatics, and has over 20 years experience in product development, consultancy, and research in natural language processing. Areas of expertise include information extraction, spoken dialogue, parsing, syntax, and semantics.
BI This Week: What exactly do you mean by the terms text mining and text analytics?
David Milward: The objective of text mining is to uncover high-quality knowledge from unstructured text, then use that knowledge to drive decision making. The approach goes beyond document search by using the linguistic structure of the text to get to its meaning. Unstructured text is converted to structured representations, which summarize the query results and give new insights. Typically, the structured summaries retain their links to the original text, which allows users to drill down to examine evidence when needed. Also, the structured output can then be fed into further analytics or a database.
How is text mining typically used?
Text mining allows users to ask questions directly of high volumes of data, such as “What is market sentiment about product x?”, or, in the pharmaceutical domain, “What biomarkers are associated with disease y?” It can also provide new kinds of insight, such as finding indirect relationships (for example, a relationship between A and B and one between B and C could indicate a potential relationship between A and C), and identifying weak signals and patterns that can be detected only through large-scale analysis.
Why has dealing with unstructured data proven to be difficult so far?
Extracting meaning from large volumes of text data has been a barrier because of the data volumes and the unstructured nature of the text. However, advances in text mining technology and in computing speed mean that comprehensive linguistic analysis can now be done at a highly useful speed and scale.
A further barrier in the past was that text mining engines could only be adapted for specific uses by programmers. As a result, queries tended to be hard-coded and written by IT experts, rather than providing the information consumer with the ability to ask and answer new questions as they arise.
Why is there so much interest in it recently?
The problem of handling vast amounts of text information is not going away. Organizations realize that there is huge value stored in various text sources, information that can be extremely useful both for research and for business decisions. Companies need ways to unlock that value more effectively than their competitors.
Text mining offers an efficient method of filtering this data, allowing organizations not only to exploit information from internal documents but also from the huge body of external information. That can include published literature, blogs, news feeds, and social media sites where consumers routinely leave text comments that are rich with product information.
Another reason for greater interest is the large number of successful case studies showing automated text mining approaches that validate the technology across different industries, prompting further adoption by the majority.
How does text mining tie in to business intelligence?
By converting unstructured text into structured output, text mining results can feed into further analytics or be combined with the results of other data analyses. This enables delivery of comprehensive, high-quality text mining results as part of systematic and reproducible workflows.
We’re already seeing text mining being adopted for a range of business applications. In the pharmaceutical industry, we’re seeing uses such as competitive intelligence, influence networks, sentiment, and patent analysis.
How is text mining being used in business? You mentioned some solid case studies -- what sectors are adopting it and why?
In our experience, text mining is growing across a wide range of business applications. One particular area that has grown rapidly in the last couple of years is sentiment analysis, which aims to analyze people’s attitude to a particular product, company, or service. That can include automatic processing of customer comments. The insights gained can play a key role in informing marketing strategy and improving customer satisfaction.
Another example is in research-intensive industries such as the pharmaceutical industry. Here the initial focus was exploiting published scientific papers to drive R&D decision-making. Having proved its value there, text mining is now being deployed much more widely, in areas including drug safety, clinical trials, and marketing applications.
We’re also experiencing increased interest in embedding text mining into other processes, such as workflow platforms and collaboration tools like Microsoft SharePoint.
There's great interest in analyzing textual data collected from social networking sites; can text analytics help? How does it tie into social networking and data collection/analysis issues?
The huge growth of social media is prompting organizations to investigate how they can best use this data to help their businesses. Again, effective filtering is critical to extract useful information from what would otherwise be very noisy data. Flexible natural language processing-based query strategies combined with easy adaptation of vocabularies are key to dealing with this type of text.
For example, micro-blogs include shortened forms of words, so adopting a data-driven approach may be necessary to find the actual vocabulary used, and then to use the derived vocabulary to optimize the querying.
Where do you see the text mining and analytics market heading? How might this technology be used in another two to three years?
Customers are still searching for a single interface to access corporate knowledge. We’re continuing to see some data silos, but in the next several years, we anticipate that there will be better access to unstructured data across different data sources. For example, major publishers are starting to provide APIs to access their content through text mining tools.
We also expect that much more text analytics will be done “behind the scenes,” feeding the results into other processes. Users can then access the information as they would structured text, largely unaware that text mining technologies were used first to analyze and structure the data for them. This is something we’re already seeing with some BI vendors.
The text mining market itself is experiencing considerable growth, even in difficult economic times. That should continue as the amount of unstructured data continues to explode. The role for hosted solutions is also likely to grow, as demand increases from smaller organizations.
What does Linguamatics bring to the market?
Linguamatics’ edge is proven agile text mining technology. Validated within the pharma/biotech sector, the I2E semantic knowledge-discovery platform enables companies to solve high-value knowledge-discovery problems, such as looking for potential safety issues of a particular treatment or repurposing for a compound nearing the end of patent life.
I2E uses a combination of natural language processing and other methods such as regular expressions to identify meaning. The technology can be applied to a wide range of problems, and adapted to new areas by plugging in appropriate terminologies. I2E can also help create these if they’re not readily available.