In-Depth
Issues and Techniques in Text Analytics Implementation, Part 2 of 2
How to streamline the information extraction process
by Victoria Loewengart
Last week, in the first part of this two-part series (read here), we discussed the potential pitfalls of configuring the information extraction process, including document formatting and improving precision and recall. This week we continue with the fine-tuning of precision and recall. We will also discuss some post processing and integration issues.
Dictionary Tagging
Sometimes it is necessary just to tag the entities that match specific data from a repository of known entities and nothing else. This approach, known as dictionary tagging, will yield high precision but not necessarily high recall extraction.
Tagging entities that are already known and recorded somewhere helps analysts identify documents rich in specific information. When entering data into a database from a document, it helps to know which entities already exist in a database to avoid entering them multiple times. Dictionary tagging can also used to establish new relationships among the known entities.
One approach to dictionary tagging is to use lexicons created by exporting data from existing repositories, such as Excel spreadsheets or database tables.
Exporting names of people and organizations from a database presents unique problems. Usually people’s names are stored in specific formats or order, such as last name, first name, middle initial. However, these names may not appear in a text document in this way. Care must be taken to use entity extraction rules in conjunction with the dictionary list, otherwise many erroneous “hits” and “misses” may occur.
Thesauri
If built properly, using a thesaurus (or a synonym list) can increase the precision and recall of the extracted information. Most of the information extraction engines can use well-tuned internal thesauri that encapsulate common words.
For a specialized system, however, the analysts may want to use a specialized thesaurus, such as the thesaurus of country names and acronyms. To use an external thesaurus effectively, take care not to define the same synonym for multiple head terms. Beware of using short abbreviations in a thesaurus. Anything of four characters or less is a candidate for overlapping, for example CA may be a chemical (Calcium) or a state (California).
Post-Processing Issues
The final step in the information extraction workflow is production of an end product that meets the users’ requirements, such as a report, a diagram, or a set of data imported into a database. The end users must have control in customizing their final product.
Even with all of the pitfalls considered and the best efforts made to remedy them, end users (analysts) always need to add, remove, or correct entities and facts. A text analytic software suite should provide a manual tagging capability (for example, the ability to visually “un-tag” or “un-highlight” entities and facts that are not wanted in the final product).
The system engineer must plan for the manual tagging capabilities PRIOR to implementation of the entire information extraction workflow, because it will affect the choice of information extraction software and supporting tools. Additional software tools may be needed to import the results of the information extraction process into an editor so users can modify results visually by highlighting/un-highlighting entities and facts, diagramming, or other means.
Also, serious consideration must be given to integrating the text analytics software with other information management tools. Information extraction is most effective when used in conjunction with other applications. For example, the results of tagging could be visualized as a link diagram among the entities. Sometimes it is desirable to import tagged entities into a database.
Most text analytics tools provide visualization applications and database utilities; however, end users often prefer their own tools and familiar third-party applications. The text analytic software must produce XML output with tagged entities; XML schema must be well documented so the output can be transformed so it can be used by other applications.
Conclusion
In setting up an information extraction process, it is difficult to get it right the first time. As with any complex system, there are always unforeseen pitfalls and problems. With careful planning it is possible to avoid some of them.. The information extraction process must evolve over time with adjustments and tuning until it meets end users’ productivity needs.
- - -
Victoria Loewengart is the principal research scientist at Battelle Memorial Institute where she researches and implements new technologies and methods to enhance analyst/system effectiveness. You can reach her at [email protected]