Managing and mining corporate information is a huge challenge...but there's light at the end of the tunnel.
There's a goldmine of underused information resources in most organizations. Buzzword after buzzword has promised to help exploit it … and largely failed.
Document management gave way to groupware and document repositories. With the Web came content management, intranets that gave Web-based content value as an internal resource, and later portals. Knowledge management, nebulous but promising, went in and out of vogue as a philosophy, if not a tool, to bring people and documents together in a sort of intellectual nirvana.
More recently, technologists seem to treat corporate data as an Internet-like informational Wild West, employing similar search and directory tools to find relevant needles in the digital haystack and put them close at hand for employees and customers.
They may be on the right track. Vendors of industrial-strength search engines, the critical component in such a solution, thus speak of "crawling" corporate repositories, and of manual classification—a task literally performed by library-science types hired by the MIS department to make sense of content categories, like so many Yahoo! staffers toiling to maintain the Web's best-known directory.
It's been the answer for Export.gov, a U.S. Department of Commerce Web site that recombines information from government Web sites, Lotus Notes databases and many other sources to present accurate information on export laws and procedures. And it helped Stanford University save time and money combing through mountains of medical and scientific information.
These enterprise search and categorization engines are the crux of the new knowledge systems, with established names like Autonomy Corp., Inktomi Corp., Verity Inc., and newcomers such as Stratify Inc. to first organize information, then help people find it. Typically, corporations use these tools alongside—or built into—corporate portals powered by software from the likes of Plumtree Software Inc. and IBM Corp. And though the portal and search vendors seem to have cozier relationships than the Bush Administration and Enron Corp., the two play distinct roles: One providing the portal itself, the other the infrastructure behind it.
Of course, there's one crucial difference from the Web: Corporate information doesn't reside solely in billions of HTML pages exposing themselves promiscuously, but in assorted repositories and file formats. So integration becomes the other key piece of this knowledge-management puzzle. Helpfully, the portal and search vendors have taken on the problem themselves, providing out-of-the box interfaces to Oracle and other corporate mainstays, and exposing their own application programming interfaces (APIs) built on popular Java and Windows-based development tools. Verity claims to support 250 file formats.
Many corporations' informational worlds are also inhabited, however, by important island nations known as document repositories. Many have Lotus Notes repositories scattered about, or not long ago made an investment in dedicated servers running a Documentum Inc. product. So portal and search vendors are designing links into these, too.
Drilldown to Architecture
The ingredients for efficiently managing knowledge repositories follow a seemingly simple recipe:
- Combine structured and unstructured data sources into a searchable "virtual" data store.
- Perk things up with content semi-awareness, lifting the repository into the realm of knowledge management.
- Buy an enterprise-class search engine, and pair it with a portal to present the results. If you don't have a portal lying around, use the one that probably comes with your search engine.
- Make sure the engine you buy has either prefab hooks into your biggest data stores, or industry standard ones like ODBC and XML. If it doesn't, ask the vendor to write one for you, or consider a different vendor who will. (Alternatively, do the integration in-house or with a consultant.)
- Finally, if you want real knowledge management and not simply a bigger repository you'll just regret later, make sure the portal or search engine can categorize people—not just databases and document repositories—as informational resources, drawing further inferences from their daily interactions.
For the last part, you might need to substitute a product that specializes, at least in part, in sifting through e-mail and instant messages—Tacit Knowledge Systems Inc.'s KnowledgeMail, for example, or Lotus Development Corp.'s Discovery Server.
While it sounds easy to the uninitiated, steps 1 though 5 can be a recipe for disaster without careful planning. With unstructured information residing in everything from Microsoft Word documents, Excel spreadsheets, and Exchange servers to intranet and e-commerce pages, you need a common structure for the central location where all the information will be managed.
That structure is almost always metadata index—a hierarchical list of terms and relationships that describe all the data—and the place it resides is a dedicated server running a standard relational database such as Oracle 9i or Microsoft SQL Server 2000. The major search vendors all use metadata indexes to organize searches (even if they don't call them that), just as Web crawlers like AltaVista do.
The enterprise search vendors compete on the clever algorithms and computer science theories they apply to building and constantly updating the most useful possible hierarchy, then classifying existing and newly created documents, text files, and data streams so they fit optimally inside it.
Such seemingly arcane distinctions as the proper use of Bayesian inference theory or neural networks are important because the indexing and classification functions they support are what determines how complete the information repository is and how accurately it reflects your business. It's the thing that brings a database-like schema to unstructured data. "Now searching is essentially navigation," says Pandu Nayak, Stratify's chief architect.
Competitor Autonomy, for example, prides itself on the results achieved from combining Bayesian inference and Shannon's information theory, among several techniques. "Effectively, it lets you read through a piece of unstructured information and figure out which concepts within it are strong, and distinguish it from other documents," says Stouffer Egan, Autonomy's U.S. general manager.
Besides bringing structure to information that lacks it, index-based classification plays a unifying role. "Classification is really an important tool that allows you to bring structured and unstructured data together," says Rajat Mukherjee, Verity's principal architect. He says further mapping between structured and unstructured data comes from index tables that can correspond to columns in relational databases.
The metadata approach also has performance advantages. By storing a compressed representation of structured information on a dedicated server running Verity's K2 Enterprise, IT departments can off-load the additional overhead that would be generated by directing queries straight through to Oracle and other databases, Mukherjee claims.
Some Assembly Required
The success of these automated techniques also determines how much manual intervention is required to massage the data, and almost all the vendors say the best results involve some combination, though they disagree on the amounts. Autonomy, for example, brags about how automatic its engine is, while competitors like Stratify and Verity call it a less successful, black-box solution.
The tradeoff comes down to this: Manual processes bring the human element into hierarchy building and meta-tagging of documents (the techniques favored by many content-management products), and that's both their strength and weakness. Individual people can understand the meaning of the words in the document, but in groups, they'll disagree or have different interpretations, producing an unreliable schema for others to use. Automation adds regularity to the process. Still, "the hierarchy produced by such an automated method might not be the absolute best hierarchy from a human perspective," says Nayak.
The whole process can be summarized as organize-classify-present, with the presentation step taking place not simply in a portal screen, but in new applications that run in it, Nayak says. Call-center logs saved in Siebel can be analyzed by the categorization engine, then fed back over an API to the customer-service department's portal, providing an ever-growing knowledge base to draw on to answer future inquiries.
That Other Big Thing: Integration
Dynamically tying most of a corporation's content, old and new, to a souped-up portal is at least as big a challenge as nailing the classification tasks. It's what Egan calls "the block-and-tackle stuff—the not so sexy stuff. It takes years to get very good at getting all the content into play." The effort is necessary, he adds, for a company to fully realize the promise of its portal, which may have been oversold as a knowledge-management cure-all. "All it did was give you a window into all the different repositories," he says. "But you know what? It's not that different from the desktop at the end of the day."
It takes a while to "extract" this information from the vendors, but there's little doubt that integration can be tedious and labor-intensive. Even prepackaged integration wizards like Plumtree's gadgets require a level of comfort with software terminology and table structures that many end users lack, though that doesn't mean it's always a job for in-house developers. Plumtree's Excel gadget, for example, is accessible right from each user's personal portal. Click through a few screens and you're asked to pick field ranges (the hardest part) from the spreadsheet, sitting somewhere on a networked server and to which you have been granted access by an administrator. If you've done it right, data from those fields appears in a small table on your home page.
The process is similar for IT development staff. According to Gordon Keller, director of Export.gov, the handful of Autonomy spiders his group used to comb 19 federal sites each took only a few hours to crunch through their jobs. Configuring and testing the spiders was significantly more time-consuming, but not especially difficult. "We've been going about it piece by piece," Keller says. He figures Export.gov staff devoted 120 hours working with Autonomy's Notes spider to comb roughly 25 Notes repositories scattered across federal agencies, examining fields and checking the results in an iterative process. An initial HTTP "fetch" (the quaint term used by UK-based Autonomy) went quickly. "It'll go out to that Web site and get everything under a certain directory," Keller says. Fetches of Oracle, DB2 and SQL Server databases were planned, and Keller expects to use Autonomy's Portal-in-Box later in 2002 to rework Export.gov's portal.
Non-traditional data sources, such as voice-mail systems, video servers, and instant messaging (IM) platforms, are increasingly becoming valuable repositories of corporate know-how, and vendors say customers are asking how they can tie them into their portals. IM presents unique problems, since it's increasingly carrying communications that not only have value for the corporate knowledge store, but can have legal implications similar to e-mail.
Instant messaging also carries a lot of junk, presenting another categorization problem to the indexers, and works like a temporary one-to-one channel for chat that typically isn't saved, except perhaps on users' desktops. Very few vendors, notably Groove Networks and Jabber Inc., let you archive instant messaging sessions, according to Larry Hawes, senior analyst at market researcher Delphi Group. Also offering IM capture is Lotus, through its K-station portal. But the enterprise search vendors have been slow to offer direct integration, though several say they expect to once demand is sufficient. Egan says Autonomy can probably process IM using its ability to intercept text streams and HTTP communications, but the major IM vendors such as AOL-Time Warner make it harder by not providing a central archiving mechanism.
Even e-mail remains a frontier of sorts for those wishing to capture and reuse corporate communications, as Delphi Group found in a September 2001 report of a survey of more than 300 implementers of content- and knowledge-management systems. "E-mail is very much unmanaged," Hawes says. "It's left to the end users."
A few vendors including Tacit are tackling the e-mail problem head-on. Lotus, again, has extensive hooks not only into Notes but into the Exchange servers (not to mention Office suite) of archrival Microsoft.
Still, users of such systems often find that they're strong in the collaboration aspect of knowledge management, but weak at unifying personal communications and disparate documents into a single virtual repository, Mukherjee says. "Lots of Notes users use Verity to bring repositories together," he says.
The additional portal or search-engine hardware needed to build and maintain the metadata index, provide access to it and reference the data it refers to is fairly minimal, the vendors claim. Typically, you'll need at least one server for the engine and its index (and its portal, if you're using it) and perhaps another dedicated server if you've chosen a different vendor's portal. You add more of each (plus, perhaps, server-farm technology like clustering, and faster network connections) depending on the demand, similar to an intranet or Web site buildout.
Performance tweaking could be your biggest worry, if the experience of one Verity user is indicative. For John Sack, director of Stanford University's HighWire Press, an ASP-style publisher of 6 terabytes worth of scientific and medical journals stored on 300 content databases, Verity's K2 Enterprise has been a clear win just for saving him $200,000 in server hardware. But re-indexing takes five days of prep work to reconfigure physical and logical mappings of the data "to re-assemble it and present it in a different way to Verity," Sack says. "It's better to spend a few thousand dollars on Verity consulting than tens of thousands on staff time to try, try again."