Search Engine Performance Is Slipping
There are between 400 million and 800 million pages available on the publicly searchable World Wide Web. But based on the findings of a study conducted by the NEC Research Institute (www.neci.nj.nec.com), much of the content remains in a vast and uncharted ocean.
According to Steve Lawrence, a research scientist at NEC and co-author of the report, Accessibility of Information on the Web, it's a problem that will get worse before it gets better.
"The situation now [in terms of information availability] is much better than before the Web, but what is getting worse is the amount that you can physically search," Lawrence explains. "If search engines were comprehensive and really up to date, then that's going to help scientists be really up to date with research."
According to Lawrence's and co-author C. Lee Giles' findings, the Web's best overall search engine, Northern Light (www.northernlight.com), indexes about 16 percent of publicly searchable Web pages. Northern Light is followed closely by Snap (www.snap.com) at 15.5 percent and AltaVista search engine (www.altavista.com) at 15.3 percent. Even when search engine coverage was combined, the amount of untapped information was huge. The 11 search engines tested combined to provide a composite coverage of 42 percent of the publicly searchable Web.
The results, say Lawrence and Giles, represent a significant drop-off in search engine performance from earlier studies. A 1998 study published in Science magazine, Lawrence and Giles estimated that the top six search engines at the time --AltaVista, Excite (www.excite.com), HotBot (www.hotbot.com), InfoSeek (www.infoseek.com), Lycos (www.lycos.com) and Northern Light -- provided 60 percent composite coverage. The pair found that the top search engine alone provided about 33 percent coverage.
Lawrence explains that the size and complexity of the Web today makes it more difficult for search engine providers to adequately index the searchable Web. "There are a number of problems, and one is the technological issue of how much [search engines] can index with a reasonable amount of money [spent] on resources," he observes.
More importantly, Lawrence notes that search engine providers can often satisfy user search requests through a standard pool of cached query results. These pools eliminate the need for resource-consuming, comprehensive Web searches.
Advanced content descriptors such as HTML meta tags could help improve coverage, but Lawrence and Giles found these tags are used on only 34.2 percent of Web servers. Plus, many search engines don't take advantage of the content descriptions provided by meta tags. Web pages that are linked to and from other pages have the best chance of being catalogued by major search engines.
Lawrence concludes that the ever-advancing pace of information systems development may help in the future. "If computer resources continue to improve with the same rapid rate thus far, over a long period of time it's going to become easier to index a greater portion of the Web, and then the economic issues won't be such a decisive factor."