Drill Down
Digital Libraries in Data Warehousing
Raj Reddy has a vision of the library of the future. The Dean of the School of Computer Science and the Herbert A. Simon Professor of Computer Science and Robotics at Carnegie Mellon University, Reddy sees a time when digital libraries will provide access to all recorded knowledge to everybody, everywhere in the world, at any time, in all languages. He realizes his vision of a digital library for the 21st century is a radical one. But it is possible, he argues.
Reddy calculates that since the invention of the printing press, no more than one billion books have been published. If the average book has 500 pages and 2,000 characters per page, it can be stored easily in one megabyte of disk space. Therefore, all the books that have ever been printed, Reddy notes, can be stored in one billion megabytes or one petabyte. At $20 a gigabyte of storage, a petabyte would cost only $20 million. So Reddy concludes, storing all the books in the world does not pose an insurmountable technological problem, at least in terms of storage.
Reddy recently lead a study tour to Japan organized by the International Technology Research Center at Loyola College and sponsored by the National Science Foundation and the Department of Defense's Advanced Research Projects Agency (DARPA) to assess Japanese hardware and software systems for what is called digital information organization, with a concentration on digital libraries. Digital information organization is a term that encompasses methods of rendering large amounts of information into a digital format in order for it to be stored, retrieved and manipulated by computers. In other words, digital information organization refers to the processes and techniques needed to develop large-scale data warehouses starting from information found on physical media.
The drive to develop digital libraries can be seen as a particularly interesting initiative in the development of data warehouses for several reasons. First, a digital library is a storehouse of largely unstructured objects that are useful only if it can be successfully searched. The reason being that the objects consist of text, images and perhaps three-dimensional objects that can be difficult to render digitally, catalogue and retrieve in a timely fashion using current technology.
Second, digital libraries are premised on the notion of integrating information from a wide range of independent sources. Not every library that owns a collection of Shakespeare, for example, is expected to digitize it. The works of Shakespeare might be digitized once and mirrored on several different sites. The same version, however, would be accessible to all digital library users.
Third, national digital library projects in both Japan and the U.S. are very ambitious. They are seen as vehicles for fundamentally restructuring education and access to knowledge throughout society.
Viewing a digital library as a paradigmatic effort at large-scale data warehousing helps brings the significant technological obstacles not related to storage into clearer focus. Those problems can be grouped into several categories including scalability, formats and standards, metadata, search mechanisms and information reliability.
Scalability is one of the most serious barriers encountered in large-scale data warehousing projects. Suppose, as Reddy has, that approximately one billion people have access to the Internet, within the next several years. If only one percent of those people wants access to a specific item, and a server could grant access to the requested information in 100 milliseconds, it would still take 12 days for the entire population to receive the requested information.
Bandwidth scalability is only part of the problem however. Think about keyword searching. A simple search of the 50 million or so Web pages now available can return 1,000 or more hits. When the amount of information available reaches 50 billion pages, a keyword search could return one million hits.
And the problems multiply. Data standards and uniform formats are essential for interoperability. But as data warehouses are stitched together from multiple sources, it becomes hard to guarantee that a uniform set of standards is applied. Even using such common standards such as ASCII for text, JPEG for images or HTML does not solve the problem. Different compression techniques or ways of cataloguing data are important to consider.
The whole concept of cataloguing raises the issue of metadata and the related notion of granularity. As ITRI study mission participant Beth Davis-Brown, the digital project conversion coordinator at the Law Library of the Library of Congress, noted, in traditional libraries, metadata - that is the information found in most catalogues - consists largely of information about the physical "container" and content information created by librarians. Finally different items are catalogued at different levels. There may be metadata about each book in the library but not each letter in a collection of private papers.
Metadata for a digital library item would include information about the location of the object, its content, and access management -- who is allowed to use the information. In early test-beds, developing the metadata has proven to be one of the most expensive operations.
In many ways, current digital library initiatives can be seen as "proof of concept" efforts for large-scale data warehousing projects using a text and unstructured data. Presentations about the ITRI study mission to Japan can be found at http://itri.loyola.edu/digilibs. Information about U.S. digital library projects is available at http://lcweb.loc.gov/loc/ndlf/digital.html. About the Author:
Dr. Eliot King is an Assistant Professor of Communications and Director of the New Media Center at Loyola College in Maryland. He can be reached at (410) 356-3943, or by e-mail at eking@loyolanet.campus.mci.net.