Profiles in IT: One Percent of the World's Data

We're trying to preserve vast amounts of digital data reliably for 100 years—but the storage media themselves haven't been proven to last that long.

The Earth Observing System (EOS) gathers data from satellites about conditions on the planet's surface and in its oceans and atmosphere. The data is being used by climatologists to develop models that will improve weather forecasts. Other scientists use the data to gauge the condition of fisheries. Information on atmospheric temperatures and pollution levels help still others track global warming. Each satellite in the system sends back a terabyte of raw information a day, and that figure doubles once the information is processed.

For Milton Halem, assistant director for information science and CIO of NASA's Goddard Space Flight Center (GSFC) in Greenbelt, Md., all that data raining down creates two problems. First, there's the issue of storing and moving huge amounts of information—a common problem, but on an uncommon scale and with some restrictions imposed by the nature of GSFC's work. Second, there are archiving issues. Modeling the weather, for example, requires many observations, and for widely spaced phenomena like El Niño, just 20 observations take roughly a century. How do you preserve digital data reliably for 100 years when the storage media themselves haven't been proven to last that long?

Halem could arguably be called an IT pioneer. He earned a Ph.D. in applied mathematics at New York University in 1968, and cut his teeth on the third Univac built. In the late sixties and early seventies, he worked at the Goddard Institute for Space Studies at Columbia. In 1977, GSFC decided to install a new supercomputer and, nervous about student unrest, moved it to Goddard's campus outside Washington, D.C.

GSFC already has the largest active storage capacity in the world, but even that will soon fall short of the agency's needs as additional EOS satellites—carrying increasingly accurate instruments that deliver correspondingly greater amounts of data—are launched. Conventional ways of dealing with the explosion of data generally aren't an option. Lousy compression techniques might destroy valuable information and administrative measures such as periodic purging, and defeat long-term projects. Even efforts to clean up user files aren't likely to help much, according to Halem, since they're just a drop in the bucket when compared to the amount of new data.

Framed in the abstract, GSFC's ability to collect data roughly follows Moore's Law, and doubles every 18 months. Increases in storage density, Halem says, follow more or less the same timetable. The speed at which tapes can be read, however, has only tripled in the last decade. A collateral problem, incidentally, is that newer, faster controllers aren't always backward compatible with older data. Halem reckons that unless something happens to increase data transfer rates, within 10 years the amount of time it takes to back up GSFC's data will exceed the life of the media onto which it's being transferred.

Halem is looking at storage area networks (SANs) to help solve the data volume issue. Goddard's campus covers several square miles and includes five or six buildings. The buildings are already linked by fiber channels, and GSFC's IT organization is taking advantage of that to install a pilot SAN. The test is using off-the-shelf equipment from the major storage vendors. "We will have all of our systems, our disk storage and some of our tape storage systems, capable of interfacing to devices that can share storage through fiber channels," he says.

The U.S. Geological Survey (USGS) is also involved in the SAN project. USGS maintains a data center in South Dakota, and some of its work parallels that of GSFC. "Once we get our part of it done, we're going to work with USGS to begin to see if we can do a backup storage system between their site and our site here, as well as a third site in West Virginia," Halem says.

Apart from helping to solve its storage problems, the SAN could help speed the preparatory work Halem's group must do before a new satellite is launched. Generally, GSFC prepares its systems to handle the new flow of data about six months before the scheduled launch date. If the launch is delayed, the IT organization could be saddled with lower-capacity storage systems than if it had waited—Moore's Law again.

"If we could share this storage in the development of our systems, it would enable us not to commit so early," Halem says. "SANs give us that capability. We could do the development work, share some of the extra capacity of the system and not make the commitment up front to acquire the necessary storage."

To solve the archiving problem, Halem considered optical storage, but rejected it, at least for the time being. "Optical storage has considerably longer shelf life—although there are still a lot of questions about that—but the access time is even slower, and it requires much more advanced technology—lasers and things like that. Over decades, those kinds of technology don't migrate as well," he says.

Halem is looking at a three-stage storage model, with low-cost disk storage—something like RAID arrays—sitting between his high-speed disks and his tape systems. The disks will be backed up onto tape cartridges at a remote location over fiber lines. Currently, Halem notes, GSFC uses 50Mb cartridges deployed in a dozen, 5,000-tape silos.

New controllers will let him roughly quadruple storage density on the same cartridges, and bring total capacity up to around 5 petabytes (1,024 terabytes). "I saw an estimate recently that the total storage capacity worldwide—everything—is just a little over 500 petabytes," Halem says. "If we go to 5 petabytes, we'll have 1 percent of the world's storage in just our building."

About the Author

Bob Mueller is a writer and magazine publishing consultant based in the Chicago area, covering technology and management subjects.