In-Depth
        
        Profiles in IT: One Percent of the World's Data
        We're trying to preserve vast amounts of digital data reliably for 100 years—but the storage media themselves haven't been proven to last that long.
        
        
        The Earth Observing System (EOS) gathers data from satellites about conditions         on the planet's surface and in its oceans and atmosphere. The data is         being used by climatologists to develop models that will improve weather         forecasts. Other scientists use the data to gauge the condition of fisheries.         Information on atmospheric temperatures and pollution levels help still         others track global warming. Each satellite in the system sends back a         terabyte of raw information a day, and that figure doubles once the information         is processed.
      
For Milton Halem, assistant director for information science and CIO         of NASA's Goddard Space Flight Center (GSFC) in Greenbelt, Md., all that         data raining down creates two problems. First, there's the issue of storing         and moving huge amounts of information—a common problem, but on an         uncommon scale and with some restrictions imposed by the nature of GSFC's         work. Second, there are archiving issues. Modeling the weather, for example,         requires many observations, and for widely spaced phenomena like El Niño,         just 20 observations take roughly a century. How do you preserve digital         data reliably for 100 years when the storage media themselves haven't         been proven to last that long?
      Halem could arguably be called an IT pioneer. He earned a Ph.D. in applied         mathematics at New York University in 1968, and cut his teeth on the third         Univac built. In the late sixties and early seventies, he worked at the         Goddard Institute for Space Studies at Columbia. In 1977, GSFC decided         to install a new supercomputer and, nervous about student unrest, moved         it to Goddard's campus outside Washington, D.C.
      GSFC already has the largest active storage capacity in the world, but         even that will soon fall short of the agency's needs as additional EOS         satellites—carrying increasingly accurate instruments that deliver         correspondingly greater amounts of data—are launched. Conventional         ways of dealing with the explosion of data generally aren't an option.         Lousy compression techniques might destroy valuable information and administrative         measures such as periodic purging, and defeat long-term projects. Even         efforts to clean up user files aren't likely to help much, according to         Halem, since they're just a drop in the bucket when compared to the amount         of new data.
      Framed in the abstract, GSFC's ability to collect data roughly follows         Moore's Law, and doubles every 18 months. Increases in storage density,         Halem says, follow more or less the same timetable. The speed at which         tapes can be read, however, has only tripled in the last decade. A collateral         problem, incidentally, is that newer, faster controllers aren't always         backward compatible with older data. Halem reckons that unless something         happens to increase data transfer rates, within 10 years the amount of         time it takes to back up GSFC's data will exceed the life of the media         onto which it's being transferred.
      Halem is looking at storage area networks (SANs) to help solve the data         volume issue. Goddard's campus covers several square miles and includes         five or six buildings. The buildings are already linked by fiber channels,         and GSFC's IT organization is taking advantage of that to install a pilot         SAN. The test is using off-the-shelf equipment from the major storage         vendors. "We will have all of our systems, our disk storage and some of         our tape storage systems, capable of interfacing to devices that can share         storage through fiber channels," he says.
      The U.S. Geological Survey (USGS) is also involved in the SAN project.         USGS maintains a data center in South Dakota, and some of its work parallels         that of GSFC. "Once we get our part of it done, we're going to work with         USGS to begin to see if we can do a backup storage system between their         site and our site here, as well as a third site in West Virginia," Halem         says.
      Apart from helping to solve its storage problems, the SAN could help         speed the preparatory work Halem's group must do before a new satellite         is launched. Generally, GSFC prepares its systems to handle the new flow         of data about six months before the scheduled launch date. If the launch         is delayed, the IT organization could be saddled with lower-capacity storage         systems than if it had waited—Moore's Law again.
      "If we could share this storage in the development of our systems, it         would enable us not to commit so early," Halem says. "SANs give us that         capability. We could do the development work, share some of the extra         capacity of the system and not make the commitment up front to acquire         the necessary storage."
      To solve the archiving problem, Halem considered optical storage, but         rejected it, at least for the time being. "Optical storage has considerably         longer shelf life—although there are still a lot of questions about         that—but the access time is even slower, and it requires much more         advanced technology—lasers and things like that. Over decades, those         kinds of technology don't migrate as well," he says.
      Halem is looking at a three-stage storage model, with low-cost disk storage—something         like RAID arrays—sitting between his high-speed disks and his tape         systems. The disks will be backed up onto tape cartridges at a remote         location over fiber lines. Currently, Halem notes, GSFC uses 50Mb cartridges         deployed in a dozen, 5,000-tape silos.
      New controllers will let him roughly quadruple storage density on the         same cartridges, and bring total capacity up to around 5 petabytes (1,024         terabytes). "I saw an estimate recently that the total storage capacity         worldwide—everything—is just a little over 500 petabytes," Halem         says. "If we go to 5 petabytes, we'll have 1 percent of the world's storage         in just our building." 
        
        
        
        
        
        
        
        
        
        
        
        
            
        
        
                
                    About the Author
                    
                
                    
                    Bob Mueller is a writer and magazine publishing consultant based in the Chicago area, covering technology and management subjects.