NOAA Calls for Clear Skies
The National Oceanic and Atmospheric Administration sought to accelerate sluggish supercomputer performance. Using Linux, the weather research speeds are now scorching.
Say goodnight, Sun. See ya later, IBM. Maybe next time, HP. One of the fastest weather research supercomputers in the world runs Red Hat Linux 6.2.
The National Oceanic and Atmospheric Administration (NOAA) houses the Jet supercomputer in Boulder, Colo. It's a cluster of 280 Compaq 833 MHz Alpha workstations connected together. But, don't try this in your living room: The system takes up about 600 square feet of space and is costing three million dollars per year to buy and run.
With as many as 40 processors running mathematical codes at the same time, it can crank out a four- to 10-day weather forecast in six hours, says Leslie Hart, a computer specialist at the NOAA Forecast Systems Laboratory (FSL) in Boulder. The Linux supercomputer can spit out a short-term forecast—covering 24 to 36 hours—in as little as one hour using a high-resolution model known as the Weather Research and Forecast Model.
The system can run one-third of a trillion arithmetic computations per second, which is 20 times faster than what Forecast Systems Laboratory had previously achieved. By 2002, it will be able to process more than four teraflops of data or four trillion computations per second.
Why a Supercomputer?
The Forecast Systems Laboratory in Boulder acts as a technology transfer labora-tory, as opposed to laboratories that have a purely operational or research function within NOAA, Hart says. They develop, procure and test technology that commercial vendors could then offer to a wider audience. NOAA is part of the Commerce Department.
NOAA researchers use the Jet supercomputer to improve on weather forecast models and to develop new ones. "Ocean and meteorological models require a tremendous amount of computing power," says A.E. "Sandy" MacDonald, director of the FSL. They get their weather prediction data from satellites and other data collection systems, which gathers weather data to the east, north, south and west of each given location. They use physics algorithms in their computational programs on the Jet supercomputer to make forecasts, with an array of meteorological -variables considered.
NOAA is also using the Jet supercomputers to run its North American Atmospheric Observing System, a program to design a next-generation upper-air observing system. Forecasters, after all, can make better predictions if they have better knowledge of current conditions. "Now, we can use the computer to tell us what the best design for the observing system of the future will be," MacDonald says.
The work that Hart and other researchers do is part of a real-life drama. When William M. Daley, then-Commerce Department secretary, announced the supercomputer acquisition, he predicted that it could save lives and property because it would help researchers improve forecasts of severe weather such as thunderstorms, tornadoes and winter storms.
Making the Linux Choice
In September 1999, NOAA officials chose High Performance Technologies Inc. (HPTi), to provide the Compaq-based Linux supercomputer under a five-year, $15 million contract. At the time, the -procurement was the largest known competitive federal government acquisition to go Linux's way.
Founded in 1992, HPTi is a small business that used a "best of breed" approach in marrying hardware vendor Compaq with Linux software, as opposed to running Compaq software exclusively on Compaq hardware.
Although Linux is very popular among scientific users, Hart says that he and his colleagues had no particular preference for an operating system. They run, after all, Unix-based operating systems by Compaq (the former Digital Equipment Corp. operating system), HP, IBM, Silicon Graphics and Sun Microsystems on their campus. They also run two clusters of Intel Pentium III workstations with Linux.
"We ran a fully competitive procurement," in keeping with federal regulations governing large system acquisitions, Hart explains. During a time of flat budgets, the FSL wasn't "looking for performance for the dollar. We didn't require a particular operating system or hardware platform. We simply required that it ran our codes well. We weren't on a crusade to implement Linux." Hart serves as the lead technical official on the supercomputer program.
Following two weeks of work laying cables and getting the Compaq workstations connected, HPTi officials had the system running by December 1999, Hart says. "It requires a fair amount of in-house expertise to do work like this, even with the systems integration and installation help provided by HPTi. You need some fairly high-level people."
"Connecting it together was not all that complicated, though time-consuming, and Compaq handles that task very aptly," adds David Rhoades, high performance systems integration director at HPTi. "It was making it all work together as a single system that was the challenge." The supercomputer features 70 terabytes of tapes in a storage unit and a fast disk subsystem with 200 megabytes per second bandwidth.
During the system's first three months in operation, it was up 99.9 percent of the time, according to Greg Lindahl, HPTi's technical lead for the NOAA contract.
Like other Unix operating systems, Linux is very stable, and the rigorous tests it gets from supportive hackers around the world helps make it more robust and stable. Uptime's a key issue since the supercomputer is in use all the time, like most supercomputers.
How did HPTi do it? "We modified the open source version of the Portable Batch System that provided users with a simple interface and also provides automatic failover for failed nodes," Rhoades explains. "This was probably the biggest software effort for us. Looking ahead, we are putting together a coordinated effort to mature the system software on several fronts, including the (input/output)."
In addition to installing the supercomputer and providing maintenance support, HPTi officials are providing two system upgrades over five years, and they are delivering a 100 terabyte mass storage system with robot control, Hart says. HPTi officials have already swapped 64-bit processors, replacing 667 MHz Alpha processors on the Compaq XP 1,000 workstations with 833MHz ones. They've already doubled the storage available for the Jet supercomputer.
Compaq, which inherited the Alpha processors through its acquisition of Digital Equipment Corp., more than two years ago, is providing NOAA with three years of on-site maintenance. Lindahl says that HPTi conducted benchmark tests on processors by Intel and Advanced Micro Devices Inc., as well as Alpha and other processors. Although they could have purchased more workstations running Intel or AMD chips for the same amount that they purchased the 280 Compaq Alpha workstations, the total performance would have been lower for FSL, he says.
If a computational node in the cluster fails, a fault tolerance daemon in the Jet supercomputer removes the failed node from the system and restarts the parallel job that used that node, Lindahl says. He calls the Compaq Alpha workstations "fairly simple," so it's not hard to find the nodes that aren't working properly. At any given time, there are usually 10 or more jobs running on the supercomputer, and the average one uses 16 processors, he says.
Government agencies with scientific missions have shown a particular interest in clustering architectures for supercomputers. Rather than buying a supercomputer from a single vendor, they build a system using commercially-available hardware and software products from vendors. Officials feel they can not only save money by using clustering architectures, but get superior performance.
The Energy Department's Sandia National Laboratories in New Mexico, for -example, has more than 800 processors running in its Cplant cluster, and they plan to add another 1,400, according to Lindahl. NASA officials, through their Beowulf clusters, built their own supercomputers using Linux. Defense Department officials, who spend about $200 million annually through their High Performance Computing Modernization Program, have also used clustering architectures.
The success of FSL's work with Linux should have an effect on other large projects in parallel computing in the public and private sector.
Under the NOAA contract, HPTi subcontractor Patuxent Technology Partners provides storage area network services under the procurement, while the University of Virginia lends a hand with advanced cluster technologies.
Keeping Users Connected
About two-thirds of the Jet supercomputer's 100 users work in Boulder, and they work on an OC/3 asynchronous transfer mode network that runs 155 Mbps, Hart says. The supercomputer has a system area network, Myricom's Myrinet, that connects the 280 computers at 2.4Gbps. The supercomputer has a 2.4 terabyte hard disk, and each workstation has two 9G hard disks.
"The interconnect between computers is a really crucial ingredient," says MacDonald. "It's like having phone lines between the computers, and it needs to be extremely fast."
The remaining 30 or so Jet users are scattered throughout the U.S., and they use Internet Service Providers (ISPs) with T-1 or better connectivity to the supercomputer through a national OC/3 network, Hart explains.
Despite what would seem to be a high level of connectivity, Hart says that the Jet system is plagued by disk bandwidth problems, which is a drawback to using clusters versus traditional "Big Iron" supercomputers. "[Input/output] is a problem we're dealing with in a cluster of this size," Hart says. By early March, NOAA officials had installed software to take advantage of fiber channel connected disks, to improve bandwidth.
"We have several options and methods of attack for solving this challenge," Rhoades says. NOAA officials have accepted CentraVision file system software by Advanced Digital Information Corp. for insertion into the Jet supercomputer, he says. "Over the next couple of months [the software] will mature significantly and have a huge impact on improving I/O," he says. HPTi is using RAID controllers and disks with the CentraVision file system software.
NOAA FSL officials are also working with HPTi to develop a forward file system that will combine with CentraVision file system and/or Global Forecast System to provide users with input/output, similar to that on Cray Research's T3E system, Rhoades says. NOAA uses Fortran compilers to help them build their software codes.
Mostly Sun, Some Clouds
"We've had some rough edges, but we're happy overall," Hart says. A mathematician by background, he called his work there "pretty interesting."
A power supply problem with the Compaq XL 1000 Alpha workstations may have happened after a batch of them got sprayed with insecticide, Rhoades says. He called the problem "fluky," and described the workstations' performance as "rock-solid."
NOAA lost no work despite the hardware failures because of the supercomputer's fault-tolerant software, according to Lindahl.
The biggest problem that MPTi officials had with their products was getting Fore Systems Ethernet and Asynchronous Transfer Mode switches to talk to each other, Lindahl says. "We had some wrong guesses about when certain technology was going to be delivered," such as the CentraVision file system for Linux, which arrived late. "We had to work with Myrinet to shake out some bugs in their new interconnect hardware and software."
The Alpha processors are "an excellent balance of floating-point performance and memory bandwidth—the two most critical factors for supercomputing applications," Rhoades says. HPTi developed a set of administrative tools so that the FSL could administer the Jet supercomputer as if it were a single computer.
The administrative tool enables FSL officials to test new software on a portion of the supercomputer, instead of taking the whole system down, Lindahl says. This feature makes the Jet supercomputer more reliable than a traditional supercomputer, where a single processor or RAM chip failure can take the entire system off-line.
"Commodity Alpha workstations coupled with the Myrinet interconnect provide a very balanced high-performance solution for [the] weather codes that FSL develops," and for other applications, he says. Myricom Inc., Myrinet's vendor, makes packet-communication and switching technology. Myrinet gives one-way data rates as fast as 1.8 gigabits per second between Unix hosts, according to Myricom.
On their own, FSL scientists developed the Scalable Modeling System (SMS), which acts as a software layer between the weather prediction model's source code and the Message Passing Interface, the industry standard for inter-processor communication. SMS is highly portable, since it works on at least eight different supercomputer systems, and FSL scientists have applied SMS on six weather prediction models.
SMS is also easy to use, and it gives high performance and displays very little impact to source code, according to FSL officials.
Working with a freeware operating system and a talented and networked group of voluntary software developers has its advantages, Hart says. But Linux patches usually come out for Intel processor computers first, and then Alpha-based systems later, he says.
"We do some work with people outside," in the Linux user community, he says.
"Red Hat (Linux) has worked well," HPTi's Rhoades says. "We have been successful in developing tools using their RPM utilities in automating systems -administration."
"The Linux community is great, and the operating system is incredibly robust and mature at its core," Rhoades says. "Everyone contributing together to a common cause is clearly the reason. A few of the drivers we need for the FSL system are a bit less ‘mainstream,' but all the developers have been super to work with."
HPTi and NOAA implement software bug fixes immediately, and "more structured upgrades are tested on the devel-opment partition prior to going into pro-duction," Rhoades says. "Currently, we are in the process of bringing the machine to the point that it is ready for the Myrinet 2000 product that will be used in the impending upgrade."
What are NOAA's future plans for the Jet supercomputer? By the summer of 2002, it will host more than 1,000 processors, Hart says. With an additional 600 square feet of space available, the system's size shouldn't be a problem, but NOAA officials have to make sure they keep the air conditioning system working to prevent overheating.