Drill Down

Will Data Warehousing Drive Supercomputing into the Corporate Mainstream?

Waste, corruption and fraud. Hardly anybody is in favor of them, particularly in government programs that use our tax dollars. The problem is that criminals are clever, and waste, corruption and fraud are hard to detect.

But detection may be getting easier. The Texas Medicaid program, the health insurance program for the poor, recently implemented a system that will significantly enhance its efforts to recover illegally or improperly spent funds from its $8 billion program. At the systems' heart lies a classic data mining application.

Using neural network technology, Texas officials analyzed hundreds of known fraud cases along approximately 100 parameters. They then created a visual representation of the analysis and applied the model to thousands of files in its database. Files that matched the model were flagged and investigated. More than 70 percent proved to be fraud. By the year 2000, according to Aurora LeBurn, Associate Commissioner of Texas' Office of Investigations and Enforcement, the technology will help identify 2000 cases of suspected fraud and lead to the recovery of approximately $14 million a year. That is a nice payoff.

Texas' Medicaid fraud detection system was put together by Electronic Data Systems (Plano, Texas) using applications from HNC Software (San Diego), a leading developer of predictive software, and Intelligent Technologies Corporation (Austin, Texas) a developer of adaptive data mining tools for intelligent fraud detection software built on a foundation of smart pattern recognition technology. It runs on a Silicon Graphics Origin2000 supercomputer using ccNUMA architecture. The architecture's interconnect technology provides high-performance, scalability and the tremendous memory bandwidth that is crucial for data intensive applications like this one. Based on the experience in Texas and similar projects, data-intensive data mining problems may be the application to drive supercomputing into the corporate mainstream.

For most of the past decade, supercomputing has been undergoing a drawn out transition from vector to parallel processing. During this transition, vector supercomputing came to be looked at as "old." At the same time, parallel processing-based supercomputing has been pigeonholed as a not-quite-ready-for-prime-time technology, of interest primarily only to computer scientists and engineers. Even today, hardware developments in supercomputing continue to outstrip the software and applications development needed to create a production environment and various performance claims fall short in real-world environments.

Nevertheless, data warehousing and data mining applications are enabling supercomputing technology to penetrate corporate America. According to officials at IBM, of their approximately 3,500 SP-based supercomputer installations, 2,500 are in corporations. And many are running data mining and data warehousing applications.

Irving Wladawsky-Berger, General Manager of IBM's Internet Division and a member of President Clinton's Advisory Committee on High Performance Computing and Communications, Information Technology, and the Next Generation Internet, told me in a recent interview at Supercomputing 98, that the reasons are simple. First, prediction based on patterns generated by the analysis of large amounts of data is something that powerful computers do very well. Secondly, the number of companies with terabytes of stored data is growing rapidly. Third, the rate of data acquisition is exploding. Finally, moving beyond parallel processing to creating truly parallel computing systems in which all the systems components including file sharing, data storage and I/O operate in parallel is crucial to maximizing the value of the information extracted from the data.

For example, Wladawsky-Berger notes, medical organizations have accumulated vast databases of information. As that information is analyzed, patterns will emerge that could help doctors diagnose diseases. "Pattern matching is what computers are good at," he said.

Moreover, as compute power has increased, initiatives to make more information available online have been launched. The folks at the high performance computing division at Compaq point to the Alexandria Digital Library project, a nationwide collaborative project involving the National Science Foundation, NASA and the Department of Defense Advanced Research Projects Agency that will open access via the World Wide Web to the nation's spatial data repositories. As project principal Larry Carver, the director of the University of California, Santa Barbara's Map and Imagery Laboratory observes, U.S. agencies spend billions of dollars annually collecting spatial data. The data repositories include multi-terabyte topographic maps and satellite imagery archives that exceed hundreds of terabytes in size. But researchers at only a few organizations can access this data.

The Alexandra project is building a test bed for digital libraries using AlphaServer 4100, a 64-bit dual-processor SMP computer from Compaq, with a 350-megabyte Digital StorageWorks RAID 5 configuration to run the system. Database software is provided by Oracle, Informix, and Sybase and knowledge management tools come from Excalibur Technology (Vienna, Va.). It is the first to conform to the Federal Geographic Data Committee's spatial metadata specifications.

But analyzing and accessing existing data is only part of the story. Data is being acquired at increasingly accelerated rates. Wladawsky-Berger likes to point to the Sabre online reservation system as an example of information flooding in simultaneously from hundreds of different points on the network. The in-flow must be effectively managed, as well as the analysis and the access. Consequently, Wladawsky-Berger argues, truly parallel computing systems offer the most potential as the best solution.

Acknowledging that many people have been scared off by grandiose promises in the past, Wladawsky-Berger likes to call these new parallel systems implementing complex pattern recognition technology for prediction "deep computing." It is a buzz word that avoids both the term artificial intelligence, which was once the catch phrase for pattern construction and recognition, and supercomputing.

Others in the industry are not enamored with the term "deep computing." Supercomputing, argues William White, Cray Systems product manager at Silicon Graphics, speaks to the most challenging problems in computing, problems such as forecasting the weather. Nevertheless, he agrees, increasingly those challenges for supercomputing will come from the area of strategic business analysis.

About the Author:

Dr. Elliot King is an Assistant Professor of Communications and Director of the New Media Center at Loyola College in Maryland. He can be reached at (410) 356-3943, or by e-mail at eking@loyolanet.campus.mci.net.