Microsoft Extending SQL Server to Data Mining

Microsoft Corp. has its sights set on data mining for SQL Server 2000, much as it set them on OLAP for SQL Server 7.0.

The data mining approach parallels Redmond’s previous OLAP efforts, from the code-name, to the API standardization meetings, to the company’s publicly stated aims in bringing the technology into its database engine.

Data mining vendors are paying close attention to Microsoft’s moves, but industry observers don’t expect the integration to generate the broad interest among IT and business executives in data mining that Microsoft created with its OLAP integration.

OLAP is a complicated technology, but it is relatively easy to understand how to use. By storing data in multidimensional databases, OLAP allows users to view data intuitively, such as first looking at sales figures by region, then drilling into a state or individual salesperson. As an analytical tool, OLAP is driven by the user’s business experience.

In data mining, the technology drives the analysis. What data mining tools do is far more sophisticated than what OLAP tools do. A data mining tool uses mathematical models to sniff out correlations in the data that a user would never think to look for. An example of the kind of counter-intuitive relationship a data mining tool find would be the discovery that people who come into a store to buy diapers often buy beer, too.

SQL Server 7.0, released in early 1999, introduced SQL Server 7.0 OLAP Services, a free bundling of OLAP server technology with every license of the base RDBMS. Huge industry anticipation preceded the launch of the product, still commonly known by its code name Plato. Some analysts say Plato helped cut the OLAP server market down to three major suppliers, Microsoft, Hyperion Solutions Corp. (www.hyperion.com) and Oracle Corp. (www.oracle.com).

When SQL Server 2000 comes out in mid-2000, it will integrate data mining algorithms and APIs in the core database engine. Microsoft calls the initiative Aurum, which is Latin for gold.

SQL Server product manager Barry Goffe says the Aurum teams within the SQL Server group and within Microsoft Research are addressing three problems they perceive with existing data mining technology -- tools are too expensive, tools aren’t integrated with databases, and most data mining algorithms don’t scale well.

"We want to build a great platform that [ISVs’] products can run on top of," Goffe says. "We’re not out there to put the data mining ISVs out of business."

Microsoft is taking vendor comment on an OLE DB for Data Mining specification, much as it did more than a year ago with OLE DB for OLAP.

Microsoft is trying to integrate data mining into its data engine in three ways. Preprocessing functionality would be done in the RDBMS, meaning an ISV’s data mining tools could rely on sophisticated RDBMS methods for cleaning, transforming, and preparing data rather than needing to develop, buy, or include those functions in their tools.

Aurum will also integrate a limited set of data mining algorithms within its data engine for the building of data models, the process within data mining where an analyst iteratively works to build a statistical model that can generate meaningful results. Coupled with this effort, Microsoft is working on data mining APIs to allow third-party tool vendors to access Microsoft’s algorithms and swap in other algorithms.

Finally, Microsoft is making SQL Server 2000 capable of deploying the models. Deployment can consist of an analyst running a sophisticated model against fresh data to find trends, or deployment can be as simple as putting a button in a vertical application -- such as running a customer’s data through a model that predicts whether the customer is a good credit risk or not.

Data mining gained the attention of venture capitalists about two or three years ago. The technology always seemed on the verge of widespread adoption, but it hasn’t happened.

"It was sold as a technology that was magic and was going to make all your woes go away. It turned out to be harder to use," says Herb Edelstein, a data mining analyst at Two Crows Corp. (www.twocrows.com).

One of the major roadblocks to adoption was that statistical backgrounds are rare in IS departments, and training in statistical modeling is a prerequesite for making the tools work, Edelstein says.

There are dozens of data mining companies that sell tools, but the market is shrinking. Some vendors are merging, others are refocusing from horizontal tools to vertical applications.

"It used to be that data mining was going to be the fourth leg in the business intelligence suite [along with OLAP, query and reporting, and enterprise reporting]," says Wayne Eckerson, an analyst at Patricia Seybold Group (www.psgroup.com). "But now it’s being used in two ways. At the real high end of the market, very savvy companies that are doing database marketing use data mining. We’re also seeing lighter weight tools being embedded into applications and pretuned to support applications like CRM applications and e-business-type applications."

Microsoft, meanwhile, is only one of five major database vendors working to pull data mining into the RDBMS data engine. Oracle, the leading RDBMS vendor in market share on Windows NT and Unix platforms, recently bought Darwin, a 4-year-old tool from Thinking Machines Corp. (www.think.com). IBM Corp. (www.ibm.com) sells Intelligent Miner and is working data mining into its cross-platform DB2 engine. NCR Corp. (www.ncr.com) is doing the same with its TeraMiner tool.

Furthest along in its efforts is Compaq Computer Corp. (www.compaq.com). Compaq’s Tandem division set out to build the kind of data mining environment Microsoft is now talking about, with an emphasis on preprocessing efficiency within its NonStop SQL/MX DBMS, which runs on Windows NT. At least two data mining products have been customized to work with NonStop SQL/MX, Clementine from SPSS (www.spss.com) and Integral Solutions (www.isl.co.uk), and KnowledgeStudio from Angoss Software (www.angoss.com).

Edelstein believes pulling data mining into the database is an excellent idea for preprocessing and model deployment, but a dubious one for model building.

In the Compaq section of his Data Mining ’99 Technology Report, Edelstein wrote, "Working on a database is important because as much as 80 percent of the time and effort in data mining projects is spent in the variety of data preparation and exploration steps that precede each iteration of model building."

Many tricks and traps lurk within model building. In Edelstein’s view, implementation is key to the success of any of the database vendor’s efforts. "Let’s suppose they decide to put building a neural net as a low-level function in the DBMS. When I build a neural net, I have to do hundreds if not thousands of passes over the data. Suppose this data is scattered over six or 10 tables, each with tens if not hundreds of megabytes. That’s not a real-time operation. If I have to do that every time I pass through the data, forget it," Edelstein says.

That is one of the obvious problems, fairly easily solved through the use of a materialized view -- a feature being included in SQL Server 2000 -- Edelstein says. But many more complicated hurdles must be jumped, he says.

Doug Dow, vice president at SPSS, looks at Microsoft’s data mining plans as an opportunity to sell more data mining tools.

"Overall, we feel there is potential for performance gain when these database vendors pull data mining into their engines," Dow says. With Microsoft in particular, "people are going to accept the idea of data mining more. It’s gained the stamp of a major industry player."

Model building within the Microsoft environment will require dedicated tools, and SPSS hopes to sell some of those. Existing data mining tools can leverage the hooks into the database to offload the transformations, data cleaning, and record deduplication to the data engine. SPSS’s Clementine already uses Compaq’s NonStop SQL/MX that way.

Existing tools also can give users a run-time choice in model deployment about whether to deploy the model in the data mining tool or in the database, Dow says.

Tom Camps, vice president for marketing at Cognos Corp. (www.cognos.com), expects Microsoft’s Aurum to speed the trend toward embedding data mining functions in vertical applications.

"It opens up the possibilities around data mining for so many more applications," Camps says. "As any technology moves into the mainstream, people are less interested in integrating technology and more interested in getting value from it. A company may want better fraud detection. They may not want better data mining algorithms."

While the changes Microsoft makes now could affect the applications you use in the future, there’s a good chance you’ll never hear about them. Camps says the conversations vendors engage in with IT managers about OLAP technology won’t occur with data mining because of its complexity.

"The underlying technology will be data mining," Camps predicts. "But we’ll never talk about it."

Must Read Articles