Supporting the Data Mining Process with Next Generation Data Mining Systems
Data mining is an emerging technology for the automatic extraction of patterns, associations, changes, anomalies and significant structures from data. Data mining is emerging as one of the key technologies enabling businesses to select, filter, screen, correlate and fuse data automatically.
Most of the value of data mining comes from using data mining technology to improve predictive modeling. For example, data mining can be used to generate predictive models automatically, which predict how much profit prospects and customers will provide and how much risk they entail from fraud, bankruptcy, charge-off and related problems.
The Data Mining Process
Broadly speaking, there are two cultures in data mining: the knowledge discovery (KD) culture and predictive modeling (PM) culture. In the KD culture, the output are rules. In the PM culture, the output are predictive models. In both cultures, the input are learning sets. The goal of both cultures is to automate as much of the process of data mining as possible. In practice, the data mining process is not completely automatic, but rather a semi-automated process.
For example, assume that cell phone users are divided into three groups: low, medium and high likelihood of switching carriers. A data mining system might extract a rule, such as, "Cell phone users who receive more than two incoming phone calls per day have a low likelihood of changing." To continue the example, a predictive model might then assign two scores to each cell phone user: one, between zero and one, indicating the likelihood that the cell phone user will switch carriers and one indicating the estimated profit from the user over the next year. This information may then be used to focus attention on high profit users who are likely to switch carriers.
It is helpful to view the data mining and predictive modeling process as consisting of four broad phases, as illustrated in Table 1. In the first phase, data from a variety of sources is cleaned, normalized, transformed and merged to create a data warehouse suitable for data mining. In the second phase, data mining algorithms are applied to data extracted from the data warehouse to produce predictive models. In general, before this can be done, the data is first enriched by adding additional attributes to the data warehouse, which are statistically derived from the data. In general, a large company has a variety of predictive models and scores about customers and their transactions, some produced internally and some purchased from third parties. In the third phase, these different predictive models are analyzed and combined if necessary to produce a single aggregate model. In the fourth phase, the appropriate predictive models are used for real-time and batch scoring of operational and transactional data.
A data warehouse is a system for managing data for decision support. Data is gathered, cleaned and transformed from operational systems and third-party sources to create the data warehouse.
In the data mining phase, data is extracted from a data warehouse and used to produce a predictive model or rule set. This phase can be automated.
In the predictive modeling phase, one or more predictive models are selected or combined in order to produce an optimal model. The models may be from data mining systems, may be produced by statisticians or modelers, or may be purchased from third parties.
In this phase, the predictive models are used to score operational and transactional data.
Table 1: The data mining and predictive modeling process consists of four major phases
Two Simple Examples
For example, our goal may be to predict the likelihood that a specified group of customers will buy a particular product. Assume that we have a variety of data about each customer in that group, some of which may be quite complex, such as time series data about what they have bought in the past and when they bought it. We will assign a likelihood "y," which ranges from 0 to 1. The higher the number, the more likely the customer is to buy the product. In practice, we would do this for many products and return the two or three products most likely to be purchased next.
To prepare the data for the data warehouse, we would generally begin by flattening the data about the customer (by throwing away structure, such as the time series information) to create a vector "x" for each customer. These are the data attributes. For example, the amount customers spent in the past quarter may be a typical data attribute.
As part of the data mining phase, we can then compute additional attributes for each customer called derived attributes. For example, we may compute whether over the past year the rate at which the customer is purchasing products has increased or decreased. The combination of the data attributes plus the derived attributes create for each customer what is usually called a feature vector "x."
In the simplest case, data mining from the PM point of view may be thought of as automatically computing a predictive model from the data, which will predict the likelihood "y" that a customer will buy a specified product, given the feature vector "x."
It is common to buy third-party data and scores about customers. This data may be added to the feature vector in the data warehousing phase and the scores may be combined in the predictive modeling phase with the scores produced from data mining. With predictive scoring, customer service representatives can answer incoming calls, and not only identify the customer's prior purchasing patterns, but even suggest new products and services the customer may want to consider buying.
As another example, data mining can be used to assign a likelihood that a credit card transaction is fraudulent. In this case, derived attributes might include the number of credit card transactions during the past three hours. It is helpful to use multiple predictive models instead of a single predictive model for this type of problem. For example, a separate predictive model could be used for individuals who travel frequently or who have a gold or platinum card.
Four Generations of Data Mining Systems
Roughly speaking, second generation data mining systems can mine data from databases and data warehouses, while third generation data mining systems can mine data from Internets and extranets. Fourth generation data mining systems can mine data from mobile, embedded and ubiquitous computing devices. (See Table 2 for a summary.)
First Generation Systems. The first generation of data mining systems support a single algorithm or a small collection of algorithms that are designed to mine vector-valued data. Recently, several of these systems have been commercialized.
Second Generation Systems. Today, research is focused on improving first-generation systems and developing second-generation data mining systems. A second-generation system is characterized by supporting high performance interfaces to databases and data warehouses, and by providing increased scalability and increased functionality. For example, second generation systems can mine larger data sets, more complex data sets and data sets in higher dimensions than those of first generation systems. They provide increased flexibility by supporting a data mining schema and a data mining query language (DMQL).
Third Generation Systems. Third-generation data mining systems are characterized by being able to mine the distributed and highly heterogeneous data found on intranets and extranets, and being able to integrate efficiently with operational systems. One of the key technologies which makes this possible is to provide first class support for multiple predictive models and the meta-data required to work with multiple predictive models built on heterogeneous systems.
Third generation data mining and predictive modeling systems are different than search engines. The latter simply provide a means of locating data on the net; on the other hand, data mining and predictive modeling systems provide a means for discovering patterns, associations, changes and anomalies in networked data.
Fourth Generation Systems. Fourth-generation data mining systems are characterized by being able to mine data generated by embedded, mobile and ubiquitous computing devices. For example, a sales rep using a mobile computing device can enter information at a client's office. A fourth generation data mining system could then provide an appropriate cross-selling suggestion.
Data Mining Algorithms
Distributed Computing Model
data mining viewed as stand alone application
supports one or more algorithms
stand alone systems
integrated with databases and data warehouses
multiple algorithms; can mine data which does not fit into the memory of a single machine
data management systems, including database and data warehouses
homogeneous, local area cluster
some systems support objects, text and continuous media data
integrates predictive modeling
data management & predictive modeling systems
network computing over intranets and extranets
support for semi-structured and web data
incorporates mobile & ubiquitous data
data management, predictive modeling & mobile systems
mobile and ubiquitous computing model
ubiquitous data model
Table 2: Four generations of data mining systems
Roughly speaking, second-generation data mining systems provide an efficient interface between data warehouses and data mining systems, while third-generation data mining systems provide, in addition, an efficient interface between the data mining systems and predictive modeling systems.
Work on understanding the appropriate interface between data management systems and data mining systems is sometimes viewed as the problem of identifying appropriate data mining primitives. Data mining primitives would be executed in the data warehouse or database to improve the performance of data mining systems.
There has been some work recently on understanding the appropriate interface between data mining systems and predictive modeling systems. An XML markup language called the Predictive Model Markup Language (PMML) has been proposed as a suitable interface.
Implementing data mining solutions for data which is small enough to fit into the memory of a single machine which, relatively static, is straightforward today. There are a variety of first generation data mining systems available, and success will depend more upon the quality of the team and of the data than the system which is selected.
If the data is large enough or changes rapidly enough to demand a database or data warehouse to adequately manage it, then a second generation data mining system is necessary. Unfortunately, today's data warehouses were designed for OLAP applications, not data mining applications. This means that true second generation data mining systems must use their own specialized data management systems as a stop gap measure until database and data warehouse vendors can provide adequate support for the appropriate data mining primitives. You should check that second generation systems produce PMML or a similar open format to facilitate incorporation of the results into operational systems.
If you are using multiple predictive models today or if predictive models need to be updated frequently, then you should seriously consider emerging third generation data mining systems which directly support these capabilities, as well integrate with databases and data warehouses. An important benefit of a third generation data mining and predictive modeling system is that predictive models produced by the data mining system can be ingested automatically by operational systems incorporating the appropriate predictive modeling module. This can dramatically reduce the total cost and the total time required to complete a project.
The professionals in many companies today are increasing mobile, yet require information that is more and more timely. Fourth generation data mining systems can play an important role here. The best way to integrate data mining with mobile computing is an area of research today.
First generation data mining systems are still immature, while second and third generation systems are just emerging. I don't know of any fourth generation systems. On the other hand, for many problems, the return on investment for a successful data mining project is so tremendous that the pain and effort required to work with the emerging technology is quickly forgotten.
The Future of Data Mining
Using software to build predictive models is not new. What is new is the need to automate more and more of this process which begins with the collection and warehousing of the appropriate data and ends with the incorporation of the appropriate predictive model into operational systems.
What data mining tries to do is to automate key parts of this process, not necessarily because it is the best thing to do, but simply because there is too much data these days and not enough folks to model it. To say it slightly differently, while models built individually by great statisticians are always better, there is simply too much data and not enough great statisticians. It is in this sense that the emergence of data mining is inevitable.
Data mining is a key enabling technology for a wide variety of business, medical and engineering applications. Second, third and fourth generation data mining and predictive modeling systems will probably merge with data warehouses to provide integrated systems for managing business processes. Also, as second, third and fourth generation data mining technologies continue to develop and mature, expect to see them incorporated as embedded technologies in a variety of different applications.
The data mining process is a complex semi-automated method for automating knowledge discovery and predictive modeling of data. It consists of three main phases: data warehousing, data mining and predictive modeling. Concretely, data mining can be thought of as extracting a learning set from a data warehouse and applying a data mining algorithm to produce a predictive model. Second generation data mining systems provide an efficient interface between data warehouses and data mining systems, while third generation data mining systems provide an efficient interface between data mining systems and predictive modeling systems, and extend data mining to intranets and extranets. Fourth generation systems will extend this support to mobile computing devices.
ABOUT THE AUTHOR:
Robert Grossman is the President of Magnify, Inc. (Chicago, Ill.) and the Director of the National Center for Data Mining at the University of Illinois at Chicago. He has been a leader in the development of high-performance and wide area data mining systems for over 10 years.