An Intro to Data Mining Part 2: Analyzing the Tools and Techniques
Data mining is one of the hottest topics in information technology. The first part (April 2000, Enterprise Systems Journal, page 32) of this two-part series focused on: What data mining is (and isn’t), why it is important, and how it can be used to provide increased understanding of critical data relationships in rapidly expanding corporate data warehouses. This second part looks at the tools and techniques used in data mining, and the issues surrounding implementations.
Data mining applications can be described in terms of a three-level architecture: applications, approaches, and algorithms and models. These three layers are discussed in the following subsections; the characteristics of the data repository are addressed under "Implementation Issues."
The application level maps the domain-specific attributes into the application space. An appropriate approach is selected, consistent with the application’s goals. Finally, one or more models or algorithms are selected to implement the selected approaches.
Data mining applications group by class, into sets of problems that have similar characteristics across different application domains. Table 1 lists sample applications by class for each of the traditionally IS-intense industries. While the fundamental problem characteristics are similar across the application domains, the parameterization of the applications are distinct from industry to industry, and from application to application. Hence, the same approaches and underlying models used to develop a fraud detection capability for a banking enterprise can be used to develop medical insurance fraud applications. The difference lies in how the approaches and models are parameterized (i.e., which of the domain-specific attributes in the data repository are used in the analysis, and how they’re used).
Table 1: Some Representative Applications of Data Mining Technology by Industry
Profile-based ("targeted") marketing
Credit risk analysis; customer retention; targeted marketing; portfolio analysis/risk assessment
Fraud detection; customer retention; targeted marketing
Customer retention; targeted marketing
Healthcare & Insurance
Fraud detection; best practices; customized care
Each data mining application class is supported by a set of algorithmic approaches used to extract the relevant relationships in the data. These approaches differ in the classes of problems they are able to solve.
Association. Association approaches address a class of problems typified by market basket analysis. Classic market basket analysis treats the purchase of a number of items as a single transaction. The desire is to find sets of items that are frequently purchased together, in order to understand and exploit natural buying patterns. This information can be used to adjust inventories, modify floor or shelf layouts, or introduce targeted promotional activities to increase sales or move specific products.
These approaches had their origins in the retail industry, where they were used to analyze the purchase of goods. The techniques are more general and can be applied equally well to the purchase of services, to develop targeted marketing campaigns or determine common (or uncommon) practices. In the financial sector, these approaches can be used to analyze customers’ account portfolios and identify sets of financial services that people often purchase together. This might be used, for example, to create a service "bundle" as part of a promotional sales campaign.
Association-based approaches often express the resultant item affinities in terms of confidence-rated rules, such as "80 percent of all transactions in which beer was purchased also included potato chips." Confidence thresholds can typically be set to eliminate all but the most common trends. The results of the association analysis (i.e., the attributes involved in the rules themselves) may trigger additional inquiries.
Sequence-based Analysis. Traditional market basket analysis deals with a collection of items as part of a point-in-time transaction. A variant of this problem occurs when there is additional information to tie together a sequence of purchases (e.g., an account number, a debit or credit card, or a frequent buyer/flyer number) in a time series. In this situation, it may not only be the coexistence of items within a transaction that is important, but also the order in which those items appear across ordered transactions and the amount of time between transactions.
In healthcare, such methods can be used to identify routine and exceptional courses of treatment by identifying the common and uncommon succession of multiple procedures over time.
Clustering. Clustering approaches address segmentation problems. These approaches support the assignment of records with a large number of attributes into a relatively small set of groups or "segments." This assignment process is performed automatically by clustering algorithms that identify the distinguishing characteristics of the dataset, then partition the n-dimensional space defined by the dataset attributes along natural cleaving boundaries. There is no need for the user to identify a priori, the groupings desired, or the attributes that should be used to segment the dataset.
Clustering is often one of the first steps undertaken in data mining analyses. It identifies groups of closely related records that can be used as a starting point for exploring further relationships of interest. This technique is used to support the development of population segmentation models, such as demographic-based customer segmentation. Additional analyses can be used to determine the characteristics of these segments with respect to some desired outcome.
Classification. Classification is the most commonly applied data mining technique. It employs a set of classified examples (a training set) to develop a model that can be used as a classifier over the population of records at large. Fraud detection and credit risk applications are examples of the types of problems well suited to this type of analysis. The use of classification algorithms begins with a training set of preclassified example transactions. For a fraud detection application, this would include complete records of activities, classified as either valid or fraudulent on a case-by-case basis. The classifier training algorithm uses these preclassified examples to determine the set of parameters required for proper discrimination, encoding these parameters into a model called a classifier. The classifier is then tested to determine the quality of the model.
Once an effective classifier has been developed, it is then used in a predictive mode to classify new records into these same predefined classes. Classifiers can be developed using one of a number of algorithms that fall into two categories: decision trees and neural networks. The selection of a particular approach has implications for a number of factors relating to the training and use of the classifier, including the number of examples required for training, susceptibility to "noisy" data, and the explanation capability of the system.
For decision tree-based classifiers, the parameters that define the classification model are the set of attributes that comprise the input vectors, and hence are in the same vocabulary as the domain attributes. With syntactic transformations, they can be put into sets of "if-then" rules that allow for "explanation" of the classifier’s reasoning in human terms.
For neural network-based classifiers, model parameters are a set of weighting functions on the network layer interconnections – a purely mathematical formulation, which is opaque, when it comes to explaining the basis for classification. In some domains, the explanation capability can be critical to the acceptance of the classifier as a useful tool; in other domains, the classifier’s success rate and effect on the business is the only metric of success.
Estimation. A variation on the classification problem involves the generation of metrics, often referred to as scores, along various dimensions in the data. Rather than employing a binary classifier to determine whether a loan applicant is a "good" or "bad" risk, this approach would generate a credit worthiness "score" based on a pre-scored training set. The method differs from algorithmic scoring approaches, in that scoring functions are derived from attributes that occur in the data, but may not directly support computation according to an arithmetic formula.
Other Techniques. The techniques previously described are predominantantly used in today’s data mining tools. Other approaches include case-based reasoning, fuzzy logic, genetic algorithms, and fractal-based transforms.
Algorithms and Models
Within each of these general approaches, there are a number of specific algorithms or models that may be applied. Each has its own strengths and weaknesses regarding the problem characteristics best addressed, discrimination capabilities, performance and training requirements. For example, different classification solutions are available for 2-class and N-class problems, and a variety of algorithms (both statistical and neural network-based) may be employed. The algorithms are often tunable using a variety of parameters aimed at providing the right balance of fidelity and performance.
Implementation issues. Having introduced a number of the approaches and algorithms employed in data mining applications, there are many considerations associated with implementing data mining applications.
The development of large data warehouses is driving the need for automated data mining approaches. The data warehouse is the typical repository for which today’s data mining applications are developed and are often implemented using large parallel relational database engines on open systems platforms (large symmetric multiprocessor-SMPs, SMP clusters, and massively parallel processors [MPPs]). Today’s data warehouses range from a few gigabytes to a few terabytes and are spread across many relational tables, with the largest tables in excess of 100 GB. These systems are typically configured with sufficient concurrent I/O capability to ensure very strong relational scan performance.
Today’s data mining tools have evolved from pattern recognition and artificial intelligence research efforts of government- and industry-funded research laboratories. These tools have a heavy algorithmic component, and are often rather "bare" with respect to user interfaces, execution control and model parameterization. They typically ingest and generate UNIX flat files (both control and data files), and are implemented using a single-threaded computational model.
This state of affairs presents a number of challenges to users, which can be summed up as a "tools gap." The gap is caused by a number of factors and requires significant pre- and post-processing to get the most out of a data mining application. Preprocessing, or conditioning activities, include selection of appropriate data subsets for performance and consistency reasons, and sometimes complex data transformations to bridge the representational gap. Post-processing often involves subselection of voluminous results and the application of visualization techniques to provide added understanding. These activities are critical to effectively address key implementation issues on multiple levels.
Users, data mining tools, and SQL-based relational databases each "speak their own language" when it comes to describing the fundamental aspects of data mining applications.
Some of the more common implementation issues include results interpretation, data selection and representation, and system implementation and scalability considerations, as described below.
Inability to "explain" results in human terms. Many of the tools employed in data mining analysis use complex algorithms that operate in an N-dimensional space that are very different than the original data’s attribute space. The ability for these systems to "explain" their results in human terms is minimal. Even with approaches that are capable of analyzing the underlying attributes, the volume and format may be unusable without additional post-processing.
Data Selection and Representation
Ensuring appropriate attribute "coverage." As with all analytical techniques, meaningful results are obtained only if one has sufficient data to answer the question posed. The attraction of a data warehouse as a source for data mining applications is that it includes information from a wide variety of sources. Despite the theoretical ability of data mining tools to automatically ferret out relevant attributes, the task of determining which source attributes to analyze remains a practical implementation issue. This may require iterative approaches with different data subsets.
Ensuring adequate training sets. For algorithms requiring training sets (classification problems), it is critical these records adequately cover the population at large, lest they introduce a misleading bias into the development of the classifier. This may lead to iterative approaches as users strive to find reasonably sized training sets that ensure adequate population coverage.
Data representation and transformation decisions. Most of the source data for today’s data mining applications reside in large, parallel relational database systems. The information is somewhat normalized so the attributes being used in a data mining application may span multiple tables. Normalization is a process in database design in which attributes, such as customer address, account balance, and other account information, are separated into multiple tables for a variety of implementation considerations. These considerations include efficiency of access to relevant attributes, minimal replication of data, and appropriate concurrency control during update. Almost universally, the data mining tools tend to look at long vectors of attributes that have been "put back together again," or denormalized.
The data mining engines operate over a set of attribute "vectors" presented through a UNIX flat file. Conditioning code must be used to provide the denormalized representation the tools need. This may require multiple passes over the input data if complex aggregations are required, or if the database engine does not support an OUTER JOIN operation. (Traditional join operations return result rows only if there are rows matching the join predicate in all contributing tables. An outer join generates a result row if any of the contributing tables have an entry, and a zero or a blank space fills the remaining columns.)
Many of the tools are constrained by the types of data elements with which they can work. Users may have to make continuous variables discrete, or remap categorical variables in order to satisfy the input constraints of a particular data mining tool. Because most data mining tools operate only on time invariant attributes, analysts often have to transform time series data into a parameterized representation that can be inserted into the analysis "vector."
It is impossible to know a priori which parameterizations retain the critical information, so the utility of a particular parameterization must be examined in context. Additionally, the encoding of time series information often requires transformations not easily described in SQL. Generation of the analysis vectors may be more complex as well.
Susceptibility to "dirty" data. Data mining tools have no higher level model of the data on which they operate and no application-oriented (semantic) structure. As such, they simply take everything they are given as factual. Users must ensure that the data being analyzed is "lean." This may require significant analysis of the attribute values being supplied to the discovery tools.
Exploiting parallel computing resources. Many of today’s data warehouses and datamarts are implemented on top of parallel relational database systems, storing data across numerous disks and accessing it through multiple CPUs. These engines perform well when doing selection, projection and aggregation operations within the database. However, current database architectures are such that result sets generated by the database engine are eventually routed through a single query coordinator process. The single coordinator process can be a significant bottleneck in making efficient use of parallel database resources. Even if one is able to generate multiple partitioned parallel output streams from the database (which can be done through "parallel client" implementations), the data mining applications themselves are typically single threaded implementations operating off of a UNIX flat file, and are unable to process partitioned datasets. Even if you are able to extract large datasets from the database, processing them can be a computationally intense operation. Although most data mining tools are intended to operate against data from a parallel database system, many have not yet been parallelized themselves.
Even though relational database systems have parallelized "engines," query results are almost universally routed through a single coordinator process working on behalf of the user’s session. This makes extraction of large datasets from the database inefficient.
This performance issue is most frequently mitigated by aggressively sampling the input dataset. The sampling required for performance reasons must be considered for potential impact on the quality of the results and an appropriate balance found. Once again, this may introduce iterative investigations into the overall analysis effort.
The 2 GB file limit. Although this is becoming less important with the advance of 64-bit operating systems, many UNIX implementations still limit the size of a single file to 2 GB. For flat file-based data mining tools, this effectively limits the size of the datasets that can be operated upon, making sampling a necessity for analysis, training and testing.
Changing business environments and evolving information technology landscapes are driving the need for more capable data analysis tools. Data mining promises increased understanding of large data repositories via automated, computer-based discovery techniques. However, effective implementation and use of the tools still requires significant expertise in extracting, manipulating and analyzing data from large data warehouses. Nevertheless, these tools are currently producing significant strategic and tactical advantages for businesses in highly competitive environments. As tools continue to mature, advances in server-side connectivity, the development of business-based analysis models, user interface improvements, and a larger population of analysts familiar with these tools will bring data mining into the mainstream of corporate information technology.
About the Author: Mitch Haskett is with the Business Intelligence/Data Warehouse Practice for Keane Inc.