Data Mining: The Third Stage of Data Warehouse Evolution

The evolution , definition, process and techniques of the data mining process are discussed.

Today's competitive marketplace challenges even the most successfulcompanies to protect and retain their customer base, manage supplier partnerships andcontrol costs, while at the same time increasing their revenue. In a world of acceleratingchange, competitive advantage will be defined by your ability to leverage information toinitiate effective business decisions before your competition does.

The detailed data generated by your company's operations is a valuable and powerfulasset. Most companies generate vast amounts of data from its On-Line TransactionProcessing (OLTP) systems, Point of Service (POS) systems, financial ATMs, and now, theInternet. The challenge faced by many of these "data rich" enterprises is tomake this information resource available at the time and place, and in the form needed tosupport their decision-making process. Initially, data warehouses were implemented tosupport historical reporting, then evolved to provide online analytical processing (OLAP)to understand historical trends and relationships. Today, data mining is being added inorder to discover hidden relationships and predict future patterns and trends.

Data Warehouse Evolution

In response to the need to make mission-critical business information resourcesavailable to support decision making, many industry-leading companies began using datawarehouses to collect and analyze large volumes of historical data. Initially, these datawarehouses were built to consolidate and report historical data on products, customers andrevenues using pre-defined SQL-generated reports that showed what had occurred (e.g.,monthly sales reporting). In the second stage of data warehouse evolution, warehouse usersquickly moved from historical reporting to more sophisticated data analysis using"what-if" ad-hoc queries in order to gain insights into why a specific conditionor event occurred.

Today, many data warehouse users are turning to advanced analytical and predictivemodeling tools and techniques to break through the limitations of traditional dataanalysis and answer the question, "What will happen in the future?" Data mining,or "knowledge discovery" as it is sometimes called, shifts a portion of theanalysis burden from the business analyst to the computer, enabling the discovery ofsubtle relationships previously unrecognizable through manual query tools. In the hands ofa knowledgeable business person, this information can lead to significant new insights.

What Is Data Mining?

Data Mining is a multi-step process of discovering meaningful new correlations,patterns and trends by sifting through the large amounts of detailed data stored in datawarehouses using pattern recognition technologies, as well as statistical and mathematicalmodeling techniques. Effective data mining is dependent on a comprehensive and robustenterprise data warehouse and not a data mart containing a summary or aggregation ofdepartmental or subject-specific data. At times, even the most astute observer cannotpredict which attributes or factors (e.g., events, behaviors, static data, transactions,etc.) may or may not contribute to the business process under study. Oftentimes, datamining can uncover data relationships that were not expected, revealing previouslyundervalued data or attributes to be significant contributing factors. In data miningthere is never too much detailed data. Data mining offers an opportunity to find businessfacts hidden in data.

The Value of Data Mining

Data mining is used by businesses in virtually every major industry. In the retailindustry, data mining is used to identify which products sell together in order to improveshelf allocation and better locate store displays. The telecommunications industry usesdata mining to identify which customers may be interested in new calling plans or newservices. The health care industry uses data mining to identify the most successfultreatment pattern combinations for particular diseases. Insurance companies and banks areusing data mining to detect fraud or unusual purchasing patterns that may indicate stolencredit cards and to indicate which factors indicate good candidates for mortgages andother types of loans.

Data mining can help answer high-value business questions that were previouslydifficult or not possible to address through traditional query and reporting tools.Examples of data mining applications include:

Market Segmentation. How can you identify and group certain segments ofcustomers? This analysis involves clustering, affinity and behavior dynamics.

Propensity-to-Buy Analysis. What combination of products or offers will appealto which customers and when? This analysis involves product scoring and purchase sequence.

Customer Attrition. Which customers are likely to be attracted by a competitor'soffer? This includes at-risk-to-defect and churn analysis.

Customer Profitability. Which customers and what products are most likely to beprofitable throughout their full life cycle? This would include current and lifetimecustomer and product value scoring.

Data Mining Techniques

Historically, data mining solutions have required considerable expertise and wereexpensive to implement. As such, only a relatively small number of businesses have beenable to exploit the power of data mining applications. The recent emergence of a new classof end-user oriented data mining tools have dramatically broadened the acceptance of datamining.

A number of tools available today apply multiple techniques to sift through data;however, it is still up to good business and/or statistical analysts to choose the besttechniques and to finally create the prediction models or clarify the base relationships.Current techniques in use in data mining are:

Traditional Statistics. Includes a variety of traditional statistical methods,such as time series forecasting, and more. Also included are the more general methods formodel fitting and validation that conduct automated searches for complicatedrelationships.

Clustering. Used to identify distinguishing characteristics between sets ofrecords and then place them into groups or segments. This process is used for customersegmentation analysis.

Association. Identifies rules that enable the correlation of attributes orfactors in the data. This method has been found to be effective in market basket analysiswhere certain items tend to be purchased together in order to improve shelf allocation andbetter locate store displays. Wal-Mart uncovered one of the most often told anecdotes inthe retail and data warehousing industries. It seems that the item most often purchasedwith diapers turned out to be beer! By placing diapers and beer close to one another instores ­ sales of both increased!

Sequential Association. Patterns emerge over time, and this method is used tolook for links that relate these sequential patterns. Life-triggers that precede specificpurchases and precursor purchases are often found using this methodology. Examples oflife-triggers that often precede changes in life insurance coverage are a change inmarital status, the purchase of a home and the birth of a child.

Decision Trees. This is an induction method (machine learning) that works bydeveloping multiple choice type questions that can be yes/no or have probabilities and/orvalues. The answers radiate out from an initial field that then splits (at a node) at eachnew decision point.

Rule Induction. The method develops rules that classify data and are oftenderived from decision trees, or other algorithms. An example for evaluating the creditworthiness of an individual could be: "If age > 50 and married and homeowner, thenrisk = good."

Neural Networks. Neural networks are an approach to computing that involvesdeveloping mathematical structures that have the ability to learn, or to adapt theirbehavior based on previous results. A trained neural network can be thought of as an"expert" in the category of information it has been given to analyze.

The Data Mining Process

Data Mining must be seen as an ongoing iterative process rather than a set of tools.Key steps in the data mining process are:

Selection of Data to be Analyzed. The first step in Data Mining is to identifyand collect the data to be mined. Remember that the objective is to make as much dataavailable for analysis as possible. Some experts advocate a sampling strategy that appliesa reliable, statistically representative sample of the full detail data.

Sample the data by extracting a portion of a data set large enough to contain thesignificant information, yet small enough to manipulate quickly and easily. Donecorrectly, sampling will yield a representative view of a larger file. Using arepresentative sample reduces processing time and cost; however, some unexpected datarelationships or contributing factors may be missed.

Review and Analyze the Data. After selecting the data, the next step is toexplore the data visually or numerically for inherent trends or groupings. In this phaseof the process, look for data points that are significantly outside the basic range of theresults, or redundant measures that are candidates for elimination in the reductionprocess. The idea is to try to begin reducing the number of variables as soon as possible.

If visual exploration does not reveal clear trends, you can explore the data throughstatistical techniques like clustering. For example, in a direct mail campaign, clusteringmight reveal groups of customers with distinct ordering patterns. Knowing these patternscan create opportunities for personalizing mailings or promotions.

Change the Data to Fit Findings. Modify the data by selecting and, if necessary,transforming the variables to focus the model selection process. Based on the discoveriesin the visual exploration phase, you may need to adjust the data to include informationsuch as the grouping of customers and significant subgroups, or you may need to introducenew variables.

You may also need to modify data when the "mined" data change. Because DataMining is a dynamic, iterative process, you can update Data Mining methods or models whennew information is available.

Build Predictive Models. Once the data are prepared, it is time to constructmodels that explain patterns in the data. Modeling the data by allowing the software toolsto search automatically for a combination of data that reliably analyzes relationshipswill improve your ability to predict desired outcomes.

Assess the Model Performance. In this step, the model is evaluated to determinethe usefulness and reliability of the findings from the Data Mining process. A commonmeans of assessing a model is to apply the model to a portion of data that was set asideduring the sampling stage. If the model is valid, it should work for this reserved sampleas well as for the sample used to construct the model.

Similarly, the model can be tested against known data. For example, if you know whichcustomers in a file had high retention rates and your model predicts retention, you cancheck to see whether the model selects these customers accurately. In addition, practicalapplications of the model, such as partial mailings in a direct mail campaign, help proveits validity.

Modify and Tune the Model. As we mentioned earlier, Data Mining is an on- goingprocess. The lessons learned each time the model is run must be used to adjust and tunethe model to reflect changing conditions in the data and the marketplace.

Combined Strengths

Data warehouses and data mining are a natural and powerful combination. The enterprisedata warehouse provides a single centralized source of clean, quality data for buildingand implementing data mining models.

According to Bill Inmon, the "Father of Data Warehousing" and President ofPine Cone Systems, "The data warehouse sets the stage for effective datamining." With a data warehouse as the foundation, up to 75 percent of the data miningeffort (data sourcing, preparation and cleansing) is already underway, saving time andmoney while increasing reliability and delivering faster results.

Data mining is a useful set of philosophies, tools, and applications that can help yourbusiness become more competitive. Although data mining is a powerful new data warehouseanalysis technique, it does not replace the need for historical reporting and ad-hoc queryanalysis. Nor does it replace human capabilities but rather augments them. Rather, yourdata warehouse's value is maximized when all data analysis types, both historical andpredictive, are leveraged as part of a comprehensive strategy for data warehousing.


About the Author:

Jack W. Wood is Director, Data Warehouse Sales and Marketing for Formula ConsultantsIncorporated (Anaheim, Calif.). He has over 13 years experience in the Unisys marketplace.