An Introduction to Data Mining, Part 1: Understanding the Critical Data Relationship in the Corporate Data Warehouse
Data mining is one of the hottest topics in information technology. This two-part series will provide an introduction to data mining with the first part focusing on what it is (and isn't), why it is important and how it can be used to provide increased understanding of critical data relationships in rapidly expanding corporate data warehouses.
Data Mining is one of the hottest topics in information technology. This two-part series will provide an introduction to data mining. The first part will focus on what it is (and isn’t), why it is important and how it can be used to provide an increased understanding of critical data relationships in rapidly an expanding corporate data warehouses. Part two will look at the tools and techniques used in data mining, and the issues surrounding implementations.
The ability to access accurate and timely information is of significant value to any enterprise. It can help organizations better understand their customers, more effectively target markets and proactively address business problems before they cut into profits. Valuable data, though, is often locked in complex operational systems and processes, buried within multiple sources, or rendered otherwise unusable.
There are probably as many different definitions of the term "data mining" as there are analytical software tool vendors in the market today. Vendors and industry analysts have adopted the term somewhat indiscriminately, drawn by the allure of the analogical concept that promises to turn mountains of data into gold nuggets of strategically valuable business information. The result is a kind of blanket definition that includes all tools employed to help users analyze and understand their data. In this article, we will explore a more narrow definition of the term: Data mining is a set of techniques used in an automated approach to exhaustively explore and surface complex relationships in very large datasets.
For this article, the discussion of datasets will be restricted to those that are largely tabular in nature, having been implemented most likely in relational database management technology. However, these techniques can be applied to other data representations, including spatial data domains, text-based domains and multimedia (image) domains.
Limitations of Verification-Based Analysis
A significant distinction between data mining and other analytical tools is the fundamental approach employed in exploring the data interrelationships. Many of the analytical tools available on the market support a verification-based approach, in which the user hypothesizes about specific data interrelationships, then uses the tools to verify (or refute) those hypotheses. This approach is inherently serial, relying on the intuition and insight of the analyst to pose the original question, and to iteratively refine the analysis, based on the results of potentially complex queries against a data repository. The effectiveness of this approach is limited by a number of factors, including:
• The ability of the human analyst to pose appropriate questions and the system to quickly return results. The verification-based approach is inherently iterative. It relies on the quick turnaround of inquiries to allow the analyst to continue an avenue of investigation. Long delays in generating results can disable that analysis process, causing key relationships to be missed.
• The ability of the analyst to manage the complexity of the attribute space. Humans deal effectively in two and three-dimensional relationships. Complex data repositories include much higher dimensions of data relationships, which humans must "navigate" one slice at a time. This makes thorough analysis a painfully slow process, and risks missing key relationships about which the user may not have thought to ask.
• The ability of the analyst to think "out-of-the-box." Many of the more interesting data relationships often go undetected because no one asks the right question. There are numerous analytical tools on the market that have been optimized to address some of these issues. Query reporting tools address an ease-of-use issue, allowing users to develop and maintain SQL queries through point-and-click interfaces.
In comparison to query tools, statistical analysis packages provide the ability to effectively explore relationships among a few variables, and to determine statistical significance against a population at large. The multidimensional and relational OLAP tools pre-compute and display hierarchies of aggregations along various dimensions, in order to respond quickly to users’ inquiries. New visualization tools allow users to explore higher dimensionality relationships through the combination of spatial and non-spatial attributes (location, size, color, etc.).
And finally, GUI-based analytical environments support the maintenance and execution of complex sets of analysis routines in a point-and-click environment. Fundamentally, these tools augment, enhance or accelerate the verification-based approach.
Data mining, on the other hand, employs a computer-driven, discovery-based approach, in which pattern-matching algorithms are employed to determine the key relationships in the data. Data mining algorithms are capable of examining numerous multidimensional data relationships concurrently, highlighting the ones that are dominant or exceptional. We will revisit these discovery-based approaches in just a moment, exploring their key characteristics, strengths and weaknesses. But first, let us understand why there is such tremendous interest in this new approach.
The Need for Data Mining
Many of the techniques employed in today’s data mining tools have been around for a number of years, with their origins in the artificial intelligence research of the 1980s and early 1990s. The machine learning techniques of induction and clustering, as well as neural network-based recognition technology were developed extensively through government and corporate-funded research and development projects. Yet, it is just now that these tools are being applied to large-scale database systems. The confluence of a number of key trends is responsible for the heightened interest in widespread deployment of high volume transactional systems.
Widespread Deployment of High-Volume Transactional Systems. Over the past 15 to 20 years, we have witnessed the proliferation of computer technology in all aspects of our society. Many of these computers have been used to capture detailed transaction information across a variety of corporate enterprises.
For example, in the retail industry, point-of-sale systems capture all of the information associated with shoppers’ transactions to the tune of a terabyte a day, according to some estimates. In the telecommunications industry, computers have been inserted directly into the communications switches, making a multitude of information available on every single telephone call. And in banking, the number of electronic transactions has soared with the use of ATM machines, credit cards, debit cards and bank-at-home services. These transactional systems have been designed to capture detailed information about every aspect of our businesses. Five years ago, open systems database vendors were struggling to provide systems that could deliver a couple of hundred transactions per minute.
Now, we are routinely seeing TPC-C numbers (tpmCs) for large multiprocessor servers in excess of 50,000, with some clustered SMPs as high as 100,000. Actually, there was initial transaction per second (TPS) benchmarks in the hundreds, with a great rush to get to 1,000 TPS in the late 1980s. The TPS benchmark (later to become TPC-A), however, was based on a simple debit-credit transaction model, and not directly comparable with today’s TPC-C numbers. TPC-C is a benchmark that attempts to develop a transactional model representative of real-world OLTP applications.
This has been accompanied by an equally impressive reduction in the price per tpmC, now well under $200. Cost per tpmC ($/tpmC) is a metric that takes into account the five year cost of ownership of a transactional computing system. It includes hardware, software and maintenance line items, reflective of the vendors’ actual pricing. Recent developments in "low end" 4- and 8-way Pentium-based SMPs, and the commoditization of clustering technology, promise to make this high transaction rate technology even more affordable and much easier to integrate into a business, leading to even greater proliferation of transaction-based information.
Doing More with Less. The proliferation of this computer and associated information technology has been accompanied by (and, many might say, has caused) fundamental changes in the way we do business. Known by a number of names, including downsizing, rightsizing and streamlining, it has to do with "working smarter" – leveraging information to improve communications, developing more efficient processes, reducing costs and improving effectiveness.
Information as a Key Corporate Asset. These trends have led businesses to look at information as a key corporate asset, to be developed and managed in a manner consistent with its strategic value. Companies used to gather the detailed transactional data on a daily basis and provide summary "rollups" on daily, weekly and monthly bases for purposes of reporting, analysis and forecasting. As these systems became capable of generating more and more data, the operational systems began to record more and more information.
As distributed information technology matured, information could be combined from multiple operational systems into one or more cross-system "business views." This allowed organizations to use information from across multiple departmental systems to provide better analytical capabilities. For example, retailers could begin to explore trends across geographic boundaries, or to explore geographical differences in sales. Financial institutions could look at a customer’s entire portfolio to ascertain their financial health and determine likely risk factors to address at the time of loan origination.
The result has been the proliferation of data warehouses – large cross-functional information repositories aimed at providing a new breed of knowledge workers with the information they need to monitor and steer all aspects of the business. These data warehouses integrate detailed transactional information from multiple, disparate operational systems to support time-based decision-making on an enterprisewide basis. Additionally, they often include data from external sources, such as customer demographics and householding information. These sources are used to augment the warehouse with additional information that may provide increased insights into many aspects of the business.
Widespread Availability of Scalable Information Analysis Platforms. Finally, in recent years we have seen widespread adoption of scalable, open systems-based information technology platforms. This includes database management systems, client-server-based analytical tools, and, most recently, information exchange and publishing tools based on Intranet services. These components provide the ability to develop computationally intense analytical techniques and to make these available to large numbers of users through a simple, ubiquitous user interface. This enables more and more knowledge workers to search out and gain access to large amounts of data through widely available desktop tools. The combination of these factors are driving the adoption of data mining technologies, by putting tremendous pressure at various points along the information value chain.
Raw data is acquired in many forms. It is augmented with semantic structure that describes the interrelationship of various data attributes. This information then undergoes an iterative modeling and analysis loop so that knowledge can be derived and codified.
On the "source" side, the amount of raw data being generated and stored in corporate data warehouses is growing rapidly, providing the "data rich/information poor" environment that makes automated, discovery-based techniques attractive. And on the "sink" side, use of more sophisticated analytical techniques are capable of uncovering new trends or patterns in the data before these relationships are well understood. The required capabilities go far beyond those delivered by conventional decision support systems. The combination of the source-side "push" and sink-side "pull" places increasing demands on the types of analyses and modeling done to convert raw data into applicable knowledge. Data mining is one of the key technologies able to accelerate the modeling and analysis efforts and provide knowledge workers with the tools necessary to navigate this complex analytical space.
Proliferation of E-commerce Has Fueled Enterprisewide Customer Intimacy Initiatives. The commercial push onto the Internet has created new integration requirements for enterprises: sales, marketing, and customer service and support, and other customer touch points. The focus of this integration is to understand as much as possible about your customers; who they are, what they buy, what influences them and how valuable they are to your organization.
Whereas the ultimate goal is to have all relevant and valuable customer information available to analyze customer interaction trends, an important question is raised: Do trends and patterns exist outside the standard demographic segments of age, race, gender and affluence? Data mining algorithms are used to uncover the hidden trends and patterns of the millions of transactions captured by Internet site engines and turn the mass of data into a new understanding of customer segmentation, based on observing customer patterns rather than the standard demographic segmentation.
In combination and through intelligent analysis, separate sources of data have the power to become knowledge. Virtually all companies have access to valuable data, but few manage to effectively exploit this data to produce knowledge. Decision support systems backed by data warehouses offer companies the technology to combine and analyze data in new and meaningful ways.
Decision support systems and data warehousing are not new ideas. Many companies have implemented data warehouses, but few effectively exploit these warehouses to their true potential. In addition, many of the earlier data warehouses were monolithic, clumsy and expensive to maintain and the data was often inaccurate or simply unusable.
Fortunately, recent advances in technology have overcome many of these past objections and shortcomings. Fully functional, responsive and high-quality data warehouses are now attainable. Inexpensive and increased storage capacity enable the construction of large yet efficient warehouses. Methods and tools to improve data quality are now available. Analytical access tolls have matured and now offer more robust features and capabilities.
As information technology continues to evolve at a rapid pace, so does the need for advanced data analysis tools. Data mining is fast becoming the best way to understand large data repositories by using automated, computer-based techniques. Data mining technology encompasses the tools that help users analyze their data by using algorithms to determine the key relationships in rapidly expanding corporate data warehouses. These tools are already providing real strategic advantages to companies in competitive environments. As these tools continue to be refined and more user interface improvements occur, data mining will gain momentum in the IT marketplace.
Provided in the next issue will be a greater analysis of the tools and techniques used in data mining, as well as information on implementation issues that are typically encountered. It will introduce a number of approaches and algorithms that can be employed in data mining applications. In addition, it will look at the more common implementation issues encountered, including results interpretations, data selection and representation, and systems implementation and scalability considerations.
Editor’s note: Next month, part two will look at the tools and techniques used in data mining, and the issues surrounding implementations.
About the Author:Mitch Haskett is with the Business Intelligence/Data Warehouse Practice for Keane Inc.