How to Mine Your Own Business: Mining for Gold in a World of Changing Currency
Data mining technology takes information about how the elements in a warehouse are related and uses technology grounded in statistics and neural networks to look for patterns between values that could be significant. Automating the ability to pick up on these potentially revealing patterns can prove valuable, but it places serious requirements on the skills of managers and warehouse architecture design.
The conviction, "There’s gold in them thar hills!" is at the heart of both data warehousing and data mining.
The difference between the two is in how this "gold" is obtained. Data warehouses are valuable when managers know what variables should be tracked. While this information can be extremely valuable in tuning the way a business is run, too often the results and their interpretation are colored by the way the questions are asked.
The goal of data mining is to look for some correlation between the variables stored in a warehouse that is not expected, some aberration, which upon further examination could lead to some new insight. Data mining technology takes information about how the elements in a warehouse are related and uses technology grounded in statistics and neural networks to look for patterns between values that could be significant.
The Problems of Data Mining
Data mining technology has several important prerequisites, namely:
• The data in the warehouse is clean and consistent
• The individuals deploying the data mining products have correctly specified the relationships between variables
• Management defines criteria for what it will spend to achieve "explanatory adequacy"
• The warehouse architecture supports evolution to facilitate the acquisition of any additional data
Clean and consistent data. If the implementation team cannot ensure that the data represented in a warehouse is consistent, mining tools may turn up interesting correlations between data elements that are totally spurious. The whole value of the data mining application is compromised.
The complexity of the task of populating a warehouse with clean and consistent data is a function of how badly the warehouse is needed. If the goal is to capture a subset of data from an operational database and transform the data values so they can be used by business users in a data mart, the task of ensuring clean and consistent data is fairly straightforward. The implementation team merely needs to ensure that the source database is consistent. However, even when dealing with a single database, this task is not simple.
Frequently, it is not possible to rebuild historical data in a database used by a mission-critical system because there is not sufficient downtime. As a result, particularly in legacy environments, a DBA will make a schema change in such a way as not to alert the data manager. Sometimes these changes are affected by COBOL REDEFINES, sometimes by a particular data value that signals the existence of another file or record. Because these changes are rarely documented in one place – rather, they are distributed across file descriptions, DBMS data definition files and application code – extensive data sampling may be required.
However, when a company is building a warehouse that either consolidates equivalent data from multiple sources (as in the case of pulling data from multiple purchasing systems) or merges related data from different types of applications, the task of populating the warehouse is even more complex. Not only must one deal with inconsistencies within a particular source database, but one must recognize when the same customer or vendor is being referenced across multiple source databases, create new keys, etc.
Despite these complexities, if a company is not committed to making the investment required to ensure clean, consistent data, it might as well give up any thought of using data mining technology.
Accurate identification of dependencies. To reduce computational complexity, data mining products allow the user to specify the relationships between the elements in a warehouse. In many respects, these are not unlike the kind of relationships captured in ERA diagrams used in designing databases. However, when defining a database, one is defining such things as the legal relationship between entities and the legal data types or range of values for each attribute.
The DBMS guarantees that these specifications will be enforced. In contrast, defining parameters for use with data mining technology is more a matter of heuristics, that is, a set of guidelines for what is expected. Data mining products use this information to "notice" relationships that violate these heuristics.
One value may be dependent on another – for example, line of credit may be a function of both credit rating and income. A higher line of credit implies that payments will be timely. A data mining tool might flag a potentially important correlation if it finds a population of customers with a pattern of high lines of credit and a large number of late payments. What this correlation might mean is a function of how this correlation is interpreted. Depending upon the company processes used to derive the values in the warehouse, there could be very different interpretations. If line of credit is computed automatically and enforced by some computer program, then it is unlikely that there is an error in line of credit. If, on the other hand, employees are the ones who make this determination and enter the line of credit, management might want to see if these customers are tied to some geographic location serviced by a particular outlet of the company, and so on.
In short, the confidence with which management can be certain that it has found a real nugget from data mining is no easy matter. It is a function of the accuracy with which these heuristics have been specified; management’s knowledge of the business processes and systems in place; the ability to add additional data to the warehouse in case the correlation is important enough that it should not be ignored and a number of competing interpretations are possible.
Determining explanatory adequacy. Scientific inquiry is a good analogy for the dilemma of companies trying to benefit from data mining. In trying to understand a phenomenon, it is as important – and perhaps easier – to discount the variables that don’t contribute as to understand the ones that do. As Lou Agosta writes about data mining in The Essential Guide to Data
Warehousing, "Refutation is absolute; whereas confirmation is always partial and tentative." Just as the early stages of scientific investigation typically lead to more questions that lead to more experiments, data mining technology is a tool for helping management refine the questions it should ask. Sometimes sufficient evidence can be found from these queries to point to a probable conclusion.
The dilemma comes when all obvious questions have been posed to the warehouse, and no definitive interpretation has been found for the correlation uncovered by the data mining. There are two possibilities at this point: Conclude that the correlation is irrelevant or generate a series of hypotheses, or theories, about what could account for it.
If the operational systems in the organization contain data that could provide corroboration for one or more of the competing hypotheses, management and IT must assess how quickly information could be incorporated into the warehouse and the relative importance of incurring the cost of rebuilding the warehouse.
If the methodology and architecture used to implement the warehouse do not support a quick iteration cycle, then it might be easier to redesign the warehouse and move forward to capture the additional information required, to see if evidence can be gathered in the future to verify the hypothesis, rather than rebuilding the warehouse. The feasibility of either of these approaches is also a function of how quickly the warehouse and the programs that populate it can be modified, and how important it is to find a timely response to the business question raised by the anomaly uncovered through data mining.
An evolutionary architecture. As Agosta also points out, ideally one would like the results from data mining before designing what a data warehouse should contain, but this is not possible as such applications cannot be run meaningfully against raw operational data. For this reason, companies that want to avail themselves of data mining should ensure that the products and methodology they use in their warehouse implementation support a quick modification cycle. The key to this is keeping a metadata audit trail of everything discovered in building the warehouse – the source fields utilized and how they map to the fields in the warehouse; the look-up tables and business rules used to transform data values and how these are affected by schema changes encountered, etc.
"Wired" FOR MINING
Data mining is appealing because it suggests how one can combine the computational muscle of the computer and techniques drawn from statistical analysis and optimization technologies with the creativity of the human mind to create a supersleuth. However, the drive toward e-commerce and what Giga calls the "zero latency enterprise" adds another wrinkle to the technical and conceptual complexities discussed above. With the goal of using the Web to support personalized marketing and just-in-time manufacturing, companies are using middleware products to implement near realtime warehouses. Depending on the business – for example, those involved with stock transactions – these near-realtime super-applications are mission-critical, but are they data warehouses?
Historically, data warehouses were intended to assist management in introspection. The concept of introspection suggests some element of elapsed time; the notion of time outside the turmoil of day-to-day business that allows us to devise a better plan. Even without this time conflict, the question remains whether even the best combination of data warehousing and data mining can anticipate a paradigm shift where many of our essential assumptions are no longer relevant. The need to balance our ever-increasing need for speed, efficiency and innovation with wisdom and an understanding of our limitations is surely our greatest challenge.
About the Author: Katherine Hammer is President and CEO of Evolutionary Technologies International (Austin, Texas).
Eleven Steps to Success in Data Warehousing
By Phillip Blackwood
Although data warehousing has long been an option for big companies, the reduction in warehousing technology costs now makes it practical for smaller companies as well. Turnkey integrated analytical solutions are reducing the cost, time and risk involved in implementation, and corporate portals are making it possible to grant access to hundreds, or thousands, of employees.
1. Recognize that the job is harder than expected. Experts report that 30 percent to 50 percent of the information in a typical database is missing or incorrect. In an operational system that focuses on swiftly and accurately processing current transactions, this may not be noticeable or may even be acceptable. However, it’s totally unacceptable in a data warehousing system designed to sort through millions of records in order to identify trends or select customers for a new product. Another challenge is that although database schema changes over the life cycle of a project, historical databases are rarely rebuilt.
2. Understand the data in existing systems. It is important to perform a detailed analysis of the status of all databases that will contribute to the data warehouse. Data interrelationships need to be determined between various systems, and must be maintained as the data is moved into the warehouse. Since the data warehouse implementation often involves making changes to database schema, understanding data relationships among heterogeneous systems is required to predict the impact of any such change, and avoid inconsistencies in other areas of the enterprise.
3. Recognize equivalent entities. One of the most important aspects of preparing for a data warehousing project is identifying equivalent entities and heterogeneous systems. For example, two different divisions may be servicing the same customer, yet have the name entered differently in their records (i.e., "AIG" and "American International Group"). A data transformation product capable of fuzzy matching can be used to identify and correct this and similar problems.
4. Support data quality with metadata. Metadata is data about data; for example, tags that indicate the subject of a World Wide Web document. One major challenge is trying to synchronize the metadata between different vendor products, different functions and different metadata stores. It is important to create and capture metadata for interfaces, business processes and database requirements as soon as possible. Several vendors offer products that have the potential to integrate metadata from disparate sources and begin to establish a central repository that can be used to provide the information needed by both administrators and users.
5. Select the right tools. Data transformation tools extract data, clean it and load it into the data warehouse, while capturing the history of that process. This transformation process may include the creation and population of new fields from the operational data, summarizing data to an appropriate level for analysis, performing error checks to validate data integrity, etc. Look for a tool that can map data from source to target with a point-and-click interface. The ability to track and manage the relationships of interrelated data entities is also useful. Finally, try to find a tool that can capture and store metadata during the conversion process.
6. Leverage external resources. External information sources can increase the value of internal information. Rather than simply comparing yearly sales, use of external data might make it possible to compare sales growth against the increase in the overall market. The integration challenge is even greater when external data sources are involved. In some cases, external data will differ so drastically from existing schema that data transformation algorithms will be required to make use of the external resource.
7. Use information distribution methods. The biggest technological improvements in the data warehousing field have come in the area of information delivery. Today, users can subscribe to regular, personalized reports and have them delivered over e-mail, housing report data securely and economically on the server.
8. Focus on high payback marketing applications. Most of the hot applications in data warehousing involve marketing, because of the potential for an immediate payback in terms of increased revenues. For example, catalog manufacturers are using data warehousing to match specific customer characteristics and customer purchases to target their consumers’ needs.
9. Emphasize early wins to build organization support. The availability of a wide range of off-the-shelf solutions has helped to reduce cost and lead-time requirements for data warehousing applications. Off-the-shelf solutions won’t complete project objectives, but they can be used to provide short-term point solutions that serve as a training and demonstration platform.
10. Don’t underestimate hardware requirements. The hardware requirements for data warehousing database servers are high, due to the large number of CPU cycles required to slice and dice data. Database size also plays a part in server performance requirements. Be sure to select a scalable platform; the typical data warehouse implementation starts out at the departmental level and grows to an enterprisewide solution. Purchasing servers that can be expanded with additional processors is one possible approach. A more ambitious idea is to combine loosely coupled systems that enable the database to be spread out over multiple servers, appearing as a single entity.
11. Consider outsourcing development and maintenance. Most data warehousing applications fit the criteria for a good outsourcing project – a large project that has been defined to the point that it does not require daily interaction between business and development teams. There have been many cases where an outsourcing team is able to make dramatic improvements in a new or existing data warehouse. Typically, these improvements do not stem from an increased level of skill on the part of the outsourcing team, but rather flow from the nature of outsourcing. The outsourcing team brings fresh ideas and perspective to their assignment and can use methods and solutions developed on previous projects.
– Phillip Blackwood is the Vice President of Data Warehouse and Business Intelligence at Syntel Inc. (Troy, Mich.; www.syntelinc.com).