Mine Your Own Business

Data mining technology takes information about how the elements in a warehouse are related and uses technology grounded in statistics and neural networks to look for patterns between values that could be significant. Automating the ability to pick up on these potentially revealing patterns can prove valuable, but it places serious requirements on the skills of managers and warehouse architecture design.

"There’s gold in them thar hills!" This conviction is at the heart of both data warehousing and data mining, the conviction that somewhere in the data representing business transactions there are insights that can be mined to transform a business.

The difference between the two is in how this "gold" is obtained. Data warehouses are valuable when managers know what variables should be tracked; for example, differences in buying patterns tied to geographies, age groups, etc. While this information can be extremely valuable in tuning the way a business is run, too often the results – —and consequently, their interpretation, —are colored by the way the questions are asked.

The goal of data mining is to look for some correlation between the variables stored in a warehouse that is not expected, some aberration, which upon further examination could lead to some new insight. Data mining technology takes information about how the elements in a warehouse are related and uses technology grounded in statistics and neural networks to look for patterns between values that could be significant.

A classic example of the kind of insight that can be provided by data mining is the discovery that in examining purchasing patterns in retail stores. The classic example being,, there seemed to be a higher than expected correlation between the purchase of disposable diapers and beer. Upon interpretation and more data sampling, it made sense that a parent who had run out of disposable diapers and had to make a special trip to a convenience store might also include an impulse purchase geared at rewarding him (her)or her for the inconvenience of having to go there.

While this kind of information can lead to better plans for how to stock shelves, in other situations this type of statistical correlation may point to something as serious as fraud. Automating the ability to pick up on these potentially revealing patterns clearly can prove valuable, but it places some pretty serious requirements on the skills of its managers and how an organization designs it warehouse architecture.

The Problems of Data Mining

Data mining technology has several important prerequisites, namely:

*

The data in the warehouse is clean and consistent.

* .

The individuals deploying the data mining products have correctly specified the relationships between variables.

* .

MManagement defines criteria for what it will spend to achieve "explanatory adequacy."

* The warehouse architecture supports evolution to facilitate the acquisition of any additional data.

Clean and consistent data. If the warehouse implementation team cannot assure insure that the data represented in a warehouse is consistent, data mining tools may turn up interesting correlations between data elements that may beare totally spurious. As a result, the The whole value of the data mining application is compromised.

The complexity of the task of populating a warehouse with clean and consistent data is a function of how badly the warehouse is needed. If the goal is to capture a subset of data from an operational database and transform the data values so it they can be used by business users in a data mart, the task of insuring clean and consistent data is fairly straightforward. The implementation team merely needs to insure that the source database is consistent. However, even when dealing with a single database this task is not simple.

Frequently it is not possible to rebuild historical data in a database used by a mission-critical system because there is not sufficient downtime. As a result, particularly in legacy environments, a DBA will make a schema change in such a way as not to alert the data manager. Sometimes these changes are effected by COBOL REDEFINES; , sometimes by a particular data value which that signals the existence of another file or record. Because these changes are rarely documented in one place – —rather, it isthey are distributed across file descriptions , dbms DBMS data definition files, and application code, —extensive data sampling may be required depending upon the amount of historical data being loaded into the data mart.

However, when a company is building a warehouse which that either consolidates equivalent data from multiple sources (as in the case of pulling data from multiple purchasing systems) or merges related data from different types of applications, the task of populating the warehouse is even more complex. Not only must one deal with inconsistencies within a particular source database, but recognize when the same customer or vendor is being referenced across multiple source databases, create new keys, etc.

Despite these complexities, if a company is not committed to making the investment required to insure clean, consistent data, it might as well give up any thought of using data mining technology.

Accurate identification of dependencies. To reduce computational complexity, data mining products allow the user to specify the relationships between the elements in a warehouse. In many respects, these are not unlike the kind of relationships captured in ERA diagrams used in designing databases. However, when defining a database one is defining such things as the legal relationship between entities (e.g., a student has more than one report cards) and the legal data types or range of values for each attribute.

The dbms DBMS guarantees that these specifications will be enforced. In contrast, defining parameters for use with data mining technology is more a matter of heuristics; , that is, a set of guidelines for what is expected. Data mining products use this information to "notice" relationships that violate these heuristics.

For example, one value may be dependent on another – —for example, line of credit may be a function of both credit rating and income. A higher line of credit implies that payments will be timely. A data- mining tool might flag a potentially important correlation if it finds a population of customers with a pattern of high lines of credit and a large number of late payments. What this correlation might mean is a function of how this correlation is interpreted. Depending upon the company processes used to derive the values in the warehouse, there could be very different interpretations. If line of credit is computed automatically and enforced by some computer program, then it is unlikely that there is an error in line of credit. If, on the other hand, employees are the ones who make this determination and enter the line of credit, management might want to see if these customers are tied to some geographic location serviced by a particular outlet of the company and so on.

In short, the confidence with which management can be certain that it has found a real nugget from data mining is no easy matter., but It is a function of the accuracy and skill with which these heuristics have been specified;, management’s knowledge of the business processes and systems in place;, and the ability to add additional data to the warehouse in the case the correlation is potentially important enough that it should not be ignored and and a number of competing interpretations are possible with the data currently available in the warehouse.

Determining explanatory adequacy. Scientific inquiry is a good analogy for the dilemma of companies trying to benefit from data mining. In trying to understand a phenomenon, it is as important – —and perhaps easier -- —to discount the variables that don’t contribute, as to understand the ones that do. As Lou Agosta says about data mining in The Essential Guide to Data Warehousing, "Refutation is absolute; whereas confirmation is always partial and tentative." Just as the early stages of scientific investigation typically lead to more questions that lead to more experiments, data mining technology is a tool for helping management refine the questions it should ask. Sometimes sufficient evidence can be found from these subsequent queries to point to a probable conclusion.

The dilemma comes when all obvious questions have been posed to the warehouse, and no definitive interpretation has been found for the correlation uncovered by the data mining— becomes what to does one do when all the obvious questions have been asked of the warehouse. .? There are two possibilities at this point –: conclude that the correlation is irrelevant or generate a series of hypotheses – —or theories - —about what could account for it.

If the operational systems in use in the organization contain data that could provide corroboration for one or more of the competing hypotheses, management and IT must assess how quickly such information could be incorporated into the warehouse and the relative importance of incurring the cost of rebuilding the warehouse.

If the methodology and architecture used to implement the warehouse do not support a quick iteration cycle, rather than rebuilding the warehouse, it might be easier to simply redesign the warehouse and move moving forward to capture the additional information required to see if evidence can be gathered in the future to verify the hypothesis. The feasibility of either of these approaches is also a function of how quickly the warehouse and the programs that populate it can be modified – —and how important it is to find a timely response to the business question raised by the anomaly uncovered through data mining.

An evolutionary architecture. As Agosta also points out, ideally one would like the results from data mining before designing what a data warehouse should contain, but this is not possible as such applications cannot be run meaningfully against raw operational data. For this reason, companies that want to avail themselves of data mining should insure that the products and methodology they use in their warehouse implementation support a quick modification cycle. The kKey to this is keeping a metadata audit trail of everything discovered in building the warehouse – —the source fields utilized and, how they map to the fields in the warehouse;, the look-up tables and business rules used to transform data values and , how these are affected across time by schema changes encountered, etc. In fact, given the volatility of the current business environment – —with changing requirements brought on by deregulation, M&A, and the drive toward e-commerce, —it is probably important that a company have the same concerns in building any warehouse, even it is not considering the use of data mining.

How being "wired" affects the potential value

Data mining is appealing because it suggests how one can combine the computational muscle of the computer and techniques drawn from statistical analysis and optimization technologies with the creativity of the human mind to create a supersleuth. However, the drive toward e-commerce and what Giga calls the "zero latency enterprise" adds another wrinkle to the technical and conceptual complexities discussed above. With the goal of using the Web to support personalized marketing and just-in-time manufacturing, companies are using middleware products to implement near -real-time warehouses. Depending on the business – —for example, those involved with stock transactions – —these near- real-time super-applications are mission-critical, but are they data warehouses. ?

Historically (there is some irony in the use of this word) data warehouses are were intended to assist management in introspection. The concept of introspection suggests some element of elapsed time – —a Sabbath? – —the notion of time outside the turmoil of day-to-day business which that allows us to devise a better plan. Even without this time conflict, the question remains whether even the best combination of data warehousing and data mining can anticipate a paradigm shift where many of our essential assumptions are no longer relevant. The need to balance our ever increasing need for speed, efficiency, and innovation with wisdom and an understanding of our limitations– —whether spiritual, technical, or procedural – —is surely our challenge for the next millennium.

 

Conclusion

We find ourselves in the midst of a new "wired" world, with tools more powerful than any of the prophets of the information Information Aage could have imagined. The technology supporting data mining is promising, but expectations about the ability to benefit from this technology must be balanced against a number of factors. The most, the most important of which isthese being the organization’s strategy for balancing the need for speed, the technical complexity of evolving and maintaining their IT environment, and their decision for trade-offs about what is important to survival.

About the Author: Katherine Hammer is President and CEO of Evolutionary Technologies International (Austin, Texas).

Must Read Articles