Data Mining: Sometimes Coincidences Are Just Coincidences

Data mining is just one component of business analytics. You must ask yourself: do the results make sense?

With the increasing recognition that data mining is an area of business intelligence that can yield a significant competitive advantage, it is important to recognize that sometimes a coincidence is just that. For example, do hemline lengths and Super Bowl wins really predict the direction of the stock market or the economy? Any perceived relationship may simply be due to coincidence rather than causation. After all, a result said to be significant at a 90 percent level also means that 10 percent of the time it could be attributed to chance.

Furthermore, although there may be no direct relationship between a cause and a perceived effect, there could be a strong relationship between the cause (the independent variable) and another variable, and a strong relationship between this other variable and the predicted result or effect (the dependent variable). For example, when data mining a database containing city demographics and incidents of crime, you are likely to find that the more houses of worship there are in a city, the more crimes have been committed.

Does this mean that religious people are robbing collection plates? Of course not! The simple explanation is that population size has a positive correlation with the number of houses of worship and that the larger the city population, the more crimes are committed. Taking the data mining result at face value, without considering what it really means, can lead to incorrect conclusions. At the very least, the predicted variable should have been crime rates (e.g., crimes per 10,000 inhabitants) rather than the absolute number of crimes.

Knowing Your Domain

On the other hand, I have heard of a data mining analysis that showed that an insurance company had its highest sales in the cities in which its offices were in older buildings. At first these results were going to be tossed out as a statistical anomaly. However, one of the company’s executives mentioned that once it opened an office in a city, it rarely changed its location and that the age of the building in which it had an office directly correlated with how long the company had sold insurance in the city. All other things being equal, the longer it had sold insurance in a city, the higher its sales volume was likely to be.

The executive who pointed this out would qualify as a “domain expert” or someone who understands the topic (and the data) under study. A data mining best practice is to make sure that data mining results are reviewed by a domain expert to see if they make sense.

Organizations need to appreciate the competitive advantage that data mining and predictive analytics can offer while recognizing that if they are not using it, their competitors might very well be. They need to remember that data mining is only one component of the overall business analytics spectrum. Query and OLAP analysis complement data mining and can be used to investigate data mining results to determine if they make sense.

These other business intelligence technologies should be used in concert with data mining to achieve the best results. Most organizations have made significant investments in their data warehouses; they are doing themselves and their constituents a disservice if they don’t utilize all the tools in their arsenal to analyze their collective data wealth.

About the Author

Michael A. Schiff is a principal consultant for MAS Strategies.