In-Depth
BI Experts: Why Predictive Analytics Will Continue to Grow
What's behind the increasing popularity of data mining, and what is its relationship to predictive analytics?
Data mining is one of the components of the business intelligence spectrum and is often included under the umbrella of advanced analytics. When discussing data mining, I consider predictive analytics to be applied data mining. I believe that data mining is a set of technologies and algorithms and that predictive analytics is the application of these technologies.
Over the past several years there have been several factors that have served to make data mining more feasible and helped move it from departmental pilot projects to enterprise use in many organizations.
Technology factors include:
Parallel processing and faster CPUs. Commodity-based massively parallel processing capabilities combined with faster and multi-core processors have served to significantly reduce the time it takes to perform complex data mining tasks.
In-memory analytics. Declining memory prices and the wide adoption of 64-bit addressing have now made it technically and economically feasible to load massive amounts of detailed data directly into memory. Data can be mined orders of magnitude faster in memory than if it needed to be continually swapped between memory and magnetic and/or solid state storage devices.
In-database data mining. Past data mining efforts might have involved extracting data from a database and then loading it into the propriety file system of a data mining tool. Partnerships between data mining and database vendors such as SAS and Teradata provide the ability to perform data mining directly within the source database.
Big Data. Thanks to technologies such as MapReduce, Hadoop, R, and natural language processing and text analytics, organizations can now collect, analyze, and mine massive amounts of structured and unstructured data.
Business drivers behind data mining include:
Sales, marketing, and call center analyses. As businesses strive to increase revenue, data mining has been deployed to identify up-selling and cross-selling opportunities. This has intensified with the growth of social media and the resulting need for rapid sentiment analysis. Organizations are deploying predictive analytics to drive personalized recommendations on their Web sites and select advertisements that accompany search engine results; search engine inquires are being mined as well. This type of usage will continue to grow as organizations such as Google can now consolidate individual user data from Gmail, Google searches, YouTube videos, and Google+ posts.
Fraud. Insurance companies have long used data mining techniques to identify potentially fraudulent claims. The IRS mines tax returns to refine its (non-published) Discriminant Information Function (DIF) system for identifying suspicious tax returns. The SEC can mine stock market trades and personal associations to identify insider trading.
Homeland Security. Although profiling may not be politically correct, it is used in "bet your borders" applications to identify potential terrorist activities.
Other data mining drivers. Chief among these are health-care applications that link seemingly isolated outbreaks of related symptoms, determine the effectiveness of possible cures, or even identify the best treatment based on the patient's genome. Predictive analytics can also be used to spot potential Medicare and Medicaid fraud.
Although early data mining techniques may have involved analyzing summary data and/or subsets of detailed data, taken in combination, the technology factors mentioned above have allowed us to analyze vast amounts of detailed data. The development of new data mining algorithms or refinements to existing ones will still be driven by university, government, and corporate research labs and Ph.D. theses. Although data mining was once the almost exclusive domain of highly skilled practitioners searching for the "best-fit" algorithm, this is no longer the case. It is now possible for business analysts to use commercial, off-the-shelf software that run a variety of data mining algorithms to find those that are most appropriate or to run sophisticated models developed by experts.
Just as many query tools gained acceptance for their ability to shield business users from having to write SQL code, data mining tools can now shield business users from having to fully understand the theory behind the underlying data mining algorithms. This has helped make predictive analytics more pervasive both in terms of the number of organizations deploying it and the number of people within these organizations able to use it.
If your organization is not already benefiting from the power of data mining and predictive analytics, it should consider adopting the technologies soon. Data mining can provide an organization with a competitive edge, which is why I believe that its usage is significantly underreported by commercial companies. After all, does Macy's tell Gimbel's what technology is it using for competitive advantages? (A more contemporary analogy might be: "Does SAS tell IBM/SPSS about its technologies?") Recognize that your competitors might be using data mining today even if they don't publicize it.
That said, don't blindly accept predictive analytics results, especially if they conflict with experience and intuition. Realize that sometimes correlations are simply coincidences and remember that a result that is statistically significant at the 95 percent level can be attributed to random chance 5 percent of the time. Does a groundhog's shadow really predict winter's length? A best data mining practice is to have predictive analytics results reviewed by a domain expert as a sanity check.