PMML: Data Mining for the Masses?
PMML recasts the data warehouse as a turnkey platform for real-time data mining.
It’s been eight years in the making, but the predictive modeling mark-up language (PMML) is on the verge of going mainstream.
PMML is an XML mark-up language that’s used to describe statistical and data-mining models. Its principal selling point is that it gives PMML-compliant applications an easy way to share data models with other PMML-aware tools. It’s stewarded by the Data Mining Group, an industry consortium that comprises a Who’s Who of data mining and relational database vendors, including lots of familiar faces—such as IBM Corp., Microsoft Corp., Oracle Corp., SAP AG, SAS Institute Inc., SPSS Inc., and NCR Corp. subsidiary Teradata.
One benefit, proponents say, is that PMML effectively brings data mining down from the mountain—i.e., the rarefied realm of the SAS or SPSS guru—and democratizes it. “Data mining can be looked at as two major processes. The development of the model still has to be done by the SAS or SPSS experts, but PMML helps with the deployment, taking that data model and executing it,” says Arlene Zaima, advanced analytic program manager with Teradata.
As a result, Zaima says, users who aren’t familiar with the intricacies of SAS or SPSS can work effectively with pre-built PMML data models. “The deployment [by users] is done much more frequently—daily or even up to the minute—and that’s what PMML addresses: the execution for the model.”
Like another long-gestating standard, the XML Query language (XQuery), PMML has been in the works for some time. But unlike XQuery, PMML has steadily evolved over time, starting with the PMML 1.1 release five years ago. Today, PMML is in version 3.0 and companies such as Teradata, IBM, Oracle, and Microsoft offer varying degrees of support for the technology.
PMML in Practice
Dan Friedman, a principal with software marketing consultancy DHF Consulting, says he’s worked with several software vendors that have incorporated support for PMML into their products. The reasons, he says, are many, but a common driver is a need to address the very different requirements of data-model development and data-model execution—the same division highlighted by Teradata’s Zaima.
“You can think of predictive statistics as coming in two pieces—runtime and design time,” he says. “Design time is done offline, where people tend to use established statistical packages like SPSS, SAS, etc. This can take weeks or months to accomplish and is done by highly specialized analysts.”
The advantage of PMML, says Friedman, is that it can help accelerate data model runtime execution. “Runtime is how you integrate this model into an operational system, like a CRM or financial system. Typically, you deploy the model to the run-time and use it to provide a score that is then acted upon with business rules or some other business logic. This scoring is done in real time and takes [less than] one second,” he explains.
Friedman, however, points to still another division—between stats geeks and business-domain experts. PMML can help bridge this chasm, too. “The thing is, the skills needed to do the runtime piece are much different than the skills to do the modeling piece. In practice, the runtime is built by domain experts that really understand the business process. They are not cutting edge ‘machine learning experts’ or stats jockeys. The stats guys know the math, but not the business process,” he says. “So, the runtime people would like to (1) use other companies’ modeling tools, [and] (2) make sure they … can leverage the best that the modeling tools can provide. Since the run-time guys aren't experts, they look to implement standards to make sure they're covered for most of the models that people want to build today and tomorrow.”
In this respect, says Toby Dunn, a SAS expert with the department of education in a prominent Southwestern state, PMML might be the most pragmatic choice for solving many vexing business problems. Dunn should know: in a former career, he worked for a company that developed a series of data models for banks and credit card companies. “These models included credit scoring, revenue prediction, and call-center queuing. The models were developed using SAS and deployed in a proprietary Java program at the client’s site,” he explains.
One problem with this approach was that the company’s Java program needed to be able to run the data models developed by Dunn and his colleagues as well as the existing and future data models developed by clients.
“PMML solved this problem,” he says. “PMML was used for two reasons. First, it was a known and stable standard tag set, which anyone could go and look up on the Web. So regardless of who was building the model to be deployed, all they needed to do was provide that model in the specified version of PMML to the client and they could quickly and easily implement it into the system. Secondly, it would do the calculations required in order for the proprietary Java program to do its job and report back to the user.”
PMML and the Primacy of the Data Warehouse
Teradata’s Zaima identifies still another division (and simmering source of confusion) in the data-mining space.
“There’s always … a division between the analytic modelers and the SQL jockeys” who manage the data warehouse, she observes, because of the rather inelegant way (from the perspective of analytic modelers, anyway) in which data mining is traditionally done.
“The data warehouse has always been an important part of analysis, because that’s where the data is kind of reconciled and pulled together,” she notes. In many cases, however, data mining isn’t done in the data warehouse; instead, data is extracted from the warehouse and loaded into an external repository.
This approach has obvious costs, though, in terms of both performance and timeliness. “These problems can be solved with SQL models, in fact, Teradata’s first approach was to develop SQL models for Teradata Warehouse Miner, but as you know, a lot of corporations have established standards around SAS or SPSS,” Zaima says. “They’ve invested so much in this expertise, and these analysts aren’t cheap. They want to be able to leverage their investment in software and resources, but they want to leverage the power of in-database [mining].”
PMML lets them do that. “It just eliminates that need of pulling the data out into another server, so it reduces the time to market and speeds time to deployment. Now [the analytic modelers] can just say, ‘Take my PMML model and just run it in the database.’ There doesn’t have to be any recoding [to SQL].”
Standard for Sharing Data Mining Models Falls Short
Stephen Swoyer is a Nashville, TN-based freelance journalist who writes about technology.