Q&A: R 101
R is a high-powered language for analyzing today’s data volumes.
- By James E. Powell
R, the open source language developed in 1993, has become a popular tool for analyzing today data volumes. Learn what R is and how it’s used, the benefits it offers, and where R is headed in this conversation with Norman Nie, the CEO of Revolution Analytics and the founder of SPSS.
BI This Week: What is the R language and what is its primary use?
Norman Nie: R is the open source language that was developed in 1993 at the University of New Zealand by statisticians Ross Ihaka and Robert Gentleman. It has been widely embraced by the academic community and it has emerged as the de facto teaching tool for advanced statistics. Nearly every student pursuing a Master's or Ph.D. in statistics today is trained on R because it's the most powerful statistical computing language on the planet.
Outside academia, R has gained popularity, too. Students who have been trained on R are bringing it from the classroom into the corporate world. Today, R is used for number of functions across many verticals -- from quantitative analysis in the financial services industry to advanced genomic research in biostatistics. Researchers have been attracted to R both for its flexibility and its extensibility -- because there is very little you can't do with R when it comes to data analysis.
Why is R well suited for predictive analytics?
Predictive analytics is statistical modeling by any other name; it's the way in which we understand, optimize and forecast the future of our world. Organizations today are generating increasingly large sets of data and have embraced this kind of advanced analytics as a means by which they can leverage their data to make better business decisions.
Because R is a modern language, it's continuously evolving and unparalleled in flexibility -- there is no statistical expression that it cannot compute, which is not true of R’s predecessors. The open source R community is two million strong and includes many of today's most brilliant statistical minds. To date, thousands of specialized packages have been developed on top of R with innovative new algorithms that meet the challenges posed by big data.
As the founder and CEO of SPSS, which is widely recognized as a statistics powerhouse, you've said you believe that R is a superior technology, and you are now competing against your own invention. Why is R superior in your opinion, and are you working to replace SPSS? Doesn't SPSS serve a different collection of users than R?
I started SPSS over 40 years ago and it did become a statistical analytics powerhouse, but at the time when I was developing its core technology, none of the concepts common to today’s modern computing technologies or programming languages existed. Because R is a far younger language, it is architected with these modern technologies in mind. To re-architect the legacy products to compete with R would be a remarkably difficult, time-consuming, and expensive task.
Furthermore, one of R’s great competitive advantages results from the fact that it has an enormous, intelligent, and vibrant community continuously working to extend and enhance the language. By way of example, today there are nearly 3,000 add-on “packages” available for R -- a number that continues to increase exponentially.
Finally, as you noted, today SPSS absolutely serves a different collection of users than R. A major reason for that is that I explicitly designed SPSS to be accessible to a less technical user. Currently, R’s great flexibility and power comes with a steep learning curve for non-expert users. However, a key mission and development area is increasing the accessibility of R to an audience of less technical users and simultaneously improving the productivity of power users.
Most IT shops have difficulty finding talented programmers. How easy is it to learn R? Is it designed for experienced programmers or for novices (in the same way Basic was introduced as a beginner's language)?
R's greatest strength -- its flexible, extensible nature -- does come at the price of a steep learning curve. While data analysts and statisticians extol the nearly limitless capabilities of R, it's admittedly not the easiest programming language to learn. At the end of the day, R is a statistical language -- it was designed by statisticians for statisticians. Due to its emergence as the de facto statistical language in academic circles, though, we're seeing more of a concerted effort to teach R to the widest audience possible.
It's true that IT departments have traditionally had difficulty finding talented programmers, but as O'Reilly and others have noted [Editor’s note: see http://radar.oreilly.com/2010/06/what-is-data-science.html], data science is emerging as one of the fastest-growing job markets, even in today's down economy. Not only is the popularity of statistics as a major rising, but a number of high-profile institutions are also introducing dedicated Master's programs for predictive analytics and data science. (For a list, see http://www.kdnuggets.com/education/usa-canada.html.) As such, more dedicated effort is being placed on broadening R and easing the learning curve for R.
What are the biggest problems users or IT experience with R, and what best practices can you suggest to avoid these problems?
Aside from ease-of-use issues the most common criticism of R historically has been scaling to larger data sets. Improving the performance and scalability of R is a critical process in making R widely accessible, and this has been the second major thrust behind Revolution’s development efforts. Recently, Revolution released a new product, ScaleR, which enables analysis on terabyte -class datasets at speeds radically faster than existing alternatives.
Where is R headed? For example, will it be used to interface with more databases or adopted as scripting language in more commercial apps?
On a broad level, it's safe to say that you'll see R embedded in more and more commercial applications as enterprises grow more familiar with the language. In particular, in the short term, I think you will begin to see R become the dominant analytics tool in both life sciences and quantitative finance, the key here being R’s acceptance and use in regulated and production environments. Beyond that, you will see R gain pervasive use across an amazingly diverse group of industries due to its remarkable flexibility and compelling cost advantage.
R is an ever-evolving language, and it's difficult to pinpoint any single direction it's headed because there are so many use cases -- from NASA designing custom packages to track mean time to failure for space shuttle parts to credit card companies using R to predict credit scores.
What products or services does Revolution Analytics offer?
Revolution Analytics offers Revolution R, which builds on top of base R packages and tailors to enterprise needs with the ability to scale massive data sets and embed R into custom business applications. We offer the Revolution R Enterprise Suite on the individual and server levels, and it is free for academics. More information is available at http://www.revolutionanalytics.com/products/.