In-Depth
Does SQL Server PDW Have What It Takes for Very Large Data Warehouses?
When it comes to high-end data warehousing, Microsoft isn't promising something it can't deliver: it's been shipping a special release of its SQL Server 2008 R2 database since November of last year.
Could a massive SQL Server data warehouse be in your data center's future?
Not if the mainframe group has anything to say about it. But, then, mainframers don't always prevail in these kinds of discussions, do they?
When it comes to high-end data warehousing, Microsoft isn't promising something it can't deliver, either. After all, Redmond has been shipping a special release of its SQL Server 2008 R2 database (dubbed Parallel Data Warehouse, or PDW) since November of last year.
What's more, SQL Server PDW is based on best-of-breed technology that Microsoft acquired from former analytic database specialist DATAllegro Corp. -- two years before IBM ponied up $1.7 billion for DATAllegro's biggest competitor, Netezza Inc.
Moreover, with SQL Server PDW, Microsoft Corp. believes it has a legitimate contender. It's gunning for a high-end segment that's currently contested by Teradata Corp., a passel of legacy RDBMS players (chiefly, IBM Corp. and Oracle Corp.), a handful of best of breeds, and a smattering of prominent players (including Hewlett-Packard Co. and EMC Corp.) that became high-end contenders chiefly by virtue of acquisition.
Microsoft's weapons? For starters, SQL Server PDW ships as a data warehouse appliance and boasts a massively parallel processing (MPP) engine. The market has basically settled on MPP now as the platform for high-end data warehousing. As recently as six years ago, most shops thought nothing of using off-the-shelf RDBMS technologies -- such as vanilla Oracle or DB2 -- to support double- or triple-digit-terabyte configurations. Some of the largest data warehouses in the world ran on Oracle, DB2, Teradata, and (less commonly) Sybase or Netezza, and there are still plenty of DWs running on commodity databases.
In demanding DW environments, however, specialty or MPP-powered data warehouses are increasingly the rule. Oracle's Exadata platform delivers MPP-like performance, experts say; IBM, for its part, purchased DW specialist Netezza Inc. to round out its own MPP story. Big Blue also offers highly-optimized DW platforms via its Smart Analytics portfolio, including several mainframe-oriented Smart Analytics offerings: its Smart Analytics Optimizer for DB2 for z/OS, its Smart Analytics System 9600, and its Smart Analytics Cloud for System z.
IBM's Smart Analytics offerings aren't yet MPP-powered, but they're a far cry from vanilla DB2, and that's the point.
"We went through a period of about 10 years where relational database technology was almost a commodity," says Colin White, president and founder with business intelligence (BI) and DW consultancy BI Research. "It didn't matter which [database] you bought; for most people, it would do the job," he continues. "Now we're back to pushing the limits of what was possible in terms of processing. This is why many of these new technologies that have come out in products have been acquired. ... [F]or certain kinds of applications, [they] provide better performance than the commodity databases."
SQL Server PDW has other things going for it, too. It's inexpensive, for example -- selling for about $13,000 per TB. To be sure, that's roughly twice the going bargain-basement price-per-TB ($6,000/TB achieved by German analytic database specialist EXASOL), but it's likewise cost-competitive with MPP offerings from Teradata, ParAccel Inc., and the former Vertica, which HP acquired two months ago to flesh out its MPP story.
Then there's HP's acquisition of Vertica, which gave it a columnar MPP database. It should be noted that HP inherited an MPP architecture -- albeit one not designed explicitly for data warehousing -- from the former Compaq Computer Corp., which purchased high-availability and fault-tolerant computing specialist Tandem Computer Corp. in mid-1997. It re-engineered Tandem's NonStop architecture and positioned it -- in the form of Neoview -- as an MPP platform for DW.
Teradata, too, picked up columnar capabilities with its acquisition of Aster Data.
If MPP provides the computing power for high-end DW, columnar has emerged as one of the most popular -- to say nothing of powerful -- database architectures for high-end DW. For certain kinds of applications, columnar provides both accelerated performance and (much) better compressability. In triple-digit-terabyte and petabyte-sized configurations, compressability starts to matter.
Last year, Microsoft announced PowerPivot, a new column-store-on-a-desktop capability for Excel 2010, SQL Server 2008 R2, and SharePoint. The upshot: Redmond now does columnar, too.
The Trump Card
PDW has at least one other benefit: chances are, you already have a SQL Server installation -- and, as a result, SQL Server experience -- in-house. As Microsoft itself might put it, why wouldn't you standardize on a scalable and cost-competitive data warehouse appliance that (unlike other MPP offerings) permits you to leverage your in-house database management and application development skills?
It's similar to the strategy from which Microsoft used to grow SQL Server from its obscure beginnings (e.g., SQL Server v1.0 was designed for IBM's OS/2 operating system; Windows NT didn't even exist yet) to a SQL Server 2008 R2 release that even most data managers concede deserves a seat at the enterprise table. In the case of SQL Server's growth, Microsoft was counting on the success (and increasing pervasiveness) of its Windows NT operating system. With SQL Server PDW, Redmond is banking on SQL Server's enormous market presence, to say nothing of a SQL Server BI stack that -- with built-in data integration, data mining, predictive analytic, OLAP, and reporting capabilities -- offers shops many freebies.
"The underlying premise for bringing PDW to market is [that] we've seen a level of success in a high-end transactional database [with vanilla SQL Server] and when we looked at the price for massive systems on MPP, we basically said, we can bring that same price/performance value proposition [to high-end data warehousing]," says Herain Oberoi, director of product management for SQL Server with Microsoft.
The way some data warehousing vendors tell it, scaling to support hundreds of terabytes is old hat. The former Netezza, the former Vertica, the former Aster Data Systems (which Teradata nabbed recently), and Dataupia Inc. -- to name just a few -- routinely trumpet an ability to scale into the hundreds of terabytes. Teradata even touts a "Petabyte Power Players" club that includes auction powerhouse eBay Inc. Oberoi says Microsoft expects to compete alongside these players in a triple-digit-terabyte "sweet spot." For this reason, Redmond isn't currently talking up petabyte-scale capabilities -- but that could change.
"Right now, we're saying [PDW can scale to support] hundreds of terabytes. We haven't made any explicit statements around the petabyte range. I'm not quite sure what if [it's a question of] technical limitations [or if] we haven't deployed any customers in that range," he comments. "But if you're looking at hundreds of terabytes, this would be your sweet spot. The benefit of an MPP system is that [while] there is some overhead to adding new nodes, you do get very close to linear scale, so the system should technically be able to [scale into the petabyte range.]"
Is SQL Server PDW Really Coming to Your Data Center?
If SQL Server is going to effectively scale to effectively support Very Large Data Warehouse (VLDW) configurations of hundreds of terabytes or even petabytes, Microsoft is likely going to have to spend a bit more money. Just as the market has validated MPP as the platform and columnar as a killer architecture for high-end data warehousing, it likewise seems to be selecting in favor of advanced analytic technologies -- such as MapReduce or the open source Hadoop framework -- to support VLDW, at least for certain kinds of applications.
Like MPP, "[MapReduce is] a parallel computing system," says White, "but what [the] Map [process] does is it's like an ELT process. It gathers all of the data that you want to process ... then the reduce process performs analytics [on that data]."
MapReduce or Hadoop, White explains, is a brute-force approach to data-crunching. In the highly virtualized data centers of today (to say nothing of the data center of tomorrow, which will likely be almost completely virtualized) it's an approach that makes a lot of sense.
"[MapReduce] is massively parallel processing, but there is not a level of optimization for that processing like there is in the relational product. With Hadoop systems, there's no optimizing. It's a programmatic interface. Hadoop ... makes it very easy to build parallel programs. You don't have to understand how to write parallel programs. The system does that for you."
IBM was in the thick of (if not quite out in front of) this charge, announcing its own Hadoop strategy ("Big Sheets") last year at this time. EMC got a built-in MapReduce capability via its acquisition of the former Greenplum; ditto for Teradata, which acquired MapReduce supporter Aster Data Systems.
If SQL Server PDW really is coming to your data center, Microsoft will probably have to offer some kind of MapReduce-like capability.
However, Redmond hasn't yet announced a MapReduce or Hadoop strategy for SQL Server. This isn't especially surprising. After all, Hadoop is both open source and Java-based. At the very least, Microsoft has had a complex history with the open source community; its stance on Java has been only slightly less complex.
In fact, the manner in which Redmond claimed to "support" Java provoked the former Sun Microsystems Inc. into suing Microsoft (for alleged IP infringement) and was also a factor in the United States Department of Justice's decision to prosecute Microsoft for allegedly violating the Sherman Antitrust Act.
MapReduce doesn't have to be implemented in Java, however.
For example, Redmond has long been experimenting with a research project that it calls "Dryad" -- i.e., a run-time based architecture that designed to facilitate application parallelization. Although Oberoi and other Microsoft officials demur when pressed on the question of MapReduce and SQL Server, Dryad looms in the background. Even several years ago, Redmond was touting a use-case involving SQL Server Integration Services (SSIS) and Dryad -- basically using Dryad to power a massively-parallel ETL process (a use-case that, coincidentally, is frequently touted in connection with MapReduce).