Up Close: Speed, Parallel Processing, and MPP
Neither parallelism (ala Oracle) nor parallelism (ala Microsoft) is classically parallel, or massively parallel (like Teradata, Netezza, and others)
Recently, both Oracle Corp. and Microsoft Corp. entered the high-end data warehousing (DW) market, touting a pair of offerings -- respectively, the Oracle Database Machine and Microsoft's Project Madison -- that emphasize performance, scalability, and sky's-the-limit expandability.
Central to both offerings is a massively parallel processing (MPP) implementation -- the special sauce made famous by high-end DW firm Teradata Corp., and (more recently) espoused by data warehouse appliance vendors Netezza Corp., DATAllegro Corp., Dataupia Inc., Kognitio, ParAccel Inc., and others. There are questions, however, about the nature of the MPP of both the Oracle Database Machine and the Postgres-cum-Ingres-cum-SQL Server DATAllegro technology. Neither is "classically" MPP – at least not in the Teradata sense of the term. There's a sense, in fact, in which both approaches reflect the database-centric viewpoints of their progenitors.
Consider Oracle's MPP-like implementation, which has an Oracle database server parceling out (and receiving) SQL queries from a shared-nothing cluster of Oracle Exadata Servers. (The latter connected to the database server by means of an InfiniBand pipe.) Industry watcher Curt Monash, a principal with Monash Research, describes this as "node heterogeneity," which he contrasts with the approaches used by MPP stalwarts such as Teradata, Netezza, and Dataupia.
"Oracle is the first major vendor for whom it is important to remember that different parts of a query plan get parallelized across completely distinct sets of processors and processing units," Monash wrote on his DBMS2 blog. The jury's still out on how Oracle's approach will stack up relative to the MPP main, however. "[H]ow good is all this parallel technology? On the one hand, we know Oracle has been shipping it for a long time, and has it widely deployed. On the other, we also know that Oracle performance has been very problematic for large parallel queries. Surely most of those problems were due to the shared-disk bottleneck, but were they all [or mostly all]? I don't yet know."
Dataupia CTO John O'Brien, himself a data warehouse architect, concedes that this approach is both functional and (potentially) highly scalable, but nonetheless claims that it isn't entirely innovative. "Oracle is using a parallel approach that says 'Let's do a lot of filtering projection down at the disk level and let's put an Oracle database down there.' What Oracle is doing is leveraging a larger Oracle RAC instance to be their aggregator node, and that's how they bring it back into the shared architecture from their shared-nothing architecture."
It's an approach that O'Brien says he used himself back when he was building high-end data warehouses for Oracle shops. "I could've built that on my own three or four years ago. Three years ago, I was building 50 TB Oracle systems, so it isn't really a breakthrough in that sense."
Microsoft's MPP story is slightly trickier -- in part because the specifics of the underlying DATAllegro technology aren't all that well understood.
What seems clear, however, is that DATAllegro's MPP implementation differs from the approaches used by Teradata, Netezza, Dataupia, and others. That stems, in part, from the same design decision that made DATAllegro -- more than its competitors -- so attractive to Microsoft: the DATAllegro technology doesn't require significant customization to the underlying database. DATAllegro itself shifted from a Postgres to an Ingres foundation over the course of about 18 months.
"What they do is they take the SQL and send it out to all of [their] nodes for processing. [From there] they take all of the results [and] put [them] back into an aggregator, so if you did a group with a sort, you'd have to get the results from all of the nodes," says a data warehousing architect familiar with DATAllegro's technology.
The problem with this approach, this person says, is that the aggregator becomes the bottleneck. It's for this reason that most MPP players (e.g., Teradata, Netezza, and Dataupia) took a different route. "When you're dealing with large data sets, you can blow out your aggregator pretty quick. Your memory, your I/O -- you end up with a very low concurrency feature."
Adds this DW professional: "One of the things [Microsoft] really liked about the DATAllegro architecture was the fact that they could basically unplug the Ingres database and plug in SQL Server and get all of those aggregations, so you could see that Microsoft was interested in the fact that they weren't buying another massively parallel database that would've been pretty hard to integrate into SQL Server. They were buying aggregated modules that would be pretty easy to integrate into SQL Server [so they could] get some pretty easy parallelization."
Unlike Oracle, of course, Microsoft hasn't yet delivered its MPP entry. That begs the question of just how long it's going to take Microsoft to finally productize (or SQL Server enable) DATAllegro's MPP implementation.
Back in July, for example, consultant, author, and data warehousing architect Mark Madsen, a principal with DW consultancy Third Nature, predicted that it would take Microsoft "three years, when the next rev of SQL Server comes out." DATAllegro CEO Stuart Frost, for his part, downplayed such pessimism. "It's not going to take years, as some people in the blogosphere are predicting," Frost said immediately after the acquisition. "Just from [the integration work] we've already done, we've actually found that it's going to be pretty straightforward. All of the hooks are there already [such as] the APIs. We don't have to change a line of code in SQL Server."
Earlier this month, Microsoft disclosed plans to ship Project Madison in the first half of 2010 -- as much as two years after it first acquired DATAllegro. Industry watchers such as Madsen point to Microsoft's delays with SQL Server 2000, SQL Server 2005, and SQL Server 2008 as reasons to doubt even that projection.
The challenge, he argues, is that even though DATAllegro touts a database-independent architecture, Microsoft will almost certainly have some difficulty "porting the shared-nothing bits from Ingres, Linux, and C/C++ to SQL Server, Windows, and C#. That's a lot of technology change to deal with, even if you don't have to change the database kernel."
Moving bits is just what Microsoft says it's doing -- although company officials reject the notion that the integration of DATAllegro's assets is shaping up to be an inordinately involved process.
"The main thing that we're doing here is we're moving DATAllegro bits on to Windows. Currently they have it running on Linux. The second piece of it is that they have Ingres as the database that's part of the solution, and that's being replaced by SQL Server," says Herain Oberoi, group product manager for SQL Server with Microsoft. "They [DATAllegro] specifically built an architecture where they didn't have any proprietary code inside of Ingres itself, [which] makes it relatively easy to swap Ingres out and put SQL Server in."
For this reason, Oberoi doesn't see any reason why the Project Madison timetable should slip. "Right now, both Kilimanjaro [a BI-centric version of SQL Server 2008] and Madison are scheduled to ship in the first half of 2010. There's no reason to think we won't hit that," he insists.
Microsoft also has a next-generation SQL Server release planned for a 2011 delivery. That's a lot on the SQL Server team's plate. "The plan is to ship it [Project Madison] as a separate thing [from Kilimanjaro]. We don't know what the packaging of that separate thing is going to look like, but as of now it's not going to be a part of Kilimanjaro," Oberoi concludes.
"The next major release [of SQL Server] will be in 24 to 36 months. That will be 2010 to 2011. Before we do that, though, because we have these new capabilities that we have to get out the door, we'll be able to ship these in the contexts of Kilimanjaro and Project Madison."
Stephen Swoyer is a Nashville, TN-based freelance journalist who writes about technology.