In-Depth
In Depth: Closing the Ad Hoc Query Performance Gap for Good
Poor ad hoc query performance was the motivating reason behind the rise of data warehousing in the late 1990s -- and it's still a problem today.
The way many business intelligence (BI) and data warehousing (DW) vendors are talking, you'd think we were back in 1998. Back then, poor ad hoc query performance was the motivating raison-d'etre behind the upsurge of enterprise DW.
According to a chorus of today's most prominent BI and DW players, it's a problem that still hasn't been licked.
At May's TDWI World Conference in Chicago, for example, BI and DW vendors had ad hoc queries on the brain: at times, it even seemed as if the intervening decade -- during which the industry as a whole pushed ahead with performance management (PM), ubiquitous BI, and other cutting-edge applications -- simply hadn't happened.
Opinions differ as to why ad hoc query performance remains problematic. Some cite sluggish relational databases, which they say are ill-equipped to support compute-intensive analytic workloads. Others point to rigid data models or monolithic data warehouse architectures, which they say limit the kinds of queries users can run. Vendors also say the blame lies with users themselves, arguing that insatiable business users - those who keep demanding more from IT -- are their own worst enemies. A few insist that nothing's wrong: the ad hoc query performance issue is solved and has been so for the better part of a decade.
Regardless of the cause of the problem, most vendors we spoke to believe that pointing fingers at business users isn't a good (much less a defensible) response.
"We hear from customers with existing enterprise data warehouses. They're fully experiencing the pain of users calling up the data management guy or the IT department saying 'We need to be able to do this!' or 'Why can't we run this query at this time?'" explains Kirsten Chapman, vice-president of marketing and alliances with upstart DW player illuminate Inc. "That's what's turned data warehousing into a nightmare: the user. But when you think about it, the user is being reasonable and IT is the one [that's] being unreasonable."
Scalable Analytic Platforms
illuminate is among many vendors pushing alternative DW solutions designed specifically to boost ad hoc query performance. Its solution -- a so-called "correlational" database management system (CDMBS) -- works in part by intelligently indexing (and heavily compressing) source data.
It's not the only vendor trying to solve poor query performance. The DW appliance segment, which -- from its notional birth (with Teradata Corp.) through its nominal christening (with Netezza Inc.) and iterative refinements (DATAllegro Corp., Dataupia Inc., and others) -- once focused solely on Big Data customers, now seems to be chasing the Bigger Faster Cheaper crowd. At TDWI in Chicago, for example, DW players were talking about poor ad hoc query performance, citing their own approaches (which rely on brute force, clever data processing tricks, or -- typically -- some combination of both) to mitigate query performance issues.
TDWI's conference also saw a new appliance player (Sybase Inc.) and served as a launching pad for several analytic-oriented appliance refinements (for example, a partnership between Dataupia and data visualization specialist Tableau Software).
Even Big Fast Data stalwart Teradata was talking about the issue, citing its new sub-$100,000 appliance deliverables as a compelling and affordable solution. It's a market that Teradata once priced itself out of, concedes Randy Lea, vice-president of product and services marketing.
It used to be that Teradata concentrated largely on the data warehousing high-end -- overlooking customers trying to address lingering capacity or performance issues, said Lea, during an interview at the conference.
Nowhere was this mid- and low-end pain more acutely felt than in ad hoc query and analysis, according to Lea. The irony, he says, is that even though ad hoc query performance is Teradata's strongest selling point, it couldn't match the price points targeted by Netezza, DATAllegro, Kognitio, Dataupia, and others -- prior to the introduction of its new appliance systems, that is.
With a sub-$100,000 appliance ($67,000 for the 550, according to Teradata marketing collateral), it can now do just that, Lea argues.
"This is where the appliance vendors quite frankly were leveraging our exit from the market, but we now have products -- the 550 and the 2500 -- that are designed specifically for this environment," Lea explained.
An Issue of Scale, Not Performance
Lea is one of several industry experts who rejects the notion of an ad hoc query performance gap, but many industry watchers we spoke to insist it is a problem because, in part, the ad hoc query issues of today aren't the same ad hoc query issues of a decade ago. Back then, Lea notes, the issue was supercharging performance for analysts, executives, and power users. Increasingly, however, organizations are putting analytic capabilities at the fingertips of rank-and-file end users.
It's a not-entirely-new BI paradigm. Regardless of what you call it -- operational or pervasive BI are two popular terms -- it's attended by a new and altogether more vexing problem of scale. DW systems that have been designed to support several dozen business power users can quickly run out of steam once they're exposed to several hundred (or more) rank-and-file users, very few of whom are concerned with the vagaries of query syntax, efficiency, or elegance.
Even as organizations open up their DW systems to more users, they're also bringing ever large volumes (and a greater variety) of data into the DW itself. Historical analyses that once involved months or, at most, a year, now consume three, five, or even ten years' worth of data. Moreover, users may need to mix and match data from the usual suspects (inventories, purchase reports, etc.) with non-traditional sources (such as pricing information from competitors). The combinations can get downright dizzying.
The issue, then, isn't that the hoc query problems of old haven't been licked, Lea and others argue -- it's that organizations are just now coming to grips with what pervasive BI actually entails. Hence the mad rush on the part of vendors, new and old, to get into the game. Not surprisingly, everyone touts a different solution -- and blames a different problem.
Teradata, of course, sees the issue through Active-Data-Warehousing-colored lenses. (Active Data Warehousing, for the record, is Teradata's signature DW philosophy; see http://esj.com/Business_Intelligence/article.aspx?EditorialsID=8968).It isn't that Active Data Warehousing is the only potential prescription for poor ad hoc query performance, Lea acknowledges; it's just that so few companies (let alone the vendors that support them) have been able to encapsulate their overarching data management (DM) practices in the context of a logical and scalable DW architecture. That's what Teradata's Active Data Warehousing vision does, and -- Lea insists -- it's what lets Teradata address the needs of both traditional power users and ad hoc query neophytes. "It provides the platform to also support operational intelligence, [which] is supporting the frontline users and the users in the frontline. Now the industry is calling that pervasive business intelligence, [but] we've actually been telling that story for three or four years. More important, we've been delivering on it [during that time], too," he argues. "We're able to offer [customers] active access, [for example]. That's the ability to access data with a [service level agreement] for quick, active responsiveness. That response might be 2 seconds at a Web site to put up a personalized offer, that access might be three hours while you're flying on a plane [without Internet connectivity]."
Brute Force
The appliance crowd takes a different tack, in many cases trumpeting the brute force approach of their all-in-one appliance systems.
It can make for an eyebrow-raising argument because it typically comes packaged with an acknowledgement of its own inelegance. Anti-appliance players, for example, like to dismiss DW appliance entries as simply throwing brute-force processing and storage power at Big Data problems.
The irony, of course, is that appliance players themselves haven't shied away from this view.
Take Netezza, for example. The Netezza Snippet Processing Units (SPU) which provide the data processing power for its Netezza Performance Servers (NPS) are "wickedly fast, but … narrow in [terms of] their capabilities," says Tim Young, vice-president of marketing with Netezza. SPUs aren't "like an Intel processor, where [they] can do graphics and all of these other technologies. They're designed with just this [purpose] in mind."
More to the point, King concedes, Netezza's Performance Servers (NPS) have typically focused on brute-force performance -- getting there (as Confederate Lieutenant General Nathan Bedford Forrest once famously put it) fastest with the mostest. "Up to this point, we've been selling speed and feeds to technologists. Someone has a data warehouse that's performing very badly. They're investing lots of money in performance tuning," Young explains. "Now they're saying, 'You know, we've come as far as we're going to by adopting this approach.' Nowadays, most people are looking for a business solution. They don't just want to throw technology at a general problem-- they want specific solutions designed to solve specific business problems."
The fastest-with-the-mostest approach still has merit, Young stresses -- customers, after all, want a baseline reassurance that their query pain points are going to be addressed, and the DW appliance's Big Fast Data value proposition gives them just that. Netezza is also evolving its products to address specific business pain points, too.
"At the moment, we're retooling our operations to start focusing more on specific industries. We'll be focusing on specific verticals, and not just in the high end, where we've [traditionally] played. We think there's a real opportunity in the sort of medium enterprise [segment]," he explains.
Netezza isn't alone. Its competitors are also taking up both their brute-force approach and, increasingly, their business-savviness, too (see http://esj.com/business_intelligence/article.aspx?EditorialsID=8960).
Relational Shortcomings?
There's a further wrinkle here -- today's DW appliances aren't brute-force-only propositions. A growing number boast columnar database underpinnings (e.g., the Sybase Analytic Appliance; appliances from ParAccel Inc., Vertica Inc., and InfoBright Inc.), which -- not surprisingly -- comprise another oft-touted answer to poor ad hoc query performance: the "analytic" database.
Ask any columnar database vendor and you'll likely hear some variation on a single seminal spin: most existing data warehouse systems are based on or derived from "legacy" relational database management systems (RDBMS), which -- columnar advocates argue -- simply aren't designed (or optimized) for analytic workloads. You can try to solve this problem by throwing more horsepower at it (as, columnar proponents allege, some non-columnar DW players do), or -- alternately -- you can focus on fine-tuning the database itself or finesse-ing the heck out of your data models. Either way, they claim, it's a kludge.
(Data models, by the way, are a favorite target of ad hoc query performance critics: they're either "too complex" -- i.e., they require "wizard-like" expertise to concoct in the first place -- or they're "too limiting," in that they constrain the kinds of queries which users can run. In either case, next-gen DW players and columnar database specialists like to tout their quasi-indifference to data models: it doesn't matter if the data model itself is kludgey, they argue; their brute-force hardware or "intelligent" columnar underpinnings can cut through it.)
Most columnar players (e.g., ParAccel, Vertica, and InfoBright, among others) use this argument, at one point or another, but they aren't the only ones. DW upstart illuminate -- which markets neither an appliance nor a columnar database -- also makes use of it. To wit: instead of tackling the ad hoc query problem by massaging data models (as, illuminate argues, relational experts advocate), or throwing more horsepower at it (the approach of many in the DW appliance segment), or compressing data and ratcheting up the horsepower (like the columnar DW appliance specialists), illuminate touts a third way: a "correlational" database that's fully indexed and which uses a data-driven schema based on correlations in the source data.Illuminate's argument, as outlined by chief architect and co-founder Joe Foley, is certainly enticing: forget what you know about data models, data warehouse architectures, and ongoing DW management: with illuminate's CDBMS, there's no re-architecting, no reloading, and no recalculating, Foley maintains.
"With any competing design, no matter how easy they try to make it, you still have to do some design work in advance, [such as] putting columns in sequence, and so on. But as soon as you've made any structured design choices, then you've limited what you can do," he argues.
"The result of any of those [competitive approaches] is that you cannot do ad hoc analysis. Any kind of pre-defined structure limits the kinds of questions you can ask, and SQL itself limits this," Foley continues. "But by having everything indexed, we can store all of the data at one level. Everyone accesses that same database set. Summary data is actually embedded in the same database."
The result, Foley maintains, is an analytic speed demon. For one thing, the CDBMS indexes everything, so it obviates the need for a query optimizer. More to the point, he argues, a CDBMS like illuminate can (in a sense) help free business users from the "hegemony" of SQL DBAs: because there's no query optimizer -- in fact, there's no need to fine-tune queries to prevent them from hijacking or overloading a database -- business users don't first have to funnel their queries through the data management group.
In other words, argues illuminate's Chapman, a CDBMS makes it easier for business users to bang away at the database -- i.e., to really and truly initiate ad hoc queries -- without "interference" or "assistance" from IT.
Relational Reassertion
Not everyone thinks the RDBMS is ill-suited for dual-use as both an OLTP and an analytic repository. Teradata's Lea, for example, notes that Teradata Warehouse is based on relational underpinnings -- with DW-friendly optimizations, of course. Properly tweaked, Lea and others argue, the RDBMS is a stellar analytic performer. Teradata, he points out, has built its business on it.
Perhaps no one takes more exception to this claim -- or canard, depending on your point of view -- than Willie Hardie, Oracle's vice-president of database marketing. The RDBMS -- or, more specifically, the Oracle RDBMS -- is an unmatched analytic workhorse, Hardie argues. Oracle is one of the biggest data warehousing players in the business, he points out, and the Oracle database powers some of the largest DWs in existence.
"The Oracle database is proven to be the fastest database out there for both transactional systems and data warehousing systems, across all scales, from small to extremely large systems. You ask any Oracle customer out there and they'll all give you the same answer: Oracle is the fastest database out there on the market right now," he claims.
If there is an ad hoc query problem, Hardie and others argue, it's precisely one of scale: customers are trying to do more than ever before with their DW systems.
That's one reason Oracle announced its Optimized Warehouse (OOW) program last October. Furthermore, if problems of scale are, indeed, plaguing enterprise DW efforts, the new HP BladeSystem for OOW -- which Oracle recently introduced in tandem with hardware partner Hewlett-Packard Co. -- should go a far piece toward addressing them (see http://www.tdwi.org/News/display.aspx?id=9024).
Optimized for what officials call the DW appliance sweet spot -- the 1-4 TB range -- HP's OOW-ready BladeSystem starts shipping this month. It boasts both brute-force density (in the form of HP's super-slim blades) and a COTS Oracle 11g database that's been tuned and optimizing for data warehousing.
More to the point, HP officials argue, it gives the people what they want: in this case, Oracle-on-an-appliance -- and pre-tuned Oracle on an appliance, at that. "Customers have been looking for data warehouse solutions in the context of an appliance-like solution, and the combination [of Oracle and HP] is something we've been mutually asked for a long time," says Rich Ghiossi, director of BI portfolio marketing with HP.
"From the user's perspective, the ability to have this system with Oracle on it, with HP, preconfigured, ready to start loading data within hours after [it's plugged in] -- that's huge."
An Evolution of Expectations
Not everyone agrees that there's an ad hoc query performance problem, per se. Some (such as Teradata's Lea, Oracle's Hardie, and others) see what's happening as a natural evolution of user expectations.
Whether it's an issue of scale, changing user behaviors (e.g., users wanting freer reign to ping away against the DW), or a full-fledged paradigm shift (occasioned by a shift to real-time or right-time data consumption), one thing's clear: the data warehousing times are a-changing.
"We believe that in the future the bulk of [IT spend] will be spent on analysis," says David Jacobson, senior director of field marketing with Sybase. Jacobson and Sybase just took the wraps off of their first appliance offering, the aptly-named Sybase Analytic Appliance. Powered by a multi-vendor special sauce (Sybase IQ, MicroStrategy BI, and IBM System p RISC/Unix hardware), the Sybase Analytic Appliance is one of the new breed of offerings that purports to address the shortcomings of its predecessors.
If there's a query performance problem -- and Jacobson thinks there is -- it's almost certainly because of scale. Jacobson should know: he was part of a team that built Sybase's first data warehouse 12 years ago.
"We were really excited when we could load a whole quarter's worth of data into the database. We would run a query against it, and at that time it was probably $150 to $200 million in sales for 90 days, and we thought that was great! What [customers are] dealing with now is something else entirely. They want to dump all of this data -- six years of data -- into IQ!
"We're producing a ton of data, and the [competitive] advantage is getting the most out of that data. It's just [a] whole different kind of [approach to] data warehousing."