The Case for Data Virtualization (Revisited)

At some point, advocates argue, data management diehards are going to have to come to terms with something very much like virtualization.

For a long time, Composite Software Inc. was out in front of the data virtualization (DV) wave. Several years ago, in fact, Composite repositioned its core offering -- the Composite Information Server -- as the foundation for its next-gen DV vision.

At that point, the market hadn't yet accepted data federation (DF), at least to the point that DF was seen as an essential component of any data integration (DI) stack.

DF is a key component of data virtualization, and -- prior to its decision to focus on and organize around DV -- Composite was seen as a champion of federation: several large vendors (including Informatica Corp. and the former Cognos Inc.) bundled and resold its data federation technology as a complement to (or enabling technology for, in the case of Cognos) their core offerings.

Nowadays, argues Bob Eve, Composite's vice president of marketing, data federation has more or less been commoditized. As Eve puts it, "Everybody has [federation] now."

That isn't quite true, but most big DI vendors do offer a federation component of some kind: IBM, along with Composite, was in the forefront of the DF push, while competitors SAP AG (which acquired Business Objects SA), Oracle Corp., and SAS Institute Inc. subsidiary DataFlux market federation offerings of varying sophistication.

"To the business intelligence application, we want to present the data as though it [were already] in the data warehouse, or consolidate [it] in some kind of data store," said Colin White, president and founder of BI Research, at TDWI's World Conference in Washington, D.C.

White endorsed DF as a means to accelerate the consumption and analysis of information in time-critical environments. "With data federation, you actually gather the data and consolidate it dynamically at the time you issue a SQL query, and the SQL query is presented to a virtual view of the data."

Two of the biggest drawbacks associated with federation, White said, are data governance -- particularly with to performance -- and data quality. Data virtualization purports to address these and other concerns, chiefly by emphasizing the particular and the pragmatic over the universal and the draconian.

Instead of prescribing the federation of resources, DV proposes mixing and matching data integration technologies to construct a virtual view of an integrated enterprise. At a basic level, DV works with the integration technologies that you already have -- ETL, by and large -- and uses federation to knit everything together.

In this arrangement, federation technology creates a kind of over-arching superstructure -- a single view of disparate (and often radically different) data sources -- but actual access to data is handled by means of several different transport backbones, including sources such as trickle-fed operational data stores, batch-refreshed data warehouse systems, data inside operational repositories, or live event data streamed over an enterprise messaging bus.

This last connectivity scenario, advocates concede, is a still-maturing aspect of the DV vision.

Support for data-derived sources -- as distinct from application-derived traffic such as event streams -- is comparatively mature; this category includes legacy, unstructured, or semi-structured data-types. It can't be based on least-common-denominator-like connectivity, says Composite's Eve.

"When we access a source, we have to understand the [underlying optimizer]. We can do a JDBC connect or an ODBC connect, but if you really want to go into Sybase IQ, that takes extra work. Recently, we wrote four new optimizations just for Netezza. We're always kind of doing that," he explains. "We have to support deeper optimizations. The Netezzas themselves can do a lot. They're pushing down all kinds of capabilities into the database that used to be in the analytic technology, and we try to take advantage of all of that."

Data virtualization is still rapidly evolving. In addition to engineering improved support for streaming event data, DV vendors say they're evolving their products to better address practical issues or real-world issues, too: Eve, for example, says Composite is now able to support broader deployments. One customer, he says, "has a virtual layer of reference data [that] they're surfacing around the world. They have these three different systems [in each location]: a Developer, a [Production], and a Backup [system] ... and they're in these three geographically-dispersed locations, in the U.S., in the E.U., and in Asia, so they have to constantly sync them." Elsewhere, Composite continues to infuse its DV platform with complementary data management technologies, including ETL- and data quality-like capabilities.

There's a perception that DV should sell itself. Its benefits, after all, are seen as easily demonstrable, particularly to business users, who advocates like Eve say feel "frustrated" by IT's lack of responsiveness or flexibility. Instead of waiting weeks or months for IT to expose data in previously siloed data sources, virtualization promises to accelerate the integration process.

Resistance from IT is still a big barrier, however. "It's hard for business people to drive an initiative that far back into data integration techniques. What [Composite is] trying to do is to go to the IT people and say, 'Wouldn't it be great if you could be more responsive? Wouldn't it be great if you had lower total cost of ownership? Wouldn't you be better off with a flexible toolbox that you could bring to bear on any problem that you wish?"

These are rhetorical questions, Eve argues -- to everyone but DM traditionalists. Like it or not, he maintains, DM diehards are eventually going to have to come to terms with something very like data virtualization.

"I think it's the data warehouse appliances and these NoSQL stores that have just kind of eroded their core tenets, which is [that] the single warehouse is the place for everything," he concludes. "Once you get this many data warehouse appliances and data stores -- now that you have them, you're only going to get value by integrating them, and at that point the best way to integrate them is by virtualization. It's almost this aftermarket bad consequence of a good thing."

Not surprisingly, Philip Russom, research manager with TDWI Research, views the matter as somewhat less clear-cut. That said, Russom sees a number of undeniable benefits associated with data virtualization, including better resistance to change or disruption (DV is based on an abstraction layer, so DI targets are somewhat insulated from changes to DI sources); improved responsiveness (thanks in part to the ability to quickly provision new targets or sources); much greater reuse of objects and services; and, oddly enough, a business-oriented view of data.

"[W]hen this [virtualization] layer exposes data objects for business entities -- such as customer, order, and so on -- it provides a business-friendly view that abstracts yet handles the underlying complexity," wrote Russom, in his recent TDWI Checklist Report on Data Virtualization.

Russom argues that DV can help quickly improve the quality and consistency of data.

"It can ensure that inconsistencies and inaccuracies across heterogeneous enterprise data stores are identified upfront without the need for staging and pre-processing," he wrote. "In addition, [data virtualization] enables business and IT to collaborate in defining and enforcing data quality rules on-the-fly or in real time as data is federated."