Case Study: Rx for Data Integration Woes

Oracle, MySQL, and Excel data come together

When Indiana University (IU) last year launched an ambitious $100 million initiative to provide researchers with transparent access to information stored in a variety of different data sources, the Hoosiers of the Big Ten got a decisive assist from the data integration experts at Big Blue.

Although IU has a rich tradition in theoretical and high performance computing, the data access problems which its proposed Centralized Life Sciences Data (CLSD) system sought to remedy bore a striking similarity to the information integration woes that bedevil many IT organizations today.

“Indiana University has set a clear goal that any researcher in the school of medicine should be able to transparently access all relevant external resources, and all resources in the school of medicine for which they have access rights,” explains Craig Stewart, IU’s director of research and academic computing. “The idea is to allow the researchers to sit down at their computers and ask a research question and get all of the appropriate data, [regardless of] whether it comes from their own lab, or from another lab here [on one of IU’s campuses], or from some other resource on the Web.”

The impetus for CLSD, says Stewart, was a desire to make it possible for researchers to run queries against, and retrieve information from, a variety of different relational database management systems (RDBMS), as well as flat file and other unstructured data sources. The CLSD system must provide researchers with transparent access to data stored at IU’s School of Medicine, as well as facilitate access to public data sources such as the National Center for Biotechnology Information and the Swiss Institute of Bioinformatics.

In the absence of a system such as CLSD, Stewart concedes, researchers must initiate multiple queries from multiple data sources—each with its own query language and its own idiosyncratic interface—and then attempt to integrate the results by themselves. It’s a time-consuming process to be sure, and constitutes a formidable bulwark to research.

Sounds like a familiar scenario, doesn’t it? Fact is, IU’s dilemma is by no means unique to academic or research computing. It besets almost all large IT organizations to some degree. “Very few [IT organizations] have just one database, and, generally, most have both relational and unstructured data [sources],” comments Wayne Kernochan, an analyst with consultancy Aberdeen Group.

In many environments, Kernochan notes, workers must be proficient with several different query interfaces—including proprietary solutions that lack uniformity of any kind—and don’t always have a meaningful way to compare their results. “What they’re asking for is something that gives them access to all of their data from a single interface,” he concludes.

The problem is exacerbated, Kernochan and other analysts say, by a paucity of available solutions. Extraction, transformation, and loading (ETL) tools have for years done the dirty work of aggregating data from multiple relational data sources, but until recently, there were few tools capable of facilitating access to both relational (structured) and non-relational (unstructured) data sources.

Clearly, says Stewart, IU required a tool that could do both. “The majority of what we have is either [in] Oracle databases or it’s [in] flat files,” he comments. “There are, in addition, some MySQL databases, some data that’s kept in Excel, and some data in other formats. All of the flat file data is unstructured.”

Enter DiscoveryLink

At around the same time that IU was building CLSD, IBM Corp. announced DiscoveryLink, a middleware technology that exploits software “wrappers”—basically, adapters which interface with a data source—to facilitate single query access to databases.

DiscoveryLink, which IBM designed specifically for the biomedical and life sciences industries, seemed like the answer to IU’s prayers: It eliminated the need to code multiple, point-to-point connections to data sources and provided a virtual view of multiple heterogeneous data sources. IBM even provided toolkit APIs that made it possible for IU to build DiscoveryLink wrappers to communicate with proprietary or unsupported data sources. Not surprisingly, says Stewart, IU started working with the DiscoveryLink product shortly after it became available.

When Big Blue announced a beta version of a product called DB2 Information Integrator, however, IU also climbed on board. After all, DB2 Information Integrator promised to expand the selection of available DiscoveryLink adapters, but brought a bevy of new features to the table, as well, including integrated in-memory text search capabilities and the ability to integrate information from the Web. “DiscoveryLink was what we started building with, but we felt that [DB2] Information Integrator had enough extras to justify testing it,” Steward comments.

Jeff Jones, IBM’s director of strategy for data management, says that DB2 Information Integrator is available in two flavors—DB2 Information Integrator and DB2 Information Integrator for Content. “One is really meant to attract developers on the database side and the other on the content side, so when you write applications to either of these two interfaces, DB2 Information Integrator goes to work for you using the [programming] language that you chose,” he explains.

IU is still in the process of implementing Information Integrator, but Stewart says that he likes what he’s seen so far. “We think this is an excellent product, and, in our opinion, one of the benefits of the product is that it frees one from the vagaries of the database software market,” he notes. “You can access whatever you’re working with, whether it’s [a life sciences database such as] HMMR or just data that’s in Oracle, from one place.”

Stewart also sings the praises of the extensive array of data source wrappers that IBM has provided for DB2 Information Integrator and DiscoveryLink. “One can use smaller programs, called either parsers or wrappers, that let one access external data sources ranging from some of the commonly used public data sources to XML,” he continues. “What we found with Information Integrator is first a growing suite of modules that permit one to access the key data sources for biomedical research, and because there is a pool of these wrappers and parsers being developed, it’s very easy to implement Information Integrator.”

IBM’s Jones argues DB2 Information Integrator’s federated architecture makes it an ideal platform for EII: “This does not require that you stage anything anywhere. This is a direct access story. The primary purpose that we see with this is enabling an application to go into any database that it has to to get data.”

Although IU is a large Oracle shop, Stewart says that he more or less agrees with this assessment. “From the standpoint of the databases that one is interacting with, I think the federated approach is the only viable approach for the future, particularly in biomedical data, where one might well have a data resource that one might be willing to expose partially, but would not be willing to let another entirely copy,” he says. “Because of the federated approach, you’re accessing data, but leaving the decisions about how much data is exposed at your remote sites up to the data owners.”

About the Author

Stephen Swoyer is a Nashville, TN-based freelance journalist who writes about technology.