Taking the Art Out of Scalable, Available Data Warehouse Design

Why does designing a scalable, available, and resilient data warehouse sometimes seem like more art than science?

Why does designing a scalable, available, and resilient data warehouse (DW) sometimes seem like an art than science?

Perhaps, experts say, it's because many adopters treat it as more of an artistic than a scientific endeavor. To wit: buying off-the-shelf data integration (DI) and business intelligence (BI) tooling -- and, in a growing number of cases, pairing it with consulting expertise -- is only part of the equation.

In DW, as in every other IT discipline, testing -- early and often -- is of paramount importance. For a variety of reasons, however, DW testing windows are often compressed -- while comprehensive testing is itself impossible. If testing is done, it's often performed after the fact -- when the data warehouse is exposed to end users. This is a bad idea on several levels, experts say, because it undermines confidence in the service at precisely the point when confidence is most important. There's no better engine for end-user adoption than functional, reliable, and (just as important) "A-ha!"-begetting interaction with a DW.

"When it's done properly, with BI tools like Cognos and Business Objects and MicroStrategy, it's fairly easy [and] fairly intuitive to use those tools and go after the data warehouse, but when the data warehouse itself is not designed properly, then the solution starts to fall apart," says Pankaj Agrawal, CEO and CTO of Compact Solutions LLC, a DW consultancy and tools vendor.

Prior to founding Compact Solutions, Agrawal was a consultant with the former Knightsbridge Solutions where he worked on Knightsbridge's formidable ETL team. The biggest challenge with any technology roll-out is driving user uptake, Agrawal says. The best way to ensure adoption is to deliver a functional and reliable service from the get-go.

"The key is to win the confidence of the end user. When the data is not delivered on time or is not properly tested, then the end-user experience is compromised. End users are the key to the success or failure of any business intelligence effort," he observes.

Wayne Eckerson, director of research with The Data Warehousing Institute (TDWI) agrees. DW adopters have traditionally paid short-shrift to proactive testing, Eckerson acknowledges. The good news, he observes, is that this is starting to change. "[The] time [it takes] to test and validate ETL routines and ensure higher quality … is critical for users to embrace the data warehouse. Too often, this part of the lifecycle [is] squeezed to a minimum and then [to the extent that it is addressed] it's hand-coded in a somewhat random manner," Eckerson comments, noting that an Executive Summit session at TDWI's World Conference held in Las Vegas in February addressed this very issue. "[It's] a much overlooked area, but I'm getting more inquiries on it lately."

Val Rayzman, CEO of DI start-up DVO Software Inc., knows a thing or two about ETL -- and broader DI -- testing and validation. A veteran of DI pure-play Informatica Corp., Rayzman's newest venture aims to take the art out of DW design and deployment. DVO markets DataValidator, a DI testing suite designed specifically for Informatica's PowerCenter ETL tool: it consists of a GUI-based front-end tool or console that uses PowerCenter as its back-end engine. Designed for either ETL architects or business analysts, DataValidator purports to both systematize and accelerate the DI testing process.

DI is sorely in need of both, says Rayzman, who points to a perhaps willful failure of imagination on the part of many ETL adopters. They've come to terms with the fact that a manageable and governable ETL tool is a superior alternative to scripting or programmatic SQL, but -- when it comes to testing -- they aren't willing to take their reasoning one step further.

"When I started at Informatica, we really had to work really hard to make a case of why you should not code SQL -- why you should use metadata-driven ETL tools. By and large, Informatica and IBM [purveyor of the former Ascential DataStage] don't have to make that case anymore. They don't have to convince people that packaged ETL is good. Think about it: you can do everything that Informatica does with SQL, but you agree that it's important to have a metadata-driven tool that has audit capabilities and all of the good things that come with a product," he contends. "The same is true with our product. If you concede the one case [i.e., for packaged ETL], why not do the same for a metadata-driven testing tool with audit capabilities?"

DataValidator isn't a comprehensive testing tool, Rayzman stresses: full-blown DW testing involves stressing constitutive hardware and connectivity components, along with throughput, scalability, and other performance metrics.

"There's a lot more to testing than what we do. It's sort of like Maslow's Hierarchy of Needs: at the very basic level, you need to make sure that things run -- that people can connect to sources, that it's reliable, that it can withstand a certain volume. Next, you want to make sure that your data is correct. That's what we do," he comments.

It's the most frequently neglected aspect of testing, Rayzman argues. Most organizations test and benchmark for performance and scalability up front -- frequently during a proof-of-concept. Validation typically happens at the end, or -- more frequently still -- after official deployment. That's too late, Rayzman maintains.

"Anything that happens at the end by its very nature is more susceptible to being squeezed. Think of it as if you're developing a project plan: I'm going to spend two weeks investigating my data and 18 weeks developing [my data warehouse] and then five weeks testing, but what happens is that it takes you three weeks to investigate. Then development runs over budget and all of a sudden either your deadlines have arrived or you run of budget for consultants, so what gets cut? Testing."

Rayzman and DVO market DataValidator as a standalone offering -- sans any services or consulting expertise: "To be honest, the tool takes 45 minutes to install, and it's a really simple-to-use tool. It's so simple and obvious that it's hard for us to offer any kind of services. By the time we learn that Field X = Field Y, a [business] analyst could have discovered the same thing. The long and the short of it is that we so far are not offering any services. There's no need to."

Agrawal and Compact Solutions, on the other hand, don't resell their DW testing and validation software; instead, they sell testing and validation as part of a larger DW consulting engagement. "We've created a [large number] of testing utilities which our consultants use as they get engaged in those products. We have [created] a [large number] of usability templates to aid integration testing and department testing. These are tools that we have developed and tested in our consulting [engagements], so they're field-tested," he comments.

The salient point, Agrawal concludes, is that data warehouse testing is far too important to be compressed into a limited timeframe. Instead, it should occur throughout the DW design and deployment lifecycle.

"A lot of people think that implementing a data warehouse and buying the software … that the software by itself will enable them to dump the data into the database and then automatically they have a data warehouse. That's wishful thinking," he cautions. "What we feel is missing a lot of the times is the realization that the data warehouse needs to be architected and engineered, and that requires effort, all the way from requirements analysis to architectural development to testing and validation. We do that as part of our consulting [engagement]."