Scalable Data Warehouses: Architects Must Improvise

Real-time requirements and faster performance are driving data warehousing changes. Inmon and Kimball provided a solid foundation, but large data warehouses may lead data architects into uncharged territory.

Names such as Bill Inmon and Ralph Kimball are recognizable to almost every data warehouse architect. Both men were involved with data warehousing at an early stage and still remain in the forefront of the technology.

Indeed, although Inmon coined the term “data warehouse” more than a decade ago, and even though Kimball published his seminal data warehousing Bible—"The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses"—in the mid 90’s, most of the data warehouses built today are nevertheless based on their models. How’s that for foundational?

“I don't think design philosophies have changed much in the last [seven or eight] years. Generally speaking, both Kimball's and Inmon's approaches are the most commonly used today,” confirms Nicholas Galemmo, an information architect with Nestlé USA and co-author of "Mastering Data Warehouse Design: Relational and Dimensional Techniques."

Claire McFarlen, a data warehouse consultant with Clay New Media, agrees. “The two main camps, Kimball vs. Inmon, continue to duke it out, as they have done so for over ten years. I can't decide if it's a 'Holyfield vs. Tyson' type of match, or more like the 'Nature vs. Nurture' dialectic of the 18th-century sociologists like Hobbes and Rousseau,” he observes.

Blueprint for Data Warehousing

Both Inmon and Kimball champion very different data warehouse design philosophies. Kimball, for example, endorses an “architecture” in which an enterprise-wide data warehouse effectively grows up around any of several independent data marts—such as for finance, sales, or marketing—each of which exploits conformed dimensions so that its data will be intelligible to consumers from other data marts.

Inmon, for his part, is of the opposite opinion. He believes that independent data marts cannot comprise an effective enterprise-wide data warehouse. Instead, he argues, organizations should focus their efforts on designing an enterprise-wide data warehouse.

For small and medium-sized data warehouses, the Inmon and Kimball models may be all that most data warehouse architects ever need.

When it comes to large or very large data warehouses—in the hundreds of gigabytes to multiple terabyte range—however, most architects will probably find themselves in uncharted territory. “Textbook data models usually don't survive very long when dealing with large volumes of data,” says Paul Balas, a data warehouse architect and a principal with data warehousing consultancy Sof-Think Inc.

The upshot is that architects who are tasked with designing scalable data warehouses typically start with a foundation—Inmon and Kimball—and improvise the rest of the way. In particular, they exploit improvements in raw processing power, along with new techniques (such as row-packing and more efficient management of updates) to design for rapid growth, Balas says.

To a large extent, experts suggest, improvements in data warehouse scalability and performance are limited by the predictable inertia of major vendors. For example, several other data warehouse design methodologies—such as the Data Vault architecture championed by data warehouse architect Dan Linstedt—have come to the fore over the past several years. None of them has challenged the primacy of the Inmon or the Kimball models, however. “These methodologies will take quite a while to catch-on as an industry trend,” speculates Balas. “Vendors are very comfortable with the concept of a normalized data warehouse and star schema methodologies. Until that critical mass occurs, these new methodologies will be limited in implementation.”

Real-time requirements transform data warehouses

The data warehouse is by no means a finished project. Experts say that increased demand for real-time or near-real-time information access is driving new design innovation in data warehousing.

“The two most common design considerations facing any large data warehouse are fast refresh and fast response time for executing user queries,” concurs Clay New Media’s McFarlen. “Interestingly, these are the two most common design issues facing any IT application that stores data.”

In years past, McFarlen points out, data warehouses typically refreshed their data on weekends or at the end of the month. Today, however, many businesses are demanding daily refreshes at a minimum. “Some data warehouses are moving to real-time refresh and this may become the trend,” he points out. “Response time becomes increasingly an issue as the volume of data in the data warehouse grows.”

ETL vendors have attempted to respond to this challenge, delivering next-generation ETL solutions that purport to provide real-time (or near-real-time) processing capabilities. Informatica Corp., for example, announced its PowerCenter 6.0 ETL suite last year, while ELT pure-player Ascential Software Corp. last month unveiled (http://www.tdwi.org/research/display.asp?id=6751&t=y) an enterprise integration suite based on its ETL and data quality technologies. Even BI powerhouse SAS Institute Inc. recently delivered a new version of its ETL tool, Enterprise ETL Server, which boasts real-time capabilities.

According to Sof-Think’s Balas, real-time or near-real-time processing is an acute requirement in large data warehouse environments, especially. “In the past, most ETL tools only supported batch processing. As volumes grow, it has become critical to spoon data to the warehouse as it is created in near-real-time,” he argues. “This adds complexities to ETL logic, but improves the ability to load more data. It also improves the accuracy of business decisions.”

Expanding horizons

As raw processing power has increased, so, too, has overall data warehouse performance.

This has occasioned at least one vendor, NCR Corp. subsidiary Teradata, to put forth the heretical proposition that because of the raw processing power of its platform, there’s no longer any need to populate a data warehouse with data marts. Teradata’s position isn’t likely to be taken up by many of its competitors, at least not in the near term: The company markets a range of data warehousing and analytic products that run on top of, and which are optimized for, its parent company’s hardware. This gives it an advantage over its competitors, nearly all of which don’t market combined solutions.

Nevertheless, suggests Sof-Think’s Balas, it’s an intriguing proposition that could be tested by other vendors as processing performance continues to increase. Balas says that Teradata’s radical proposition has the potential to revolutionize the data warehouse. “If you design a third normal form warehouse with some minor modifications in the physical design, you can run all your DSS/EIS reporting off the warehouse,” he points out. “In theory this is possible, but in practice difficult to achieve. If this concept becomes practical, it will have a huge impact in the total cost of ownership of a data warehouse. There is the potential for a huge market shift if they can gain momentum.”

About the Author

Stephen Swoyer is a Nashville, TN-based freelance journalist who writes about technology.