In-Depth
Scalability Implications for Relational Databases
Many data warehousing vendors today claim they can handle multi-terabyte data warehouses. But, is this infatuation with storage really the point of data warehousing? With so many vendors using this as their marketing pitch, it's a pretty sure bet that there's more to scalability that they'd prefer not to talk about.
To illustrate these points let's talk about large sites as well as look at what's possible in data warehousing today. The issues presented in this article will impact sites of all sizes. Smaller organizations have the same competitive pressures as large ones, with considerably fewer resources to support their warehouses. As needs scale down, so does the ability to support technical heroics.
With traditional enterprise systems, the need for scalability was driven by the amount of data you could store and the number of transactions per second you could support. There was little relationship between these dimensions of the system; transactions tended to access only a few rows of data regardless of how much was stored.
With data warehousing, the equation is considerably more complex. Data warehouses are not simply for data storage; data is already stored everywhere. The warehouses are built for data access. In this context, scalability must be more than simply the ability to reliably store data, but also the ability to leverage this data as a strategic resource.
Data warehouse scalability can be thought of in four dimensions:
- Data Volume;
- Concurrency;
- Environmental Complexity; and
- Support
Data Volume and Concurrency are interrelated; while Environmental Complexity is determined in part by how well you deal with the first two dimensions. Support scalability is directly determined by the other three dimensions.
Data Volume
On the surface, data volume is the simplest of the dimensions for a database to support; and industry benchmarks bear this out. Production sites reveal a different story, however. The largest data warehouse currently supports in excess of 16 terabytes in a single database, with more than 5 terabytes of non-redundant user data.
In the data warehouse, more "data volume" is achieved by increasing the amount of detail retained or by increasing the number of data elements collected. In either case, this is done because there is a demand for additional data to be used in conjunction with the existing data. Unlike in a traditional transaction system, the additional data will not exist in a relative vacuum. This increases both the volume of data individual queries must be able to process and the number of tables and data elements with which a query must deal.
Some vendors will support their claims of multi-terabyte warehouses by pointing to sites with considerable replicated data, generally found in the form of summary data marts. One such site, very heavily promoted by the supporting vendors, has 1.5 TB of storage to accommodate less than 100 GB of data. This is not an example of Data Volume scalability.
The primary technology for dealing with increased volumes of data is parallelism. Be cautious, though, since nearly every parallel database, even using massively parallel processing (MPP), does some things serially, or with significantly limited parallelism. A database that spends 75% of its time running a query in parallel mode can never run that query in 1/4 the time without speeding up the serialized portion. And, this usually requires a new generation of servers. What seemed a trivial difference can quickly stop your project.
The critical aspect of parallelism is the database optimizer. The optimizer must understand what processes can be effectively handled through parallelism; and it must automatically parallelize these processes. As data volumes grow, data skew becomes more prevalent, and index choices and join sequences become more critical to performance. To effectively support data warehousing, the optimizer must be specifically designed to understand the impacts of parallelism, and mature (read: old) enough to effectively incorporate accurate costing algorithms from production sites, not development theories.
If your database cannot effectively deal with the data volumes your users require, you will be forced to artificially divide the data across multiple platforms, using a distributed data approach. This will severely impact Environmental Complexity and Support scalability.
Concurrency
Concurrency, the number of users you have for the system and the number of concurrent queries they will need to run, is somewhat more complex. Transaction systems are built around known numbers of users requiring access to known, generally very small, sets of data with tight response parameters. In data warehousing, we do not have the luxury of these guidelines. Our users may require access all day every day; or they may submit just a few queries a month. They may require a few simple rows, or need to analyze hundreds of millions of rows from dozens of subject areas in a single request. Of course, since not all users are created equal, so these requests each come with their own set of priorities. The largest data warehouses currently support over 10,000 users using a single copy of the data, with 1,000 or more queries running at the same time.
To support the complex, dynamic, and often conflicting demands of these users, the data warehouse engine - database - must be able to efficiently optimize very complex requests. If it cannot do this, a single large request can waste so many system resources that it makes the entire environment unusable. It must be able to recognize queries that can share resources, such as physical I/Os, to further reduce the overall workload impact.
The database must provide the ability to easily manage security access for these large user communities. And finally, the database must allow you to determine what the business priorities are, preferably by user, time of day, day of week, and current system workload; and automatically defer queries for later use and prioritize resource usage once a query is in the system.
Environmental Complexity
The next dimension of scalability is environmental complexity. The data warehouse environment grows more complex every time a new subject or more history or a new query or a new user or a new data mart is added, or the data warehouse itself is decentralized. This complexity must be hidden from most users, preferably by using views of the data. The largest data warehouses today support over 500 tables of detail data in a single centralized database.
To support this level of complexity, the database must be able to handle extremely complex constructs in a single query. Once again, this is the role of the optimizer.
For example, one company wanted to track warranty and repair data for their entire product line over multiple years. After much tuning of the database, they found their database could store, but not support, queries against this much data. To overcome this shortcoming, they decentralized the data warehouse into a separate platform for each product and each year. Of course, this eliminated the ability to search for patterns by year or a specific component, so much of the value was lost.
When combined with parallelism, the large number of objects in a warehouse presents a significant issue - data placement and partitioning. Let's say there is a simpler environment - only 200 tables with an average of three indexes each - and you want 1,000 partitions (on average) for each object to support parallelism. This means there are 800,000 physical objects to place. ((200 tables + 600 indexes) * 1,000 partitions)
Generally, data marts, as well as distribution of the warehouse over different databases, are created because the data warehouse platform chosen is not up to the task at hand. As a compromise, the database is given simplified tables to operate on.
Some consultants will characterize data marts as a way to simplify the warehouse environment. This generally plays well with technical staff, since individual applications are simpler to develop and tune. From the larger perspective, creating and supporting a large number of data marts dramatically increases the Environmental Complexity of the data warehouse environment. An example, one company created a centralized warehouse, but determined that their chosen technology could not support more than a handful of concurrent queries, and could not support complex queries without significant on-going performance tuning. To overcome this, they created data marts. Today, that company is trying to support an environment with nearly 100 data marts.
While data marts have a place in the overall data warehouse environment, over-reliance on them will lead to support issues, which lead us to the final dimension of scalability.
Support
Support scalability is the ability to realistically support the environment that has been created. All other aspects of scalability impact this. Generally speaking, the more compromises made in the design, the further it strays from a normalized representation of the business; the more subordinate structures you need (data marts), or the more summary level data you carry, the higher your overall support costs will be. Industry estimates now place I.T. support costs at 55 percent of the cost of system ownership, compared with 3 percent for hardware and software acquisition. Support is the logical place to look for dramatic savings in your data warehouse. The largest data warehouses in the world - those same warehouses described in this article - have no data marts, only a handful of summary structures, and require fewer than two database administrators and one system administrator to support the entire environment.
To support large data sets, the database must be able to automatically partition and place data on the disk devices, without the Database Administrator (DBA) specifying either. Otherwise, the task is simply too enormous to do efficiently (remember the 800,000 object example above). To support new or ad-hoc queries, the database must tune all queries effectively without the DBA or programmer providing hints or otherwise re-writing the query. (Think of trying to keep up with the demand if only 5 percent of the 10,000 active users have changing requirements - that's still 500 users.)
Finally, the primary method of user access must be through the data warehouse, not through data marts. Think of the mart as a "dumbed-down" structure to overcome shortcomings in parallelism, the optimizer, or administrative aspects of the warehouse database. In this context, marts are collections of pre-joined, summarized, or otherwise simplified tables.
While data marts have a useful place in the data warehouse environment, they should be used only where there is a clear return on investment - a business reason for doing so - and not because the wrong technology was chosen for the warehouse.
Consider what happens in a typical mart-centric environment when a new query is requested (and remember the 100 data mart example):
- The user requests a new report;
- The user's request is allocated to the least busy programmer;
- The request goes into the programmer's "to do" list;
- The programmer examines the request to determine if one of the existing data marts has the data to execute the request. If one does:
- The programmer proceeds to step 6);
- The systems group determines if the current mart hardware can support another application;
- If the platform cannot support the request, hardware re-allocation begins;
- If no data mart exists which can satisfy the request:
- The programmer designs a new data mart for this request;
- The systems group locates a server which can support the new mart;
- The programmer designs and tests a procedure to create the data mart from the centralized data warehouse;
- The programmer develops maintenance procedures for the new mart;
- The DBA group performance tunes the new mart maintenance procedure;
- The programmer designs, codes, and tests the new user request;
- The DBA group performance tunes the new request;
- The user examines the results returned from the programmer to see if they match the
original request;
- The user and programmer work together to resolve any outstanding issues.
This is a slow, cumbersome process. It dramatically reduces the likelihood of fast response times, limits the number of questions a user can ask, and has significantly higher support costs. Ultimately, this type of environment becomes unsupportable. This can be critical to your business.
In one company, the marketing department for a single business unit identified 10-15 new initiatives every year, many with new data requirements and all with new reporting requirements. The complexity of their mart-centric environment meant that more than half their requests went unanswered.
In a warehouse-centric environment, a centralized warehouse contains all necessary detail data for the enterprise, in a highly normalized model and on a platform that all corporate users can access whenever they need to.
Consider this same problem in a warehouse-centric approach to user access:
The user requests a new analysis using a graphical front-end tool;The tool passes the request directly to the database;The database automatically optimizes the new request, with no DBA intervention;The database returns the answer to the user.In this environment, many ad-hoc user requests are answered in minutes - most in under one hour and all by the next morning. In addition, the user can ask more requests, test more hypotheses, and better understand their environment. In addition, the user can respond quickly to developing trends and issues. There are no practical limits to support scalability.
Support issues are the final limitation to both cost and business responsiveness. But these can only be minimized if all other dimensions of data warehouse scalability have been addressed.
The Real Bottom Line
Data warehousing has changed the face of competition in many industries. But, to become a truly strategic resource, the warehouse must be able to accommodate more data than you think you have, more users than you know exist, more complexity than the best DBAs could tune for - and do these things in a cost effective manner.
Above all, the warehouse must be able to evolve with your business. Evolution does not mean adding more appendages. Evolution means quickly supporting new concepts, responding faster and faster to situations and leveraging data as a strategic resource. If your warehouse cannot adapt quickly, then you have no sustainable competitive advantage.
The most critical decision you make will be your database selection. If you select technology based on corporate standards or vendor promises, save your money and your sanity and stop now.
Choosing the proper database platform is a fairly simple task. Find at least a few references that are doing what you need to do using the technology you plan to use, and validate with these users that they are getting the value they expected. If these references cannot be validated, you've made the wrong choice.
Many data warehouses fail to deliver what the users expected. Yours can deliver. You have a choice. You can spend your time overcoming technology limits, or you can spend it changing the face of your business and perhaps your industry.
About the Author:
Ron Fryer is a Senior Architect with Teradata/NCR, and has been designing very large data warehouses for eight years. Mr. Fryer has written numerous papers on data warehousing and is a contributing author to the recently published book, Understanding Relational Data Base Management Systems, 2nd Edition (McGraw-Hill, 1997). He can be reached at (303) 692-7201 or via e-mail at [email protected].