Be Prepared: How to Deal with the Surprises That Are Going to Happen to Your Data Warehouse Project

Surprise! You thought that your little warehousing project was going to move along without a hitch. Well, guess again

After working on a number of different data warehousing projects over thepast several years, I have found that some of the same interesting and sometimes annoyingthings happen in each data warehouse project. I have worked on projects that range in sizefrom 100 GB to 1.5 TBs. Many of these items are not too startling, as they have beenhappening to data processing (DP) systems since the computer was invented. What makes themstand out from their occurrence in other DP systems, is the unique nature of a datawarehouse. Until a few years ago, it was inconceivable that a single database would hold10 TBs of data. Not only are these systems very large, but you can query the database andget an answer back in a few minutes. These systems now exist and are being used by anumber of different companies. Based on my experiences with a number of data warehousingprojects, here are the things that are probably going to occur to any data warehouseproject.

Dirty Data

The main goal for most data warehousing projects is to give the end users access totheir data in a way that they have never been able to see before. One of the first thingsthey or the people that are building the data warehouse (DW), are going to find is thatyour operational data has inconsistencies and missing parts and it is a lot worse than youthought. There is nothing too startling about this statement. Data Processing has facedthis problem since the very first system was put into production. The old, garbage ingarbage out, GIGO, still holds true, but the difference is with a data warehouse the userscan see a lot more data and far more easily than ever before.

Every problem that may have been hidden before will now be there for everyone to see.Many companies are under the delusion that their operational data is pretty good overall.They know there may be a few errors, but they think these are few and far between. Itoften comes as a shock as to the amount of inconsistent and just plain wrong data that isfound. The way to handle this problem is to carefully audit the input into the datawarehouse. There are several ways to do this.

One is to select a random sample of data and track it through from source topresentation from the data warehouse. This is a basic technique that is used with all DPsystems. Another, more specific to a data warehouse, is to write SQL scripts that checkthe consistency of the data that is loaded into the data warehouse. You can check thingslike matching against any master files you have extracted from or checking the summationsor items counts against any summary tables you have built. Be sure you keep the load filesthat were used to load the data warehouse. This will allow you to do the ultimate audit:The ability to track data items that the end user thinks is wrong, back to the load files.It is imperative that the data you load into the data warehouse be clean and consistent.

The other surprise that goes along with the dirty data, is that you will probably findthe errors in the operational systems that caused the dirty data. The bad or dirty datagot there, in the first place, because the operational systems allowed it to be created.This also includes data that is not necessarily wrong, but is incomplete or inconsistent.This can be a very important benefit from a data warehouse. Not many people think aboutthe corrections to the operational systems as a benefit of a data warehouse.

On several of the projects I have worked on, the scenario went like this: The end userreports that a certain data item is wrong. The item is tracked back through the datawarehouse system and it is found that the item is being loaded and displayed as it wasgiven by the feeding systems. The item is tracked back through the operational systemsuntil the place where the error occurs is found. In every single project I have worked on,a good number of operational errors were found and corrected. The best way to handle thisis to have a good relationship with the people that are supporting the operationalsystems. The people that support the operational systems are usually receptive to thishelp in finding their errors, but some will not be too happy that this new system iscausing them to do additional work.

Belief in the Data

The next problem you are probably going to see is related to the first one we talkedabout. Once the end users have access to this wide range of clean data, many of the mythsabout the data will be destroyed. They will not believe what they see. The sale of thisitem cannot be this low, these customers cannot be doing this, the use of these assetscannot be this bad and on and on it will go. If you have done what you were supposed tohave done in step one, then you will be able to prove that the data is correct.

Many of the old polices and procedures will have to be changed as these"discoveries" are made about the data. This can be one of the most powerfulaspects of a data warehouse. It is, of course, very important to use a data warehouse tohave detail data that is used in the decision making process, but to find the truth abouthow a company is doing business is of great value. The myths about customer perceptions,stock assortment or asset utilization that exists today can be replaced with an accuratepicture.

The other part of this problem is the fact that users will be comparing the data fromthe data warehouse to the old reports or old systems data. In all probability, the datawill not match between the different systems. Once again, auditing the input and output ofthe data warehouse is the only way for it to gain credibility. There can only be a singleversion of the truth, if a company is going to get the real value out of the data. In theend, all will benefit from the truth, but the process is never easy.

The next surprise you will find in using a data warehouse is that end usersophistication will grow at amazing speed and will drive the need to increase theprocessing power and storage capabilities of the system. This is true with all DPapplications to some extent, but with a data warehouse it is possible to see growth ratesin the 50 to 100 percent range per year! When you first turn on your data warehouse, youwill probably see your end users divide into three groups.

A small percent will become power users by embracing the technology, the largestsegment will become average users of the system and a small percent will use the systemvery little. Your power users will be executing new and more complex queries next year incomparison to what they are running now. Not only will your power users increase the CPUusage, but your average users will also become more sophisticated in the queries they use.

The training of the users is a very important part of the success of the datawarehouse. This training will also drive the end user sophistication and the types ofquestions they will ask. On the projects that I have worked, the training had to be doneon a regular and frequent bases. We had training sessions set up for new users, refreshersfor current users and ones for each major release of the end user access tools. There arenow some very good data mining products on the market. When some of your power users startto use these, then you will truly see growth in CPU usage! Be prepared for and put thefunding into your IT budget for the growth needs of your data warehouse.

In the normal course of everyday business, changes will occur in the hierarchies thatyou are using in your data warehouse. It makes no difference if it is a product, customerrelationship, ship to cargo, geographic or passenger to flight hierarchy, changes willoccur. This means any summary tables you have created using these hierarchies will have tobe resummed. It also means that without very careful maintenance, these hierarchies mayneed to be rebuilt.

What you are carrying with a data warehouse is a snap shot of data back into history.When these hierarchies change, you have the decision to make of either rebuilding yoursummary tables or figuring out a way to show what the hierarchy looked like back inhistory. Most companies rebuild their hierarchies and summary tables as needed. This is anugly truth about carrying a large amount of historical data that few people want to talkabout. The good part is the business value that is derived from the data more thanjustifies the pain of these maintenance activities. Plan on putting into production theprocesses that were used to build and sum these tables at the beginning of the project.

Another interesting phenomenon that may happen to your project is that the word willget out that you are doing something new and exciting for the end users. Even in thecompanies that are the most advanced in using information technology, there is anoverwhelming need for users to get at their data. In most companies, this need is aconstant and loud cry from the user community that keeps the CIO up at night.

So what may happen is you will start to get calls from various section managers wantingto come see what you are doing. This is great, as long as you control these presentationsand do not let them divert too much of your time away from the main objective. It is justnatural to want to show off your creation, but be careful about setting expectations forthese potential future customers. Your next data warehouse project will most likely comefrom one of these groups that wanted to see what you were doing. On one project that Iworked on, the demands for presentations became so great that we had to limit them toorganized group presentations only. We were spending too much time doing individualpresentations.

You are going to have to provide for the care and feeding of this thing you havecreated called the data warehouse. This of course true of all DP systems, but thedifference is because of the above items, a data warehouse is going take a lot moreresources than a standard system. The reason is not that a data warehouse is a means untoitself, the reason is the business value that the user can get out of the data.

Many companies are seeing very quick and large returns on their investment in a datawarehouse. If you have a system that is growing 50 percent a year and the end users areincreasing their CPU usage by 25 to 100 percent per year, you have a system that isgrowing explosively. Another major part of the data warehouse that people do not like totalk about is the amount of money that will be required to maintain the system evenwithout any growth. There are many companies claiming that they can build a large datawarehouse for you. What they fail to talk about, is the amount of technical support thatis required to keep it running. You had better have a long-range plan of the fundingrequirements for the care and feeding of your data warehouse.

There are a number of surprising and challenging things that are going to happen toyour data warehouse project. If you plan carefully and look at the long-term picture, youwill be able to handle each of these as they occur.

About the Author:

Andy Blevins is a Senior Software Designer/DBA working as an NCR consultant in Dallas.