In-Depth
Be Prepared: How to Deal with the Suprises That Are Going to Happen to Your Data Warehouse Project
After working on a number of different data warehousing projects over the last several years, I have found that some of the same interesting and sometimes surprising things happen in each data warehouse project. I have worked on projects that range in size from 100 GB to 1.5 TBs. Many of these items are not too startling, as they have been happening to data processing (DP) systems since the computer was invented. What makes them stand out from their occurrence in other DP systems, is the unique nature of a data warehouse. Until a few years ago, it was inconceivable that a single database would hold 10 TBs of data. Not only are these systems very large, but you can query the database and get an answer back in a few minutes. These systems now exist and are being used by a number of different companies. Based on my experiences with a number of data warehousing projects, here are the things that are probably going to occur to any data warehouse project.
Dirty Data
The main goal for most data warehousing projects is to give the end users access to their data in a way that they have never been able to see before. One of the first things they or the people that are building the data warehouse (DW), are going to find is that your operational data has inconsistencies and missing parts and it is a lot worse than you thought. There is nothing too startling about this statement. Data Processing has faced this problem since the very first system was put into production. The old, garbage in garbage out, GIGO, still holds true, but the difference is with a data warehouse the users can see a lot more data and far more easier than ever before.
Every problem that may have been hidden before will now be there for everyone to see. Many companies are under the delusion that their operational data is pretty good overall. They know there may be a few errors, but they think these are few and far between. It often comes as a shock as to the amount of inconsistent and just plain wrong data that is found. The way to handle this problem is to carefully audit the input into the data warehouse. There are several ways to do this.
One is to select a random sample of data and track it through from source to presentation from the data warehouse. This is a basic technique that is used with all DP systems. Another, more specific to a data warehouse, is to write SQL scripts that check the consistency of the data that is loaded into the data warehouse. You can check things like matching against any master files you have extracted from or checking the summations or items counts against any summary tables you have built. Be sure you keep the load files that were used to load the data warehouse. This will allow you to do the ultimate audit: The ability to track data items that the end user thinks is wrong, back to the load files. It is imperative that the data you load into the date warehouse be clean and consistent.
The other surprise that goes along with the dirty data, is that you will probably find the errors in the operational systems that caused the dirty data. The bad or dirty data got there, in the first place, because the operational systems allowed it to be created. This also includes data that is not necessarily wrong, but is incomplete or inconsistent. This can be a very important benefit from a data warehouse. Not many people think about the corrections to the operational systems as a benefit of a data warehouse.
On several of the projects I have worked on, the scenario went like this: The end user reports that a certain data item is wrong. The item is tracked back through the data warehouse system and it is found that the item is being loaded and displayed as it was given by the feeding systems. The item is tracked back through the operational systems until the place where the error is occurs is found. In every single project I have worked on, a good number of operational errors were found and corrected. The best way to handle this is to have a good relationship with the people that are supporting the operational systems. The people that support the operational systems are usually receptive to this help in finding their errors, but some will not be too happy that this new system is causing them to do additional work.
Belief in the Data
The next problem you are probably going to see is related to the first one we talked about. Once the end users have access to this wide range of clean data, many of the myths about the data will be destroyed. They will not believe what they see. The sale of this item can not be this low, these customers can not be doing this, the use of these assets can not be this bad and on and on it will go. If you have done what you were supposed to have done in step one, then you will be able to prove that the data is correct.
Many of the old polices and procedures will have to be changed as these "discoveries" are made about the data. This can be one of the most powerful aspects of a data warehouse. It is of course very important to use a data warehouse to have detail data that is used in the decision making process, but to find the "real" truth about how a company is doing business is of great value. The myths about customer perceptions, stock assortment or asset utilization that exists today can be replaced with an accurate picture.
The other part of this problem is the fact the users will be comparing the data from the data warehouse to the old reports or old systems data. In all probability, the data will not match between the different systems. Once again, auditing the input and output of the data warehouse is the only way for it to gain credibility. There can only be a single version of the truth, if a company is going to get the real value out of the data. In the end, all will benefit from the truth, but the process is never easy.
The next surprise you will find in using a date warehouse is that end user sophistication will grow at amazing speed and will drive the need to increase the processing power and storage capabilities of the system. This is true with all DP applications to some extent, but with a data warehouse it is possible to see growth rates in the 50 to 100 percent range per year! When you first turn on your data warehouse, you will probably see your end users divide into three groups.
A small percent will become power users by embracing the technology, the largest segment will become average users of the system and a small percent will use the system very little. Your power users will be executing new and more complex queries next year in comparison to what they are running now. Not only will your power users increase the CPU usage, but your average users will also become more sophisticated in the queries they use.
The training of the users is a very important part of the success of the data warehouse. This training will also drive the end user sophistication and the types of questions they will ask. On the projects that I have worked, the training had to be done on a regular and frequent bases. We had training sessions set up for new users, refreshers for current users and ones for each major release of the end user access tools. There are now some very good data mining products on the market. When some of your power users start to use these, then you will truly see growth in CPU usage! Be prepared for and put the funding into your IT budget for the growth needs of your data warehouse.
In the normal course of everyday business, changes will occur in the hierarchies that you are using in your data warehouse. It makes no difference if it is a product, customer relationship, ship to cargo, geographic or passenger to flight hierarchy, changes will occur. This means any summary tables you have created using these hierarchies will have to be resummed. It also means that without very careful maintenance, these hierarchies may need to be rebuilt.
What you are carrying with a data warehouse is a snapshot of data back into history. When these hierarchies change, you have the decision to make of either rebuilding your summary tables or figuring out a way to show what the hierarchy looked like back in history. Most companies rebuild their hierarchies and summary tables as needed. This is an ugly truth about carrying a large amount of historical data that few people want to talk about. The good part is the business value that is derived from the data more than justifies the pain of these maintenance activities. Plan on putting into production the processes that were used to build and sum these tables at the beginning of the project.
Another interesting thing that may happen to your project is that the word will get out that you are doing something new and exciting for the end users. Even in the companies that are the most advanced in using information technology, there is an overwhelming need for users to get at their data. In most companies, this need is a constant and loud cry from the user community that keeps the CIO up at night.
So what may happen is you will start to get calls from various section managers wanting to come see what you are doing. This is great, as long as you control these presentations and do not let them divert too much of your time away from the main objective. It is just natural to want to show off your creation, but be careful about setting expectations for these potential future customers. Your next data warehouse project will most likely come from one of these groups that wanted to see what you were doing. On one project that I worked on, the demands for presentations became so great that we had to limit them to organized group presentations only. We were spending too much time doing individual presentations.
You are going to have to provide for the care and feeding of this thing you have created called the data warehouse. This of course true of all DP systems, but the difference is because of the above items, a data warehouse is going to take a lot more resources than a standard system. The reason is not that a data warehouse is a means unto itself, the reason is the business value that the user can get out of the data.
Many companies are seeing very quick and large returns on their investment in a data warehouse. If you have a system that is growing 50 percent a year and the end users are increasing their CPU usage by 25 to 100 percent per year, you have a system that is growing explosively. Another major part of the data warehouse that people do not like to talk about is the amount of money that will be required to maintain the system even without any growth. There are many companies claiming that they can build a large data warehouse for you. What they fail to talk about, is the amount of technical support that is required to keep it running. You had better have a long-range plan of the funding requirements for the care and feeding of your date warehouse.
There are a number of surprising and challenging things that are going to happen to your data warehouse project. If you plan carefully and look at the long-term picture, you will be able to handle each of these as they occur.
About the Author:
Andy Blevins is a Senior Software Designer/DBA working as an NCR consultant in Dallas. He can be reached at (972) 650-2307 or via e-mail at [email protected]