In-Depth
A One-on-One Interview with Bill Inmon
Bill Inmon is the Chief Technical Officer of Pine Cone Systems. With 39 books published and translated into nine languages, Bill speaks at conferences and seminars worldwide.
ESJ: How has today's data warehouse changed from a few years ago, in both its use and design?
Bill Inmon: Today's data warehouse is significantly expanded from the warehouse of a decade ago. A decade ago, data warehousing was a separate database, apart from the transaction database.
Today's data warehouse has blossomed into a full-blown architecture called the "corporate information factory." The corporate information factory has its own dynamics and its own component parts. There is the enterprise data warehouse, which contains the granular data, that is the foundation on which everything else depends. From the enterprise data warehouse emanates the data marts, the exploration warehouse, alternative storage - near line storage and secondary storage, project warehouses. Then, there is the operational data store (ODS) and the integration and transformation layer. In short, today we have a full-fledged framework where yesterday we had a raw, detailed database full of integrated, historical data.
ESJ: How has business intelligence (BI) impacted the data warehouse?
BI: Business intelligence has enabled the end user to take advantage of the data warehouse. Data warehouse and BI go together, hand in hand. The data warehouse is the infrastructure and the BI component is the end user access and analysis component. They form a symbiotic relationship.
ESJ: Is outsourcing data warehousing development and maintenance a real option for IT managers?
BI: The appeal of outsourcing the data warehouse development is great. Data warehouse skills are very difficult to find and expensive to acquire when found. Therefore, outsourcing to a consulting company looks to be very appealing. But many of the real problems, the real failures in data warehousing, have occurred when the data warehouse has been outsourced.
The problem with using a large consulting firm to outsource the data warehouse development is that the consulting firm wants to treat data warehouse development like any other development project. The consulting firm wants a three-year contract where 10 to 20 consultants are brought in. The reason why the consulting firms want to treat data warehousing this way is because this is the way all their other contracts and projects are done.
The data warehouse should not be treated this way. Data warehousing should be built in small, fast iterations at a time. No single development effort in data warehousing should require more than three months. Therefore, from a cultural standpoint, data warehousing is a poor fit with the way that large consulting firms like to operate.
A related reason why large consulting firms do not do well with data warehousing is that they want to employ the development methodology that they are familiar with. Unfortunately, in most cases, the large consulting firm is steeped in the tradition of the "waterfall" structured approach to development. In the waterfall approach, all requirements are gathered before the next step of development ensues. Each step of development is completed before the next step begins. This waterfall style of development is exactly the opposite of what the data warehouse requires.
Proper data warehouse development requires a spiral development approach to the building of the warehouse, not a classical "waterfall" approach. In a spiral approach a small subset of requirements are taken completely through the development process before the next set of requirements are tackled. Trying to get a large body of consultants to unlearn the knowledge and lore that they have carefully learned over the past 20 years is a delicate and difficult thing to do. For these reasons, when it comes to outsourcing, let the buyer beware.
ESJ: How can an IT manager gain an accurate understanding of what data the company has, and what actually needs warehousing?
BI: The needs of what should be in the warehouse and how it should be structured are determined only by the end user building multiple iterations of the warehouse. The end user cannot know what needs to be in the warehouse on the first and even the second iteration of development. The end user operates in what can be termed a "mode of discovery."
Only after the end user sees what the possibilities are, can they articulate what the real needs are. Therefore, the developer needs to quickly develop the warehouse in small, fast iterations so that the real end user requirements can be discovered. Trying to do a classical JAD session-based requirements gathering development, where all requirements are gathered before the first line of code is struck, is a prescription for disaster when it comes to data warehousing.
ESJ: What advice can you give to an IT manager in a heterogeneous environment?
BI: It is normal for the legacy environment to exist in a heterogeneous state. The use of an ETL tool can greatly alleviate the work of integrating disparate data. As far as the different components of the warehouse existing in a heterogeneous state, that too is normal. The ODS is in one technology, the exploration warehouse is in another, and each data mart is in yet another technology. There is no reason why the different components of the corporate information factory should not exist in different technologies.
ESJ: How can IT managers leverage their metadata? And how much metadata is enough metadata?
BI: Metadata is a tough topic. There exist few tools in the marketplace that are of any help. And what few tools there are center around the notion of a centralized repository, which is a mainframe idea that has little relevance to a distributed architecture.
One of the main reasons why there are no metadata distributed tools is that Sand Hill Road, where the venture capital community lives, refuses to fund any significant distributed metadata companies. Sand Hill Road has the attitude that they don't want to fund a company where money has never been made. Think about it. No company or metadata product has ever been financially successful. Therefore, the venture capital community doesn't want anything to do with metadata.
This leaves managers with a real dilemma. This is not a good answer for a lot of reasons, but if you have to have metadata today, build your own solution since the metadata products of the world that are suitable to data warehousing will not likely be forthcoming.
ESJ: Is there a metadata standard?
BI: The only metadata standard I am aware of is that which is promoted by the Metadata Council.
ESJ: Can you explain the concept of "cubes" or "cubing data?"
BI: Cubes refer to multidimensional data warehouses. Cubes are best placed in data marts, not large-scale, industrial-strength data warehouses. Cubes are the basis of OLAP technology.
ESJ: Can you help position or put in perspective the role of DB2 with the modern data warehouse?
BI: DB2 is a very serviceable platform for data warehousing, especially the SPII architecture.
ESJ: What is today's IS manager missing out on (if anything) when it comes to data mining?
BI: The natural extension of data warehousing is data mining. Once the data has been collected, integrated and cleansed, it is natural to want to use the warehouse to start to find different patterns of transactions. Data mining placed on a data warehouse is a diamond placed on a golden ring.
ESJ: What is the essential difference between a data mart and data warehouse?
BI: A data warehouse is fundamentally and architecturally different from a data mart. Comparing a data mart to a data warehouse is like comparing a tumbleweed to an elm tree. Both are plants, but both have an inherently different genetic structure.
A data warehouse contains a wealth of historical, granular, corporate and integrated data, and needs to have data structured in the most flexible manner possible.
A data mart contains a minimum of historical, aggregated and summarized data; is designed around a department's requirements; contains data optimized for departmental access; and is structured for optimal speed of access.
Because of these very stringent differences between data marts and data warehouses, the internal structure is quite different. Data warehouses are best normalized. Data marts are best denormalized into star joins and snowflake structures.
ESJ: What's the most common mistake an IT manager faces?
BI: The single, largest mistake IT management makes is in trying to treat a data warehouse as they have treated online transaction processing systems. Development is radically different. Usage is radically different. Capacity planning is radically different. Users' attitudes are radically different. But IT managers think that the existing IT organization can build and operate a warehouse just as they have done for years with OLTP systems. In fact, the IT function must go back to square one. And people don't like to do that.
ESJ: Is there a way to test the validity, accuracy and scope of a data warehouse?
BI: The validity, accuracy and scope of the warehouse is easily tested. Ask the end user to pay for the warehouse and you will find out more than you want to know.
ESJ: What are some long-term maintenance issues facing a data warehouse?
BI: Far and away the largest long-term maintenance issue of the warehouse is that of managing the volume of data found in the warehouse. IT organizations are used to managing systems measured in megabytes and gigabytes. Data warehouses are measured in hundreds of gigabytes, terabytes and even petabytes. There simply is nothing to compare warehouse maintenance to, based on a background of OLTP.
The volumes of data present some novel challenges to the IT manager. The first of those issues is budget. Many a manager simply assumes that data in a warehouse should be stored on high performance storage. After all, that's the way we have always done it, and the hardware vendor seems to think this is a grand idea. But after about 500 gigabytes, see what happens. At some point in time there is the realization that "business as usual" will send your company to the poorhouse.
There comes a day that alternative forms of storage become the only rational way to succeed with a warehouse. Stated differently, those corporations building warehouses over a terabyte of data are throwing money away by the bushel by continuing to place their data on high-performance storage. And these are the very same people who complain about how much the data warehouse costs.
Simply stated, for the long-term vision of data warehousing, high-performance storage is absolutely not the appropriate technology.
ESJ: What's the biggest challenge IT managers face when implementing a data warehouse?
BI: The IT manager faces both a managerial and developmental challenge in building the data warehouse. If I had to give words of advice to the IT manager, first of all, build the warehouse iteratively, one step at a time. It is poisonous to try to use waterfall development and the "big bang" approach to the building of the data warehouse.
Second, make sure you understand the framework known as the corporate information factory. There are good reasons why there are different components and dynamics.
Third, make sure the end user is involved from the beginning of the development of the warehouse. Without the end user's input, the warehouse stands the chance of becoming a technological masterpiece, but a business failure.
Be prepared for handling a volume of data that you have never seen before. Data warehouses surpass everyone's expectations when it comes to the volumes that accumulate in the warehouse.Do not attempt to use the "waterfall" development approach to build the data warehouse. The warehouse requires a spiral development methodology.
Finally, check the references of your consultant. Do not accept your consulting firm's credentials as proof that your warehouse will be built properly. Instead, look at the individual resume of the people the consulting firm offers you, and examine - closely - their individual credentials.
For more information about data warehouse and the corporate information factory, visit Bill's Web site at www.billinmon.com.