EIS effort in the industry today is increasingly facing the challenge of providing a robust, yet flexible enterprise architecture for data mart delivery to end users within a short period of time, and with limited resources. This mart delivery architecture further needs to support a diverse set of end users ranging from power users having good knowledge of SQL to the true end users wanting to look at canned reports or perform OLAP queries. In addition, this architecture has to be flexible enough to support data delivery in ROLAP, DOLAP, EXCEL spread sheet or simply flat file formats. This article proposes an enterprise architecture for delivering dependent data marts that will satisfy varying end user DSS needs. One of the main features of the proposed architecture is its ability to allow the reuse of dimensions across the enterprise. The focus of this article is on the layer between a data warehouse and data marts. We choose to call this layer the "Enterprise Mart Cooker."
Assumptions
"Enterprise Mart Cooker" is part of a dependent data mart environment. It is assumed therefore that a data warehouse will feed this layer with data that has been extracted, cleaned and transformed. The concept of a data mart in our context transcends its traditional definition to include broad data formats ranging from relational to dimensional cubes to flat files, such as excel spread sheet. This architecture will accommodate the broad definition of a data mart and yet provide cost effective means to deliver them.
Description of System Components
1. Mart Cooker Input Staging Layer. This is an area where data warehouse extracts are initially stored. Extracted data is not in a dimensional form since data warehouse is mostly relational. This layer allows de-coupling of data warehouse and data mart environments that have different data organizations - relational vs. dimensional. Advantages of the input staging layer are threefold. First, this layer will allow very simple restart and recovery mechanism for mart cooker processes by buffering the extracted warehouse data; second source and target specification between data warehouse and input staging layer will be very simple due to the intermediate, non-dimensional data organization of the input staging layer which eliminates the relational-to-dimensional transformation process detail and third, since data is staged within RDBMS, database features, such as stored procedures, triggers, etc., can be effectively used to simplify programming effort and hence simplify mart building process.
2. Mart Cooker Process Layer. This layer consists of a set of processes that create facts and dimensions using data from the input staging layer and then store the newly created mart in output staging layer for delivery to the end user environment by the mart delivery layer. While creating a data mart, these processes can reuse pre-existing dimensions from the corporate dimension repository instead of recreating them, thus eliminating the redundancy and duplication of effort. New dimensions created during the mart building process, that have reuse potential, can be added by the cooker process to the corporate dimensional repository while the existing dimensions can be refreshed. Reuse will greatly reduce subsequent mart building time since less code has to be written to create a new one.
One of the key features of this layer is it's ability to allow the generation of data marts having structurally varying data formats. Furthermore, this architecture lends well to automating the mart creation process itself using various configuration techniques. For example, a user interface can be used to input configurable parameters, such as source and target mapping specification between the input and output staging layer, information as to whether a dimension should be created from scratch or can be reused from the corporate dimension repository and transformation rules necessary to create the dimensional model, followed by automatic code generation. We feel that automation aspect of this architecture has huge potential for return on investment by reducing the cost and level of staffing. Therefore, we recommend investing a good amount of time and effort in developing such a process.
3. Corporate Dimension Repository Layer. Corporate Dimension Repository is a store for commonly used business dimensions within a company. Mart cooker processes store newly created dimensions that can be reused into this repository and retrieve existing dimensions during mart creation. This is a vital component of the overall architecture since it promotes knowledge reuse and enforces consistent view and common vocabulary across the enterprise. For example, once a customer dimension and its hierarchy has been established over a period of time, this dimension can be reused by any organization within the company without having to reinvest time and resources. As a result every organization using the customer dimension will have the same view of their customers thus promoting common vocabulary. Also, reuse of corporate dimension will help reduce time and cost involved in building new data marts, which is critical to the success of an enterprise effort. Technical metadata (specific to vendor implementation) can coexist with business dimensions to provide reusability downstream.
4. Mart Cooker Output Staging Layer. Newly created marts (by the mart cooker processes) are stored in this area before delivery to the end user environments. This layer provides easy restart and recovery for the mart delivery layer. Furthermore, since the mart data is staged for a certain period of time, delivery to multiple destinations is easily accomplished. This layer helps in pulling computation away from end user environments, thus reducing end user hardware capacity needs and resource requirement. Additionally, by facilitating the creation and staging of marts in a common area, this layer eliminates the redundant effort required for creating same data mart by multiple end users. Lastly, this layer also provides end users the option of not needing a production supported environment since the mart can be easily resent within the specified mart retention time period.
5. Mart Delivery Layer. This layer supports and facilitates the transportation of data marts to end user environments. Since heterogeneous mart environments, such as relational, flat files or dimensional have to be supported by this layer, different mart loading techniques will have to be used. Middleware, such as publish and subscribe may be used for incremental or complete loading of data across the network. Delivery mechanism can be as simple as FTP of flat files or as complicated as use of a middleware that will invoke a loader in the target database.
Example of Building a Data Mart Using the Proposed Architecture
1. The user submits initial request for building a new data mart. This request includes statement of intended use (objective, etc.).
2. Mart team conducts interviews with the user group and gathers requirements.
3. Business analysts on the mart team develop a dimensional model.
4. Initial extraction request for data from the warehouse is prepared by the data mart team. This specification consists of high level item descriptions needed for creating the data mart. Mart team is assumed to have no knowledge of warehouse data structure. This specification is forwarded to the warehouse team.
5. The warehouse team investigates specification submitted by the mart team and creates source mapping specification. If a field is not found in the warehouse, the warehouse team initiates ETL process for this data and communicates the expected date of delivery to the mart team. This source specification document is then submitted back to the mart team.
6. Based on the detailed source specification provided by the warehouse team, the mart team designs the target specification for the input staging area. The layout and the loading frequency is added to the target specification document by the mart team. This specification is then handed back to the warehouse team.
7. Warehouse team develops ETL necessary to populate the input staging area.
8. Using the source specification document from the warehouse team, the business dimension model developed by the mart team and the corporate dimension repository data model, the mart team determines whether existing dimensions can be reused from the corporate dimension repository. Dimensional transformation needed to create mart from the input staging area is specified.
9. The mart team designs and develops mart cooker processes for building the new mart. As an example, the building of a cooker process can be as simple as entering the dimensional transformation specification via a user interface followed by automatic code generation.
10. Delivery specification is created by the mart team with help from the user group.
11. Using the delivery specification, delivery process is built to transport newly created data mart to its destination. Delivery mechanism of data mart can be as simple as an FTP of flat file (in case of EXCEL format) or as sophisticated as use of middleware that will transport data across network and invoke loader for the target database.
Advantages of This Architecture
1. This architecture is simple, scaleable, multi-tiered and easy to implement.
2. Allows creation of marts in any data format, such as relational, dimensional or flat files.
3. Allows reuse of dimensions and technical metadata across the enterprise, hence reduces cost and mart creation cycle time.
4. Promotes corporate-wide common business vocabulary by sharing common business dimensions.
5. Modular nature of the architecture allows simple interface specification between data warehouse, input staging layer, mart cooker processes, output staging layer and the mart delivery layer.
6. Marts can be maintained with or without the production support staff since it is easy to resend newly built mart from the output staging layer.
7. Architecture allows automation of data mart creation process.
8. Mart cooker layer pulls number crunching involved in mart creation away from the end users into an IT function so that the end user DSS server can remain simple, inexpensive and small.
9. The architecture does not tie any specific DSS vendor or database methodology to its design thus being truly open and enterprise.
10. Centralized location of information ensures data consistency across a corporation.
ABOUT THE AUTHORS:
Sanjay Tiwary and Asim Tewary are independent data warehouse & data mart architects. They help businesses build medium to large DSS systems using state-of-the-art technology. Sanjay Tiwary can be reached at stiwary@technologist.com and Asim Tewary at atewary@technologist.com.