E-Business Opps: Data Warehouse Extract, Transform and Load

Last month, we focused on Web server log file records. Now that we understand their format, we can explore the architecture of the extract, transform and load (ETL) process needed to get the data into a clickstream data warehouse.

The distributed nature of Web page content makes clickstream ETL more of an around-the-world safari rather than a short trip to the data stored inside the back-office data center.

The average Web page may contain content that is served-up by all the types of servers pictured in the figure. Let’s go through them one-by-one, starting with the internal systems:

Replicated Web Servers. These are your site’s Web servers.

Web Application Servers. Some vendors, like BEA, Blue Martini and Art Technology Group, produce application servers that create dynamically generated, personalized Web pages on the fly. They often contain specialized logic to create application-specific constructs, like shopping carts, product hierarchies and checkout pages. If these application servers can log these application-specific constructs, they can be combined with the site-hit-oriented Web server log files to produce a more complete picture of user activity.

Internal Cached Content Servers. Many sites use these to speed access to popular pages, and these specialized servers typically log at least at the Combined Log File format level.

SSL Web Servers. These servers log secure transactions, like user registrations and credit card authorizations.

External systems also log information:

Advertising Engines. If you contract with an advertising engine to run advertisements for your e-business, then every time a banner ad is viewed, that information is logged by the advertising engine’s servers and can be shipped back for analysis.

External Cached Content Engines. Many sites use external caching engines to push large content objects to users faster than they would be served by their own Web servers. These external caching engines, like Akamai, log their activities for future analysis.

External User Identity Engines. These engines, like Coremetrics, use a special cookie to identify users to a greater level than may be possible using normal mechanisms like a site-created persistent cookie. The identity information is then redirected to the requesting site, where the identity information can be logged.

A separate log file server parses, merges and calculates page dwell time, using the log information. This information is put into a staging database schema on the log file server for insertion into the clickstream data warehouse. By having a separate log file server, data extraction can be performed continuously without impacting the query performance of the clickstream data warehouse. Since the staging schema closely resembles the actual clickstream data warehouse schema, loading is straightforward and can be done in batches or even as a continuous trickle feed of new data, as long as queries are always properly constrained by date and time. Using a flexible architecture, like this one, can make clickstream data warehouse ETL construction almost a mechanical process. But, one has to make sure that similar log data sources use the same log record formats, or information will be lost.

Mark Sweiger is President and Principal Consultant for Clickstream Consulting. He can be reached at msweiger@ClickstreamConsulting.com.