Clickstream Data Warehouse: Bleeding Edge or Mainstream?

The ability to know what people want by observing how they think as you follow their clickstream data is a powerful way to support business decisions.

by Stephanie Best

In today's world, we no longer have to wait a long time for an answer to our questions. If we have a question about almost anything, we can almost always find the answer right away. If we have a need for a product, we can almost always find the item right away. On the Internet, the answers or items are just a few keystrokes or clicks away. Search engines have changed the way we write term papers, the way we confirm the name of that actor that was on the "tip of out tongue"; even the way we follow up on a question from that televised game show we were watching. We open the browser, type in our relevant information and off we go.

Where we find the answer, and if we return to that Web site in the future, can depend on several factors. Is the Web site logically laid out so you can easily find what youy are looking for? Do users frequently look for similar products and information when visiting a particular site? Similarly, do users often look past other links and related information?

This information, otherwise known as, "what people want," is an extremely valuable commodity to businesses. Insight and understanding of an organization's Web site visitor patterns is provided through access to timely clickstream data from Web site visits. This data enables companies to be more effective by improving campaigns and offers, pricing models, Web site targeting, layout and workflow, and more.

As companies become more sophisticated in their Web analytics requirements, they often choose to replace or augment the capabilities of readily available, off-the-shelf Web analytics products. Building and maintaining an internal clickstream data warehouse (CDW) enables these companies to manage, segment, and report on the data in ways that the packaged products do not. Recent advances in data warehousing technologies (such as columnar databases and data warehousing appliances designed to deliver extreme performance with massive and complex data sets) have made CDWs a practical option for many companies.

For any CDW initiative to be successful, the ability to process the raw clickstream data files in a rapid, cost-effective and scalable manner is a critical component. The raw clickstream source data, which often exists as huge, complex text files, must be parsed, structured, cleansed, and loaded on a regular basis into the CDW before the CDW can provide value.

Web Analytics Products

Off-the-shelf, software-as-a-service (SaaS) Web analytics products have been available for years. For many companies, these products provide sufficient levels of detail and flexibility in their reporting. These products also let the user jump right in and start using the service out the box. There are also implementation time and costs benefits associated with off-the-shelf Web analytics products.

However, due to the page-centric nature of the results they provide, many organizations have determined they need to augment the capabilities of these products. Some of the limitations of off-the-shelf Web analytics products include:

  • Lack of user-centric segmentation. Although useful for tracking activity in a page-centric manner, for many companies the specific information and segmentation available with off-the-shelf Web analytics products does not satisfy their requirements for user-centric information. Customers are able to track the number of visitors, page views, and conversions on their Web site, they are unable to segment the data by user to understand what a user does in a particular session or track a user's activity across multiple sessions.
  • Lack of historical analysis. With off-the-shelf products, the pages to be tracked and the tracking criteria must be defined in advance; it's impossible to report on new criteria from previous (historical) Web site activity. The new criteria must first be defined, and only subsequent activity can be tracked.
  • Tracking limitations. A limitation with technologies such page tagging (embedding an object into a Web page that enables usage tracking) is that not all user visits are tracked. For example, visits are not recorded for visitors who have deleted cookies or who don't have JavaScript enabled. Also, since most SaaS Web analytics products rely on code that is executed on the client, they cannot report on server responses, such as failed requests and response times.

The Clickstream Data Warehouse

A clickstream data warehouse is used to store all of the historical Web site activity in a structured format -- typically on the company's own servers -- so that sophisticated queries and reports can be run on the data with business intelligence software. With a large volume of clickstream data being generated on a daily basis, and the large number of fields in the data, the prospect of implementing a CDW can be daunting. The implementation can be time consuming and costly endeavor when compared with using an off-the-shelf Web analytic product.

Even so, the business advantages of augmenting or supplanting packaged SaaS Web analytics products with a CDW often provide sufficient justification for companies to undergo the initiative. Some of these business advantages include:

  • Flexibility. Because the company has all of the data, it can process, segment, and report on the data in whatever ways it chooses. For example, the ability to segment the data into unique user sessions and to combine multiple visits of a particular user over time provides rich insight into customer value.
  • Combining multiple touch points. Combining a customer's clickstream activity with data from other customer relationship management systems provides companies with a more complete view of the customer and allows for more precise customer scoring.
  • Historical analysis. With a CDW, queries do not need to be pre-defined. Days, months, or years after the activity, an organization can ask new questions of the data that it did not initially think to ask.

High-performance data warehousing technologies are often used because of the complexity and size of the clickstream data. However, high-performance data warehousing technology alone is not sufficient.

A critical component of any CDW implementation is extreme performance data transformation technology that can parse, structure, and cleanse the raw clickstream source files to initially populate the CDW, and refresh it on an ongoing basis. Processing source files is a common cause of bottlenecks, as these files are typically very large, complex, and require significant processing to extract the desired information from the files. The frequency of the file refreshes is another source of bottlenecks, introducing delays and errors into the process. In recent years, data integration software has become available that offers the benefits of ease-of-use and extensibility while delivering extreme levels of performance and scalability for processing large, complex clickstream data files.


Until recently, initiatives to manage clickstream data in-house were limited to companies on the bleeding edge. However, with the availability of newer high-performance technologies, this capability is migrating into the mainstream. The ability to know what people want by observing how they think as you follow their clickstream data is a powerful way to support business decisions. More organizations are "building" their virtual warehouses full of clickstream data. These companies are raising the bar on the insights and benefits that can be obtained from clickstream analysis. With the knowledge obtained, the companies are able to fulfill the ultimate goal of delivering precisely what the customer wants when they want it.

Stephanie Best is director, product marketing at Syncsort.

Must Read Articles