E-Business Opps -- Cookies: The Perfect User ID Snack

In last month’s column, I presented a generalized meta-schema for an e-business clickstream data warehouse. While it is tempting to forge ahead and apply this model to a specific example, first we need to have a much better understanding

of the data contained in such a schema. Since the focus of a clickstream data warehouse is the analysis of user activity, there must be a mechanism to identify each user. This is the job for one of the most reviled and misunderstood components of a typical e-business architecture, the cookie file.

Most people are, at least, vaguely aware of the existence of cookie files on their client systems. They are created and modified by all Web browsers when they parse a "Set-Cookie" header string in the response to a particular HTTP request (like a GET of a Web page). Although the exact format of cookies varies from browser to browser, they all have at least the six following fields:

Name. The name of the cookie variable, for example, "UserID." The name is a required field and has no default values.

Value. The string value assigned to the cookie variable. For example, the cookie variable named "UserID" could be set to a value of "334." If the value is empty, then the value of the client cookie is cleared.

Domain. This is the domain name that created the cookie and it is the only domain that is permitted to receive or modify the cookie on subsequent accesses. Only the creating domain may read its own cookies, other domains have no access. The cookie variable must have at least two dots, like ".clickstreamconsulting.com," for example, otherwise one could create a cookie for .com or .net, which is not permitted.

Path. The top level of the subtree within the domain for which the cookie is valid and returned upon access to a page within the subtree. A path of "/" means the cookie is good for all pages in the Web site, while a more qualified path, like "/ClickstreamConsulting/articles" means the cookie only applies pages in the /articles subtree.

Expires. The expiration date of the cookie. The cookie persists on the client system until this date. If this value is not set, the cookie only lasts for the duration of a the browser session, after which it is automatically deleted.

Secure. If TRUE, a secure connection to the domain is needed to pass the cookie. The default value is FALSE.

The key to quickly determining a user’s identity is the cookie file. If a user accesses your Web site for the first time, there will be no cookie file returned for your domain, because the cookie hasn’t been created yet. Assuming your Web server is configured to accept cookies, on first access by a user it will note that no cookie was passed, and it will then add a Set-Cookie header to the response that sends back the requested page, causing the cookie to be created on the client system. If the cookie variable in the Set-Cookie header is unique, then all subsequent accesses by that user will be identified by the unique value of the returned cookie variable.

Knowing that a particular browser instance has a cookie file "UserID" of "334" is certainly helpful because it distinguishes Web server activities of that user ID from those of other cookied user IDs. But, this level of knowledge about user identity is not very specific, and we probably can do much better. One way to increase the level of user knowledge associated with a cookie file user ID is to allow users to register themselves at your site, and then associate the registration information with the cookie. This common technique may increase your knowledge about the user to include things like a registration ID and an e-mail address. Using syndicated data available from a number of providers, e-mail addresses can often be decoded into more specific information like real name, address, phone number, user psycho/demographic data, etc.

A clever new way to quickly determine user identity is available from Coremetrics. Coremetrics subscribers insert special JavaScript tags, which call a Coremetrics server, into the start of their Web pages. If any site that uses Coremetrics has been previously accessed by that user (and chances of this are high given its popularity), a cookie called ".data.coremetrics.com" will already exist on the client system. This cookie identifies the user to Coremetrics, and Coremetrics passes this identity back to the subscribing site in its response to the original call. Because the same Coremetrics cookie and, therefore, user identity, is used by multiple Web site subscribers, it is possible to identify user activities that cross site boundaries, like referring sites, advertising engines, affiliated sites, etc., not to mention the ability to share detailed identity information among Coremetrics and those sites. While few will admit to using Coremetrics, it is one of the more popular user identity services, and definitely worth a look.

Next month, we will continue our journey to understand the data inside a clickstream data warehouse by analyzing Web server log files, the primary clickstream data source. When coupled with cookie files, this data stream is the key to a clickstream data warehouse implementation. Don’t miss it.

Mark Sweiger is President and Principal for Clickstream Consulting, specializing in custom clickstream data warehouses and e-commerce education. He can be reached via e-mail at msweiger@ClickstreamConsulting.com.