In-Depth

E-Business Opps -- Web Server Log Files: The Ultimate Clickstream Data Source

In many ways, Web server log file records are a dream data source. They are automatically emitted by every popular type of Web server, including Apache, Microsoft IIS, IBM WebSphere and BEA WebLogic. There is one record for each HTTP

HTTP transaction, providing a very fine granularity of user activity detail. And, unless the site chooses to customize its logging, log records have a universal standard format, the NCSA Combined Log File Format, making them exceptionally easy to parse and manipulate.

The Combined Log File Format contains the following fields:

• h = The remote requestor ID. This will be in IP format (example: 165.32.124.3) if Directory Name Service (DNS) is not enabled.

• I = The remote username via identd. This is blank (indicated by a "-"), unless the client is UNIX or Linux.

• u = The HTTP authenticated username, This is likely blank, unless your site requires user registration.

• r = The entire text of the HTTP request.

• s = The returned HTTP status code.

• b = The number of bytes in the returned page.

• {referer}i = The most recent referring URL, taken from the "referer" field of the HTTP request header.

• {user-agent}i = The requesting software type; usually a browser.

Here are a few examples of log records:

  • 204.243.130.5 -- [26/Feb/2001:15:34:52 -0600] "GET / HTTP/1.0" 200 8437 "http://metacrawler.com/crawler?general=dimensional+modeling" "Mozilla/4.5 [en] (Win98; I)"
  • 204.243.130.5 -- [26/Feb/2001:15:34:53 -0600] "GET /logo1.gif HTTP/1.0" 200 1900 "http://www.companyname.com/" "Mozilla/4.5 [en] (Win98; I)"
  • 204.243.130.5 -- [26/Feb/2001:15:35:26 -0600] "GET /articles.html HTTP/1.0" 200 7363 "http://www.companyname.com/" "Mozilla/4.5 [en] (Win98; I)"

In the first log record, the first field, "204.243.130.5," is the IP address of the system issuing the HTTP requests. It is quite likely that this IP address is an Internet proxy for the client, like a server at his ISP. As a matter of course, I run a DNS mapper against our site’s log file to decode IP addresses into real host names, but, in this case, the IP address did not correspond to a registered domain name and so it was not decoded. The next two fields in the log record are "blanks," indicated by the two hyphen placeholders. These blank fields indicate that the client was not authenticated by identd or by HTTP authentication. The next field, "26/Feb/2001:_15:34:52 -0600," is the GMT time when the user’s request was logged.

The next field, "GET / HTTP/1.0," is the actual text of the HTTP request. The next field, "200," is the returned HTTP status code, and 200 means that the operation was successful. The next field following the status code, "8437," is the actual number of bytes transferred to the client as the result of the HTTP transaction. If the number of bytes transferred is not equal to the number of bytes in the actual home page, then we know that the user abandoned the GET by hitting STOP, BACK or by clicking through to another link before the page was fully loaded. The next field, "http://metacrawler.com/crawler?general=_dimensional+modeling," is the referring URL, and, in this case, the referrer is the search engine metacrawler.com, which found the site as the result of a search for sites that mention the term "dimensional modeling," which is specified by the query string parameters after the "?" in the URL. The last field in the log record, "Mozilla/4.5 [en] (Win98; I)" indicates which type of browser and OS the user employed to get this page, and, in this case, the browser was Microsoft IE version 4.5 running on a Windows 98 system.

Moving on to the second log record, we see that it records an HTTP GET of the image "logo1.gif," which is the file containing the company logo. This additional GET is embedded inside the HTML that generates the home page, which resulted in the creation of the first log record above. The existence of this second log record for the home page illustrates a key concept: Most page views generate multiple HTTP transactions, resulting in multiple log records per page view.

The final log record in the example indicates that the user decided to visit the site’s "Articles and Information" link, whose content is in the file "articles.html."

While the fields contained in the standard Combined Log File Format are obviously useful as the clickstream data source, additional fields can be added to the log record by adding additional objects to the log record template used by your Web server’s logging module. Virtually, all Web server software supports extensions to the format of the log file.

Mark Sweiger is President and Principal Consultant for Clickstream Consulting. He can be reached at [email protected].

Must Read Articles