|
HOME articles tutorials tool directory training books about |
|
|
|
Free Email
Our Other Sites
|
A Comprehensive Strategy for Using Web Site Statistics - Page 2User Agent Fields The user agent field also suffers from imprecise semantics, different implementations, and missing data. This can partially be attributed to the use of the field by browser vendors to perform content negotiation. Given that the rendering of HTML differs from browser to browser, servers have the ability of altering the HTML based upon which browser is on the other end. Consequently, the user agent field may contain the name of multiple browsers. Some proxies also append information to this field. In addition, the value of the user agent field can vary for requests made by the same user using the same Web browser. Adding to the confusion, there is no standardized manner to determine if requests are made by autonomous agents (e.g., robots), semi-autonomous agents acting on behalf of users (e.g., copying a set of pages for off-line reading), or humans following hyperlinks in real time. Clearly, it is important to be able to understand these classes of requests when attempting to model surfing behaviors.
Interpreting Cookies Although cookies were initially implemented to facilitate shopping carts, a common use of cookies is to uniquely identify users within a web site. Cookies work in the following manner. When a person visits a cookie enabled web site, the server replies with the content and a unique identifier called a cookie, which the browser stores on the user's machine. On subsequent requests to the same web site, the browser software includes the value of the cookie with each request. Because the identifier is unique, all requests that were with the same cookie are known to be from the same browser. Since multiple people may use the same browser, each cookie may not actually represent a single user, but most web sites are willing to accept this limitation and treat each cookie as a single user. Recently, browser vendors have provided users with controls to select the cookie policy that maps to their privacy preferences. This enables users to choose various levels of awareness when visiting a server that issues cookies in addition to allowing users to completely disable their browser from sending cookies. Consequently, unless a site requires people to use cookies to receive content, the cookie field may be null, which leaves the task of identifying user paths to relying upon the other recorded fields. Given the limitations of the information recorded in Web access logs, it is not surprising that sites require users to adhere to cookies and defeat caching to gain more accurate usage information. Still, numerous sites either do not use cookies or do not require users to accept a cookie to gain access to content. In these cases, determining unique users and their paths through a web site is typically done heuristically. Even when cookies are used, several scenarios are possible when a previously encountered cookie is processed. If the request is coming from the same host regardless of the user agent, the request is treated as being issued by the same user. This is because a unique cookie is issued to only one browser. If the user agent field remains the same but the host changes, it is still the same user and some form of IP/domain name change is occurring. This often occurs with users behind firewalls and ISPs that load-balance proxies. However, if we have the same cookie with a different user agent, then an error has most likely occurred as cookies are not shared across browsers. If no cookies are present, then the site statistic software can resort to using IP addresses. If the request comes from a known host, then we could have a new user or the same user, otherwise the request is from a different user. It is important to point out that these latter two cases could also be issued from non-cookie compliant crawling software. An interesting set of scenarios occur when a new cookie is encountered. If the request is from a host that has already been processed and the previous value of the cookie was “null” and the user agent is the same, it is fair to conclude that the request is from a new user that just received their first cookie from the server in the previous request. If the client is not using cookie obfuscation software, one would expect the following requests from this user to all contain the same cookie. However, suppose the previous value from the same host and agent was a different cookie, it could be the same user obfuscating cookie requests, or a new user from the same ISP using the same browser version and platform as the user from the previous request. Barring any other piece of supporting evidence like the referrer field or consulting the site's topology, it is difficult to determine which the correct scenario is. If the user agent is different from the previous request, but accompanies a new cookie from the same host, it is fair to assume that a new user has entered the site. Of course, a new cookie from a new host regardless of the agent is a new user. You can also learn something about visitors by studying their domain names. Though the log file may record IP addresses, your log analysis program can determine from many of these IP numbers the associated domain or ISP. This might tell you if your most important client -- or competitor -- has been looking at your web pages. The most simplistic assumption to make about users is that each IP address or domain name represents a unique user. Using this method, all the requests made by the same host are treated as through from a single user. When a new host is detected, a new user profile is created and the corresponding requests are associated to the new user. Several methods that use additional information recorded in the access logs or other heuristics are also possible. One refinement is to use the user agent field. Using this method, new users are identified as above as well as when requests coming from the same machine have different user agents. Another refinement is to place session timeouts on requests made from the same machine. The intuition is that if a certain amount of time has elapsed, then the old user has left the site and a new user has entered.
|
|
|
HOME articles tutorials tool directory training books about (c) copyright 2000-2007 Anventure. All
Rights Reserved.
|
||