i

The Inference Algorithms

Because Internet protocols are stateless (that is, there is no sustained connection between client and server), the Internet server log files contain no definitive information regarding visits or users. Consequently, visit and user information must be inferred (statistically approximated) from the data in the log files and the logical structure of your site, which you specify with the Server Manager.

i

What is a Hit?

A hit is a line in a log file. Hits include:


You can count hits simply by counting lines in a log file. But since that count is unrelated to content or user behavior, it’s impossible to extrapolate meaningful information simply by comparing hit counts. For example, a page with four inline images counts as five hits when visited just once. A page with no images counts as a single hit for a single visit. Comparison between the two in no way reflects level of usage.

i

What is a Request?

A request is any connection to an Internet site (a hit) that successfully retrieves content. You may be familiar with similar terminology that refers to page "views" or "impressions." Usage Import reads every hit in the log file, but only copies requests into the Usage Analyst database.

In order for a hit to qualify as a request, the HTTP response code in the web server log file must be 200 or 304. Ad clicks (HTTP response code 302) are also imported into the database if a file name is imported that matches the paths specified in site properties. (See "Site Properties: Advertising" in Chapter 5.)

Note
Request counts are conservative because browser software and many Internet gateways intercept some requests before they reach the server, and these cached requests are never logged. Usage Import compensates by means of its inference algorithms. (See "Site Properties: Inferences" in Chapter 5.)

i

What is a Visit?

A visit is a series of consecutive requests from a user to an Internet site. To assign requests to visits, the Import module sorts all requests in the log file based upon properties, which differentiate visits from one another. These properties include:


After the sort, the visit algorithm uses two methods to discern individual visits:

  1. If your extended log files include referrer data, then new visits begin with referring links external to your Internet site.
  2. Regardless of whether you have referrer data, if a user doesn’t make a request for 30 minutes, the previous series of requests from that user is considered a completed visit. (The timeout duration can be adjusted by the user in the Inferences panel of Site properties. See Chapter 5 for more information.)

i

What is a User?

A user is anyone who visits the site at least once. Site Server Express Analysis has three ways to recognize unique users. If your extended log files contains persistent cookie data, the software uses this data to recognize unique users. If no cookie data is available, the software uses a registered user name to recognize users. If no registration information is available, the software uses, as a last resort, users’ Internet host names.

Cookies are the best way to uniquely identify users. The use of cookies before registered user names within the user algorithm makes it possible to tie together both the unregistered and registered portions of a visit to the same user. (Server extensions to implement cookie distribution are available at www.Interse.com.)

Many organizations use Internet gateways, which mask the real Internet host names, so user counts may be conservative for those users determined through their Internet host names.

i

What is an Organization?

An organization is a commercial, academic, nonprofit, government, or military entity that connects users to the Internet. If the address is an unresolved IP address (four dotted decimal numbers), then the Class C address of the IP (the first three dotted decimal numbers) is used to represent the organization. This approximation is based upon the fact that most organizations directly connected to the Internet have their own Class C address.

If the Internet address is resolved to a full Internet host name, then this host name is parsed on the decimals. The geographic descriptor and organization type descriptor fields of the host name are included as part of the organization domain. The Import module knows how to interpret the Internet host names of more than 200 top-level Internet domains. For example, if the host name is www.interse.ac.uk, the organization domain is interse.ac.uk; but if it is www.interse.com, the organization domain is interse.com.

Using the Whois query, Usage Import groups together all domains registered to the same entity as one organization. For domains not found by the Whois query, each entity is designated by an individual organization.


© 1996-1997 Microsoft Corporation. All rights reserved.