Site Server Express - Usage Import

Article
08/28/2007

Usage Import reads your Internet server log files and reconstructs the actual requests, visits, users, and organizations that have interacted with your sites within the database. In order to relate the log files to the logical organization of your sites, Microsoft Site Server Express Analysis uses two principal components of the Usage Import: the Log File Manager, which organizes, filters and performs the actual importation of your logs for analysis; and the Server Manager, which sets up the site structure for which the logs are imported. (The Log File Manager and Server Manager are discussed fully below.)

The Inference Algorithms

Because Internet protocols are stateless (that is, there is no sustained connection between client and server), the Internet server log files contain no definitive information regarding visits or users. Consequently, visit and user information must be inferred (statistically approximated) from the data in the log files and the logical structure of your site, which you specify with the Server Manager.

What is a Hit?

A hit is a line in a log file. Hits include:

Requests for content
Errors
Blank lines in the log files
Internet communication overhead

You can count hits simply by counting lines in a log file. But since that count is unrelated to content or user behavior, it's impossible to extrapolate meaningful information simply by comparing hit counts. For example, a page with four inline images counts as five hits when visited just once. A page with no images counts as a single hit for a single visit. Comparison between the two in no way reflects level of usage.

What is a Request?

A request is any connection to an Internet site (a hit) that successfully retrieves content. You may be familiar with similar terminology that refers to page "views" or "impressions." Usage Import reads every hit in the log file, but only copies requests into the Usage Analyst database.

In order for a hit to qualify as a request, the HTTP response code in the web server log file must be 200 or 304. Ad clicks (HTTP response code 302) are also imported into the database if a file name is imported that matches the paths specified in site properties. (See "Site Properties: Advertising" in Chapter 5.)

Note Request counts are conservative because browser software and many Internet gateways intercept some requests before they reach the server, and these cached requests are never logged. Usage Import compensates by means of its inference algorithms. (See "Site Properties: Inferences" in Chapter 5.)

What is a Visit?

A visit is a series of consecutive requests from a user to an Internet site. To assign requests to visits, the Import module sorts all requests in the log file based upon properties, which differentiate visits from one another. These properties include:

Internet address
User name
User agent
Cookie

After the sort, the visit algorithm uses two methods to discern individual visits:

If your extended log files include referrer data, then new visits begin with referring links external to your Internet site.
Regardless of whether you have referrer data, if a user doesn't make a request for 30 minutes, the previous series of requests from that user is considered a completed visit. (The timeout duration can be adjusted by the user in the Inferences panel of Site properties. See Chapter 5 for more information.)

What is a User?

A user is anyone who visits the site at least once. Site Server Express Analysis has three ways to recognize unique users. If your extended log files contains persistent cookie data, the software uses this data to recognize unique users. If no cookie data is available, the software uses a registered user name to recognize users. If no registration information is available, the software uses, as a last resort, users' Internet host names.

Cookies are the best way to uniquely identify users. The use of cookies before registered user names within the user algorithm makes it possible to tie together both the unregistered and registered portions of a visit to the same user. (Server extensions to implement cookie distribution are available at https://msdn.microsoft.com/library/tools/fpage/install.htm .)

Many organizations use Internet gateways, which mask the real Internet host names, so user counts may be conservative for those users determined through their Internet host names.

What is an Organization?

An organization is a commercial, academic, nonprofit, government, or military entity that connects users to the Internet. If the address is an unresolved IP address (four dotted decimal numbers), then the Class C address of the IP (the first three dotted decimal numbers) is used to represent the organization. This approximation is based upon the fact that most organizations directly connected to the Internet have their own Class C address.

If the Internet address is resolved to a full Internet host name, then this host name is parsed on the decimals. The geographic descriptor and organization type descriptor fields of the host name are included as part of the organization domain. The Import module knows how to interpret the Internet host names of more than 200 top-level Internet domains. For example, if the host name is www.interse.ac.uk, the organization domain is interse.ac.uk; but if it is www.interse.com, the organization domain is interse.com.

Using the Whois query, Usage Import groups together all domains registered to the same entity as one organization. For domains not found by the Whois query, each entity is designated by an individual organization.

Internet Server Log Files

The Import module reads server log files one line at a time. Each line is then parsed (separated) into the data fields shown in the following table.

Information	Description
Internet address	The Internet address (Internet IP or resolved Internet host name) from which the request came, and to which the server sent the response.
Time stamp	The date and time the server responded to the request.
File name	The content file name, or URL, that was sent back to the Internet address.
User name	The user name used to log in to a site requiring registration.
Size	The size of the response in bytes. This number is used to calculate bandwidth usage.
User agent	The product name, product version, operating system, and the security scheme of the web browser used for the request.
Referrer	The referring URL of the current request (the page containing the link the user clicked.)
Cookie	A persistent identification code assigned to the user, which allows you to track the user across several visits.
HTTP response code	The response code (200, 304, or 302) associated with a request.
Site type	The type of Internet site (web, gopher, FTP). This field is found within log files that are common to multiple servers.
Server IP	The IP address of the individual server. This field is necessary to distinguish servers for a multihomed log data source.

Supported Log File Formats

Internet server log files must be in one of the formats shown in the following table for the Import module to interpret the log file data correctly.

Site type	Supported log file formats
World Wide Web	Common log file format EMWACS log file format Usage Analyst extended log file format Intersé Market Focus 1 database Intersé Market Focus 2 database MCI log file format Microsoft IIS standard log file format Microsoft IIS extended log file format Microsoft IIS hyperextended log file format Microsoft IIS ODBC log file format NCSA combined log file format NCSA combined w/ servername log file format NCSA w/ servername log file format Netscape Proxy extended logging format Open Market extended log file format O'Reilly Multihome common log file format 1.0 O'Reilly Windows log file format 1.0 SiteTrack log file format Spry Web Server ASCII log file format Spry Web Server ODBC log file format Universal log file format UUNET extended log file format WebFacts audit log file format WebStar log file format Zeus common log file format
FTP	WU Archive FTP log file format Microsoft Internet Server log file format
Gopher	Microsoft Internet Server log file format
Real time stream	Real Audio log file format

Common Log File Format

This log file format includes:

Internet address
User name
Time stamp
Filename
File size

Here's a sample line from the common log file:

www.interse.com - bob [08/Aug/1995:06:00:00 -0800] "GET /analyst/ 
HTTP/1.0" 200 1067

Note Any text after the file size on a log file line is ignored by the Import module.

EMWACS Log File Format

This log file format is produced by the EMWACS web server. EMWACS is a public domain web server for Windows NT.

Here's a sample line from an EMWACS log file:

Mon Aug 07 08:54:39 1995 204.86.26.20 157.54.17.9 GET /analyst/ 
HTTP/1.0

Extended Log File Format

The extended log file format is simply the common log format with three additional fields:

Referrer URL
User agent (browser)
Cookie information

Note This format was designed to be compatible with many of the custom extended log file formats currently in use.

Not all fields are required to comply with the format. Any of the following would be a valid extended log file entry:

all common log format strings
all common log format and referrer strings
all common log format, referrer, and user agent strings
all common log format, referrer, user agent, and cookie strings

The syntax of the extended log file format is:
Common Log Format ["referrer/-" ["user agent" ["cookie"]]]

Quotes around the referrer string are optional, but recommended. If quotes start the referrer string, quotes must end it. If quotes aren't used and the referrer is blank, then a hyphen is accepted as a blank referrer string.

User agent strings must be surrounded by quotes. A blank user agent string is represented by "".

Cookie strings must be surrounded by quotes. A blank cookie string is represented by "".

Here's a sample line from the extended log with all three extended fields:

www.interse.com - bob [08/Aug/1995:06:00:00 -0800] "GET /analyst/ 
HTTP/1.0" 200 1067 "https://www.infoseek.com?qt=Interse" "Mozilla 
2.0b4 Windows 32-bit" "INTERSE=12345678910"

Note Logging modules are available for Microsoft IIS, Apache, and Netscape to add user agent, referrer, and cookie data to your log files within the extended log file format. For more information, see https://msdn.microsoft.com/library/tools/fpage/install.htm

Microsoft IIS Standard Log File Format

This log file format includes usage information for web, gopher, and FTP servers that run under the Microsoft Internet Information Server.

Here's a sample line from a Microsoft IIS standard log file:

www.interse.com, -, 8/7/95, 8:54:39, W3SVC, WWW,157.54.17.9, 
490, 232, 4401, 200, 0,GET, /analyst/, -,

Microsoft IIS Extended Log File Format

This log file format includes usage information along with user agent and referrer data for web, gopher, and FTP servers that run under the Microsoft Internet Information Server.

Here's a sample line from a Microsoft IIS extended log file:

153.36.62.27, -, 3/1/96, 0:00:00, W3SVC, WWW, 198.105.232.4, 
3565, 245, 2357, 200, 0, GET, /MSOffice/Images/button5a.gif, 
Mozilla/1.22 (compatible; MSIE 2.0; Windows 95), 
https://www.microsoft.com/msoffice/, -,

NCSA Combined Log File Format

NCSA Version 1.5 includes built-in support for a combined log file format that includes user agent and referrer information. This URL describes the configuration options required to include this information in your log files. Note that the extended log file format of NCSA Version1.5 is encompassed by the definition of the extended log file format. For more information, see https://hoohoo.ncsa.uiuc.edu/docs/setup/httpd/TransferLog.html .

Here's a sample line from an NCSA combined log file format:

tomato.interse.com - - [19/Sep/1995:15:19:07 -0500] 
"GET /images/icon.gif HTTP/1.0" 200 1656 
"https://aboutus/" "NCSA_Mosaic/2.7b1 (X11;IRIX 5.3 IP22) 
libwww/2.12 modified"

NCSA Log File Format with Server Name

NCSA Version 1.5 includes support for a log file format that includes the server name in the log file. If you are using VirtualHost support, this will be the name of VirtualHost. For more information, see https://hoohoo.ncsa.uiuc.edu/docs/setup/httpd/TransferLog.html .

This is a sample line from an NCSA log file format with server name:

tomato.interse.com - - [06/Oct/1995:13:51:23 -0500] 
"GET /beta-1.5/howto/fixes.html" 200 3296 www.interse.com

NCSA Combined Log File Format with Server Name

NCSA Version 1.5 includes support for a combined log file format that includes the server name, user agent, and referrer information. For more information, see https://hoohoo.ncsa.uiuc.edu/docs/setup/httpd/TransferLog.html .

tomato.interse.com - - [19/Sep/1995:15:19:07 -0500] "GET 
/images/icon.gif HTTP/1.0" 200 1656 www.interse.com "https://aboutus/" 
"Mozilla/1.22 (compatible; MSIE 2.0; Windows 95)"

Netscape Proxy Extended Logging Format

When imported as a proxy log, referrer information is ignored.

This is a sample of a Netscape proxy extended logging format:

127.0.0.1 - - [14/Aug/1996:05:00:01 -0700] "GET https://www.nytimes.com/
HTTP/1.0" 403 - "-" "Netscape-Proxy/2.0 (Batch update)" GET
https://www.nytimes.com/ - "HTTP/1.0" - - - - 141 168 - - - - - -

Open Market Extended Log File Format

This log file format is produced by the Open Market web server software.

This is a sample line from the file:

log {start 824480600.659060} {method GET} {url 
/animation/maxx/images/maxxhome.gif} {referrer 
https://mtv.com/animation/maxx/} {agent {Mozilla/2.0 (Win95; I)}}
{bytes 65745} {status 200} {end 824480601.397589} {host 198.147.4.29}

O'Reilly Windows Log File Format Version 1.0

This log file format is produced by the O'Reilly Web Site Version 1.0 web server software. For more information, see https://website.ora.com/techcenter/devcorner/white/winlog.htm .

This is a sample of that format:

05/18/96 00:41:33 user.interse.com olive.interse.com GET 
/ourproducts/reports/contentsummary.html 
https://olive.interse.com/ourproducts/reports/executivesummary.html 
Mozilla/2.01 (Win95; I) 200 23511 9313

SiteTrack Log File Format

This is the format produced by Group Cortex's SiteTrack log file format. For more information, see https://www.cortex.net .

This is a sample of the SiteTrack format:

phx-az16-24.ix.netcom.com - - [07/08/1996:00:04:56] 
GET /content/resources/cgi/netscape.html HTTP/1.0 302 
264 Mozilla/2.02 (Win95; I) 
https://www.stars.com/Vlib/Providers/CGI.html 0836798696514377 
0836798696514377 0836798696514377 - 3 -

Spry Web Server ASCII Log File Format

This log file format is produced by the Spry Web Server software.

Here's a sample of the format:

GET,/mill/rock.gif, 200 OK,40740,02/05/1996 16:02:56 GMT,
02/05/1996 16:02:58 GMT,149.174.73.127,149.174.73.127,
ads-demo2.inhouse.compuserve.com,Mozilla/1.1 (Windows; U; 32bit),
,https://ads-demo2:2000/mill/rocks.htm,

UUNET Extended Log File Format

This log file format is produced for those sites being hosted by UUNET.

This is a sample of the format:

152.163.192.72 304 0 826693224 0.013 "" "GET / HTTP/1.0"
"https://frostbite.umd.edu/%7Ecass/music.html"
"IWENG/1.2.003 via proxy gateway CERN-HTTPD/2.0 libwww/2.17"

WebFacts Audit Log File Format

This is the format produced by the ABC WebFacts audit software.

Here is a sample of the format:

2 www.datamation.com 960526 85202 Cust57.Max5.Toronto.ON.MS.UU.NET 
/PlugIn/Images/pluginN.gif https://www.datamation.com/ 
PlugIn/homepage/index_foot.htmlMozilla/2.0 (compatible; MSIE 2.0B; Windows 95;1024,768 304 0 0 0

WebSTAR Log File Format

This log file format is produced by Quarterdeck's WebSTAR web server for the Macintosh.

Here's a sample line from a WebSTAR server log file:

08/07/95 08:54:39 OK 157.54.17.9 :public:real.gif 2557

Note Fields in this format are configurable. Usage Import supports the default fields shown above, plus an extended format that includes referrer, user agent, and username data. In this extended format, all extended fields are optional.

Here's a sample line from an extended WebSTAR server log file:

Default fields referrer user agent username

Zeus Log File Format

This format is produced by the commercial Zeus web server. (See www.zeus.co.uk.)

This format closely resembles the common log file (CLF) format except the date format is d/mmm/yyyy rather than dd/mmm/yyy (i.e., the 3rd day of the month is represented as 3 rather than 03)

For example:

CLF: 03/Jul/1996 
Zeus: 3/Jul/1996

WU Archive FTP Log File Format

The WU Archive FTP server, the most common UNIX FTP software, records incoming and outgoing file transfers in a log file, which, by default, is named XFERLOG.

Here's a sample line from a WU Archive FTP log file:

Sat Dec 16 04:48:30 1995 1 www.interse.com 124 /README a _ o a 
support@www.interse.com ftp 0 *

FTP: Microsoft Internet Information Server (IIS) Log File Format

For more information on this format, please refer to the section on the Microsoft IIS standard log file format.

Here's a sample line from a Microsoft Internet Information Server log file (FTP):

www.interse.com, -, 8/7/95, 8:54:39, FTPSVC, WWW,157.54.17.9, 490, 232, 4401, 200, 0,GET, /analyst/, -,

Gopher: Microsoft Internet Information Server Log File Format

For more information on this format, please refer to the section on the Microsoft IIS standard log file format.

Here's a sample line from a Microsoft Internet Information Server log file (Gopher):

www.interse.com, -, 8/7/95, 8:54:39, GopherSVC, WWW,157.54.17.9, 490, 232, 4401, 200, 0,GET, /analyst/, -,

Real Audio Log File Format

This format is the same as the common log file format, but the user agent string is appended immediately following the file size.

This is a sample of the Real Audio log file format:

www.interse.com - bob [08/Aug/1995:06:00:00 -0800] "GET /analyst/ 
HTTP/1.0" 200 1067 https://www.infoseek.com

The Server Manager

Use the Server Manager to configure each of your servers and sites, and the log data they produce, within the database. Before any data can be imported into the database, the servers and sites that produced that data must be configured in the Server Manager.

The Server Manager is designed to import the most complex server environment into the database, and then aggregate and filter the data within the Analysis module.

Configurations in the Server Manager are hierarchical, with three distinct levels:

Log data source A log data source produces log files in specific formats defined by commercial server software. (The complete list of supported log file types is shown in a figure later in this chapter.) An individual log file does not constitute a log data source. A log data source is an application that produces individual log files over time.

Note In the future, data might not be recorded in log files, so "log data source" is used to refer generically to the producer of information to be analyzed by Site Server Express Analysis.

When a log file is imported into the database, that log file is associated with the log data source that produced it.

Server Every log data source contains data from at least one server.

Within a Microsoft, NCSA, or O'Reilly environment, multiple servers can be logged to one log file. In these cases, there can be multiple servers per log data source.

Within an Apache or Netscape server environment, each server produces its own log file, thus there is one log data source per server.

Note The term server, as used here, does not refer to a physical hardware server but to a program that responds to a request from a user. A site has one software server for each type of content—HTTP, FTP, Gopher, or Real Audio—that it publishes.

Site A site is a collection of content. Sites can be replicated across several servers, and different components of a site can be spread across several servers. Every server handles the content from at least one site.

The Server Manager provides a list of all information currently configured within the database. (See the following figure.) Each log data source, server, or site in the Server Manager is given a default name when it is configured. This name can be changed by selecting the icon in the Server Manager tree and renaming the object. This name is used throughout Analysis to refer to the data associated with the log data source, server, or site. Use the Server Manager to add sources, servers, and sites and to edit and remove existing ones.

Server information is organized in a graphical hierarchy, with log data sources at the root (represented by a scroll), specific servers down one branch (represented by a spider for Web servers, a file cabinet for FTP servers, a rodent face for Gopher servers, and an ear for Real Audio), and individual sites (represented by a sphere) at the tip.

The Server Manager can be opened from Usage Import.

The Log Data Source

Internet server software records client connections in a log file. For the Usage Import to parse your log files correctly, you must specify the correct log file format. If you specify an incorrect log file format, you cannot import data into the database or, worse yet, you'll import incorrect data.

Usage Import also supports importing directly from an ODBC log database created by Microsoft and Spry Internet servers. These database imports are treated as a log file format.

The type of information recorded in your log file greatly affects the accuracy and flexibility of your analysis. Specifically, you should try to record referrer, user agent, and cookie information in your log files.

Note If you use Microsoft, Apache, Netscape, or O'Reilly web server software, server extensions are available at https://msdn.microsoft.com/library/tools/fpage/install.htm to assist your analysis.

Log File Format

The log file format for an individual log data source is set when the source is added. Clicking a log data source or Server icon in the Server Manager with the right mouse button opens the Log Data Source Properties window, shown in the following figure.

The Log Data Source Properties window can also be opened by selecting the log data source and clicking the Properties button on the Usage Import window toolbar, as shown below.

The Server Manager allows you to add only servers and sites of a type supported by the Log file data source. (For example, a Real Audio site cannot be added to a log type that supports only HTTP.) The same applies to multihomed sites; the Server Manager will not allow you to add multiple servers for formats that do not support them.

Finally, keep in mind the following restrictions.

You can only have one log data source configured at a time.
You cannot import log files into the database until the Internet site that created the log file has been configured in the database.
Deleting an Internet site causes the unrecoverable deletion of all requests, users, visits, organizations, ad clicks, and views associated with that site.
Internet site properties can be edited at any time. However, this can cause inconsistencies between data imported before and after the changes are recorded. The best strategy is to be sure of your site's properties and do any editing before performing imports.

The Server

In order for Analysis to function accurately, each server must be set up properly, and the appropriate Server type must be identified.

Configuring a Server

Configuring a server is done with the Server Properties window, which you open by right-clicking a Log data source (choose New server from the pop-up menu) or the Server icon (choose New server or Edit, as appropriate). You can also open this window by selecting the log data source or Server icon and clicking the Properties button on the toolbar.

The options for the Server properties are shown in this table:

Server property	Description
Server type (required)	The type of Internet site (World Wide Web, FTP, Gopher, Real Audio). This property defines the list of log file formats and also provides an essential grouping for aggregation during analysis.
Directory index (required)	This setting indicates the file that the server returns when a directory is requested. (Both requests for directories and requests for directory index files are treated as requests for the directory in the database.) Defaults will depend on server software; typically, INDEX.HTML, and INDEX.HTML and HOME.HTML for servers with multiple directory index files.
IP address and IP port (optional)	Use this setting to: • distinguish servers which are multihomed • set the default for the Exclude hosts site property • distinguish internal and external referrers used in the visit algorithm
Local time zone (required)	This value establishes the default GMT offset of a server. Set this option to the time zone where your content is hosted. Selecting Adjust Time Zone in Import Options adjusts time calculations in relation to this setting, which may be changed to reflect the time zone for your analysis.
Local domain (required)	Insert the domain name used on the local network of the hardware hosting your content. The local domain setting is used to resolve any incompletely resolved host names in your log files.

Server Types

Site Server Express Analysis can analyze four types of server:

World wide web
FTP
Gopher
Real Audio

Most log file formats support only one server type, and only that entry will be available on the Server Properties window. However, if you are using a log file format that supports multiple server types, such as Microsoft, then you will be able to make the appropriate selection for your sites.

Note The dimension ServerType can be used to filter and aggregate this property in Analysis. (See Chapter 6, "The Report Writer.")

Directory Index Files

This property should include the file name that the server returns when a directory is requested (that is, when the request ends with a /). During import, Usage Import aggregates all requests for directory index files as requests for the directory. For example, if you specify that you have a directory index of INDEX.HTML, then a request for /INDEX.HTML and / will both be recorded as / in the database. If your server has multiple directory index files configured, then the syntax of this property should be INDEX.HTML and HOME.HTML, the default for Netscape servers.

IP Address and IP Port

This property has three uses:

If you use a Microsoft or Spry server and the server is multihomed (that is, multiple servers are being recorded within one log data source), then the Usage Import uses the server IP address and port to assign the hits in the log file to a specific server in the Server Manager. If a hit in the log file does not match the IP of a server, that hit is discarded during import as not matching a configured server.
The property is used to help set the default Exclude hosts site property. For example, the IP address for www.Interse.com is 206.86.22.20. If that value is used for this property, one of the default excluded host entries will be the class C address of 206.86.22.*. This filters out internal traffic from employees at the site.
The property is used to assist in calculating accurate visits to your sites. The visit algorithm uses referrer data to help infer visits to your web site. To do that, it needs to differentiate internal and external referrers. If you know the IP address, you can designate referrers from that IP address as internal.

Local Time Zone

This property should include the time zone of the hardware where your content is hosted. Choose Adjust Requests Timestamp During Import on the Import tab to determine the offset that is applied to every request from this log data source. For example, if your server is hosted in New York and you would like to analyze your data in California time, then you would set this property to GMT -05 Eastern and set the Import option to GMT -08 Pacific. When log files are imported into this log data source, Analysis subtracts three hours from all time stamps.

Local Domain

This property should include the domain name used on the local network of the hardware where your content is hosted.

If you use a hosting service, type the domain of your hosting provider. For example, if UUNet hosts your web server, type uu.net here.
If you host your site locally, then type your own organization's domain name.

The local domain name is used to fully resolve any partially resolved host names in your log files. For example, if a Microsoft employee at computer bubba.microsoft.com accessed www.microsoft.com, the log file would contain bubba. The local domain entry allows Usage Import to resolve this host name to bubba.microsoft.com. This will default to the local domain of the current computer.

The Site

The site is the lowest level of configuration needed to carry out your analyses. You must make the structure of your sites explicit before their statistics can be calculated. In the Site Properties window, you provide the information that defines the sites whose activity you wish to analyze.

Configuring a Site

When you add or edit an Internet site, the Server Manager displays the Site Properties window. (See the following figure.) Add or delete a site by right-clicking the server icon from which it branches in the graphical tree. Edit or delete an existing site or add a new site for the same server by clicking a site icon. To edit, you can also highlight an icon and choose the Properties button.

Home Page URLs (Required)

This property specifies the URLs used to access this site. If your site has multiple URLs, then you should list all of them with the syntax https://www.yourcompany.com and https://yourcompany.com. The visit algorithm used during import compares the referring URLs of the hits in the log file to the host names of the home page URLs to determine if the referrer was external to the site.

Note The first URL listed is used to hyperlink to the site in your analysis reports.

Server File System Paths for this Site (Optional)

If the current site's content resides within a specific path on this server (for example, /thissite/* ) or within a collection of paths on this server (for example, /filepath1/* and /filepath2/*), type those paths here.

A blank entry in this option designates this site as the default site on the server. All files not assigned to another site on this server are assigned to the default site. If there are no sites on the server with a blank entry for this property, then any request that does not match one of the set file paths will be discarded during import as not belonging to a configured site.

Internal Hosts to Exclude from Import (Optional)

You can specify the host names or IP addresses whose requests you want to exclude from the database. For Internet sites, this entry is typically used to exclude requests from employees and testing software. However, you can specify any host here. Specify the complete Class C Internet address and domain name such as *.yourcompany.com and 206.86.22.* in case some IP addresses are not successfully resolved in the log file.

Note All requests associated with these host names will not be available for detailed usage analysis. The hits associated with these host names are, however, counted in the aggregate hits and bandwidth statistics.

Inline Images to Exclude from Import (Optional)

Specify the file names you want to exclude from the database. This is typically used to prevent requests for inline images (that is, the decorative images on a page) from being imported into the database.

Note Bandwidth calculations are just as accurate and useful if inline images are excluded. The more you exclude, the faster the import process, the smaller the database, and the faster the analysis process. Excluding images will not exclude advertising views.

Site Properties: The Inferences Tab

Usage Import applies inferences to calculate requests, visits, users, and organizations from the hits in the log file. These inferences make assumptions which can be adjusted here.

Insert Missing Referrers into Clickstream (Optional)

This feature helps compensate for caching on the networks connecting your users and your servers.

Consider a situation where a user traces the following path:

Step	Page
1	Page A
2	Page B
3	Page A (cached)
4	Page C

In this situation, the server log file will record these hits:

Request	Referrer
Page A
Page B	Page A
Page C	Page A

Without referrer inferences, Usage Import will import one request each for Page A, Page B, and Page C, as above. This is the default.

If Insert Missing Referrers is selected,Usage Import reconstructs the following clickstream:

Request	Referrer
Page A
Page B	Page A
Page A
Page C	Page A

Caching typically flattens out the request profile of a site, as shown in the following figure. This feature helps determine the true request profile of your site.

Impact of inserting missing referrers on request statistics

Note The Insert Missing Referrers feature works only if your log files include referrer data.

Visit Timeout (Required)

Usage Import infers visits based upon, among other things, a timeout—the length of time after which any visit is considered closed. Some arbitrary time limit is required to define a visit, otherwise every visit would be infinite. You can choose an appropriate setting here.

The Internet advertising industry has agreed on a timeout of 30 minutes for standardized reporting, so 30 minutes is the default value. However, this value has a tremendous impact upon your analysis results, so any refinement of this number that you can provide based upon empirical experience will improve accuracy. For example, you might assume that visitors spend much less than 30 minutes at a navigation site, so you might set this value to 10 minutes. At the other end of the spectrum, visitors might spend much more than 30 minutes at a customer support site, so this value might be set as high as 120 minutes.

Multiple Users Use the Same User Name (Optional)

Usage Import infers users based upon, among other things, the user name recorded in the log file.

If your site has a section where many people log in under a single user name (for example, as guest or evaluator), the inference algorithms normally identify all of them as a single user. Selecting this option causes Usage Import to not identify users based on user name but to attempt to assign unique user IDs based on other information.

Note Distributing persistent cookies circumvents the problem entirely, because each user is permanently and uniquely identified without resorting to user name.

Site Properties: The Query Strings Tab

A file name requested from a web server has several components. Consider a request for:

/cgi-bin/getquote?symbols=interse+microsoft&display=table&alpha= 
beta#top:1000

Here "/cgi-bin/getquote" is the URL, "symbols=interse+microsoft&display=
table&alpha=beta" is the query string, "#top" is a fragment, and ":1000" is a parameter.

During import, all fragments and parameters are removed from file names. It also separates query strings from file names, and optionally store them in the database for the individual requests to your sites so you can analyze the requests, visits, and users according to a particular query string. To take advantage of this feature, your query strings must be formatted in name=value pairs.

File System Paths Whose Query Strings Should be Stored (Optional)

You need to specify which query strings to retain by indicating their file path. Typically, you are interested in the information from only a subset of all your file names with associated query strings. For example, if all of the CGI scripts you are interested in parsing are stored in /cgi/I_care_about_these/, then you would type /cgi/I_care_about_these/* in this box. Type multiple paths separated by and. The standard wildcard operators also apply. (For example, type /* to store all query strings.)

Site Properties: The File Names Tab

Remove Top-Level Directory From File System Paths (Optional)

Some hosting services add a customer name as a directory name before file names. If you select this option, then the first level directory is removed from each file name string before the string is recorded in the database.

File System Paths to Apply Regular Expressions To (Optional)

Regular expressions are a powerful string replacement function in UNIX and PERL. In certain situations, the file names recorded in your logs are not exactly as you want them to be recorded in the database. To change them, you can use the regular expression search-and-replace function.

In the path property, provide the file system paths that need a search-and-replace correction applied to them. For example, to change the path "/analyst/*" to "/ourproducts/*", type /analyst/* for this property. For multiple paths, use the syntax /analyst/* and /otherpath/*.

Regular Expressions to Apply to File Names (Optional)

Type the actual regular expression that will be applied, with the following syntax:

s/string1/string2/

where s=search-replace

string1=string to search for

string2=replacement string

In the example given above, to replace "/analyst/*" with "/ourproducts/*", type:

To apply two regular expressions, type:

The back-slash ( \ ) is an escape character for this command, which allows the next front-slash ( / ) to be interpreted as the directory hierarchy divider rather than the regular expression's own separator-character. The back-slash is required for including any of regular expression's special characters within a string for search and replace. Without the back-slash, the special characters have the following wildcard values in regular expressions:

Character	Meaning
^	Matches the beginning of a string
$	Matches the end of a string
.	Matches any character
[ ]	Character class, or the complement of a character class if the first character inside the brackets is a caret ( ^ )
*	Repeat previous, zero or more times
+	Repeat previous, one or more times
?	Repeat previous, zero or one time
\	Escape next character (treat next special character as a literal to be included in the string rather than as a wildcard)
{ }	Tagged match (Note: Usage for tagged matches is extremely complex, and the function is recommended only for expert users of UNIX or PERL. It is not otherwise supported by Usage Analyst 2.0.)

The following examples illustrate how to use the wildcard characters:

Pattern	Matches
^stuff	strings that start with "stuff"
stuff$	strings that end with "stuff"
^…$	any 3-character string
[AEIOU]	any uppercase vowel
[0-9]	any digit
[A-Z a-z] [0-9]	any letter followed by any digit
[^0-9]	any character except a digit
[A-Z] [0-9]*	any upper-case letter optionally followed by any number of digits
[A-Z] [0-9]+	any upper-case letter followed by at least one digit
[A-Z] [0-9]?	any upper-case letter optionally followed by one digit
[+-] ? [0-9]+	any integer optionally preceded by a sign
[+-] ? [0-9] + \.?[0-9]*	any real number

Note All regular expressions listed in this property are applied to all paths listed in the previous property.

Usage Import Options

The Options window, available from the Tools menu, allows you to configure several settings for Usage Import. This window comprises of eight tabbed groups of choices, as shown in the following figure. You can save these options as default settings for all future imports; otherwise the options are stored in the Windows registry. Any changes you make to the options are in effect while you are working; when you quit and restart Usage Import, the default options are restored.

Configuring Usage Import Options

The Usage Import Option configuration settings are explained in the following table.

Configuration setting	Explanation
Drop database indexes	Analysis requires database indexes; however, import is much slower when the database has indexes. Therefore, by default, import drops all database indexes and then analysis adds them before beginning. However, once you have accumulated a large amount of usage data within the database, where each import represents only a small percentage of the data in the database, you'll want to turn this option off because adding indexes to the large database takes longer than the incremental time required for import.
Adjust request timestamps	If this option is turned on, all time stamps in your log files are adjusted to the selected time zone, from the time zone specified for that site in the Internet site manager. This is useful if you have sites in multiple time zones.
Exclude spiders	Checking this option avoids counting hits by Internet search engines, robots, and any other user agents specified on the Spider list tab. Note: This is the only way to exclude user agents.
Lookup HTML titles	When you enable this option, the Import module performs HTML title lookups on new HTML files added to the database during the log file import. You can perform the same operation manually from the Tools menu.
Resolve IP addresses	When you enable this option, the Import module tries to resolve every unresolved IP address it encounters in the log file.
Whois query for unknown domains	This option instructs Usage Import to perform a Whois query when the organization name is not known. You can perform the same operation manually from the Tools menu.

Configuring IP Resolution

IP resolution settings are listed in the following table. For a more complete explanation of IP resolution and how it is handled by the Import module, see "Resolve IP Addresses."

Configuration setting	Explanation
IP resolution cache period	This setting allows you to specify the duration of any IP lookup before the operation is repeated. The import module remembers the IP/host-name combination for the number of days specified. During the cache period, Usage Import automatically converts all IP references to the resolved host name. After the cache period, Import retries resolution. Longer cache settings will speed import and resolution but may miss intervening changes to IP/host-name combinations.
Timeout	Timeout establishes how long Usage Import will search before it enters an IP address as unresolved. Setting a higher timeout gives more complete resolution but slows completion of import. You can run IP resolution manually from the Tools menu.
Batch resolution size	Specifying the batch size for IP resolution allows you to optimize operation of Usage Import to your DNS server. If your server supports a larger number of simultaneous requests, you can increase the setting over the default of 300 for improved performance. Too large a number may crash your DNS server and cause Report Writer to report an artificially large number of unresolved addresses.

Configuring Log File Overlaps

Having time periods overlap in your log files introduces inaccuracies in your database. A number of scenarios can produce time overlaps in log file entries: running logs on separate servers, interrupting and resuming logging on a single server, accidentally re-importing an individual log file, or concatenating distinct log files. The Log File Overlaps window allows you to specify how to treat such redundancies. See the following table for panel settings.

Note Concatenation of log files makes tracking of overlap extremely difficult. It should not be done.

Configuration setting	Explanation
Overlap period	This setting allows you to specify the period to be considered an overlap by the import module. Shorter periods will reduce apparent overlap but may affect accuracy of later analysis for the period in question.
Action on overlap detection	Import all records: ignores overlap entirely, includes all redundancies in the database (default) Stop the import: halts the import for the log file in question only and continues any other imports underway. Stop all imports: halts the current import of all log files. Discard records and proceed: Discard the overlapping records and proceed.

The adjustable "grace period" in Usage Import makes allowance for variations in the methods of web servers in logging requests. Some servers log a request at the time it's received. Others record the time stamp at the time of the request but don't log it until the transaction is complete. Depending on your individual system, you may want more or less tolerance for such situations, which can produce apparent overlap. For example, if you have an FTP site where users routinely make very large file downloads that take hours to complete, increase the period for download overlaps or use the default "ignore overlaps" option.

If an import is stopped because of overlaps, the Import Statistics window reports the result, as shown in the following figure.

The Log File Manager in the Import module now makes it possible to correct mistakes by deleting imports selectively. (For more information, see "Deleting Log Files" in Chapter 5.)

Directory Options

The Directory Options tab allows you to specify a default directory for import files and log files.

The settings here affect:

The Browse button next to the log file path on the Log file manager. The default directory in the file browser is the directory specified for the default log file directory.
Typing a log file name in the entry box of the Log file manager without giving a full directory path. Import looks for the file in the directory set as the default.
Command line operation with no directory specified.

Configuring IP Servers

On the IP Server window, Usage Import asks you to specify the servers and domain required for the Internet-connected functions of the program (lookups, mail, and IP resolution). These settings are explained in the following table.

Configuration setting	Explanation
HTTP proxy server	If you specify a proxy server host name and port, the Import module uses this address for all HTML title lookups. If you are unsure of this information, check with your system administrator. Note: If your proxy requires a user name and password, specify the proxy host name as: username:password@hostname
SMTP server	Use this setting if you plan to distribute analysis reports via email using the MAIL.EXE utility.
Local domain of DNS server	Used to clarify hosts returned from IP resolution. Defaults to the local domain of the computer, but if your DNS server is maintained by an ISP, then this setting should be entered.

Identifying Spiders

If the Exclude Spiders box is checked on the Import Options window, the Spider List shown below allows you to identify engines for which log entries will be removed.

The entries in the box are common user-agent strings for spiders. If for any reason you want to exclude any other user agents, you can specify them here.

Intranet Organization Definition

For some large or complex site and log structures (for example, ISPs or large intranets with many subdomains), it may be useful to define organizations further down a domain tree than the default assignment of the organization to the Internet domain. The Intranet organization panel allows you to set the number of levels beyond the domain for Usage Import and Report Writer to recognize as distinct organizations.

If you make the Intranet setting one domain part beyond the organization, you will have three-part, two-dotted organization names. Two levels beyond, and Usage Import will define organizations with four-part names in the database.

Intranet setting	Organizations in database
Zero domain parts beyond Internet organization	company.com
One domain part beyond Internet organization	ca.company.com ny.company.com
Two domain parts beyond Internet organization	marketing.ca.company.com engineering.ca.company.com marketing.ny.company.com engineering.ny.company.com

Log File Rotation

Because log file rotation requires an arbitrary cutoff of data produced at your sites, there will inevitably be visits that are interrupted. For these visits, information will be divided between the end of one log file and the beginning of the next. Usage Import gives you a number of options for handling this situation.

Options in the At The End Of An Import list box are explained in the following table.

Configuration setting	Explanation
Commit open visits to database	If you routinely commit open visits to the database, there will be a small exaggeration of statistics at the opening and closing of the log file period, because those visits will be counted twice.
Discard open visits	Discarding open visits will under-report visits at the ends of the log files, because those visits will be dropped.
Store open visits for next import	Storing open visits for the next import is the most accurate alternative, because it reconstructs the actual visit as if there were one seamless log. There is a small cost in speed as the open visits must be called up from the cache at each new import.
Clear open visits cache	Clearing the cache of open visits produces a clean slate for the new import. This option is particularly useful if you ordinarily store open visits but occasionally want to discard them. Note: The cache is maintained per site per database. Clearing the cache clears all of them.

The Log File Manager

The Log File Manager is used for managing all of the logs imported into your database. The window contains a scrolling list of information about logs already imported as well as controls for importing and deleting logs.

Information in the Log File Manager is organized in the following columns:

Action
Log data source
Description
Import size (MB)
First request
Last request
Number of requests
Start time
Who

Click any column heading to sort the data on that column. To resize the columns, drag the dividers in the header row.

Column Headings

The Action heading shows either Import or Delete, depending on the operation performed.

The log data source shows those sources (named in your configuration with the Server Manager) for which log files have been imported or deleted.

The Description shows the names of imported or deleted files and any filter string used. You can highlight, copy, and paste the filter string for reuse in another operation. This is particularly useful when you need to enter a complicated filter repeatedly.

New Import Controls

The Log Data Source drop-down list shows the names of sources configured with the Server Manager. Select the source for which you wish to import the log file.

In the Log Location drop-down list, select where the import will come from. The default is a file. You can type the name and path for the file or use the Browse button to select the file.

Performing an Import

To perform an import, use the Browse button to select the log file or specify its full path in the Log Location text box on the main Import module panel.

Start Import

When you're ready to import your log files, choose the Start Import button.

The Import module displays a message after finishing an import. The message includes the total time required to complete the operation.

The Import Statistics Window

Statistics for the completed import are displayed in the Import Statistics window.

The import statistics give you an indication of any problems during import or incorrect configurations of your sites or import options. (For a full explanation of each item, see the following table.)

Note The underlying data for the Import Statistics window display is stored in the Log File Manager and the database, but the presentation as seen in the window is not saved. If you want to preserve the this record of the import operation, copy and paste from the window to a text editor.

Message	Description
Requests imported	Number of requests successfully imported.
Ad views imported	Number of hits to files specified in the ad view file system paths of the site properties. Filters set in Exclude Images do not apply to ad views.
Ad clicks imported	Number of redirect clicks (HTTP response code 302) to ad sites specified in the ad click file system path.
Hits couldn't be parsed	Number of entries in the log file that didn't match the expected format and so couldn't be imported. If no hits are imported and the number of entries that cannot be parsed is greater than 1,000, you are warned to make sure the log file format of your import matches that of your log data source.
Missing referrers requests were inserted	Number of requests missing in the log file data due to caching and reinserted by means of inference algorithms.
Hits were client errors	Number of hits that had a 400 response code and weren't imported.
Hits were server errors	Number of hits that had a 500 response code and weren't imported.
Hits were redirects	Number of hits that had a 300 response code and weren't imported.
Hits weren't from the correct Internet/Intranet protocol	Some log file formats include requests from multiple site types within one log file. If a request is encountered whose site type doesn't match the site type of the current Internet site, it isn't imported.
Hits were to an unconfigured site	If the Site ID or site name in the log file doesn't match those configured in the Server Manager for which the log file is imported, hits to that site are discarded.
Hits were to an unconfigured server	If the Server ID or server name in the log file doesn't match those configured in the Server Manager for which the log file is imported, hits to that server are discarded.
Hits were to excluded inline images	If the current Internet site was configured with inline images excluded and the file name of a request matches the criterion, the request isn't imported. For example, if you specify .GIF as an exclusion criterion, all .GIF counts are dropped. Note:* Depending on the quantity of inline images on your web pages, this number may be very high.
Hits were from excluded hosts	If you specify hosts to exclude in the Site Properties Excludes window, hits from those hosts are tabulated in this count and dropped from the import to the database.
Hits were from excluded spiders	If you specify spiders to exclude in the Site Properties Excludes window, hits from spiders are tabulated in this count and dropped from the import to the database.
Hits represented uploads and not requests	For imports of FTP log files only. Usage Analyst 2.0 analyzes only the download requests from your FTP site. No other FTP transactions are imported.
Number of open visits	One of the following, depending on the import option setting for log file rotation: • open visits were committed to database • open visits were discarded • open visits were loaded from cache before this import • open visits were cached for next import

The information in the Statistics window is broken down by server, and within each server, by individual site. For an import to multiple servers and sites, you can scroll through the statistics by section for each respective site.

Note Where configuration options affect the import statistics, you set them either in the site properties of the Server Manager or in Usage Import options.

Deleting Log Files

In managing your database, you will encounter situations where you want to remove an imported log file. You may find that you've imported a log file with an improperly configured filter or inappropriate overlaps, or you may simply want to reduce the size of the database by removing older log files. In all cases, you can remove complete imports with the Log File Manager by highlighting the appropriate log files and then choosing Delete from the Edit menu or by highlighting the imports and choosing the Delete button.

Usage Import prompts you to confirm your deletions. Once removed, the log files can be reincorporated in the database only by performing another import. If you want to delete requests selectively (as opposed to deleting entire log files), you should choose Delete Requests from the Tools menu.

Usage Import's Tools and Buttons

In addition to functions available in the Log file manager and Server Manager, Usage Import has contextual features and tools available through its menus and toolbar.

The Tools menu provides access both to the Import Options window and to several tools for managing your data.

Lookup HTML Titles

The Lookup HTML Titles tool, accessible from the Tools menu, associates a title with every HTML file in the database. Usage Import can look up HTML titles during the Import process if you enable Lookup HTML Titles, or you can look them up at any time afterward using this tool.

The Boolean filter for HTML lookups gives you a great deal of flexibility in tailoring the process to your individual needs. The default is Title="", which asks to resolve any entries in the database presently lacking titles. Usage Import also keeps track of sites by sequential ID numbers so you can look up titles by site without having to include names. (Where you want to specify the name for a site, server, or log source, you need to enclose it in double quotes.)

Usage Import determines the HTML title by requesting the HTML file from the Internet site and then parsing the <TITLE> text from the file. If you specify an HTTP proxy server on the configuration tab of the Import Options window, this proxy server is used to request the HTML file.

To activate the Title lookup process, choose Lookup. The time required for completion depends on the number of HTML files to look up and the speed of your Internet connection.

Resolve IP Addresses

The Resolve IP addresses tool, accessible from the Tools menu, tries to resolve any unidentified IP addresses in the database. Usage Import attempts to resolve IP addresses during the import process if you enable the Resolve IP Addresses import option, or you may do it at any time afterward using this tool.

Every host directly connected to the Internet has a unique IP address. This IP address is represented by four groups of integers (0-255) separated by decimal points (for example, 10.10.10.10). DNS resolution turns IP addresses into host names (such as tomato.interse.com). Usage Import can interpret host names to determine organization type and country/region information. Host names also provide the key for finding organization names and addresses by means of a Whois lookup. You cannot determine any of this information directly from IP addresses.

Note After IP addresses have been resolved, run Whois to determine organization names.

Either your server or Usage Import should resolve IP addresses. If your server has IP resolution turned on, then disable Usage Import IP resolution. It's better to use the server to perform IP resolution because that enables the resolution of IPs that are dynamically assigned by dial-up Internet service providers. If you try to resolve dynamically assigned IPs at a time other than when they're active on your server, they either can't be resolved (because the user is no longer dialed in) or they may be resolved to an incorrect host name (because another user has dialed-in and been assigned that IP.)

Usage Import caches all IP resolutions for the number of days you specify on the main Import Option window. If a cached lookup has aged beyond this length of time, that IP address is resolved again. Caching IP addresses improves performance, especially if your sites receive a large amount of repeat traffic. However, IP addresses change, and caching IPs for too long can result in inaccurate IP resolutions.

Note IP resolution is extremely slow. Each IP address can take up to a minute to resolve, and many will never resolve. (A typical success rate is 60 to 85 percent.) For this reason, you should have your server perform resolutions if possible, and you should experiment with cache duration to optimize performance.

Delete Requests

The more data you have in the database, the more valuable your analysis for identifying trends. However, at some point you may want to permanently delete old requests from your database. The Delete Requests tool, accessible from the Tools menu, permanently removes data from the database.

The Delete Requests window defaults to the Timestamp<Today filter, which deletes all requests earlier than the current date. The standard of range of possibilities is also available, as in following figure, which deletes all requests for a single site.

Highlighting an object in the Server Manager and then using the Delete Requests tool gives you a pre-configured filter based on your selection. (The example above was generated by highlighting the Intersé web site icon and then opening the Delete tool.)

When you're ready to delete, choose the Delete button. The program prompts you to confirm the action. If you confirm, the requests are permanently removed from the database. If you ever need to analyze those requests again, you must import the log files from that period again.

Many of the functions of Usage Import are available through buttons on the toolbar.

The Scheduler

The Scheduler button provides access to a graphical facility for automating most of the functions of Usage Import, including filtered imports and deletes and the various Internet lookups. The scheduler also automates the functions of Report Writer, and its graphical interface is accessible from component independently.

For a complete discussion of the Scheduler and its operation, see Chapter 9, "Automating Usage Import and the Report Writer."