DLXS Statistics System

Functional overview
Technical overview
Installation
Reprocessing web log files and retabulating statistics
Appendix A: Sample XML file for resource pattern definitions
Appendix B: Stats system architecture

Functional overview

The DLXS statistics system consists of two parts: (1) a tool to run on each web server to parse web log files, calculate hits, and insert those hits into the database, and (2) a web interface for retrieving reports such as HTML or MS Excel files. The web log parsing tool is based on XML configuration files called "resource pattern definitions". The pattern definition for a resource describes the string patterns in a URL that would indicate that a hit is in reference to that resource. Many DLXS collections can be described in the same configuration file. For example, all Text Class collections will have the same definition, so only one XML file will be created for all Text Class collections. See Appendix A for a sample XML configuration file.

The pattern definition file uses regular expressions to describe the resource in two ways: (1) the string that must be present in a hit URL to signify that the hit is in reference to this resource, and (2) the string or strings that must or must not be present to describe what type of hit occurred on this resource (such as a search, browse, etc.).

After parsing a web log file, the software compares each hit to each configuration file and tries to answer two questions:

Does this hit reference this resource?
If so, what type (and then subtype) of hit is this?

The XML configuration file also contains elements to describe a second and third level of tracking within a resource, such as titles (idno=*) and pages (seq=*) in DLXS collections. These levels are used in the creation of COUNTER reports.

To add tracking for a new resource, one would only have to add a new pattern definition XML file for the resource or include the resource name in an existing XML file (see Appendix A). Please check the XML configuration files for XML well-formedness and validity (a DTD file is included, /misc/s/stats/stats-config.dtd).

Reports are available through the web interface. The interface recognizes a user's IP address and determines to which collections the user has access. The interface then presents a number of reports in different formats:

COUNTER reports in Excel format available by email
Collection-level reports in Excel format or in HTML

Top

Technical overview

This section walks through the algorithm used by the stats_driver.pl script to process web log files and tabulate statistics in the database.

In general, the algorithm is as follows:

Build Stats::Resource objects to represent each of the XML pattern definition files.
Process log file one line at a time
Parse log file line
Determine which Stats::Resource object matches the log line, if any
Upon matching a hit to a Stats::Resource object, determine the 'type' and 'subtype' of the hit
Determine if the hit is a resource-, title-, or section-level hit
Increment the hit count in the database

1. Build Stats::Resource objects
The script uses a configuration variable to determine where to look for the resource pattern definition XML files. All *.xml files in that directory are considered. The script creates a Stats::Resource object for each of those XML files. This object parses the XML using the XML::Simple Perl module and stores the pattern definition data in member variables.

2. Process log file one line at a time
On the command line the script takes a string that should be the full path to a web log file. This log file is opened and read one line at a time. Each line is completely processed and the hit data updated in the database before moving on to the next line.

3. Parse log file line
The log file line is parsed and a Stats::Hit object is created to represent the data in that line.

The Stats::Hit constructor does most of the work of parsing the log file line. It assumes the log file is in Common Logfile Format (CLF) with the elements:

host rfc931user authuser [date] "method file protocol" status bytes "referer" "useragent"

The string is matched against the following regular expression:

/(\S+)\s(\S+)\s(\S+)\s(\[.+\])\s(".+|-")\s(\S+)\s(\S+)\s(".+|-")\s(".*")\s/

At this point double clicks are discarded. To determine if a hit is a double click the code must keep a structure that stores the time of all previous hits organized by host and URL. If the current hit is within 10 seconds of any previous hit from the same host to the same URL, then it is considered a double click and discarded. If a third hit occurs within 10 seconds of a hit determined to be a double-click, then that third hit is discarded as well.

4. Determine which Stats::Resource object matches the hit, if any
The hit is compared to each of the Stats::Resource objects to determine which resources it matches.

As an example, consider this line from a log file:

222.166.160.134 - - [31/Mar/2006:00:00:06 -0500] "GET /cgi/t/text/text-idx?c=moajrnl;idno=acw8433.1-03.054;node=acw8433.1-03.054:3 HTTP/1.1" 200 15123 "http://www.hti.umich.edu/cgi/b/bib/bibperm?q1=ACW8433-1329APPL-257" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" www.hti.umich.edu

Also, consider the resource pattern definition XML file in Appendix A.

Say the Stats::Resource object that represents the XML in Appendix A (we'll call it Resource A) is the first Stats::Resource object considered. The script first has to answer the question "Does this hit refer to Resource A?". To determine this, the script compares the hit URL, "/cgi/t/text/text-idx?c=moajrnl;idno=acw8433.1-03.054;node=acw8433.1-03.054:3", to the value of the <identifier> element in the XML. Literally, it tries to do a regular expression match $hiturl =~ /$identifier/.

If that regular expression match fails, then the code moves on to the next Stats::Resource object.

If the match returns true, then it also tries to match one or more of the <resource_definition> structures. To do this it loops through each of the <or> elements doing a regular expression match on each of the <and> elements and a !~ match on each of the <not> elements. If the match on any of the <and> or <not> elements fails, then we jump out of the loop that was evaluating this <or> element and go to the next one. As soon as the hit URL matches all of the <and> and <not> elements in an <or>, the code stores the resource ID in a list. It then evaluates the next <resource_definition> element. Finally, the method returns the list of resource IDs that matched the hit URL. This means that a hit URL could refer to more than one resource within a resource definition file.

The code then moves on to the next Stats::Resource object and tries to match the hit to that resource definition. This means that a hit could also match more than one resource definition. This is probably an unlikely scenario, but the functionality exists in case you want to tabulate different types of statistics for the same resource.

If the hit does not match any of the Stats::Resource objects, then nothing more is done with that line and the script moves on to analyze the next line in the log file.

5. Determine the type and subtype.
If the hit does match a Stats::Resource object, then the next step is to determine the "type" and "subtype" of the hit. This determination is made by comparing the hit URL to the regular expressions in the <hit> elements in the XML.

The possible types and subtypes from the resource definition file are stored in a hash with their corresponding sets of regular expressions. The code loops through the <or> elements doing pattern matches on the <and> and <not> strings.

Once the code finds a matching (type, subtype), it returns that match. This means that a hit URL can match one and only one type/subtype combination.

6. Determine if the hit is a resource-, title-, or section-level hit
The final analysis on the hit is to determine if the hit is a resource-level, title-level, or section-level hit. If a <title> element exists in the XML and the hit URL matches the value of the <title> element, then we know that the hit is at least a title-level hit. If a <section> element exists in the XML and the hit URL matches the value of the <section> element, then we know that the hit is a section-level hit.

7. Increment the hit count in the database
Once the script has made this determination, it updates the database. Starting with the most granular type of hit: If we're dealing with a section-level hit, the "Total" column is incremented for the row with the resource, title, and section fields filled in. If we're dealing with a title-level hit, the "Total" column is incremented for the row with the resource and title fields filled in, but the section field is the empty string. If we're dealing with a resource-level hit, the "Total" column is incremented for the row with the resource field filled in, but the title and section fields are the empty string. So each hit is only counted once in the database, for the type that is the most specific.

This all works out because if you want to say "Show me all resource-level hits for resource X" the query would be

SELECT * FROM hit_totals WHERE resource = X AND title = '' AND section = ''

which basically gets all rows for that resource regardless of the title or section values. If you want to say "Show me all title-level hits for resource X" the query would be

SELECT * FROM hit_totals WHERE resource = X AND title != '' AND section = ''

which gets all rows for that resource that do contain a value in the title field.

After matching a hit to a resource, we still loop through the rest of the resources to see if the hit also matches another resource. This is probably unlikely, but currently the script has no reason to believe that this type of situation would not occur, so it checks anyway.

Top

Installation

Set up the database. Create the stats database on one server using the stats_db.sql file. The stats_driver.pl script on each of the web servers must be able to talk to the stats database.
Set up the log parsing software. Install the script stats_driver.pl on each web server. In addition, this script relies on:
- the configuration file
- the supporting Stats::* Perl modules
- the pattern definition XML files
Each of those must also be installed on each web server and the configuration file appropriately configured on each server. Set the full path to the configuration file in the Stats::Config module.

The following third-party Perl modules are also required:
- Config::Auto
- DBI
- Socket
- Time::Local
- XML::Simple
Create an entry in the crontab on each web server to run the stats_driver.pl script once per day. If you run that script more than once on a given log file, then the results in the database will be erroneously doubled, and you will have to delete all results in the database for that day and reprocess the log files.
Create pattern definition XML files. (See Appendix A) Also, be sure to set the full path to these files in the resource_def_path configuration variable. Check the configuration file for well-formed, valid XML using 'xmllint': xmllint --noout --valid *.xml The DTD file is stats-config.dtd.
Set up the COUNTER processing script. One of the servers must have the process_counter_queue.pl script installed. This script could be installed on one of the web servers or a different machine, but it only needs to run on one server. The only requirement is that the script be able to access the stats database from the remote host. This script reads COUNTER requests from the database, creates COUNTER reports, and emails the reports (as MS Excel files) to users.
This script relies on:
- the configuration file (be sure to configure appropriately and set the full path to the config file in Stats::Config)
- the supporting Stats::* Perl modules
The following third-party Perl modules are also required:
- Config::Auto
- DBI
- Mail::Mailer
- MIME::Base64
- Spreadsheet::WriteExcel (version 2.18)
- Socket
Create a crontab entry to run this script frequently, such as every 15 minutes. Sample crontab entry:
0,15,30,45 * * * * perl /l1/bin/s/stats/process_counter_queue.pl 2>&1 >>/l1/bin/s/stats/process_counter_queue.log
Set up the web interface. The stats web interface should be installed on one server. The interface consists of two CGI scripts, subscriber and manager. The idea is that the manager script will sit behind an authentication method to require users to authenticate before accessing the interface. A user who has authenticated at manager will be able to view stats for all institutions and for all collections found in the AUTHZD_COLL environment variable. A user at subscriber will be able to view stats only for their institution.
The stats web interface also relies on:
- the configuration file (be sure to configure appropriately and set the full path to the config file in Stats::Config)
- the supporting Stats::* Perl modules
- the XSLT file for displaying results in HTML
- the Javascript files in /web/s/stats
- a CSS file (if one is referenced in the configuration file)
The following third-party Perl modules are also required:
- Config::Auto
- CGI
- DBI
- XML::Simple
- XML::LibXSLT
- Digest::MD5
- Mail::Mailer
- Socket
- Spreadsheet::WriteExcel (version 2.18)
- XML::Writer
- XML::LibXML

Top

Reprocessing web log files and retabulating statistics

The reprocess_stats.pl script, which wraps the stats_driver.pl script, can be used to regenerate data in the database if necessary. The two likely cases are:

Reprocess past log files to tabulate stats for a collection newly added to a resource definition config file.
The SQL DELETE FROM hit_totals WHERE resource='foo' AND hitdate >= '2006-09-01' AND hitdate <= '2006-09-30'; will delete all stats for collection "foo" during September 2006. You can then regenerate those stats (assuming the web log files still exist) by running the script reprocess_stats.pl on all relevant web servers:

perl reprocess_stats.pl foo 20060902 20061001

(Note: Web log file conventions may differ, but in our case the log files contain data for the day prior to the date in the filename. So '20060902 - 20061001' are the dates in the log file names for dates 20060901 through 20060930.)

This will reprocess all web log files and tabulate stats for collection "foo" only between the given date ranges.
Reprocess past log files for all collections over a period of time.
The SQL DELETE FROM hit_totals WHERE hitdate >= '2006-09-01' AND hitdate <= '2006-09-30'; will delete all stats for all collections during September 2006. You can then regenerate those stats (assuming the log files still exist) by running the script reprocess_stats.pl on all relevant web servers like this:

perl reprocess_stats.pl all 20060902 20061001

Top

Appendix A: Sample XML file for resource pattern definitions

Important: Use 'xmllint' or other utility to validate the XML.

Sample resource pattern definition file

Top

Appendix B: Stats system architecture

Following is a description of where the various parts of the software are located:

Web server log processing script:

$dlxs_root/bin/s/stats/stats_driver.pl

COUNTER reports request processing:

$dlxs_root/bin/s/stats/process_counter_queue.pl

Web interface:

$dlxs_root/cgi/s/stats/subscriber
$dlxs_root/cgi/s/stats/manager

XSLT stylesheets for dynamic HTML reports:

$dlxs_root/web/s/stats/resource_level_report.xsl

CSS for web interface:

$dlxs_root/web/s/stats/webstats.css

All supporting Perl modules:

$dlxs_root/lib/Stats/*.pm

Other:

The location where COUNTER reports should be temporarily created is defined in the 'counter_reports_path' configuration variable.
The location of the resource pattern definition XML files is defined in the 'resource_def_path' configuration variable.
The configuration file template 'stats.conf.template' and the database script 'stats_db.sql' are in $dlxs_root/misc/s/stats/