The DLXS statistics system consists of two parts: (1) a tool to run on each web server to parse web log files, calculate hits, and insert those hits into the database, and (2) a web interface for retrieving reports such as HTML or MS Excel files. The web log parsing tool is based on XML configuration files called "resource pattern definitions". The pattern definition for a resource describes the string patterns in a URL that would indicate that a hit is in reference to that resource. Many DLXS collections can be described in the same configuration file. For example, all Text Class collections will have the same definition, so only one XML file will be created for all Text Class collections. See Appendix A for a sample XML configuration file.
The pattern definition file uses regular expressions to describe the resource in two ways: (1) the string that must be present in a hit URL to signify that the hit is in reference to this resource, and (2) the string or strings that must or must not be present to describe what type of hit occurred on this resource (such as a search, browse, etc.).
After parsing a web log file, the software compares each hit to each configuration file and tries to answer two questions:
To add tracking for a new resource, one would only have to add a new pattern definition XML file for the resource or include the resource name in an existing XML file (see Appendix A). Please check the XML configuration files for XML well-formedness and validity (a DTD file is included, /misc/s/stats/stats-config.dtd).
Reports are available through the web interface. The interface recognizes a user's IP address and determines to which collections the user has access. The interface then presents a number of reports in different formats:
This section walks through the algorithm used by the stats_driver.pl script to process web log files and tabulate statistics in the database.
In general, the algorithm is as follows:
1. Build Stats::Resource objects
The script uses a configuration variable to determine where to look for the resource pattern definition XML files. All *.xml files in that directory are considered. The script creates a Stats::Resource object for each of those XML files. This object parses the XML using the XML::Simple Perl module and stores the pattern definition data in member variables.
2. Process log file one line at a time
On the command line the script takes a string that should be the full path to a web log file. This log file is opened and read one line at a time. Each line is completely processed and the hit data updated in the database before moving on to the next line.
3. Parse log file line
The log file line is parsed and a Stats::Hit object is created to represent the data in that line.
The Stats::Hit constructor does most of the work of parsing the log file line. It assumes the log file is in Common Logfile Format (CLF) with the elements:
host rfc931user authuser [date] "method file protocol" status bytes "referer" "useragent"
The string is matched against the following regular expression:
/(\S+)\s(\S+)\s(\S+)\s(\[.+\])\s(".+|-")\s(\S+)\s(\S+)\s(".+|-")\s(".*")\s/
At this point double clicks are discarded. To determine if a hit is a double click the code must keep a structure that stores the time of all previous hits organized by host and URL. If the current hit is within 10 seconds of any previous hit from the same host to the same URL, then it is considered a double click and discarded. If a third hit occurs within 10 seconds of a hit determined to be a double-click, then that third hit is discarded as well.
4. Determine which Stats::Resource object matches the hit, if any
The hit is compared to each of the Stats::Resource objects to determine which resources it matches.
As an example, consider this line from a log file:
222.166.160.134 - - [31/Mar/2006:00:00:06 -0500] "GET /cgi/t/text/text-idx?c=moajrnl;idno=acw8433.1-03.054;node=acw8433.1-03.054:3 HTTP/1.1" 200 15123 "http://www.hti.umich.edu/cgi/b/bib/bibperm?q1=ACW8433-1329APPL-257" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" www.hti.umich.edu
Also, consider the resource pattern definition XML file in Appendix A.
Say the Stats::Resource object that represents the XML in Appendix A (we'll call it Resource A) is the first Stats::Resource object considered. The script first has to answer the question "Does this hit refer to Resource A?". To determine this, the script compares the hit URL, "/cgi/t/text/text-idx?c=moajrnl;idno=acw8433.1-03.054;node=acw8433.1-03.054:3", to the value of the <identifier> element in the XML. Literally, it tries to do a regular expression match $hiturl =~ /$identifier/.
If that regular expression match fails, then the code moves on to the next Stats::Resource object.
If the match returns true, then it also tries to match one or more of the <resource_definition> structures. To do this it loops through each of the <or> elements doing a regular expression match on each of the <and> elements and a !~ match on each of the <not> elements. If the match on any of the <and> or <not> elements fails, then we jump out of the loop that was evaluating this <or> element and go to the next one. As soon as the hit URL matches all of the <and> and <not> elements in an <or>, the code stores the resource ID in a list. It then evaluates the next <resource_definition> element. Finally, the method returns the list of resource IDs that matched the hit URL. This means that a hit URL could refer to more than one resource within a resource definition file.
The code then moves on to the next Stats::Resource object and tries to match the hit to that resource definition. This means that a hit could also match more than one resource definition. This is probably an unlikely scenario, but the functionality exists in case you want to tabulate different types of statistics for the same resource.
If the hit does not match any of the Stats::Resource objects, then nothing more is done with that line and the script moves on to analyze the next line in the log file.
5. Determine the type and subtype.
If the hit does match a Stats::Resource object, then the next step is to determine the "type" and "subtype" of the hit. This determination is made by comparing the hit URL to the regular expressions in the <hit> elements in the XML.
The possible types and subtypes from the resource definition file are stored in a hash with their corresponding sets of regular expressions. The code loops through the <or> elements doing pattern matches on the <and> and <not> strings.
Once the code finds a matching (type, subtype), it returns that match. This means that a hit URL can match one and only one type/subtype combination.
6. Determine if the hit is a resource-, title-, or section-level hit
The final analysis on the hit is to determine if the hit is a resource-level, title-level, or section-level hit. If a <title> element exists in the XML and the hit URL matches the value of the <title> element, then we know that the hit is at least a title-level hit. If a <section> element exists in the XML and the hit URL matches the value of the <section> element, then we know that the hit is a section-level hit.
7. Increment the hit count in the database
Once the script has made this determination, it updates the database. Starting with the most granular type of hit: If we're dealing with a section-level hit, the "Total" column is incremented for the row with the resource, title, and section fields filled in. If we're dealing with a title-level hit, the "Total" column is incremented for the row with the resource and title fields filled in, but the section field is the empty string. If we're dealing with a resource-level hit, the "Total" column is incremented for the row with the resource field filled in, but the title and section fields are the empty string. So each hit is only counted once in the database, for the type that is the most specific.
This all works out because if you want to say "Show me all resource-level hits for resource X" the query would be
SELECT * FROM hit_totals WHERE resource = X AND title = '' AND section = ''
which basically gets all rows for that resource regardless of the title or section values. If you want to say "Show me all title-level hits for resource X" the query would be
SELECT * FROM hit_totals WHERE resource = X AND title != '' AND section = ''
which gets all rows for that resource that do contain a value in the title field.
After matching a hit to a resource, we still loop through the rest of the resources to see if the hit also matches another resource. This is probably unlikely, but currently the script has no reason to believe that this type of situation would not occur, so it checks anyway.
TopEach of those must also be installed on each web server and the configuration file appropriately configured on each server. Set the full path to the configuration file in the Stats::Config module.
The following third-party Perl modules are also required:
Create an entry in the crontab on each web server to run the stats_driver.pl script once per day. If you run that script more than once on a given log file, then the results in the database will be erroneously doubled, and you will have to delete all results in the database for that day and reprocess the log files.
This script relies on:
The following third-party Perl modules are also required:
Create a crontab entry to run this script frequently, such as every 15 minutes. Sample crontab entry:
0,15,30,45 * * * * perl /l1/bin/s/stats/process_counter_queue.pl 2>&1 >>/l1/bin/s/stats/process_counter_queue.log
The stats web interface also relies on:
The following third-party Perl modules are also required:
The reprocess_stats.pl script, which wraps the stats_driver.pl script, can be used to regenerate data in the database if necessary. The two likely cases are:
The SQL DELETE FROM hit_totals WHERE resource='foo' AND hitdate >= '2006-09-01' AND hitdate <= '2006-09-30'; will delete all stats for collection "foo" during September 2006. You can then regenerate those stats (assuming the web log files still exist) by running the script reprocess_stats.pl on all relevant web servers:
perl reprocess_stats.pl foo 20060902 20061001
(Note: Web log file conventions may differ, but in our case the log files contain data for the day prior to the date in the filename. So '20060902 - 20061001' are the dates in the log file names for dates 20060901 through 20060930.)
This will reprocess all web log files and tabulate stats for collection "foo" only between the given date ranges.
The SQL DELETE FROM hit_totals WHERE hitdate >= '2006-09-01' AND hitdate <= '2006-09-30'; will delete all stats for all collections during September 2006. You can then regenerate those stats (assuming the log files still exist) by running the script reprocess_stats.pl on all relevant web servers like this:
perl reprocess_stats.pl all 20060902 20061001
Important: Use 'xmllint' or other utility to validate the XML.
Sample resource pattern definition file
TopFollowing is a description of where the various parts of the software are located:
Web server log processing script:
COUNTER reports request processing:
Web interface:
XSLT stylesheets for dynamic HTML reports:
CSS for web interface:
All supporting Perl modules:
Other: