Last updated | 2004-06-16 17:23:24 EDT |
Doc Title | broker: An OAI-compliant metadata server |
Author 1 | Blanco, Jose |
CVS Revision | $Revision: 1.8 $ |
broker20 is the CGI program that produces XML responses to OAI verbs as dictated by version 2.0 of the OAI protocol. Setting up broker20 will allow service providers to access and harvest metadata about your collections, for the purpose of aggregating and making this metadata, and consequently the collections, more broadly available to the public.
broker20 also produces responses to CGM verbs as dictated by the CGM Protocol, a protocol for distributed searching. This protocol was developed by the University of Michigan, Cornell University, and Göttingen University with support provided by the National Science Foundation. Working from the roots of the DIENST protocol developed at Cornell and the then-emergent OAI protocols, the project team focused on creating a new protocol--dubbed CGM, for "Cornell, Göttingen, Michigan"--that was consistent with OAI, borrowed from DIENST, and added mechanisms for full text searching.
Contents:
Setting up broker20 involves understanding the six verbs behind the OAI protocol. To learn more about the OAI protocol, see http://www.openarchives.org/.
This verb identifies the provider (i.e., you). The response of this verb is created based on the following parameters that reside in $DLXSROOT/cgi/b/broker20/broker20.cfg, and that are configurable when the dlxs middleware is installed:
$gRepositoryID : for DLPS, the value is lib.umich.edu. Note that this must be a domain name.
$gRepositoryName : for DLPS, the value is The University of Michigan. University Library. Digital Library Production Service.
$BaseUrl : for DLPS, the value is http://www.hti.umich.edu/cgi/b/broker20/broker20
$AdminEmail : for DLPS, the value is dlps-broker@umich.edu
$EarliestDateStamp : for DLPS, the value is 2000-08-17. Enter the ealiest date stamp for your institution.
$DeletedRecord : for DLPS, the value is NO. This flag indicates the manner in which the repository supports the notion of deleted records. Legitimate values are no, transient, or persistent. In the preparation of bib data, we don’t keep track of deleted records, that’s why it’s set to no.
$Granularity : for DLPS, the value is YYYY-MM-DD’. This is the resolution of the datestamp for your repository. The legitimate values are YYYY-MM-DD and YYYY-MM-DDThh:mm:ssZ with meanings as defined in ISO8601. The default value is the granularity used in the preparation of bib data.
$SampleID : for DLPS, the value is oai:lib.umich.edu:YEATS-YC023, with YEATS-YC023 being a record id from the yeats collection.
ListSets will list the sets in your repository. broker20 views sets as bib collections in groups that have OAI access privileges. Therefore, if you create a group using collmgr (see documentation for collmgr for specific steps to do this), and you set the OAI parameter for that group to be Y or y, you will see each of the collections in the group listed as a set when this verb is issued to broker20.
Because the setSpec values, which broker20 builds using the values collid and groupid, need to be alphanumeric according to the OAI protocol, all groupid and collid values need to be alphanumeric. Here at DLPS we had collids ending in -bib, so broker20 will remove the hyphens to make them OAI compliant, and when a set with bib at the end is requested using the OAI protocol, it is translated to ‘-bib’ so we can access it internally. So we are limited to not using collids with bib at the end unless they are preceded by a hyphen if we want to make it OAI enabled. So for example, a collid of yeats-bib can be used and broker20 can handle it, but a collid like yeatsbib should not be used.
Set information is used as an optional input by ListIdentifiers and ListRecords.
ListMetadataFormats responds with a list of all the formats supported by broker20. Currently, it responds with: oai_dc (Dublin Core) for a valid identifier passed in, and oai_dc if there are any records in the repository. DLXS's BibClass format is also supported as an output format.
This verb will list the identifiers,i.e., the unique record locators, in the repository. If a set is not specified, it will list all the identifiers in groups that have been made OAI enabled. If a set is specified, it will generate a list of identifiers for the requested set.
GetRecord will return a single record for the identifier requested, in the metadata format requested.
This verb works very much like GetRecord, but instead of returning one record, it creates a list of records based on the input parameters. This is the verb the harvesters will use to harvest your collections.
Something of interest to point out here is that in broker20 there is a routine that converts bib class data to Dublin Core data for this verb and the GetRecord verb and in the case where the bib data is bad (missing closing K tag, for example), the record will not be output, but an entry into a log file in /l1/cgi/b/broker20 by the name of ErrorLogFor_broker20 will be made. In the log file you will find the time the error took place, the id of the record, and a copy of the record with the problem. You may want to create a CRON script to clean this log file periodically and perhaps to notify you if there are entries.
Here are UM where we have over 130,000 records, and we have never had a bad record (so far). If you run your bib data through an xml validator and it validates, you should never get an entry in this error log.
In order for broker20 to work, you need to create a group or groups made up of collections that you wish harvesters to gather via broker20. You can do this through collmgr . Be sure to set the OAI parameter to Y or y for these groups. Most institutions will probably only create one group with the collections they feel a harvester should have access to, but there are cases where you want one harvester to harvest one group, and another to harvest another. In these cases you could create multiple groups and notify the harvester of the group they may be interested in. When they run their harvester, they will run them against that group.
To put collections online, you should follow the procedures to get BibClass collections online, since broker20 works against Bib Class collections. Also, remember to add the collection(s) to the AUTH system giving broker20 access to it.
All searches for data are done with XPAT.
OAI is Unicode compliant, and in the broker20 cgi, there are routines for converting character entity references used by DLPS to XML numeric character entity reference values based on the Unicode equivalent for that character. If your institution has character entity references that are not included in the list that the release version of broker20 uses, you will need to add them to the broker20 code with the appropriate Unicode values. You will need to add the conversion in the routine ConvertStandardCharEnt
There is another routine, ConvertCollectionChars, that converts Latin-1 characters (x0a1 to x0ff) and a few other characters from ISO-8859-* into their Unicode equivalents also represented as XML numeric character entity references. You may need to add conversions in this routine if your records contain any characters from these ISO encoding not currently handled by broker20.
In instances where a character entity reference does not have an obvious Unicode equivalent, the character entity reference is unchanged in the output. For example, if there is no obvious Unicode mapping for &abc;, broker20 will output &abc;. The user interface will simple display this string.
When you complete your installation and testing of broker20 at your institution, you will want to register your broker20 with OAI at their website http://www.openarchives.org/data/registerasprovider.html. This is the place to register to let harvesters know you are available for harvesting. But before registering there, you should try registering it at the unofficial website which is great for testing. This site will run your broker20 through a series of test, and once it passes the tests you will be prompted to register. Select ‘Test and Add an archive to this list’ to test and add your broke20.
broker complies with the cgm protocol, i.e., broker will respond to the verbs described in the CGM Protocol documentation. This is means that if you setup broker correctly, other institutions can search text collections at your institution, and of course you can search other intitutions collections that have set up broker. So say for instance you have a text class collection of chemistry books and another institution has a different collection of chemistry books, these two collections could be searched from one institution's website with the tools provided with DLXS release 11 (broker, and the subclass of TextClass CgmTC.pm). The following sections describe in more details what needs to be done to get something like this to work.
In order for pageviewer to make a Dissemination request for pages from a remote repository, it needs to provide that repository with an access code when it issues a Disseminate request; therefore, you will need to create an access code for your pageviewer to use. If you are accessing a remote repository that uses DLXS, you can run the broker20_access url from your institution (for example, at UM this would be http://hti.umich.edu/cgi/b/broker20/broker20_access), and enter the IP address of the server where pageviewer resides, and an access code will be provided. If you are accessing an institution that does not use DLXS you will have to contact them to find out if they are using an access code and if so what it should be. You will then need to enter these access codes in textclass.cfg in the hash gCgmAccessCode.
So say you want to make some collections available to anther institution for searching, what do you need to do? All you need to do is go into the file broker20.cfg, and add the collids of the collections you want to make available to the array @gSupportedSets. The one limitation on this is that the collection be a level 1 collection. broker will not work with collections of higher levels.
Now what does the other institution have to do now to get access to these collections? Well, if they have DLXS release 11 they need to make entries in the collection manager for these collection, and then make these collections part of a group(s). I want to point out that the CGM protocol does not presently support NOT and PROXIMITY searching, and so DLXS release 11 has been configured so that if a "CGM" collection is part of a group, then these operation are not presented to the user. This is in line with our philosophy of presenting a UI with functionality available to all the collections. These cgm collections are never listened in the all collection group. They only show up in the group pages.
Here are the fields that should be filled in in the collmgr for cgm collections. All other fields should be left blank.
The subclass of TextClass that these types of collections will use is CgmTC.pm. This subclass uses a MySQL database, and therefore needs the database set up. To do this you will need to create a database by the name of cgm, and a user to the databse with SELECT, INSERT, UPDATE, and DELETE priviledges. You can then go ahead and create the necessary tables in the CGM database used by this subclass. There is a file in /l1/bin/b/broker20/ called CreateAndPopulateCGMTables. You can use this file to create the tables needed for the CgmTC.pm subclass to work. You can use the following command to execute the file: mysql -u root -p < CreateAndPopulateCGMTables.txt
There are two perl files that connect to the database, and they are: CgmTC.pm and BBItemTCForCGM.pm. You will need to go into these two files and update the following variables:
There is one additional change that will have to be made in order to have your DLXS text class interface function properly and that is that you will have to create a subclass of your resident collection. The DLXS has logic in it that is used remove the NOT and PROXIMITY search options if a cgm collecting is part of a group. Most of this logic is in TextApp.pm, but there is one situation where this decision is made in the TextClass code, and that is why this needs to be done. The routine that needs to change is FilterResultsForBasicToc. Copy the one in UmhistmathTC.pm to the subclass of your resident collection, and remember to update your collmgr entry for subclassmodule to indicate this.
The final thing that needs to be mentioned is that a cron job needs to be setup to clean up the tables in the cgm database. We have a perl script that gets called every two hours that does this for us. It lives in /l1/bin/b/broker20 and its called Purge_CGM_Database. This script removes all entries from these tables that are older than two hours.