Similar to our decisions regarding new nameresolver2, we wanted to create a new OAI data provider that would move away from using broker20 and Bibliographic Class. We also were creating the new MBooks environment, which used metadata directly from our online library catalog (Mirlyn). It seemed counterintuituve to put Mirlyn metadata (essentially marc21 data) into Bibliographic Class to make it work in broker20, and our method for crosswalking TEI to Text Class to Bibliographic Class had always seemed sub-standard. We also had to handle a new rights environment for MBooks-- those that were public domain and those that were restricted-- and there was no clean method to connect the rights database with broker20.
Consequently, we created UMProvider to hold and provide access to all our OAI metadata-- the MBooks metadata as well as the DLPS/DLXS metadata from our Text Class and Image Class collections. And we decided to make it re-usable at the same time, i.e., that it be a single perl module that connects to any relational database (e.g., MySQL) and that it have no other requirements other than common perl system modules (e.g., XML::LibXML, CGI, DBI).
We'll provide a brief overview of OAI before we show UMProvider.
Steps for getting started:
The UMProvider will be included in DLXS release 14 ($DLXSROOT/bin/o/oai/, $DLXSROOT/cgi/o/oai/) and is available right now on sourceforge (non-DLXS enabled): http://www.sourceforge.net/projects/umoaitoolkit/. The existing OAI Provider will continue to be distributed with DLXS. However, we encourage you to start using the UMProvider, as it is simpler to manage and conforms to the OAI specification correctly (something that broker20 never did completely).
The UM OAI Toolkit (umoaitoolkit) available from sourceforge contains the OAI-PMH harvesting scripts as well.
The first MySQL table is mandatory and stores all of the required data for the UMProvider. The second table can be used if you would like to organize your records into sets. Sets, in OAI-PMH, are used for organizing the data for selective harvesting of the content. Both tables are created as they appear below when you install DLXS release 14.
First table (oai):+-----------+--------------+------+-----+-------------------+ | Field | Type | Null | Key | Default | +-----------+--------------+------+-----+-------------------+ | id | varchar(150) | NO | PRI | | | timestamp | timestamp | NO | MUL | CURRENT_TIMESTAMP | | oai_dc | mediumblob | YES | | NULL | | marc21 | mediumblob | YES | | NULL | | mods | mediumblob | YES | | NULL | +-----------+--------------+------+-----+-------------------+Second table (oaisets) optional:
+-----------+--------------+------+-----+---------+ | Field | Type | Null | Key | Default | +-----------+--------------+------+-----+---------+ | id | varchar(150) | NO | PRI | | | oaiset | varchar(32) | NO | PRI | | +-----------+--------------+------+-----+---------+
CREATE TABLE oai ( id VARCHAR(150) PRIMARY KEY, timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, oai_dc MEDIUMBLOB, mods MEDIUMBLOB, marc21 MEDIUMBLOB, key 'timestamp' (timestamp) ); CREATE TABLE oaisets (id VARCHAR(150), oaiset VARCHAR(32), PRIMARY KEY ('id','oaiset'), KEY 'oaiset' (oaiset));
First, log onto pilsner with your workshop ID.
The only thing that needs to be changed for the CGI script ($DLXSROOT/cgi/o/oai/oai) is the information needed to connect to the database. Other than that, the sample script should work out of the box.
mysqlServer = dev.mysql.umdl.umich.edu mysqlDbName = userX_ws mysqlUser = dlxs mysqlPasswd = middleware
The UMProvider configuration contains information about the repository for the Identify, ListSets and ListMetadataFormats OAI-PMH verbs. This data is not really dynamic so it is just stored in an XML configuration file.
change:
Test the configuration with a few OAI requests:
Again, "userX" should be replaced with your user ID for the workshop.
http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=Identify http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListSets http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListMetadataFormats [ should be one DC record in the table by default ] http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=oai_dc http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:userX.ws.umdl.umich.edu:MIU01-12345678
In this step we are going to load already formatted metadata (oai_dc first) using the loadOai.pl script. The data that is fed to this script for loading needs to be wrapped in a <records> element. Also, mirroring the OAI-PMH format, a <header> (containing the unique identifier) and a <metadata> element are required for each record.
Here is an example of that data:
<?xml version="1.0" encoding="UTF-8"?> <records> <record> <header> <identifier>MIU01-000053324</identifier> <setSpec>mbooks:pd</setSpec> </header> <metadata> <oai_dc:dc> [ YOUR oai_dc DATA HERE ] </oai_dc:dc> </metadata> </record> [ MORE RECORDS HERE ] </records>
http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=oai_dcHere are more metadata format (marc21 and mods) examples:
# ./loadOai.pl -d $DLXSROOT/prep/s/sampleoai/marc21_samples/ # ./loadOai.pl -d $DLXSROOT/prep/s/sampleoai/mods_samples/ http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=marc21 http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=mods
loadOai.pl also allows you to force the records at the time of loading into a specified set.
# ./loadOai.pl -d $DLXSROOT/prep/s/sampleoai/oai_dc_samples/ -s dlps http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=oai_dc&set=dlps
Tip: If your data is in broker20 already, you can use the OAI harvester to collect your data. Then, change the $recordXpath (see below) to load any OAI-PMH ListRecords response from a file.
## optional config -- xpath to find records my $recordXpath = "/OAI-PMH/ListRecords/record";
started xpat: $DLXSROOT/idx/a/alajournals/alajournals.dd executing: pr.region.HEADER region HEADER executing: StopIf you had trouble running the ExtractHeaders.pl script, there will be no output in the log.
mv *headers.xml $DLXSROOT/prep/o/oai/headers
./ConvertToDc.pl -c exampleColls.xml -d $DLXSROOT/prep/o/oai/headers
parsing dynamic collection alajournals executing xsltproc -o $DLXSROOT/prep/o/oai/provider/alajournals-dc.xml --param collid "'alajournals'" --param lang "'eng'" --param type "'DLPS'" textClassToDc.xsl $DLXSROOT/prep/o/oai/headers/alajournals-headers.xml
Below is some example XSLT code from textClassToDc.xsl that maps the title from Text Class to the dc:title field.
<xsl:for-each select="FILEDESC/SOURCEDESC/BIBLFULL/TITLESTMT/TITLE"> <xsl:if test="normalize-space(.)"> <dc:title> <xsl:apply-templates select="."/> </dc:title> <xsl:call-template name="lineBreak"/> </xsl:if> </xsl:for-each>
./LoadDB.pl -d $DLXSROOT/prep/o/oai/provider -c exampleColls.xml -p
http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListSetsto see the list of sets in the repository. You'll see that there are 8 sets: dlps, dlpstext, dlps:collid (3), and dlpstext:collid (3). This set structure is optional. We chose to organize our sets this way so that a harvester could request all dlps collections or only the images or only the texts.
http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=oai_dc&set=dlps:alajournals http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=oai_dc&set=dlpstext:emerson http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=oai_dc&set=dlpstext:conraditc
In order to obtain the article-level metadata for the serial collections, we use the whole Text Class file in addition to the header files from xpat. We also needed to account for exceptions in how the volume and article data is organized so that our oai_dc data was cleanly formatted.
Identifiers cannot have colons in OAI. Some of our serial collections use colons to indicate article identifiers (e.g. 0522508.0001.001:1). We had to replace these with dashes to be OAI-PMH compatible. There were also some instances where different identifier types were used (acc.no vs. dlps).
Some of the older, static collections are coded in SGML instead of XML. Since these collections are not modified often, we used the Bib Class files for the transformation instead of the Text Class files.
We have one collection with the Scholarly Publishing Office that has 150 sub-collections. Rather than list all of the sub-collections in the configuration XML file (exampleColls.xml in our demo), we list only the base collection with the collid llmc. The script will then process all of the sub-collections within the llmc directory, e.g. $DLXSROOT/obj/l/llmc/subcoll1, $DLXSROOT/obj/l/llmc/subcoll2, etc.
For some Image Class collections, the title, subject, description are identical and the IDs similiar. In order to distinguish records, we appended the view (e.g. front, back, side) to the title. The collection scltinteric is such an example.
The UMProvider: Process Flows and Examples presentation contains flow diagrams of the weekly automated processes for checking for updated records and new collections on slides 5 and 6.
View an example of the oai update report. The Perl script that generates the content of the report is at $DLXSROOT/bin/o/oai/provider/GenerateReport.pl. To change the email addresses to which the report is sent, you must edit $DLXSROOT/bin/o/oai/provider/text_ic_oai_cron.pl.
View an example of the new collection report. The Perl script containing the content and email addresses for this report is at $DLXSROOT/bin/o/oai/provider/GetNewCollections.pl.