Indexing TextClass and FindaidClass data will be covered in detail during the Text Class Data Preparation and FindaidClass Data Preparation sections.
A full list of XPAT commands can be found at: http://quod.lib.umich.edu/sgml/pat/pat50manual.html
To invoke collmgr: http://______.ws.umdl.umich.edu/cgi/c/collmgr/collmgr (replace ______ with you account user id.
XPAT indexes strings (semi-infinite strings) rather than words. Consider this text:
Searching for same thing great will retrieve the string beginning with "same" and followed by "thing great". However searching for same great will not retrieve anything. XPAT searches for strings, anchored at index points that match up to virtually the end of the document.... called Kitchee-Gumeeng, also great lake. The words Mitchee and Kitchee both seem to mean the same thing, great, large; whether there is a shade of difference in applying ...
Index points are offsets into the text where XPAT looks for matches. Generally, index points are characters following spaces.
Ssearching for several words with XPAT is implicitly a phrase search. To search for "same" AND/OR "great" requires the use of boolean operators (^ and +) and regions.
XPAT compresses multiple spaces in the text to a single space when indexing and searching.
XPAT also can perform case mapping and character mapping. This is specified in the data dictionary (.dd) file.
Back to topThis section treats issues of character encoding as it applies to XPAT and mentions a few tools we've written you may find useful. There is also and expanded treatment of this subject.
Some Unicode / XPAT facts:
There are many reasons to use Unicode.
We deliver a few locally developed tools you may find useful.
This section applies only to XPAT-based classes: TextClass, FindaidClass. ImageClass is MySQL-based. More when we talk about data preparation for the classes more fully.
Here's a look at the resulting files:
ls -al /l1/workshop/pfarber/dlxs/idx/s/sampletc_utf8 -rw-rw-r-- 1 pfarber dlps 816 Jun 16 2005 div1head.rgn -rw-rw-r-- 1 pfarber dlps 576 Jun 16 2005 div2head.rgn -rw-rw-r-- 1 pfarber dlps 528 Jun 16 2005 id.rgn -rw-rw-r-- 1 pfarber dlps 528 Jun 16 2005 mainauthor.rgn -rw-rw-r-- 1 pfarber dlps 528 Jun 16 2005 maindate.rgn -rw-rw-r-- 1 pfarber dlps 528 Jun 16 2005 mainheader.rgn -rw-rw-r-- 1 pfarber dlps 528 Jun 16 2005 main.rgn -rw-rw-r-- 1 pfarber dlps 528 Jun 16 2005 maintitle.rgn -rw-rw-r-- 1 pfarber dlps 6704 Apr 4 2007 page.rgn -rw-rw-r-- 1 pfarber dlps 6704 Apr 4 2007 page-t.rgn -rw-rw-r-- 1 pfarber dlps 138040 Jun 16 2005 sampletc_utf80.rgn -rw-rw-r-- 1 pfarber dlps 50907 Apr 5 2007 sampletc_utf8.dd -rw-rw-r-- 1 pfarber dlps 968452 Apr 4 2007 sampletc_utf8.idx -rw-rw-r-- 1 pfarber dlps 0 Jan 30 2004 sampletc_utf8.init
The data dictionary is an XML file in a collection subdirectory of the idx directory. It ties all the the index files together and holds the specifications for index points and character mappings.
Here's a bit of the section of the data dictionary that specifies the index points, i.e. the points in your data where XPAT will look for matches to your query string:
<IndexPoints> <IndexPt> &printable.</IndexPt> <IndexPt>&printable.-</IndexPt> <IndexPt>-&printable.</IndexPt> <IndexPt> &Latin.</IndexPt> <IndexPt>&Latin.-</IndexPt> <IndexPt>-&Latin.</IndexPt> <IndexPt> &Greek.</IndexPt> <IndexPt>&Greek.-</IndexPt> <IndexPt>-&Greek.</IndexPt> </IndexPoints>
Note the metacharacters like &printable., &. or &Greek. that represent all characters from one of the blocks of Unicode Plane 0. Index point metacharacters are based on the Unicode block definitions, Perl unicode lib (e.g. lib/5.8.3/unicore/lib/Latin.pl) and modified as described in the XPAT data dictionary document.
Here's a bit of the section of the data dictionary where character mapping is specified. Refer to the Unicode character database. This is mainly used for upper to lower case mapping for alphabets that have case.
... <Map><From>!</From><To> </To></Map> <Map><From>"</From><To> </To></Map> <Map><From>$</From><To> </To></Map> <Map><From>%</From><To> </To></Map> ... <Map><From>U+0391</From><To>U+03B1</To></Map> <Map><From>U+0392</From><To>U+03B2</To></Map> <Map><From>U+0393</From><To>U+03B3</To></Map> <Map><From>U+0394</From><To>U+03B4</To></Map> <Map><From>U+0395</From><To>U+03B5</To></Map> ...
Finally here's an example of a full data doctionary.
Back to topXPAT indexes XML regions (via xmlrgn), allowing searching of text within regions, regions including text or other regions, etc. Consider this diagram of the kind of regions XPAT can index.
|<------------------------- region FAMNAME --------------------------->| | | <FAMNAME SOURCE="lcnaf" ENCODINGANALOG="100">Whittemore Family</FAMNAME> | | | | | ->| |<- region "A-SOURCE" | | | |<------------ region "FAMNAME-T" --------->|
There are three kinds of regions:
To start an interactive session with XPAT, enter xpatu (for UTF-8 data indexing/searching) along with the name of the data dictionary (dd) file.:
%
xpatu $DLXSROOT/idx/s/sampletc_utf8/sampletc_utf8.dd
Back to top
In XPAT, a point is a unique byte offset in the full text where XPAT has indexed a string. Enter a string or byte offset in square brackets and set of points is returned:
>> "prince" 1: 134 matches >> "prince " 2: 123 matches >> sample 3: 10 matches >> pr 539939, ..was said that Prince Alexander of Battenberg had changed into a .. 957348, ..e only child, Prince Alexander, who came in before we went to ta.. 1390470, ..TEM>Bismarck, Prince, and the Austro-German alliance ~ <REF>xxiv.. 552103, ..alliance that Prince Bismarck, in 1879, entered into the very cl.. 208247, .. sceptre d'un prince de religion orthodoxe.</P> <P> <.. 1016444, ..n the streets Prince Michael and Teresia, 20 to 30 dinars toward.. 943446, ..ian statue of Prince Michael, whose name and portrait are found .. 483031, ..la volonté du prince Nicolas, ses résolutions personnelles au su.. 1411801, ..udolph, Crown Prince, Popularity of ~ <REF>69</REF> </ITEM.. 1141121, ..raged it. The Prince suspected nothing of what was taking place .. >> "emile " 4 : 9 matches >> "Émile " 5 : 9 matches >> [290947] 6 : one match
The first query finds all "semi-infinite strings" that begin with "prince", the second finds those that are "prince" exactly (with the space, or anything that has been mapped to a space). The "Emile" queries demonstrate character mapping and case mapping. The last, finds the string beginning at the byte offset 290947.
Back to topA region in XPAT is a span of text comprising zero or more bytes. xmlrgn or multirgn (discussed in the TextClass Collection Implementation/Indexing Section) handles the creation of these regions.
To find how many of a particular region type exist, enter region plus the name of the region (double quotes are needed if the name contains non-alphanumeric characters).
>> region "DIV1" 1: 38 matches >> region "A-NODE" 2: 46 matches
Also see the {ddinfo regionnames}
command.
Also see the history
command.
Any collection of points or regions can be grouped together in a set. Sets can be combined or split with XPAT's boolean operators. All sets created during a session have unique number identifier They can be can given names (name = ). They can be printed out (pr), saved, exported (useful in the creation of "fabricated regions"). Here are just a few examples:
>> long 1: 244 matches >> help 2: 54 matches >> 1 + 2 3: 298 matches >> "alternate" 4: 5 matches >> pr 4 1175485, ..most from the alternate advance and retreat of the Russian and T.. 1165090, ..in. Vineyards alternated with fields of barley, oats, and maize;.. 967310, ..men and women alternately; <EPB/> <PB REF="00000208.tif" S.. 1313659, ..a and Austria alternately. But, when able to repel aggression, s.. 1303571, .. each country alternately. It should be composed of three secti.. >> mysearch = "pair" 5: mysearch = 3 matches >> pr *mysearch 1170568, ..and a half; a pair of buffaloes, 600 francs (£24).</P> <P>B.. 848085, ..s dress was a pair of large Turkish trousers of white wool, a sh.. 1085132, ..nd thick; two pairs of oxen drew it by means of a pole which was..
Also see the subset
command.
Also see the {sortorder}
setting.
Also see other operators and relations.
The pr command is the heart of viewing sets. In an interactive XPAT session, it lets you view the results you've searched for. Within the middleware, getting the data back from XPAT is the first step; next there is a small amount of manipulation of the XML that is returned from XPAT queries; finally conversion to HTML is done via XSLT stylesheets.
Note: The save command is, in a sense, the same as the pr command: pr displays to STDOUT, save outputs (appends) to a file whose name is given by {savefile}. The format of the output is the same.
Back to topUsing some basic XPAT operators, we can build some very specific searches that take advantage of the XML markup. Here is an actual example from the TextClass implementation.
Consider this (edited) XML of a TEI header element and note the highlighted portion:
<HEADER> <FILEDESC> <TITLESTMT> <TITLE TYPE="245"> The Balkan Peninsula, / by Émile de Laveleye</TITLE> <AUTHOR> Laveleye, Emile de, 1822-1892. </AUTHOR> </TITLESTMT> <PUBLICATIONSTMT> <PUBLISHER>DLPS ...</PUBLISHER> <IDNO TYPE="dlps">abe5413.0001.001</IDNO> </PUBLICATIONSTMT> <SOURCEDESC> <BIBLFULL> <TITLESTMT> <TITLE TYPE="main"> The Balkan Peninsula, / by Émile de Laveleye</TITLE> <AUTHOR> Laveleye, Emile de, 1822-1892. </AUTHOR> <AUTHOR> Thorpe, Mary, Mrs., tr. </AUTHOR> </TITLESTMT> <PUBLICATIONSTMT> <PUBLISHER>G. P. Putnam's sons,</PUBLISHER> </PUBLICATIONSTMT> </BIBLFULL> </SOURCEDESC> </FILEDESC> <ENCODINGDESC> ... </ENCODINGDESC> <PROFILEDESC> ... </PROFILEDESC> </HEADER>
The following query is actually the basis for the fabricated region called mainauthor in most of our text collections.
>> ((region AUTHOR within (region TITLESTMT within region FILEDESC)) not within (region SOURCEDESC)) 6: 2 matches >> pr.region.6 235, ..<AUTHOR> Yriarte, Charles, 1832-1898. </AUTHOR> .. 513768, ..<AUTHOR> Laveleye, Emile de, 1822-1892. </AUTHOR>..
Let's say we want a slice of the the title(s) for the chapters in a given volume that contain hits for a users search for the word prince. We construct the query in stages.
>> hitssearch = ("prince " + "prince<") 1: hitssearch = one match
>> chapters = (region DIV1 incl (region "A-TYPE" incl "chapter")) incl *hitssearch 2: chapters = 14 matches
>> excludedheadregions = (region LIST) + (region FIGURE) + (region DIV2) 3: excludedheadregions = 25 matches
>> chapterheads = (region HEAD within *chapters) not within *excludedheadregions 4: chapterheads = 14 matches
>> volume = region main incl (region HEADER incl (region IDNO incl "abe5413.0001.001")) 5: volume = one match
>> volumechapterheads = *chapterheads within *volume 6: volumechapterheads = 11 matches
>> volumechapterheadsslice = subset.1.5 *volumechapterheads 7: volumechapterheadsslice = 5 matches
>> pr.region.HEAD *volumechapterheadsslice 523428, ..<HEAD>INTRODUCTORY CHAPTER.<LB/>THE PRESENT POSITION OF BULGARIAN AFFAIRS.</HEAD>.. 557986, ..<HEAD>CHAPTER I.<LB/>VIENNA—MINISTERS AND FEDERALISM.</HEAD>.. 600631, ..<HEAD>CHAPTER II.<LB/>BISHOP STROSSMAYER.</HEAD>.. 707081, ..<HEAD>CHAPTER III.<LB/>HISTORY AND RURAL ECONOMY OF BOSNIA.</HEAD>.. 819018, ..<HEAD>CHAPTER IV.<LB/>BOSNIA—ITS SOURCES OF WEALTH, ITS INHABITANTS, AND RECENT PROGRESS.</HEAD>..
A fabricated region is a "virtual" region that has been indexed. You can use any valid XPAT query to create a result set. Then, with the {export} command, you can have XPAT create a binary index of the points in the result.
There are two basic reasons to do this:
Once the fabricated regions are created and indexed, they can be searched for and printed just like any other region.
We've actually already seen an example of a region that could be made into a a fabricated region in the last section. Recall these two named regions:
>> excludedheadregions = (region LIST) + (region FIGURE) + (region DIV2) 3: excludedheadregions = 25 matches
We could make the named query chapterheads into a fabricated region with the {export} and {exportfile} commands as follows:>> chapterheads = (region HEAD within *chapters) not within *excludedheadregions 4: chapterheads = 14 matches
Another example of an important fabricated region in TextClass and FindaidClass is maindate.{exportfile "/l1/idx/s/sample/chapterheads.rgn"}; export *chapterheads; ~sync "chapterheads";
>> region maindate 1: 2 matches >> pr.region.maindate region maindate 1181, ..<DATE>1876.</DATE>.. 514996, ..<DATE>1887.</DATE>..
For more examples and discussion of fabricated regions, see: Fabricated Regions.
Back to topThe most likely queries you may need to debug are those involving fabregions because those will be queries you construct yourself as opposed to the hard-coded queries in the middleware. Nonetheless, this technique is useful when debugging any involved query.
The idea is simple. Start XPAT at the command line and submit the sub-queries of the full query until you find one that does not return the result you expect. To see the queries submitted to XPAT append ;debug=search to the end of your URL and copy/paste the query strings into the XPAT command line prompt, submittting named queries before submitting queries that refer to the named queries. Here's an example from the sampletc_utf8 collection
For more information about all XPAT commands, see the regular DLXS documentation about XPAT.
<Sync>string</Sync>Back to top
The default mode, in an interactive XPAT session, is "quietoff". This gives the results messages you have seen so far: numbered sets, byte offsets followed by snippets of SGML with ".." on either end, etc. Another mode, and the most useful for interacting with XPAT programmatically, is "quieton raw". Nothing seems to happen when one enters:
>> {quieton raw}
However, entering queries now produces results that are tagged in a way that is easily parsable from within a program. First enter an earlier point search:
firstsearch = ("Branivoj " + "Branivoj<")
<SSize>1</SSize> pr
<PSet><Start>313615</Start><Raw><Size>64</Size>res du nom de Branivoj s'emparent du territoire qu'ils gouvernen</Raw></PSet>
Now enter an earlier region search:
((region AUTHOR within (region TITLESTMT within region FILEDESC)) not within (region SOURCEDESC))
<SSize>4</SSize>
pr.region.AUTHOR
<RSet><Start>143</Start><End>178</End><Raw><Size>36</Size> <AUTHOR>Holbach, Maude M. </AUTHOR></Raw><Start>298344</Start> <End>298391</End><Raw><Size>48</Size><AUTHOR>Yriarte, Charles, 1832-1898. </AUTHOR></Raw> <Start>792438</Start><End>792487</End><Raw><Size>50</Size> <AUTHOR>Laveleye, Emile de, 1822-1892. </AUTHOR></Raw><Start>1689410</Start> <End>1689486</End><Raw><Size>77</Size> <AUTHOR>Sebright, Georgina Mary Muir (Mackenzie), Lady, d. 1874- </AUTHOR></Raw></RSet>
Some of these tags are self-explanatory (e.g., SSize = set size). But some may need a bit of explanation.
XPAT's ability to return results with tags allows a program to parse the results into pieces. In the DLXS Middleware this is done by a group of DLXS Perl modules. These modules have methods to let the CGI program interact with XPAT (an XPAT process is forked off by the CGI program and queries can be made of it at any time). The main object the code uses is the xpat object. It has methods for making queries in different ways and for interacting with the forked off XPAT process.
Here is some code (from TextClass.pm) that illustrates how the middleware uses a method of the Perl-based XPAT object (created in an earlier part of the code).
... my $query = qq{(region mainheader incl ( $idnorgn incl "$idno" ) );}; my ( $error, $result) = $xpat->GetSimpleResultsFromQuery( $query ); if ( $error ) { &DlpsUtils::errorBail( qq{Query error in FindXPATContainingIdno: $result} ); } &DlpsUtils::StripAllRSetCruft( \$result ); $result =~ m,<SSize>(\d+)</SSize>,; my $hit = $1; if ( $hit > 0 ) { $returnXpat = $xpat; last; } ...
While some code, such as this, makes a query via a simple method, most queries in the middleware are actually made by other means, through other objects and their methods. Once data has been prepared according to the DLXS Class DTDs, in terms of searching, the middleware can be thought of as an engine that simply "runs" the data.
Back to top