Indexing will be covered in detail during the Text Class Data Preparation section.
A full list of XPat commands can be found at: http://www.hti.umich.edu/sgml/pat/pat50manual.html
xpat
along with
the name of the data dictionary (dd) file.:
% xpat $DLXSROOT/idx/s/sampletc/sampletc.dd
>> "prince"
1: 375 matches>> "prince "
2: 304 matches>> [290947]
3: one match
The first query finds all semi-infinite strings (sistrings) that begin with
"prince", the second finds those that are "prince" exactly, the third
finds the string beginning at the byte 290947.
To find how many of a particular region type exist, enter region plus the name of the region (double quotes are needed if the name contains non-alphanumeric characters).
>> region "DIV1" 1: 120 matches >> region "A-NODE" 2: 132 matches
Also see the {ddinfo regionnames}
command.
Also see the history
command.
>> long
1: 532 matches
>> help
2: 133 matches
>> 1 + 2
3: 665 matches >> "subsequently"
4: 5 matches
>> pr 4 819525, ..eparture, and subsequently confirmed in their position by the So.. 2764281, ..ra, and often subsequently during our stay, we walked on the mou.. 2936185, .. Kara George, subsequently he returned, but unexpectedly, and at.. 201591, .., whom we met subsequently, however, at Castelnuovo, seemed to r.. 2104209, .. of Russia. Subsequently, however, they showed more discrimin..>> mysearch = "lasting"
5: mysearch = 2 matches
>> pr *mysearch
1380924, ..tion, nothing lasting could be established. The Servians were de..
2465605, .. room. After lasting out five hundred years !</P><P>Perhaps a l..
Also see the subset
command.
Also see the {sortorder}
setting.
Also see other operators and relations.
Using some basic XPat operators, we can build some very specific searches that take advantage of the SGML's markup. Here is an actual example from the TextClass implementation. The following query is actually the basis for the fabricated region called mainauthor in most of our text collections. Note that this query depends on knowing the structure of the document's markup (in case of TextClass documents, the regions here are essentially the same as in the TEIHEADER of the TEI.2 DTD.)
>> ((region AUTHOR within (region TITLESTMT within region FILEDESC)) not within (region SOURCEDESC)) 6: 4 matches >> pr.region.6 143, ..<AUTHOR> Holbach, Maude M. </AUTHOR>.. 298344, ..<AUTHOR> Yriarte, Charles, 1832-1898. </AUTHOR>.. 792438, ..<AUTHOR> Laveleye, Emile de, 1822-1892. </AUTHOR>.. 1689410, ..<AUTHOR> Sebright, Georgina Mary Muir (Mackenzie), Lady, d. 1874- </AUTHOR>..
Here we construct a query to return a PSet consisting of hits on a user-entered search term. We want to display a line containing the immediate context of the hit and also a title from an enclosing division:
The query for the user's search is simply:
>> firstsearch = ("Branivoj " + "Branivoj<")
7: firstsearch = one match
To get a division title for the hit we need to build up regions based on the hit:
>> slicesearch = subset.1.25 *firstsearch
8: slicesearch = one match
>> mainslicesearch = (region "DLPSTEXTCLASS" incl *slicesearch)
9: mainslicesearch = one match
>> mainheader = (region "HEADER" within *mainslicesearch)
10: mainheader = one match
Finally to view the content of the region we have constructed we enter:
>> pr.region."HEADER" (region *mainheader)
See also viewing sets.
The default mode, in an interactive XPat session, is "quietoff". This gives the results messages you have seen so far: numbered sets, byte offsets followed by snippets of SGML with ".." on either end, etc. Another mode, and the most useful for interacting with XPat programmatically, is "quieton raw". Nothing seems to happen when one enters:
>> {quieton raw}
However, entering queries now produces results that are tagged in a way that is easily parsable from within a program. First enter an earlier point search:
firstsearch = ("Branivoj " + "Branivoj<")
<SSize>1</SSize> pr
<PSet><Start>313615</Start><Raw><Size>64</Size>res du nom de Branivoj s'emparent du territoire qu'ils gouvernen</Raw></PSet>
Now enter an earlier region search:
((region AUTHOR within (region TITLESTMT within region FILEDESC)) not within (region SOURCEDESC))
<SSize>4</SSize> pr.region.AUTHOR
<RSet><Start>143</Start><End>178</End><Raw><Size>36</Size> <AUTHOR>Holbach, Maude M. </AUTHOR></Raw><Start>298344</Start> <End>298391</End><Raw><Size>48</Size><AUTHOR>Yriarte, Charles, 1832-1898. </AUTHOR></Raw> <Start>792438</Start><End>792487</End><Raw><Size>50</Size> <AUTHOR>Laveleye, Emile de, 1822-1892. </AUTHOR></Raw><Start>1689410</Start> <End>1689486</End><Raw><Size>77</Size> <AUTHOR>Sebright, Georgina Mary Muir (Mackenzie), Lady, d. 1874- </AUTHOR></Raw></RSet>
Some of these tags are self-explanatory (e.g., SSize = set size). But some may need a bit of explanation.
XPat's ability to return results with tags allows a program to parse the results into pieces. In the DLXS Middleware this is done by a group of DLXS Perl modules. These modules have methods to let the CGI program interact with XPat (an XPat process is forked off by the CGI program and queries can be made of it at any time). The main object the code uses is the xpat object. It has methods for making queries in different ways and for interacting with the forked off XPat process.
Here is some code (from TextClass.pm) that illustrates how the middleware uses a method of the Perl-based XPat object (created in an earlier part of the code).
... my $query = qq{(region mainheader incl ( $idnorgn incl "$idno" ) );};
my ( $error, $result) = $xpat->GetSimpleResultsFromQuery( $query );
if ( $error )
{ &DlpsUtils::errorBail( qq{Query error in FindXPatContainingIdno: $result} ); }
&DlpsUtils::StripAllRSetCruft( \$result );
$result =~ m,<SSize>(\d+)</SSize>,;
my $hit = $1;
if ( $hit > 0 )
{ $returnXpat = $xpat;
last;
} ...
While some code, such as this, makes a query via a method, most queries in the middleware are actually made by other means, through other objects and methods. Once data has been prepared according to the DLXS Class DTDs, in terms of searching, the middleware can be thought of as an engine that simply "runs" the data. If there are any code changes that need to be made by DLXS users, it is usually when different display of data is needed ("filtering"). That is outside the scope of this section of the workshop.
The pr command is the heart of viewing sets. In an interactive XPat session, it lets you view the results you've searched for. Within the middleware, getting the data back from XPat is just one step; it is followed by "filtering" operations (Perl substitutions using regular expressions) to remove or change other tags in the the content and to change the appearance tof the content; e.g. highlighting hits, etc.
The format the results that XPat returns with pr or save is determined by the current {quieton} setting. There is a big difference between the normal user-sitting-at-the-pat-terminal interactive mode, and the machine-readable modes.
Note: The save command is, in a sense, the same as the pr command: pr displays to STDOUT, save utputs (appends) to a file whose name is given by {savefile}. The format of the output is the same.
<Sync>string</Sync>