Query Language Details
Some documentation can be found at: http://www.hti.umich.edu/sgml/pat/pat50manual.html
Some further detail concerning Query
Language is available.
We're going to start out in user mode, and shift into what's called {quieton}
and {quieton raw} once some basic concepts have been demonstrated.
In the first few sections, I'm going to non-chalantly introduce items that I
won't fully explain until the section on operators and relationships.
Invoking xpat
We've come all this way to in the end type a command like:
% xpat /l1/idx/b/bosnia/bosnia.dd
Identifying Points
In XPat a point is some unique byte offset in the full text under
consideration, corresponding to those places where XPat was told to begin
sistrings. We get a set (see below for a more formal discussion of sets)
of points by performing a search for a string or a particular offset:
>> "mulberry"
>> "mulberry "
>> mulberry
>> [118312]
The first finds all sistrings that begin with mulberry,
the second finds those that are "mulberry"
exactly, third finds the byte offset for the byte 118312,
and the fourth is the same as the first.
Searching for something that doesn't exist in the database gets you
zero results, or a point set with zero members:
>> "syzygy"
4: no match
NOTE: XPat supports the history
command, through which you could get a list of all sets created in the
session, and the command that created them.
>> history
Identifying Regions
A region in XPat is a span of text comprising zero or more bytes. sgmlrgn50
(discussed in the Encoding Workshop site) handles the create of these regions.
The {ddinfo regionnames} command
will list all the currently-defined regions, and from this list we can
find out about particular regions:
>> region DIV1
1: 113 matches
>> region "A-NODE"
2: 125 matches
That is, the region command followed by the name of some region (remembering
that regions with non-alphanumerics in its name must be double-quoted) evaluates
to the number of members in that region. There are 113 DIV1 regions and 125 A-NODE
regions above.
Looking for a region that is not defined gets you a result set of "-1"
and an error message.
See about ERROR .
Identifying Sets (Numbered and Named Sets)
Any collection of zero or more points or regions can be grouped together
in a set, sets can be combined or split with XPat's boolean operators,
all sets created during a session have unique number identifier, one can
give them names, sets can be printed out, saved, exported and imported.
I gloss over some operators and commands here until the next section.
>> "long "
1: 352 matches
>> region "DIV1" incl "long "
2: 76 matches
>> "help "
3: 100 matches
>> 2 + 3
4: 176 matches
>> region "TEXT" incl 1
5: 4 matches
>> vsearch = "vardar "
6: vsearch = 10 matches
>> vsearchnext = 6
7: vsearchnext = 10 matches
>> pr *vsearchnext
1886962, ..l crosses the Vardar, and to this end a bridge is in process of ..
2058198, ..fall into the Vardar (Axius); and two—the Lab and the Sitn..
683365, .. région où le Vardar et la Morava prennent leur source, que pass..
1818056, ..s; and on the Vardar Gate and Arch of Constantine<NOTE>"The Egna..
2023056, .. of the river Vardar. Our host was a grumbling old man, who asto..
1902124, ..side into the Vardar plain. The plain in its purple distance mel.
>>
The vsearchnext = 6 line is interesting: 6 is a number, it might
be a character you want to search for. So is XPat definitely looking for a numbered
set in this session, or the number 6? Another reason to always put search terms
in quotes. A command like vearchnext = 243 where the 243 is a
number that is larger than the set number for the last-created set will result
in an error.
There are additionally two special commands to create subsets of sets:
-
subset.X.Y A
-
Make a new set that consists of Y members of A, starting at the Xth member
of A. Members of A start numbering at 1.
-
This command is used to get result content in slices.
-
sample.X A
-
Make a new set that consists of X members of A, selected from A of size
Y such that each Y/Xth is in the new set.
How does XPat know which member of a set is the "first", "second", and so on.
This is set with the (sortorder} setting. TextClass uses
only: {sortorder occur}, which is to say that results are returned in
the byte order in which they occur in the source text: the byte offset of a member
of a set is <= the byte offset of the next member, if any. TextClass as it
stands orders results for display to the user by occurance order, and any ordering
other than that is accomplished outside XPat, in the InitIterator method. There
are other {sortorder} settings used by XPat, but we have so far not used
them in the Middleware. See {settings} below
more on the different settings.
Operators and Relations
After tantalizing with various operators, weI'll now actually define the
ones we use most in the form in which they trpically occur. You may
wish to refer to nn even more detailed discussion.
-
A ^ B
-
the "and" or "intersection" operator: A and B are two sets, or expressions
that evaluate to sets, and the resulting set includes those points or regions
in both A and B that have the exact same start offsets.
-
A + B + C + ...
-
the "or" or "union" operator: A, B, C... are sets, the resulting set is
a point set if at least one of the sets being combined is a point set,
consisting of the start offsets of all the points or regions in the original
sets. If all the sets being combined are region sets, then regions
that nest inside other listed regions (either entirely or at their start
byte offset) will be removed from the resultant set.
-
We saw an earlier example of the "+" operator.
-
A incl B
A not incl B
-
A is a region set, B is either point or region, the result is a region
set of all members of A that contain at least one member of B, containment
meaning that a given B has a start offset within the inclusive range of
a given A's start and end offsets.
-
A within B
A not within B
-
In many ways the complement to incl: A is a point or region set, B is a
region set, the resulting set is all members of A that are contained (by
the start offset rule as under incl) in any B. This also takes the not
operator to return all A's that are not within any B.
-
A near B
-
A and B are either points or regions, and # is either explicitly stated,
or taken from the {proximity} setting (see about {settings}
below). The result is all A's whose start offsets are within the specified
number of bytes of the start offset of any B. The not form returns all
A's whose start offsets are not within the specified number of bytes from
the start offset of any B. The nearest B might be earlier or later in the
source file.
-
A fby B
-
This is just as the near operator, save that an A must be followed within
the specified number of bytes by a B to be in the result set. This also
takes the not operator.
-
not
-
This reverses the sense of the expression it modifies, usable with incl,
within,
near,
and fby.
Using the Operators to Make Sets of Interest
Now that we have our basic concepts and operators, let's get in there and
do some searching and document analysis: the process by which we
figure out what is there and how the DTD was applied to this SGML,
and what of it we can use. When developing an online system, this
is the most important step. Some important commands will be introduced
in this experimentation.
What a set of interest is is entirely up to the user, and the
notion of user ranges from developers to content specialists to the patrons
floating around out there. I'm going to walk through some increasingly
complicated possible sets of interest here.
We ignore in this section the matter of displaying retrieved results.
That comes in a later section.
-
find all the words that start with "diff", and find all the words that
are "different" exactly
>> "diff"
8: 354 matches
>> "different "
9: 134 matches
>>
find all the "gate" follows "vardar"
>> "vardar " fby "gate "
1: one match
>>
Now some actual examples from the TextClass implementation. This
query is actually the basis for the fabricated region called mainauthor
in
Bosnia and illustrates within:
>> ((region AUTHOR within (region TITLESTMT within region FILEDESC))
not within (region SOURCEDESC))
17: 4 matches
The motivation here depends on knowing that the AUTHOR element appears
within the TITLESTMT and that the TITLESTMT element appears within both
the FILEDESC and indirectly within the SOURCEDESC element:
<!ELEMENT fileDesc - - (titleStmt, ..., (sourceDesc)+)>
<!ELEMENT titleStmt - - ((title)+, (author | editor | respStmt)*)>
<!ELEMENT sourceDesc - - (p | bibl | biblFull)+>
<!ELEMENT biblFull - - (titleStmt, ..., (sourceDesc)*)>
Here is a simplified query using intersection (^) to fetch the
regions that are notes in Bosnia
sgml. The full-blown form is the union of DIV2 and P tags in
addition ot DIV1 tags.
>> (region "DIV1-T" incl "NODE=aas7611.0001.001:11")
1: one match
>> region DIV1 ^ 1
2: 7 matches
So we have the DIV1 regions which correspond exactly with DIV1 tags
containing the "id" attribute, which is how notes are marked up in Yeats.
Suppose we constructed a query which has returned a PSet consisting
of hits on a term the user has entered to search on and now we would line
to display the immediate context of the hit and also a title from an
enclosing division:
The query for the user's search is simply:
>> firstsearch = ("Branivoj " + "Branivoj<")
2: firstsearch = one match
To get an division title for the hit we need to build up regions based
on the hit:
>> slicesearch = subset.1.25 *firstsearch
3: slicesearch = one match
>> mainslicesearch = (region DLPSTEXTCLASS incl *slicesearch)
4: mainslicesearch = one match
>> mainheader = (region HEADER within *mainslicesearch)
5: mainheader = one match
Finally to view the content of the region we have constructed we do:
>> pr.region.mainheader (region mainheader)
The next section discusses the pr command which is the heart
of viewing sets. Of course, we are not finished at this point. Getting
the data back from XPat is just one step. It is followed by filtering
operations (perl substitutions using regular expressions) to remove other
tags that may be mucking up our content and to change the appearance tof
the content e.g. highlighting hits, etc.
Viewing the Sets We've Constructed
Now what we've all been waiting for, we have some results or sets of interest,
and we want to look at them. The two commands for viewing results are pr
and save. In a sense, they are really the same command: pr
displays to STDOUT, save displays to {savefile}. Since
they behave the same way, I will use pr in my examples.
NOTE: save appends to the current {savefile}.
The kind of text for each result that XPat returns with pr and
save
is determined by the current {quieton} setting (which see, below
under {settings}). There is a big difference
between the normal user-sitting-at-the-pat-terminal interaction mode, and
the machine-readable modes.
-
pr (point-set)
-
This prints out up to {ordersize} (see {settings}
below) members of the point-set, starting with the first, according to
the current {sortorder} setting.
-
pr.X shift.-Y (point-set)
-
For the first results in (point-set), a string X bytes wide, offset to
the left of the matching point Y bytes. X and Y overide the {settings}
of {printlength} and {leftcontext} respectively, which
see below.
-
pr.region."region-name" (region-set of type "region-name")
-
prints the entire span of the members in (region-set). This is a bit of
a pain; to have to tell XPat the "format" of the region you would like
to see, when it should already know!
-
pr
pr %
pr.X shift.-Y
-
All these are variations on "print the last set created".
{settings}
These are settings that control certain behaviors of XPat during a search
session. There is only one setting that our programs use explicitly as
a set options command, the {quieton} command. The other settings
that are used by TextClass search strategies are made explicit through
the commands in which they are relevant, and aren't ever set with a set
options type command. I list here the {quieton} variants used
in TextClass, and then those {settings} explicitly used in commands
but not set per se. There are other {settings} that we don't use.
-
{quieton}
{quieton raw}
{quietoff}
-
{quieton} and {quieton raw} change the interaction mode
of XPat from whatever it was to one of the quieton modes. The {quieton}
modes have no user command prompt; multiple commands are separated with
a ;, and zero or more ;-separated commands are sent to XPat with a newline.
XPat returns information about sets and prints out results delimited by
special tags:
-
SSize
-
SSize tags surround a number, meaning that the set created by the search
corresponding to this SSize has number-many members.
-
RSet
-
When a region set is printed, all the members of the set printed are surrounded
by a pair of of RSet tags. In {quieton} mode, each result
from the region set consists of two tags:
<Start>#</Start><End>#</End>
refering to the start and end byte offsets for that particular result.
It is the responsibility of the programmer to already know or be able to
handle how many results there are in this RSet (like, knowing
what search generated the set). In {quieton raw} mode we get more
information about each region result:
<Start>#</Start><End>#</End><Raw><Size>#</Size>blah
blah blah</Raw>
Start and End are byte offsets as before, but Size is the byte length of
the text delimited by the close Size tag and the close Raw tag.
-
PSet
-
When a point set is printed, all the members of the set printed are surrounded
by a pair of PSet tags. In {quieton} mode, each result
from the point set consists of one tag:
<Start>#</Start>
Where the surrounded number refers to a byte offset of the point. In {quieton
raw} mode, we get some more information:
<Start>#</Start><Raw><Size>#</Size>flug
flug fluggy!</Raw>
Start is a byte offset, Size is a byte size, and the text of the point
is delimited by the close Size and close Raw. The dimensions of the text
printed depends on the combinations of the {printlength} and {leftcontext}
settings, or their explicit definition with the pr command involved.
-
Error
-
If some kind of non-fatal error occured during a search, XPat will, in
lieu of any of the preceding tags, send an error tag with some hopefully
helpful error message in it. The SSP CGI platform captures this, but doesn't
always do a great job of letting the programmer/user know, and the SSP
CGI platform always considers this fatal (ie, the CGI script tries to bail
out with a message whenever it gets this from XPat).
{quietoff} is used to bring the interaction back into the normal
user interaction mode.
-
{printlength #}
-
This setting controls the default print window size for point sets, how
many total bytes are printed when a point set result is printed. See the
discussion of pr above. Default is 64.
-
{leftcontext #}
-
This setting controls how many characters before the matching text will
be printed when a point set is printed. If there are 100 characters of
{printlength},
and 14 of {leftcontext}, then the point where the matching text
starts will be the 15th character. See the discussion of pr above.
Default is 14
-
{sortorder <order>}
-
This determines in what order a given set of results is sorted by XPat.
I always use {sortorder occur}, but there are other modes.
-
{savefile "file"}
-
Changes the default save file name.
-
{exportfile "file"}
-
changes the default export file name.
Miscellaneous and Useful Commands
- {ddinfo regionnames}
- Lists all the currently-defined regions. A very useful command for document
analysis
- history
- List of results sets from previously issued searches.
- ~sync "string"
- A fabulously useful command, basically an echo sort of command. We use
this in the TextClass perl modules to signal when XPat is done sending results.
In any of the {quieton} modes, this returns:
<Sync>string</Sync>
-
-