| Last updated | 2002-03-26 15:10:20 EST |
| Doc Title | The XPAT Command Manual |
| Author 1 | Wilkin, John Price |
| CVS Revision | $Revision: 1.8 $ |
The following provides a summary of XPAT commands, settings, and concepts, and is based extensively on Open Text's PAT 5.0 documentation. Many of the commands included here are not implemented in DLXS middleware.
{CommandFile}{CommandFile string}
changes the file name used by save.commands and
exec .
The CommandFile setting determines which file the
save.commands command writes to and the exec command
reads from. It has a default value of xpat.cmd . If the
string begins with a numeral or contains blanks or non-alphanumeric characters,
it must be enclosed within double quote marks. The file name must
also conform to the file naming conventions of the host operating system. It can
be changed at any time during a XPAT session and remains in effect until changed
again or until the end of the session. The current value of
CommandFile is displayed by the command {Settings} .
>> {CommandFile "/usr/new/output_file"}
This changes the setting to the value /usr/new/output_file which any subsequent
save.commands command writes to and exec command
reads from.
exec , save.commands , Settings
comment#
marks the start of a comment.
The comment, that is the # and the rest of the line following
the # , is ignored by XPAT. The comment can be placed on a line by
itself or following a XPAT query. It is useful for annotating queries stored in a
file to be processed in batch mode or to be read in by the exec
command. The queries may be created externally or generated during a XPAT session
and saved by save.commands for later use.
>> # find all the Shakespearean quotations
>> region Quote incl (region Author incl "shaks")
The line beginning with the # is ignored by XPAT.
>> first = region "<E>" .. "</L>"
# find first language
XPAT creates a new region set with this command. The rest of the line,
beginning with the # , is ignored.
exec , save.commands , save.history
{DefaultRegion}{DefaultRegion string}
determines which region set is the current default.
The DefaultRegion setting designates a special region set, known
as the default region. The default region can be referred to as
region without specifying the actual region name. The setting can
be changed at any time during a XPAT session and remains in effect until it is
changed again or until the end of the session.
If the string giving the setting value begins with a numeral or contains
blanks or non-alphanumeric characters it must be enclosed within double
quote marks.
Using the default region in a command, without having previously specified one, is illegal and results in the following message.
No information for region in the data dictionary
For convenience, a frequently used DefaultRegion setting can be
defined within an init file whose location is given
in the data dictionary file. The init file is read
and executed by a XPAT session when it is started (see the data dictionary
documentation for details).
>> region including "constitution"
The region set referred to in the example by region is the one designated by the
DefaultRegion setting.
>> {DefaultRegion HeadLine}
>> region including constitution
The first line changes the DefaultRegion setting to the value
HeadLine . The command that follows uses the region
set HeadLine even though it is not specified.
data dictionary documentation, including , pr ,
region save , Settings ,
within
difference- set2
removes members from a set.
The difference operator (- ) creates a new set
containing the members of set1 that are not members of
set2. Set1 and set2 can be either point sets or
region sets. The new set is of the same type as set1.
If either set1 or set2 is a region set, the first pointer delineating each region is used to determine if a member of set1 also occurs in set2. Thus, for set arithmetic (difference, union and intersection) in XPAT, set members of a region set are considered to be equal if they start at the same location in the text. The end point of a region is ignored in such operations.
>> "to"-"to "-"to<"
Note that these operators are parsed left to right and can be combined
without bracketing. This query creates a point set that contains all the
matches to the prefix to excluding those to the
string to followed by a blank or a left angle
bracket. Assuming an index in which all punctuation has been mapped to blanks,
the result contains words starting with to ) but not
the word to .
>> ("q" - "qu") within region HeadWord
This query creates a point set. The point set includes all words located in
a Headword region that begin with q but not with
qu .
>> region Story incl "music " - region
Story incl "art "
This query creates a region set. The region set is comprised of all Story
regions that include the string music but not the
string art .
>> region Q - "<Q><D>"
Assume that the regions described by region Q
all begin with the string <Q> . The above
query creates a region set of the members of region
Q that do not have the string <D>
immediately following the <Q> .
intersection , union
donedone
terminates a XPAT session.
The done command ends the session and causes the XPAT process to
exit. A message may be generated telling how much computer time has been used
during the session.
quit , stop
double quote" string"
allows the use of strings that include special characters.
Normally, XPAT interprets a sequence of characters as a string and searches
the database for matches to it. However, there are certain types of strings that
XPAT cannot recognize as search targets unless they are enclosed within
double quote marks. The special strings are: strings which begin
with a numeral, for example 2nd ; strings which
contain blanks or non-alphanumeric characters, for example end of the year or <Author>Scott ; and strings which are XPAT commands,
for example near and within . In each case, a string that is not enclosed in
double quote marks but should be will result in a syntax error or
unexpected result.
Note that if numbers are not enclosed in double quote marks,
they are interpreted as a reference to the number of a set previously calculated
in the XPAT session.
A pair of quotes representing an empty string ("" ) stands for
the set of all index points in the text being searched.
>> "done " >> done
The first command creates a point set containing matches to the word done . The second command ends the XPAT session.
>> 19 within region Date >> "19" within region Date
The first query finds those members of the previously calculated set,
identified by the number 19, that are within region
Date . The second query finds the matches to the string 19 within region Date .
>> ""
This command produces a list of every point indexed in the text.
>> "_XPat_1" = "match this string " >> "_XPat_OP1" = region "Region Set 5" >> "_XPat_2" = *"_XPat_1" within *"_XPat_OP1"
The above sequence of commands might be produced by a program that accepts input from a user and generates commands that are sent to XPAT. Since the names contain non-alphanumeric data they must be bounded by quotation marks.
index point , region , set name ,
string search
execexec
reads a file into a XPAT session and executes the commands contained in the file.
The name of the file read by the exec command is determined by
the value of the CommandFile setting. By default, the value is
xpat.cmd but can be changed at any time during the XPAT
session.
The exec command can be used to enter queries to a XPAT session.
The queries, for example macro definitions, may be recorded in a file using an
editor or saved in a file from a previous XPAT session using
save.commands .
>>{CommandFile "/usr/xpat/srch023.q"}>>exec
The first command sets the name of the file to be read by any
exec command to /usr/xpat/srch023.q .
The second command reads the file /usr/xpat/srch023.q and executes the commands contained in
the file.
save.commands
CommandFile
exportexport set1
saves information about sets created in a XPAT session.
Export writes a detailed description of the members of
set1, created during a XPAT session, to a file. The description
includes the type (region or point) of the set and sufficient information to
recreate a copy of the set. The name of the file is determined by the value of
the ExportFile setting. By default the file name is xpat.exp but can be changed during a session by using the
command ExportFile . When export writes to the named
file it writes over anything that may currently exist in the file. Assuming a
default ExportFile setting of xpat.exp ,
the following message is given:
Exporting to xpat.exp.
The file may subsequently be read into a XPAT session by the
import command.
If the saved set is a frequently used region set, it can be made available as
a predefined region in future XPAT sessions by editing the data dictionary file
and adding the appropriate information. If the new region set, containing 150
regions, is named newregion and saved in the file
newregion_file , the following lines, added to the
data dictionary, would make it available to XPAT.
<Region> <Name>newregion</Name> <Desc>This new region set describes ....</Desc> <File> <SysName>newregion_file</SysName> <Offset>0</Offset> </File> <Count>300</Count> <Type>pairs</Type> </Region>
>> "tax" near "increase"
>> export %
The first query creates a point set of the matches to the string tax when it is within the current Proximity
of the string increase . The second command writes
this point set to the file xpat.exp . The information
written to the file contains header information followed by details about each
element in the set.
>> {ExportFile "v.exp"}
>> verse = region "<V>" .. "</V>"
>> export *verse
The first line of the example changes the ExportFile setting
to v.exp . The second line creates a region set and
names it verse . The third command writes header
information and a description of each member of the region
verse to the file v.exp .
data dictionary documentation, import
ExportFile
{ExportFile}{ExportFile string}
changes the file name used by export and import .
The ExportFile setting determines the file written by the
export command and read by the import command. It has
a default value of xpat.exp . If the string begins with
a numeral or contains blanks or non-alphanumeric characters, it must be enclosed
within double quote marks. The file name must also conform to the
file naming conventions of the host operating system. It can be changed at any
time during a XPAT session and remains in effect until it is changed again or
until the end of the session. The current value of the ExportFile
setting is displayed by the command {Settings} .
>> {ExportFile "/usr/new/export_file"}
This changes the value of the setting so that any subsequent
export or import command utilizes the file /usr/new/export_file .
export , import , Settings
fbyfby set2
finds members of sets that occur close to each other in a specified order.
Fby (followed by) creates a set containing those members of
set1 that have one or more members of set2 within a
specified number of characters to their right . Set1 and
set2 may be either point sets or region sets. The new set is of the
same type as set1.
The distance between members of the two sets is calculated by counting the
number of characters in the text from the first character of a member of
set1 to the first character of a member of set2. The
measure used to determine closeness is the value of
the Proximity setting which has a default value of 80 characters.
This can be changed for all subsequent uses of fby by changing the
Proximity setting, or it can be changed for an individual use of
fby by using a modifier attached to the command. The form of the
modifier is a period followed by a number representing the maximum distance (in
characters).
If either set1 or set2 is a region set, the first of the two pointers delineating the region is used to determine the distance between the set members.
Multiple fby commands are not parsed left to right. A command of
the form
set1fbyset2fbyset3
is handled as if parenthesized as follows:
set1fby(set2fbyset3)
The command not fby creates a set containing the members of
set1 that are not within the specified distance to the
left of any member of set2.
set1not fby(set2fbyset3)
is the same as
set1 - (set1fby(set2fbyset3))
>> "law " fby "order "
Assuming a Proximity of 80, this query creates a point set
containing the matches to law with one or more
matches to order within 80 characters to their
right, counting from the l in law to the o in order .
>> region Title fby.30 region Author
This query creates a region set containing the members of the set region Title that have one or more members in the set
region Author within 30 characters to the right.
The distance is measured as the number of characters from the first character
of a Title region to the first character of an Author region.
>> "law " not fby "order "
This query creates a point set containing the matches to law that do not have a match to order
within 80 characters to the right, calculating the distance as in the
first example.
>> "law " not fby.30 "order "
This query creates a point set containing the matches to law that do not have a match to order
within 30 characters to the right.
near
Proximity
firstfirst set1
finds a specific number of contiguous members from the start of a set.
First creates a set of a specified size which is comprised of
members from the beginning of set1. The members of the new set are in
the order they appear in set1. Set1 may be either a region
set or a point set. The new set is of the same type as set1.
The operation of the first command involves the set member
counter that keeps track of the selected members, the identification of the size
of the requested set, and the SortOrder setting that determines
which members are in the new set.
First selects members from the beginning of a set. The ordering
of a set, and hence which members occur at the beginning, is controlled by the
SortOrder setting. If the SortOrder setting is Alpha , the set is ordered alphabetically. If the
SortOrder setting is Occur or OccurHead , the set is ordered according to occurrence in
the text. If the SortOrder setting is AsIs , the set order is the current one which may be either
alphabetic or occurrence order.
Each set that is used with a first , next or
~nextemp command has a cursor (set member counter) associated with
it. The cursor indicates the location in set1 at which to start
selecting members for the set being created. Each first command
resets the cursor so members for the new set are chosen from the beginning of
set1. On completion of the first command the cursor is
updated to point at the beginning of the next set. Note, when the
SortOrder setting changes and the set ordering is changed, the
cursor is reset to the beginning of set1.
The size of the set created is determined by the value of
SampleSize which has a default value of 10. If the size of
set1 is less than SampleSize then the new set created is
the same size as set1. Changing the SampleSize setting
affects all subsequent uses of first during the current session.
For an individual use of the command, the size of the new set can be specified
by using a modifier attached to the first command. This modifier is
in the form of a period followed by a numeric value giving the desired set size.
The first command can be used by itself or with the
pr , save or export commands.
>> {SampleSize 40}
>> first 5
The first line changes the SampleSize setting to 40 and the
second line creates a set that contains the first 40 members of set number 5
created earlier in the XPAT session.
>> first .10 "the best of "
This line creates a set containing the first 10 members in the set of
matches to the phrase the best of .
>> first .0 3
This query resets the cursor to the first member of set number 3.
next , ~nextemp , sample , set
number , subset
SampleSize , SortOrder
~free~free number
releases a XPAT set.
Following the ~free command, the set number is no
longer available for reference in a XPAT command. The set is no longer displayed
by the history command.
If the sets freed are at the end of the current history list, the set numbers
will be reused for the next sets created in the XPAT session. For example, if the
history list contains set numbers 1 to 8, and 6 through 8 are freed using the
~free command, the next set number assigned is 6. However, if set
number 2 is freed and the history list includes set numbers 1 to 8, the next set
is number 9.
>> ~free 4
This removes set number 4 from the history list. The set can no longer be accessed by number reference.
~freeall , history
~freeall~freeall
releases all XPAT sets.
Following the ~freeall command, all the sets that existed in the
current session are no longer available for reference in a XPAT command. In
addition, those sets are no longer displayed by the history
command.
Following the ~freeall , the next set number assigned is 1.
>> ~freeall
This removes all the current sets in the history list from the history list. Following the command, no previously created sets can be referenced, and the next set that is produced is assigned the number 1.
~free , history
historyhistory
displays the record of the current XPAT session.
Information about each set created during the XPAT session is recorded in a
history list. For each of the sets, history displays a set number,
the number of members in the set and the query that produced the set. Sets
created during the current session can be accessed by referring to the number of
the set in the history list. The results of pr , save ,
Settings , { } , and certain tilde (~ )
commands do not appear in this list since no sets are produced by these
commands.
As the entire history list may become quite long, it is useful to be able to
view only a part of the list. The History setting determines what
portion of the history list is displayed by the history command.
The History setting has a default value of 0. This indicates that
the entire history list is to be displayed. When set to an integer n
(any integer greater than zero) the final n elements in the history
list are displayed by any subsequent use of the history command
during the session.
The items listed can also be changed for an individual use of the
history command. Modifiers may be attached to the
history command to request that a certain number of items and that
a particular portion of the list be displayed.
The first modifier, in the form of a period followed by a number, indicates
where in the history list to begin the display. A positive integer p
requests that the display start at the pth item from the start of the
history list. A negative integer p requests that the display start at
the pth item from the end of the history list. The number of items
displayed is the value of the History setting.
The number of items displayed can also be changed for an individual use of
the history command by using a second modifier attached to an
already modified history command. This second modifier is also in
the form of a period followed by a number giving the number of items to be
displayed.
The default maximum size of the history list is 300 items. If more than 300 sets are created the last 300 sets created during this XPAT session are retained in the list. This maximum size can be altered by a command line parameter when starting a XPAT session.
Note that a set can be removed from the history list by the
~free command.
>> "univ"
>> pr sample %
>> "waterloo"
>> 1 near 2
>> pr
>> history
Assuming the above are the only commands executed in the XPAT session to
this point, the result of the history command would be as
follows:
1: 11680, "univ"
2: 209, "waterloo"
3: 4, 1 near 2
>> {History 5}
>> history
The first command, in this example, sets the value of the
History setting to 5. The second command, and subsequent uses of
the history command in the session, will show information about
the five final sets in the history list. The second command shows information
about the final five sets in the history list.
>> history.3
This use of the history command gives information about the
commands in the history starting at the third element in the history list.
Using the XPAT session described in the first example, above, the result of
this would be.
3: 4, 1 near 2
>> history.-2
This use of the command gives information starting at the second element from the end of the history list. Again, using the first example, the result of this would be.
2: 209, waterloo 3: 4, 1 near 2
>> history.4.10
This use of history gives information from the history list
starting at the fourth entry on the list and continuing for ten entries.
>> history.-4.2
This use of history gives information about the final two
entries in the history list.
~free , save.commands , save.history ,
set number
History
{History}{History number}
changes the number of items from the history list displayed by the
history command.
The History setting determines the number of items displayed by
the history command. Note that the setting may be overridden and
the number of items displayed determined by a modifier for an individual use of
the history command. The default value of the setting is 0
indicating that all sets created in this session are to be shown by the
history command. The setting can be changed at any time during a
XPAT session and stays in effect until changed again or until the end of the
session. The current value of the History setting is displayed by
the command {Settings} .
>> {History 30}
This changes the setting to the value 30 so that any subsequent use of the
history command during the session displays 30 items.
history , Settings
{HistoryFile}{HistoryFile string}
changes the file name used by save.history .
The HistoryFile setting determines the file written by the
save.history command. It has a default value of xpat.his . If the string begins with a numeral or contains
blanks or non-alphanumeric characters, it must be enclosed within double
quote marks. The file name must also conform to the file naming
conventions of the host operating system. It can be changed at any time during a
XPAT session and remains in effect until it is changed again or until the end of
the session. The current value of the HistoryFile setting is
displayed by the command {Settings} .
>> {HistoryFile "/usr/new/history_file"}
This changes the HistoryFile setting so that any subsequent
use of the save.history command during the session writes to the
file /usr/new/history_file .
save.history , Settings
importimport
reads information that has been saved in a file by the export
command.
Import reads data from a file and creates a new set which can be
used as if it had been created during the current XPAT session. XPAT determines
from the header information whether the saved set is a point set or a region
set, and the new set is of the same type. The file read is determined by the
ExportFile setting which has a default value of xpat.exp . The file name can be changed during a session by
resetting the ExportFile setting.
Assuming the default setting of ExportFile , if the imported set
is a region set, the following message is generated:
Importing regions from 'xpat.exp'
If it is a point set, the message generated is:
Importing point set from 'xpat.exp'.
>> import
>> % within region Quote
The first command reads from the file xpat.exp .
The second line uses the imported set as an operand to a within
command and finds the members of the set that occur within region Quote .
>> {ExportFile "v.exp"}
>> verse = import
>> *verse including ("blind " fby "ditch ")
The first command resets the ExportFile setting to v.exp . The second reads a set from the file v.exp and names it verse . The
third query finds the set of imported regions that include the string blind when it is followed by the string ditch (the assumption has been made that this set is a
region set).
export
ExportFile
includingincluding set2
set1 incl set2
find regions that contain members of a set.
Including or incl creates a set comprised of
members of set1 that include one or more members of set2.
Set1 must be a region set. Set2 may be either a point set
or a region set. The new set is a region set.
Set1 may be a predefined region set, a region set created during
the XPAT session using the region command, a region set resulting
from the use of the import command, or the result of a previous
query in the session.
If set2 is a point set, and if one or more of the points occur in a region from set1, then that set1 region is included in the new set.
If set2 is a region set, and the first of the pair of pointers (offsets into the text) describing a region of set2 is contained in a region of set1, that set1 region is included in the new set. The second pointer of the pair delineating set2 does not have to fall within the region of set1 in order that the set1 region be included in the new set.
The including command can also be used to find regions that
contain more than one member of set2, by attaching a modifier
specifying the minimum number of members of set2 to the
including command. This modifier is in the form of a period
followed by the value of the minimum number of members.
The command not including creates a set containing those members
of set1 that do not contain any of the members in set2.
set1 not including set2
is the same as
set1 - (set1 including set2)
Including and within are similar in that they both
restrict searches to specified regions in the text. They differ in the set that
is created. The including command creates a set of regions that
contain one or more members of another set, while within
creates a set of pointers or regions that are contained in members of a
region set.
>> region Story including ("Free trade"
near "Canada")
This query finds the regions described by region
Story that contain one or more matches to the string Free trade when it occurs close to the string Canada .
>> region Story including.3 ("Free trade"
near "Canada")
This query finds the regions described by region
Story that contain at least three matches to the string Free trade when it occurs close to the string Canada .
>> region Quote not including region Author
This query creates a set of Quote regions that do not contain the first pointer of the pair delineating an Author region.
>> dates = "1800" .. "1825"
>> region Date including *dates
The first query creates a point set containing all the numbers that are
alphabetically between 1800 and 1825 . The second creates the set of Date regions that
contain one or more of these numbers.
>> region Quotationincluding"Wright" >> %including"Waterloo"
The first query creates a region set of quotations that contain the string
Wright . The second query finds the members in the
new region set that also contain the string Waterloo .
>> (*speech including "republican") including "democrat"
This query is similar to the previous one. It assumes that a region set
named speech has been defined and it finds the
members of this set that contain both the string republican and the string democrat .
>> (*definition incl ("men" + "women")) incl "education"
>> *definition including (("men" + "women") ^ "education")
The first query creates a set of definition regions that include the string
education as well as either men or women . Note that the
second query does not create the same set but actually creates a set of size
0. This result is due to the fact that the intersection operation - (("men" + "women") ^ "education") - produces an empty
result. This result occurs since there are no members of the union set men + women that are also members of the set education (see definition of the union
operator).
intersect , not , region , within
index pointXPAT views the entire text as one long string. In contrast to
traditional text indices, which deal with words, XPAT indexes strings. The
indexed strings extend from each index point to the end of the
text.
The XPAT index is made up of the starting points of each string. The index points make up the possible match points for a string search. Parameters set when the index is built determine which strings are in the index. The parameters specify patterns in the text that define the beginnings of strings to be indexed. For example, one pattern could specify that every character in the text is to be indexed, while another pattern could specify that each printable character following a blank is to be indexed.
When the index is created, two additional settings can alter how XPAT sees the
text. Character mappings cause XPAT to see certain characters as
equivalent to other characters. For example, all upper case letters may be
mapped to lower case letters so that XPAT does not distinguish between upper and
lower case when searching for a string. Also, some words may be designated as
stopwords. XPAT views the text as if these words are not there. XPAT
ignores strings in the text that start at an index point and match the given
stopword strings followed by a blank after the character mappings have been
applied. The character mappings also affect the strings chosen to be index
points. For example, if a > is mapped to a blank
and if the index points are defined as blanks followed by printable characters,
in the text ...<tag>wisdom... the w in the string wisdom is an
index point. Text with character mappings applied and stopwords removed is
referred to as converted text.
When searching for a given string, a match is found if the given string (after having the character mappings applied to it and the stopwords removed) is the same as the converted text that begins one of the indexed strings.
data dictionary documentation, double quote ,
offsets , quiet mode , range ,
shift , string search
intersect^ set2
finds members common to two sets.
The intersect operator (^ ) creates a new set
consisting of the members in set1 that are also in set2.
Set1 and set2 can be either point sets or region sets. The
new set is of the same type as set1.
If either of set1 or set2 is a region set, only the first of the pointers describing the region is used in the comparison to determine if a member should be included in the new set. Two members of a region set are considered to be equal if they start at the same location in the text.
>> (region Verse incl "eye") ^ (region
Verse
incl "seed")
This query creates a region set. It includes verse regions that contain
both the string eye and the string seed .
>> ("research" near "medical") ^ ("research" near "biolog")
This query creates a point set. It includes the matches to research that appear close to both the string medical and the string biolog .
difference , region , union
{Label}{Label string}
specifies an identifying string to be used as a label.
When XPAT is operating in quiet mode with labels requested, any
set displayed by a pr or save command shows the label
string preceding the numeric value of the text offset. This can be used to
identify which database the information is from. In a XPAT session, if a value
for Label has not been set by this command, the default value used
is the name of the data dictionary. The label string must begin with an
alphabetic character and contain no blanks or non-alphanumeric characters. The
setting can be changed at any time during a XPAT session and remains in effect
until it is changed again or until the end of the session.
>> {Label Database1}
>> {QuietOn Label}
pr ("Ontario" near ("B.C." + "British Columbia"))
The tagged output from the pr command shows the numeric offset
in the file preceded by the string Database1 , in
the form
<PSet><Start>Database1:12345</Start></PSet>
offsets , quiet mode
QuietOff , QuietOn
last set%
refers to the previous result.
% is used as shorthand to refer to the set created most recently
in the XPAT session. The set is the final one in the current history list. Some
commands, such as pr and save , do not create sets that
are saved and recorded in the history list and thus cannot be accessed by using
the % . If there is no history, the last set is the
null set which contains all index points.
>> region Author including "Hemingway" >> pr sample % >> % within region Quote
The % in the second line of the example refers to the set
created by the including command in the first query. The
% in the third line also refers to the set created by the first
line and not to the result of the pr in the second line which
does not produce a set.
~free , ~freeall , history
{LeftContext}{LeftContext number}
specifies how many characters of context are displayed to the left of a set member.
By default, when a set is displayed with the pr command or
written to a file by the save command, the text has 14 characters
to the left of the match point. The setting can be changed at any time during a
XPAT session and remains in effect until it is changed again or until the end of
the current session. The current value of the LeftContext setting
is displayed by the command {Settings} .
>> {LeftContext 40}
This changes the setting to the value 40 so that any subsequent
pr or save command produces text with 40 characters
to the left of the match point.
pr , save , Settings
PrintLength
macroThe macro capability facilitates the use of
frequently used sequences of XPAT commands.
A macro can be defined in a XPAT session and be available only for the
duration of that session, or a macro can be created externally and read into any
XPAT session by an exec command or during initialization.
The definition of a macro (here called name )
begins with the following: name = macro
After this line the system prompt changes from >> to || for the duration
of the macro definition. The body of the macro may begin on the same line or on
a subsequent line. XPAT interprets anything immediately following the word
macro , that is not a blank or new line, as the beginning of the
macro definition. The body of the macro may contain arguments. The
nth argument to the macro is identified within the macro definition
by the string $n$ . Any sets that are created by the
macro may also be used in its definition. The string *n* refers to the nth set created within the
macro. The end of the macro definition is indicated by a @ . After the @ the system prompt
returns to the form >> .
References to other macros may be used within the definition of a macro. If the macro contains more than one XPAT query, they can be put on separate lines or on the same line with the queries separated by a semi-colon. The body of the macro is not checked for syntax errors when it is defined. Any errors are reported when the macro is used.
The macro is invoked by the following call:
name(arg1,arg2..)
If the number of arguments in the macro call is less than the number in the macro definition a syntax error is reported. If it is greater, the extra arguments are ignored.
Each argument consists of all the text occurring between argument delimiters:
parentheses and commas. That is, if a macro takes three arguments - (arg1,arg2,arg3) - arg1 consists
of the text between the opening parenthesis and the first comma, arg2 consists of the text between the first and second
comma, and arg3 consists of the text between the
second comma and closing parenthesis. If a macro takes only one argument - (arg1) - the parentheses are the argument delimiters. Note
that any spaces entered with an argument string will be included with the
parameter substitution which is unlikely to be the intent of the user. To avoid
unexpected results, enter only the exact text that you wish to be
substituted in the arguments of the macro call. Also note that macros may have
no arguments.
When the macro is invoked, the invocation is replaced by an exact copy of the
body of the macro with the arguments substituted for the formal parameters. This
means that the macro can be used within other XPAT queries. This may require that
the macro definition have the closing @ on the same
line as the final line of the body of the macro definition to avoid introducing
an unwanted new-line character.
If improperly used, macros that produce multiple sets and are used within other queries may cause more than one syntax error to be reported. Care must be taken with bracketing in order to ensure that the results reflect what was actually intended.
A macro can be redefined during a XPAT session. When the macro is redefined,
the previous definition of the macro is displayed following the first new line
entered after the word macro . The format of this previous
definition consists of the macro name followed by a colon, followed by the body
of the macro on subsequent lines.
For convenience, macros that are used frequently can be defined within an init file whose location is given in the data dictionary file. The init file is read and executed by a XPAT session when it is initially started. (See the data dictionary documentation for details.)
>> word = macro
|| ( "$1$ " + "$1$<" + "$1$-" ) @
>> word(pad) within *definitions
This macro is used with text that contains tags that start with a < and where the tags may follow text without blanks
appearing before the tag. The macro defines a word as a string of characters
followed by a blank, < , or - (in this definition an index that has all punctuation
mapped to a blank is assumed). Since the macro definition has the @ sign on the same line as the body of the definition,
the macro can be used within a more complicated query as shown. Note the
brackets included in the macro definition. The example assumes that there is a
region definition and finds all occurrences of
pad , as a word, inside one of these regions.
>> both = macro || region $1$ || *1* including $2$ || *2* including $3$ || @ >>both(Line,"juliet","romeo")
With the macro defined here, the members of a predefined region set which
contain both of two given strings are found. In the macro call above, the
macro is applied to a database of Shakespearean texts in order to find the
members of the predefined region set named Line
containing references to both romeo and juliet . The definition of this macro returns more than
one set. It also has the @ on the line following
the body and thus could not be used within another query. The resulting output
from XPAT showing the three sets produced by the macro would appear as below:
16: 128794 matches 17: 214 matches 18: 25 matches
data dictionary documentation, thesaurus
naming sets= set1
assigns a name to a set.
A set which has been named can be referred to either by that name or its set number. Set1 can be either a point set or a region set.
A name that starts with a letter and contains only letters and
numbers does not need to be enclosed within double quote marks.
However, if the name contains special characters (blanks or
non-alphanumeric characters), or does not start with a letter, it must be
enclosed within double quote marks both in the assignment statement
and in subsequent use.
To use the name in a query it must be preceded by an asterisk
(* ). Without the asterisk (* ), XPAT interprets the name
as a string rather than as the name of a set.
>> UK = "U.K."+"Britain"+"Great Brit"+"United King"
>> region Headline including *UK
The first line assigns the name UK to a set of
matches to four alternate ways of referring to the United Kingdom. The second
line finds Headline regions that contain any of the matches.
>> "min_hiring" = region Minutes incl ("hiring" near "policy")
>> region Attendees within *"min_hiring"
The first line of the example assigns the name min_hiring to Minutes regions that include matches to
hiring appearing close to matches to policy . The second line finds the Attendees regions that
are within one of the resulting Minutes regions from the first query.
double quote , set name
nearnear set2
finds members of sets that are close to each other.
Near creates a set containing the members of set1
that are within a specified number of characters before or after one or more
members of set2. Set1 and set2 may be either
point sets or region sets. The new set is of the same type as set1.
The distance between members of the two sets is calculated by counting the
number of characters in the text between the first character of a member of
set1 and the first character of a member of set2. The
measure used to determine closeness is the value of
the Proximity setting which has a default value of 80 characters.
The value can be changed for all subsequent uses of near by
changing the Proximity setting, or it can be changed for an
individual use of near by using a modifier attached to the command.
The form of the modifier is a period followed by a number representing the
maximum distance (in characters).
If either set1 or set2 is a region set, the first of the two pointers describing the region is used in finding the distance between the members of the sets.
Multiple near commands are not parsed left to right. A command
of the form
set1nearset2nearset3
is handled as if parenthesized as follows:
set1near(set2nearset3)
The command not near creates a set containing those members of
set1 that are not within the specified distance of any
member of set2.
set1 not near set2
is the same as
set1 - (set1 near set2)
>> "love " near "hate "
Assuming a Proximity of 80, this query creates a point set
containing those matches to love that are within 80
characters of matches to hate , counting from the
l in love to the h in hate . The string hate can occur before or after love
in the text.
>> region Title near.30 region Author
This query creates a region set containing the members of region Title that are within 30 characters of one or more
members of region Author . In this case the distance
is measured as the number of characters between the first character of a Title
region and the first character of an Author region.
>> "love " not near "hate "
This query creates a point set containing those matches to love that do not occur within 80 characters of a match to
hate calculating the distance as in the first
example.
>> "love " not near.30 "hate "
This query creates a point set containing the matches to love that do not occur within 30 characters of a match to
hate .
fby , not
Proximity
nextnext set1
finds a specified number of contiguous members of a set following members
already identified by a first or next command.
Next creates a set of a specified size containing the members of
set1 that start at the current cursor position associated with this
set. The cursor position is determined by a previous first or
next command applied to set1. The members of the new set
are in the order they appear in set1. Set1 may be either a
region set or a point set. The new set is of the same type as set1.
The operation of the next command depends on the set order
established by the SortOrder setting. If the SortOrder
setting is Alpha , the set is ordered alphabetically;
if the SortOrder setting is Occur or
OccurHead , the set is ordered as the members occur in
the text; and if the SortOrder setting is AsIs , the set ordering is the current one and may thus be
either alphabetic or occurrence order.
Each set that is used with a first , next or
~nextemp command has a cursor (set member counter) associated with
it. The cursor indicates the location in set1 at which to begin
selection for the set being created. On completion of the next
command the cursor is updated to point at the beginning of the next set. Note,
when the SortOrder setting changes and the set ordering is changed,
the cursor is reset to the first element.
The size of the set created is determined by the value of
SampleSize which has a default value of 10. If the size of
set1 is less than SampleSize , then the new set created
is the same size as set1. Changing the SampleSize
affects all subsequent uses of next during the current session. For
an individual use of the command, the size of the new set can be specified by
using a modifier attached to the next command. The modifier is in
the form of a period followed by a numeric value giving the desired set size.
The next command can be used by itself or with the
pr , save or export commands. Note that
next may only be used in conjunction with these commands.
>> {SampleSize 40}
>> first .0 5
>> next 5
The first line of the example changes the SampleSize setting
to 40. The first command resets the cursor associated with set
number 5 to the first member of the set and creates a set of size 0 (thereby
leaving the cursor at the first member). The third line creates a set that
contains the first 40 members of set number 5.
>> next .10 5
If this command follows the previous example, a set of ten members is
created. The cursor associated with set number 5 indicates that 40 members
have been used to create the set in the previous next command and
so this new set starts at the 41st member of set number 5.
first , ~nextemp , sample ,
subset
SampleSize , SortOrder
~nextemp~nextemp set1
finds a specified number of contiguous members of a set following members
already identified by a first or next command.
The command ~nextemp creates a set of a specified size
containing the members of set1 that start at the current cursor
position associated with the set. The cursor position is determined by the
previous first or next command applied to
set1. The members of the new set are in the order they appear in
set1. Set1 may be either a region set or a point set. The
new set is of the same type as set1.
The ~nextemp command is identical to the next
command except that the cursor is unchanged by the ~nextemp
command.
The operation of the ~nextemp command depends on the set order
established by the SortOrder setting. If the SortOrder
setting is Alpha , the set is ordered alphabetically;
if the SortOrder setting is Occur or
OccurHead , the set is ordered as the members occur in
the text; and if the SortOrder setting is AsIs , the set ordering is the current one and may thus be
either alphabetic or occurrence order.
Each set that is used with a first , next or
~nextemp command has a cursor (set member counter) associated with
it. The cursor indicates the location in set1 to start selecting
members for the set being created. After completion of the ~nextemp
command, the cursor is unchanged. This differs from the behaviour of the
next command, which updates the cursor to point at the last member
of set1 selected for the new set. Note, when the
SortOrder setting and the set ordering change, the cursor is reset
to the first element.
The size of the set created is determined by the value of the
SampleSize setting which has a default value of 10. If the size of
set1 is less than SampleSize , then the new set created
is the same size as set1. Changing the SampleSize
setting affects all subsequent uses of ~nextemp during the current
session. For an individual use of the command, the size of the new set can be
specified by using a modifier attached to the ~nextemp command.
This modifier is in the form of a period followed by a numeric value giving the
desired set size.
The ~nextemp command can be used by itself or with the
pr , save or export commands. Note that
~nextemp may only be used in conjunction with these commands.
>> {SampleSize 40}
>> first .0 5
>> ~nextemp 5
The first line changes the SampleSize setting. The
first command initializes the cursor associated with set number 5
to the first member of the set and creates a result set of size 0 (thereby
leaving the cursor at the first member). The third line creates a set that
contains the first 40 members of set number 5.
>> ~nextemp .10 5
Assume this command follows the previous example. On completion of the
previous query, the cursor still points to the beginning of the set as the
~nextemp command does not change the cursor setting. The set
created by this query contains 10 elements from the beginning of set number 5.
first , next , sample ,
subset
SampleSize , SortOrder
notis used to modify four XPAT commands.
The forms in which not can appear are not fby ,
not including , not near , and not within .
These uses are described in the entries for fby ,
including , near , and within .
Not cannot be used to modify any other commands.
fby , including , near ,
within
offsets[number]
[label:number]
generate a point set containing a specified position in the text.
The number in the square brackets is a logical position in the text and need not be an index point. The number indicates the offset, measured in number of characters, from the beginning of the text database. The first character of the text has offset [1]. If the number used in square brackets exceeds the size of the text XPAT gives the message
Error: Input number too large.
Note that the new set is a point set with only one member.
The second form of the command, shown above, uses offsets that are produced
when XPAT is operating in quiet mode and using labels. In this form, in order to
produce correct results, the label string must be the current value of the
setting Label . When the label is different from the current
Label setting the resulting set has size 0.
>> region Quote including[20000]
This query finds the Quote region that includes the offset 20000.
>> {Label news}
>> region Quote including [ news:20000]
This query uses an offset in the form produced by XPAT in quiet mode (having
requested labels with the offsets). Since the Label has been set
to the value news by the previous command, the
query finds the region set named Quote containing
the given offset. If the label, prefixed to the offset, is anything other than
news the query would produce a set of size 0.
quiet mode , sets
Label
prpr set1
displays contents of XPAT sets.
Pr displays each member of set1 with surrounding
context. A modifier can be attached to the pr command in order to
control the context exactly. Set1 can be any region set or point set.
If set1 is a region set, the first of the pair of points describing
each region in the text is displayed by the pr command. If no
set1 is given, the operand for the command is the most recent set
created in the session.
For each member in the given set, the output is in the form of an integer
giving the offset of the set member in the text file, followed by a comma, a
blank, two periods and then the characters surrounding the set member. The first
character in the database is considered to be offset 1. The order in which the
set is displayed depends on the current SortOrder setting.
With no modifier, pr prints a line of text for each element in
set1. The PrintLength and LeftContext
settings determine the content of the line printed. With the default settings,
the printed text is 64 characters in length of which 14 precede the match point.
The number of characters displayed to the left of the match point can be altered
by changing the LeftContext . The total number of characters printed
can be altered by changing the PrintLength setting.
The total number of characters to be displayed can be set, for a single
instance of the command, by using a numeric modifier attached to the
pr . The modifier is in the form of a period followed by a number
giving the total number of characters to be displayed. The left context that is
displayed is still determined by the value of the LeftContext
setting.
The second form the modifier can have is a period followed by the string
region . When the modifier .region is used, the output
text starts at the match point and continues to the end of the default region in
which the match point occurs. If the match point is not within the default
region, no output is displayed for the match point.
The second form of the modifier can be refined to request that the text
displayed is a region other than the default region. An additional modifier
specifying a defined region set can be attached to the already modified
pr command (i.e. to the pr.region ). The additional
modifier can specify the region in one of three ways: a string giving the name
of a predefined region, the number of a region set created in the XPAT session,
or a string preceded by an asterisk (* ) referring to a named region
set defined in the XPAT session (see the examples below). As with the form
pr.region , described above, this use results in the displayed text
starting at the match point with no left context and continuing to the end of
the region. When the match point is not contained in the designated region set,
no output is displayed.
>>"Kipling">>pr
This command displays a line of context for each member of the previously
calculated set. Assuming the PrintLength and
LeftContext still have the default values, each line will contain
64 characters of which 14 will be before the match point.
>> {PrintLength 300}
>> pr "my dear Watson"
As with the previous example, this command prints a line of context for
each member in the point set matching the string my dear
Watson . In this case, the line printed for each member in the set is
300 characters long but still has 14 characters preceding the match point.
>> pr region including "detective"
This command will print a line for each member in the set of default
regions that contains the string detective . The
text displayed starts at the beginning of the default region .
>> pr.200 shift.-100 ("city" near "oxford")
This command prints a line of 200 characters for each member in the set of
matches to the string city when it appears near the
string oxford . Since the match points in this set
have been shifted 100 characters to the left the displayed text actually
begins 114 characters to the left of the string city (assuming the LeftContext is set to
14).
>> region incl (region EQ incl ("<D>1980" .. "<D>1986"))
>> pr.region
The first query finds the members of the default region (in
this example they might be dictionary entries) that contain EQ regions which
are in the period from 1980 to 1986. The second command prints these entries.
After the offset, comma, blank and two periods, the displayed text starts at
the match point which is at the beginning of the default region, and continues
to the end of the default region.
>> region Quote including ("univ" near "waterloo")
>> pr.region.Quote
The first query finds the Quote regions which contain the string univ occurring near the string waterloo . The second command displays these regions. The
output consists of an offset, comma, blank, two periods and the text starting
at the beginning of the Quote region and continuing to the end of the Quote
region.
>> pr.region.5 "law" fby "order"
This command displays data from the set of matches to the string law when followed by order .
The text that is printed starts at the matches to the string law and continues to the end of the regions which contain
the match point.
>> *verse including "faith, hope, charity"
>> pr.region.*verse
The first query finds the regions that contain the string faith, hope, charity occurring in the set that has been
created and named verse during the XPAT session. For
each of the members in this region set, the second command prints information
starting at the beginning of the region and continuing to the end of the
region described by *verse .
history , naming sets , quiet mode ,
region , save
DefaultRegion , LeftContext ,
PrintLength , SortOrder
{PrintLength}{PrintLength number}
specifies how many characters of text are displayed.
By default, when the members of a set are displayed with the pr
command or written to a file by the save command, each member
contains 64 characters of context, 14 to the left of the match point, the match
point itself, and 49 to the right. This setting may be overridden so that the
number of characters processed is determined by a modifier for an individual use
of the pr or save commands. The
PrintLength setting determines the total number of characters
processed and thus affects the number of characters shown to the right of the
match point. The number of characters to the left of the match point is
determined by the LeftContext setting.
The setting can be changed at any time during a XPAT session and remains in
effect until changed again or until the end of the session. The current value of
the PrintLength setting is displayed by the command
{Settings} .
>> {PrintLength 100}
>> pr ("Yukon" near ("B.C." + "British Columbia"))
This changes the setting to the value 100 so that any subsequent
pr or save command produces text 100 characters in
length. The set displayed has 14 characters to the left of the match point and
85 characters to the right, assuming a default value of 14 characters for left
context.
pr , save , Settings
LeftContext
{Proximity}{Proximity number}
specifies the measure of closeness for the near and
fby commands.
The Proximity default for the fby and
near commands is 80 characters. That is, a match point of a member
of set1 must be within 80 characters of a match point of a member of
set2 to be included in a new set created by the near and
fby commands.
The Proximity setting may be overridden for an individual use of
the fby and near commands by appending a modifier to
the command.
The Proximity setting can also be changed at any time during a
XPAT session and remains in effect until changed again or until the end of the
session (see example below). The current value of the Proximity
setting is displayed by the command {Settings} .
>> {Proximity 200}
>> "Canada" near ("U.S." + "United States" + "the States")
The first line of the example changes the Proximity setting to
the value 200 so that any subsequent Proximity commands use this
value. In the query, XPAT finds the occurrences of the string Canada that occur within 200 characters either to the
left or right of members of the set produced by the union of the sets matching
the strings U.S. , United
States and the States .
fby , near , Settings
~qnum~qnum
outputs a query number.
The ~qnum command operates in both standard and quiet mode. In
standard mode, the number of the next query is output. In quiet mode, the
information is tagged and the number of the next query is contained within
<Qnum> tags.
>> "testing"
>> ~qnum
If testing is the first query in the XPAT
session, the output from the ~qnum command is the set number 2.
In quiet mode, this appears as the string <Qnum>2</Qnum> .
quiet mode
quiet mode{QuietOn Raw Converted
Label Persistent}
{QuietOff}
changes the mode of operation of XPAT. {QuietOn} causes XPAT to
operate in quiet mode. {QuietOff} causes XPAT to revert to standard
(non-quiet) mode.
Each of the four arguments to QuietOn is optional and may appear
in any order. When an argument is present in a QuietOn command, the
corresponding setting is turned on. Conversely, when an argument is not present
in a QuietOn command, the corresponding setting is turned off.
Settings are not carried forward from one QuietOn command to the
next but are reset with each QuietOn command.
All XPAT commands that create sets operate the same way in quiet mode and in standard mode. However, the output generated by XPAT is different in the two modes. No prompt or newline appears when XPAT is operating in quiet mode. In addition, the output from XPAT in quiet mode is in a tagged format.
In standard mode, when a command or query creates a new set, a set number and the number of matches is output. In quiet mode, the tagged output contains the number of matches within <SSize> tags but no set number. For example, if a set of 122 matches is created by a XPAT query the output is of the form:
<SSize>122</SSize>
In standard mode, information displayed about a set by a pr
command is affected if a modifier is attached to the command. The output from a
pr command is preceded by the offset in the text of the set member
being printed. If the set is a region set, the offset is the start of each
region in the set. In quiet mode, the output contains the numeric offset in a
tagged format. The settings of Raw, Converted,
Label and Persistent affect the information displayed by
the pr command. Each of the settings is discussed below.
{QuietOn}
With Persistent turned off, the values of the offsets that are output are the logical offsets into the file. The logical and persistent offsets are different and non- interchangeable. Persistent offsets are designed for use with the update system. (See the documentation for the XPAT update system.)
If the pr command is not of the form pr.region , the
offset of each set member is contained within <Start> tags and
the entire output is contained within <PSet> (for Point Set)
tags. For example, if the set is of size 2, the output might look as follows
(without the line breaks).
<PSet><Start>1234</Start> <Start>5554</Start></PSet>
If the modifier to the pr command is .region , in
standard mode the text displayed is from the match point to the end of a
specified region. In quiet mode, the tagged output contains both the offset of
the match point and the offset of the end of the specified region. The offsets
of the ends of the region are contained within <End> tags and
the entire output is contained within <RSet> (for Region Set)
tags. For example, for the above set of size 2, output from a
pr.region might look like
<RSet><Start>1234</Start><End>1444</End> <Start>5554</Start><End>6000</End></RSet>
{Quiet On Label}
When Label is turned on, the form in which the offset is printed
changes. The numeric value of the offset into the text within the
<Start> or <End> tags is preceded by an
identifying label and a colon. This label string is the value of the
Label setting. If Label has not been set, the label
used in the output is the name of the data dictionary file up to the first
non-alphanumeric character. For example, if the data dictionary is news.dd and Label has not been set, the output
from a pr command would look like:
<PSet><Start>news:1234</Start> <Start>news:5554</Start></PSet>
{QuietOn Raw }
When Raw is turned on, in addition to the tagged offsets, the output contains text showing the match point and surrounding context. For each member of the set, this additional information is output within <Raw> tags following the tagged offset information for each member of the set.
The length of the string being output is given within <Size>
tags and is followed by the text. As in standard mode, if pr has no
modifier, the length of string output is determined by the
PrintLength setting and the context shown to the left of the match
point is determined by the LeftContext setting. If the modifier is
a numeric value, this value determines the length of the string and the left
context is still determined by the LeftContext setting. If the
modifier to the pr command is .region , the text starts
at the match point and continues to the end of the specified region.
For example, assuming a PrintLength setting of 25, and a
LeftContext setting of 5, the output from a pr command
applied to a set of 2 matches to the string sample
would be (without the line breaks shown here):
<PSet><Start>1234</Start><Raw><Size>25</Size> This sample is to be firs</Raw> <Start>3456</Start> <Raw><Size>25</Size> This sample is to be seco</Raw> </PSet>
If the SortOrder setting is OccurHead , in addition to the above output, the descriptive
header is output in a tagged format. (See the entry for SortOrder
for a description of the header). This information is contained within
<Hdr> tags and includes the length of the descriptive string
within <Size> tags followed by the string of the header. If the
SortOrder setting was OccurHead in the
above example, the output would be (without the line breaks shown here):
<PSet><Start>1234</Start> <Hdr><Size>10</Size>First </Hdr> <Raw><Size>25</Size> This sample is to be firs</Raw> <Start>3456</Start> <Hdr><Size>10</Size>Second </Hdr> <Raw><Size>25</Size> This sample is to be seco</Raw> </PSet>
{QuietOn Converted}
When Converted is turned on, in addition to the tagged offsets,
text following the match point is output for each member of the set. This text
is displayed with the appropriate character mappings for the XPAT index and any
stopwords removed. For example, if upper case is mapped to lower case when
creating the index, the text is displayed in lower case. If the index has the
word to as a stopword, to
would not appear in the converted text.
For each member of the set, this additional information is output within
<Cvt> tags. The length of the output text string is enclosed
within <Size> tags and is followed by the text itself. For each
set member, the text string shown starts at the match point. This is in contrast
to the Raw text output which shows the match point with some left
context. If the pr has no modifier, the length of the string is
determined by the PrintLength setting. If the modifier is numeric,
this determines the string length. If the modifier is .region , the
length of the string is the value of the difference between the offsets of the
match point and the end of the region. As the displayed text is converted text,
it is possible that some text conversions cause output, such as multiple blanks
resulting from character mappings or stopwords, to be suppressed. This may
result in text that occurs past the end of the region to be displayed.
For example, using the above example of a set of size 2 and further assuming
that to and be are
stopwords the output might be:
<PSet><Start>1234</Start> <Cvt><Size>25</Size> sample is first used for </Cvt> <Start>3456</Start> <Cvt><Size>25</Size> sample is second used for</Cvt> </PSet>
If the SortOrder setting is OccurHead , in addition to the above output, the descriptive
header is given in a tagged format. (See the entry for SortOrder
for a description of the header). The information giving the descriptive string
precedes the <Cvt> tag and does not have the character mappings
applied to it. The previous example would change to:
<PSet><Start>1234</Start> <Hdr><Size>10</Size>First ..<Cvt><Size>25</Size> this sample is first used</Cvt> <Start>3456</Start> <Hdr><Size>10</Size>Second ..<Cvt><Size>25</Size> this sample is second use</Cvt> </PSet>
{QuietOn Persistent}
When Persistent is turned on, the offsets that are output are the persistent (persistent) positions within the text database. As noted earlier, in a database that has not been initialized for update, the persistent and logical offsets are identical.
{QuietOn Raw Converted Label}
Any combination of the QuietOn arguments may be used. Thus,
after the command {QuietOn Raw Converted Label} , the
following would result:
<PSet><Start>news:1234</Start> <Raw><Size>25</Size> This sample is to be firs</Raw> <Cvt><Size>25</Size> sample is used first for </Cvt> <Start>news:3456</Start> <Raw><Size>25</Size> This sample size is to be seco</Raw> <Cvt><Size>25</Size> sample is second used for</Cvt> </PSet>
The save command results in identical behaviour to that of the
pr command except that the information is written to a designated
file rather than displayed on the standard output.
Syntax errors that occur during the XPAT session are reported in a tagged format. A set size of -1 is indicated and the error information is contained within <Error> tags. For example, if a command uses the default region before it is set, the error shown is (without the line breaks shown here):
<SSize>-1</SSize> <Error>No information for default region </Error>
Although the sets created by the signif command are the same in
quiet and standard mode, signif does not display the text string
associated with the set in quiet mode. If signif is modified with a
negative integer n requesting n sets, only information
about the last set created is shown.
History and {Settings} display no output in quiet
mode.
history , XPAT update system documentation, pr ,
save , Settings , signif
Label , LeftContext , PrintLength ,
SortOrder
{QuietOff}{QuietOff}
quiet mode
{QuietOn}{QuietOn Raw Converted Label Persistent
}
quiet mode
quitquit
terminates a XPAT session.
The use of the quit command causes the session to end and the
XPAT process to exit. A message may be generated telling how much computer time
has been used during the XPAT session.
done , stop
range.. string2
finds strings that begin with strings occurring within an alphabetic range.
The range operator creates a point set consisting of those
indexed points in the text that fall alphabetically between string1
and string2 inclusive. String1 and string2 are
patterns that may or may not actually occur in the text being searched. The
resulting set contains the matches to both string1 and
string2.
Both the operands to the range command must be strings. Using a
set number with the range command is illegal and results in a
syntax error.
>> "n" .. "z"
This query finds all indexed points in the text that occur in the
alphabetic ordering between n and z .
>> "a" .. "z"
Again, assuming the text has been indexed on words, this query creates a set of all the words and phrases in alphabetical order (that is, it produces a concordance of the text).
>> "1" .. "200"
This query find all the strings that fall alphabetically between 1 and 200 . This gives all the
indexed strings that begin with 1 or 200 . For example, the strings 1929 , 20034 as well as strings
such as 2003/1 and 2000-15000 are in this range. The resulting set does not
contain the strings 3 or 4 .
>> region Date including ("1920" .. "1925")
This query finds Date regions that contain dates from 1920 to 1925
inclusive. The range 1920 .. 1925 also contains strings such as "1925000" as they also
fall within the range.
>> "<Date>1920" .. "<Date>1925"
If dates are marked with the tag <Date>
and begin with a 4-digit value for the year, this query reliably finds dates
between 1920 and 1925 inclusive, and only those dates.
data dictionary documentation, index points
rankedbyrankedby set2
ranks a region set by the number of contained members of another set.
Rankedby creates a set containing those members of
set1 that contain the greatest number of occurrences of members of
set2. Set1 must be a region set. Set2 may be
either a point set or a region set. The new set is a region set.
Set1 may be a predefined region set, a region set that has been
created within the current XPAT session using the region command, a
region set resulting from the use of the import command, or the
result of a previous query during the current session.
The size of the new set is by default the value of the
SampleSize . Another size may be requested with a numeric modifier
in the form of a period followed by the requested size.
The set that is created, when accessed by pr , save
and subset in SortOrder AsIs , is naturally ordered by rank. That is to say, the
first member will be that element of set1 that contains the most
occurrences of members of set2.
In detail, the rankedby command operates as follows. It first
splits all the members of set1 into groups. Each member of a group
includes the same number of members of set2 as the other members of
the group. In addition, within a group, the members are sorted into occurence
order. After it has grouped the members of set1, the
rankedby command sorts the groups into decreasing order of number
of included members of set2.
For example, say that set1 has 6 members, as follows: 3 members
that each contain 2 members of set2, 2 members that each contain 4
members of set2, and 1 member that contains no members of
set2. After rankedby has grouped and sorted
set1, the groups are be as follows. The first group consists of the 2
members of set1 that contain 4 members of set2. The second
group consists of the 3 members of set1 that contain 2 members of
set2, and the third group consists of the 1 member of set1
that contains no members of set2. Within each group, the members are
in occurence order. If the user has requested the top 4 sets, the result set
would contain both members of the first group and the first two members of the
second group.
>> region Story rankedby ("Free trade" near "Canada")
This query finds the regions described by region
Story that contain the greatest number of matches to Free trade when it occurs close to the Canada . The number of members in the new set is the value
of the SampleSize setting.
>> region Quote rankedby.5 region Author
This query creates a set whose members are the 5 members of region Quote that contain the greatest number of members
of region Author .
including , region
SampleSize
regionregion
region string
region set1 .. set2
produce region sets in a text database. The first two forms of the
region command refer to region sets that have been defined
externally to a XPAT session and for which information is available in the data
dictionary. These region sets may have been defined using patregion or
any other program that generates information (in the form that XPAT understands)
about regions in the text. The third form of the command defines a region set
during a XPAT session. The results of any of these commands can be used as
operands to any of the XPAT commands that operate on region sets.
region
Region , used with no operand, refers to the particular
predefined region set that has been designated as the default region. The
default region is defined by the DefaultRegion setting and can be
reset for the remainder of a XPAT session by changing the setting. If no default
region has been defined, using region in this form causes an error.
The following message is generated:
No information for default region.
region string
The second form of the region command indicates one of the named
predefined region sets. The string is the name that has been given to
the region set in the data dictionary. For example, the region sets might be the
chapters of a book, the entries in a dictionary or the headlines in a newspaper
database. The information about certain regions in the text database is
generated by a program external to XPAT and is made available during a XPAT
session via the data dictionary. One program that generates the information is
patregion.
Note that the string giving the name of the region set can contain blanks or
special characters, if it is enclosed within double quote
marks.
region set1 .. set2
The third form of the region command defines a new region set.
The region set that is created by this command is only available for the
duration of the XPAT session. Information about this region set can be written to
a file using the export command and read into a future XPAT session
using the import command.
Set1 and set2 are used to define regions in the new
set. Set1 and set2 can be either point sets or region
sets. If either set1 or set2 is a region set, the
region command uses only the first of the pair of pointers
describing its members in defining the new region.
Each region in the new set is formed as follows. A member of set1
is the beginning of a region if it is followed by a member from set2
with no other member of set1 occurring between the two members. The
end point of the new region is defined by the member in set2 that
most closely follows the set1 member. The region contains the text
from the beginning of the member of set1 up to but not including the
member of set2. This produces the smallest non-overlapping region set
that can be formed by set1 and set2. The size of the
region set created is equal to or smaller than the size of set1. If
the members of set2 are matches to a pattern, the new region set
does not contain the occurrences of that pattern. For example, if
set2 is the set of matches to the string End of
Message , the new region set contains no occurrences of the string End of Message .
If set1 and set2 are identical, two extra regions may
be included in the newly created set. These are: a region from the beginning of
the text to the member of set1 that occurs earliest in the text; and
a region from the last element of set1 in the text to the end of the
text. If either of these regions is a substring of length zero, it is not
included. If the shift command is applied to set1 or
set2, the extra regions are not included in the new set.
Some programs, such as patregion, that produce predefined region
sets, define the end point of the region in a somewhat different manner. These
programs deal with patterns of text (rather than points in the text) and the end
point of the region that is defined is usually the last character in the pattern
that is used to define the regions. If desired, the region command
within a XPAT session can be used in conjunction with the shift
command to create a set of regions in which the ends of the regions are at the
end of a pattern. See the examples below.
XPAT does not support region sets whose members nest or overlap. As described
above, using region with operands that are patterns defining nested
or overlapping regions, creates a region set which is the smallest
non-overlapping set of regions. Patregion used on the same text creates
a possibly different region set (also non-overlapping) consisting of regions
from an opening pattern to the following end pattern.
>> region including ("Smith" near "Jones")
This query creates a region set, consisting of the members of the default
region which contain a match to the string Smith
when it occurs within a prescribed distance of the string Jones .
>> "Campbell" within region "Speaker Name"
In this example, we assume that one of the predefined regions has been
named Speaker Name . This query creates a point set
that contains matches to the string Campbell
occurring within members of region Speaker Name .
>> firstb =region"<A>".."</B>" >> (region Bwithin *firstb) including "requested string"
The text, in this example, contains regions that begin with <A> and end with </A> . Each of these A
regions contains smaller regions that begin with <B> and end with </B> . Assume, in certain instances, that it is
necessary to be able to find the first B region
within each A region. The use of region in the
first query creates a region set named firstb that
can be used to find these regions. The members of firstb are the pieces of text that begin with the string
<A> and extend to the closest string </B> . The second query finds the members of region B that are within firstb , and then finds the members of the latter that
include requested string .
>> quote = region "<Q>" .. (shift.4 "</Q>")
If some components in the text are tagged with <Q> and </Q> this
command creates a region set describing these components. Each region in the
set extends from the opening tag <Q> to the
end of the closing tag </Q> . By using the
shift operator, applied to the </Q> , the
members of the point set used to find the ends of the new regions all point to
the end of the string </Q> rather than to the
beginning of the tag.
>> mess1 =region"From:" .. "From:" >> mess2 =region"From:" .. (shift.0 "From:") >> from =region*mess1 .. "Received:" >> "Bill" within *from
This set of queries is being applied to a database of mail messages. Each
message has the string From: at the beginning. The
string Received: appears at the beginning of the
second line of the message indicating the time the message was received.
Assume that the first query, identifying the matches to the string From: , returns a set of size 10. Further assume that
there is text in the database preceding the first From: . The next query creates a region set of size 11 as
two additional regions are included in the resulting set: one containing the
text from the beginning of the text to the first occurrence of From: and the other containing the text from the last
occurrence of From: to the end of the text. The
third query creates a region set of size 9 as these two regions are not
included in the new set. The next query creates a region set describing the
sender of the message. The final query finds the matches to Bill in the regions describing the sender of the message.
Notice the use of an asterisk (* ) before the name of the new
region set when it is used as an operand to a XPAT command.
data dictionary documentation, export , import ,
including , index point , naming sets ,
pr , save , set name , shift ,
within
DefaultRegion
samplesample set1
finds representative members of a larger set.
Sample creates a set containing a specified number of members of
set1. Set1 may be either a region set or a point set. The
new set is of the same type as set1.
The size of the set created is determined by the value of the
SampleSize setting which has a default value of 10. If the size of
set1 is less than SampleSize , then the new set created
is the same size as set1. The size can be changed for all subsequent
uses of sample during the current session by changing the
SampleSize setting. For an individual use of the
sample command, the setting can be changed by using a modifier
attached to the command. The form of the modifier is a period followed by a
number giving the desired size of the sample set.
The members of the sample set are chosen as follows. If the size of
set1 is x and the sample size requested is y,
each x/yth member of set1 is in the sample set. For
example, if a sample of size 20 is requested from a set of size 2000, the 100th,
200th members etc. are chosen. The ordering of the set, and hence the members of
the sample set, is determined when the set is created. The
SortOrder setting does not determine which members are included in
the set created by the sample command as it does for the
subset , next , ~nextemp , and
first commands. However, this setting does affect how the
sample set is ordered when used with a pr command (or
save command).
The sample command can be used by itself or with the
pr , save or export commands. The
sample may only be used in conjunction with these commands.
>> sample "shaks"
Assuming a SampleSize setting of 10, this query creates a set
of 10 examples from the set of matches to the string shaks .
>> {SampleSize 30}
>> sample "shaks"
The first command changes the SampleSize setting to 30 and the
second creates a set of 30 examples from the set of matches to the string
shaks .
>> region Quote including "Doyle"
>> sample .20 %
The first query creates a region set containing Quote regions that include
the string Doyle . The second query creates a sample
set of 20 members from the results of the first query.
>> region Quote including (sample "Doyle")
This query is illegal and results in a syntax error.
first , next , ~nextemp ,
subset
SampleSize , SortOrder
{SampleSize}{SampleSize number}
specifies the size of the set produced by the sample ,
subset , and rankedby commands.
By default, sample and subset create a set of 10
members of a given set. This setting may be overridden and the size of the
result determined by a modifier for an individual use of these commands. The
SampleSize setting can be changed at any time during a XPAT session
and remains in effect until changed again or until the end of the session. The
current value of the SampleSize setting is displayed by the command
{Settings} .
>> {SampleSize 200}
>> pr sample 5
This changes the SampleSize to 200 and any subsequent
sample or subset command uses this value. In the
second query XPAT prints information about 200 members of set number 5 created
earlier in the session.
rankedby , sample , Settings ,
subset
savesave set1
writes the contents of a set to a file.
The save command is identical to the pr command
except that the output is written to a file. The name of the file where the
information is written is determined by the value of the setting
SaveFile . The default value of the setting is xpat.res . The file used by the save command can
be changed at any time during the XPAT session by changing the setting. The
information output by the save command is concatenated onto the end
of the save file if one of the same name already exists. Otherwise, a new file
is created and the information is written to the new file. Assuming the default
setting of SaveFile , the following message is printed on execution
of the save command:
Saving in xpat.res.
For each member in the given set, the output is in the form of an integer
giving the offset of the set member in the text file followed by a comma, a
blank, two periods and the characters surrounding the set member. The order in
which the set is output is determined by the current SortOrder
setting.
With no modifier, Save outputs a line of text for each element
in set1. The PrintLength and LeftContext
determine the content of the line saved. With the default settings, the saved
text is 64 characters in length of which 14 precede the match point. The number
of characters to the left of the match point can be altered by changing the
LeftContext setting. The total number of characters printed can be
altered by changing the PrintLength setting.
The total number of characters to be saved can be set for a single instance
of the command, by using a numeric modifier attached to the save
command. The modifier is in the form of a period followed by a number giving the
total number of characters to be saved. The left context that is saved is still
determined by the value of the LeftContext setting.
The second form the modifier can have is a period followed by the string
region . When the command save.region is used, the
output text starts at the match point and continues to the end of the default
region in which the match point occurs. If the match point is not within a
default region, no output is saved for the match point.
The second form of the modifier can be refined to request that the text
output be in a region other than the default region. An additional modifier,
specifying a defined set of regions, can be attached to the already modified
save command (i.e. to the save.region ). This
additional modifier can be in one of three forms: a string giving the name of a
predefined region, the number of a region set created in the XPAT session, or a
string preceded by an asterisk (* ) referring to a named region set
defined in the XPAT session. As with the form save.region , described
above, this use results in the output text starting at the match point with no
left context and continuing to the end of the region. When the match point is
not contained in the designated set, no output is saved.
The similarly named commands save.commands and
save.history result in very different behaviour and are described
in separate entries.
>> "Helen Maday"
>> save
As a result of this command, XPAT writes a line of context for each member
in the most recently created set. The information is written to the file that
is named by the setting SaveFile . If the setting has not been
changed during the session, the file used is xpat.res . Note that the information is appended to the
save file if one of the same name already exists.
>> save "From: Tony Lopez "
A line of context for each member of the set that matches the string From: is written to the save file.
>> {PrintLength 120}
>> save region including "planet"
A line of context is written for each member in the set of regions created
by the including query. The line that is written starts at the
beginning of each region in of the new set. Since the PrintLength
has been set to 120, each line contains 120 characters and has 14 characters
to the left of the beginning of the displayed region.
>> save.200 shift.-100 ("procedure" near "policy")
In this case, a line of 200 characters is written to the save file for each
member in the set created by the query shift.-100
("procedure" near "policy") . The text that is written starts 114
characters to the left of the string procedure .
>> region including (region EQ including "<A>Doyle</A>")
>> save.region
The first query finds all the earliest quotes (defined by the region EQ ) that have Doyle as
the author. The second command saves information about each of the default
regions that includes one of these quotes. The information that is written for
each of these regions contains the offset in the text file of the region, a
comma, a blank followed by two periods and the text of the default region. As
no set is given as an operand to the save command, it is
understood that the command applies to the previous set.
>> region Quote including ("stadium" near "Toronto")
>> save.region.Quote %
The first query finds all quotes that contain the strings stadium within 80 characters of the string Toronto . The second command saves information in the save
file (xpat.res unless the SaveFile
setting has been reset) about each of these regions. As in the example above,
the output for each set member is in the form of an integer giving the text
offset, a comma, a blank followed by two periods and the text beginning at the
start of region Quote and continuing to the end of
the region.
>> save.region.5 "night" fby "day"
This command saves information about the set created by the query "night" fby "day" . The information written to the save
file (after the offset, comma, blank and two periods) starts at the text night and continues to the end of the region defined by
set number 5 that contains the match.
>> minutes = region "<Min>" .. "</Min>"
>> *minutes including "examination schedule"
>> save.region.*minutes
The first query in this example defines a set of regions that are named
minutes . The second query finds the regions in this
set that contain the string examination schedule .
The third command saves information about this set in the save file. For each
member in this set, the information saved contains the offset, comma, blank
and two periods followed by the text of the region in the set named minutes .
exec , export , import , pr ,
save.commands , save.history
DefaultRegion , LeftContext ,
PrintLength , Savefile , SortOrder
save.commandssave.commands
writes information to a file about the queries in the XPAT session. These are saved in a form that allows them to be used in another XPAT session.
Save.commands saves, in a file, all the queries that have been
executed and have produced sets during the current session. These are the
queries that appear in the history list. Only the command is saved in the file,
not the set number or number of matches. The setting CommandFile ,
that determines the file where the information is written, has a default value
of xpat.cmd . The output file can be changed at any
time during the session by changing the CommandFile setting. If a
file of this name already exists, the information is concatenated onto the end
of the file. Otherwise a new file is created.
The saved information can be read into a XPAT session and executed using the
exec command.
>> "love" near "hate"
>> pr sample
>> region Q including %
>> save.region.Q %
>> {CommandFile "/usr/my_commandfile"}
>> save.commands
The second last command sets a new name for the file to be used by
save.commands ; /usr/my_commandfile .
The final command saves the information about the commands that has been
generated to this point in the XPAT session. In the portion of the session
shown here, only two commands generated sets and so the following is saved in
the file /usr/my_commandfile .
"love" near "hate" region Q including %
exec , export , import ,
save , save.history
CommandFile
{SaveFile}{SaveFile string}
changes the file name used by save .
The SaveFile setting determines the file written by the
save command. It has a default value of xpat.res . If the string begins with a numeral, or contains
blanks or non-alphanumeric characters, it must be enclosed within double
quote marks. The file name must also conform to the file naming
convention of the host operating system. It can be changed at any time during a
XPAT session and remains in effect until it is changed again or until the end of
the session. The current value of the SaveFile setting is displayed
by the command {Settings} .
>> {SaveFile "output_file"}
This changes the setting to the value output_file so that any subsequent use of the
save command writes to this file. The name of the file is not an
absolute path name and is therefore located in the current working directory.
save , Settings
save.historysave.history
writes information to a file about the queries and results in the current XPAT session.
Save.history writes a record of a XPAT session. XPAT's history
list records information about all queries that produce sets. For these queries,
save.history saves the set number, the number of members in the
set, and the query that produced the set in a file. The setting
HistoryFile , that determines the file where the information is
written, has a default value of xpat.his . A different
output file can be chosen at any time during a session by changing the setting.
If a file of this name already exists, the information is concatenated onto the
end of the file. Otherwise a new file is created. Note that comments are saved
only if they are on the same line as the command itself.
>> "fish" near "fowl"
>> pr
>> region definition including %
>> save.history
This command saves the information from the history list in the file xpat.his . After the sequence of commands shown above, the
history list contains information about two sets which are saved into a file:
1: 142, "fish" near "fowl" 2: 17, region definition including %
exec , export , history ,
import , save , save.commands
HistoryFile
set name* name
refers to a named set.
Query result sets may be named and subsequently referred to either by set
number or by name. The name must be preceded by an asterisk (* ) to
reference the set; otherwise, XPAT interprets the name as a command or search
string.
>> univ = "university" near "MIT" >> qu = region Quote including*univ >>*qu including "Harvard"
The first query creates a set of matches to university occurring near MIT .
The second query uses the set *univ and creates a
new set *qu . The third query finds the Quote
regions that include the set of matches from the first query as well as the
string Harvard .
>> begin = region "<Title>" .. "</Summary>"
>> "Paris" within * begin
The first query defines a new region set and calls the new set begin . The second query creates a set containing the
matches to Paris that fall within one of these
regions.
region , including , naming sets ,
set number , within
set numberreferences a previously created set.
After the first query in a session, XPAT displays a line of the form:
1: 300 matches
The number 1 here names the set of results and can be used in subsequent
searches. The valid set numbers are those displayed by the history
command.
When an invalid set number is used XPAT generates a message. If, for example, set number 33 is referenced before it has been calculated or after it has been freed the message is:
Expression 33 is out of range
>> region Author including 5
In this query, the number 5 refers to the fifth result of the session. For example, set 5 might be the set of matches to all the variants of spelling for a particular author's name.
history , ~free , ~qnum , set
name
History
setsIn the XPAT system, queries are combinations of the XPAT commands described in this document. In response to each query, XPAT creates a set which is either a point set or a region set.
These sets can be used as operands in subsequent queries. In contrast to the conventional approach of a single, nested compound query, XPAT allows complex queries to be expressed as a series of simple queries. This provides an opportunity to try alternative ways of combining previous result sets to arrive at a solution. XPAT provides a history list of all previous sets created in a session and a convenient notation to access them.
A member of a point set is a location in the text which is the start of a
string that continues to the end of the text. The XPAT system finds locations in
the text where strings, matching pattern(s) given in the query, begin. The
members of a point set are usually index points , however, in the
sets created by shift or offsets , (the notation
[n] ), the members refer to positions in the text that may or may
not be index points.
The members of a region set are substrings of the text, beginning and ending at specified points. Region sets that are the result of a query within a XPAT session are available only for the duration of the session. However, region s