Last updated | 2002-03-26 15:10:20 EST |
Doc Title | The XPAT Command Manual |
Author 1 | Wilkin, John Price |
CVS Revision | $Revision: 1.8 $ |
The following provides a summary of XPAT commands, settings, and concepts, and is based extensively on Open Text's PAT 5.0 documentation. Many of the commands included here are not implemented in DLXS middleware.
{CommandFile}
{CommandFile
string}
changes the file name used by save.commands
and
exec
.
The CommandFile
setting determines which file the
save.commands
command writes to and the exec
command
reads from. It has a default value of xpat.cmd
. If the
string begins with a numeral or contains blanks or non-alphanumeric characters,
it must be enclosed within double quote
marks. The file name must
also conform to the file naming conventions of the host operating system. It can
be changed at any time during a XPAT session and remains in effect until changed
again or until the end of the session. The current value of
CommandFile
is displayed by the command {Settings}
.
>> {CommandFile "/usr/new/output_file"}
This changes the setting to the value /usr/new/output_file
which any subsequent
save.commands
command writes to and exec
command
reads from.
exec
, save.commands
, Settings
comment
#
marks the start of a comment.
The comment, that is the #
and the rest of the line following
the #
, is ignored by XPAT. The comment can be placed on a line by
itself or following a XPAT query. It is useful for annotating queries stored in a
file to be processed in batch mode or to be read in by the exec
command. The queries may be created externally or generated during a XPAT session
and saved by save.commands
for later use.
>> #
find all the Shakespearean quotations
>> region Quote incl (region Author incl "shaks")
The line beginning with the #
is ignored by XPAT.
>> first = region "<E>" .. "</L>"
#
find first language
XPAT creates a new region set with this command. The rest of the line,
beginning with the #
, is ignored.
exec
, save.commands
, save.history
{DefaultRegion}
{DefaultRegion
string}
determines which region set is the current default.
The DefaultRegion
setting designates a special region set, known
as the default region. The default region can be referred to as
region
without specifying the actual region name. The setting can
be changed at any time during a XPAT session and remains in effect until it is
changed again or until the end of the session.
If the string giving the setting value begins with a numeral or contains
blanks or non-alphanumeric characters it must be enclosed within double
quote
marks.
Using the default region in a command, without having previously specified one, is illegal and results in the following message.
No information for region in the data dictionary
For convenience, a frequently used DefaultRegion
setting can be
defined within an init
file whose location is given
in the data dictionary file. The init
file is read
and executed by a XPAT session when it is started (see the data dictionary
documentation for details).
>> region including "constitution"
The region set referred to in the example by region
is the one designated by the
DefaultRegion
setting.
>> {DefaultRegion HeadLine}
>> region including constitution
The first line changes the DefaultRegion
setting to the value
HeadLine
. The command that follows uses the region
set HeadLine
even though it is not specified.
data dictionary documentation, including
, pr
,
region
save
, Settings
,
within
difference
-
set2
removes members from a set.
The difference
operator (-
) creates a new set
containing the members of set1 that are not members of
set2. Set1 and set2 can be either point sets or
region sets. The new set is of the same type as set1.
If either set1 or set2 is a region set, the first pointer delineating each region is used to determine if a member of set1 also occurs in set2. Thus, for set arithmetic (difference, union and intersection) in XPAT, set members of a region set are considered to be equal if they start at the same location in the text. The end point of a region is ignored in such operations.
>> "to"-
"to "-
"to<"
Note that these operators are parsed left to right and can be combined
without bracketing. This query creates a point set that contains all the
matches to the prefix to
excluding those to the
string to
followed by a blank or a left angle
bracket. Assuming an index in which all punctuation has been mapped to blanks,
the result contains words starting with to
) but not
the word to
.
>> ("q" -
"qu") within region HeadWord
This query creates a point set. The point set includes all words located in
a Headword region that begin with q
but not with
qu
.
>> region Story incl "music " -
region
Story incl "art "
This query creates a region set. The region set is comprised of all Story
regions that include the string music
but not the
string art
.
>> region Q -
"<Q><D>"
Assume that the regions described by region Q
all begin with the string <Q>
. The above
query creates a region set of the members of region
Q
that do not have the string <D>
immediately following the <Q>
.
intersection
, union
done
done
terminates a XPAT session.
The done
command ends the session and causes the XPAT process to
exit. A message may be generated telling how much computer time has been used
during the session.
quit
, stop
double quote
"
string"
allows the use of strings that include special characters.
Normally, XPAT interprets a sequence of characters as a string and searches
the database for matches to it. However, there are certain types of strings that
XPAT cannot recognize as search targets unless they are enclosed within
double quote
marks. The special strings are: strings which begin
with a numeral, for example 2nd
; strings which
contain blanks or non-alphanumeric characters, for example end of the year
or <Author>Scott
; and strings which are XPAT commands,
for example near
and within
. In each case, a string that is not enclosed in
double quote
marks but should be will result in a syntax error or
unexpected result.
Note that if numbers are not enclosed in double quote
marks,
they are interpreted as a reference to the number of a set previously calculated
in the XPAT session.
A pair of quotes representing an empty string (""
) stands for
the set of all index points
in the text being searched.
>> "done " >> done
The first command creates a point set containing matches to the word done
. The second command ends the XPAT session.
>> 19 within region Date >> "19" within region Date
The first query finds those members of the previously calculated set,
identified by the number 19, that are within region
Date
. The second query finds the matches to the string 19
within region Date
.
>> ""
This command produces a list of every point indexed in the text.
>> "_XPat_1" = "match this string " >> "_XPat_OP1" = region "Region Set 5" >> "_XPat_2" = *"_XPat_1" within *"_XPat_OP1"
The above sequence of commands might be produced by a program that accepts input from a user and generates commands that are sent to XPAT. Since the names contain non-alphanumeric data they must be bounded by quotation marks.
index point
, region
, set name
,
string search
exec
exec
reads a file into a XPAT session and executes the commands contained in the file.
The name of the file read by the exec
command is determined by
the value of the CommandFile
setting. By default, the value is
xpat.cmd
but can be changed at any time during the XPAT
session.
The exec
command can be used to enter queries to a XPAT session.
The queries, for example macro definitions, may be recorded in a file using an
editor or saved in a file from a previous XPAT session using
save.commands
.
>>{CommandFile "/usr/xpat/srch023.q"}
>>exec
The first command sets the name of the file to be read by any
exec
command to /usr/xpat/srch023.q
.
The second command reads the file /usr/xpat/srch023.q
and executes the commands contained in
the file.
save.commands
CommandFile
export
export
set1
saves information about sets created in a XPAT session.
Export
writes a detailed description of the members of
set1, created during a XPAT session, to a file. The description
includes the type (region or point) of the set and sufficient information to
recreate a copy of the set. The name of the file is determined by the value of
the ExportFile
setting. By default the file name is xpat.exp
but can be changed during a session by using the
command ExportFile
. When export
writes to the named
file it writes over anything that may currently exist in the file. Assuming a
default ExportFile
setting of xpat.exp
,
the following message is given:
Exporting to xpat.exp.
The file may subsequently be read into a XPAT session by the
import
command.
If the saved set is a frequently used region set, it can be made available as
a predefined region in future XPAT sessions by editing the data dictionary file
and adding the appropriate information. If the new region set, containing 150
regions, is named newregion
and saved in the file
newregion_file
, the following lines, added to the
data dictionary, would make it available to XPAT.
<Region> <Name>newregion</Name> <Desc>This new region set describes ....</Desc> <File> <SysName>newregion_file</SysName> <Offset>0</Offset> </File> <Count>300</Count> <Type>pairs</Type> </Region>
>> "tax" near "increase"
>> export
%
The first query creates a point set of the matches to the string tax
when it is within the current Proximity
of the string increase
. The second command writes
this point set to the file xpat.exp
. The information
written to the file contains header information followed by details about each
element in the set.
>> {ExportFile "v.exp"}
>> verse = region "<V>" .. "</V>"
>> export
*verse
The first line of the example changes the ExportFile
setting
to v.exp
. The second line creates a region set and
names it verse
. The third command writes header
information and a description of each member of the region
verse
to the file v.exp
.
data dictionary documentation, import
ExportFile
{ExportFile}
{ExportFile
string}
changes the file name used by export
and import
.
The ExportFile
setting determines the file written by the
export
command and read by the import
command. It has
a default value of xpat.exp
. If the string begins with
a numeral or contains blanks or non-alphanumeric characters, it must be enclosed
within double quote
marks. The file name must also conform to the
file naming conventions of the host operating system. It can be changed at any
time during a XPAT session and remains in effect until it is changed again or
until the end of the session. The current value of the ExportFile
setting is displayed by the command {Settings}
.
>> {ExportFile "/usr/new/export_file"}
This changes the value of the setting so that any subsequent
export
or import
command utilizes the file /usr/new/export_file
.
export
, import
, Settings
fby
fby
set2
finds members of sets that occur close to each other in a specified order.
Fby
(followed by) creates a set containing those members of
set1 that have one or more members of set2 within a
specified number of characters to their right . Set1 and
set2 may be either point sets or region sets. The new set is of the
same type as set1.
The distance between members of the two sets is calculated by counting the
number of characters in the text from the first character of a member of
set1 to the first character of a member of set2. The
measure used to determine closeness
is the value of
the Proximity
setting which has a default value of 80 characters.
This can be changed for all subsequent uses of fby
by changing the
Proximity
setting, or it can be changed for an individual use of
fby
by using a modifier attached to the command. The form of the
modifier is a period followed by a number representing the maximum distance (in
characters).
If either set1 or set2 is a region set, the first of the two pointers delineating the region is used to determine the distance between the set members.
Multiple fby
commands are not parsed left to right. A command of
the form
set1fby
set2fby
set3
is handled as if parenthesized as follows:
set1fby
(set2fby
set3)
The command not fby
creates a set containing the members of
set1 that are not
within the specified distance to the
left of any member of set2.
set1not fby
(set2fby
set3)
is the same as
set1 - (set1fby
(set2fby
set3))
>> "law " fby
"order "
Assuming a Proximity
of 80, this query creates a point set
containing the matches to law
with one or more
matches to order
within 80 characters to their
right, counting from the l
in law
to the o
in order
.
>> region Title fby.30
region Author
This query creates a region set containing the members of the set region Title
that have one or more members in the set
region Author
within 30 characters to the right.
The distance is measured as the number of characters from the first character
of a Title region to the first character of an Author region.
>> "law " not fby
"order "
This query creates a point set containing the matches to law
that do not have a match to order
within 80 characters to the right, calculating the distance as in the
first example.
>> "law " not fby.30
"order "
This query creates a point set containing the matches to law
that do not have a match to order
within 30 characters to the right.
near
Proximity
first
first
set1
finds a specific number of contiguous members from the start of a set.
First
creates a set of a specified size which is comprised of
members from the beginning of set1. The members of the new set are in
the order they appear in set1. Set1 may be either a region
set or a point set. The new set is of the same type as set1.
The operation of the first
command involves the set member
counter that keeps track of the selected members, the identification of the size
of the requested set, and the SortOrder
setting that determines
which members are in the new set.
First
selects members from the beginning of a set. The ordering
of a set, and hence which members occur at the beginning, is controlled by the
SortOrder
setting. If the SortOrder
setting is Alpha
, the set is ordered alphabetically. If the
SortOrder
setting is Occur
or OccurHead
, the set is ordered according to occurrence in
the text. If the SortOrder
setting is AsIs
, the set order is the current one which may be either
alphabetic or occurrence order.
Each set that is used with a first
, next
or
~nextemp
command has a cursor (set member counter) associated with
it. The cursor indicates the location in set1 at which to start
selecting members for the set being created. Each first
command
resets the cursor so members for the new set are chosen from the beginning of
set1. On completion of the first
command the cursor is
updated to point at the beginning of the next set. Note, when the
SortOrder
setting changes and the set ordering is changed, the
cursor is reset to the beginning of set1.
The size of the set created is determined by the value of
SampleSize
which has a default value of 10. If the size of
set1 is less than SampleSize
then the new set created is
the same size as set1. Changing the SampleSize
setting
affects all subsequent uses of first
during the current session.
For an individual use of the command, the size of the new set can be specified
by using a modifier attached to the first
command. This modifier is
in the form of a period followed by a numeric value giving the desired set size.
The first
command can be used by itself or with the
pr
, save
or export
commands.
>> {SampleSize 40}
>> first
5
The first line changes the SampleSize
setting to 40 and the
second line creates a set that contains the first 40 members of set number 5
created earlier in the XPAT session.
>> first
.10 "the best of "
This line creates a set containing the first 10 members in the set of
matches to the phrase the best of
.
>> first
.0 3
This query resets the cursor to the first member of set number 3.
next
, ~nextemp
, sample
, set
number
, subset
SampleSize
, SortOrder
~free
~free
number
releases a XPAT set.
Following the ~free
command, the set number is no
longer available for reference in a XPAT command. The set is no longer displayed
by the history
command.
If the sets freed are at the end of the current history list, the set numbers
will be reused for the next sets created in the XPAT session. For example, if the
history list contains set numbers 1 to 8, and 6 through 8 are freed using the
~free
command, the next set number assigned is 6. However, if set
number 2 is freed and the history list includes set numbers 1 to 8, the next set
is number 9.
>> ~free
4
This removes set number 4 from the history list. The set can no longer be accessed by number reference.
~freeall
, history
~freeall
~freeall
releases all XPAT sets.
Following the ~freeall
command, all the sets that existed in the
current session are no longer available for reference in a XPAT command. In
addition, those sets are no longer displayed by the history
command.
Following the ~freeall
, the next set number assigned is 1.
>> ~freeall
This removes all the current sets in the history list from the history list. Following the command, no previously created sets can be referenced, and the next set that is produced is assigned the number 1.
~free
, history
history
history
displays the record of the current XPAT session.
Information about each set created during the XPAT session is recorded in a
history list. For each of the sets, history
displays a set number,
the number of members in the set and the query that produced the set. Sets
created during the current session can be accessed by referring to the number of
the set in the history list. The results of pr
, save
,
Settings
, { }
, and certain tilde (~
)
commands do not appear in this list since no sets are produced by these
commands.
As the entire history list may become quite long, it is useful to be able to
view only a part of the list. The History
setting determines what
portion of the history list is displayed by the history
command.
The History
setting has a default value of 0. This indicates that
the entire history list is to be displayed. When set to an integer n
(any integer greater than zero) the final n elements in the history
list are displayed by any subsequent use of the history
command
during the session.
The items listed can also be changed for an individual use of the
history
command. Modifiers may be attached to the
history
command to request that a certain number of items and that
a particular portion of the list be displayed.
The first modifier, in the form of a period followed by a number, indicates
where in the history list to begin the display. A positive integer p
requests that the display start at the pth item from the start of the
history list. A negative integer p requests that the display start at
the pth item from the end of the history list. The number of items
displayed is the value of the History
setting.
The number of items displayed can also be changed for an individual use of
the history
command by using a second modifier attached to an
already modified history
command. This second modifier is also in
the form of a period followed by a number giving the number of items to be
displayed.
The default maximum size of the history list is 300 items. If more than 300 sets are created the last 300 sets created during this XPAT session are retained in the list. This maximum size can be altered by a command line parameter when starting a XPAT session.
Note that a set can be removed from the history list by the
~free
command.
>> "univ"
>> pr sample %
>> "waterloo"
>> 1 near 2
>> pr
>> history
Assuming the above are the only commands executed in the XPAT session to
this point, the result of the history
command would be as
follows:
1: 11680, "univ" 2: 209, "waterloo" 3: 4, 1 near 2
>> {History 5}
>> history
The first command, in this example, sets the value of the
History
setting to 5. The second command, and subsequent uses of
the history
command in the session, will show information about
the five final sets in the history list. The second command shows information
about the final five sets in the history list.
>> history.3
This use of the history
command gives information about the
commands in the history starting at the third element in the history list.
Using the XPAT session described in the first example, above, the result of
this would be.
3: 4, 1 near 2
>> history.-2
This use of the command gives information starting at the second element from the end of the history list. Again, using the first example, the result of this would be.
2: 209, waterloo 3: 4, 1 near 2
>> history.4.10
This use of history
gives information from the history list
starting at the fourth entry on the list and continuing for ten entries.
>> history.-4.2
This use of history
gives information about the final two
entries in the history list.
~free
, save.commands
, save.history
,
set number
History
{History}
{History
number}
changes the number of items from the history list displayed by the
history
command.
The History
setting determines the number of items displayed by
the history
command. Note that the setting may be overridden and
the number of items displayed determined by a modifier for an individual use of
the history
command. The default value of the setting is 0
indicating that all sets created in this session are to be shown by the
history
command. The setting can be changed at any time during a
XPAT session and stays in effect until changed again or until the end of the
session. The current value of the History
setting is displayed by
the command {Settings}
.
>> {History 30}
This changes the setting to the value 30 so that any subsequent use of the
history
command during the session displays 30 items.
history
, Settings
{HistoryFile}
{HistoryFile
string}
changes the file name used by save.history
.
The HistoryFile
setting determines the file written by the
save.history
command. It has a default value of xpat.his
. If the string begins with a numeral or contains
blanks or non-alphanumeric characters, it must be enclosed within double
quote
marks. The file name must also conform to the file naming
conventions of the host operating system. It can be changed at any time during a
XPAT session and remains in effect until it is changed again or until the end of
the session. The current value of the HistoryFile
setting is
displayed by the command {Settings}
.
>> {HistoryFile "/usr/new/history_file"}
This changes the HistoryFile
setting so that any subsequent
use of the save.history
command during the session writes to the
file /usr/new/history_file
.
save.history
, Settings
import
import
reads information that has been saved in a file by the export
command.
Import
reads data from a file and creates a new set which can be
used as if it had been created during the current XPAT session. XPAT determines
from the header information whether the saved set is a point set or a region
set, and the new set is of the same type. The file read is determined by the
ExportFile
setting which has a default value of xpat.exp
. The file name can be changed during a session by
resetting the ExportFile
setting.
Assuming the default setting of ExportFile
, if the imported set
is a region set, the following message is generated:
Importing regions from 'xpat.exp'
If it is a point set, the message generated is:
Importing point set from 'xpat.exp'.
>> import
>> % within region Quote
The first command reads from the file xpat.exp
.
The second line uses the imported set as an operand to a within
command and finds the members of the set that occur within region Quote
.
>> {ExportFile "v.exp"}
>> verse = import
>> *verse including ("blind " fby "ditch ")
The first command resets the ExportFile
setting to v.exp
. The second reads a set from the file v.exp
and names it verse
. The
third query finds the set of imported regions that include the string blind
when it is followed by the string ditch
(the assumption has been made that this set is a
region set).
export
ExportFile
including
including
set2
set1 incl
set2
find regions that contain members of a set.
Including
or incl
creates a set comprised of
members of set1 that include one or more members of set2.
Set1 must be a region set. Set2 may be either a point set
or a region set. The new set is a region set.
Set1 may be a predefined region set, a region set created during
the XPAT session using the region
command, a region set resulting
from the use of the import
command, or the result of a previous
query in the session.
If set2 is a point set, and if one or more of the points occur in a region from set1, then that set1 region is included in the new set.
If set2 is a region set, and the first of the pair of pointers (offsets into the text) describing a region of set2 is contained in a region of set1, that set1 region is included in the new set. The second pointer of the pair delineating set2 does not have to fall within the region of set1 in order that the set1 region be included in the new set.
The including
command can also be used to find regions that
contain more than one member of set2, by attaching a modifier
specifying the minimum number of members of set2 to the
including
command. This modifier is in the form of a period
followed by the value of the minimum number of members.
The command not including
creates a set containing those members
of set1 that do not contain any of the members in set2.
set1 not including
set2
is the same as
set1 - (set1 including
set2)
Including
and within
are similar in that they both
restrict searches to specified regions in the text. They differ in the set that
is created. The including
command creates a set of regions that
contain one or more members of another set, while within
creates a set of pointers or regions that are contained in members of a
region set.
>> region Story including
("Free trade"
near "Canada")
This query finds the regions described by region
Story
that contain one or more matches to the string Free trade
when it occurs close to the string Canada
.
>> region Story including.3
("Free trade"
near "Canada")
This query finds the regions described by region
Story
that contain at least three matches to the string Free trade
when it occurs close to the string Canada
.
>> region Quote not including
region Author
This query creates a set of Quote regions that do not contain the first pointer of the pair delineating an Author region.
>> dates = "1800" .. "1825"
>> region Date including
*dates
The first query creates a point set containing all the numbers that are
alphabetically between 1800
and 1825
. The second creates the set of Date regions that
contain one or more of these numbers.
>> region Quotationincluding
"Wright" >> %including
"Waterloo"
The first query creates a region set of quotations that contain the string
Wright
. The second query finds the members in the
new region set that also contain the string Waterloo
.
>> (*speech including "republican") including "democrat"
This query is similar to the previous one. It assumes that a region set
named speech
has been defined and it finds the
members of this set that contain both the string republican
and the string democrat
.
>> (*definition incl ("men" + "women")) incl "education" >> *definition including (("men" + "women") ^ "education")
The first query creates a set of definition regions that include the string
education
as well as either men
or women
. Note that the
second query does not create the same set but actually creates a set of size
0. This result is due to the fact that the intersection operation - (("men" + "women") ^ "education")
- produces an empty
result. This result occurs since there are no members of the union set men + women
that are also members of the set education
(see definition of the union
operator).
intersect
, not
, region
, within
index point
XPAT views the entire text as one long string. In contrast to
traditional text indices, which deal with words, XPAT indexes strings. The
indexed strings extend from each index point
to the end of the
text.
The XPAT index is made up of the starting points of each string. The index points make up the possible match points for a string search. Parameters set when the index is built determine which strings are in the index. The parameters specify patterns in the text that define the beginnings of strings to be indexed. For example, one pattern could specify that every character in the text is to be indexed, while another pattern could specify that each printable character following a blank is to be indexed.
When the index is created, two additional settings can alter how XPAT sees the
text. Character mappings cause XPAT to see certain characters as
equivalent to other characters. For example, all upper case letters may be
mapped to lower case letters so that XPAT does not distinguish between upper and
lower case when searching for a string. Also, some words may be designated as
stopwords. XPAT views the text as if these words are not there. XPAT
ignores strings in the text that start at an index point and match the given
stopword strings followed by a blank after the character mappings have been
applied. The character mappings also affect the strings chosen to be index
points. For example, if a >
is mapped to a blank
and if the index points are defined as blanks followed by printable characters,
in the text ...<tag>wisdom...
the w
in the string wisdom
is an
index point. Text with character mappings applied and stopwords removed is
referred to as converted text.
When searching for a given string, a match is found if the given string (after having the character mappings applied to it and the stopwords removed) is the same as the converted text that begins one of the indexed strings.
data dictionary documentation, double quote
,
offsets
, quiet mode
, range
,
shift
, string search
intersect
^
set2
finds members common to two sets.
The intersect
operator (^
) creates a new set
consisting of the members in set1 that are also in set2.
Set1 and set2 can be either point sets or region sets. The
new set is of the same type as set1.
If either of set1 or set2 is a region set, only the first of the pointers describing the region is used in the comparison to determine if a member should be included in the new set. Two members of a region set are considered to be equal if they start at the same location in the text.
>> (region Verse incl "eye") ^
(region
Verse
incl "seed")
This query creates a region set. It includes verse regions that contain
both the string eye
and the string seed
.
>> ("research" near "medical") ^
("research" near "biolog")
This query creates a point set. It includes the matches to research
that appear close to both the string medical
and the string biolog
.
difference
, region
, union
{Label}
{Label
string}
specifies an identifying string to be used as a label.
When XPAT is operating in quiet mode
with labels requested, any
set displayed by a pr
or save
command shows the label
string preceding the numeric value of the text offset. This can be used to
identify which database the information is from. In a XPAT session, if a value
for Label
has not been set by this command, the default value used
is the name of the data dictionary. The label string must begin with an
alphabetic character and contain no blanks or non-alphanumeric characters. The
setting can be changed at any time during a XPAT session and remains in effect
until it is changed again or until the end of the session.
>> {Label Database1}
>> {QuietOn Label}
pr ("Ontario" near ("B.C." + "British Columbia"))
The tagged output from the pr
command shows the numeric offset
in the file preceded by the string Database1
, in
the form
<PSet><Start>Database1:12345</Start></PSet>
offsets
, quiet mode
QuietOff
, QuietOn
last set
%
refers to the previous result.
%
is used as shorthand to refer to the set created most recently
in the XPAT session. The set is the final one in the current history list. Some
commands, such as pr
and save
, do not create sets that
are saved and recorded in the history list and thus cannot be accessed by using
the %
. If there is no history, the last set
is the
null set which contains all index points.
>> region Author including "Hemingway" >> pr sample % >> % within region Quote
The %
in the second line of the example refers to the set
created by the including
command in the first query. The
%
in the third line also refers to the set created by the first
line and not to the result of the pr
in the second line which
does not produce a set.
~free
, ~freeall
, history
{LeftContext}
{LeftContext
number}
specifies how many characters of context are displayed to the left of a set member.
By default, when a set is displayed with the pr
command or
written to a file by the save
command, the text has 14 characters
to the left of the match point. The setting can be changed at any time during a
XPAT session and remains in effect until it is changed again or until the end of
the current session. The current value of the LeftContext
setting
is displayed by the command {Settings}
.
>> {LeftContext 40}
This changes the setting to the value 40 so that any subsequent
pr
or save
command produces text with 40 characters
to the left of the match point.
pr
, save
, Settings
PrintLength
macro
The macro
capability facilitates the use of
frequently used sequences of XPAT commands.
A macro can be defined in a XPAT session and be available only for the
duration of that session, or a macro can be created externally and read into any
XPAT session by an exec
command or during initialization.
The definition of a macro (here called name
)
begins with the following: name = macro
After this line the system prompt changes from >>
to ||
for the duration
of the macro definition. The body of the macro may begin on the same line or on
a subsequent line. XPAT interprets anything immediately following the word
macro
, that is not a blank or new line, as the beginning of the
macro definition. The body of the macro may contain arguments. The
nth argument to the macro is identified within the macro definition
by the string $n$
. Any sets that are created by the
macro may also be used in its definition. The string *n*
refers to the nth set created within the
macro. The end of the macro definition is indicated by a @
. After the @
the system prompt
returns to the form >>
.
References to other macros may be used within the definition of a macro. If the macro contains more than one XPAT query, they can be put on separate lines or on the same line with the queries separated by a semi-colon. The body of the macro is not checked for syntax errors when it is defined. Any errors are reported when the macro is used.
The macro is invoked by the following call:
name(arg1,arg2..)
If the number of arguments in the macro call is less than the number in the macro definition a syntax error is reported. If it is greater, the extra arguments are ignored.
Each argument consists of all the text occurring between argument delimiters:
parentheses and commas. That is, if a macro takes three arguments - (arg1,arg2,arg3)
- arg1
consists
of the text between the opening parenthesis and the first comma, arg2
consists of the text between the first and second
comma, and arg3
consists of the text between the
second comma and closing parenthesis. If a macro takes only one argument - (arg1)
- the parentheses are the argument delimiters. Note
that any spaces entered with an argument string will be included with the
parameter substitution which is unlikely to be the intent of the user. To avoid
unexpected results, enter only the exact text that you wish to be
substituted in the arguments of the macro call. Also note that macros may have
no arguments.
When the macro is invoked, the invocation is replaced by an exact copy of the
body of the macro with the arguments substituted for the formal parameters. This
means that the macro can be used within other XPAT queries. This may require that
the macro definition have the closing @
on the same
line as the final line of the body of the macro definition to avoid introducing
an unwanted new-line character.
If improperly used, macros that produce multiple sets and are used within other queries may cause more than one syntax error to be reported. Care must be taken with bracketing in order to ensure that the results reflect what was actually intended.
A macro can be redefined during a XPAT session. When the macro is redefined,
the previous definition of the macro is displayed following the first new line
entered after the word macro
. The format of this previous
definition consists of the macro name followed by a colon, followed by the body
of the macro on subsequent lines.
For convenience, macros that are used frequently can be defined within an init file whose location is given in the data dictionary file. The init file is read and executed by a XPAT session when it is initially started. (See the data dictionary documentation for details.)
>> word = macro
|| ( "$1$ " + "$1$<" + "$1$-" ) @
>> word(pad)
within *definitions
This macro is used with text that contains tags that start with a <
and where the tags may follow text without blanks
appearing before the tag. The macro defines a word as a string of characters
followed by a blank, <
, or -
(in this definition an index that has all punctuation
mapped to a blank is assumed). Since the macro definition has the @
sign on the same line as the body of the definition,
the macro can be used within a more complicated query as shown. Note the
brackets included in the macro definition. The example assumes that there is a
region definition
and finds all occurrences of
pad
, as a word, inside one of these regions.
>> both = macro || region $1$ || *1* including $2$ || *2* including $3$ || @ >>both(Line
,"juliet"
,"romeo")
With the macro defined here, the members of a predefined region set which
contain both of two given strings are found. In the macro call above, the
macro is applied to a database of Shakespearean texts in order to find the
members of the predefined region set named Line
containing references to both romeo
and juliet
. The definition of this macro returns more than
one set. It also has the @
on the line following
the body and thus could not be used within another query. The resulting output
from XPAT showing the three sets produced by the macro would appear as below:
16: 128794 matches 17: 214 matches 18: 25 matches
data dictionary documentation, thesaurus
naming sets
=
set1
assigns a name to a set.
A set which has been named can be referred to either by that name or its set number. Set1 can be either a point set or a region set.
A name that starts with a letter and contains only letters and
numbers does not need to be enclosed within double quote
marks.
However, if the name contains special characters (blanks or
non-alphanumeric characters), or does not start with a letter, it must be
enclosed within double quote
marks both in the assignment statement
and in subsequent use.
To use the name in a query it must be preceded by an asterisk
(*
). Without the asterisk (*
), XPAT interprets the name
as a string rather than as the name of a set.
>> UK =
"U.K."+"Britain"+"Great Brit"+"United King"
>> region Headline including *UK
The first line assigns the name UK
to a set of
matches to four alternate ways of referring to the United Kingdom. The second
line finds Headline regions that contain any of the matches.
>> "min_hiring" =
region Minutes incl ("hiring" near "policy")
>> region Attendees within *"min_hiring"
The first line of the example assigns the name min_hiring
to Minutes regions that include matches to
hiring
appearing close to matches to policy
. The second line finds the Attendees regions that
are within one of the resulting Minutes regions from the first query.
double quote
, set name
near
near
set2
finds members of sets that are close to each other.
Near
creates a set containing the members of set1
that are within a specified number of characters before or after one or more
members of set2. Set1 and set2 may be either
point sets or region sets. The new set is of the same type as set1.
The distance between members of the two sets is calculated by counting the
number of characters in the text between the first character of a member of
set1 and the first character of a member of set2. The
measure used to determine closeness
is the value of
the Proximity
setting which has a default value of 80 characters.
The value can be changed for all subsequent uses of near
by
changing the Proximity
setting, or it can be changed for an
individual use of near
by using a modifier attached to the command.
The form of the modifier is a period followed by a number representing the
maximum distance (in characters).
If either set1 or set2 is a region set, the first of the two pointers describing the region is used in finding the distance between the members of the sets.
Multiple near
commands are not parsed left to right. A command
of the form
set1near
set2near
set3
is handled as if parenthesized as follows:
set1near
(set2near
set3)
The command not near
creates a set containing those members of
set1 that are not
within the specified distance of any
member of set2.
set1 not near
set2
is the same as
set1 - (set1 near
set2)
>> "love " near
"hate "
Assuming a Proximity
of 80, this query creates a point set
containing those matches to love
that are within 80
characters of matches to hate
, counting from the
l
in love
to the h
in hate
. The string hate
can occur before or after love
in the text.
>> region Title near.30
region Author
This query creates a region set containing the members of region Title
that are within 30 characters of one or more
members of region Author
. In this case the distance
is measured as the number of characters between the first character of a Title
region and the first character of an Author region.
>> "love " not near
"hate "
This query creates a point set containing those matches to love
that do not occur within 80 characters of a match to
hate
calculating the distance as in the first
example.
>> "love " not near.30
"hate "
This query creates a point set containing the matches to love
that do not occur within 30 characters of a match to
hate
.
fby
, not
Proximity
next
next
set1
finds a specified number of contiguous members of a set following members
already identified by a first
or next
command.
Next
creates a set of a specified size containing the members of
set1 that start at the current cursor position associated with this
set. The cursor position is determined by a previous first
or
next
command applied to set1. The members of the new set
are in the order they appear in set1. Set1 may be either a
region set or a point set. The new set is of the same type as set1.
The operation of the next
command depends on the set order
established by the SortOrder
setting. If the SortOrder
setting is Alpha
, the set is ordered alphabetically;
if the SortOrder
setting is Occur
or
OccurHead
, the set is ordered as the members occur in
the text; and if the SortOrder
setting is AsIs
, the set ordering is the current one and may thus be
either alphabetic or occurrence order.
Each set that is used with a first
, next
or
~nextemp
command has a cursor (set member counter) associated with
it. The cursor indicates the location in set1 at which to begin
selection for the set being created. On completion of the next
command the cursor is updated to point at the beginning of the next set. Note,
when the SortOrder
setting changes and the set ordering is changed,
the cursor is reset to the first element.
The size of the set created is determined by the value of
SampleSize
which has a default value of 10. If the size of
set1 is less than SampleSize
, then the new set created
is the same size as set1. Changing the SampleSize
affects all subsequent uses of next
during the current session. For
an individual use of the command, the size of the new set can be specified by
using a modifier attached to the next
command. The modifier is in
the form of a period followed by a numeric value giving the desired set size.
The next
command can be used by itself or with the
pr
, save
or export
commands. Note that
next
may only be used in conjunction with these commands.
>> {SampleSize 40} >>first
.0 5 >>next
5
The first line of the example changes the SampleSize
setting
to 40. The first
command resets the cursor associated with set
number 5 to the first member of the set and creates a set of size 0 (thereby
leaving the cursor at the first member). The third line creates a set that
contains the first 40 members of set number 5.
>> next
.10 5
If this command follows the previous example, a set of ten members is
created. The cursor associated with set number 5 indicates that 40 members
have been used to create the set in the previous next
command and
so this new set starts at the 41st member of set number 5.
first
, ~nextemp
, sample
,
subset
SampleSize
, SortOrder
~nextemp
~nextemp
set1
finds a specified number of contiguous members of a set following members
already identified by a first
or next
command.
The command ~nextemp
creates a set of a specified size
containing the members of set1 that start at the current cursor
position associated with the set. The cursor position is determined by the
previous first
or next
command applied to
set1. The members of the new set are in the order they appear in
set1. Set1 may be either a region set or a point set. The
new set is of the same type as set1.
The ~nextemp
command is identical to the next
command except that the cursor is unchanged by the ~nextemp
command.
The operation of the ~nextemp
command depends on the set order
established by the SortOrder
setting. If the SortOrder
setting is Alpha
, the set is ordered alphabetically;
if the SortOrder
setting is Occur
or
OccurHead
, the set is ordered as the members occur in
the text; and if the SortOrder
setting is AsIs
, the set ordering is the current one and may thus be
either alphabetic or occurrence order.
Each set that is used with a first
, next
or
~nextemp
command has a cursor (set member counter) associated with
it. The cursor indicates the location in set1 to start selecting
members for the set being created. After completion of the ~nextemp
command, the cursor is unchanged. This differs from the behaviour of the
next
command, which updates the cursor to point at the last member
of set1 selected for the new set. Note, when the
SortOrder
setting and the set ordering change, the cursor is reset
to the first element.
The size of the set created is determined by the value of the
SampleSize
setting which has a default value of 10. If the size of
set1 is less than SampleSize
, then the new set created
is the same size as set1. Changing the SampleSize
setting affects all subsequent uses of ~nextemp
during the current
session. For an individual use of the command, the size of the new set can be
specified by using a modifier attached to the ~nextemp
command.
This modifier is in the form of a period followed by a numeric value giving the
desired set size.
The ~nextemp
command can be used by itself or with the
pr
, save
or export
commands. Note that
~nextemp
may only be used in conjunction with these commands.
>> {SampleSize 40} >>first
.0 5 >>~nextemp
5
The first line changes the SampleSize
setting. The
first
command initializes the cursor associated with set number 5
to the first member of the set and creates a result set of size 0 (thereby
leaving the cursor at the first member). The third line creates a set that
contains the first 40 members of set number 5.
>> ~nextemp
.10 5
Assume this command follows the previous example. On completion of the
previous query, the cursor still points to the beginning of the set as the
~nextemp
command does not change the cursor setting. The set
created by this query contains 10 elements from the beginning of set number 5.
first
, next
, sample
,
subset
SampleSize
, SortOrder
not
is used to modify four XPAT commands.
The forms in which not
can appear are not fby
,
not including
, not near
, and not within
.
These uses are described in the entries for fby
,
including
, near
, and within
.
Not
cannot be used to modify any other commands.
fby
, including
, near
,
within
offsets
[number]
[label:number]
generate a point set containing a specified position in the text.
The number in the square brackets is a logical position in the text and need not be an index point. The number indicates the offset, measured in number of characters, from the beginning of the text database. The first character of the text has offset [1]. If the number used in square brackets exceeds the size of the text XPAT gives the message
Error: Input number too large.
Note that the new set is a point set with only one member.
The second form of the command, shown above, uses offsets that are produced
when XPAT is operating in quiet mode and using labels. In this form, in order to
produce correct results, the label string must be the current value of the
setting Label
. When the label is different from the current
Label
setting the resulting set has size 0.
>> region Quote including[
20000]
This query finds the Quote region that includes the offset 20000.
>> {Label news} >> region Quote including[
news:20000]
This query uses an offset in the form produced by XPAT in quiet mode (having
requested labels with the offsets). Since the Label
has been set
to the value news
by the previous command, the
query finds the region set named Quote
containing
the given offset. If the label, prefixed to the offset, is anything other than
news
the query would produce a set of size 0.
quiet mode
, sets
Label
pr
pr
set1
displays contents of XPAT sets.
Pr
displays each member of set1 with surrounding
context. A modifier can be attached to the pr
command in order to
control the context exactly. Set1 can be any region set or point set.
If set1 is a region set, the first of the pair of points describing
each region in the text is displayed by the pr
command. If no
set1 is given, the operand for the command is the most recent set
created in the session.
For each member in the given set, the output is in the form of an integer
giving the offset of the set member in the text file, followed by a comma, a
blank, two periods and then the characters surrounding the set member. The first
character in the database is considered to be offset 1. The order in which the
set is displayed depends on the current SortOrder
setting.
With no modifier, pr
prints a line of text for each element in
set1. The PrintLength
and LeftContext
settings determine the content of the line printed. With the default settings,
the printed text is 64 characters in length of which 14 precede the match point.
The number of characters displayed to the left of the match point can be altered
by changing the LeftContext
. The total number of characters printed
can be altered by changing the PrintLength
setting.
The total number of characters to be displayed can be set, for a single
instance of the command, by using a numeric modifier attached to the
pr
. The modifier is in the form of a period followed by a number
giving the total number of characters to be displayed. The left context that is
displayed is still determined by the value of the LeftContext
setting.
The second form the modifier can have is a period followed by the string
region
. When the modifier .region
is used, the output
text starts at the match point and continues to the end of the default region in
which the match point occurs. If the match point is not within the default
region, no output is displayed for the match point.
The second form of the modifier can be refined to request that the text
displayed is a region other than the default region. An additional modifier
specifying a defined region set can be attached to the already modified
pr
command (i.e. to the pr.region
). The additional
modifier can specify the region in one of three ways: a string giving the name
of a predefined region, the number of a region set created in the XPAT session,
or a string preceded by an asterisk (*
) referring to a named region
set defined in the XPAT session (see the examples below). As with the form
pr.region
, described above, this use results in the displayed text
starting at the match point with no left context and continuing to the end of
the region. When the match point is not contained in the designated region set,
no output is displayed.
>>"Kipling"
>>pr
This command displays a line of context for each member of the previously
calculated set. Assuming the PrintLength
and
LeftContext
still have the default values, each line will contain
64 characters of which 14 will be before the match point.
>> {PrintLength 300}
>> pr
"my dear Watson"
As with the previous example, this command prints a line of context for
each member in the point set matching the string my dear
Watson
. In this case, the line printed for each member in the set is
300 characters long but still has 14 characters preceding the match point.
>> pr
region including "detective"
This command will print a line for each member in the set of default
regions that contains the string detective
. The
text displayed starts at the beginning of the default region
.
>> pr.200
shift.-100 ("city" near "oxford")
This command prints a line of 200 characters for each member in the set of
matches to the string city
when it appears near the
string oxford
. Since the match points in this set
have been shifted 100 characters to the left the displayed text actually
begins 114 characters to the left of the string city
(assuming the LeftContext
is set to
14).
>> region incl (region EQ incl ("<D>1980" .. "<D>1986"))
>> pr.region
The first query finds the members of the default region
(in
this example they might be dictionary entries) that contain EQ regions which
are in the period from 1980 to 1986. The second command prints these entries.
After the offset, comma, blank and two periods, the displayed text starts at
the match point which is at the beginning of the default region, and continues
to the end of the default region.
>> region Quote including ("univ" near "waterloo")
>> pr.region.Quote
The first query finds the Quote regions which contain the string univ
occurring near the string waterloo
. The second command displays these regions. The
output consists of an offset, comma, blank, two periods and the text starting
at the beginning of the Quote region and continuing to the end of the Quote
region.
>> pr.region.5
"law" fby "order"
This command displays data from the set of matches to the string law
when followed by order
.
The text that is printed starts at the matches to the string law
and continues to the end of the regions which contain
the match point.
>> *verse including "faith, hope, charity"
>> pr.region.*verse
The first query finds the regions that contain the string faith, hope, charity
occurring in the set that has been
created and named verse
during the XPAT session. For
each of the members in this region set, the second command prints information
starting at the beginning of the region and continuing to the end of the
region described by *verse
.
history
, naming sets
, quiet mode
,
region
, save
DefaultRegion
, LeftContext
,
PrintLength
, SortOrder
{PrintLength}
{PrintLength
number}
specifies how many characters of text are displayed.
By default, when the members of a set are displayed with the pr
command or written to a file by the save
command, each member
contains 64 characters of context, 14 to the left of the match point, the match
point itself, and 49 to the right. This setting may be overridden so that the
number of characters processed is determined by a modifier for an individual use
of the pr
or save
commands. The
PrintLength
setting determines the total number of characters
processed and thus affects the number of characters shown to the right of the
match point. The number of characters to the left of the match point is
determined by the LeftContext
setting.
The setting can be changed at any time during a XPAT session and remains in
effect until changed again or until the end of the session. The current value of
the PrintLength
setting is displayed by the command
{Settings}
.
>> {PrintLength 100}
>> pr ("Yukon" near ("B.C." + "British Columbia"))
This changes the setting to the value 100 so that any subsequent
pr
or save
command produces text 100 characters in
length. The set displayed has 14 characters to the left of the match point and
85 characters to the right, assuming a default value of 14 characters for left
context.
pr
, save
, Settings
LeftContext
{Proximity}
{Proximity
number}
specifies the measure of closeness for the near
and
fby
commands.
The Proximity
default for the fby
and
near
commands is 80 characters. That is, a match point of a member
of set1 must be within 80 characters of a match point of a member of
set2 to be included in a new set created by the near
and
fby
commands.
The Proximity
setting may be overridden for an individual use of
the fby
and near
commands by appending a modifier to
the command.
The Proximity
setting can also be changed at any time during a
XPAT session and remains in effect until changed again or until the end of the
session (see example below). The current value of the Proximity
setting is displayed by the command {Settings}
.
>> {Proximity 200}
>> "Canada" near ("U.S." + "United States" + "the States")
The first line of the example changes the Proximity
setting to
the value 200 so that any subsequent Proximity
commands use this
value. In the query, XPAT finds the occurrences of the string Canada
that occur within 200 characters either to the
left or right of members of the set produced by the union of the sets matching
the strings U.S.
, United
States
and the States
.
fby
, near
, Settings
~qnum
~qnum
outputs a query number.
The ~qnum
command operates in both standard and quiet mode. In
standard mode, the number of the next query is output. In quiet mode, the
information is tagged and the number of the next query is contained within
<Qnum> tags.
>> "testing"
>> ~qnum
If testing
is the first query in the XPAT
session, the output from the ~qnum
command is the set number 2.
In quiet mode, this appears as the string <Qnum>2</Qnum>
.
quiet mode
quiet mode
{QuietOn
Raw Converted
Label Persistent}
{QuietOff}
changes the mode of operation of XPAT. {QuietOn}
causes XPAT to
operate in quiet mode. {QuietOff}
causes XPAT to revert to standard
(non-quiet) mode.
Each of the four arguments to QuietOn
is optional and may appear
in any order. When an argument is present in a QuietOn
command, the
corresponding setting is turned on. Conversely, when an argument is not present
in a QuietOn
command, the corresponding setting is turned off.
Settings are not carried forward from one QuietOn
command to the
next but are reset with each QuietOn
command.
All XPAT commands that create sets operate the same way in quiet mode and in standard mode. However, the output generated by XPAT is different in the two modes. No prompt or newline appears when XPAT is operating in quiet mode. In addition, the output from XPAT in quiet mode is in a tagged format.
In standard mode, when a command or query creates a new set, a set number and the number of matches is output. In quiet mode, the tagged output contains the number of matches within <SSize> tags but no set number. For example, if a set of 122 matches is created by a XPAT query the output is of the form:
<SSize>122</SSize>
In standard mode, information displayed about a set by a pr
command is affected if a modifier is attached to the command. The output from a
pr
command is preceded by the offset in the text of the set member
being printed. If the set is a region set, the offset is the start of each
region in the set. In quiet mode, the output contains the numeric offset in a
tagged format. The settings of Raw, Converted,
Label and Persistent affect the information displayed by
the pr
command. Each of the settings is discussed below.
{QuietOn}
With Persistent turned off, the values of the offsets that are output are the logical offsets into the file. The logical and persistent offsets are different and non- interchangeable. Persistent offsets are designed for use with the update system. (See the documentation for the XPAT update system.)
If the pr
command is not of the form pr.region
, the
offset of each set member is contained within <Start> tags and
the entire output is contained within <PSet> (for Point Set)
tags. For example, if the set is of size 2, the output might look as follows
(without the line breaks).
<PSet><Start>1234</Start> <Start>5554</Start></PSet>
If the modifier to the pr
command is .region
, in
standard mode the text displayed is from the match point to the end of a
specified region. In quiet mode, the tagged output contains both the offset of
the match point and the offset of the end of the specified region. The offsets
of the ends of the region are contained within <End> tags and
the entire output is contained within <RSet> (for Region Set)
tags. For example, for the above set of size 2, output from a
pr.region
might look like
<RSet><Start>1234</Start><End>1444</End> <Start>5554</Start><End>6000</End></RSet>
{Quiet On
Label}
When Label is turned on, the form in which the offset is printed
changes. The numeric value of the offset into the text within the
<Start> or <End> tags is preceded by an
identifying label and a colon. This label string is the value of the
Label
setting. If Label
has not been set, the label
used in the output is the name of the data dictionary file up to the first
non-alphanumeric character. For example, if the data dictionary is news.dd
and Label
has not been set, the output
from a pr
command would look like:
<PSet><Start>news:1234</Start> <Start>news:5554</Start></PSet>
{QuietOn
Raw }
When Raw is turned on, in addition to the tagged offsets, the output contains text showing the match point and surrounding context. For each member of the set, this additional information is output within <Raw> tags following the tagged offset information for each member of the set.
The length of the string being output is given within <Size>
tags and is followed by the text. As in standard mode, if pr
has no
modifier, the length of string output is determined by the
PrintLength
setting and the context shown to the left of the match
point is determined by the LeftContext
setting. If the modifier is
a numeric value, this value determines the length of the string and the left
context is still determined by the LeftContext
setting. If the
modifier to the pr
command is .region
, the text starts
at the match point and continues to the end of the specified region.
For example, assuming a PrintLength
setting of 25, and a
LeftContext
setting of 5, the output from a pr
command
applied to a set of 2 matches to the string sample
would be (without the line breaks shown here):
<PSet><Start>1234</Start><Raw><Size>25</Size> This sample is to be firs</Raw> <Start>3456</Start> <Raw><Size>25</Size> This sample is to be seco</Raw> </PSet>
If the SortOrder
setting is OccurHead
, in addition to the above output, the descriptive
header is output in a tagged format. (See the entry for SortOrder
for a description of the header). This information is contained within
<Hdr> tags and includes the length of the descriptive string
within <Size> tags followed by the string of the header. If the
SortOrder
setting was OccurHead
in the
above example, the output would be (without the line breaks shown here):
<PSet><Start>1234</Start> <Hdr><Size>10</Size>First </Hdr> <Raw><Size>25</Size> This sample is to be firs</Raw> <Start>3456</Start> <Hdr><Size>10</Size>Second </Hdr> <Raw><Size>25</Size> This sample is to be seco</Raw> </PSet>
{QuietOn
Converted}
When Converted is turned on, in addition to the tagged offsets,
text following the match point is output for each member of the set. This text
is displayed with the appropriate character mappings for the XPAT index and any
stopwords removed. For example, if upper case is mapped to lower case when
creating the index, the text is displayed in lower case. If the index has the
word to
as a stopword, to
would not appear in the converted text.
For each member of the set, this additional information is output within
<Cvt> tags. The length of the output text string is enclosed
within <Size> tags and is followed by the text itself. For each
set member, the text string shown starts at the match point. This is in contrast
to the Raw text output which shows the match point with some left
context. If the pr
has no modifier, the length of the string is
determined by the PrintLength
setting. If the modifier is numeric,
this determines the string length. If the modifier is .region
, the
length of the string is the value of the difference between the offsets of the
match point and the end of the region. As the displayed text is converted text,
it is possible that some text conversions cause output, such as multiple blanks
resulting from character mappings or stopwords, to be suppressed. This may
result in text that occurs past the end of the region to be displayed.
For example, using the above example of a set of size 2 and further assuming
that to
and be
are
stopwords the output might be:
<PSet><Start>1234</Start> <Cvt><Size>25</Size> sample is first used for </Cvt> <Start>3456</Start> <Cvt><Size>25</Size> sample is second used for</Cvt> </PSet>
If the SortOrder
setting is OccurHead
, in addition to the above output, the descriptive
header is given in a tagged format. (See the entry for SortOrder
for a description of the header). The information giving the descriptive string
precedes the <Cvt> tag and does not have the character mappings
applied to it. The previous example would change to:
<PSet><Start>1234</Start> <Hdr><Size>10</Size>First ..<Cvt><Size>25</Size> this sample is first used</Cvt> <Start>3456</Start> <Hdr><Size>10</Size>Second ..<Cvt><Size>25</Size> this sample is second use</Cvt> </PSet>
{QuietOn
Persistent}
When Persistent is turned on, the offsets that are output are the persistent (persistent) positions within the text database. As noted earlier, in a database that has not been initialized for update, the persistent and logical offsets are identical.
{QuietOn
Raw Converted Label}
Any combination of the QuietOn
arguments may be used. Thus,
after the command {QuietOn
Raw Converted Label}
, the
following would result:
<PSet><Start>news:1234</Start> <Raw><Size>25</Size> This sample is to be firs</Raw> <Cvt><Size>25</Size> sample is used first for </Cvt> <Start>news:3456</Start> <Raw><Size>25</Size> This sample size is to be seco</Raw> <Cvt><Size>25</Size> sample is second used for</Cvt> </PSet>
The save
command results in identical behaviour to that of the
pr
command except that the information is written to a designated
file rather than displayed on the standard output.
Syntax errors that occur during the XPAT session are reported in a tagged format. A set size of -1 is indicated and the error information is contained within <Error> tags. For example, if a command uses the default region before it is set, the error shown is (without the line breaks shown here):
<SSize>-1</SSize> <Error>No information for default region </Error>
Although the sets created by the signif
command are the same in
quiet and standard mode, signif
does not display the text string
associated with the set in quiet mode. If signif
is modified with a
negative integer n requesting n sets, only information
about the last set created is shown.
History
and {Settings}
display no output in quiet
mode.
history
, XPAT update system documentation, pr
,
save
, Settings
, signif
Label
, LeftContext
, PrintLength
,
SortOrder
{QuietOff}
{QuietOff}
quiet mode
{QuietOn}
{QuietOn
Raw Converted Label Persistent
}
quiet mode
quit
quit
terminates a XPAT session.
The use of the quit
command causes the session to end and the
XPAT process to exit. A message may be generated telling how much computer time
has been used during the XPAT session.
done
, stop
range
..
string2
finds strings that begin with strings occurring within an alphabetic range.
The range
operator creates a point set consisting of those
indexed points in the text that fall alphabetically between string1
and string2 inclusive. String1 and string2 are
patterns that may or may not actually occur in the text being searched. The
resulting set contains the matches to both string1 and
string2.
Both the operands to the range
command must be strings. Using a
set number with the range
command is illegal and results in a
syntax error.
>> "n" ..
"z"
This query finds all indexed points in the text that occur in the
alphabetic ordering between n
and z
.
>> "a" ..
"z"
Again, assuming the text has been indexed on words, this query creates a set of all the words and phrases in alphabetical order (that is, it produces a concordance of the text).
>> "1" ..
"200"
This query find all the strings that fall alphabetically between 1
and 200
. This gives all the
indexed strings that begin with 1
or 200
. For example, the strings 1929
, 20034
as well as strings
such as 2003/1
and 2000-15000
are in this range. The resulting set does not
contain the strings 3
or 4
.
>> region Date including ("1920" ..
"1925")
This query finds Date regions that contain dates from 1920 to 1925
inclusive. The range 1920
..
1925
also contains strings such as "1925000" as they also
fall within the range.
>> "<Date>1920" ..
"<Date>1925"
If dates are marked with the tag <Date>
and begin with a 4-digit value for the year, this query reliably finds dates
between 1920 and 1925 inclusive, and only those dates.
data dictionary documentation, index points
rankedby
rankedby
set2
ranks a region set by the number of contained members of another set.
Rankedby
creates a set containing those members of
set1 that contain the greatest number of occurrences of members of
set2. Set1 must be a region set. Set2 may be
either a point set or a region set. The new set is a region set.
Set1 may be a predefined region set, a region set that has been
created within the current XPAT session using the region
command, a
region set resulting from the use of the import
command, or the
result of a previous query during the current session.
The size of the new set is by default the value of the
SampleSize
. Another size may be requested with a numeric modifier
in the form of a period followed by the requested size.
The set that is created, when accessed by pr
, save
and subset
in SortOrder
AsIs
, is naturally ordered by rank. That is to say, the
first member will be that element of set1 that contains the most
occurrences of members of set2.
In detail, the rankedby
command operates as follows. It first
splits all the members of set1 into groups. Each member of a group
includes the same number of members of set2 as the other members of
the group. In addition, within a group, the members are sorted into occurence
order. After it has grouped the members of set1, the
rankedby
command sorts the groups into decreasing order of number
of included members of set2.
For example, say that set1 has 6 members, as follows: 3 members
that each contain 2 members of set2, 2 members that each contain 4
members of set2, and 1 member that contains no members of
set2. After rankedby
has grouped and sorted
set1, the groups are be as follows. The first group consists of the 2
members of set1 that contain 4 members of set2. The second
group consists of the 3 members of set1 that contain 2 members of
set2, and the third group consists of the 1 member of set1
that contains no members of set2. Within each group, the members are
in occurence order. If the user has requested the top 4 sets, the result set
would contain both members of the first group and the first two members of the
second group.
>> region Story rankedby ("Free trade" near "Canada")
This query finds the regions described by region
Story
that contain the greatest number of matches to Free trade
when it occurs close to the Canada
. The number of members in the new set is the value
of the SampleSize
setting.
>> region Quote rankedby.5 region Author
This query creates a set whose members are the 5 members of region Quote
that contain the greatest number of members
of region Author
.
including
, region
SampleSize
region
region
region
string
region
set1 .. set2
produce region sets in a text database. The first two forms of the
region
command refer to region sets that have been defined
externally to a XPAT session and for which information is available in the data
dictionary. These region sets may have been defined using patregion or
any other program that generates information (in the form that XPAT understands)
about regions in the text. The third form of the command defines a region set
during a XPAT session. The results of any of these commands can be used as
operands to any of the XPAT commands that operate on region sets.
region
Region
, used with no operand, refers to the particular
predefined region set that has been designated as the default region. The
default region is defined by the DefaultRegion
setting and can be
reset for the remainder of a XPAT session by changing the setting. If no default
region has been defined, using region
in this form causes an error.
The following message is generated:
No information for default region.
region
string
The second form of the region
command indicates one of the named
predefined region sets. The string is the name that has been given to
the region set in the data dictionary. For example, the region sets might be the
chapters of a book, the entries in a dictionary or the headlines in a newspaper
database. The information about certain regions in the text database is
generated by a program external to XPAT and is made available during a XPAT
session via the data dictionary. One program that generates the information is
patregion.
Note that the string giving the name of the region set can contain blanks or
special characters, if it is enclosed within double quote
marks.
region
set1 .. set2
The third form of the region
command defines a new region set.
The region set that is created by this command is only available for the
duration of the XPAT session. Information about this region set can be written to
a file using the export
command and read into a future XPAT session
using the import
command.
Set1 and set2 are used to define regions in the new
set. Set1 and set2 can be either point sets or region
sets. If either set1 or set2 is a region set, the
region
command uses only the first of the pair of pointers
describing its members in defining the new region.
Each region in the new set is formed as follows. A member of set1
is the beginning of a region if it is followed by a member from set2
with no other member of set1 occurring between the two members. The
end point of the new region is defined by the member in set2 that
most closely follows the set1 member. The region contains the text
from the beginning of the member of set1 up to but not including the
member of set2. This produces the smallest non-overlapping region set
that can be formed by set1 and set2. The size of the
region set created is equal to or smaller than the size of set1. If
the members of set2
are matches to a pattern, the new region set
does not contain the occurrences of that pattern. For example, if
set2 is the set of matches to the string End of
Message
, the new region set contains no occurrences of the string End of Message
.
If set1 and set2 are identical, two extra regions may
be included in the newly created set. These are: a region from the beginning of
the text to the member of set1 that occurs earliest in the text; and
a region from the last element of set1 in the text to the end of the
text. If either of these regions is a substring of length zero, it is not
included. If the shift
command is applied to set1 or
set2, the extra regions are not included in the new set.
Some programs, such as patregion, that produce predefined region
sets, define the end point of the region in a somewhat different manner. These
programs deal with patterns of text (rather than points in the text) and the end
point of the region that is defined is usually the last character in the pattern
that is used to define the regions. If desired, the region
command
within a XPAT session can be used in conjunction with the shift
command to create a set of regions in which the ends of the regions are at the
end of a pattern. See the examples below.
XPAT does not support region sets whose members nest or overlap. As described
above, using region
with operands that are patterns defining nested
or overlapping regions, creates a region set which is the smallest
non-overlapping set of regions. Patregion used on the same text creates
a possibly different region set (also non-overlapping) consisting of regions
from an opening pattern to the following end pattern.
>> region
including ("Smith" near "Jones")
This query creates a region set, consisting of the members of the default
region which contain a match to the string Smith
when it occurs within a prescribed distance of the string Jones
.
>> "Campbell" within region "Speaker Name"
In this example, we assume that one of the predefined regions has been
named Speaker Name
. This query creates a point set
that contains matches to the string Campbell
occurring within members of region Speaker Name
.
>> firstb =region
"<A>".."</B>" >> (region B
within *firstb) including "requested string"
The text, in this example, contains regions that begin with <A>
and end with </A>
. Each of these A
regions contains smaller regions that begin with <B>
and end with </B>
. Assume, in certain instances, that it is
necessary to be able to find the first B
region
within each A
region. The use of region in the
first query creates a region set named firstb
that
can be used to find these regions. The members of firstb
are the pieces of text that begin with the string
<A>
and extend to the closest string </B>
. The second query finds the members of region B
that are within firstb
, and then finds the members of the latter that
include requested string
.
>> quote = region
"<Q>" .. (shift.4 "</Q>")
If some components in the text are tagged with <Q>
and </Q>
this
command creates a region set describing these components. Each region in the
set extends from the opening tag <Q>
to the
end of the closing tag </Q>
. By using the
shift operator, applied to the </Q>
, the
members of the point set used to find the ends of the new regions all point to
the end of the string </Q>
rather than to the
beginning of the tag.
>> mess1 =region
"From:" .. "From:" >> mess2 =region
"From:" .. (shift.0 "From:") >> from =region
*mess1 .. "Received:" >> "Bill" within *from
This set of queries is being applied to a database of mail messages. Each
message has the string From:
at the beginning. The
string Received:
appears at the beginning of the
second line of the message indicating the time the message was received.
Assume that the first query, identifying the matches to the string From:
, returns a set of size 10. Further assume that
there is text in the database preceding the first From:
. The next query creates a region set of size 11 as
two additional regions are included in the resulting set: one containing the
text from the beginning of the text to the first occurrence of From:
and the other containing the text from the last
occurrence of From:
to the end of the text. The
third query creates a region set of size 9 as these two regions are not
included in the new set. The next query creates a region set describing the
sender of the message. The final query finds the matches to Bill
in the regions describing the sender of the message.
Notice the use of an asterisk (*
) before the name of the new
region set when it is used as an operand to a XPAT command.
data dictionary documentation, export
, import
,
including
, index point
, naming sets
,
pr
, save
, set name
, shift
,
within
DefaultRegion
sample
sample
set1
finds representative members of a larger set.
Sample
creates a set containing a specified number of members of
set1. Set1 may be either a region set or a point set. The
new set is of the same type as set1.
The size of the set created is determined by the value of the
SampleSize
setting which has a default value of 10. If the size of
set1 is less than SampleSize
, then the new set created
is the same size as set1. The size can be changed for all subsequent
uses of sample
during the current session by changing the
SampleSize
setting. For an individual use of the
sample
command, the setting can be changed by using a modifier
attached to the command. The form of the modifier is a period followed by a
number giving the desired size of the sample set.
The members of the sample set are chosen as follows. If the size of
set1 is x and the sample size requested is y,
each x/yth member of set1 is in the sample set. For
example, if a sample of size 20 is requested from a set of size 2000, the 100th,
200th members etc. are chosen. The ordering of the set, and hence the members of
the sample set, is determined when the set is created. The
SortOrder
setting does not determine which members are included in
the set created by the sample
command as it does for the
subset
, next
, ~nextemp
, and
first
commands. However, this setting does affect how the
sample
set is ordered when used with a pr
command (or
save
command).
The sample
command can be used by itself or with the
pr
, save
or export
commands. The
sample
may only be used in conjunction with these commands.
>> sample
"shaks"
Assuming a SampleSize
setting of 10, this query creates a set
of 10 examples from the set of matches to the string shaks
.
>> {SampleSize 30}
>> sample
"shaks"
The first command changes the SampleSize
setting to 30 and the
second creates a set of 30 examples from the set of matches to the string
shaks
.
>> region Quote including "Doyle"
>> sample
.20 %
The first query creates a region set containing Quote regions that include
the string Doyle
. The second query creates a sample
set of 20 members from the results of the first query.
>> region Quote including (sample
"Doyle")
This query is illegal and results in a syntax error.
first
, next
, ~nextemp
,
subset
SampleSize
, SortOrder
{SampleSize}
{SampleSize
number}
specifies the size of the set produced by the sample
,
subset
, and rankedby
commands.
By default, sample
and subset
create a set of 10
members of a given set. This setting may be overridden and the size of the
result determined by a modifier for an individual use of these commands. The
SampleSize
setting can be changed at any time during a XPAT session
and remains in effect until changed again or until the end of the session. The
current value of the SampleSize
setting is displayed by the command
{Settings}
.
>> {SampleSize 200}
>> pr sample 5
This changes the SampleSize
to 200 and any subsequent
sample
or subset
command uses this value. In the
second query XPAT prints information about 200 members of set number 5 created
earlier in the session.
rankedby
, sample
, Settings
,
subset
save
save
set1
writes the contents of a set to a file.
The save
command is identical to the pr
command
except that the output is written to a file. The name of the file where the
information is written is determined by the value of the setting
SaveFile
. The default value of the setting is xpat.res
. The file used by the save
command can
be changed at any time during the XPAT session by changing the setting. The
information output by the save
command is concatenated onto the end
of the save file if one of the same name already exists. Otherwise, a new file
is created and the information is written to the new file. Assuming the default
setting of SaveFile
, the following message is printed on execution
of the save
command:
Saving in xpat.res.
For each member in the given set, the output is in the form of an integer
giving the offset of the set member in the text file followed by a comma, a
blank, two periods and the characters surrounding the set member. The order in
which the set is output is determined by the current SortOrder
setting.
With no modifier, Save
outputs a line of text for each element
in set1. The PrintLength
and LeftContext
determine the content of the line saved. With the default settings, the saved
text is 64 characters in length of which 14 precede the match point. The number
of characters to the left of the match point can be altered by changing the
LeftContext
setting. The total number of characters printed can be
altered by changing the PrintLength
setting.
The total number of characters to be saved can be set for a single instance
of the command, by using a numeric modifier attached to the save
command. The modifier is in the form of a period followed by a number giving the
total number of characters to be saved. The left context that is saved is still
determined by the value of the LeftContext
setting.
The second form the modifier can have is a period followed by the string
region
. When the command save.region
is used, the
output text starts at the match point and continues to the end of the default
region in which the match point occurs. If the match point is not within a
default region, no output is saved for the match point.
The second form of the modifier can be refined to request that the text
output be in a region other than the default region. An additional modifier,
specifying a defined set of regions, can be attached to the already modified
save
command (i.e. to the save.region
). This
additional modifier can be in one of three forms: a string giving the name of a
predefined region, the number of a region set created in the XPAT session, or a
string preceded by an asterisk (*
) referring to a named region set
defined in the XPAT session. As with the form save.region
, described
above, this use results in the output text starting at the match point with no
left context and continuing to the end of the region. When the match point is
not contained in the designated set, no output is saved.
The similarly named commands save.commands
and
save.history
result in very different behaviour and are described
in separate entries.
>> "Helen Maday"
>> save
As a result of this command, XPAT writes a line of context for each member
in the most recently created set. The information is written to the file that
is named by the setting SaveFile
. If the setting has not been
changed during the session, the file used is xpat.res
. Note that the information is appended to the
save file if one of the same name already exists.
>> save
"From: Tony Lopez "
A line of context for each member of the set that matches the string From:
is written to the save file.
>> {PrintLength 120}
>> save
region including "planet"
A line of context is written for each member in the set of regions created
by the including
query. The line that is written starts at the
beginning of each region in of the new set. Since the PrintLength
has been set to 120, each line contains 120 characters and has 14 characters
to the left of the beginning of the displayed region.
>> save.200
shift.-100 ("procedure" near "policy")
In this case, a line of 200 characters is written to the save file for each
member in the set created by the query shift.-100
("procedure" near "policy")
. The text that is written starts 114
characters to the left of the string procedure
.
>> region including (region EQ including "<A>Doyle</A>")
>> save.region
The first query finds all the earliest quotes (defined by the region EQ
) that have Doyle
as
the author. The second command saves information about each of the default
regions that includes one of these quotes. The information that is written for
each of these regions contains the offset in the text file of the region, a
comma, a blank followed by two periods and the text of the default region. As
no set is given as an operand to the save
command, it is
understood that the command applies to the previous set.
>> region Quote including ("stadium" near "Toronto")
>> save.region.Quote
%
The first query finds all quotes that contain the strings stadium
within 80 characters of the string Toronto
. The second command saves information in the save
file (xpat.res
unless the SaveFile
setting has been reset) about each of these regions. As in the example above,
the output for each set member is in the form of an integer giving the text
offset, a comma, a blank followed by two periods and the text beginning at the
start of region Quote
and continuing to the end of
the region.
>> save.region.5
"night" fby "day"
This command saves information about the set created by the query "night" fby "day"
. The information written to the save
file (after the offset, comma, blank and two periods) starts at the text night
and continues to the end of the region defined by
set number 5 that contains the match.
>> minutes = region "<Min>" .. "</Min>"
>> *minutes including "examination schedule"
>> save.region.*minutes
The first query in this example defines a set of regions that are named
minutes
. The second query finds the regions in this
set that contain the string examination schedule
.
The third command saves information about this set in the save file. For each
member in this set, the information saved contains the offset, comma, blank
and two periods followed by the text of the region in the set named minutes
.
exec
, export
, import
, pr
,
save.commands
, save.history
DefaultRegion
, LeftContext
,
PrintLength
, Savefile
, SortOrder
save.commands
save.commands
writes information to a file about the queries in the XPAT session. These are saved in a form that allows them to be used in another XPAT session.
Save.commands
saves, in a file, all the queries that have been
executed and have produced sets during the current session. These are the
queries that appear in the history list. Only the command is saved in the file,
not the set number or number of matches. The setting CommandFile
,
that determines the file where the information is written, has a default value
of xpat.cmd
. The output file can be changed at any
time during the session by changing the CommandFile
setting. If a
file of this name already exists, the information is concatenated onto the end
of the file. Otherwise a new file is created.
The saved information can be read into a XPAT session and executed using the
exec
command.
>> "love" near "hate"
>> pr sample
>> region Q including %
>> save.region.Q %
>> {CommandFile "/usr/my_commandfile"}
>> save.commands
The second last command sets a new name for the file to be used by
save.commands
; /usr/my_commandfile
.
The final command saves the information about the commands that has been
generated to this point in the XPAT session. In the portion of the session
shown here, only two commands generated sets and so the following is saved in
the file /usr/my_commandfile
.
"love" near "hate" region Q including %
exec
, export
, import
,
save
, save.history
CommandFile
{SaveFile}
{SaveFile
string}
changes the file name used by save
.
The SaveFile
setting determines the file written by the
save
command. It has a default value of xpat.res
. If the string begins with a numeral, or contains
blanks or non-alphanumeric characters, it must be enclosed within double
quote
marks. The file name must also conform to the file naming
convention of the host operating system. It can be changed at any time during a
XPAT session and remains in effect until it is changed again or until the end of
the session. The current value of the SaveFile
setting is displayed
by the command {Settings}
.
>> {SaveFile "output_file"}
This changes the setting to the value output_file
so that any subsequent use of the
save
command writes to this file. The name of the file is not an
absolute path name and is therefore located in the current working directory.
save
, Settings
save.history
save.history
writes information to a file about the queries and results in the current XPAT session.
Save.history
writes a record of a XPAT session. XPAT's history
list records information about all queries that produce sets. For these queries,
save.history
saves the set number, the number of members in the
set, and the query that produced the set in a file. The setting
HistoryFile
, that determines the file where the information is
written, has a default value of xpat.his
. A different
output file can be chosen at any time during a session by changing the setting.
If a file of this name already exists, the information is concatenated onto the
end of the file. Otherwise a new file is created. Note that comments are saved
only if they are on the same line as the command itself.
>> "fish" near "fowl"
>> pr
>> region definition including %
>> save.history
This command saves the information from the history list in the file xpat.his
. After the sequence of commands shown above, the
history list contains information about two sets which are saved into a file:
1: 142, "fish" near "fowl" 2: 17, region definition including %
exec
, export
, history
,
import
, save
, save.commands
HistoryFile
set name
*
name
refers to a named set.
Query result sets may be named and subsequently referred to either by set
number or by name. The name must be preceded by an asterisk (*
) to
reference the set; otherwise, XPAT interprets the name as a command or search
string.
>> univ = "university" near "MIT" >> qu = region Quote including*
univ >>*
qu including "Harvard"
The first query creates a set of matches to university
occurring near MIT
.
The second query uses the set *univ
and creates a
new set *qu
. The third query finds the Quote
regions that include the set of matches from the first query as well as the
string Harvard
.
>> begin = region "<Title>" .. "</Summary>"
>> "Paris" within *
begin
The first query defines a new region set and calls the new set begin
. The second query creates a set containing the
matches to Paris
that fall within one of these
regions.
region
, including
, naming sets
,
set number
, within
set number
references a previously created set.
After the first query in a session, XPAT displays a line of the form:
1: 300 matches
The number 1 here names the set of results and can be used in subsequent
searches. The valid set numbers are those displayed by the history
command.
When an invalid set number is used XPAT generates a message. If, for example, set number 33 is referenced before it has been calculated or after it has been freed the message is:
Expression 33 is out of range
>> region Author including 5
In this query, the number 5 refers to the fifth result of the session. For example, set 5 might be the set of matches to all the variants of spelling for a particular author's name.
history
, ~free
, ~qnum
, set
name
History
sets
In the XPAT system, queries are combinations of the XPAT commands described in this document. In response to each query, XPAT creates a set which is either a point set or a region set.
These sets can be used as operands in subsequent queries. In contrast to the conventional approach of a single, nested compound query, XPAT allows complex queries to be expressed as a series of simple queries. This provides an opportunity to try alternative ways of combining previous result sets to arrive at a solution. XPAT provides a history list of all previous sets created in a session and a convenient notation to access them.
A member of a point set is a location in the text which is the start of a
string that continues to the end of the text. The XPAT system finds locations in
the text where strings, matching pattern(s) given in the query, begin. The
members of a point set are usually index points
, however, in the
sets created by shift
or offsets
, (the notation
[n]
), the members refer to positions in the text that may or may
not be index points.
The members of a region set are substrings of the text, beginning and ending at specified points. Region sets that are the result of a query within a XPAT session are available only for the duration of the session. However, region sets can also be defined externally and be made available to the XPAT session. Each member of a region set is described by two locations in the database, indicating the start and end of the region and these locations may or may not be index points.
Region sets may be used in XPAT to restrict searches to desired parts of the
text. The within
command finds the members of a set that are
contained in a designated region set. The including
command finds
the members of the designated region set that contain one or more members of a
given set.
The sets produced within XPAT can be refined using set arithmetic or proximity
commands. The difference (-
) and intersection (^
)
commands remove members of an existing set. The proximity (fby
and
near
) commands reduce sets by finding the members of a given set
that have specified text close by.
In addition to refining sets, it is possible to combine two sets to create a
larger one by using the union (+
) command.
XPAT queries, applied to a text database, may create large sets; analyzing a
smaller, representative subset might aid in making decisions about how to
proceed appropriately with a search. Several commands in XPAT provide this
capability. Sample
creates a representative subset while
subset
, next
, and first
each create
contiguous smaller subsets of a larger set.
difference
, fby
, first
,
including
, intersection
, near
,
next
, offsets
, range
,
region
, sample
, shift
,
subset
, union
, within
{Settings}
{Settings}
shows the current values of a number of XPAT parameters.
>> {Settings}
The output might be:
{CharMappings " " "Aa" "Bb" "Cc" "Dd" "Ee" "Ff" "Gg" "Hh" "Ii" "Jj" "Kk" "Ll" "Mm" "Nn" "Oo" "Pp" "Qq" "Rr" "Ss" "Tt" "Uu" "Vv" "Ww" "Xx" "Yy" "Zz" "[ " "\\ " "] " "^ " "_ " "` " "{ " "| " "} " "~ "} {StopWords} {WordStarters " \P" "\P-" "-\P" "\P<" "\P&."} {SortOrder AsIs} {PrintLength 64} {LeftContext 14} {Proximity 80} {SampleSize 10} {SaveFile xpat.res} {CommandFile xpat.cmd} {ExportFile xpat.exp} {HistoryFile xpat.his} {History 0} {QuietOff}
shift
shift
set1
creates a point set whose members are a specified distance from a set of matches.
Shift
creates a new set whose members are locations in the text
which result from an equal shift being applied to all members of
set1. Set1 may be either a point set or a region set. If
it is a region set, the resulting set is a point set containing the first of the
pair of pointers describing each member of the set. The set created by the
shift
command is always a point set.
The shift
command creates a new set consisting of pointers that
are (by default) 10 characters after the original set members. The set created
is in occurrence order. This default shift distance (10) and direction can be
changed by using a modifier attached to the shift
command. The
modifier is of the form of a period followed by a positive or negative integer.
If the modifier is a negative integer n, each member of
set1 is shifted to n characters before the original
location. If it is a positive integer n, the shift is to n
characters after that location.
The points in the new set need not be index points.
>> shift
"dog and cat"
This creates a new point set whose members are 10 characters after each
match to the string dog and cat
.
>> "<Tag>"
>> pr
>> shift.5
%
>> pr
The first query creates a point set of matches to the string <Tag>
. The ordering of this set is alphabetical.
The third line of the example creates a second point set whose members start 5
characters after a string <Tag>
. The members
now point to the start of the contents of the tagged region. This set is in
occurrence order. Thus, the order in which the members of the set are
displayed by the second pr
is not necessarily the same as that
seen with the first pr
command.
>> shift.5
region Tag
>> pr
Assuming that region Tag
describes a set that
begins with a tag <Tag>
, the first query
creates a point set whose members now point to the start of the contents of
Tag regions. The set created is the same as the one in the above example.
>> pr shift.-20
"the best of times"
This displays more context before the match points without having to change
the LeftContext
setting. The pr
command displays
members of a set of matches to the string the best of
times
with 34 characters showing before the matches (the default 14
characters plus 20 additional resulting from the shift
command).
Because of the shift
command, this set is displayed in occurrence
order. Note, as with other commands, since the pr
and
shift
are on the same line the set is not saved, does not appear
on the history list and so is not accessible by a set number.
offsets
, region
signif
signif
set1
finds frequently occurring words or phrases in the text.
Signif
finds the most frequent words or phrases following the
text matching set1. Set1 can be either a point set or a
region set for the first two forms of the command but has a restriction, as
noted below, for the third form of the command. The set (or sets) created are
point sets. If set1 is not given, signif
uses the last
set produced as its operand.
The signif
command looks for words or phrases, in contrast to
the other XPAT commands that operate on points and regions. For
signif
, a word is defined as a string of characters ending either
in a blank or a character that has been mapped to a blank by the character
mappings used in building the index being used in the current XPAT session.
The signif
command examines the words or phrases that begin with
a string. The string used is the longest string common to the text pointed to by
each of the members of set1. If set1 is a point set
resulting from a string search, signif
starts with that string. If
set1 is not given and the previous result was from a
signif
command, the string is the one associated with the set
created. (Note that, in addition to the number of matches in the set,
signif
returns a string value.) If set1 is a region set,
the first of the pairs of pointers describing the regions are used to find any
common string beginning in these regions. For example, this might be the pattern
used to define the regions. If set1 is a point set that is not the
result of a string search, XPAT checks for a common string beginning the text
pointed to by each member of set1. In some cases, this common string
is the null (empty) string.
Signif
has three different modes of operation. A modifier can be
attached to the signif
command. The syntax of the modifier is a
period followed by an integer n.
The first mode of the command, signif
with no modifier, finds a
frequent word or phrase and then, by reapplying signif
to the
resulting set, may be used to extend this phrase. In this mode, the command
finds the string and then finds all possible extensions, in the text, of this
string up to the next blank. Signif
creates the set, among the
matches to these possible extensions, with the most members.
The second mode of the command, signif
with a positive integer
n as modifier, finds the most commonly occurring phrase of length
n words beginning with the string. The set created contains those
members of set1 that are the matches to the most frequent phrase
beginning with the given string that is at least n words in length.
The third mode of the command, signif
with a negative integer
n as modifier, creates n sets which are the matches to the
n most frequent phrases beginning with the string. This use of the
signif
command is restricted to sets matching a string or a set
created by a range
command. Using any other set as an operand is
illegal and results in the error message:
Repetitive signif should be on strings or ranges only
Note that the set created by signif
is identical to that created
by signif.-1
but not necessarily to the one created by
signif.1
.(See the examples.)
The displayed output from the signif
command gives the number of
matches and the word or phrase found (preceded by text=). The text is
shown with the character mappings applied and stop words removed (see data
dictionary documentation). For example, this means, if >
has been mapped to a blank and the word the
is not a stopword, one would see the following:
>> signif
"<HL>"
2: 604 matches, text=<hl the
>>signif
" >>signif
>>signif.3
" >>signif.-10
"
The above queries, except the second one, operate on the entire text. In
these cases, the string that the signif
command starts with is
the empty string. The first query finds the most frequent word or phrase that
occurs in the text. The second query operates on the set created by the first
query and extends this result by one word. The third query finds the most
frequent phrase of at least three words that occurs in the text. The fourth
query finds the ten most frequent words or phrases within the text.
>> signif
"y"
This finds the most frequent word that starts with the letter y
in the text database.
>> signif
"to be"
Note that the string to be
used by the
signif
command does not end in a blank. Since signif
looks at all extensions of the string up to the next blank, the only phrases
eligible to be the most frequent phrase starting with this string are two-word
phrases. In many texts, the most likely set created with this command would be
the point set matching the two-word phrase to be
(ending in a blank).
>> signif
"to be "
In this example, the string given ends in a blank so the possible
extensions of this string that are examined by signif
are all
three-word phrases. The point set created as the answer to this command is the
set of matches to the three-word phrase starting with the two words to be
that occurs most frequently in the text.
>> signif.1
"to be"
This command creates a set of the most frequent phrase whose word length is
one. The newly created set is the set of matches to the word to
that are contained in the set of matches to the phrase
to be
. This means that the size of the set is
probably smaller than the set of matches to to
but
that the text shown is the string to
. Also note
that this set is not equal to the set created in the preceding example or to
the set created in the following example.
>> signif.-1
"to be"
This command finds the most frequent phrase that begins with the string
to be
. The answer to this is a two-word phrase that
is identical to that found in the second example.
>> signif.-3
"to be"
This command finds three sets that are the matches to the three most
frequent phrases beginning with the string to be
.
The first set is the same set created in the example before last. The next set
is created by applying signif
to two sets and comparing the
resulting sets. Signif
is first applied to a set that is created
by taking the difference between the set represented by the original string
and the new set just created. Signif
is also applied to the set
just created. The larger set from these two signif
applications
is the second answer. This same procedure is repeated on the original set and
on the two new sets to obtain the third set.
>>signif.4
"to be" >>signif
The first command creates the set of matches to the most frequent four-word
phrase that begins with to be
. The second
signif
is applied to the resulting four-word phrase. Since this
result ends in a blank, the second Signif
searches for the most
frequently occurring five-word phrase that begins with the four words located
by the first command.
>> "aba" .. "abz"
>> signif
The first command creates a point set that matches all strings that are
alphabetically between aba
and abz
. Signif
applied to this set creates a
set of matches to the most frequent word in the text that begins with ab
.
>> auth = region "<A>" .. "</A>" >>signif
*auth >>signif.2
*auth >>signif.-4
*auth
The first command creates a region set. The second command creates a set
representing the most frequent string at the beginning of these regions.
Assuming that in the character mappings the >
is
mapped to a blank, the string that signif
uses to find the
extension consists at least of the word <A>
.
Thus, the set created is the set of matches to the string <A>
followed by at least one other word. The third
command gives the most common two word phrase starting the region set named
auth
. The first word of this phrase will be <A>
. The last command is illegal and results in the
following error message:
Repetitive signif should be on strings or ranges only
>> signif.-4
"<A>"
This command creates four sets. These are the sets of matches to the four
most frequent phrases starting with the string <A>
. Notice that, if the >
has been mapped to a blank, the phrases are at least
two words in length.
>> sample.100 "<A>"
>> signif.-2
%
This use of signif
is illegal since signif
with a
negative modifier may be applied only to a set matching a string or created by
a range
command. An error message is generated.
{SortOrder}
{SortOrder
number}
determines the ordering of a set.
The behaviour of first
, next
, pr
,
save
, subset
and ~nextemp
are affected by
the ordering associated with their operands. The SortOrder
setting
indicates whether these sets are to be treated in alphabetical order or in the
order that members of the set occur in the text (occurrence order). (A
SortOrder
setting of OccurHead
(explanation below) also determines what is displayed by the pr
and
save
commands.)
Every set in XPAT has an internal ordering which varies from set to set, as
described below. The ordering is chosen by XPAT, for reasons of efficiency, and
no assumptions can be made in this regard. It is often desirable, however, to
present results in a certain order, and the SortOrder
setting
exists to control this. When a set is an operand of a pr
or
save
command or a new set is created by a first
,
next
, subset
or ~nextemp
command, the
ordering of the set and hence the behaviour of the command is determined by the
SortOrder
setting. This may mean that the existing set must be
reordered for processing with these commands. For some XPAT commands this results
in a change in the internal ordering, and this change is reflected when
subsequently operating with a SortOrder
setting of AsIs
(explanation below). The ordering of a set that is not
an operand to one of the above commands is not affected when the
SortOrder
setting changes.
The permissible values for the SortOrder
setting are AsIs
, Alpha
, Occur
and OccurHead
. The default
value of SortOrder
is AsIs
. If the
SortOrder
setting is AsIs
, the set is
processed in the order in which it currently exists. If the
SortOrder
setting is Alpha
, the set is
processed in alphabetical order. If the SortOrder
setting is Occur
or OccurHead
, the set is
processed in occurrence order. For sets whose internal ordering is not
alphabetical, for example region sets, displaying results with a
SortOrder
setting of Alpha
will require
resorting which may result in additional computation delay depending on the set
size.
Setting the SortOrder
setting to Occur
results in further changes to the behaviour of the
pr
and save
commands. For Pr
and
save
, with a SortOrder
setting of AsIs
, Alpha
or Occur
, the position offset for each set member is
displayed. With a SortOrder
setting of OccurHead
, the contents of a named region set are output in
place of the position offset. Setting SortOrder
to OccurHead
requires reference to two regions within the
brace brackets used to change the SortOrder
setting. The first
region referenced is the one whose contents are displayed in place of an offset
when members of a set are displayed. The second region referenced must be one
that contains both the match points of the members of a set and the first region
referenced in the SortOrder
setting (OccurHead
).
The text displayed in place of the offset, as a result of
SortOrder
being set to OccurHead
, is the
first region found of the specified type within the containing region
(also specified in the OccurHead
setting). If the
text to be displayed begins with an opening angle bracket, the text until the
closing angle bracket is ignored and the next character is displayed. If the
next character is another angle bracket, the preceding process is repeated
iteratively. A maximum of 10 characters or up to the next <
in the text is displayed. Both these region sets,
named in the OccurHead
setting, must be in the data
dictionary. If they are not, for example if X
is
named in the setting but does not exist in the data dictionary, the following
error message results:
No information for region X in the data dictionary
The SortOrder
setting can be changed at any time during a XPAT
session and remains in effect until it is changed again or until the end of the
session. The current value of the SortOrder
setting is displayed by
the command {Settings}
.
>> {SortOrder Alpha}
>> pr %
The first command sets the SortOrder
setting so that the
displayed set, following the pr
command, is in alphabetical
order.
>>{SortOrder Occur}
>> sample "Moriarity" >> pr % >>{SortOrder AsIs}
>> pr %
The sample set created in the first example has an alphabetical ordering.
With the SortOrder
setting of Occur
,
the first pr
displays the set in occurrence order. However, after
the SortOrder
is reset to AsIs
, the
set that is printed after the next pr
is displayed in alphabetic
order. Note that the sample set is not affected by the SortOrder
setting at the time of its creation and that the reordering for printing is
temporary.
>> {SortOrder OccurHead LF E}
>> pr "shaks"
One effect of setting the SortOrder
to OccurHead
is that the set is displayed in occurrence
order by the pr
command. The beginning of each line, following
the pr
command, contains the starting characters of the first
region, named LF
, that occurs with the region named
E
containing a member of the point set matching the
string shaks
.
>> Ondaatje >> pr subset % >>{SortOrder Occur}
>> pr subset % >>{SortOrder AsIs}
>> pr %
The subset displayed by the first pr
command is shown in
alphabetical order. With the SortOrder
set to Occur
, the next pr
command displays the
subset in occurrence order. With the SortOrder
set to AsIs
, the final pr
displays the set of
matches to the string Ondaatje
in occurrence order
since this point set was reordered permanently as a result of the subset
command executed when the SortOrder
was Occur
.
first
, next
, ~nextemp
,
pr
, save
, Settings
, subset
stop
stop
terminates a XPAT session. The use of this command causes the session to end and the XPAT process to exit. A message may be generated telling how much computer time has been used during the XPAT session.
done
, quit
string search
A command consisting only of a string causes XPAT to search for occurrences of
the string in the text database. A set is created whose members are matches to
all index points in the text that begin with the given string. A match occurs
when the given string (after having the character mappings applied to it and
stopwords removed) is the same as the text that begins at an index point (also
having had the character mappings applied and stopwords removed). Searching for
phrases with a XPAT index is as fast as searching for a word or a prefix of a
word. After a search, the number of matches to the pattern is displayed, but the
results of the search are not shown unless requested by a pr
command.
>> in
If the index currently being used is based on words, the matches returned
from this input string are the matches to all the phrases in the text that
begin with the two characters in
. That is, there
will be matches to strings beginning with the word in
as well as to strings beginning with inside
,
into
etc. In order to match only strings beginning
with the word in
a blank must be added to the
search string and the string enclosed within quotation marks.
If each character is indexed, the matches returned would also include
strings that appear as part of words such as within
and getting
.
If the index has been made with character mappings that map upper case to
lower case, the matches would also include matches to strings that include
In
.
>> "to be or not to be that is the question"
If the index used when searching for the above string was created with the
stopwords to
, be
, or
, not
, that
, is
and the
, this string search is equivalent to a search on the
string question
.
data dictionary documentation, double quote
, index
point
, offsets
, quiet mode
, range
,
shift
subset
subset
set1
finds a number of contiguous members of a set.
Subset
creates a set of a specified size containing members
starting at a designated location in set1. The members of the new set
are in the order they appear in set1. Set1 may be a region
set or a point set. The new set is of the same type as set1.
The operation of the subset
command is affected by the size of
the set requested and the SortOrder
setting.
The ordering of a set, and hence which members are chosen to be in the set
created by the subset
command, is controlled by the
SortOrder
setting. If the SortOrder
setting is Alpha
, the set is ordered alphabetically; if it is Occur
or OccurHead
, the set is
ordered as the members occur in the text; and if the SortOrder
setting is AsIs
, the set order is the current one and
may thus be either alphabetic or occurrence order. The location within
set1 to start selecting members for the new set is indicated by a
numeric location in the ordered set. This numeric location is given to the
subset
command as a modifier attached to the command. The modifier
is in the form of a period followed by an integer that can be either positive or
negative. A positive integer gives the desired location relative to the
beginning of the set and a negative integer gives it relative to the end of the
set. Without any modifier the subset
is taken from the start of
set1.
The size of the set created is determined by the value of the setting
SampleSize
which has a default value of 10. If the size of
set1 is less than SampleSize
, then the new set created
is the same size as set1. Changing the SampleSize
setting affects all subsequent uses of subset
during the current
session. The size of the subset
can be specified for an individual
use of the command by using a second modifier attached to the already modified
subset
command. This modifier is also of the form of a period
followed by a numeric value giving the desired set size.
The subset
command can be used by itself or with the
pr
, save
or export
commands. The
subset
may only be used in conjunction with these commands.
>> {SampleSize 40}
>> subset
%
The first command changes the SampleSize setting and the query in the second line returns a set that contains the first 40 members of the most recent result in the session.
>> subset
.10 "Montreal "
This query creates a set of 40 members (assuming the SampleSize setting in
the first example) starting at the tenth member in the set of matches to the
string Montreal
.
>> subset
.-10 "Montreal "
This is similar to the previous query but the new set starts at the tenth member from the end of the set. Therefore, the resulting set size is only 10 even though the SampleSize setting is 40.
>> subset
.5.30 %
This query creates a set of 30 members starting from the fifth member of the most recent result in the session.
>> {SortOrder Occur}
>> subset
.-20.20 5
The query in the second line of the example creates a set containing the
final 20 members in the set represented by set number 5. The
SortOrder
setting means that both set number 5 and the new set
are in occurrence order.
first
, next
, ~nextemp
,
sample
SampleSize
, SortOrder
~sync
~sync
string
outputs a tagged identifier.
The command ~sync
is available only when the XPAT session is
operating in quiet mode. ~sync
outputs a message tagged with
Sync tags containing the given string. This command is mainly used
when XPAT is integrated into a more complex system. The output from the
~sync
command can then be used to identify a position in an input
stream when information is being received from several different sources.
>> ~sync
"festival"
The output from this command is the tagged string: <Sync>festival</Sync>
.
~qnum
thesaurus
provides an efficient way to describe patterns that have some common quality. For example, if many searches of a database involve finding references in the text to money in different currencies, the thesaurus provides the capability to define a variable describing all the possible patterns to be used in these searches.
The thesaurus
variable is defined in a file named in the data
dictionary. Within the thesaurus file, each separate variable, called a
word, is surrounded by <Entry> tags. Within the
<Entry> tags are other tagged areas: the name of the variable
is contained within <Word> tags, followed by the associated
query contained within <Query> tags. The thesaurus capability
is implemented using macros
so the query may be a complex one
creating more than one set. The same cautions, described for macros, apply to
bracketing and syntax errors.
To invoke a thesaurus variable the name is preceded by the character
<. For example, a thesaurus variable named money
may be used within a XPAT session as follows:
<"money"
As with macros, thesaurus invocations are replaced by an exact copy of the definition. This means that a thesaurus variable can be used as an operand in other XPAT queries. Note, however, that it may be necessary to bracket the entire invocation in order to ensure correct results from the query. In practice, bracketing the definition itself is a good general method.
If an undefined thesaurus variable is used, for example <testing
, the following error message is generated.
The macro testing is undefined
>> (<"policy"
) near (<"economy"
)
Assume that the following tagged data appears in the thesaurus file reference in the data dictionary for the XPAT session.
<Entry> <Word>economy</Word> <Query>("economic " + "fiscal " + "monetary " + "economy")</Query> </Entry> <Entry> <Word>policy</Word> <Query>("policy " + "policies ")</Query> </Entry>
The query shown finds any matches to either of the strings
described by the thesaurus variable policy
that
occur near any of the strings that are part of the union described
by the thesaurus variable economy
.
>> *speaker including <"macbeth"
This assumes that the following tagged data appears in the thesaurus file.
<Entry> <Word>macbeth</Word> <Query>"macbeth" - (shift.5 "lady macbeth")</Query> </Entry>
The query shown finds those members of
the region set defined by the name speaker
that
contain Macbeth
but not Lady
Macbeth
.
data dictionary documentation, macro
union
+
set2
combines two sets.
The union
operator (+
) creates a new set containing
the members of both set1 and set2, with duplicates
removed. Set1 and set2 can be either point sets or region
sets. If either of set1 or set2 is a point set the new set
is also a point set.
If both set1 and set2 are region sets and there is no overlap or nesting of any member from set1 and any member from set2, the union set is a region set. If overlap or nesting occurs, set1 and set2 are treated as point sets by using the first of each pair of pointers describing the regions in the sets. The new set created is the union of these point sets. The following message is generated when this occurs:
Warning: Addition of Region objects produced a region with overlaps -- simplified into a point set
Note that if both set1 and set2 are region sets and a member of set1 coincides exactly with a member of set2, this is not considered to be an instance of overlap or nesting; rather, these members are considered identical and only one will be a member of the output set.
>> USA + "U.S.A" + "United States" + "America "
This query creates a new point set containing all matches to each of the individual sets in the query.
>> region Title + region Summary
Assuming that the members of region Title and of region Summary do not overlap or nest, this query creates a new region set containing all the members of both regions.
>> *your_region + region First
Assuming that your_region
was created during the
current XPAT session and contains a member that overlaps one or more members of
region First
, this query creates a point set. The
members in the new set consist of the first of the two pointers that describe
the members of your_region
and of region First
. XPAT prints a warning message before
printing the number of matches in the new set.
difference
, intersect
within
within
set2
finds members of a set within a given region.
Within
creates a set containing those members of set1
that are located in one of the regions of the text described by set2.
Set1 may be either a point set or a region set. Set2 must
be a region set. The new set is of the same type as set1.
Set2 may be a predefined region set, a region set that has been
created within the XPAT session using the region
command, a region
set resulting from the use of the import
command, or the result of
a previous query in the session.
If set1 is a point set, each member is examined to see if it falls
within a region from set2 in order to determine inclusion in the new
set. If set1 is a region set, the first of the pair of pointers
(offsets into the text) describing each member is examined to see if it falls
within a region of set2. The second pointer of the pair does not have
to fall within a region of set2 for the region to be included in the
new set. That is to say, if set1 and set2 are both region
sets and they overlap, members of set1 are included in the result of
within
if they begin within a member of set2.
The command not within
creates a set containing those members of
set1 that are not in any of the regions described by set2.
set1 not within
set2
is the same as
set1 - (set1 within
set2)
Including
and within
are similar in that they both
restrict searches to specified regions in the text. They differ in the set that
is created. The including
command creates a set of regions that
contain one or more members of another set, while within
creates a set of pointers or regions that are contained in members of a
region set.
>> "Cohen" within
region Speaker
In this example, the predefined region Speaker
defines regions of the text that contain speakers' names. This query creates a
set of matches to Cohen
that falls within the
regions described by region Speaker
.
>> "Fontaine" not within
region Speaker
This query finds all references to Fontaine
that
are not located within one of the regions describing a speaker.
>> first = region "<Etym>" .. "</Language>" >> ("Spanish"within
region Language)within
*first
The first query defines regions of the text that start at the string <Etym>
and end with the string </Language>
. The second query finds all the matches
to Spanish
that are within a Language region and
also within one of the newly defined regions.
import
, including
, not
,
region