Last updated	2002-03-26 15:10:20 EST
Doc Title	The XPAT Command Manual
Author 1	Wilkin, John Price
CVS Revision	$Revision: 1.8 $

XPAT Command Manual

The following provides a summary of XPAT commands, settings, and concepts, and is based extensively on Open Text's PAT 5.0 documentation. Many of the commands included here are not implemented in DLXS middleware.

List of Commands (TOC)

{CommandFile}
comment
{DefaultRegion}
difference
done
double quote
exec
export
{ExportFile}
fby
first
~free
~freeall
history
{History}
{HistoryFile}
import
including
index point
intersect
{Label}
last set
{LeftContext}
macro
naming sets
near
next
~nextemp
not
offsets
pr
{PrintLength}
{Proximity}
~qnum
quiet mode
{QuietOff}
{QuietOn}
quit
range
rankedby
region
sample
{SampleSize}
save
save.commands
{SaveFile}
save.history
set name
set number
sets
{Settings}
shift
signif
{SortOrder}
stop
string search
subset
~sync
thesaurus
union
within

Command and Settings Documentation

`{CommandFile}`

{CommandFile string}

changes the file name used by save.commands and exec .

The CommandFile setting determines which file the save.commands command writes to and the exec command reads from. It has a default value of xpat.cmd . If the string begins with a numeral or contains blanks or non-alphanumeric characters, it must be enclosed within double quote marks. The file name must also conform to the file naming conventions of the host operating system. It can be changed at any time during a XPAT session and remains in effect until changed again or until the end of the session. The current value of CommandFile is displayed by the command {Settings} .

Examples:

>>  {CommandFile "/usr/new/output_file"}

This changes the setting to the value /usr/new/output_file which any subsequent save.commands command writes to and exec command reads from.

`comment`

#

marks the start of a comment.

The comment, that is the # and the rest of the line following the # , is ignored by XPAT. The comment can be placed on a line by itself or following a XPAT query. It is useful for annotating queries stored in a file to be processed in batch mode or to be read in by the exec command. The queries may be created externally or generated during a XPAT session and saved by save.commands for later use.

Examples:

>>  # find all the Shakespearean quotations
>>  region Quote incl (region Author incl "shaks")

The line beginning with the # is ignored by XPAT.

>>  first = region "<E>" .. "</L>"
# find first language

XPAT creates a new region set with this command. The rest of the line, beginning with the # , is ignored.

`{DefaultRegion}`

{DefaultRegion string}

determines which region set is the current default.

The DefaultRegion setting designates a special region set, known as the default region. The default region can be referred to as region without specifying the actual region name. The setting can be changed at any time during a XPAT session and remains in effect until it is changed again or until the end of the session.

If the string giving the setting value begins with a numeral or contains blanks or non-alphanumeric characters it must be enclosed within double quote marks.

Using the default region in a command, without having previously specified one, is illegal and results in the following message.

No information for region in the data dictionary

For convenience, a frequently used DefaultRegion setting can be defined within an init file whose location is given in the data dictionary file. The init file is read and executed by a XPAT session when it is started (see the data dictionary documentation for details).

Examples:

>>  region including "constitution"

The region set referred to in the example by region is the one designated by the DefaultRegion setting.

>>  {DefaultRegion HeadLine} 
>>  region including constitution

The first line changes the DefaultRegion setting to the value HeadLine . The command that follows uses the region set HeadLine even though it is not specified.

`difference`

set1 - set2

removes members from a set.

The difference operator (- ) creates a new set containing the members of set1 that are not members of set2. Set1 and set2 can be either point sets or region sets. The new set is of the same type as set1.

If either set1 or set2 is a region set, the first pointer delineating each region is used to determine if a member of set1 also occurs in set2. Thus, for set arithmetic (difference, union and intersection) in XPAT, set members of a region set are considered to be equal if they start at the same location in the text. The end point of a region is ignored in such operations.

Examples:

>>  "to" - "to " - "to<"

Note that these operators are parsed left to right and can be combined without bracketing. This query creates a point set that contains all the matches to the prefix to excluding those to the string to followed by a blank or a left angle bracket. Assuming an index in which all punctuation has been mapped to blanks, the result contains words starting with to ) but not the word to .

>>  ("q" - "qu") within region HeadWord

This query creates a point set. The point set includes all words located in a Headword region that begin with q but not with qu .

>>  region Story incl "music " - region
Story incl "art "

This query creates a region set. The region set is comprised of all Story regions that include the string music but not the string art .

>>  region Q - "<Q><D>"

Assume that the regions described by region Q all begin with the string <Q> . The above query creates a region set of the members of region Q that do not have the string <D> immediately following the <Q> .

`done`

done

terminates a XPAT session.

The done command ends the session and causes the XPAT process to exit. A message may be generated telling how much computer time has been used during the session.

`double quote`

" string"

allows the use of strings that include special characters.

Normally, XPAT interprets a sequence of characters as a string and searches the database for matches to it. However, there are certain types of strings that XPAT cannot recognize as search targets unless they are enclosed within double quote marks. The special strings are: strings which begin with a numeral, for example 2nd ; strings which contain blanks or non-alphanumeric characters, for example end of the year or <Author>Scott ; and strings which are XPAT commands, for example near and within . In each case, a string that is not enclosed in double quote marks but should be will result in a syntax error or unexpected result.

Note that if numbers are not enclosed in double quote marks, they are interpreted as a reference to the number of a set previously calculated in the XPAT session.

A pair of quotes representing an empty string ("" ) stands for the set of all index points in the text being searched.

Examples:

>>  "done "
>>  done

The first command creates a point set containing matches to the word done . The second command ends the XPAT session.

>>  19 within region Date 
>>  "19" within region Date

The first query finds those members of the previously calculated set, identified by the number 19, that are within region Date . The second query finds the matches to the string 19 within region Date .

>>  ""

This command produces a list of every point indexed in the text.

>>  "_XPat_1" = "match this string " 
>>  "_XPat_OP1" = region "Region Set 5" 
>>  "_XPat_2" = *"_XPat_1" within *"_XPat_OP1"

The above sequence of commands might be produced by a program that accepts input from a user and generates commands that are sent to XPAT. Since the names contain non-alphanumeric data they must be bounded by quotation marks.

`exec`

exec

reads a file into a XPAT session and executes the commands contained in the file.

The name of the file read by the exec command is determined by the value of the CommandFile setting. By default, the value is xpat.cmd but can be changed at any time during the XPAT session.

The exec command can be used to enter queries to a XPAT session. The queries, for example macro definitions, may be recorded in a file using an editor or saved in a file from a previous XPAT session using save.commands .

Examples:

>>  {CommandFile "/usr/xpat/srch023.q"} 
>>  exec

The first command sets the name of the file to be read by any exec command to /usr/xpat/srch023.q . The second command reads the file /usr/xpat/srch023.q and executes the commands contained in the file.

Settings:

CommandFile

`export`

export set1

saves information about sets created in a XPAT session.

Export writes a detailed description of the members of set1, created during a XPAT session, to a file. The description includes the type (region or point) of the set and sufficient information to recreate a copy of the set. The name of the file is determined by the value of the ExportFile setting. By default the file name is xpat.exp but can be changed during a session by using the command ExportFile . When export writes to the named file it writes over anything that may currently exist in the file. Assuming a default ExportFile setting of xpat.exp , the following message is given:

Exporting to xpat.exp.

The file may subsequently be read into a XPAT session by the import command.

If the saved set is a frequently used region set, it can be made available as a predefined region in future XPAT sessions by editing the data dictionary file and adding the appropriate information. If the new region set, containing 150 regions, is named newregion and saved in the file newregion_file , the following lines, added to the data dictionary, would make it available to XPAT.

<Region>
 <Name>newregion</Name>
 <Desc>This new region set describes ....</Desc>
 <File>
  <SysName>newregion_file</SysName>
  <Offset>0</Offset>
 </File>
 <Count>300</Count>
 <Type>pairs</Type>
</Region>

Examples:

>>  "tax" near "increase" 
>>  export %

The first query creates a point set of the matches to the string tax when it is within the current Proximity of the string increase . The second command writes this point set to the file xpat.exp . The information written to the file contains header information followed by details about each element in the set.

>>  {ExportFile "v.exp"} 
>>  verse = region "<V>" .. "</V>" 
>>  export *verse

The first line of the example changes the ExportFile setting to v.exp . The second line creates a region set and names it verse . The third command writes header information and a description of each member of the region verse to the file v.exp .

Settings:

ExportFile

`{ExportFile}`

{ExportFile string}

changes the file name used by export and import .

The ExportFile setting determines the file written by the export command and read by the import command. It has a default value of xpat.exp . If the string begins with a numeral or contains blanks or non-alphanumeric characters, it must be enclosed within double quote marks. The file name must also conform to the file naming conventions of the host operating system. It can be changed at any time during a XPAT session and remains in effect until it is changed again or until the end of the session. The current value of the ExportFile setting is displayed by the command {Settings} .

Examples:

>>  {ExportFile "/usr/new/export_file"}

This changes the value of the setting so that any subsequent export or import command utilizes the file /usr/new/export_file .

`fby`

set1 fby set2

finds members of sets that occur close to each other in a specified order.

Fby (followed by) creates a set containing those members of set1 that have one or more members of set2 within a specified number of characters to their right . Set1 and set2 may be either point sets or region sets. The new set is of the same type as set1.

The distance between members of the two sets is calculated by counting the number of characters in the text from the first character of a member of set1 to the first character of a member of set2. The measure used to determine closeness is the value of the Proximity setting which has a default value of 80 characters. This can be changed for all subsequent uses of fby by changing the Proximity setting, or it can be changed for an individual use of fby by using a modifier attached to the command. The form of the modifier is a period followed by a number representing the maximum distance (in characters).

If either set1 or set2 is a region set, the first of the two pointers delineating the region is used to determine the distance between the set members.

Multiple fby commands are not parsed left to right. A command of the form

       set1 fby set2 fby set3

is handled as if parenthesized as follows:

       set1 fby (set2 fby set3)

The command not fby creates a set containing the members of set1 that are not within the specified distance to the left of any member of set2.

       set1 not fby (set2 fby set3)

is the same as

       set1 - (set1 fby (set2 fby set3))

Examples:

>>  "law " fby "order "

Assuming a Proximity of 80, this query creates a point set containing the matches to law with one or more matches to order within 80 characters to their right, counting from the l in law to the o in order .

>>  region Title fby.30 region Author

This query creates a region set containing the members of the set region Title that have one or more members in the set region Author within 30 characters to the right. The distance is measured as the number of characters from the first character of a Title region to the first character of an Author region.

>>  "law " not fby "order "

This query creates a point set containing the matches to law that do not have a match to order within 80 characters to the right, calculating the distance as in the first example.

>>  "law " not fby.30 "order "

This query creates a point set containing the matches to law that do not have a match to order within 30 characters to the right.

Settings:

Proximity

`first`

first set1

finds a specific number of contiguous members from the start of a set.

First creates a set of a specified size which is comprised of members from the beginning of set1. The members of the new set are in the order they appear in set1. Set1 may be either a region set or a point set. The new set is of the same type as set1.

The operation of the first command involves the set member counter that keeps track of the selected members, the identification of the size of the requested set, and the SortOrder setting that determines which members are in the new set.

First selects members from the beginning of a set. The ordering of a set, and hence which members occur at the beginning, is controlled by the SortOrder setting. If the SortOrder setting is Alpha , the set is ordered alphabetically. If the SortOrder setting is Occur or OccurHead , the set is ordered according to occurrence in the text. If the SortOrder setting is AsIs , the set order is the current one which may be either alphabetic or occurrence order.

Each set that is used with a first , next or ~nextemp command has a cursor (set member counter) associated with it. The cursor indicates the location in set1 at which to start selecting members for the set being created. Each first command resets the cursor so members for the new set are chosen from the beginning of set1. On completion of the first command the cursor is updated to point at the beginning of the next set. Note, when the SortOrder setting changes and the set ordering is changed, the cursor is reset to the beginning of set1.

The size of the set created is determined by the value of SampleSize which has a default value of 10. If the size of set1 is less than SampleSize then the new set created is the same size as set1. Changing the SampleSize setting affects all subsequent uses of first during the current session. For an individual use of the command, the size of the new set can be specified by using a modifier attached to the first command. This modifier is in the form of a period followed by a numeric value giving the desired set size.

The first command can be used by itself or with the pr , save or export commands.

Examples:

>>  {SampleSize 40} 
>>  first 5

The first line changes the SampleSize setting to 40 and the second line creates a set that contains the first 40 members of set number 5 created earlier in the XPAT session.

>>  first .10 "the best of "

This line creates a set containing the first 10 members in the set of matches to the phrase the best of .

>>  first .0 3

This query resets the cursor to the first member of set number 3.

Settings:

SampleSize , SortOrder

`~free`

~free number

releases a XPAT set.

Following the ~free command, the set number is no longer available for reference in a XPAT command. The set is no longer displayed by the history command.

If the sets freed are at the end of the current history list, the set numbers will be reused for the next sets created in the XPAT session. For example, if the history list contains set numbers 1 to 8, and 6 through 8 are freed using the ~free command, the next set number assigned is 6. However, if set number 2 is freed and the history list includes set numbers 1 to 8, the next set is number 9.

Examples:

>>  ~free 4

This removes set number 4 from the history list. The set can no longer be accessed by number reference.

`~freeall`

~freeall

releases all XPAT sets.

Following the ~freeall command, all the sets that existed in the current session are no longer available for reference in a XPAT command. In addition, those sets are no longer displayed by the history command.

Following the ~freeall , the next set number assigned is 1.

Examples:

>> ~freeall

This removes all the current sets in the history list from the history list. Following the command, no previously created sets can be referenced, and the next set that is produced is assigned the number 1.

`history`

history

displays the record of the current XPAT session.

Information about each set created during the XPAT session is recorded in a history list. For each of the sets, history displays a set number, the number of members in the set and the query that produced the set. Sets created during the current session can be accessed by referring to the number of the set in the history list. The results of pr , save , Settings , { } , and certain tilde (~ ) commands do not appear in this list since no sets are produced by these commands.

As the entire history list may become quite long, it is useful to be able to view only a part of the list. The History setting determines what portion of the history list is displayed by the history command. The History setting has a default value of 0. This indicates that the entire history list is to be displayed. When set to an integer n (any integer greater than zero) the final n elements in the history list are displayed by any subsequent use of the history command during the session.

The items listed can also be changed for an individual use of the history command. Modifiers may be attached to the history command to request that a certain number of items and that a particular portion of the list be displayed.

The first modifier, in the form of a period followed by a number, indicates where in the history list to begin the display. A positive integer p requests that the display start at the pth item from the start of the history list. A negative integer p requests that the display start at the pth item from the end of the history list. The number of items displayed is the value of the History setting.

The number of items displayed can also be changed for an individual use of the history command by using a second modifier attached to an already modified history command. This second modifier is also in the form of a period followed by a number giving the number of items to be displayed.

The default maximum size of the history list is 300 items. If more than 300 sets are created the last 300 sets created during this XPAT session are retained in the list. This maximum size can be altered by a command line parameter when starting a XPAT session.

Note that a set can be removed from the history list by the ~free command.

Examples:

>>  "univ" 
>>  pr sample % 
>>  "waterloo" 
>>  1 near 2 
>>  pr 
>>  history

Assuming the above are the only commands executed in the XPAT session to this point, the result of the history command would be as follows:

  1:   11680,  "univ"
  2:     209,  "waterloo"
  3:       4,  1 near 2

>>  {History 5} 
>>  history

The first command, in this example, sets the value of the History setting to 5. The second command, and subsequent uses of the history command in the session, will show information about the five final sets in the history list. The second command shows information about the final five sets in the history list.

>>  history.3

This use of the history command gives information about the commands in the history starting at the third element in the history list. Using the XPAT session described in the first example, above, the result of this would be.

  3:       4,  1 near 2

>>  history.-2

This use of the command gives information starting at the second element from the end of the history list. Again, using the first example, the result of this would be.

  2:     209,  waterloo
  3:       4,  1 near 2

>>  history.4.10

This use of history gives information from the history list starting at the fourth entry on the list and continuing for ten entries.

>>  history.-4.2

This use of history gives information about the final two entries in the history list.

Settings:

History

`{History}`

{History number}

changes the number of items from the history list displayed by the history command.

The History setting determines the number of items displayed by the history command. Note that the setting may be overridden and the number of items displayed determined by a modifier for an individual use of the history command. The default value of the setting is 0 indicating that all sets created in this session are to be shown by the history command. The setting can be changed at any time during a XPAT session and stays in effect until changed again or until the end of the session. The current value of the History setting is displayed by the command {Settings} .

Examples:

>>  {History 30}

This changes the setting to the value 30 so that any subsequent use of the history command during the session displays 30 items.

`{HistoryFile}`

{HistoryFile string}

changes the file name used by save.history .

The HistoryFile setting determines the file written by the save.history command. It has a default value of xpat.his . If the string begins with a numeral or contains blanks or non-alphanumeric characters, it must be enclosed within double quote marks. The file name must also conform to the file naming conventions of the host operating system. It can be changed at any time during a XPAT session and remains in effect until it is changed again or until the end of the session. The current value of the HistoryFile setting is displayed by the command {Settings} .

Examples:

>>  {HistoryFile "/usr/new/history_file"}

This changes the HistoryFile setting so that any subsequent use of the save.history command during the session writes to the file /usr/new/history_file .

`import`

import

reads information that has been saved in a file by the export command.

Import reads data from a file and creates a new set which can be used as if it had been created during the current XPAT session. XPAT determines from the header information whether the saved set is a point set or a region set, and the new set is of the same type. The file read is determined by the ExportFile setting which has a default value of xpat.exp . The file name can be changed during a session by resetting the ExportFile setting.

Assuming the default setting of ExportFile , if the imported set is a region set, the following message is generated:

Importing regions from 'xpat.exp'

If it is a point set, the message generated is:

Importing point set from 'xpat.exp'.

Examples:

>>  import 
>>  % within region Quote

The first command reads from the file xpat.exp . The second line uses the imported set as an operand to a within command and finds the members of the set that occur within region Quote .

>>  {ExportFile "v.exp"} 
>>  verse = import 
>>  *verse including ("blind " fby "ditch ")

The first command resets the ExportFile setting to v.exp . The second reads a set from the file v.exp and names it verse . The third query finds the set of imported regions that include the string blind when it is followed by the string ditch (the assumption has been made that this set is a region set).

Settings:

ExportFile

`including`

set1 including set2

set1 incl set2

find regions that contain members of a set.

Including or incl creates a set comprised of members of set1 that include one or more members of set2. Set1 must be a region set. Set2 may be either a point set or a region set. The new set is a region set.

Set1 may be a predefined region set, a region set created during the XPAT session using the region command, a region set resulting from the use of the import command, or the result of a previous query in the session.

If set2 is a point set, and if one or more of the points occur in a region from set1, then that set1 region is included in the new set.

If set2 is a region set, and the first of the pair of pointers (offsets into the text) describing a region of set2 is contained in a region of set1, that set1 region is included in the new set. The second pointer of the pair delineating set2 does not have to fall within the region of set1 in order that the set1 region be included in the new set.

The including command can also be used to find regions that contain more than one member of set2, by attaching a modifier specifying the minimum number of members of set2 to the including command. This modifier is in the form of a period followed by the value of the minimum number of members.

The command not including creates a set containing those members of set1 that do not contain any of the members in set2.

 set1 not including set2

is the same as

 set1 - (set1 including set2)

Including and within are similar in that they both restrict searches to specified regions in the text. They differ in the set that is created. The including command creates a set of regions that contain one or more members of another set, while within creates a set of pointers or regions that are contained in members of a region set.

Examples:

>>  region Story including ("Free trade"
near "Canada")

This query finds the regions described by region Story that contain one or more matches to the string Free trade when it occurs close to the string Canada .

>>  region Story including.3 ("Free trade"
near "Canada")

This query finds the regions described by region Story that contain at least three matches to the string Free trade when it occurs close to the string Canada .

>>  region Quote not including region Author

This query creates a set of Quote regions that do not contain the first pointer of the pair delineating an Author region.

>>  dates = "1800" .. "1825" 
>>  region Date including *dates

The first query creates a point set containing all the numbers that are alphabetically between 1800 and 1825 . The second creates the set of Date regions that contain one or more of these numbers.

>>  region Quotation including "Wright"

>>  % including "Waterloo"

The first query creates a region set of quotations that contain the string Wright . The second query finds the members in the new region set that also contain the string Waterloo .

>>  (*speech including "republican") including "democrat"

This query is similar to the previous one. It assumes that a region set named speech has been defined and it finds the members of this set that contain both the string republican and the string democrat .

>>  (*definition incl ("men" + "women")) incl "education"

>>  *definition including (("men" + "women") ^ "education")

The first query creates a set of definition regions that include the string education as well as either men or women . Note that the second query does not create the same set but actually creates a set of size 0. This result is due to the fact that the intersection operation - (("men" + "women") ^ "education") - produces an empty result. This result occurs since there are no members of the union set men + women that are also members of the set education (see definition of the union operator).

`index point`

XPAT views the entire text as one long string. In contrast to traditional text indices, which deal with words, XPAT indexes strings. The indexed strings extend from each index point to the end of the text.

The XPAT index is made up of the starting points of each string. The index points make up the possible match points for a string search. Parameters set when the index is built determine which strings are in the index. The parameters specify patterns in the text that define the beginnings of strings to be indexed. For example, one pattern could specify that every character in the text is to be indexed, while another pattern could specify that each printable character following a blank is to be indexed.

When the index is created, two additional settings can alter how XPAT sees the text. Character mappings cause XPAT to see certain characters as equivalent to other characters. For example, all upper case letters may be mapped to lower case letters so that XPAT does not distinguish between upper and lower case when searching for a string. Also, some words may be designated as stopwords. XPAT views the text as if these words are not there. XPAT ignores strings in the text that start at an index point and match the given stopword strings followed by a blank after the character mappings have been applied. The character mappings also affect the strings chosen to be index points. For example, if a > is mapped to a blank and if the index points are defined as blanks followed by printable characters, in the text ...<tag>wisdom... the w in the string wisdom is an index point. Text with character mappings applied and stopwords removed is referred to as converted text.

When searching for a given string, a match is found if the given string (after having the character mappings applied to it and the stopwords removed) is the same as the converted text that begins one of the indexed strings.

`intersect`

set1 ^ set2

finds members common to two sets.

The intersect operator (^ ) creates a new set consisting of the members in set1 that are also in set2. Set1 and set2 can be either point sets or region sets. The new set is of the same type as set1.

If either of set1 or set2 is a region set, only the first of the pointers describing the region is used in the comparison to determine if a member should be included in the new set. Two members of a region set are considered to be equal if they start at the same location in the text.

Examples:

>>   (region Verse incl "eye") ^ (region
Verse
incl "seed")

This query creates a region set. It includes verse regions that contain both the string eye and the string seed .

>>   ("research" near "medical") ^ ("research" near "biolog")

This query creates a point set. It includes the matches to research that appear close to both the string medical and the string biolog .

`{Label}`

{Label string}

specifies an identifying string to be used as a label.

When XPAT is operating in quiet mode with labels requested, any set displayed by a pr or save command shows the label string preceding the numeric value of the text offset. This can be used to identify which database the information is from. In a XPAT session, if a value for Label has not been set by this command, the default value used is the name of the data dictionary. The label string must begin with an alphabetic character and contain no blanks or non-alphanumeric characters. The setting can be changed at any time during a XPAT session and remains in effect until it is changed again or until the end of the session.

Examples:

>>  {Label Database1} 
>>  {QuietOn Label} 
pr ("Ontario" near ("B.C." + "British Columbia"))

The tagged output from the pr command shows the numeric offset in the file preceded by the string Database1 , in the form

<PSet><Start>Database1:12345</Start></PSet>

Settings:

QuietOff , QuietOn

`last set`

%

refers to the previous result.

% is used as shorthand to refer to the set created most recently in the XPAT session. The set is the final one in the current history list. Some commands, such as pr and save , do not create sets that are saved and recorded in the history list and thus cannot be accessed by using the % . If there is no history, the last set is the null set which contains all index points.

Examples:

>>  region Author including "Hemingway" 
>>  pr sample % 
>>  % within region Quote

The % in the second line of the example refers to the set created by the including command in the first query. The % in the third line also refers to the set created by the first line and not to the result of the pr in the second line which does not produce a set.

`{LeftContext}`

{LeftContext number}

specifies how many characters of context are displayed to the left of a set member.

By default, when a set is displayed with the pr command or written to a file by the save command, the text has 14 characters to the left of the match point. The setting can be changed at any time during a XPAT session and remains in effect until it is changed again or until the end of the current session. The current value of the LeftContext setting is displayed by the command {Settings} .

Examples:

>>  {LeftContext 40}

This changes the setting to the value 40 so that any subsequent pr or save command produces text with 40 characters to the left of the match point.

Settings:

PrintLength

`macro`

The macro capability facilitates the use of frequently used sequences of XPAT commands.

A macro can be defined in a XPAT session and be available only for the duration of that session, or a macro can be created externally and read into any XPAT session by an exec command or during initialization.

The definition of a macro (here called name ) begins with the following: name = macro

After this line the system prompt changes from >> to || for the duration of the macro definition. The body of the macro may begin on the same line or on a subsequent line. XPAT interprets anything immediately following the word macro , that is not a blank or new line, as the beginning of the macro definition. The body of the macro may contain arguments. The nth argument to the macro is identified within the macro definition by the string $n$ . Any sets that are created by the macro may also be used in its definition. The string *n* refers to the nth set created within the macro. The end of the macro definition is indicated by a @ . After the @ the system prompt returns to the form >> .

References to other macros may be used within the definition of a macro. If the macro contains more than one XPAT query, they can be put on separate lines or on the same line with the queries separated by a semi-colon. The body of the macro is not checked for syntax errors when it is defined. Any errors are reported when the macro is used.

The macro is invoked by the following call:

name(arg1,arg2..)

If the number of arguments in the macro call is less than the number in the macro definition a syntax error is reported. If it is greater, the extra arguments are ignored.

Each argument consists of all the text occurring between argument delimiters: parentheses and commas. That is, if a macro takes three arguments - (arg1,arg2,arg3) - arg1 consists of the text between the opening parenthesis and the first comma, arg2 consists of the text between the first and second comma, and arg3 consists of the text between the second comma and closing parenthesis. If a macro takes only one argument - (arg1) - the parentheses are the argument delimiters. Note that any spaces entered with an argument string will be included with the parameter substitution which is unlikely to be the intent of the user. To avoid unexpected results, enter only the exact text that you wish to be substituted in the arguments of the macro call. Also note that macros may have no arguments.

When the macro is invoked, the invocation is replaced by an exact copy of the body of the macro with the arguments substituted for the formal parameters. This means that the macro can be used within other XPAT queries. This may require that the macro definition have the closing @ on the same line as the final line of the body of the macro definition to avoid introducing an unwanted new-line character.

If improperly used, macros that produce multiple sets and are used within other queries may cause more than one syntax error to be reported. Care must be taken with bracketing in order to ensure that the results reflect what was actually intended.

A macro can be redefined during a XPAT session. When the macro is redefined, the previous definition of the macro is displayed following the first new line entered after the word macro . The format of this previous definition consists of the macro name followed by a colon, followed by the body of the macro on subsequent lines.

For convenience, macros that are used frequently can be defined within an init file whose location is given in the data dictionary file. The init file is read and executed by a XPAT session when it is initially started. (See the data dictionary documentation for details.)

Examples:

>>  word = macro 
 ||  ( "$1$ " + "$1$<" + "$1$-" ) @ 
>>  word(pad) within *definitions

This macro is used with text that contains tags that start with a < and where the tags may follow text without blanks appearing before the tag. The macro defines a word as a string of characters followed by a blank, < , or - (in this definition an index that has all punctuation mapped to a blank is assumed). Since the macro definition has the @ sign on the same line as the body of the definition, the macro can be used within a more complicated query as shown. Note the brackets included in the macro definition. The example assumes that there is a region definition and finds all occurrences of pad , as a word, inside one of these regions.

>>  both = macro 
 ||  region $1$ 
 ||  *1* including $2$ 
 ||  *2* including $3$ 
 ||  @ 
>> 
both(Line ,"juliet" ,"romeo")

With the macro defined here, the members of a predefined region set which contain both of two given strings are found. In the macro call above, the macro is applied to a database of Shakespearean texts in order to find the members of the predefined region set named Line containing references to both romeo and juliet . The definition of this macro returns more than one set. It also has the @ on the line following the body and thus could not be used within another query. The resulting output from XPAT showing the three sets produced by the macro would appear as below:

  16: 128794 matches
  17: 214 matches
  18: 25 matches

`naming sets`

name = set1

assigns a name to a set.

A set which has been named can be referred to either by that name or its set number. Set1 can be either a point set or a region set.

A name that starts with a letter and contains only letters and numbers does not need to be enclosed within double quote marks. However, if the name contains special characters (blanks or non-alphanumeric characters), or does not start with a letter, it must be enclosed within double quote marks both in the assignment statement and in subsequent use.

To use the name in a query it must be preceded by an asterisk (* ). Without the asterisk (* ), XPAT interprets the name as a string rather than as the name of a set.

Examples:

>>  UK = "U.K."+"Britain"+"Great Brit"+"United King" 
>>  region Headline including *UK

The first line assigns the name UK to a set of matches to four alternate ways of referring to the United Kingdom. The second line finds Headline regions that contain any of the matches.

>>  "min_hiring" = region Minutes incl ("hiring" near "policy") 
>>  region Attendees within *"min_hiring"

The first line of the example assigns the name min_hiring to Minutes regions that include matches to hiring appearing close to matches to policy . The second line finds the Attendees regions that are within one of the resulting Minutes regions from the first query.

`near`

set1 near set2

finds members of sets that are close to each other.

Near creates a set containing the members of set1 that are within a specified number of characters before or after one or more members of set2. Set1 and set2 may be either point sets or region sets. The new set is of the same type as set1.

The distance between members of the two sets is calculated by counting the number of characters in the text between the first character of a member of set1 and the first character of a member of set2. The measure used to determine closeness is the value of the Proximity setting which has a default value of 80 characters. The value can be changed for all subsequent uses of near by changing the Proximity setting, or it can be changed for an individual use of near by using a modifier attached to the command. The form of the modifier is a period followed by a number representing the maximum distance (in characters).

If either set1 or set2 is a region set, the first of the two pointers describing the region is used in finding the distance between the members of the sets.

Multiple near commands are not parsed left to right. A command of the form

       set1 near set2 near set3

is handled as if parenthesized as follows:

       set1 near (set2 near set3)

The command not near creates a set containing those members of set1 that are not within the specified distance of any member of set2.

 set1 not near set2

is the same as

 set1 - (set1 near set2)

Examples:

>>  "love " near "hate "

Assuming a Proximity of 80, this query creates a point set containing those matches to love that are within 80 characters of matches to hate , counting from the l in love to the h in hate . The string hate can occur before or after love in the text.

>>  region Title near.30 region Author

This query creates a region set containing the members of region Title that are within 30 characters of one or more members of region Author . In this case the distance is measured as the number of characters between the first character of a Title region and the first character of an Author region.

>>  "love " not near "hate "

This query creates a point set containing those matches to love that do not occur within 80 characters of a match to hate calculating the distance as in the first example.

>>  "love " not near.30 "hate "

This query creates a point set containing the matches to love that do not occur within 30 characters of a match to hate .

Settings:

Proximity

`next`

next set1

finds a specified number of contiguous members of a set following members already identified by a first or next command.

Next creates a set of a specified size containing the members of set1 that start at the current cursor position associated with this set. The cursor position is determined by a previous first or next command applied to set1. The members of the new set are in the order they appear in set1. Set1 may be either a region set or a point set. The new set is of the same type as set1.

The operation of the next command depends on the set order established by the SortOrder setting. If the SortOrder setting is Alpha , the set is ordered alphabetically; if the SortOrder setting is Occur or OccurHead , the set is ordered as the members occur in the text; and if the SortOrder setting is AsIs , the set ordering is the current one and may thus be either alphabetic or occurrence order.

Each set that is used with a first , next or ~nextemp command has a cursor (set member counter) associated with it. The cursor indicates the location in set1 at which to begin selection for the set being created. On completion of the next command the cursor is updated to point at the beginning of the next set. Note, when the SortOrder setting changes and the set ordering is changed, the cursor is reset to the first element.

The size of the set created is determined by the value of SampleSize which has a default value of 10. If the size of set1 is less than SampleSize , then the new set created is the same size as set1. Changing the SampleSize affects all subsequent uses of next during the current session. For an individual use of the command, the size of the new set can be specified by using a modifier attached to the next command. The modifier is in the form of a period followed by a numeric value giving the desired set size.

The next command can be used by itself or with the pr , save or export commands. Note that next may only be used in conjunction with these commands.

Examples:

>>  {SampleSize 40} 
>>  first .0 5 
>>  next 5

The first line of the example changes the SampleSize setting to 40. The first command resets the cursor associated with set number 5 to the first member of the set and creates a set of size 0 (thereby leaving the cursor at the first member). The third line creates a set that contains the first 40 members of set number 5.

>>  next .10 5

If this command follows the previous example, a set of ten members is created. The cursor associated with set number 5 indicates that 40 members have been used to create the set in the previous next command and so this new set starts at the 41st member of set number 5.

Settings:

SampleSize , SortOrder

`~nextemp`

~nextemp set1

finds a specified number of contiguous members of a set following members already identified by a first or next command.

The command ~nextemp creates a set of a specified size containing the members of set1 that start at the current cursor position associated with the set. The cursor position is determined by the previous first or next command applied to set1. The members of the new set are in the order they appear in set1. Set1 may be either a region set or a point set. The new set is of the same type as set1.

The ~nextemp command is identical to the next command except that the cursor is unchanged by the ~nextemp command.

The operation of the ~nextemp command depends on the set order established by the SortOrder setting. If the SortOrder setting is Alpha , the set is ordered alphabetically; if the SortOrder setting is Occur or OccurHead , the set is ordered as the members occur in the text; and if the SortOrder setting is AsIs , the set ordering is the current one and may thus be either alphabetic or occurrence order.

Each set that is used with a first , next or ~nextemp command has a cursor (set member counter) associated with it. The cursor indicates the location in set1 to start selecting members for the set being created. After completion of the ~nextemp command, the cursor is unchanged. This differs from the behaviour of the next command, which updates the cursor to point at the last member of set1 selected for the new set. Note, when the SortOrder setting and the set ordering change, the cursor is reset to the first element.

The size of the set created is determined by the value of the SampleSize setting which has a default value of 10. If the size of set1 is less than SampleSize , then the new set created is the same size as set1. Changing the SampleSize setting affects all subsequent uses of ~nextemp during the current session. For an individual use of the command, the size of the new set can be specified by using a modifier attached to the ~nextemp command. This modifier is in the form of a period followed by a numeric value giving the desired set size.

The ~nextemp command can be used by itself or with the pr , save or export commands. Note that ~nextemp may only be used in conjunction with these commands.

Examples:

>>  {SampleSize 40} 
>>  first .0 5 
>>  ~nextemp 5

The first line changes the SampleSize setting. The first command initializes the cursor associated with set number 5 to the first member of the set and creates a result set of size 0 (thereby leaving the cursor at the first member). The third line creates a set that contains the first 40 members of set number 5.

>>  ~nextemp .10 5

Assume this command follows the previous example. On completion of the previous query, the cursor still points to the beginning of the set as the ~nextemp command does not change the cursor setting. The set created by this query contains 10 elements from the beginning of set number 5.

Settings:

SampleSize , SortOrder

`not`

is used to modify four XPAT commands. The forms in which not can appear are not fby , not including , not near , and not within . These uses are described in the entries for fby , including , near , and within . Not cannot be used to modify any other commands.

`offsets`

[number]

[label:number]

generate a point set containing a specified position in the text.

The number in the square brackets is a logical position in the text and need not be an index point. The number indicates the offset, measured in number of characters, from the beginning of the text database. The first character of the text has offset [1]. If the number used in square brackets exceeds the size of the text XPAT gives the message

Error: Input number too large.

Note that the new set is a point set with only one member.

The second form of the command, shown above, uses offsets that are produced when XPAT is operating in quiet mode and using labels. In this form, in order to produce correct results, the label string must be the current value of the setting Label . When the label is different from the current Label setting the resulting set has size 0.

Examples:

>>  region Quote including [ 20000]

This query finds the Quote region that includes the offset 20000.

>>  {Label news} 
>>  region Quote including [ news:20000]

This query uses an offset in the form produced by XPAT in quiet mode (having requested labels with the offsets). Since the Label has been set to the value news by the previous command, the query finds the region set named Quote containing the given offset. If the label, prefixed to the offset, is anything other than news the query would produce a set of size 0.

Settings:

Label

`pr`

pr set1

displays contents of XPAT sets.

Pr displays each member of set1 with surrounding context. A modifier can be attached to the pr command in order to control the context exactly. Set1 can be any region set or point set. If set1 is a region set, the first of the pair of points describing each region in the text is displayed by the pr command. If no set1 is given, the operand for the command is the most recent set created in the session.

For each member in the given set, the output is in the form of an integer giving the offset of the set member in the text file, followed by a comma, a blank, two periods and then the characters surrounding the set member. The first character in the database is considered to be offset 1. The order in which the set is displayed depends on the current SortOrder setting.

With no modifier, pr prints a line of text for each element in set1. The PrintLength and LeftContext settings determine the content of the line printed. With the default settings, the printed text is 64 characters in length of which 14 precede the match point. The number of characters displayed to the left of the match point can be altered by changing the LeftContext . The total number of characters printed can be altered by changing the PrintLength setting.

The total number of characters to be displayed can be set, for a single instance of the command, by using a numeric modifier attached to the pr . The modifier is in the form of a period followed by a number giving the total number of characters to be displayed. The left context that is displayed is still determined by the value of the LeftContext setting.

The second form the modifier can have is a period followed by the string region . When the modifier .region is used, the output text starts at the match point and continues to the end of the default region in which the match point occurs. If the match point is not within the default region, no output is displayed for the match point.

The second form of the modifier can be refined to request that the text displayed is a region other than the default region. An additional modifier specifying a defined region set can be attached to the already modified pr command (i.e. to the pr.region ). The additional modifier can specify the region in one of three ways: a string giving the name of a predefined region, the number of a region set created in the XPAT session, or a string preceded by an asterisk (* ) referring to a named region set defined in the XPAT session (see the examples below). As with the form pr.region , described above, this use results in the displayed text starting at the match point with no left context and continuing to the end of the region. When the match point is not contained in the designated region set, no output is displayed.

Examples:

>>  "Kipling" 
>>  pr

This command displays a line of context for each member of the previously calculated set. Assuming the PrintLength and LeftContext still have the default values, each line will contain 64 characters of which 14 will be before the match point.

>>  {PrintLength 300} 
>>  pr "my dear Watson"

As with the previous example, this command prints a line of context for each member in the point set matching the string my dear Watson . In this case, the line printed for each member in the set is 300 characters long but still has 14 characters preceding the match point.

>>  pr region including "detective"

This command will print a line for each member in the set of default regions that contains the string detective . The text displayed starts at the beginning of the default region .

>>  pr.200 shift.-100 ("city" near "oxford")

This command prints a line of 200 characters for each member in the set of matches to the string city when it appears near the string oxford . Since the match points in this set have been shifted 100 characters to the left the displayed text actually begins 114 characters to the left of the string city (assuming the LeftContext is set to 14).

>>  region incl (region EQ incl ("<D>1980" .. "<D>1986")) 
>>  pr.region

The first query finds the members of the default region (in this example they might be dictionary entries) that contain EQ regions which are in the period from 1980 to 1986. The second command prints these entries. After the offset, comma, blank and two periods, the displayed text starts at the match point which is at the beginning of the default region, and continues to the end of the default region.

>>  region Quote including ("univ" near "waterloo")

>>  pr.region.Quote

The first query finds the Quote regions which contain the string univ occurring near the string waterloo . The second command displays these regions. The output consists of an offset, comma, blank, two periods and the text starting at the beginning of the Quote region and continuing to the end of the Quote region.

>>  pr.region.5 "law" fby "order"

This command displays data from the set of matches to the string law when followed by order . The text that is printed starts at the matches to the string law and continues to the end of the regions which contain the match point.

>>  *verse including "faith, hope, charity" 
>>  pr.region.*verse

The first query finds the regions that contain the string faith, hope, charity occurring in the set that has been created and named verse during the XPAT session. For each of the members in this region set, the second command prints information starting at the beginning of the region and continuing to the end of the region described by *verse .

Settings:

DefaultRegion , LeftContext , PrintLength , SortOrder

`{PrintLength}`

{PrintLength number}

specifies how many characters of text are displayed.

By default, when the members of a set are displayed with the pr command or written to a file by the save command, each member contains 64 characters of context, 14 to the left of the match point, the match point itself, and 49 to the right. This setting may be overridden so that the number of characters processed is determined by a modifier for an individual use of the pr or save commands. The PrintLength setting determines the total number of characters processed and thus affects the number of characters shown to the right of the match point. The number of characters to the left of the match point is determined by the LeftContext setting.

The setting can be changed at any time during a XPAT session and remains in effect until changed again or until the end of the session. The current value of the PrintLength setting is displayed by the command {Settings} .

Examples:

>>  {PrintLength 100} 
>>  pr ("Yukon" near ("B.C." + "British Columbia"))

This changes the setting to the value 100 so that any subsequent pr or save command produces text 100 characters in length. The set displayed has 14 characters to the left of the match point and 85 characters to the right, assuming a default value of 14 characters for left context.

Settings:

LeftContext

`{Proximity}`

{Proximity number}

specifies the measure of closeness for the near and fby commands.

The Proximity default for the fby and near commands is 80 characters. That is, a match point of a member of set1 must be within 80 characters of a match point of a member of set2 to be included in a new set created by the near and fby commands.

The Proximity setting may be overridden for an individual use of the fby and near commands by appending a modifier to the command.

The Proximity setting can also be changed at any time during a XPAT session and remains in effect until changed again or until the end of the session (see example below). The current value of the Proximity setting is displayed by the command {Settings} .

Examples:

>>  {Proximity 200} 
>> "Canada" near ("U.S." + "United States" + "the States")

The first line of the example changes the Proximity setting to the value 200 so that any subsequent Proximity commands use this value. In the query, XPAT finds the occurrences of the string Canada that occur within 200 characters either to the left or right of members of the set produced by the union of the sets matching the strings U.S. , United States and the States .

`~qnum`

~qnum

outputs a query number.

The ~qnum command operates in both standard and quiet mode. In standard mode, the number of the next query is output. In quiet mode, the information is tagged and the number of the next query is contained within <Qnum> tags.

Examples:

>>  "testing" 
>>  ~qnum

If testing is the first query in the XPAT session, the output from the ~qnum command is the set number 2. In quiet mode, this appears as the string <Qnum>2</Qnum> .

`quiet mode`

{QuietOn Raw Converted Label Persistent}

{QuietOff}

changes the mode of operation of XPAT. {QuietOn} causes XPAT to operate in quiet mode. {QuietOff} causes XPAT to revert to standard (non-quiet) mode.

Each of the four arguments to QuietOn is optional and may appear in any order. When an argument is present in a QuietOn command, the corresponding setting is turned on. Conversely, when an argument is not present in a QuietOn command, the corresponding setting is turned off. Settings are not carried forward from one QuietOn command to the next but are reset with each QuietOn command.

All XPAT commands that create sets operate the same way in quiet mode and in standard mode. However, the output generated by XPAT is different in the two modes. No prompt or newline appears when XPAT is operating in quiet mode. In addition, the output from XPAT in quiet mode is in a tagged format.

In standard mode, when a command or query creates a new set, a set number and the number of matches is output. In quiet mode, the tagged output contains the number of matches within <SSize> tags but no set number. For example, if a set of 122 matches is created by a XPAT query the output is of the form:

<SSize>122</SSize>

In standard mode, information displayed about a set by a pr command is affected if a modifier is attached to the command. The output from a pr command is preceded by the offset in the text of the set member being printed. If the set is a region set, the offset is the start of each region in the set. In quiet mode, the output contains the numeric offset in a tagged format. The settings of Raw, Converted, Label and Persistent affect the information displayed by the pr command. Each of the settings is discussed below.

{QuietOn}

With Persistent turned off, the values of the offsets that are output are the logical offsets into the file. The logical and persistent offsets are different and non- interchangeable. Persistent offsets are designed for use with the update system. (See the documentation for the XPAT update system.)

If the pr command is not of the form pr.region , the offset of each set member is contained within <Start> tags and the entire output is contained within <PSet> (for Point Set) tags. For example, if the set is of size 2, the output might look as follows (without the line breaks).

<PSet><Start>1234</Start>
<Start>5554</Start></PSet>

If the modifier to the pr command is .region , in standard mode the text displayed is from the match point to the end of a specified region. In quiet mode, the tagged output contains both the offset of the match point and the offset of the end of the specified region. The offsets of the ends of the region are contained within <End> tags and the entire output is contained within <RSet> (for Region Set) tags. For example, for the above set of size 2, output from a pr.region might look like

<RSet><Start>1234</Start><End>1444</End>
<Start>5554</Start><End>6000</End></RSet>

{Quiet On Label}

When Label is turned on, the form in which the offset is printed changes. The numeric value of the offset into the text within the <Start> or <End> tags is preceded by an identifying label and a colon. This label string is the value of the Label setting. If Label has not been set, the label used in the output is the name of the data dictionary file up to the first non-alphanumeric character. For example, if the data dictionary is news.dd and Label has not been set, the output from a pr command would look like:

<PSet><Start>news:1234</Start>
<Start>news:5554</Start></PSet>

{QuietOn Raw }

When Raw is turned on, in addition to the tagged offsets, the output contains text showing the match point and surrounding context. For each member of the set, this additional information is output within <Raw> tags following the tagged offset information for each member of the set.

The length of the string being output is given within <Size> tags and is followed by the text. As in standard mode, if pr has no modifier, the length of string output is determined by the PrintLength setting and the context shown to the left of the match point is determined by the LeftContext setting. If the modifier is a numeric value, this value determines the length of the string and the left context is still determined by the LeftContext setting. If the modifier to the pr command is .region , the text starts at the match point and continues to the end of the specified region.

For example, assuming a PrintLength setting of 25, and a LeftContext setting of 5, the output from a pr command applied to a set of 2 matches to the string sample would be (without the line breaks shown here):

<PSet><Start>1234</Start><Raw><Size>25</Size>
This sample is to be firs</Raw>
<Start>3456</Start>
<Raw><Size>25</Size>
This sample is to be seco</Raw>
</PSet>

If the SortOrder setting is OccurHead , in addition to the above output, the descriptive header is output in a tagged format. (See the entry for SortOrder for a description of the header). This information is contained within <Hdr> tags and includes the length of the descriptive string within <Size> tags followed by the string of the header. If the SortOrder setting was OccurHead in the above example, the output would be (without the line breaks shown here):

<PSet><Start>1234</Start>
<Hdr><Size>10</Size>First     </Hdr>
<Raw><Size>25</Size>
This sample is to be firs</Raw>
<Start>3456</Start>
<Hdr><Size>10</Size>Second    </Hdr>
<Raw><Size>25</Size>
This sample is to be seco</Raw>
</PSet>

{QuietOn Converted}

When Converted is turned on, in addition to the tagged offsets, text following the match point is output for each member of the set. This text is displayed with the appropriate character mappings for the XPAT index and any stopwords removed. For example, if upper case is mapped to lower case when creating the index, the text is displayed in lower case. If the index has the word to as a stopword, to would not appear in the converted text.

For each member of the set, this additional information is output within <Cvt> tags. The length of the output text string is enclosed within <Size> tags and is followed by the text itself. For each set member, the text string shown starts at the match point. This is in contrast to the Raw text output which shows the match point with some left context. If the pr has no modifier, the length of the string is determined by the PrintLength setting. If the modifier is numeric, this determines the string length. If the modifier is .region , the length of the string is the value of the difference between the offsets of the match point and the end of the region. As the displayed text is converted text, it is possible that some text conversions cause output, such as multiple blanks resulting from character mappings or stopwords, to be suppressed. This may result in text that occurs past the end of the region to be displayed.

For example, using the above example of a set of size 2 and further assuming that to and be are stopwords the output might be:

<PSet><Start>1234</Start>
<Cvt><Size>25</Size>
sample is first used for </Cvt>
<Start>3456</Start>
<Cvt><Size>25</Size>
sample is second used for</Cvt>
</PSet>

If the SortOrder setting is OccurHead , in addition to the above output, the descriptive header is given in a tagged format. (See the entry for SortOrder for a description of the header). The information giving the descriptive string precedes the <Cvt> tag and does not have the character mappings applied to it. The previous example would change to:

<PSet><Start>1234</Start>
<Hdr><Size>10</Size>First                   
  ..<Cvt><Size>25</Size>
this sample is first used</Cvt>
<Start>3456</Start>
<Hdr><Size>10</Size>Second                  
  ..<Cvt><Size>25</Size>
this sample is second use</Cvt>
</PSet>

{QuietOn Persistent}

When Persistent is turned on, the offsets that are output are the persistent (persistent) positions within the text database. As noted earlier, in a database that has not been initialized for update, the persistent and logical offsets are identical.

{QuietOn Raw Converted Label}

Any combination of the QuietOn arguments may be used. Thus, after the command {QuietOn Raw Converted Label} , the following would result:

<PSet><Start>news:1234</Start>
<Raw><Size>25</Size>
This sample is to be firs</Raw>
<Cvt><Size>25</Size>
sample is used first for </Cvt>
<Start>news:3456</Start>
<Raw><Size>25</Size>
This sample size is to be seco</Raw>
<Cvt><Size>25</Size>
sample is second used for</Cvt>
</PSet>

The save command results in identical behaviour to that of the pr command except that the information is written to a designated file rather than displayed on the standard output.

Syntax errors that occur during the XPAT session are reported in a tagged format. A set size of -1 is indicated and the error information is contained within <Error> tags. For example, if a command uses the default region before it is set, the error shown is (without the line breaks shown here):

<SSize>-1</SSize>
<Error>No information for default region
</Error>

Although the sets created by the signif command are the same in quiet and standard mode, signif does not display the text string associated with the set in quiet mode. If signif is modified with a negative integer n requesting n sets, only information about the last set created is shown.

History and {Settings} display no output in quiet mode.

Settings:

Label , LeftContext , PrintLength , SortOrder

`{QuietOff}`

{QuietOff}

See:

quiet mode

`{QuietOn}`

{QuietOn Raw Converted Label Persistent}

See:

quiet mode

`quit`

quit

terminates a XPAT session.

The use of the quit command causes the session to end and the XPAT process to exit. A message may be generated telling how much computer time has been used during the XPAT session.

`range`

string1 .. string2

finds strings that begin with strings occurring within an alphabetic range.

The range operator creates a point set consisting of those indexed points in the text that fall alphabetically between string1 and string2 inclusive. String1 and string2 are patterns that may or may not actually occur in the text being searched. The resulting set contains the matches to both string1 and string2.

Both the operands to the range command must be strings. Using a set number with the range command is illegal and results in a syntax error.

Examples:

>>  "n" .. "z"

This query finds all indexed points in the text that occur in the alphabetic ordering between n and z .

>>  "a" .. "z"

Again, assuming the text has been indexed on words, this query creates a set of all the words and phrases in alphabetical order (that is, it produces a concordance of the text).

>>  "1" .. "200"

This query find all the strings that fall alphabetically between 1 and 200 . This gives all the indexed strings that begin with 1 or 200 . For example, the strings 1929 , 20034 as well as strings such as 2003/1 and 2000-15000 are in this range. The resulting set does not contain the strings 3 or 4 .

>>  region Date including ("1920" .. "1925")

This query finds Date regions that contain dates from 1920 to 1925 inclusive. The range 1920 .. 1925 also contains strings such as "1925000" as they also fall within the range.

>>  "<Date>1920" .. "<Date>1925"

If dates are marked with the tag <Date> and begin with a 4-digit value for the year, this query reliably finds dates between 1920 and 1925 inclusive, and only those dates.

`rankedby`

set1 rankedby set2

ranks a region set by the number of contained members of another set.

Rankedby creates a set containing those members of set1 that contain the greatest number of occurrences of members of set2. Set1 must be a region set. Set2 may be either a point set or a region set. The new set is a region set.

Set1 may be a predefined region set, a region set that has been created within the current XPAT session using the region command, a region set resulting from the use of the import command, or the result of a previous query during the current session.

The size of the new set is by default the value of the SampleSize . Another size may be requested with a numeric modifier in the form of a period followed by the requested size.

The set that is created, when accessed by pr , save and subset in SortOrder AsIs , is naturally ordered by rank. That is to say, the first member will be that element of set1 that contains the most occurrences of members of set2.

In detail, the rankedby command operates as follows. It first splits all the members of set1 into groups. Each member of a group includes the same number of members of set2 as the other members of the group. In addition, within a group, the members are sorted into occurence order. After it has grouped the members of set1, the rankedby command sorts the groups into decreasing order of number of included members of set2.

For example, say that set1 has 6 members, as follows: 3 members that each contain 2 members of set2, 2 members that each contain 4 members of set2, and 1 member that contains no members of set2. After rankedby has grouped and sorted set1, the groups are be as follows. The first group consists of the 2 members of set1 that contain 4 members of set2. The second group consists of the 3 members of set1 that contain 2 members of set2, and the third group consists of the 1 member of set1 that contains no members of set2. Within each group, the members are in occurence order. If the user has requested the top 4 sets, the result set would contain both members of the first group and the first two members of the second group.

Examples:

>>  region Story rankedby ("Free trade" near "Canada")

This query finds the regions described by region Story that contain the greatest number of matches to Free trade when it occurs close to the Canada . The number of members in the new set is the value of the SampleSize setting.

>>  region Quote rankedby.5 region Author

This query creates a set whose members are the 5 members of region Quote that contain the greatest number of members of region Author .

Settings:

SampleSize

`region`

region

region string

region set1 .. set2

produce region sets in a text database. The first two forms of the region command refer to region sets that have been defined externally to a XPAT session and for which information is available in the data dictionary. These region sets may have been defined using patregion or any other program that generates information (in the form that XPAT understands) about regions in the text. The third form of the command defines a region set during a XPAT session. The results of any of these commands can be used as operands to any of the XPAT commands that operate on region sets.

region

Region , used with no operand, refers to the particular predefined region set that has been designated as the default region. The default region is defined by the DefaultRegion setting and can be reset for the remainder of a XPAT session by changing the setting. If no default region has been defined, using region in this form causes an error. The following message is generated:

No information for default region.

region string

The second form of the region command indicates one of the named predefined region sets. The string is the name that has been given to the region set in the data dictionary. For example, the region sets might be the chapters of a book, the entries in a dictionary or the headlines in a newspaper database. The information about certain regions in the text database is generated by a program external to XPAT and is made available during a XPAT session via the data dictionary. One program that generates the information is patregion.

Note that the string giving the name of the region set can contain blanks or special characters, if it is enclosed within double quote marks.

region set1 .. set2

The third form of the region command defines a new region set. The region set that is created by this command is only available for the duration of the XPAT session. Information about this region set can be written to a file using the export command and read into a future XPAT session using the import command.

Set1 and set2 are used to define regions in the new set. Set1 and set2 can be either point sets or region sets. If either set1 or set2 is a region set, the region command uses only the first of the pair of pointers describing its members in defining the new region.

Each region in the new set is formed as follows. A member of set1 is the beginning of a region if it is followed by a member from set2 with no other member of set1 occurring between the two members. The end point of the new region is defined by the member in set2 that most closely follows the set1 member. The region contains the text from the beginning of the member of set1 up to but not including the member of set2. This produces the smallest non-overlapping region set that can be formed by set1 and set2. The size of the region set created is equal to or smaller than the size of set1. If the members of set2 are matches to a pattern, the new region set does not contain the occurrences of that pattern. For example, if set2 is the set of matches to the string End of Message , the new region set contains no occurrences of the string End of Message .

If set1 and set2 are identical, two extra regions may be included in the newly created set. These are: a region from the beginning of the text to the member of set1 that occurs earliest in the text; and a region from the last element of set1 in the text to the end of the text. If either of these regions is a substring of length zero, it is not included. If the shift command is applied to set1 or set2, the extra regions are not included in the new set.

Some programs, such as patregion, that produce predefined region sets, define the end point of the region in a somewhat different manner. These programs deal with patterns of text (rather than points in the text) and the end point of the region that is defined is usually the last character in the pattern that is used to define the regions. If desired, the region command within a XPAT session can be used in conjunction with the shift command to create a set of regions in which the ends of the regions are at the end of a pattern. See the examples below.

XPAT does not support region sets whose members nest or overlap. As described above, using region with operands that are patterns defining nested or overlapping regions, creates a region set which is the smallest non-overlapping set of regions. Patregion used on the same text creates a possibly different region set (also non-overlapping) consisting of regions from an opening pattern to the following end pattern.

Examples:

>>  region including ("Smith" near "Jones")

This query creates a region set, consisting of the members of the default region which contain a match to the string Smith when it occurs within a prescribed distance of the string Jones .

>> "Campbell" within region "Speaker Name"

In this example, we assume that one of the predefined regions has been named Speaker Name . This query creates a point set that contains matches to the string Campbell occurring within members of region Speaker Name .

>>  firstb = region "<A>".."</B>" 
>>  (region B within *firstb) including "requested string"

The text, in this example, contains regions that begin with <A> and end with </A> . Each of these A regions contains smaller regions that begin with <B> and end with </B> . Assume, in certain instances, that it is necessary to be able to find the first B region within each A region. The use of region in the first query creates a region set named firstb that can be used to find these regions. The members of firstb are the pieces of text that begin with the string <A> and extend to the closest string </B> . The second query finds the members of region B that are within firstb , and then finds the members of the latter that include requested string .

>>  quote = region "<Q>" .. (shift.4 "</Q>")

If some components in the text are tagged with <Q> and </Q> this command creates a region set describing these components. Each region in the set extends from the opening tag <Q> to the end of the closing tag </Q> . By using the shift operator, applied to the </Q> , the members of the point set used to find the ends of the new regions all point to the end of the string </Q> rather than to the beginning of the tag.

>>  mess1 = region "From:" .. "From:"

>>  mess2 = region "From:" .. (shift.0 "From:") 
>>  from = region *mess1 .. "Received:"

>>  "Bill" within *from

This set of queries is being applied to a database of mail messages. Each message has the string From: at the beginning. The string Received: appears at the beginning of the second line of the message indicating the time the message was received. Assume that the first query, identifying the matches to the string From: , returns a set of size 10. Further assume that there is text in the database preceding the first From: . The next query creates a region set of size 11 as two additional regions are included in the resulting set: one containing the text from the beginning of the text to the first occurrence of From: and the other containing the text from the last occurrence of From: to the end of the text. The third query creates a region set of size 9 as these two regions are not included in the new set. The next query creates a region set describing the sender of the message. The final query finds the matches to Bill in the regions describing the sender of the message. Notice the use of an asterisk (* ) before the name of the new region set when it is used as an operand to a XPAT command.

Settings:

DefaultRegion

`sample`

sample set1

finds representative members of a larger set.

Sample creates a set containing a specified number of members of set1. Set1 may be either a region set or a point set. The new set is of the same type as set1.

The size of the set created is determined by the value of the SampleSize setting which has a default value of 10. If the size of set1 is less than SampleSize , then the new set created is the same size as set1. The size can be changed for all subsequent uses of sample during the current session by changing the SampleSize setting. For an individual use of the sample command, the setting can be changed by using a modifier attached to the command. The form of the modifier is a period followed by a number giving the desired size of the sample set.

The members of the sample set are chosen as follows. If the size of set1 is x and the sample size requested is y, each x/yth member of set1 is in the sample set. For example, if a sample of size 20 is requested from a set of size 2000, the 100th, 200th members etc. are chosen. The ordering of the set, and hence the members of the sample set, is determined when the set is created. The SortOrder setting does not determine which members are included in the set created by the sample command as it does for the subset , next , ~nextemp , and first commands. However, this setting does affect how the sample set is ordered when used with a pr command (or save command).

The sample command can be used by itself or with the pr , save or export commands. The sample may only be used in conjunction with these commands.

Examples:

>>  sample "shaks"

Assuming a SampleSize setting of 10, this query creates a set of 10 examples from the set of matches to the string shaks .

>>  {SampleSize 30} 
>>  sample "shaks"

The first command changes the SampleSize setting to 30 and the second creates a set of 30 examples from the set of matches to the string shaks .

>>  region Quote including "Doyle" 
>>  sample .20 %

The first query creates a region set containing Quote regions that include the string Doyle . The second query creates a sample set of 20 members from the results of the first query.

>>  region Quote including (sample "Doyle")

This query is illegal and results in a syntax error.

Settings:

SampleSize , SortOrder

`{SampleSize}`

{SampleSize number}

specifies the size of the set produced by the sample , subset , and rankedby commands.

By default, sample and subset create a set of 10 members of a given set. This setting may be overridden and the size of the result determined by a modifier for an individual use of these commands. The SampleSize setting can be changed at any time during a XPAT session and remains in effect until changed again or until the end of the session. The current value of the SampleSize setting is displayed by the command {Settings} .

Examples:

>>  {SampleSize 200} 
>>  pr sample 5

This changes the SampleSize to 200 and any subsequent sample or subset command uses this value. In the second query XPAT prints information about 200 members of set number 5 created earlier in the session.

`save`

save set1

writes the contents of a set to a file.

The save command is identical to the pr command except that the output is written to a file. The name of the file where the information is written is determined by the value of the setting SaveFile . The default value of the setting is xpat.res . The file used by the save command can be changed at any time during the XPAT session by changing the setting. The information output by the save command is concatenated onto the end of the save file if one of the same name already exists. Otherwise, a new file is created and the information is written to the new file. Assuming the default setting of SaveFile , the following message is printed on execution of the save command:

Saving in xpat.res.

For each member in the given set, the output is in the form of an integer giving the offset of the set member in the text file followed by a comma, a blank, two periods and the characters surrounding the set member. The order in which the set is output is determined by the current SortOrder setting.

With no modifier, Save outputs a line of text for each element in set1. The PrintLength and LeftContext determine the content of the line saved. With the default settings, the saved text is 64 characters in length of which 14 precede the match point. The number of characters to the left of the match point can be altered by changing the LeftContext setting. The total number of characters printed can be altered by changing the PrintLength setting.

The total number of characters to be saved can be set for a single instance of the command, by using a numeric modifier attached to the save command. The modifier is in the form of a period followed by a number giving the total number of characters to be saved. The left context that is saved is still determined by the value of the LeftContext setting.

The second form the modifier can have is a period followed by the string region . When the command save.region is used, the output text starts at the match point and continues to the end of the default region in which the match point occurs. If the match point is not within a default region, no output is saved for the match point.

The second form of the modifier can be refined to request that the text output be in a region other than the default region. An additional modifier, specifying a defined set of regions, can be attached to the already modified save command (i.e. to the save.region ). This additional modifier can be in one of three forms: a string giving the name of a predefined region, the number of a region set created in the XPAT session, or a string preceded by an asterisk (* ) referring to a named region set defined in the XPAT session. As with the form save.region , described above, this use results in the output text starting at the match point with no left context and continuing to the end of the region. When the match point is not contained in the designated set, no output is saved.

The similarly named commands save.commands and save.history result in very different behaviour and are described in separate entries.

Examples:

>>  "Helen Maday" 
>>  save

As a result of this command, XPAT writes a line of context for each member in the most recently created set. The information is written to the file that is named by the setting SaveFile . If the setting has not been changed during the session, the file used is xpat.res . Note that the information is appended to the save file if one of the same name already exists.

>>  save "From: Tony Lopez "

A line of context for each member of the set that matches the string From: is written to the save file.

>>  {PrintLength 120} 
>>  save region including "planet"

A line of context is written for each member in the set of regions created by the including query. The line that is written starts at the beginning of each region in of the new set. Since the PrintLength has been set to 120, each line contains 120 characters and has 14 characters to the left of the beginning of the displayed region.

>>  save.200 shift.-100 ("procedure" near "policy")

In this case, a line of 200 characters is written to the save file for each member in the set created by the query shift.-100 ("procedure" near "policy") . The text that is written starts 114 characters to the left of the string procedure .

>>  region including (region EQ including "<A>Doyle</A>") 
>>  save.region

The first query finds all the earliest quotes (defined by the region EQ ) that have Doyle as the author. The second command saves information about each of the default regions that includes one of these quotes. The information that is written for each of these regions contains the offset in the text file of the region, a comma, a blank followed by two periods and the text of the default region. As no set is given as an operand to the save command, it is understood that the command applies to the previous set.

>>  region Quote including ("stadium" near "Toronto")

>>  save.region.Quote %

The first query finds all quotes that contain the strings stadium within 80 characters of the string Toronto . The second command saves information in the save file (xpat.res unless the SaveFile setting has been reset) about each of these regions. As in the example above, the output for each set member is in the form of an integer giving the text offset, a comma, a blank followed by two periods and the text beginning at the start of region Quote and continuing to the end of the region.

>>  save.region.5 "night" fby "day"

This command saves information about the set created by the query "night" fby "day" . The information written to the save file (after the offset, comma, blank and two periods) starts at the text night and continues to the end of the region defined by set number 5 that contains the match.

>>  minutes = region "<Min>" .. "</Min>"

>>  *minutes including "examination schedule" 
>>  save.region.*minutes

The first query in this example defines a set of regions that are named minutes . The second query finds the regions in this set that contain the string examination schedule . The third command saves information about this set in the save file. For each member in this set, the information saved contains the offset, comma, blank and two periods followed by the text of the region in the set named minutes .

Settings:

DefaultRegion , LeftContext , PrintLength , Savefile , SortOrder

`save.commands`

save.commands

writes information to a file about the queries in the XPAT session. These are saved in a form that allows them to be used in another XPAT session.

Save.commands saves, in a file, all the queries that have been executed and have produced sets during the current session. These are the queries that appear in the history list. Only the command is saved in the file, not the set number or number of matches. The setting CommandFile , that determines the file where the information is written, has a default value of xpat.cmd . The output file can be changed at any time during the session by changing the CommandFile setting. If a file of this name already exists, the information is concatenated onto the end of the file. Otherwise a new file is created.

The saved information can be read into a XPAT session and executed using the exec command.

Examples:

>>  "love" near "hate" 
>>  pr sample 
>>  region Q including % 
>>  save.region.Q % 
>>  {CommandFile "/usr/my_commandfile"} 
>>  save.commands

The second last command sets a new name for the file to be used by save.commands ; /usr/my_commandfile . The final command saves the information about the commands that has been generated to this point in the XPAT session. In the portion of the session shown here, only two commands generated sets and so the following is saved in the file /usr/my_commandfile .

"love" near "hate"
region Q including %

Settings:

CommandFile

`{SaveFile}`

{SaveFile string}

changes the file name used by save .

The SaveFile setting determines the file written by the save command. It has a default value of xpat.res . If the string begins with a numeral, or contains blanks or non-alphanumeric characters, it must be enclosed within double quote marks. The file name must also conform to the file naming convention of the host operating system. It can be changed at any time during a XPAT session and remains in effect until it is changed again or until the end of the session. The current value of the SaveFile setting is displayed by the command {Settings} .

Examples:

>>  {SaveFile "output_file"}

This changes the setting to the value output_file so that any subsequent use of the save command writes to this file. The name of the file is not an absolute path name and is therefore located in the current working directory.

`save.history`

save.history

writes information to a file about the queries and results in the current XPAT session.

Save.history writes a record of a XPAT session. XPAT's history list records information about all queries that produce sets. For these queries, save.history saves the set number, the number of members in the set, and the query that produced the set in a file. The setting HistoryFile , that determines the file where the information is written, has a default value of xpat.his . A different output file can be chosen at any time during a session by changing the setting. If a file of this name already exists, the information is concatenated onto the end of the file. Otherwise a new file is created. Note that comments are saved only if they are on the same line as the command itself.

Examples:

>>  "fish" near "fowl" 
>>  pr 
>>  region definition including % 
>>  save.history

This command saves the information from the history list in the file xpat.his . After the sequence of commands shown above, the history list contains information about two sets which are saved into a file:

1:     142,  "fish" near "fowl"
2:      17,  region definition including
%

Settings:

HistoryFile

`set name`

* name

refers to a named set.

Query result sets may be named and subsequently referred to either by set number or by name. The name must be preceded by an asterisk (* ) to reference the set; otherwise, XPAT interprets the name as a command or search string.

Examples:

>>  univ = "university" near "MIT" 
>>  qu = region Quote including * univ

>>  * qu including "Harvard"

The first query creates a set of matches to university occurring near MIT . The second query uses the set *univ and creates a new set *qu . The third query finds the Quote regions that include the set of matches from the first query as well as the string Harvard .

>>  begin = region "<Title>" .. "</Summary>" 
>> "Paris" within * begin

The first query defines a new region set and calls the new set begin . The second query creates a set containing the matches to Paris that fall within one of these regions.

`set number`

number

references a previously created set.

After the first query in a session, XPAT displays a line of the form:

1: 300 matches

The number 1 here names the set of results and can be used in subsequent searches. The valid set numbers are those displayed by the history command.

When an invalid set number is used XPAT generates a message. If, for example, set number 33 is referenced before it has been calculated or after it has been freed the message is:

Expression 33 is out of range

Examples:

>>  region Author including 5

In this query, the number 5 refers to the fifth result of the session. For example, set 5 might be the set of matches to all the variants of spelling for a particular author's name.

Settings:

History

`sets`

In the XPAT system, queries are combinations of the XPAT commands described in this document. In response to each query, XPAT creates a set which is either a point set or a region set.

These sets can be used as operands in subsequent queries. In contrast to the conventional approach of a single, nested compound query, XPAT allows complex queries to be expressed as a series of simple queries. This provides an opportunity to try alternative ways of combining previous result sets to arrive at a solution. XPAT provides a history list of all previous sets created in a session and a convenient notation to access them.

A member of a point set is a location in the text which is the start of a string that continues to the end of the text. The XPAT system finds locations in the text where strings, matching pattern(s) given in the query, begin. The members of a point set are usually index points , however, in the sets created by shift or offsets , (the notation [n] ), the members refer to positions in the text that may or may not be index points.

The members of a region set are substrings of the text, beginning and ending at specified points. Region sets that are the result of a query within a XPAT session are available only for the duration of the session. However, region sets can also be defined externally and be made available to the XPAT session. Each member of a region set is described by two locations in the database, indicating the start and end of the region and these locations may or may not be index points.

Region sets may be used in XPAT to restrict searches to desired parts of the text. The within command finds the members of a set that are contained in a designated region set. The including command finds the members of the designated region set that contain one or more members of a given set.

The sets produced within XPAT can be refined using set arithmetic or proximity commands. The difference (- ) and intersection (^ ) commands remove members of an existing set. The proximity (fby and near ) commands reduce sets by finding the members of a given set that have specified text close by.

In addition to refining sets, it is possible to combine two sets to create a larger one by using the union (+ ) command.

XPAT queries, applied to a text database, may create large sets; analyzing a smaller, representative subset might aid in making decisions about how to proceed appropriately with a search. Several commands in XPAT provide this capability. Sample creates a representative subset while subset , next , and first each create contiguous smaller subsets of a larger set.

`{Settings}`

{Settings}

shows the current values of a number of XPAT parameters.

Examples:

>>  {Settings}

The output might be:

{CharMappings " " "Aa" "Bb" "Cc" "Dd" "Ee" "Ff" "Gg" "Hh"
"Ii" "Jj"
 "Kk" "Ll" "Mm" "Nn" "Oo" "Pp" "Qq" "Rr" "Ss" "Tt" "Uu" "Vv"
"Ww" "Xx"
 "Yy" "Zz" "[ " "\\ " "] " "^ " "_ " "` " "{ " "| " "} " "~ "}
{StopWords}
{WordStarters  " \P" "\P-" "-\P" "\P<" "\P&."}
{SortOrder AsIs}
{PrintLength 64}
{LeftContext 14}
{Proximity 80}
{SampleSize 10}
{SaveFile xpat.res}
{CommandFile xpat.cmd}
{ExportFile xpat.exp}
{HistoryFile xpat.his}
{History 0}
{QuietOff}

`shift`

shift set1

creates a point set whose members are a specified distance from a set of matches.

Shift creates a new set whose members are locations in the text which result from an equal shift being applied to all members of set1. Set1 may be either a point set or a region set. If it is a region set, the resulting set is a point set containing the first of the pair of pointers describing each member of the set. The set created by the shift command is always a point set.

The shift command creates a new set consisting of pointers that are (by default) 10 characters after the original set members. The set created is in occurrence order. This default shift distance (10) and direction can be changed by using a modifier attached to the shift command. The modifier is of the form of a period followed by a positive or negative integer. If the modifier is a negative integer n, each member of set1 is shifted to n characters before the original location. If it is a positive integer n, the shift is to n characters after that location.

The points in the new set need not be index points.

Examples:

>>  shift "dog and cat"

This creates a new point set whose members are 10 characters after each match to the string dog and cat .

>>  "<Tag>" 
>>  pr 
>>  shift.5 % 
>>  pr

The first query creates a point set of matches to the string <Tag> . The ordering of this set is alphabetical. The third line of the example creates a second point set whose members start 5 characters after a string <Tag> . The members now point to the start of the contents of the tagged region. This set is in occurrence order. Thus, the order in which the members of the set are displayed by the second pr is not necessarily the same as that seen with the first pr command.

>>  shift.5 region Tag 
>>  pr

Assuming that region Tag describes a set that begins with a tag <Tag> , the first query creates a point set whose members now point to the start of the contents of Tag regions. The set created is the same as the one in the above example.

>>  pr shift.-20 "the best of times"

This displays more context before the match points without having to change the LeftContext setting. The pr command displays members of a set of matches to the string the best of times with 34 characters showing before the matches (the default 14 characters plus 20 additional resulting from the shift command). Because of the shift command, this set is displayed in occurrence order. Note, as with other commands, since the pr and shift are on the same line the set is not saved, does not appear on the history list and so is not accessible by a set number.

`signif`

signif set1

finds frequently occurring words or phrases in the text.

Signif finds the most frequent words or phrases following the text matching set1. Set1 can be either a point set or a region set for the first two forms of the command but has a restriction, as noted below, for the third form of the command. The set (or sets) created are point sets. If set1 is not given, signif uses the last set produced as its operand.

The signif command looks for words or phrases, in contrast to the other XPAT commands that operate on points and regions. For signif , a word is defined as a string of characters ending either in a blank or a character that has been mapped to a blank by the character mappings used in building the index being used in the current XPAT session.

The signif command examines the words or phrases that begin with a string. The string used is the longest string common to the text pointed to by each of the members of set1. If set1 is a point set resulting from a string search, signif starts with that string. If set1 is not given and the previous result was from a signif command, the string is the one associated with the set created. (Note that, in addition to the number of matches in the set, signif returns a string value.) If set1 is a region set, the first of the pairs of pointers describing the regions are used to find any common string beginning in these regions. For example, this might be the pattern used to define the regions. If set1 is a point set that is not the result of a string search, XPAT checks for a common string beginning the text pointed to by each member of set1. In some cases, this common string is the null (empty) string.

Signif has three different modes of operation. A modifier can be attached to the signif command. The syntax of the modifier is a period followed by an integer n.

The first mode of the command, signif with no modifier, finds a frequent word or phrase and then, by reapplying signif to the resulting set, may be used to extend this phrase. In this mode, the command finds the string and then finds all possible extensions, in the text, of this string up to the next blank. Signif creates the set, among the matches to these possible extensions, with the most members.

The second mode of the command, signif with a positive integer n as modifier, finds the most commonly occurring phrase of length n words beginning with the string. The set created contains those members of set1 that are the matches to the most frequent phrase beginning with the given string that is at least n words in length.

The third mode of the command, signif with a negative integer n as modifier, creates n sets which are the matches to the n most frequent phrases beginning with the string. This use of the signif command is restricted to sets matching a string or a set created by a range command. Using any other set as an operand is illegal and results in the error message:

Repetitive signif should be on strings or ranges only

Note that the set created by signif is identical to that created by signif.-1 but not necessarily to the one created by signif.1 .(See the examples.)

The displayed output from the signif command gives the number of matches and the word or phrase found (preceded by text=). The text is shown with the character mappings applied and stop words removed (see data dictionary documentation). For example, this means, if > has been mapped to a blank and the word the is not a stopword, one would see the following:

  >> signif  "<HL>"
    2: 604 matches, text=<hl the

Examples:

>>  signif " 
>>  signif 
>>  signif.3 " 
>>  signif.-10 "

The above queries, except the second one, operate on the entire text. In these cases, the string that the signif command starts with is the empty string. The first query finds the most frequent word or phrase that occurs in the text. The second query operates on the set created by the first query and extends this result by one word. The third query finds the most frequent phrase of at least three words that occurs in the text. The fourth query finds the ten most frequent words or phrases within the text.

>>  signif "y"

This finds the most frequent word that starts with the letter y in the text database.

>>  signif "to be"

Note that the string to be used by the signif command does not end in a blank. Since signif looks at all extensions of the string up to the next blank, the only phrases eligible to be the most frequent phrase starting with this string are two-word phrases. In many texts, the most likely set created with this command would be the point set matching the two-word phrase to be (ending in a blank).

>>  signif "to be "

In this example, the string given ends in a blank so the possible extensions of this string that are examined by signif are all three-word phrases. The point set created as the answer to this command is the set of matches to the three-word phrase starting with the two words to be that occurs most frequently in the text.

>>  signif.1 "to be"

This command creates a set of the most frequent phrase whose word length is one. The newly created set is the set of matches to the word to that are contained in the set of matches to the phrase to be . This means that the size of the set is probably smaller than the set of matches to to but that the text shown is the string to . Also note that this set is not equal to the set created in the preceding example or to the set created in the following example.

>>  signif.-1 "to be"

This command finds the most frequent phrase that begins with the string to be . The answer to this is a two-word phrase that is identical to that found in the second example.

>>  signif.-3 "to be"

This command finds three sets that are the matches to the three most frequent phrases beginning with the string to be . The first set is the same set created in the example before last. The next set is created by applying signif to two sets and comparing the resulting sets. Signif is first applied to a set that is created by taking the difference between the set represented by the original string and the new set just created. Signif is also applied to the set just created. The larger set from these two signif applications is the second answer. This same procedure is repeated on the original set and on the two new sets to obtain the third set.

>>  signif.4 "to be" 
>>  signif

The first command creates the set of matches to the most frequent four-word phrase that begins with to be . The second signif is applied to the resulting four-word phrase. Since this result ends in a blank, the second Signif searches for the most frequently occurring five-word phrase that begins with the four words located by the first command.

>>  "aba" .. "abz" 
>>  signif

The first command creates a point set that matches all strings that are alphabetically between aba and abz . Signif applied to this set creates a set of matches to the most frequent word in the text that begins with ab .

>>  auth = region "<A>" .. "</A>" 
>>  signif *auth 
>>  signif.2 *auth 
>>  signif.-4 *auth

The first command creates a region set. The second command creates a set representing the most frequent string at the beginning of these regions. Assuming that in the character mappings the > is mapped to a blank, the string that signif uses to find the extension consists at least of the word <A> . Thus, the set created is the set of matches to the string <A> followed by at least one other word. The third command gives the most common two word phrase starting the region set named auth . The first word of this phrase will be <A> . The last command is illegal and results in the following error message:

Repetitive signif should be on strings or ranges only

>>  signif.-4 "<A>"

This command creates four sets. These are the sets of matches to the four most frequent phrases starting with the string <A> . Notice that, if the > has been mapped to a blank, the phrases are at least two words in length.

>>  sample.100 "<A>" 
>>  signif.-2 %

This use of signif is illegal since signif with a negative modifier may be applied only to a set matching a string or created by a range command. An error message is generated.

`{SortOrder}`

{SortOrder number}

determines the ordering of a set.

The behaviour of first , next , pr , save , subset and ~nextemp are affected by the ordering associated with their operands. The SortOrder setting indicates whether these sets are to be treated in alphabetical order or in the order that members of the set occur in the text (occurrence order). (A SortOrder setting of OccurHead (explanation below) also determines what is displayed by the pr and save commands.)

Every set in XPAT has an internal ordering which varies from set to set, as described below. The ordering is chosen by XPAT, for reasons of efficiency, and no assumptions can be made in this regard. It is often desirable, however, to present results in a certain order, and the SortOrder setting exists to control this. When a set is an operand of a pr or save command or a new set is created by a first , next , subset or ~nextemp command, the ordering of the set and hence the behaviour of the command is determined by the SortOrder setting. This may mean that the existing set must be reordered for processing with these commands. For some XPAT commands this results in a change in the internal ordering, and this change is reflected when subsequently operating with a SortOrder setting of AsIs (explanation below). The ordering of a set that is not an operand to one of the above commands is not affected when the SortOrder setting changes.

The permissible values for the SortOrder setting are AsIs , Alpha , Occur and OccurHead . The default value of SortOrder is AsIs . If the SortOrder setting is AsIs , the set is processed in the order in which it currently exists. If the SortOrder setting is Alpha , the set is processed in alphabetical order. If the SortOrder setting is Occur or OccurHead , the set is processed in occurrence order. For sets whose internal ordering is not alphabetical, for example region sets, displaying results with a SortOrder setting of Alpha will require resorting which may result in additional computation delay depending on the set size.

Setting the SortOrder setting to Occur results in further changes to the behaviour of the pr and save commands. For Pr and save , with a SortOrder setting of AsIs , Alpha or Occur , the position offset for each set member is displayed. With a SortOrder setting of OccurHead , the contents of a named region set are output in place of the position offset. Setting SortOrder to OccurHead requires reference to two regions within the brace brackets used to change the SortOrder setting. The first region referenced is the one whose contents are displayed in place of an offset when members of a set are displayed. The second region referenced must be one that contains both the match points of the members of a set and the first region referenced in the SortOrder setting (OccurHead ).

The text displayed in place of the offset, as a result of SortOrder being set to OccurHead , is the first region found of the specified type within the containing region (also specified in the OccurHead setting). If the text to be displayed begins with an opening angle bracket, the text until the closing angle bracket is ignored and the next character is displayed. If the next character is another angle bracket, the preceding process is repeated iteratively. A maximum of 10 characters or up to the next < in the text is displayed. Both these region sets, named in the OccurHead setting, must be in the data dictionary. If they are not, for example if X is named in the setting but does not exist in the data dictionary, the following error message results:

No information for region X in the data dictionary

The SortOrder setting can be changed at any time during a XPAT session and remains in effect until it is changed again or until the end of the session. The current value of the SortOrder setting is displayed by the command {Settings} .

Examples:

>>  {SortOrder Alpha} 
>>  pr %

The first command sets the SortOrder setting so that the displayed set, following the pr command, is in alphabetical order.

>>  {SortOrder Occur} 
>>  sample "Moriarity" 
>>  pr % 
>>  {SortOrder AsIs} 
>>  pr %

The sample set created in the first example has an alphabetical ordering. With the SortOrder setting of Occur , the first pr displays the set in occurrence order. However, after the SortOrder is reset to AsIs , the set that is printed after the next pr is displayed in alphabetic order. Note that the sample set is not affected by the SortOrder setting at the time of its creation and that the reordering for printing is temporary.

>>  {SortOrder OccurHead LF E} 
>>  pr "shaks"

One effect of setting the SortOrder to OccurHead is that the set is displayed in occurrence order by the pr command. The beginning of each line, following the pr command, contains the starting characters of the first region, named LF , that occurs with the region named E containing a member of the point set matching the string shaks .

>>  Ondaatje
>>  pr subset % 
>>  {SortOrder Occur} 
>>  pr subset % 
>>  {SortOrder AsIs} 
>>  pr %

The subset displayed by the first pr command is shown in alphabetical order. With the SortOrder set to Occur , the next pr command displays the subset in occurrence order. With the SortOrder set to AsIs , the final pr displays the set of matches to the string Ondaatje in occurrence order since this point set was reordered permanently as a result of the subset command executed when the SortOrder was Occur .

`stop`

stop

terminates a XPAT session. The use of this command causes the session to end and the XPAT process to exit. A message may be generated telling how much computer time has been used during the XPAT session.

`string search`

A command consisting only of a string causes XPAT to search for occurrences of the string in the text database. A set is created whose members are matches to all index points in the text that begin with the given string. A match occurs when the given string (after having the character mappings applied to it and stopwords removed) is the same as the text that begins at an index point (also having had the character mappings applied and stopwords removed). Searching for phrases with a XPAT index is as fast as searching for a word or a prefix of a word. After a search, the number of matches to the pattern is displayed, but the results of the search are not shown unless requested by a pr command.

Examples:

>>  in

If the index currently being used is based on words, the matches returned from this input string are the matches to all the phrases in the text that begin with the two characters in . That is, there will be matches to strings beginning with the word in as well as to strings beginning with inside , into etc. In order to match only strings beginning with the word in a blank must be added to the search string and the string enclosed within quotation marks.

If each character is indexed, the matches returned would also include strings that appear as part of words such as within and getting .

If the index has been made with character mappings that map upper case to lower case, the matches would also include matches to strings that include In .

>>  "to be or not to be that is the question"

If the index used when searching for the above string was created with the stopwords to , be , or , not , that , is and the , this string search is equivalent to a search on the string question .

`subset`

subset set1

finds a number of contiguous members of a set.

Subset creates a set of a specified size containing members starting at a designated location in set1. The members of the new set are in the order they appear in set1. Set1 may be a region set or a point set. The new set is of the same type as set1.

The operation of the subset command is affected by the size of the set requested and the SortOrder setting.

The ordering of a set, and hence which members are chosen to be in the set created by the subset command, is controlled by the SortOrder setting. If the SortOrder setting is Alpha , the set is ordered alphabetically; if it is Occur or OccurHead , the set is ordered as the members occur in the text; and if the SortOrder setting is AsIs , the set order is the current one and may thus be either alphabetic or occurrence order. The location within set1 to start selecting members for the new set is indicated by a numeric location in the ordered set. This numeric location is given to the subset command as a modifier attached to the command. The modifier is in the form of a period followed by an integer that can be either positive or negative. A positive integer gives the desired location relative to the beginning of the set and a negative integer gives it relative to the end of the set. Without any modifier the subset is taken from the start of set1.

The size of the set created is determined by the value of the setting SampleSize which has a default value of 10. If the size of set1 is less than SampleSize , then the new set created is the same size as set1. Changing the SampleSize setting affects all subsequent uses of subset during the current session. The size of the subset can be specified for an individual use of the command by using a second modifier attached to the already modified subset command. This modifier is also of the form of a period followed by a numeric value giving the desired set size.

The subset command can be used by itself or with the pr , save or export commands. The subset may only be used in conjunction with these commands.

Examples:

>>  {SampleSize 40} 
>>  subset %

The first command changes the SampleSize setting and the query in the second line returns a set that contains the first 40 members of the most recent result in the session.

>>  subset .10 "Montreal "

This query creates a set of 40 members (assuming the SampleSize setting in the first example) starting at the tenth member in the set of matches to the string Montreal .

>>  subset .-10 "Montreal "

This is similar to the previous query but the new set starts at the tenth member from the end of the set. Therefore, the resulting set size is only 10 even though the SampleSize setting is 40.

>>  subset .5.30 %

This query creates a set of 30 members starting from the fifth member of the most recent result in the session.

>>  {SortOrder Occur} 
>>  subset .-20.20 5

The query in the second line of the example creates a set containing the final 20 members in the set represented by set number 5. The SortOrder setting means that both set number 5 and the new set are in occurrence order.

Settings:

SampleSize , SortOrder

`~sync`

~sync string

outputs a tagged identifier.

The command ~sync is available only when the XPAT session is operating in quiet mode. ~sync outputs a message tagged with Sync tags containing the given string. This command is mainly used when XPAT is integrated into a more complex system. The output from the ~sync command can then be used to identify a position in an input stream when information is being received from several different sources.

Examples:

>>  ~sync "festival"

The output from this command is the tagged string: <Sync>festival</Sync> .

`thesaurus`

provides an efficient way to describe patterns that have some common quality. For example, if many searches of a database involve finding references in the text to money in different currencies, the thesaurus provides the capability to define a variable describing all the possible patterns to be used in these searches.

The thesaurus variable is defined in a file named in the data dictionary. Within the thesaurus file, each separate variable, called a word, is surrounded by <Entry> tags. Within the <Entry> tags are other tagged areas: the name of the variable is contained within <Word> tags, followed by the associated query contained within <Query> tags. The thesaurus capability is implemented using macros so the query may be a complex one creating more than one set. The same cautions, described for macros, apply to bracketing and syntax errors.

To invoke a thesaurus variable the name is preceded by the character <. For example, a thesaurus variable named money may be used within a XPAT session as follows:

<"money"

As with macros, thesaurus invocations are replaced by an exact copy of the definition. This means that a thesaurus variable can be used as an operand in other XPAT queries. Note, however, that it may be necessary to bracket the entire invocation in order to ensure correct results from the query. In practice, bracketing the definition itself is a good general method.

If an undefined thesaurus variable is used, for example <testing , the following error message is generated.

The macro testing is undefined

Examples:

>>  (<"policy" ) near (<"economy" )

Assume that the following tagged data appears in the thesaurus file reference in the data dictionary for the XPAT session.

<Entry>
 <Word>economy</Word>
 <Query>("economic " + "fiscal " + "monetary " + "economy")</Query>
</Entry>
<Entry>
 <Word>policy</Word>
 <Query>("policy " + "policies ")</Query>
</Entry>

The query shown finds any matches to either of the strings described by the thesaurus variable policy that occur near any of the strings that are part of the union described by the thesaurus variable economy .

>>  *speaker including <"macbeth"

This assumes that the following tagged data appears in the thesaurus file.

<Entry>
 <Word>macbeth</Word>
 <Query>"macbeth" - (shift.5 "lady
macbeth")</Query>
</Entry>

The query shown finds those members of the region set defined by the name speaker that contain Macbeth but not Lady Macbeth .

`union`

set1 + set2

combines two sets.

The union operator (+ ) creates a new set containing the members of both set1 and set2, with duplicates removed. Set1 and set2 can be either point sets or region sets. If either of set1 or set2 is a point set the new set is also a point set.

If both set1 and set2 are region sets and there is no overlap or nesting of any member from set1 and any member from set2, the union set is a region set. If overlap or nesting occurs, set1 and set2 are treated as point sets by using the first of each pair of pointers describing the regions in the sets. The new set created is the union of these point sets. The following message is generated when this occurs:

Warning: Addition of Region objects produced a region
with overlaps -- simplified into a point set

Note that if both set1 and set2 are region sets and a member of set1 coincides exactly with a member of set2, this is not considered to be an instance of overlap or nesting; rather, these members are considered identical and only one will be a member of the output set.

Examples:

>>  USA + "U.S.A" + "United States" + "America "

This query creates a new point set containing all matches to each of the individual sets in the query.

>>  region Title + region Summary

Assuming that the members of region Title and of region Summary do not overlap or nest, this query creates a new region set containing all the members of both regions.

>>  *your_region + region First

Assuming that your_region was created during the current XPAT session and contains a member that overlaps one or more members of region First , this query creates a point set. The members in the new set consist of the first of the two pointers that describe the members of your_region and of region First . XPAT prints a warning message before printing the number of matches in the new set.

`within`

set1 within set2

finds members of a set within a given region.

Within creates a set containing those members of set1 that are located in one of the regions of the text described by set2. Set1 may be either a point set or a region set. Set2 must be a region set. The new set is of the same type as set1.

Set2 may be a predefined region set, a region set that has been created within the XPAT session using the region command, a region set resulting from the use of the import command, or the result of a previous query in the session.

If set1 is a point set, each member is examined to see if it falls within a region from set2 in order to determine inclusion in the new set. If set1 is a region set, the first of the pair of pointers (offsets into the text) describing each member is examined to see if it falls within a region of set2. The second pointer of the pair does not have to fall within a region of set2 for the region to be included in the new set. That is to say, if set1 and set2 are both region sets and they overlap, members of set1 are included in the result of within if they begin within a member of set2.

The command not within creates a set containing those members of set1 that are not in any of the regions described by set2.

 set1 not within set2

is the same as

 set1 - (set1 within set2)

Examples:

>> "Cohen" within region Speaker

In this example, the predefined region Speaker defines regions of the text that contain speakers' names. This query creates a set of matches to Cohen that falls within the regions described by region Speaker .

>> "Fontaine" not within region Speaker

This query finds all references to Fontaine that are not located within one of the regions describing a speaker.

>>  first = region "<Etym>" .. "</Language>" 
>>  ("Spanish" within region Language) within *first

The first query defines regions of the text that start at the string <Etym> and end with the string </Language> . The second query finds all the matches to Spanish that are within a Language region and also within one of the newly defined regions.

List of Commands (TOC)

Command and Settings Documentation

Examples:

See also:

Examples:

See also:

Examples:

See also:

Examples:

See also:

See also:

Examples:

See also:

Examples:

See also:

Settings:

Examples:

See also:

Settings:

Examples:

See also:

Examples:

See also:

Settings:

Examples:

See also:

Settings:

Examples:

See also:

Examples:

See also:

Examples:

See also:

Settings:

Examples:

See also:

Examples:

See also:

Examples:

See also:

Settings:

Examples:

See also:

See also:

Examples:

See also:

Examples:

See also:

Settings:

Examples:

See also:

Examples:

See also:

Settings:

Examples:

See also:

Examples:

See also:

Examples:

See also:

Settings:

Examples:

See also:

Settings:

Examples:

See also:

Settings:

See also:

Examples:

See also:

Settings:

Examples:

See also:

Settings:

Examples:

See also:

Settings:

Examples:

See also:

Examples: