MULTIRGN - XPAT multiple region index builder (man page)

MULTIRGN

Section: User Commands (1)
Index

NAME

multirgn - XPAT multiple region index builder

SYNOPSIS

multirgn [ -v | -v1 | -v2 ] [ -f ] [ -merge ] [ -o output_prefix ] [ -sw tag_position_file ] [ -meta meta_structure_file [ -displayfmt format_name ] ] -D data_dictionary -t tagname_file

DESCRIPTION

multirgn takes two main arguments: the data_dictionary and the name of a file identifying the tags that define the regions. In one pass over the text, it produces a region index file containing the region indices of all the regions specified in the tagname_file. The region index file has the format described in the regions(5) man page. By default, this region index file is named with the same prefix as the data_dictionary, and with the suffix, `.rgn'. An alternate name can be specified using the -o option, described below. multirgn also updates (or creates) the `Region' sections in the data_dictionary for the regions that it builds. If the .rgn file already exists at the time multirgn is invoked, multirgn will prompt the user to confirm that the existing file can be overwritten.

Most text databases contain several types of structural elements. The structural elements in a newspaper database may include, for example, Stories, Headlines, Bylines, Dates and Paragraphs. These structural elements are known in xpat as regions and, in many databases, are delimited by tags. The tags are usually surrounded by angle brackets (the characters `<' and `>') to distinguish them from the main body of the text. These tags commonly occur in start-end pairs, with the end tag being distinguished from the start tag by a slash (`/') character after the opening angle bracket.

For example, headings in a text might be tagged as:

<Heading>The text of the heading</Heading>

multirgn only recognizes tags using the above angle bracket syntax, and will only write the region pointers for the regions delimited by the tags specified in the tagname_file. The format of that file is described below.

OPTIONS

-v

-v1

Specify regular verbose mode. This option tells multirgn to produce descriptive output concerning its execution. If this option is not specified, multirgn runs silently, producing no output concerning its execution.

-v2

Specify second-level verbose mode. This option tells multirgn to produce extra messages about subtler problems in the data. In particular, multirgn prints ``Mangled Tag Error'' messages in -v2 mode. Note that in most cases, ``Mangled Tag Error'' messages, are not needed, so -v1 mode is adequate. Refer to the section on ``Mangled Tag Errors'', below, for further details.

-f

Specify full pathnames. This option tells multirgn to use full paths in the `SysName' fields within the `Region' sections of the Data Dictionary. The `SysName' fields specify the file that contains the actual region index pointers. The -f option is useful to produce a Data Dictionary that can be accessed from any place in the filesystem. By default, multirgn puts just the filename of the region (`.rgn') file in the `SysName' fields. In that configuration, the database can only be accessed from the database directory itself.

-o output_prefix

Specify the output file prefix. This option specifies the prefix of the region pointer file that multirgn produces. multirgn will add a `.rgn' suffix to the given prefix to produce the complete region index filename. If the output_prefix already has a `.rgn' extension, multirgn will not add another one. multirgn will put the complete filename in the `SysName' fields within the `Region' sections of the Data Dictionary, for the regions that multirgn builds.

-merge

Enable multirgn to merge the new regions into the existing regions.

-sw tag_position_file

Specify sort and write mode. This option tells multirgn to perform the sort and write stages only. This option allows the user to bypass multirgn's initial phase that identifies the starts and ends of regions. This is useful in those special situations where multirgn's region start / end identification system (which just recognizes start and end tags with the angle bracket syntax) is not powerful enough, and users wish to use their own custom region identification program instead. In this configuration, the custom program is run to identify the regions and to write their start and end offsets to a tag_position_file. Then, multirgn is run with the -sw option to create a final region index from that tag_position_file and the tagname_file.

The tag_position_file consists of a series of five-byte entries in the following format: the first byte is a one-byte integer whose value is the 0-based entry number in the tagname_file. The remaining four bytes are a four-byte integer whose value is a 0-based offset into the text. These offsets are alternately interpreted as the starts and the ends of the regions hat the regions.

The following example shows the tag_position_file that multirgn's region identification system produces (in a complete run, multirgn first creates a tag_position_file and then runs the sort-and-write phase automatically). Assume the tagname_file contains the following two lines:

     Hdr
     Story

Also, assume the text consists of the following line.

     <Story><Hdr>this</Hdr>is<Hdr>done</Hdr></Story>

multirgn's region identification system identifies the following regions on the text:

     <Story><Hdr>this</Hdr>is<Hdr>done</Hdr></Story>
     ^      ^             ^  ^             ^       ^
     |      +---- Hdr ----+  +---- Hdr ----+       |
     |                                             |
     +-------------------- Story ------------------+

The tag_position_file it builds contains the following six values:

  Entry  Byte 1    Bytes 2-5
  -----  ------    ---------
    1      0           0     (Interpreted as Start Story)
    2      1           7     (Interpreted as Start Hdr)
    3      1          21     (Interpreted as End Hdr)
    2      1          24     (Interpreted as Start Hdr)
    3      1          38     (Interpreted as End Hdr)
    4      0          46     (Interpreted as End Story)

Under the -sw option, multirgn uses the tagname_file to determine the region names corresponding to the `Byte 1' values, above. Therefore, `0' would correspond to `Hdr' and `1' would correspond to `Story'.

-meta meta_structure_file

Construct only the meta regions and the Default Data region. This option only applies to MFS databases. The meta_structure_file is a special file generated by mfsmeta. This file contains information that allows multirgn to restrict the region building operation to specific portions of the text. Refer to the MFS Database Regions section, below, for details.

-displayfmt format_name

Construct regions within those files having the same DisplayFmt as specified by format_name. This option only applies to MFS database and may only be used in conjunction with the `-meta' option. Refer to the MFS Database Regions section, below, for details.

TAGNAME FILE FORMATS

The tagname_file can be in one of two formats. In the plain-text format, the bodies of the tags which enclose the different regions are listed in the tagname_file, one tag body (and hence one region) per line, with no other markup. For example, a tagname_file containing the lines,

Heading Section

instructs multirgn to search for regions defined by the tag pairs <Heading HREF="1"> .. </Heading> ... <PB><Section> .. </Section>. Each region has the same name as the body of its defining tag. However, this format provides no mechanism to index attributes or empty tags such as <PB>.

The second format is called the encoded-text format. Note that the encoding _is_ case sensitive. The following is an example of an encoded-text tagname_file.


  <region>
    <element>Heading</element>
    <element>Section</element>
  </region>
  <region>
    <tag>PB</tag>
  </region>
  <region>
    <att>HREF</att>
  </region>

In this example, the tags which define the regions are the same as the plain-text format example: <Heading HREF="1">..</Heading> ... <PB><Section>..</Section>. However, this format also supports indexing of the HREF attribute (which will be identified with a region name of "A-HREF" in the data_dictionary file), and indexing of the empty tag <PB> element (which will be identified with a region name of "PB-T" in the data_dictionary file). Elements defined within the <element> region can also be defined within the <tag> region, in order to index only the tag in addition the tag data region.

Note:: XML singletons of the form <FOO/> cannot be indexed by multirgn. These singletons should be represented in the data as <FOO></FOO>.

Note:: No command line switch is necessary to tell multirgn what the tagname_file format is; multirgn detects the format automatically and processes the file accordingly.

MFS DATABASE REGIONS

In MFS databases, the MFS system creates a ``virtual text'' from the text of all the files in the database. The portion of this virtual text that corresponds to each file consists of three pieces: the Meta-Header section, the Data section and the Meta-Trailer section. This breakdown is illustrated in the following diagram:


  <OTDoc><OTMeta>....</OTMeta><OTData>........</OTData></OTDoc>
  |---------- Meta-Header -----------|| Data ||- Meta-Trailer -|
  ^                                   ^       ^                ^
  start                               start   start            end
  header                              data    trailer          pos

If multirgn is run over such a database, it will build a region index for each tag defined in the tagname_file. It will search for these tags in all three of the above sections in each file. While this behaviour is usually adequate, the region-building process can usually be made more accurate and efficient by building the regions in several passes, restricting the build operation to specific sections of the text in each pass. The -meta and -displayfmt options provide detailed control over this process.

When the -meta option is specified alone (i.e., without the -displayfmt option) multirgn only builds regions on the text in the Meta-Trailer and Meta-Header sections. Within those sections, it builds regions on the tags defined by the tagname_file. It skips the Data sections.

The argument to the -meta option is a meta_structure_file, which contains the offsets of the Meta-Header, Data and Meta-Tailer sections, and a few other pieces of information for each file. This file is built by the mfsmeta program. Refer to the mfsmeta(1) man page for further details on this file.

This type of restriction is useful because the MFS system can usually produce the text of the Meta-Header and Meta-Trailer sections much faster than the text in the Data sections. Also, the Data sections may not contain any regions in the first place, which means that scanning over that text should be avoided. Under this configuration, it builds regions on the tags defined by the tagname_file. As such, the tagname_file should be set up with the following meta-fields:


OTDoc
OTMeta
OTFile
OTDate
OTTime
OTDisplayFmt
OTFieldsSize
OTFields
OTData

The tagname_file should also be set up to build regions on any User Meta-Data fields in the OTFields sections.

In addition to the regions defined by the tagname_file, multirgn also builds a special OTDefaultData region. This region defines an appropriate unit of text to send to a viewer program, within the Data section of each file. For most word processor files, this consists of the entire Data section. However, for tagged text files that contain several ``Entries'' (e.g., newspaper Stories), an individual Entry might be more appropriate. Because MFS databases may contain both type of files, a special field in the Data Dictionary may be used to control how multirgn builds the OTDefaultData region.

Each FilterChain section in the Data Dictionary may contain an OTDefaultDataTag field. If a given FilterChain section does not contain an OTDefaultDataTag field, multirgn will make the entire Data section of the corresponding files the OTDefaultData region. If a FilterChain section does have an OTDefaultDataTag field defined, multirgn will scan the corresponding files' Data sections and will build the OTDefaultData regions on the tags defined by the OTDefaultDataTag field. In this way, the members of the OTDefaultData region are appropriate for each file.

The second configuration involves specifying the -displayfmt option in addition to the -meta option. With both options specified, multirgn restricts the region build operation to the Data sections of those files whose DisplayFmt matches the given format_name. Under such a configuration, the tagname_file should be set up with the tags that are present in those particular Data sections.

By using these two configurations, the first pass can be used to build the meta-field tags and then one or more subsequent passes can be used to build regions on the Data section sections of specific file types.

"NESTING PROBLEMS"

multirgn is designed to operate on text in which the tags are properly paired and the tag pairs are properly nested. However, it handles overlapping tag pairs and recursive nestings automatically.

Overlapping Pairs

Although invalid SGML, multirgn will correctly index overlapping pairs, which are patterns of the form:

  <Body><Tag1>text text<Tag2>text text</Tag1>text text</Tag2></Body>


Recursive Nestings:


Recursive nestings are patterns of the form

  <Body><Tag1>text text<Tag1>text text</Tag1>text text</Tag1></Body>

In this case, multirgn will index each occurrence of the tag, but does not provide any explicit indexing to assist the XPAT search engine to locate the inner nest (i.e. ''region Tag1 within region Tag1'').

multirgn will issue a warning message if the nested tags are not symmetric:

TryingtopopunpushedtagTag2,inputoffset53stack:/Body

MANGLED TAG ERRORS

If the -v2 option is specified, multirgn will print a warning message whenever it encounters what it considers to be ``mangled tags''. multirgn considers opening angle brackets (`<' characters) that are not the start of tags to be mangled tags. Whenever it encounters such occurrences, it simply ignores them and keeps on processing (which is usually the correct behaviour).

The following segment of text contains an example of a `<' character in a location other than the start of a tag:

...describedbytherelation,x<=y,andhavingthe...

If the -v2 option is specified multirgn would report the following error when it encountered the above segment of text (NNN is the offset of the `<' character):

MangledtagerroratoffsetNNN

This message is basically a warning to alert the Database Administrator that the data contains `<' characters at places other than tags. For this reason, the -v2 option should only be specified if the data is not expected to contain `<' characters in places other than tags.

MERGING INDEX FILES

Before multirgn starts building the regions, it scans the existing `Region' sections in the data_dictionary and verifies that the region index files that they reference exist. If any of those files are missing, multirgn deletes the corresponding `Region' sections from the data_dictionary. In this way, integrity is maintained between the data_dictionary and the region index file.

If the region index file that multirgn writes the index to already exists, multirgn will write the new index points onto the end of the index file. If region pointers for a particular region already exists in the region index file, multirgn will remove them from where they exist in the index file and will place the new pointers on the end of the index file.

MULTIRGN AND XPATRGN

In many situations, it will be necessary to use both multirgn and xpatrgn to define all the regions in a production database. While multirgn is vastly preferable for those regions which nest nicely and are unambiguously tagged (usually the majority), there are often other regions which are not defined unambiguously by tags or which may overlap other regions. In these situations it is best to build all the tagged regions with one multirgn run and to follow that with several xpatrgn runs to build the remaining regions.

FILES

data_dictionary: the Data Dictionary to be updated.
tagname_file: the tagname_file containing the list of region tag bodies.
file.rgn: the resulting region index file.
tagint.nnnn: the temporary file.

DIFFERENCES BETWEEN MULTIRGN AND SGMLRGN

multirgn differs from sgmlrgn in several ways: (1) Because it does not validate the text, it produces output much faster. (2) multirgn can possibly create smaller index files if only a subset of the regions are defined. (3) multirgn does not fabricate new region names if regions of the same name are nested. (4) multirgn infers an empty element if it is defined as a <tag> and not an <element>. Because of this, multirgn will only generate tag regions (e.g. "TAG-T") and not element regions (e.g. "TAG") for empty elements, while sgmlrgn will generate both.

Index

NAME
SYNOPSIS
DESCRIPTION
OPTIONS
TAGNAME FILE FORMATS
MFS DATABASE REGIONS
MANGLED TAG ERRORS
MERGING INDEX FILES
MULTIRGN AND XPATRGN
FILES
DIFFERENCES BETWEEN MULTIRGN AND SGMLRGN
SEE ALSO