multirgn - XPAT multiple region index builder
multirgn [ -v | -v1 | -v2 ] [ -f ] [ -merge ] [ -o output_prefix ] [ -sw tag_position_file ] [ -meta meta_structure_file [ -displayfmt format_name ] ] -D data_dictionary -t tagname_file
multirgn takes two main arguments: the data_dictionary and the name of a file identifying the tags that define the regions. In one pass over the text, it produces a region index file containing the region indices of all the regions specified in the tagname_file. The region index file has the format described in the regions(5) man page. By default, this region index file is named with the same prefix as the data_dictionary, and with the suffix, `.rgn'. An alternate name can be specified using the -o option, described below. multirgn also updates (or creates) the `Region' sections in the data_dictionary for the regions that it builds. If the .rgn file already exists at the time multirgn is invoked, multirgn will prompt the user to confirm that the existing file can be overwritten.
Most text databases contain several types of structural elements. The structural elements in a newspaper database may include, for example, Stories, Headlines, Bylines, Dates and Paragraphs. These structural elements are known in xpat as regions and, in many databases, are delimited by tags. The tags are usually surrounded by angle brackets (the characters `<' and `>') to distinguish them from the main body of the text. These tags commonly occur in start-end pairs, with the end tag being distinguished from the start tag by a slash (`/') character after the opening angle bracket.
For example, headings in a text might be tagged as:
<Heading>The text of the heading</Heading>
multirgn only recognizes tags using the above angle bracket syntax, and will only write the region pointers for the regions delimited by the tags specified in the tagname_file. The format of that file is described below.
Hdr Story
<Story><Hdr>this</Hdr>is<Hdr>done</Hdr></Story>
<Story><Hdr>this</Hdr>is<Hdr>done</Hdr></Story> ^ ^ ^ ^ ^ ^ | +---- Hdr ----+ +---- Hdr ----+ | | | +-------------------- Story ------------------+
Entry Byte 1 Bytes 2-5 ----- ------ --------- 1 0 0 (Interpreted as Start Story) 2 1 7 (Interpreted as Start Hdr) 3 1 21 (Interpreted as End Hdr) 2 1 24 (Interpreted as Start Hdr) 3 1 38 (Interpreted as End Hdr) 4 0 46 (Interpreted as End Story)
The tagname_file can be in one of two formats. In the plain-text format, the bodies of the tags which enclose the different regions are listed in the tagname_file, one tag body (and hence one region) per line, with no other markup. For example, a tagname_file containing the lines,
Heading
Section
instructs multirgn to search for regions defined by the tag pairs <Heading HREF="1"> .. </Heading> ... <PB><Section> .. </Section>. Each region has the same name as the body of its defining tag. However, this format provides no mechanism to index attributes or empty tags such as <PB>.
The second format is called the encoded-text format. Note that the encoding _is_ case sensitive. The following is an example of an encoded-text tagname_file.
<region> <element>Heading</element> <element>Section</element> </region> <region> <tag>PB</tag> </region> <region> <att>HREF</att> </region>
In this example, the tags which define the regions are the same as the plain-text format example: <Heading HREF="1">..</Heading> ... <PB><Section>..</Section>. However, this format also supports indexing of the HREF attribute (which will be identified with a region name of "A-HREF" in the data_dictionary file), and indexing of the empty tag <PB> element (which will be identified with a region name of "PB-T" in the data_dictionary file). Elements defined within the <element> region can also be defined within the <tag> region, in order to index only the tag in addition the tag data region.
In MFS databases, the MFS system creates a ``virtual text'' from the text of all the files in the database. The portion of this virtual text that corresponds to each file consists of three pieces: the Meta-Header section, the Data section and the Meta-Trailer section. This breakdown is illustrated in the following diagram:
<OTDoc><OTMeta>....</OTMeta><OTData>........</OTData></OTDoc> |---------- Meta-Header -----------|| Data ||- Meta-Trailer -| ^ ^ ^ ^ start start start end header data trailer pos
If multirgn is run over such a database, it will build a region index for each tag defined in the tagname_file. It will search for these tags in all three of the above sections in each file. While this behaviour is usually adequate, the region-building process can usually be made more accurate and efficient by building the regions in several passes, restricting the build operation to specific sections of the text in each pass. The -meta and -displayfmt options provide detailed control over this process.
When the -meta option is specified alone (i.e., without the -displayfmt option) multirgn only builds regions on the text in the Meta-Trailer and Meta-Header sections. Within those sections, it builds regions on the tags defined by the tagname_file. It skips the Data sections.
The argument to the -meta option is a meta_structure_file, which contains the offsets of the Meta-Header, Data and Meta-Tailer sections, and a few other pieces of information for each file. This file is built by the mfsmeta program. Refer to the mfsmeta(1) man page for further details on this file.
This type of restriction is useful because the MFS system can usually produce the text of the Meta-Header and Meta-Trailer sections much faster than the text in the Data sections. Also, the Data sections may not contain any regions in the first place, which means that scanning over that text should be avoided. Under this configuration, it builds regions on the tags defined by the tagname_file. As such, the tagname_file should be set up with the following meta-fields:
OTDoc OTMeta OTFile OTDate OTTime OTDisplayFmt OTFieldsSize OTFields OTDataThe tagname_file should also be set up to build regions on any User Meta-Data fields in the OTFields sections.
In addition to the regions defined by the tagname_file, multirgn also builds a special OTDefaultData region. This region defines an appropriate unit of text to send to a viewer program, within the Data section of each file. For most word processor files, this consists of the entire Data section. However, for tagged text files that contain several ``Entries'' (e.g., newspaper Stories), an individual Entry might be more appropriate. Because MFS databases may contain both type of files, a special field in the Data Dictionary may be used to control how multirgn builds the OTDefaultData region.
Each FilterChain section in the Data Dictionary may contain an OTDefaultDataTag field. If a given FilterChain section does not contain an OTDefaultDataTag field, multirgn will make the entire Data section of the corresponding files the OTDefaultData region. If a FilterChain section does have an OTDefaultDataTag field defined, multirgn will scan the corresponding files' Data sections and will build the OTDefaultData regions on the tags defined by the OTDefaultDataTag field. In this way, the members of the OTDefaultData region are appropriate for each file.
The second configuration involves specifying the -displayfmt option in addition to the -meta option. With both options specified, multirgn restricts the region build operation to the Data sections of those files whose DisplayFmt matches the given format_name. Under such a configuration, the tagname_file should be set up with the tags that are present in those particular Data sections.
By using these two configurations, the first pass can be used to build the meta-field tags and then one or more subsequent passes can be used to build regions on the Data section sections of specific file types.
"NESTING PROBLEMS"
multirgn is designed to operate on text in which the tags are properly paired and the tag pairs are properly nested. However, it handles overlapping tag pairs and recursive nestings automatically.
Overlapping Pairs
Although invalid SGML, multirgn will correctly index overlapping pairs, which are patterns of the form:
<Body><Tag1>text text<Tag2>text text</Tag1>text text</Tag2></Body> Recursive Nestings: Recursive nestings are patterns of the form <Body><Tag1>text text<Tag1>text text</Tag1>text text</Tag1></Body>
In this case, multirgn will index each occurrence of the tag, but does not provide any explicit indexing to assist the XPAT search engine to locate the inner nest (i.e. ''region Tag1 within region Tag1'').
multirgn will issue a warning message if the nested tags are not symmetric:
TryingtopopunpushedtagTag2,inputoffset53stack:/Body
If the -v2 option is specified, multirgn will print a warning message whenever it encounters what it considers to be ``mangled tags''. multirgn considers opening angle brackets (`<' characters) that are not the start of tags to be mangled tags. Whenever it encounters such occurrences, it simply ignores them and keeps on processing (which is usually the correct behaviour).
The following segment of text contains an example of a `<' character in a location other than the start of a tag:
...describedbytherelation,x<=y,andhavingthe...
If the -v2 option is specified multirgn would report the following error when it encountered the above segment of text (NNN is the offset of the `<' character):
MangledtagerroratoffsetNNN
This message is basically a warning to alert the Database Administrator that the data contains `<' characters at places other than tags. For this reason, the -v2 option should only be specified if the data is not expected to contain `<' characters in places other than tags.
Before multirgn starts building the regions, it scans the existing `Region' sections in the data_dictionary and verifies that the region index files that they reference exist. If any of those files are missing, multirgn deletes the corresponding `Region' sections from the data_dictionary. In this way, integrity is maintained between the data_dictionary and the region index file.
If the region index file that multirgn writes the index to already exists, multirgn will write the new index points onto the end of the index file. If region pointers for a particular region already exists in the region index file, multirgn will remove them from where they exist in the index file and will place the new pointers on the end of the index file.
In many situations, it will be necessary to use both multirgn and xpatrgn to define all the regions in a production database. While multirgn is vastly preferable for those regions which nest nicely and are unambiguously tagged (usually the majority), there are often other regions which are not defined unambiguously by tags or which may overlap other regions. In these situations it is best to build all the tagged regions with one multirgn run and to follow that with several xpatrgn runs to build the remaining regions.
multirgn differs from sgmlrgn in several ways: (1) Because it does not validate the text, it produces output much faster. (2) multirgn can possibly create smaller index files if only a subset of the regions are defined. (3) multirgn does not fabricate new region names if regions of the same name are nested. (4) multirgn infers an empty element if it is defined as a <tag> and not an <element>. Because of this, multirgn will only generate tag regions (e.g. "TAG-T") and not element regions (e.g. "TAG") for empty elements, while sgmlrgn will generate both.
xpat(1), xpatrgn(1), sgmlrgn(1), mfsmeta(1), regions(5), data_dictionary(5)