DATA_DICT (man page): data_dictionary - XPAT Data Dictionary file format for DLXS XPAT databases

DESCRIPTION

Each DLXS XPAT database has a Data Dictionary containing information to:

A Data Dictionary is made up of several sections. Each section is delimited by ``tags'' - short labels enclosed by angle brackets: `<' and `>'. For example, information about the database's text is preceded by a <Text> start tag and followed by a </Text> end tag. The slash (`/') character is used to distinguish between end tags and start tags. The entire Data Dictionary is enclosed by <DB> and </DB> tags.

Each section and field in the Data Dictionary is described separately in the following paragraphs. The Data Dictionary contains a Thesaurus field, a Text section, an Indices section, and a Regions section.

Thesaurus FIELD

The Thesaurus field is enclosed by <Thesaurus> and </Thesaurus> tags. It contains the name of a file with thesaurus definitions. The format of this file is described in the `thesaurus' entry of the XPat Reference Manual and Tutorial. The filename can be specified using either a relative path or an absolute path.

Text SECTION

The Text section is enclosed by <Text> and </Text> tags. It contains information relating to the text itself. Specifically, it contains an MfsFiles section, which describes all the individual files that make up the database.

NOTE:
For backward-compatibility, the Text section may also contain a Files section instead of the MfsFiles section. Refer to the ``BACKWARD-COMPATIBILITY'' section at the end of this man page for descriptions of this Files section and all the fields it contains.

MfsFiles SECTION

The MfsFiles section is enclosed by <MfsFiles> and </MfsFiles> tags. The fields within the MfsFiles section are described in the mfs(5) man page. Refer to that man page for the details.

Indices SECTION

The Indices section is enclosed by <Indices> and </Indices> tags. It contains one or more Index sections.

Index SECTION

The Index section is enclosed by <Index> and </Index> tags. It contains information about a single, named Main Index. Specifically, it contains a Name field, a FastFind section (if a Fast-Find index has been built on this Main Index), a File section, an InitFile field, an IndexPoints section, a Mappings section, and an IntegrityCheck field.

Name FIELD

The Name field is enclosed by <Name> and </Name> tags. It names the index contained within the enclosing Index section. It is used when invoking xpat to specify which index is to be used in searching. The first Index section may have an empty Name field. All other Index sections must have non-empty Name fields.

FastFind SECTION

The FastFind section is enclosed by <FastFind> and </FastFind> tags. It contains a FastFindCompression section, a FastFindIndex section and a FastFindWordList section. These sections describe information for each of the three files that constitute the FastFind index. Note that these sections are present in the Data Dictionary only if a Fast-Find index has been built on the database (this is always the case for MFS databases).

FastFindCompression SECTION

The FastFindCompression section is enclosed by <FastFindCompression> and </FastFindCompression> tags. It contains one File section.

File SECTION

The File section is enclosed by <File> and </File> tags. It specifies the FastFind Compression file. It contains a SysName field, a ModDate field, and an Offset field.

SysName FIELD

The SysName field is enclosed by <SysName> and </SysName> tags. It contains the file's filename or path.

ModDate FIELD

The ModDate field is enclosed by <ModDate> and </ModDate> tags. In contains the last modification date of the file, encoded as a number. The database system maintains this number to ensure that the database hasn't been changed in an unauthorized manner.

Offset FIELD

The Offset field is enclosed by <Offset> and </Offset> tags. It contains the logical starting offset of the current information within the file. This field is usually set to 0, except in Region sections. Refer to the Region section, below, for details.

FastFindIndex SECTION

The FastFindIndex section is enclosed by <FastFindIndex> and </FastFindIndex> tags. It contains one File section that specifies the main Fast-Find Index file. The contents of the File section is described in the section on FastFindCompression, above. Refer to that section for details.

FastFindWordList SECTION

The FastFindWordList section is enclosed by <FastFindWordList> and </FastFindWordList> tags. It contains one File section that specifies the Fast-Find Word List file. The contents of the File section is described in the section on FastFindCompression, above. Refer to that section for details.

File SECTION

This section specifies the Main Index file. The contents of the File section is described in the section on FastFindCompression, above. Refer to that section for details.

InitFile FIELD

The InitFile field is enclosed by <InitFile> and </InitFile> tags. It contains the name of a file which is read by xpat during initialization. Any legal xpat command may be contained in the initialization file. Typical uses are setting the DefaultRegion, defining macros, or defining a match set or region set commonly used in a xpat session. Refer to the XPat Reference Manual and Tutorial for more information on the valid Pat commands.

IndexPoints SECTION

The IndexPoints section is enclosed by <IndexPoints> and </IndexPoints> tags. It contains one or more IndexPt section.

IndexPt SECTION

Each IndexPt section is enclosed by <IndexPt> and </IndexPt> tags. These fields contain strings which indicate points in the text which should be indexed.

The simplest index point is simply two characters, such as <IndexPt>ab</IndexPt>. This example instructs xpatbld to create an index point each time an ``ab'' occurs in the text. For each such occurrence, an index point is generated for the ``b''.

Since listing each two-letter combination to index can be cumbersome, each IndexPt section can contain meta-characters. A meta-character stands for a number of characters. For instance, the meta-character `&uppercase.' represents the characters ``ABCDEFG...'' and so on. An index point containing <IndexPt> &uppercase.</IndexPt> (note the space immediately preceding the `&' character) is equivalent to specifying the following:

        <IndexPt> A</IndexPt>
        <IndexPt> B</IndexPt>
        <IndexPt> C</IndexPt>
        and so on...

A meta-character may appear in place of either the first character, the second character, or both. The following meta-characters are defined:

&printable.
All ASCII printable characters:
!@#$%^&*()_+~|1234567890-=`\{}:"<>?[];',./
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
&ISO_printable.
All ASCII printable characters as defined above, plus printable characters from the ISO character set. Note that not every character with the 8th bit set is an ISO printable character. In octal, the character are:
\241 \242 \243 \244 \245 \246 \247
\250 \251 \252 \253 \254 \255 \256 \257
\260 \261 \262 \263 \264 \265 \266 \267
\270 \271 \272 \273 \274 \275 \276 \277
\300 \301 \302 \303 \304 \305 \306 \307
\310 \311 \312 \313 \314 \315 \316 \317
\320 \321 \322 \323 \324 \325 \326 \327
\330 \331 \332 \333 \334 \335 \336 \337
\340 \341 \342 \343 \344 \345 \346 \347
\350 \351 \352 \353 \354 \355 \356 \357
\360 \361 \362 \363 \364 \365 \366 \367
\370 \371 \372 \373 \374 \375 \376 \377
&alphabetic.
Alphabetic characters A-Z and a-z.
&ISO_alphabetic.
ASCII alphabetic characters as defined above, plus ISO alphabetic characters:
\300 \301 \302 \303 \304 \305 \306 \307
\310 \311 \312 \313 \314 \315 \316 \317
\321 \322 \323 \324 \325 \326
\331 \332 \333 \334 \335
\340 \341 \342 \343 \344 \345 \346 \347
\350 \351 \352 \353 \354 \355 \356 \357
\361 \362 \363 \364 \365 \366
\371 \372 \373 \374 \375 \377
&uppercase.
Uppercase alphabetic characters A-Z.
&ISO_uppercase.
ASCII uppercase characters as defined above, plus ISO uppercase characters:
\300 \301 \302 \303 \304 \305 \306 \307
\310 \311 \312 \313 \314 \315 \316 \317
\321 \322 \323 \324 \325 \326
\331 \332 \333 \334 \335
&lowercase.
Lowercase alphabetic characters a-z.
&ISO_lowercase.
ASCII uppercase characters as defined above, plus ISO lowercase characters:
\340 \341 \342 \343 \344 \345 \346 \347
\350 \351 \352 \353 \354 \355 \356 \357
\361 \362 \363 \364 \365 \366
\371 \372 \373 \374 \375 \377
&alphanumeric.
The alphabetic and numeric characters: 0-9, A-Z, and a-z.
&ISO_alphanumeric.
The ISO_alphabetic characters as defined above, plus 0-9.
&special.
Non-alphanumeric ASCII printable characters: !@#$%^&*()_+~|-=`\{}[]:";'<>?,./
&ISO_special.
The ASCII special characters defined above, plus ISO special characters:
\241 \242 \243 \244 \245 \246 \247
\250 \251 \252 \253 \254 \255 \256 \257
\260 \261 \262 \263 \264 \265 \266 \267
\270 \271 \272 \273 \274 \275 \276 \277
\320 \327
\330 \336 \337
\360 \367 \370 \376
&all.
Every 7-bit character, including `\0'.
&ISO_all.
Every 8-bit character, including `\0'.
&numeric.
The numeric digits: 0123456789.

The following meta-characters represent single characters which are special in the syntax of the Data Dictionary:

&amp.          &    
&backspace.    \b
&lt.           <    
&gt.           >    
&return.       \r
&newline.      \n
&tab.          \t

The Following meta-characters are defined for Unicode support. Note the code points are specified in ranges using the Unicode 'U+' notation.

&printable.
All ASCII printable characters:
!@#$%^&*()_+~|1234567890-=`}:"<>?[];',./
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
&special.
Non-alphanumeric ASCII printable characters:
!@#$%^&*()_+~|-=`\{}[]:";'<>?,./

The following meta-characters represent single characters which are special in the syntax of the Data Dictionary:

 
&amp.          &
&backspace.    \b
&lt.           <
&gt.           >
&return.       \r
&newline.      \n
&tab.          \t

The following scripts are based on UnicodeData.txt and Perl 5.8 unicore/lib files.

&Latin.
U+0041-U+005A U+0061-U+007A U+00AA-U+00AA U+00BA-U+00BA
U+00C0-U+00D6 U+00D8-U+00F6 U+00F8-U+0220 U+0222-U+0233
U+0250-U+02AD U+02B0-U+02B8 U+02E0-U+02E4 U+1E00-U+1E9B
U+1EA0-U+1EF9 U+2071-U+2071 U+207F-U+207F U+212A-U+212B
U+FB00-U+FB06 U+FF21-U+FF3A U+FF41-U+FF5A
&Armenian.
U+0531-U+0556 U+0559-U+0559 U+0561-U+0587 U+FB13-U+FB17
&Bengali.
U+0981-U+0983 U+0985-U+098C U+098F-U+0990 U+0993-U+09A8
U+09AA-U+09B0 U+09B2-U+09B2 U+09B6-U+09B9 U+09BC-U+09BC
U+09BE-U+09C4 U+09C7-U+09C8 U+09CB-U+09CD U+09D7-U+09D7
U+09DC-U+09DD U+09DF-U+09E3 U+09E6-U+09F1
&Bopomofo.
U+3105-U+312C U+31A0-U+31B7
&Buhid.
U+1740-U+1753
&Cherokee.
U+13A0-U+13F4
&Cyrillic.
U+0400-U+0481 U+048A-U+04CE U+04D0-U+04F5 U+04F8-U+04F9
U+0500-U+050F
U+0901-U+0903 U+0905-U+0939 U+093C-U+094D U+0950-U+0954
U+0958-U+0963 U+0966-U+096F
&Ethiopic.
U+1200-U+1206 U+1208-U+1246 U+1248-U+1248 U+124A-U+124D
U+1250-U+1256 U+1258-U+1258 U+125A-U+125D U+1260-U+1286
U+1288-U+1288 U+128A-U+128D U+1290-U+12AE U+12B0-U+12B0
U+12B2-U+12B5 U+12B8-U+12BE U+12C0-U+12C0 U+12C2-U+12C5
U+12C8-U+12CE U+12D0-U+12D6 U+12D8-U+12EE U+12F0-U+130E
U+1310-U+1310 U+1312-U+1315 U+1318-U+131E U+1320-U+1346
U+1348-U+135A U+1369-U+137C
&Georgian.
U+10A0-U+10C5 U+10D0-U+10F8
&Greek.
U+00B5-U+00B5 U+037A-U+037A U+0386-U+0386 U+0388-U+038A
U+038C-U+038C U+038E-U+03A1 U+03A3-U+03CE U+03D0-U+03F5
U+1F00-U+1F15 U+1F18-U+1F1D U+1F20-U+1F45 U+1F48-U+1F4D
U+1F50-U+1F57 U+1F59-U+1F59 U+1F5B-U+1F5B U+1F5D-U+1F5D
U+1F5F-U+1F7D U+1F80-U+1FB4 U+1FB6-U+1FBC U+1FBE-U+1FBE
U+1FC2-U+1FC4 U+1FC6-U+1FCC U+1FD0-U+1FD3 U+1FD6-U+1FDB
U+1FE0-U+1FEC U+1FF2-U+1FF4 U+1FF6-U+1FFC U+2126-U+2126
&Gujarati.
U+0A81-U+0A83 U+0A85-U+0A8B U+0A8D-U+0A8D U+0A8F-U+0A91
U+0A93-U+0AA8 U+0AAA-U+0AB0 U+0AB2-U+0AB3 U+0AB5-U+0AB9
U+0ABC-U+0AC5 U+0AC7-U+0AC9 U+0ACB-U+0ACD U+0AD0-U+0AD0
U+0AE0-U+0AE0 U+0AE6-U+0AEF
&Gurmukhi.
U+0A02-U+0A02 U+0A05-U+0A0A U+0A0F-U+0A10 U+0A13-U+0A28
U+0A2A-U+0A30 U+0A32-U+0A33 U+0A35-U+0A36 U+0A38-U+0A39
U+0A3C-U+0A3C U+0A3E-U+0A42 U+0A47-U+0A48 U+0A4B-U+0A4D
U+0A59-U+0A5C U+0A5E-U+0A5E U+0A66-U+0A74
&Hangul.
U+1100-U+1159 U+115F-U+11A2 U+11A8-U+11F9 U+3131-U+318E
U+AC00-U+D7A3 U+FFA0-U+FFBE U+FFC2-U+FFC7 U+FFCA-U+FFCF
U+FFD2-U+FFD7 U+FFDA-U+FFDC
&Han.
U+2E80-U+2E99 U+2E9B-U+2EF3 U+2F00-U+2FD5 U+3005-U+3005
U+3007-U+3007 U+3021-U+3029 U+3038-U+303B U+3400-U+4DB5
U+4E00-U+9FA5 U+F900-U+FA2D U+FA30-U+FA6A
&Hanunoo.
U+1720-U+1734
&Hebrew.
U+05D0-U+05EA U+05F0-U+05F2 U+FB1D-U+FB1D U+FB1F-U+FB28
U+FB2A-U+FB36 U+FB38-U+FB3C U+FB3E-U+FB3E U+FB40-U+FB41
U+FB43-U+FB44 U+FB46-U+FB4F
&Hiragana.
U+3041-U+3096 U+309D-U+309F
&Kannada.
U+0C82-U+0C83 U+0C85-U+0C8C U+0C8E-U+0C90 U+0C92-U+0CA8
U+0CAA-U+0CB3 U+0CB5-U+0CB9 U+0CBE-U+0CC4 U+0CC6-U+0CC8
U+0CCA-U+0CCD U+0CD5-U+0CD6 U+0CDE-U+0CDE U+0CE0-U+0CE1
U+0CE6-U+0CEF
&Katakana.
U+30A1-U+30FA U+30FD-U+30FF U+31F0-U+31FF U+FF66-U+FF6F
U+FF71-U+FF9D
&Khmer.
U+1780-U+17D3 U+17E0-U+17E9
&Lao.
U+0E81-U+0E82 U+0E84-U+0E84 U+0E87-U+0E88 U+0E8A-U+0E8A
U+0E8D-U+0E8D U+0E94-U+0E97 U+0E99-U+0E9F U+0EA1-U+0EA3
U+0EA5-U+0EA5 U+0EA7-U+0EA7 U+0EAA-U+0EAB U+0EAD-U+0EB9
U+0EBB-U+0EBD U+0EC0-U+0EC4 U+0EC6-U+0EC6 U+0EC8-U+0ECD
U+0ED0-U+0ED9 U+0EDC-U+0EDD
&Malayalam.
U+0D02-U+0D03 U+0D05-U+0D0C U+0D0E-U+0D10 U+0D12-U+0D28
U+0D2A-U+0D39 U+0D3E-U+0D43 U+0D46-U+0D48 U+0D4A-U+0D4D
U+0D57-U+0D57 U+0D60-U+0D61 U+0D66-U+0D6F
&Mongolian.
U+1810-U+1819 U+1820-U+1877 U+1880-U+18A9
&Myanmar.
U+1000-U+1021 U+1023-U+1027 U+1029-U+102A U+102C-U+1032
U+1036-U+1039 U+1040-U+1049 U+1050-U+1059
&Oriya.
U+0B01-U+0B03 U+0B05-U+0B0C U+0B0F-U+0B10 U+0B13-U+0B28
U+0B2A-U+0B30 U+0B32-U+0B33 U+0B36-U+0B39 U+0B3C-U+0B43
U+0B47-U+0B48 U+0B4B-U+0B4D U+0B56-U+0B57 U+0B5C-U+0B5D
U+0B5F-U+0B61 U+0B66-U+0B6F
&Runic.
U+16A0-U+16EA U+16EE-U+16F0
&Sinhala.
U+0D82-U+0D83 U+0D85-U+0D96 U+0D9A-U+0DB1 U+0DB3-U+0DBB
U+0DBD-U+0DBD U+0DC0-U+0DC6 U+0DCA-U+0DCA U+0DCF-U+0DD4
U+0DD6-U+0DD6 U+0DD8-U+0DDF U+0DF2-U+0DF3
&Syriac.
U+0710-U+072C U+0730-U+074A
&Tagalog.
U+0710-U+072C U+0730-U+074A
&Tagbanwa.
U+1760-U+176C U+176E-U+1770 U+1772-U+1773
&Tamil.
U+0B82-U+0B83 U+0B85-U+0B8A U+0B8E-U+0B90 U+0B92-U+0B95
U+0B99-U+0B9A U+0B9C-U+0B9C U+0B9E-U+0B9F U+0BA3-U+0BA4
U+0BA8-U+0BAA U+0BAE-U+0BB5 U+0BB7-U+0BB9 U+0BBE-U+0BC2
U+0BC6-U+0BC8 U+0BCA-U+0BCD U+0BD7-U+0BD7 U+0BE7-U+0BF2
&Telugu.
U+0C01-U+0C03 U+0C05-U+0C0C U+0C0E-U+0C10 U+0C12-U+0C28
U+0C2A-U+0C33 U+0C35-U+0C39 U+0C3E-U+0C44 U+0C46-U+0C48
U+0C4A-U+0C4D U+0C55-U+0C56 U+0C60-U+0C61 U+0C66-U+0C6F
&Thaana.
U+0780-U+07B1
&Thai.
U+0E01-U+0E3A U+0E40-U+0E4E U+0E50-U+0E59
&Tibetan.
U+0F00-U+0F00 U+0F18-U+0F19 U+0F20-U+0F33 U+0F35-U+0F35
U+0F37-U+0F37 U+0F39-U+0F39 U+0F40-U+0F47 U+0F49-U+0F6A
U+0F71-U+0F84 U+0F86-U+0F8B U+0F90-U+0F97 U+0F99-U+0FBC
U+0FC6-U+0FC6
&UnifiedIdeograph.
U+3400-U+4DB5 U+4E00-U+9FA5 U+FA0E-U+FA0F U+FA11-U+FA11
U+FA13-U+FA14 U+FA1F-U+FA1F U+FA21-U+FA21 U+FA23-U+FA24
U+FA27-U+FA29

Mappings SECTION

The Mappings section is enclosed by <Mappings> and </Mappings> tags. It consists of two distinct parts. The first part is a list of Map sections, each of which maps a character, enclosed by <From> and </From> tags, to another character, enclosed by <To> and </To> tags. The most common use is to map uppercase letters into their lowercase equivalents, or punctuation into spaces.

It is also possible to map ranges of characters to their lower case equivalents (where this concept is applicable). The beginning character of the range enclosed in <First> and </First> is followed by the last character in the range enclosed in <Last> and </Last>. These two tag pairs are enclosed by <CharRange> and </CharRange>. The <CharRange> tag pair is enclosed by the <From> and <To> tag pairs as described for a single character above. For example:

       <From>
         <CharRange>
           <First>A</First>
           <Last>Z</Last>
         </CharRange> 
       </From>

       <To>
         <CharRange>
           <First>a</First>
           <Last>z</Last>
         </CharRange> 
       </To>

Note: When xpat starts up, it first builds an initial map which maps all non-ASCII and all non-printable characters to NULL. xpat then reads the user-defined character mappings defined in the Mappings section and adds those specifications to the initial map. The user-defined mappings override the default mappings. One use of character mappings is to map selected non-printable characters to themselves. This effectively undoes the NULL mapping that xpat creates for those characters by default.

Two escape mechanisms exist to specify non-printable characters in the From and the To fields. The first mechanism is octal specification. Each octal specification consists of a backslash followed by three octal digits (e.g., `\003' for `^C'). The second mechanism is entity reference specification. The following table illustrates the entity references that can be used. The characters in the right-hand column can be specified using the corresponding entity reference in the left-hand column:

&amp.       &
&backspace. \b
&lt.        <
&gt.        >
&return.    \r
&newline.   \n
&tab.       \t

Each of the From and To fields can contain at most one character, one octal code, or one entity reference. If a To field is empty, it means that the corresponding From character should be mapped to NULL.

The second part of the Mappings section is a list of stopwords - words which are not indexed. The words themselves are enclosed by <Ignore> and </Ignore> tags. The whole list is enclosed by <StopWords> and </StopWords> tags. Note that stopwords are not supported by xpatbldu, the Unicode enabled version of xpatbld.

IntegrityCheck FIELD

The IntegrityCheck field is enclosed by <IntegrityCheck> and </IntegrityCheck> tags. This field contains a single number that encodes relevant information about the indexing parameters to ensure that the descriptive information in the Data Dictionary matches the information used to actually create the index. It is maintained by the programs that build and maintain indices (e.g., xpatbld and xpatmaint). The IntegrityCheck value is also checked by xpat on startup. If an integrity error is detected, xpat will print an error message to that effect and will not search the database.

Regions SECTION

The Regions section is enclosed by <Regions> and </Regions> tags. It usually contains one or more Region sections. However, it may be empty or omitted if no regions are defined.

Region SECTION

Each Region section is enclosed by <Region> and </Region> tags. It contains information defining a region of the database. Regions are used by xpat in the ``within'' and ``including'' commands (refer to the XPat Reference Manual and Tutorial for more information).

Each Region section has zero or more FastRegion sections, a Name field, a Desc field, a File section, a Count field, and a Type field.

FastRegion SECTION

Each FastRegion section is enclosed by <FastRegion> and </FastRegion> tags. Each FastRegion section contains information defining the FastRegion index between the enclosing region and a specific Main Index. Within a particular Region section, there can be at most one FastRegion section for each Index section in the Data Dictionary. The FastRegion sections are created by the xpatfr program when it builds the FastRegion indices.

Each FastRegion section contains a File section and an IndexName section.

File SECTION

The File section is enclosed by <File> and </File> tags. It specifies the actual file that contains the FastRegion index data for the enclosing FastRegion section. The contents of the File section are described in the FastFindCompression section above. Refer to that section for details.

IndexName SECTION

The IndexName section is enclosed by <IndexName> and </IndexName> tags. It specifies the name of the Main Index in this Data Dictionary for which this particular FastRegion index was built. The index name in this field has to be the same as the Name in one of the Index sections in this Data Dictionary. This field can be empty if the FastRegion was built on the default index (which does not have a name).

Name FIELD

The Name field is enclosed by <Name> and </Name> tags. It contains the name by which that region is referenced in xpat.

Desc FIELD

The Desc field is enclosed by <Desc> and </Desc> tags. It contains a description of the region and may be empty or omitted.

File SECTION

The File section is enclosed by <File> and </File> tags. It indicates where to find the file containing the region's pointers into the text. The contents of the File section are described in the FastFindCompression section above. Refer to that section for details. Note that the Offset field with the File section may be non-zero. This is because the region building programs place the index data for several regions inside a single file. The Offset specifies where in that file the current region's segment begins.

Count FIELD

The Count field is enclosed by <Count> and </Count> tags. It gives the number of pointers for this region. Note that this number is twice the number of regions defined because each region in a region set consists of a start pointer and an end pointer.

Type FIELD

The Type field is enclosed by <Type> and </Type> tags. The only type that is currently supported is the `pairs' type (where each region is explicitly defined by a start and an end pointer).

Grammar SECTION

This section is enclosed by <Grammar> and </Grammar> tags and is reserved for future XPAT use.

Display SECTION

This section is enclosed by <Display> and </Display> tags and is reserved for future XPAT use.

EXAMPLES

The following is the Data Dictionary for a complete database. Note that parts of some sections have been removed to reduce the size of the example.

<DB>
  <Thesaurus>/usr/ot/default.the</Thesaurus>
  <Text>
    <MfsFiles>
      <FileMap>mydb</FileMap>
      <FilterChain>
        <SearchView>meta</SearchView>
        <DisplayView>meta</DisplayView>
        <RawView>meta</RawView>
        <DisplayFmt>ASCII</DisplayFmt>
        <DefaultDataTag></DefaultDataTag>
        <FileGroup>
          <MfsDir>data</MfsDir>
          <MfsFile>*.txt</MfsFile>
          <MfsExpand>file</MfsExpand>
        </FileGroup>
      </FilterChain>
    </MfsFiles>
  </Text>
  <Indices>
    <Index>
      <Name>default</Name>
      <File>
        <SysName>/usr/ot/manual/def.idx</SysName>
        <ModDate>679335524</ModDate>
        <Offset>0</Offset>
      </File>
      <InitFile>/usr/ot/manual/init</InitFile>
      <IndexPoints>
        <IndexPt> &alphanumeric.</IndexPt>
      </IndexPoints>
      <Mappings>
        <Map><From></From><To></To></Map>
        <Map><From>&backspace.</From><To> </To></Map>
        <Map><From>&tab.</From><To> </To></Map>
        <Map><From>&newline.</From><To> </To></Map>
        <Map><From>&return.</From><To> </To></Map>
        <Map><From>!</From><To> </To></Map>
        <Map><From>"</From><To> </To></Map>
        <Map><From>#</From><To> </To></Map>
        <Map><From>$</From><To> </To></Map>
        <Map><From>%</From><To> </To></Map>
        <Map><From>&amp.</From><To> </To></Map>
            ...Note: Some text deleted.
        <Map><From>A</From><To>a</To></Map>
        <Map><From>B</From><To>b</To></Map>
        <Map><From>C</From><To>c</To></Map>
        <Map><From>D</From><To>d</To></Map>
        <Map><From>E</From><To>e</To></Map>
              ...   Note: Some text deleted.
        <Map><From>~</From><To> </To></Map>
        <StopWords>
          <Ignore>a</Ignore>
          <Ignore>an</Ignore>
          <Ignore>and</Ignore>
          <Ignore>are</Ignore>
                ...   Note: Some text deleted.
          <Ignore>with</Ignore>
        </StopWords>
      </Mappings>
      <LongestMatch>
        <Length>0</Length>
        <Resolution>0</Resolution>
      </LongestMatch>
      <IntegrityCheck>1846024038</IntegrityCheck>
    </Index>
    <Index>
      <Name>word</Name>
      <File>
        <SysName>/usr/ot/manual/word.idx</SysName>
        <ModDate>679335592</ModDate>
        <Offset>0</Offset>
      </File>
      <IndexPoints>
        <IndexPt> &printable.</IndexPt>
        <IndexPt>-&alphanumeric.</IndexPt>
        <IndexPt>&alphanumeric.-</IndexPt>
        <IndexPt>&printable.&lt.</IndexPt>
      </IndexPoints>
      <Mappings>
        <Map><From></From><To></To></Map>
        <Map><From>&backspace.</From><To> </To></Map>
        <Map><From>&tab.</From><To> </To></Map>
        <Map><From>&newline.</From><To> </To></Map>
        <Map><From>&return.</From><To> </To></Map>
        <Map><From>!</From><To> </To></Map>
              ...   Note: Some text deleted.
        <Map><From>~</From><To> </To></Map>
        <StopWords></StopWords>
      </Mappings>
      <LongestMatch>
        <Length>0</Length>
        <Resolution>0</Resolution>
      </LongestMatch>
      <IntegrityCheck>736122026</IntegrityCheck>
    </Index>
  </Indices>
  <Regions>
    <Region>
      <Name>cmd</Name>
      <Desc>Illustrations of xpat commands.</Desc>
      <File>
        <SysName>/usr/ot/manual/rgn.cmd</SysName>
        <ModDate>679335629</ModDate>
        <Offset>0</Offset>
      </File>
      <Count>672</Count>
      <Type>pairs</Type>
    </Region>
    <Region>
      ...   Note: Some text deleted.
    </Region>
    <Region>
      ...   Note: Some text deleted.
    </Region>
  </Regions>
</DB>

BACKWARD-COMPATIBILITY

The following paragraphs describe the contents of the Files section used in Release 4.x Data Dictionaries.

Files SECTION

The Files section is enclosed by <Files> and </Files> tags. It contains one File section that describes the text file (in Release 4.x database, the text was in ASCII or tagged ASCII format, and was in a single file). The contents of the File section are described in the FastFindCompression section above. Refer to that section for details.

FILES

The following sections of the Data Dictionary reference specific files:

<Thesaurus>                  thesaurus file
<Text><Files>                database's text
<Indices><Index><File>       index over the database
<Indices><Index><InitFile>   initialization commands for xpat
<Regions><Region><File>      region files

SEE ALSO

xpat(1), xpatbld(1), xpatmaint(1), xpatrgn(1), multirgn(1), sgmlrgn(1), xpat_export(5), regions(5)

Index