MFS

Section: File Formats (5)
Updated: November 2000
Index Return to Main Contents

NAME

mfs - XPAT Multi-File/Filter Support system

DESCRIPTION

The MFS (Multi-File/Filter Support) system is an extension to the I/O subsystem that is used by all XPAT index-building and search programs. It allows these programs to handle databases consisting of many separate files in multiple formats. Conceptually, the MFS system extracts the text from all the individual files, adds some structure to those extracted pieces of text, and groups them all together as a single `virtual text'. The XPAT programs then operate on that virtual text as if it were a single file. The MFS system creates this virtual text through the use of a FileMap, which is essentially a directory of all the files in the database, along with extra information about each file (such as the timestamps).

The FileMap must be built before any of the other index-building and search programs can be used. This is accomplished with the mfsbld program which reads the MFS specifications from the database's Data Dictionary. These specifications are described in the next section.

MFS SECTIONS IN THE DATA DICTIONARY

The difference between the Data Dictionary of a regular database and that of an MFS database lies in the contents of the `Text' section (which is enclosed by <Text> and </Text> tags). In a regular database, this section contains a `Files' section which specifies the database's text file. In an MFS database, the `Files' section is replaced by an `MfsFiles' section. This `MfsFiles' section specifies all the files that make up the MFS database, along with extra information for each one, such as the method to extract their texts.

The following sections describe the contents of the `MfsFiles' section in the Data Dictionary. You may want to refer to the example at the end of this man page while reading these descriptions. Refer to the Database Administration Guide for more details on the `Files' section used in regular databases.

Each `MfsFiles' section contains a `FileMap' field and one or more `FilterChain' sections.

-

The `FileMap' field specifies the prefix of the FileMap files. These files have the suffixes `.fmp', `.lmp' and `.xmp'. When mfsbld creates these files, it names them with the prefix specified in the `FileMap' field in the Data Dictionary.

-

The `FilterChain' section defines the files and data filtering parameters for one type of file in the database. Most MFS databases consist of files of several different file types; each file type is defined in its own `FilterChain' section. Each `FilterChain' section defines three types of parameters: the filter chains for the required database views, the Display View Format label, and the files in the database of the file type being defined.

A filter chain is a group of data filters that perform some transformation on the data. In doing so, each filter chain provides a view of the data. For example, a filter chain that extracts the text from a word processor file provides a view of the text of the file. In contrast, a filter chain that extracts the text along with the formatting information provides a formatted view of the data. Each `FilterChain' section in the Data Dictionary contains filter chain definitions for three database views. These filter chains are contained in the `SearchView', `DisplayView', and `RawView' fields. In addition to those fields, each `FilterChain' section contains a `DisplayFmt' field and one or more `FileGroup' sections.

-

The `SearchView' field specifies the filter chain for the Search View of the database. The Search View is used for indexing and search purposes. The filter chain is written as a list of filters in the style of Unix pipeline specifications. For example, the `SearchView' field

<SearchView>wfw(05,3)|meta</SearchView>

describes two filters. The first filter is the `wfw' filter with the parameter `(05,3)'. Because this filter is the first one in the chain, it reads the data files directly. The `wfw' filter is actually a ``super-filter'' supplied by XPAT, which can extract the text from a wide variety of data formats. The second filter is the `meta' filter. It receives its input from the `wfw' filter and sends its output to the indexing and search programs. The `meta' filter is a special filter, also supplied by XPAT, that surrounds the text coming from the previous filters in the chain with tags. These tags provide file-level structure to text that may be otherwise unstructured. It also inserts extra fields, called meta-fields, in front of that text. These meta-fields contain the FileMap information for the file, which is very useful to user interfaces.

The `meta' filter can generate the meta-fields very quickly (usually much faster than the other filters in the chain can generate the filtered text). For this reason, the meta-fields are usually used to build summary lists in the user interfaces. You should always terminate every Search View filter chain with the `meta' filter. Refer to the Database Administration Guide for further details of the `wfw' and `meta' filters. Refer the section below on User Meta Data for more information on adding custom information to the meta-fields.

-

The `DisplayView' field contains the filter chain that provides the Display View. This view is used by the user interfaces to display the text in a formatted form. Refer to the Database Administration Guide for further details and examples of the Display View. As with the Search View, this filter chain should always end with the `meta' filter.

-

The `RawView' field contains the filter chain that provides the Raw View. This view is currently unused by XPAT user interfaces, but is reserved for future use. You should set it to be the same as the DisplayView filter chain (because each view definition must contain at least one filter).

-

The `DisplayFmt' field contains a short label that identifies the Display View output format. This label is necessary in databases where several different Display View data formats may be produced (e.g., SGML text from SGML files, raw word processor data from the word processor files, and GIF from the graphics files). The user interface may need to send each of these data formats to its own viewer program. Since the `DisplayFmt' field is one of the meta-fields that the `meta' filter generates, the user interface can use it to automatically send the Display View data to the right viewer program.

You can put any label you want in this field, as long as you configure the user interface send data with that label to the correct viewer. Short identifiers, such as `sgml', are recommended to minimize the size of the FileMap files (since the Display Format label for each file is kept in that file's entry in the FileMap).

-

The `DefaultDataTag' field contains the default data tag name that will be used to construct the OTDefaultData regions. If this field is not specified, the default tag name OTData will be assumed.

-

The `FileGroup' sections specify the files which are to be included in the database, for the enclosing `FilterChain' section. Each `FileGroup' section defines a directory and a file pattern within that directory. Multiple `FileGroup' sections may be used to define different directories and patterns. Each `FileGroup' section contains an `MfsDir' field, an `MfsFile' field and an `MfsExpand' field.

-

The `MfsDir' field defines the path to the directory which contains the files.

-

The `MfsFile' field defines the file pattern which specifies the files. This pattern consists of a complete or partial filename which may contain the following wildcards:

-: `*', which matches any string, including the null string. For example, `*.wp' will match any filename ending with a `.wp' extension.
-: `?', which matches a single character. For example, `A??.wp' will match any filename starting with an `A', followed by any two characters, and ending with a `.wp' extension.
-: `[...]', which matches any one of the characters enclosed between `[' and `]'. For example, `str[XYZ]' will match the filenames `strX', `strY' and `strZ'. In addition, the character `-' may be used between two characters to specify the range between (and including) those two characters. For example, `str[1-5]' will match the filenames `str1', `str2', ..., `str5'.
-: `\', which escapes special characters.

-

The `MfsExpand' field defines whether mfsbld should look in just the directory specified by the `MfsDir' field or also in the entire subtree rooted at that directory, to find files matching the pattern. The valid keywords in the `MfsExpand' field are `tree' and `file'. The `tree' keyword tells mfsbld to find the files recursively in the subdirectory, while the `file' keyword tells mfsbld to find only the files in the specified directory.

MFS DATA DICTIONARY EXAMPLE

This is an example `Text' section of a Data Dictionary for an MFS database.

  <Text>
    <MfsFiles>
      <FileMap>amap</FileMap>
      <FilterChain>
        <SearchView>wfw(05,0)|meta</SearchView>
        <DisplayView>meta</DisplayView>
        <RawView>meta</RawView>
        <DisplayFmt>wp</DisplayFmt>
        <FileGroup>
          <MfsDir>/usr/wordprocessor/doc</MfsDir>
          <MfsFile>*.wp</MfsFile>
          <MfsExpand>tree</MfsExpand>
        </FileGroup>
        <FileGroup>
          <MfsDir>../../mydata/texts</MfsDir>
          <MfsFile>MyFile*.[4-6]</MfsFile>
          <MfsExpand>file</MfsExpand>
        </FileGroup>
      </FilterChain>
      <FilterChain>
        <SearchView>sys("uncompress")|wfw(21,4X)|meta</SearchView>
        <DisplayView>sys("uncompress")|meta</DisplayView>
        <RawView>flat</RawView>
        <DisplayFmt>wfw</DisplayFmt>
        <FileGroup>
          <MfsDir>/u2/compressed/texts</MfsDir>
          <MfsFile>*.Z</MfsFile>
          <MfsExpand>tree</MfsExpand>
        </FileGroup>
      </FilterChain>
    <MfsFiles>
  </Text>

FILE MAP FILES

As mentioned above, the FileMap consists of three files:

-: The FileMap file itself with a `.fmp' extension.
-: The Filter List file with a `.lmp' extension.
-: The Compiled FileMap file with a `.xmp' extension.

These files are generated by mfsbld. The `.fmp' and `.lmp' files are both ASCII files; the `.xmp' file is a binary file. All XPAT programs that use MFS load the `.xmp' file into memory for fast access. This file consumes 8 bytes of memory for each file entry. Therefore, a FileMap containing 131,072 files will occupy 1 MB of memory.

USER META DATA

One of the meta-fields that the `meta' filter produces contains free text that is associated with the file. This free text is called the user meta-data. This data is usually a summary or a title for the file that users can use to quickly identify the file or its contents. This field is intended to allow database administrators to place database-specific and file-specific data into the meta-fields. This data can then be used in the user interface's summary list, allowing it to contain much more useful information than would otherwise be possible. Note that this field can be left empty if the existing information in the other meta-fields is sufficient to construct the summary list.

The user meta-data is incorporated into the FileMap at the time it is built, via a user meta-data template file. You can create an empty template file by using the `-t' option to the mfsbld program (refer to the mfsbld(1) man page for the details). The template file has the suffix `.dat'. The user meta-data are added to this template file using one of the methods described below. mfsbld is then executed again (without the `-t' option) to build the FileMap files. mfsbld automatically incorporates the data in the `.dat' file into the FileMap if the `.dat' file is present (it should be in the same location as the other FileMap files, as specified by the Data Dictionary's `FileMap' field).

NOTE: You usually do not run mfsbld by hand. Instead, you usually run dbbuild, which in turn calls mfsbld and all the other index-building programs. Refer to the dbbuild(1) man page for more details.

The `.dat' file that mfs creates consists of a number of entries, one for each file in the database. Each entry looks like the following (in the real file there are no newlines in each entry):

  <OTUserData><OTFile>../dir1/dir2/somefile.doc</OTFile>
  <OTFields></OTFields></OTUserData>

You should enter the user meta-data for each database file in the `OTFields' field in that file's `OTUserData' entry. You can do this either by editing it by hand or by using a program that automatically merges a pre-built list of user meta-data with this template file. If the user meta-data consists of pieces of text that come from the files themselves (such as title fields), you need to have run a program to extract these pieces beforehand.

Note that in the simplest form, the user meta-data are just strings of text. However, in more complex forms, they may consist of more than single tagged fields. Because the user meta-data are meta-fields in the virtual text produced by the MFS system, you can simply build region indices on the tags in the user meta-data. For example, you might create a template file entry as follows:

  <OTUserData><OTFile>somefile.doc</OTFile><OTFields><HeadLine>This
  is the text of the headline</HeadLine><PageNum>344</PageNum></OTFields>
  </OTUserData>

In the above example, you could build regions on `HeadLine' tags and on `PageNum' tags. For example, you could use the `HeadLine' region in the summary list, and the `PageNum' field in the viewer program. Refer to the Database Administration Guide for more details and examples of this form of user meta-data and building region indices.

Note that you should edit ONLY the contents of the `OTFields' fields; the contents of the `OTFile' fields must remain unchanged or the user meta-data may not get properly matched up with the right files.

INTEGRITY CHECKS

mfsbld will generate integrity check data for the FileMap inside the `.xmp' file. This integrity check data is used by the various indexing and search programs to verify that the three FileMap files are consistent. It is important NOT TO EDIT FileMap files directly. Doing so may corrupt the database's integrity. xpat will detect such problems and will not allow searching in those cases.

As far as integrity checking of the actual database files is concerned, each entry in the FileMap contains the corresponding file's timestamp. This allows xpat to detect if any of the files have been modified since the database was built. Note that xpat will still search databases that have modified files. However, searches will be performed against the versions of the files at the time they were indexed, while text output (e.g., to viewer programs) will use the current versions. In order to search the indexed version of the database, you will have to rebuild or update the database (refer to the xpatbld(1) and dbbuild(1) man pages and the Database Administration Guide for more details).

This document was created by man2html, using the manual pages.
Time: 18:03:38 GMT, March 26, 2001

MFS

NAME

DESCRIPTION

MFS SECTIONS IN THE DATA DICTIONARY

MFS DATA DICTIONARY EXAMPLE

FILE MAP FILES

USER META DATA

INTEGRITY CHECKS

SEE ALSO

Index