MFS
Section: File Formats (5)
Updated: November 2000
Index
Return to Main Contents
NAME
mfs - XPAT Multi-File/Filter Support system
DESCRIPTION
The MFS (Multi-File/Filter Support) system is an extension to
the I/O subsystem that is used by all
XPAT index-building and search programs. It allows these
programs to handle databases consisting of
many separate files in multiple formats.
Conceptually, the MFS system extracts the text from all the individual
files, adds some structure to those extracted pieces of text, and groups them
all together
as a single `virtual text'. The XPAT programs then operate on
that virtual text as if it were a single file. The MFS system creates
this virtual text through the use of a FileMap, which is essentially
a directory of all the files in the database, along with extra
information about each file (such as the timestamps).
The FileMap must be built before
any of the other index-building and search programs can be used.
This is accomplished with the mfsbld program which reads the MFS
specifications from the database's Data Dictionary. These
specifications are described in the next section.
MFS SECTIONS IN THE DATA DICTIONARY
The difference between the Data Dictionary of a regular database and that of an MFS
database lies in the contents of the `Text' section (which is enclosed
by <Text> and </Text> tags). In a regular database,
this section contains a `Files' section which specifies the
database's text file. In an MFS database, the `Files' section is
replaced by an `MfsFiles' section. This `MfsFiles' section specifies all
the files that make
up the MFS database, along with extra information for each one, such
as the method to extract their texts.
The following sections describe the contents of the `MfsFiles' section
in the Data Dictionary. You may want to refer
to the example at the end of this man page while reading these
descriptions. Refer to the Database Administration Guide for more details on the
`Files' section used in regular databases.
Each `MfsFiles' section contains a `FileMap'
field and one or more `FilterChain' sections.
-
- -
-
The `FileMap' field specifies the
prefix of the FileMap files. These files have
the suffixes `.fmp', `.lmp' and
`.xmp'. When mfsbld creates these files, it names them with the
prefix specified in the `FileMap' field in the Data Dictionary.
- -
-
The `FilterChain' section defines the files and data filtering
parameters for one type of file in the database.
Most MFS databases consist of files of several different file types;
each file type is defined in its own `FilterChain' section. Each
`FilterChain' section defines three types of parameters: the filter
chains for the required database views, the Display View Format label,
and the files in the database of the file type being defined.
-
-
A filter chain is a group of data filters that perform some
transformation on the data.
In doing so, each filter chain provides a view of the data.
For example, a filter chain that extracts the text from a word
processor file provides a view of the text of the file. In contrast,
a filter chain that extracts the text along with the formatting information
provides a formatted view of the data. Each `FilterChain'
section in the Data Dictionary contains filter chain definitions for three database views.
These filter chains are contained in the `SearchView', `DisplayView', and
`RawView' fields.
In addition to those fields, each `FilterChain' section contains
a `DisplayFmt' field and one or more `FileGroup' sections.
- -
-
The `SearchView' field specifies the filter chain for the
Search View of the database.
The Search View is used for indexing and search
purposes. The filter chain is written as a list of filters in the
style of Unix pipeline specifications. For example, the
`SearchView' field
-
-
<SearchView>wfw(05,3)|meta</SearchView>
-
-
describes two filters. The first filter is the `wfw' filter
with the parameter `(05,3)'. Because this filter
is the first one in the chain, it reads the
data files directly. The `wfw' filter is actually a ``super-filter''
supplied by XPAT, which can extract the text from a wide variety of
data formats. The second filter
is the `meta' filter. It receives its input from the
`wfw' filter and sends its output to the indexing and search
programs. The `meta' filter is a special filter, also supplied
by XPAT, that
surrounds the text coming from the previous filters in the chain with
tags. These tags provide file-level structure to text that may be
otherwise unstructured.
It also inserts extra fields, called meta-fields, in front of that text.
These meta-fields contain the FileMap information for the file, which is
very useful to user interfaces.
-
-
The `meta' filter can generate the meta-fields very quickly
(usually much faster than the other filters in the chain can generate
the filtered text). For this reason, the meta-fields are usually
used to build summary lists in the user interfaces.
You should always terminate every Search
View filter chain with the `meta' filter.
Refer to the Database Administration Guide for
further details of the `wfw' and `meta' filters.
Refer the section below on User Meta Data for more information on
adding custom information to the meta-fields.
- -
-
The `DisplayView' field contains the filter chain that
provides the Display View. This view is used by the user interfaces
to display the text in a formatted form. Refer to the Database Administration Guide
for further details and examples of the Display View.
As with the Search View, this filter chain
should always end with the `meta' filter.
- -
-
The `RawView' field contains the filter chain that
provides the Raw View. This view is currently
unused by XPAT user interfaces, but is reserved for
future use. You should set it to be the same as the DisplayView
filter chain (because each view definition must
contain at least one filter).
- -
-
The `DisplayFmt' field contains a short label that identifies the
Display View output format.
This label is necessary in databases where several different Display View
data formats may be produced (e.g., SGML text from SGML files, raw word
processor data from the word processor files, and GIF from the
graphics files). The user interface may need to send each of these
data formats to its own viewer program. Since the `DisplayFmt' field
is one of the meta-fields that the `meta' filter generates,
the user interface can use it to automatically send the Display View
data to the right viewer program.
-
-
You can put any label you want in this field, as long as you
configure the user interface send data
with that label to the correct viewer. Short identifiers,
such as `sgml', are recommended to
minimize the size of the FileMap files (since the Display Format label
for each file is kept in that file's entry in the FileMap).
- -
-
The `DefaultDataTag' field contains the default data tag name that will
be used
to construct the OTDefaultData regions.
If this field is not specified, the default tag name OTData
will be assumed.
- -
-
The `FileGroup' sections specify the files which
are to be included in the database, for the enclosing `FilterChain'
section. Each `FileGroup' section defines
a directory and a file pattern within that directory.
Multiple `FileGroup' sections
may be used to define different directories and patterns.
Each `FileGroup' section contains an `MfsDir' field,
an `MfsFile' field and an `MfsExpand' field.
-
- -
-
The `MfsDir' field defines
the path to the directory which contains the files.
- -
-
The `MfsFile' field defines the file pattern which specifies the
files. This pattern consists of a complete or partial filename which
may contain the following wildcards:
-
- -
-
`*', which matches any string, including the null string.
For example, `*.wp' will match any filename ending with
a `.wp' extension.
- -
-
`?', which matches a single character.
For example, `A??.wp' will match any filename starting with
an `A', followed by any two characters, and ending with
a `.wp' extension.
- -
-
`[...]', which matches any one of the characters enclosed
between `[' and `]'.
For example, `str[XYZ]' will match the filenames `strX',
`strY' and `strZ'. In addition, the character
`-' may be used between two characters to specify the range
between (and including) those two characters.
For example, `str[1-5]' will match the filenames `str1',
`str2', ..., `str5'.
- -
-
`\', which escapes special characters.
- -
-
The `MfsExpand' field defines whether
mfsbld should look in just the directory specified by the `MfsDir' field
or also in the entire subtree rooted at that directory, to find files
matching the pattern.
The valid keywords in the `MfsExpand' field are `tree' and
`file'. The `tree' keyword tells mfsbld to find the files
recursively in the subdirectory, while the `file' keyword
tells mfsbld to find only the files in the specified directory.
MFS DATA DICTIONARY EXAMPLE
This is an example `Text' section of a Data Dictionary for an MFS database.
<Text>
<MfsFiles>
<FileMap>amap</FileMap>
<FilterChain>
<SearchView>wfw(05,0)|meta</SearchView>
<DisplayView>meta</DisplayView>
<RawView>meta</RawView>
<DisplayFmt>wp</DisplayFmt>
<FileGroup>
<MfsDir>/usr/wordprocessor/doc</MfsDir>
<MfsFile>*.wp</MfsFile>
<MfsExpand>tree</MfsExpand>
</FileGroup>
<FileGroup>
<MfsDir>../../mydata/texts</MfsDir>
<MfsFile>MyFile*.[4-6]</MfsFile>
<MfsExpand>file</MfsExpand>
</FileGroup>
</FilterChain>
<FilterChain>
<SearchView>sys("uncompress")|wfw(21,4X)|meta</SearchView>
<DisplayView>sys("uncompress")|meta</DisplayView>
<RawView>flat</RawView>
<DisplayFmt>wfw</DisplayFmt>
<FileGroup>
<MfsDir>/u2/compressed/texts</MfsDir>
<MfsFile>*.Z</MfsFile>
<MfsExpand>tree</MfsExpand>
</FileGroup>
</FilterChain>
<MfsFiles>
</Text>
FILE MAP FILES
As mentioned above, the FileMap consists of three files:
-
- -
-
The FileMap file itself with a `.fmp' extension.
- -
-
The Filter List file with a `.lmp' extension.
- -
-
The Compiled FileMap file with a `.xmp' extension.
These files are generated by mfsbld.
The `.fmp' and `.lmp' files are both
ASCII files; the `.xmp' file is a binary file.
All XPAT programs that use MFS load the `.xmp'
file into memory for fast access.
This file consumes 8 bytes of memory for each file entry. Therefore,
a FileMap containing 131,072 files will occupy 1 MB of memory.
USER META DATA
One of the meta-fields that the `meta' filter produces contains
free text that is associated with the file. This free text is called
the user meta-data. This data is usually a summary
or a title for the file that users can use to quickly identify the
file or its contents. This field is intended to allow database
administrators to place database-specific and file-specific data into
the meta-fields. This data can then be used in the user interface's
summary list, allowing it to contain much more useful
information than would otherwise be possible. Note that
this field can be left empty if the existing information
in the other meta-fields is sufficient to construct the summary list.
The user meta-data is incorporated into the FileMap at the time it is
built, via a user meta-data template file. You can create an empty
template file by using the `-t' option to the mfsbld program
(refer to the mfsbld(1) man
page for the details). The template file has the suffix `.dat'.
The user meta-data are added to this template file using one of the
methods described below. mfsbld is then executed again (without the
`-t' option) to build the FileMap
files. mfsbld automatically incorporates the data in the
`.dat' file into the FileMap if the `.dat' file
is present (it should be
in the same location as the other FileMap files, as specified by
the Data Dictionary's `FileMap' field).
NOTE: You usually do not run mfsbld by hand. Instead, you usually
run dbbuild, which in turn calls mfsbld and all the other index-building programs.
Refer to the dbbuild(1) man page for more details.
The `.dat' file that mfs creates consists of a number
of entries, one for each file in the database. Each entry looks like
the following (in the real file there are no newlines in each entry):
<OTUserData><OTFile>../dir1/dir2/somefile.doc</OTFile>
<OTFields></OTFields></OTUserData>
You should enter the user meta-data for each database file in the
`OTFields' field in that file's `OTUserData' entry.
You can do this either by editing it by hand or by using a
program that automatically merges a pre-built list
of user meta-data with this template file. If the user meta-data
consists of pieces of text that come from the files themselves (such
as title fields), you need to have run a program to extract these pieces
beforehand.
Note that in the simplest form, the user meta-data are just strings of
text. However, in more complex forms, they may consist of more than
single tagged fields. Because the user meta-data are
meta-fields in the virtual text produced by the MFS system,
you can simply build region indices on the tags
in the user meta-data. For example, you might create a template file
entry as follows:
<OTUserData><OTFile>somefile.doc</OTFile><OTFields><HeadLine>This
is the text of the headline</HeadLine><PageNum>344</PageNum></OTFields>
</OTUserData>
In the above example, you could build regions on `HeadLine' tags
and on `PageNum' tags. For example, you could use the `HeadLine'
region in the
summary list, and the `PageNum' field in the viewer program.
Refer to the Database Administration Guide for more details and examples
of this form of user meta-data and building region indices.
Note that you should edit ONLY the contents of the `OTFields' fields;
the contents of the `OTFile' fields must remain unchanged or the user
meta-data may not get properly matched up with the right files.
INTEGRITY CHECKS
mfsbld will generate integrity check data for the FileMap inside the
`.xmp' file. This integrity check data is used by the various
indexing and search programs to verify
that the three FileMap files are consistent.
It is important NOT TO EDIT FileMap files directly. Doing
so may corrupt the database's integrity. xpat will detect such
problems and will not allow searching in those cases.
As far as integrity checking of the actual database files is
concerned, each entry in the FileMap contains the corresponding file's
timestamp. This allows xpat to detect if any of the files
have been modified since the database was built. Note that xpat will
still search databases that have modified files. However, searches
will be performed against
the versions of the files at the time they were indexed, while text
output (e.g., to viewer programs) will
use the current versions. In order to search the indexed version
of the database, you will have to rebuild or update the database
(refer to the xpatbld(1) and dbbuild(1) man pages and the Database Administration Guide for more
details).
SEE ALSO
Database Administration Guide
mfsbld(1), xpatffw(1), xpatffi(1), data_dict(5)
Index
- NAME
-
- DESCRIPTION
-
- MFS SECTIONS IN THE DATA DICTIONARY
-
- MFS DATA DICTIONARY EXAMPLE
-
- FILE MAP FILES
-
- USER META DATA
-
- INTEGRITY CHECKS
-
- SEE ALSO
-
This document was created by
man2html,
using the manual pages.
Time: 18:03:38 GMT, March 26, 2001