XPAT index builder for XPAT databases (man page)

XPATBLD

Section: User Commands (1)
Index

NAME

xpatbld - XPAT index builder for xpat databases

SYNOPSIS

xpatbldu [-v] [-r] [-m memory k | K | m | M ] [-d region_name] [-i int_filename] [-s merge_filename] [-t text_filename [-o out_filename] [-D data_dictionary [-I index_name] ] [-p index_point_filename]

DESCRIPTION

xpatbld builds a Main index for either the text text_filename or the text declared in data_dictionary. When a text_filename is specified, an index is created with the default set of character mappings and index points. The default is sgml, which is explained below. An alternate set of character mappings and index point specifications may be selected with the -c option. When a data_dictionary is specified, an index is created for the text specified in the Data Dictionary (see data_dict(5) for more details on the Data Dictionary). The default for xpatbldu is hard-coded to support XML and US-ASCII with punctuation and non-printing characters mapped to space. This simple default is used in lieu of command line options given the wide range of possible alphabets available under Unicode.

In general, the following material applies equally to xpatbld and xpatbldu. An exception for xpatbldu is noted in the memory requirements section.

If the -I option is also specified when a data_dictionary is specified, information from the index named index_name will be used for character mappings, index points, and other index related information. If -I is not used, then information from the first index in the Data Dictionary is used. If index_name does not exist in the Data Dictionary, it is treated as if it was not specified.

Index building is a two phase process. In the first phase, xpatbld divides the entire text into blocks, and indexes each block. It then writes the index for each block to a separate intermediate file. The amount of memory allocated to xpatbld determines the size of the block of text that is indexed. If the memory allocation is sufficient the entire text is indexed in one pass. If the memory allocation is not sufficient to index the entire text, then the first phase is divided into several passes. After each pass, xpatbld calculates how to merge the index just created with all the previously written intermediate indices. xpatbld then writes a file of merge instructions for the newly created partial index. When the entire text has been processed in this manner, the second phase begins.

In the second phase, the intermediate files are transformed into the final format and are re-written. These files are then merged according to the information in the merge instruction files to produce a final index file.

The names of both the final index file and the intermediate files can be controlled using the -o, -i, and -s options. At the end of a successful index build all intermediate and merge instruction files are automatically removed.

OPTIONS

-v: verbose - produce some additional messages concerning the execution of xpatbld.
-r: restart - use the log file to restart xpatbld. xpatbld restarts after the last checkpoint in the .log file. See the Usage Notes section of this man page for a discussion of restarting.
-m memory [ k | K | m | M ]: memory size - use memory kilobytes or megabytes of physical memory for building the index. A larger memory allocation results in faster indexing. The default memory allocation is 500 KB, of which 400 KB are used for internal buffers, leaving 100 KB for indexing. See the Database Administration Guide and the discussion on ``Memory Usage'', below, for more details.
-d region_name: region - build an index over only the region of text indicated by region_name. See xpatrgn(1), multirgn(1), sgmlrgn(1), data_dict(5) and regions(5) for more information on regions. region_name must be specified in the Data Dictionary.
-i int_filename: intermediate filename - name the intermediate files int_filename .iN, where N is an integer pass number. The amount of disk space used by the partial index files is equal to the size of the final index file. The amount of disk space used for the merge instruction files is an additional 1/4 the size of the final index. See the -s option for more details on the merge region files. The -i option allows the partial index files to be placed on a different disk, if necessary. In the second phase of xpatbld, each intermediate file is rewritten using the name int_filename.tN. If the -i option is not specified, int_filename is set to out_filename.
-p index_point_filename: index point filename - use the index points produced by another index builder. The normal index point specification used by xpatbld is two characters in length and satisfies most needs. However, some specialized databases may require a more complicated index point specification. In these cases, a specialized index builder is programmed and run over the text, producing a file containing the four-byte, zero-based offsets of index points into the text. Using the -p option, xpatbld can be made to use the index point file created by such a specialized index builder.
-s merge_filename: merge filename - name the merge instruction files merge_filename.mN where N is an integer pass number. The total size of these files will be about 1/4 the size of the final index. The -s option allows the merge files to be put on a different disk, if necessary. If the -s option is not used, merge_filename is set to the value of out_filename.
-t text_filename: text filename - specify the name of the text file to index. This option cannot be used in conjunction with the -D option. The default character mapping (sgml) is used unless an alternate set is selected with the -c option.
-o out_filename: output filename - name the output files out_filename.idx
(indexfile)and out_filename.dd
(DataDictionary). The -o option can only be used in conjunction with the -t option. If out_filename is not specified, the default name `out' is used.
-c [ none | basic | isolatin | sgml ]: character mapping - specify the character mapping to use. none specifies that no character mappings are to be used. basic maps upper case characters to lower case, and maps backspaces, newlines, tabs, punctuation and special characters to blank. isolatin is similar to basic but includes the extended characters of the ISO character set. sgml (the default) is similar to isolatin but has character mappings and index points tailored to SMGL-style tags. xpatbld writes the character mapping to the new Data Dictionary file for subsequent modifications by the user. To avoid overwriting any existing character mapping specifications, this option can only be used with the -t option.
-D data_dictionary: Data Dictionary - index the text specified in data_dictionary. Use the character mappings and index points specified in index_name (specified with the -I option) or the defaults if index_name is not specified. The -D option may not be used in conjunction with the -t option.
-I index_name: index name - index the text using the character mappings and index points specified in the index section of the Data Dictionary named index_name. If this option is not used then the first specified index in the Data Dictionary is used. See data_dict(5) for more information on the Data Dictionary.

EXAMPLES

The following is a sample xpatbld run:

xpatbld -v -m 12m -i /u1/data -s /u2/data -D data.dd

This will build an index on the text specified in the Data Dictionary file named data.dd. It will use 12 megabytes of physical memory to do the index building. Intermediate index files will be written to the directory /u1, merge instruction files will be written to the directory /u2, and the final index, the log file, and the Data Dictionary will be written to the directory containing the Data Dictionary. Each of the files written will have the file name prefix data. xpatbld will write verbose output to standard output (stdout) concerning each pass of each phase in the index building process. If the above xpatbld command is stopped before completing, it may be restarted with the command:

        xpatbld -v -r -m 12m -i /u1/data -s /u2/data -D data.dd

USAGE NOTES

General Operation

xpatbld indexes texts in three phases. In the first phase, it breaks up the text into chunks that will fit into memory. It then creates an intermediate partial index file for each chunk. These intermediate partial index files have the suffixes `.i1', `.i2', `.i3', and so on. It also creates a ``merge instruction'' file for each intermediate partial index file. These merge instruction files have the suffixes, `.m1', `.m2', `.m3', and so on.

In the second phase, xpatbld replaces the intermediate partial index files by final partial index files. These final partial index files have the suffixes `.t1', `.t2', `.t3', and so on. As xpatbld creates each one, it removes the corresponding intermediate partial index file.

In the third phase, the merge instruction files are used to merge the final partial index files into a final Main Index (`.idx') file. When xpatbld has finished writing the Main Index file it removes all the partial index files and the merge files.

Because of the complex nature of the algorithm, it is important to carefully calculate how much memory and disk space to allocate to xpatbld when it builds a Main Index. Accurate index building time calculations are also useful to help plan the index building process of large databases. The following sections will discuss those three topics.

Memory Usage

In general, the more memory available to xpatbld, the faster it will run. However, it is important that the memory that you tell xpatbld to use is the available physical memory. The available physical memory is the total physical memory (RAM) installed in the machine, minus the amount of RAM used by the operating system and any other processes running on the machine (note that this is different from the amount of virtual memory that these processes may require). The amount of memory the operating system uses varies widely from machine to machine. On smaller machines (with 4 MB of RAM or less) the operating system may take up 2 MB or less, while on larger machines (64 MB of RAM or more) it can use 8 MB or more (due to the various buffers and other space that the kernel uses to manage the larger configuration).

xpatbld uses the memory you allocate as follows. First, it uses 400 KB for internal buffers. It then divides the remainder into two pieces and uses one piece to load chunks of text and the other piece to build partial indices on those chunks. This means that the number of chunks that xpatbld divides the text into is equal to the total size of the text times 2, divided by the amount of memory you allocated (minus 400K). This also means that the maximum amount of memory that xpatbld needs is twice the size of the text, plus 400 KB.

For example, say the text is 500 MB and you tell xpatbld to use 60 MB of memory, it will divide the text into (500 MB * 2 / (60 MB - 0.4 MB)) = 16.8 chunks (or 17 chunks, rounded to the next whole number). It also means that the maximum amount of physical memory that xpatbld would need to index that text is 500 MB*2+0.4 MB=1000.4 MB (or around a gigabyte).

Note that xpatbldu uses UCS-2 encoding internally and so requires 2 bytes to store each character rather than 1 byte for xpatbld. This doubles the memory requirement for the piece of memory used to load chunks of text. Therefore the amount of memory to allocate for xpatbldu is different than for xpatbld in the following way.

xpatbld is used to index iso-8859-* encoded data where each character is 1 byte. So to say that a text file is 500 MB (as in the above example arithmetic) is to say that the file contains 500 MB of characters.
xpatbldu is used to index UTF-8 encoded Unicode Plane 0 data where each character can use up to 3 bytes. Therefore the size of the file IS NOT the number of characters to be stored in memory allocated to perform the indexing. To determine how many characters are represented in the file you can run xpatutf8check on the file. It reports the number of characters in the file. As mentioned above it takes 2 bytes to store each character in memory during indexing. So, for example, if you have a 500 MB file, xpatutf8check might report that there are 384 MB of characters. Now, observing that the xpatbldu internal UCS-2 encoding requires 2 bytes per character the actual memory requirement is 384 MB*2=768MB. It is this number you should use in calculations instead of the 500 MB file size.

Note that in MFS databases, the size of the text in the above calculation is the size of the filtered text. This amount is usually considerably less than the total size of all the files in the database because each file contains a significant amount of word processor overhead.

It is usually well worth monitoring xpatbld for pagefault activity as it processes the first few chunks. You should restart with less memory (if there is a lot of pagefault activity) or more memory (if there is no pagefault activity). The ideal memory specification is just under the point where pagefaults begin. This is especially important when you are building an index on a large text file (e.g., where the size of the text file is 10 times or more the size of available physical memory). In such cases, if too much or too little memory is allocated, xpatbld will take MUCH longer than necessary. You can monitor xpatbld's performance using the vmstat(8) and sar(8) programs (at least one of which should be available on every type of Unix operating system).

Disk Usage

The size of the Main Index file, in relation to the size of the text, varies depending on the indexing parameters used to build the index. There are two broad categories of indices: word indices and the character indices. A word index has an index point at the beginning of every word, while a character index has an index point at every character. The size of the Main Index file, in bytes, is four times the number of index points in the text, plus 512 bytes for the file header. The Main Index file for a typical word index on English text is around 75% the size of the text. In contrast, the Main Index file for a character index is roughly 4 times the size of the text. Most databases have word indices built on them.

While these guidelines characterize the size of the Main Index once it has been built, xpatbld requires more disk space than the final index size, while it is building the index. This extra space is required for the partial index files and the merge instruction files. For a large index it is important that the required disk space be calculated properly.

The intermediate partial indices and the final partial indices will each total the size of the final complete index. However, because the final indices replace the intermediate ones, only the space equal to the size of the final index is needed for them. The merge instruction files will total about 1/4 the size of the final index. And enough space is needed for the final index. These components add up to 2 1/ 4 times the size of the Main Index file, or roughly 170% the size of the text, for word indices.

Disk space trick: In an extremely tight situation it is possible to build a word index using about 1 1/ 4 the size of the final index. The trick is to allow xpatbld to proceed until ALL the final partial indices have been built. At this point xpatbld will start writing the final complete index file (you can tell when this happens by regularly listing the contents of the directory where the final `.idx' file will reside and waiting until that file is created and starts to grow). When xpatbld starts writing the final index, all of the information for index building is in the partial indices and the merge files; the text is no longer needed. If the text is backed up on tape, it may be removed while xpatbld writes the final complete index. After xpatbld has finished creating the final index file, it will automatically remove all the partial index files. There will then be room to restore the text.

Disk space available on a network may be used to store the merge instruction files, which are written and then read only once, or the final index, which is written only once. The text and the intermediate index files are used very heavily and should be on the same machine that xpatbld is running on.

Timing Calculation

In a large xpatbld run it is useful to be able to estimate how long the complete index build will take. You can use the following method to compute this estimate.

As described above, xpatbld breaks the text up into chunks that will fit into approximately half of the allocated memory. You can estimate the exact number of chunks more accurately while xpatbld is running by inspecting the contents of the log file (which has a `.log' extension). That file records exactly how many characters are processed in each chunk. The number of characters in the various chunks will not be exactly the same, but should all be relatively close to some average value. The total number of chunks is then the size of the text divided by the average chunk size.

Once you have determined the number of chunks, you can move on to determine the times for the various steps in the operation. As mentioned above, xpatbld works by first building the partial index file for each chunk and then building the merge file. The partial index files all take approximately the same amount of time to build. However, the process of calculating the merge files takes longer with each successive chunk. The merge file calculation for a given chunk involves (n - 1) separate steps, where n is the chunk number. Those steps all take approximately the same amount of time.

You can determine the time it takes to build the index for each chunk, and the time for each separate merge step by looking at the timestamps on the `.iN' and `.mN' files. The following table provides an example of the first three chunks of a typical build:

FileTimestampElapsed Time

demo.i110:13 -

demo.m110:130 mins

demo.i210:185 mins

demo.m210:213 mins

demo.i310:265 mins

demo.m310:337 mins

In the above example, each partial index file appears to take around 5 minutes to build, while each step in the merge file calculation appears to take around 3.5 minutes (from the sequence: 0 mins, 3 mins, 7 mins).

The total time for the complete index build can be determined by the following formula. If there are n chunks, then there are n Phase 1 indexing operations, n * (n - 1) / 2 Phase I merge steps, n Phase 2 indexing passes and one Phase 3 merge operation. The Phase 1 and Phase 2 indexing steps all take approximately the same amount of time (5 minutes in the above example). The time for the Phase 3 merge phase is insignificant with respect to the total time of the other passes, so it is not included in the overall calculation. The total time is then given by the formula,

2 * I * n + M * n * (n - 1) / 2

where I is the indexing time and M is the merge step time. In our example, n is 11, I is 5 minutes and M is 3.5 minutes, so the total time estimate is 302.5 minutes, or around 5 hours.

Restarting

xpatbld may be stopped at any time. xpatbld can then be restarted with the -r option. The restart will be from the last ``checkpoint'' written to the log file. Checkpoints are written after each intermediate index file is written, after all merging has been calculated for an index file, and after each final index file is written. When restarting, the memory allocation (specified with the -m option) must be the same as for the initial run. The -o, -i, and -s options may be changed provided that all files related to the option are moved to the new new name and location. The -o option affects the `.log' file. The `.idx' and `.dd' files, also affected by the -o option, are completely rewritten by a restart. The -s option may be changed provided that the `.mN' files are moved. The -i option may be changed provided that the `.iN' and `.tN' files are moved.

FILES

data_dictionary.dd Data Dictionary file
int_filename.i[0-9]+ partial index files built by Phase I
merge_filename.m[0-9]+ merge instruction files built by Phase I
int_filename.t[0-9]+ partial index files built by Phase II
out_filename.idx output index file
out_filename.log log file