Last updated |
2002-07-08 12:17:12 EDT |
Doc Title |
Preparing Data for Text Class |
Author 1 |
Powell, Chris |
CVS Revision |
$Revision: 1.6 $ |
Preparing Data for Index Building (Text Class)
Setting up directories
You will need to identify directories where you plan to store your SGML or XML source file, your index file (approximately 75% of the
size of your SGML source), your "region" files and other information such as data dictionaries, and files you use to prepare your data. We recommend you use the following
structure:
- Store specialized scripts for your collection and its Makefile in $DLXSROOT/bin/c/collid/where
$DLXSROOT is the "tree" where you install all DLXS components, c is the first letter of the name of the collection you are indexing, and collid is the collection ID of the collection you are indexing. For example, if your collection ID is "moa" and your DLXSROOT is "/l1", you will place the
Makefile in /l1/bin/m/moa/, e.g., /l1/bin/m/moa/Makefile. See directory conventions for more information.
- Store your source texts and any DTDs, doctype, and files for preparing your data in $DLXSROOT/prep/c/collid/. Unlike the contents of other directories, everything in prep should be ultimately expendable in the production environment.
- Store the finalized, concatenated SGML file for your text collection in $DLXSROOT/obj/c/collid/ , e.g., /l1/obj/m/moa/moa.sgm.
- Store index, region, data dictionary, and init files in $DLXSROOT/idx/c/collid/, e.g., /l1/idx/m/moa/moa.idx. See the XPAT documentation for more on these types of files.
The files that are located in $DLXSROOT/bin/s/sampletc and$DLXSROOT/prep/s/sampletc should be copied into your collection directories and used to index your collection. The following files may need to be editted so that the #! points to your location of perl:
- $DLXSROOT/bin/t/text/isolat128bit.pl
- $DLXSROOT/bin/t/text/output.dd.frag.pl
- $DLXSROOT/bin/t/text/inc.extra.dd.pl
- $DLXSROOT/bin/t/text/cleanfiles.pl
- $DLXSROOT/bin/t/text/catsourcefiles.pl
The following files will need to be edited to reflect your collection names and paths:
- $DLXSROOT/bin/s/sampletc/Makefile
- $DLXSROOT/prep/s/sampletc/sampletc.blank.dd
- $DLXSROOT/prep/s/sampletc/sampletc.extra.srch
-
$DLXSROOT/prep/s/sampletc/sampletc.inp
Preparing your data
Within your prep directory, create a data subdirectory for your collection and copy the texts for your collection into it. In our example collection for the Making of America, this would be $DLXSROOT/prep/m/moa/data/. Ensure that your converted documents validate against the TextClass DTD and conform to the text structure document. Now you are ready for your final document preparation.
- You need to decide whether you wish to keep character entities (for example, é) in your text files or replace them with their 8-bit ISO
Latin 1 equivalent (for example, é). If you choose to do replace your character entities, you will be able to search for blessed, for example, and retrieve both blesséd and
blessed, because the indexing process maps both of these characters to just e. Otherwise, you would have to search for blesséd to
retrieve the word with the diacritic. If you want to do convert your entities, use themake convert command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile. See also this reference on converting your data to Unicode.
- Normalize the SGML files, which, if necessary, adjusts the
SGML tagging so that it is consistent in terms of case and order of element attributes. This may be run in a batch in the $DLXSROOT/prep/c/collid/data/ directory using the following shell command (this is for tcsh; different syntax may be appropriate in different shells):
foreach file (*.sgm)
sgmlnorm $DLXSROOT/prep/s/sampletc/sampletc.text.inp $file > $file.norm
end - Concatenate separate normalized files into one collection file. If you do not care about the order in which the files will occur, this command will suffice: cat *.norm > $DLXSROOT/bin/c/collid/collid.sgm
- Before indexing, check to see if node attributes have been applied when the
documents were converted to Text Class -- they will appear in the DIV tags for each division and will look like this: <DIV1
NODE="AAN8938.0001.001:1">. If they have not, use the make nodefy command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile. Node attributes are necessary for building reliable results lists in Text Class; the lack of nodes will result in an assertion error in the middleware.