Last updated 2002-07-08 12:17:12 EDT
Doc Title Preparing Data for Text Class
Author 1 Powell, Chris
CVS Revision $Revision: 1.6 $
Preparing Data for Index Building (Text Class)

Setting up directories

You will need to identify directories where you plan to store your SGML or XML source file, your index file (approximately 75% of the size of your SGML source), your "region" files and other information such as data dictionaries, and files you use to prepare your data. We recommend you use the following structure:

The files that are located in $DLXSROOT/bin/s/sampletc and$DLXSROOT/prep/s/sampletc should be copied into your collection directories and used to index your collection. The following files may need to be editted so that the #! points to your location of perl:

The following files will need to be edited to reflect your collection names and paths:

Preparing your data

Within your prep directory, create a data subdirectory for your collection and copy the texts for your collection into it. In our example collection for the Making of America, this would be $DLXSROOT/prep/m/moa/data/. Ensure that your converted documents validate against the TextClass DTD and conform to the text structure document. Now you are ready for your final document preparation.

  1. You need to decide whether you wish to keep character entities (for example, é) in your text files or replace them with their 8-bit ISO Latin 1 equivalent (for example, é). If you choose to do replace your character entities, you will be able to search for blessed, for example, and retrieve both blesséd and blessed, because the indexing process maps both of these characters to just e. Otherwise, you would have to search for blesséd to retrieve the word with the diacritic. If you want to do convert your entities, use themake convert command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile. See also this reference on converting your data to Unicode.
  2. Normalize the SGML files, which, if necessary, adjusts the SGML tagging so that it is consistent in terms of case and order of element attributes. This may be run in a batch in the $DLXSROOT/prep/c/collid/data/ directory using the following shell command (this is for tcsh; different syntax may be appropriate in different shells):
    foreach file (*.sgm)
    sgmlnorm $DLXSROOT/prep/s/sampletc/sampletc.text.inp $file > $file.norm
    end
  3. Concatenate separate normalized files into one collection file. If you do not care about the order in which the files will occur, this command will suffice: cat *.norm > $DLXSROOT/bin/c/collid/collid.sgm
  4. Before indexing, check to see if node attributes have been applied when the documents were converted to Text Class -- they will appear in the DIV tags for each division and will look like this: <DIV1 NODE="AAN8938.0001.001:1">. If they have not, use the make nodefy command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile. Node attributes are necessary for building reliable results lists in Text Class; the lack of nodes will result in an assertion error in the middleware.