Last updated	2002-07-08 12:17:12 EDT
Doc Title	Preparing Data for Text Class
Author 1	Powell, Chris
CVS Revision	$Revision: 1.6 $

Preparing Data for Index Building (Text Class)

Setting up directories

You will need to identify directories where you plan to store your SGML or XML source file, your index file (approximately 75% of the size of your SGML source), your "region" files and other information such as data dictionaries, and files you use to prepare your data. We recommend you use the following structure:

Store specialized scripts for your collection and its Makefile in $DLXSROOT/bin/c/collid/where $DLXSROOT is the "tree" where you install all DLXS components, c is the first letter of the name of the collection you are indexing, and collid is the collection ID of the collection you are indexing. For example, if your collection ID is "moa" and your DLXSROOT is "/l1", you will place the Makefile in /l1/bin/m/moa/, e.g., /l1/bin/m/moa/Makefile. See directory conventions for more information.
Store your source texts and any DTDs, doctype, and files for preparing your data in $DLXSROOT/prep/c/collid/. Unlike the contents of other directories, everything in prep should be ultimately expendable in the production environment.
Store the finalized, concatenated SGML file for your text collection in $DLXSROOT/obj/c/collid/ , e.g., /l1/obj/m/moa/moa.sgm.
Store index, region, data dictionary, and init files in $DLXSROOT/idx/c/collid/, e.g., /l1/idx/m/moa/moa.idx. See the XPAT documentation for more on these types of files.

The files that are located in $DLXSROOT/bin/s/sampletc and$DLXSROOT/prep/s/sampletc should be copied into your collection directories and used to index your collection. The following files may need to be editted so that the #! points to your location of perl:

$DLXSROOT/bin/t/text/isolat128bit.pl
$DLXSROOT/bin/t/text/output.dd.frag.pl
$DLXSROOT/bin/t/text/inc.extra.dd.pl
$DLXSROOT/bin/t/text/cleanfiles.pl
$DLXSROOT/bin/t/text/catsourcefiles.pl

The following files will need to be edited to reflect your collection names and paths:

$DLXSROOT/bin/s/sampletc/Makefile
$DLXSROOT/prep/s/sampletc/sampletc.blank.dd
$DLXSROOT/prep/s/sampletc/sampletc.extra.srch
$DLXSROOT/prep/s/sampletc/sampletc.inp

Preparing your data

Within your prep directory, create a data subdirectory for your collection and copy the texts for your collection into it. In our example collection for the Making of America, this would be $DLXSROOT/prep/m/moa/data/. Ensure that your converted documents validate against the TextClass DTD and conform to the text structure document. Now you are ready for your final document preparation.

You need to decide whether you wish to keep character entities (for example, é) in your text files or replace them with their 8-bit ISO Latin 1 equivalent (for example, é). If you choose to do replace your character entities, you will be able to search for blessed, for example, and retrieve both blesséd and blessed, because the indexing process maps both of these characters to just e. Otherwise, you would have to search for blesséd to retrieve the word with the diacritic. If you want to do convert your entities, use themake convert command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile. See also this reference on converting your data to Unicode.
Normalize the SGML files, which, if necessary, adjusts the SGML tagging so that it is consistent in terms of case and order of element attributes. This may be run in a batch in the $DLXSROOT/prep/c/collid/data/ directory using the following shell command (this is for tcsh; different syntax may be appropriate in different shells):
foreach file (*.sgm)
sgmlnorm $DLXSROOT/prep/s/sampletc/sampletc.text.inp $file > $file.norm
end
Concatenate separate normalized files into one collection file. If you do not care about the order in which the files will occur, this command will suffice: cat *.norm > $DLXSROOT/bin/c/collid/collid.sgm
Before indexing, check to see if node attributes have been applied when the documents were converted to Text Class -- they will appear in the DIV tags for each division and will look like this: <DIV1 NODE="AAN8938.0001.001:1">. If they have not, use the make nodefy command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile. Node attributes are necessary for building reliable results lists in Text Class; the lack of nodes will result in an assertion error in the middleware.