To make the most of Finding Aids Class and Text Class in DLXS Release 14, you will want to convert or otherwise handle the character entities, numeric entities, or Latin 1 8-bit characters that have been the staples of SGML (and XML, despite the default encoding of UTF-8) for so long. Even with finding aids that are already in EAD 2002 XML, you will need to do some testing of character encodings, conversion of these encodings (if they exist) to UTF-8, normalization, and conversion of SGML to XML (strange but true).
There are a number of possibilities you may encounter:
There are a number of tools you can use to identify what you have before you.
jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul file.xml
If you have a mixed bag of encodings and entities in your documents, there's a definite order in which you want to approach the conversion task, to avoid having a mixture of Latin 1 and UTF-8 in one document at any point in the transformation.
iconv -f iso88591 -t utf8 oldfile > newfile
/l1/bin/t/text/isocer2utf8 oldfile > newfile
/l1/bin/t/text/ncr2utf8 oldfile > newfile
This would be a good point to run findentities.pl again to see what (if anything) you have left, and to re-validate using jhove or utf8chars to ensure that you have done no harm.
In the directory /l1/workshop-samples/sooty, you will find four sample files that we'll examine for character encoding and then convert to UTF-8: findaid1.xml, findaid2.xml, text1.xml, and text2.sgm. Copy these to your own directory -- they are completely expendable and won't serve any further purpose. They are merely illustrative of all the possibilities you might encounter and how you may want to handle them.
First, we'll look at which character or numeric entities, if any, are used in these documents.
foreach file (findaid*)
echo $file
$DLXSROOT/bin/t/text/findEntities.pl $file
endforeach file (text*)
echo $file
$DLXSROOT/bin/t/text/findEntities.pl $file
end
We have some CERs and NCRs to deal with, aside from the five XML-approved entities (&, >, <, ', and "). So, we know we'll be needing both isocer2utf and ncr2utf. Next, we'll see what characters we have (Latin 1? UTF-8? something else?). We'll run through all three tools, just for the sake of completeness, in the order of speediness and terseness.
foreach file (findaid*)
echo $file
xpatutf8check $file
endforeach file (text*)
echo $file
xpatutf8check $file
end
We now know that both the text files are either UTF-8 or plain ASCII (because of the output of these two tests), but there's a problem with one of the finding aids. jhove will tell us a bit more about our materials. You'll note we don't need to echo the filename as that's part of the jhove report. You'll also notice jhove is not so fast.
foreach file (findaid*)
jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul $file
endforeach file (text*)
jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul $file
end
So, the second file in each set is plain ASCII (the Basic Latin block) with entities, the first finding aid is not UTF-8, and the first text file is. Let's look a bit more at the two non-ASCII files with the slowest and most verbose tool of them all. We're not doing a foreach this time, but we wouldn't need to echo the filename either, as it is again part of what the tool reports.
utf8chars findaid1.xmlutf8chars text1.xml
We can see the exact problem with findaid1.xml -- there's an 8-bit Latin 1 e acute before Boekeloo on line 37. We also can see all the UTF-8 characters in text1.xml -- this is the sort of information that is useful when time comes to map characters and encodings in the xpatu data dictionary.
Now that we know which items need what character treatments, we'll convert them. text1.xml is completely fine, so we'll leave it as is. findaid1.xml has the one Latin 1 character, so we'll use iconv to convert it to UTF-8. It had no entities of any kind, so we'll be done with it after this step.
iconv -f iso88591 -t utf8 findaid1.xml > findaid1.xml.utf
Next, findaid2.xml had numeric character references. It is fine and can be indexed as-is, but users would need to search for the hexadecimal string in the midst of words ( é for é, for example). So, we'll use ncr2utf to convert the entities into the characters. WARNING! & is the ampersand (as is &) -- if you convert these to the character, you will run into validation problems down the road, as bare ampersands are not permitted in XML. Don't get carried away!
ncr2utf8 findaid2.xml > findaid2.xml.utf
Finally, text2.sgm has ISO character entity references (from Latin 1, Greek, and Publishing) that need to be converted to UTF-8 with isocer2utf.
isocer2utf8 text2.sgm > text2.sgm.utf
Note that the ampersand CER was not processed. This is perfectly correct.
Some of you may be in a position where you'll want to be converting your SGML files to XML. Many of you will be fortunate enough to have files already in XML -- say, finding aids in EAD 2002. However, these will have to be normalized, too, to avoid problems with xpatu and xmlrgn down the road by ensuring that all the attributes are in the same order as specified in the DTD. Because of known but uncorrected problems in the normalization tools, you will end up with SGML and will need to convert that to XML.
Because the file we want to work with is now UTF-8, we need to set some environment variables for the tools from the sp package to let them know this is UTF-8. It doesn't matter that you've set your puTTy window to UTF-8, if you are using osx, osgmlnorm, or onsgmls, you must set your environment properly.
setenv SP_CHARSET_FIXED YESsetenv SP_ENCODING utf-8
First we normalize, invoking a declaration to handle the non-SGML UTF-8 characters without claiming that the material itself is XML.
osgmlnorm $DLXSROOT/misc/sgml/xmlentities.dcl sample.inp text2.sgm.utf > text2.sgm.norm
Now I'll test the output with one of the UTF-8 tools to make sure that it's come through unscathed, and with findEntities.pl to see what has happened with the remaining XML-friendly entities, and it's fine. Now to convert our SGML to XML using osx.
osx -x no-nl-in-tag -x empty -E 500 -f errors $DLXSROOT/misc/sgml/xmlentities.dcl sample.inp text2.sgm.norm > text2.xml
Again I'll test the output with one of the UTF-8 tools to make sure that it's come through unscathed, and with findEntities.pl to see what has happened with the remaining XML-friendly entities, and again it's fine.
Just for fun, we'll normalize the files already in XML, just to show that things get changes from XML to SGML against their will.
osgmlnorm $DLXSROOT/misc/sgml/xml.dcl $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.text.inp text1.xml > text1.xml.norm
osx -x no-nl-in-tag -x empty -E 5000 -f error $DLXSROOT/misc/sgml/xml.dcl $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.text.inp text1.xml.norm > text1.xml.norm.xml