DLXS Unicode Data Preparation and Online Presentation Issues

Introduction

This document describes in some detail the issues involved in Unicode data preparation and indexing, middleware configuration, template issues and user input. In its data preparation and indexing aspect, it is mainly applicable to TextClass, BibClass and FindaidClass. With respect to the remaining issues, it relates to all the classes.

For non-unicode specific information on data preparation for individual classes, check the following links:

About Unicode

The authoritative source for information about Unicode is the Unicode Consortium. You will find the complete standard and lots of helpful links to other sources of information on Unicode.

First some definitions. A Character Repertoire is a collection of abstract characters independent of how they look when printed. A Coded Character Set is an assignment of a unique number to each character in a Character Repertoire. The ISO/IEC 10646 Coded Character Set assigns a unique number to virtually every character in in all the world's alphabets. These numbers are called Code Points. Unicode is a standard built on top of ISO/IEC 10646 that, in addition to specifying the assignment of number to character, deals with things like collation, bi-directionality, normalization and, most importantly, encoding. A Character Encoding Scheme (encoding) specifies how the number that stands for a character is stored in a file or in computer memory.

There are many Character Encoding Schemes defined by the Unicode Standard but the one of interest to us is called UTF-8. The UTF-8 encoding of the Unicode Coded Character Set is the preferred encoding for Unicode on the Web. It is a multi-byte encoding meaning that it may use from 1 to 6 bytes to encode the Unicode Code Point (number) of a given character. UTF-8 and US-ASCII (0-7F hex) are identical. Above 7F, 2 or more bytes are required to encode the number assigned to a Unicode character. With Unicode it is possible for one document to contain characters from many different alphabets and to treat them uniformly for search purposes.

DLXS Background

Prior to release 12, DLXS depended on a variety of mechanisms to handle non-ASCII character data. These included:

These mechanisms are not required if the data is in Unicode especially now that Unicode fonts are widely available in the current generation of web browsers.

Platform Requirements

It is necessary use the latest software versions recommended in DLXS System Requirements.

There a a few terminal emulators that handle UTF-8 encoded Unicode reasonably well:

Data Conversion

If your data does not come to you in Unicode UTF-8 encoded XML, conversion is necessary. A typical conversion might be as follows. Note that you may only need to perform just one of (A) or (B) depending on what form your data takes. That is, non-ASCII characters in your data may be represented by entities or encoded directly in, for instance, ISO-8859-1. It is possible that both steps (A) and (B) may be required.

A useful reference to Unicode characters is the file UnicodeData.txt available from the Unicode Consortium and delivered with Perl 5.8 under, for example, PERLROOT/perl/lib/5.8.3/unicore/.

(A) Convert the data to the Unicode UTF-8 encoding

Use the iconv program. The following example on Linux assumes your data is initially encoded in ISO-8859-1/Latin1:

iconv -c -f ISO-8859-1 -t UTF-8 -o outfile infile

Use the Perl Unicode.pm module in a script like the following:

#!/l/local/bin/perl -w
use strict;
use Unicode::MapUTF8 qw(to_utf8);
while( <> ) {
print to_utf8({ -string => $_, -charset => 'ISO-8859-1' }); }

Use a program like XMLSpy to read in your file and write it out UTF-8 encoded.

(B) Convert numeric character references (NCRs) and SGML character entity references (CERs) to Unicode UTF-8 encoded characters

Since your ultimate goal is to have UTF-8 encoded XML encoded recall that XML has 5 predefined CERs which you do not need to convert and which the utilities described below do not touch. They are &amp;, &lt;, &gt;, &apos; and &quot;.

Programs such as XMLSpy or osx may do the needed conversions for you but vary in their handling of SGML SDATA and NDATA entities. In some cases you may benefit from use of the following two utilities in addition..

For NCRs, i.e. references of the form &#DDDD; where D is a decimal digit or &#xXXXX; where X is a hexadecimal digit, you can use the DLXS utility program DLXSROOT/bin/t/text/ncr2utf8 run as:

ncr2utf8 inputfile > outputfile

For CERs, e.g. references like &Aring;, you may need to analyze the references present in your data. The program DLXSROOT/bin/t/text/findEntities.pl will generate a list of CERs in your data.

It is likely that most or even all CERs in your data will come from one of the ISO Character Entity Sets: ISOamsa, ISOamsb, ISOamsc, ISOamsn, ISOamso, ISOamsr, ISOcyr1, ISOcyr2, ISOgrk1, ISOgrk2, ISOgrk3, ISOgrk4, ISOlat1, ISOlat2, ISOmfrk, ISOnum, ISOpub, ISOtech, MMLalias or MMLextra. You can use DLXSROOT/bin/t/text/isocer2utf8 run as:

isocer2utf8 inputfile > outputfile
to translate these CERs directly to UTF-8. Running findEntities.pl after this will identify any CERs outside these ISO sets.

Another option is to use an SGML parser like onsgmls together with Character Entity Declarations that substitute the Unicode NCR for the CER in the parsed output followed by a run of ncr2utf8 to complete the conversion.

Note that If you started with SGML, you may need to touch up the SGML to make it (and its DTD) XML compliant if you rely solely on the small utility programs supplied with the DLXS release. This process is outside the scope of this document (but see DLXSROOT/misc/sgml/textclass.stripped.xml.dtd for an example of the XML version of textclass.dtd). At this point you should have UTF-8 encoded XML data ready to index.

Indexing

Refer to files in DLXSROOT/prep/s/sampletc_utf8 and DLXSROOT/bin/s/sampletc_utf8 for the following discussion.

DLXS delivers a Makefile to take you through the process of building the main XPAT index and the fabricated region indexes. The process is very similar for Latin1 encoded SGML data and UTF-8 encoded XML data. This process is outlined in TextClass Indexing. The main difference between the non-Unicode Makefile and the Unicode Makefile is that xpatbldu, xpatu and xmlrgn are used instead of xpatbld, xpat and sgmlrgn.

Be sure your XML data file begins with the XML declaration:

<?xml version="1.0" encoding="UTF-8"?>
. Without this declaration, xmlrgn will not build correct region indexes.

The most important input to the indexing process is the XPAT Data Dictionary. If your data spans several languages, especially those languages with non-Latin alphabets, you will need to configure a Data Dictionary that takes this into account. The sampletc_utf8.blank.dd can be used as a starting point and with some editing is sufficient for Latin based languages. There are two sections in the Data Dictionary that need attention: the Index Points and the Mappings.

Once these sections in the Data Dictionary have been configured the indexing process can proceed via the Makefile. Note that if you have XML element or attribute names that contain non-ASCII characters in your document you should use multirgn to generate the region indexes due to a limitation in xmlrgn. It is expected that this case is rare.

Index Point specification

This specification tells XPAT what points in the data to index. Typically, XPAT is directed to index and search beginning at an alphabetic character following a blank space, i.e. a word. Here is the Index Point specification section of the sampletc_utf8.blank.dd in prep:

   <IndexPoints>
        <IndexPt> &printable.</IndexPt>
        <IndexPt>&printable.-</IndexPt>
        <IndexPt>-&printable.</IndexPt>
        <IndexPt>&printable.&lt.</IndexPt>
        <IndexPt>&printable.&amp.</IndexPt>
        <IndexPt> &Latin.</IndexPt>
        <IndexPt>&Latin.-</IndexPt>
        <IndexPt>-&Latin.</IndexPt>
        <IndexPt>&Latin.&lt.</IndexPt>
        <IndexPt>&Latin.&amp.</IndexPt>
        <IndexPt> &Greek.</IndexPt>
        <IndexPt>&Greek.-</IndexPt>
        <IndexPt>-&Greek.</IndexPt>
        <IndexPt>&Greek.&lt.</IndexPt>
        <IndexPt>&Greek.&amp.</IndexPt>
      </IndexPoints>
The sampletc_utf8.xml data file contains characters from the Latin and Greek alphabets. Index points are defined for the characters from each of those alphabets using XPAT Unicode metacharacters like "&Latin." and "&Greek.". These metacharacters group Unicode characters into "blocks" which correspond roughly to alphabets. The document The XPAT Data Dictionary has a list of these Unicode metacharacters together with the characters that belong to each block (about midway through the section). If your character data is Latin-based it will probably suffice to simply remove the Greek elements from sampletc_utf8.blank.dd.

It is not advisable to create a Data Dictionary that specifies all the blocks so as to create s "universal" Data Dictionary. This would impose a performance and memory penalty on XPAT at runtime.

Not all languages have a concept of upper and lower case.

Languages such as Chinese do not separate "words" with spaces. This presents a problem for XPAT. A partial solution is to specify every character to be an index point:

<IndexPt>&Hangul.&Hangul.</IndexPt>
This would result in an index 4 times the size of the data and a large runtime memory requirement for the XPAT index point table and as of this writing should be considered experimental. There is a probability of false hits but that should decrease as the length of the query increases.

Mappings specification

Case insensitivity makes it easier for users to enter query terms. This is implemented in the Mappings section by mapping uppercase characters to their lowercase equivalent. Keyboards in the United States usually do not have keys for the accented characters used in European languages. These accented characters are mapped to their unaccented forms in the Mappings section. This allows search and retrieval whether the character appears accented or unaccented in the data. Apropos of Unicode, here is a part of the Mappings section devoted to mapping uppercase Greek to lowercase:

 
        ...
        <Map><From>U+0391</From><To>U+03B1</To></Map>
        <Map><From>U+0392</From><To>U+03B2</To></Map>
        <Map><From>U+0393</From><To>U+03B3</To></Map>
        <Map><From>U+0394</From><To>U+03B4</To></Map>
        <Map><From>U+0395</From><To>U+03B5</To></Map>
        ...
      
Note that the Greek characters are specified using the "U+" Unicode notation. The number following the "U+" is the Unicode Code Point for the character expressed in hexadecimal notation. From this one can see that the Data Dictionary can be built entirely form ASCII characters. It is not necessary to have a UTF-8 enabled editor. The XPAT Unicode implementation currently accepts values up to U+FFFF (65535). This covers all the characters defined in Unicode Plane 0 also referred to as the Basic Multilingual Plane.

While there are characters in higher planes they are relatively rare and this XPAT limitation is not expected to present an obstacle to indexing your Unicode-based texts. Should the need arise XPAT can be extended to use a full 32 bit word internally. As there is little need for this currently it is more memory efficient to use a 16 bit word to store characters in memory.

You will need to analyze your texts to decide what sort of mapping may be useful to your target audiences. There are many issues to consider. Input mechanisms dominate these considerations.

DLXS is exploring the addition of a configurable javascript popup virtual keyboard to allow users to enter characters from alphabets for which they lack a physical keyboard.

Collmgr Fields / Configuration

To put your data online you will naturally need to define a collection in the collection database using Collmgr. There are two differences between a non-Unicode collection and a Unicode collection. Currently there is no support for a Unicode Wordwheel. Leave the wwappmodule, wwdd, wwrealms and wwrealmseng blank.

To configure a Unicode collection set the locale field to a UTF-8 locale value such as en_US.UTF-8. You can get a list of locale values recognized by your Unix system by typing locale -a at the shell prompt. A UTF-8 locale setting affects several areas of functionality in the middleware.

Templates

At the present time a large number of HTML templates have a <META> tag with charset=iso-8859-1. These templates must continue to work for data from non-Unicode collections while at the same time supporting Unicode data. Rather than adding a PI to all these templates or duplicating them we have chosen to process them automatically on output from the middleware. The middleware changes occurrences of charset=iso-8859-1 (if present) to charset=UTF-8 when outputting processed HTML templates. Templates intended to support Unicode character data should have the <META> tag with charset=iso-8859-1 (<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">) in their headers. The upshot of this is that since all templates probably conform to this requirement already no changes should be needed.

Unicode, User Input and Form Submission

The encoding of user input to HTML forms is a complex area not made any easier by browser bugs and standards that do not address the problem fully. The best discussion of this topic is by A.J.Flavell. Basically the problem is that there is no reliable way for the browser to convey to the middleware what encoding is in effect for the data entered into a form by the user. Quoting Mr. Flavell:

In practice, browsers normally display the contents of text fields according to the character coding (charset) that applies for the HTML page as a whole; and when it submits the text fields they are effectively in this same coding. Thus if the server sent out the (page containing the) form with a definite charset specification, it could normally assume that the submitted data can be interpreted in accordance with the same charset. There are however anomalies of various kinds, some of which have been seen and understood by the author of this note, some of which have been seen and not understood, and some of which are only anecdotal at the moment.

In addition to these considerations, some users may be typing-in or pasting-in text from an application that uses their local character coding (practical examples being macRoman on a Mac; or MS-DOS CP850 being copied out of a DOS window on an MS Windows PC), into a text field of a document that used the author's - different - character coding (let's say for the simplest example, iso-8859-1): the user might then submit the form, disregarding that what they are seeing in the text area is not what they intended to send. [...]
Given this state of affairs we can see that user data entry is not 100% reliable. Nonetheless, it is reasonable to assume the following in a page send by the middleware with charset=UTF-8: Beyond these assertions it is impossible to generalize about how copying and pasting characters from arbitrary sources into an input field might be expected to behave.

Current Limitations in DLXS Middleware (Release 12)

The middleware supports collections with different character encodings in single collection mode and in cross-collection mode. However the encodings are limited to Unicode UTF-8 and ISO8859-1 (Latin1).

Any user input that is not valid UTF-8 is assumed to be Latin1 encoded. This input is transcoded to UTF-8 under this assumption. Because ASCII is, by default, UTF-8, input is not changed and XPAT Latin1-based collection queries will proceed successfully if the data dictionary maps accented character to their unaccented base character. A Latin1 XPAT search result is converted to UTF-8 to enable the data to pass through the XML/XSLT parsers on output and to display correctly in the web template which is set to charset=UTF-8. This creates a minor deficiency if the user copies a string of accented characters from the results back into the search form. The characters are now UTF-8 encoded and will not be found in a Latin1 encoded collection.