Unicode in DLXS

This topic touches on XPAT indexing, data preparation and middleware configuration and behavior as they relate to Unicode in DLXS.

Unicode and Character Sets
DLXS data preparation and Unicode
Unicode XPAT indexing
Middleware configuration, requirements and behavior for Unicode

Documentation on this topic can also be found at:

http://www.dlxs.org/docs/13/class/unicode.html

Unicode and Character Sets

Your data may be pure ASCII encoding which supports 127 characters. The ISO-8859-* encodings support 256 characters but only one set of 256 characters at a time. Latin2 covers German and Polish. Latin5 covers German and Turkish. There is no single-byte encoding covering German and Russian, for example.

Methods to represent characters from other alphabets:

Character Entity References (CER), e.g. À in SGML for LATIN CAPITAL LETTER A WITH GRAVE
- Not searchable. Not easily entered by the user. Not XML. Less support in browsers than for NCR
Numeric character references (NCR), e.g. &#00C0; is the Unicode Code Point for LATIN CAPITAL LETTER A WITH GRAVE
- Not searchable. Not easily entered by the user.

Unicode Definitions

Character Repertoire - collection of abstract characters independent of how they look when printed.
Coded Character Set - assignment of a unique number to each character in a Character Repertoire.
Code Point - the unique number assigned to a character in a Coded Character Set.
ISO/IEC 10646 - The Coded Character Set for Unicode. The Unicode standard specifies additional properties for each character.
Character Encoding Scheme - or Encoding for short, specifies how the number assigned to a character is stored in a file or in computer memory.
- UTF-8, UTF-16
- UTF-8 is a multi-byte variable length encoding. A byte is not a character, except for code points 0 to 127.
- ASCII is UTF-8 by design
- Basic multilingual plane. 65536 characters. Divided into blocks of varying size by alphabet.

Reasons to use Unicode

Can represent more than one alphabet in a single document or web page.
Searchable. (xpatu)
Programming is simpler.
Users can enter unaccented Latin characters and get results for accenter Latin characters via XPAT mapping functionality.
Non-ASCII characters can be entered by users via national keyboards, virtual keyboards, IMEs, copy/paste.
Can be collated.
Fundamental to XML.

Back to top

DLXS data preparation and Unicode

Tools are getting better, more plentiful.

Viewers, Terminal emulators

Do you see Ã© instead of é ... ? Example
Linux
- GNOME terminal
- yudit editor
- XEmacs editor (caveats)
- xterm -u8 -fn '-misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1'
- Bitstream Cyberbit and MS Arial Unicode fonts
Windows
- PuTTY terminal with Hummingbird Exceed X Server version 8 or higher on Windows
- MS Arial Unicode
- XMLSpy

Tools

The goal is to get your data into UTF-8 encoded XML. You need to know how characters in your data have been encoded in order to transform to another encoding.

iconv -c -f ISO-8859-1 -t UTF-8 -o outfile infile iconv -l shows a list of supported encodings
DLXSROOT/bin/t/text/ncr2utf8
xpatutf8check
jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul
DLXSROOT/bin/t/text/utf8chars
OpenSP osx
XMLSpy

Back to top

Unicode XPAT indexing

This applies to XPAT-based classesTextClass, FindaidClass and BibClass. ImageClass is MySQL-based. More when we talk about data preparation for the classes more fully.

xpatbldu, xpatu and xmlrgn are used instead of xpatbld, xpat and sgmlrgn.
Sample Makefile in DLXSROOT/bin/s/sampletc_utf8
<?xml version="1.0" encoding="UTF-8"?> Important for xmlrgn
Index point meta characters (e.g. &Greek. or &Cyrillic.) are based on the Unicode block definitions, Perl unicode lib (e.g. lib/5.8.3/unicore/lib/Latin.pl) and modified as described in the XPAT data dictionary document.

Specify index points in the data dictionary (.dd file) based on the alphabets in your data.

            <IndexPoints>
            <IndexPt> &printable.</IndexPt>
            <IndexPt>&printable.-</IndexPt>
            <IndexPt>-&printable.</IndexPt>
            <IndexPt>&printable.<.</IndexPt>
            <IndexPt>&printable.&.</IndexPt>

            <IndexPt> &Latin.</IndexPt>
            <IndexPt>&Latin.-</IndexPt>
            <IndexPt>-&Latin.</IndexPt>
            <IndexPt>&Latin.<.</IndexPt>

            <IndexPt> &Greek.</IndexPt>
            <IndexPt>&Greek.-</IndexPt>
            <IndexPt>-&Greek.</IndexPt>
            <IndexPt>&Greek.<.</IndexPt>
            </IndexPoints>

Specify character mappings in the data dictionary also based on the characters that occur in your data. Note U+XXXX notation. Refer to the Unicode character database. This is mainly for case mapping for alphabets that have case.

            ...
            <Map><From>U+0391</From><To>U+03B1</To></Map>
            <Map><From>U+0392</From><To>U+03B2</To></Map>
            <Map><From>U+0393</From><To>U+03B3</To></Map>
            <Map><From>U+0394</From><To>U+03B4</To></Map>
            <Map><From>U+0395</From><To>U+03B5</To></Map>
            ...

DLPS Unicode examples

OAIster is %100 UTF-8 encoded XML indexed by xpatbldu and multirgn and searched using xpatu.
Supports Latin, Greek, Cyrillic, Han, Hiragana, Katakana and Hangul characters.
Highlighting based on .dd file character mappings.
OAIster data dictionary

Workshop example is %100 UTF-8 encoded XML containing English, French and Greek and indexed by xpatbldu and xmlrgn and searched using xpatu. Wordwheel building is also Unicode based.

Back to top

Middleware configuration, requirements and behavior for Unicode

Use Perl 5.8.3 or higher. 5.8.8 is better. Avoid 5.8.6 (debugger problem).

If your data is UTF-8 encoded Unicode, set the collection manager (collmgr) locale field to en_US.UTF-8. Middleware wil use xpatu to read the index. That is all.

To make legacy Latin-1 encoded SGML data work:

The collection manager (collmgr) locale field should be set to en_US (or left empty) to use xpat instead of xpatu to read the index.
If there are character entity references like "é", declare them in the DLXSROOT/web/(c)/(collection)/entitiesdoctype.chnk file (copied from DLXSROOT/misc/sgml), if not already present in that file.

The basic assumption INSIDE the middleware is that ANY input (user typed or search results from XPAT) is UTF-8 encoded.

USER input that is not valid UTF-8 will be transcoded into UTF-8 FROM LATIN-1
SEARCH ENGINE results from XPAT are are processed through the DlpsUtils::Sgml2XmlFilter:
- to transcode into UTF-8 FROM LATIN-1
- to change SGML-style singletons (e.g. <LB>) to XML-style singletons (e.g. <LB/>).

Downside: Searches for accented characters will fail in Latin-1 collections because the user's search term will be converted to UTF-8 but the collection data will be Latin-1. Unaccented searches will still work.

Back to top