Unicode in DLXS

This topic touches on XPAT indexing, data preparation and middleware configuration as they relate to Unicode in DLXS.

Unicode in General
DLXS data preparation and Unicode
Unicode XPAT indexing
Middleware configuration for Unicode

Documentation on this topic can also be found at: http://docs.dlxs.org/class/dlxs-unicode.xml

Unicode in General

There is a lot of fuzziness in talk about characters. "Character set" considered harmful.

Definitions

Character Repertoire - collection of abstract characters independent of how they look when printed.
Coded Character Set - assignment of a unique number to each character in a Character Repertoire.
Code Point - the unique number assigned to a character in a Coded Character Set.
ISO/IEC 10646 - The Coded Character Set for Unicode. The Unicode standard specifies additional properties for each character.
Character Encoding Scheme - or Encoding for short, specifies how the number assigned to a character is stored in a file or in computer memory.
- UTF-8, UTF-16
- UTF-8 is a multi-byte variable length encoding. A byte is not a character, mostly.
- ASCII is UTF-8 by design
- Basic multilingual plane. 65536 characters. Divided into blocks of varying size by alphabet.

DLXS multi-lingual character support before Unicode

The ASCII encoding only supports 127 characters. The ISO-8859-* encodings support 256 characters but only one set of 256 characters at a time. Latin2 covers German and Polish. Latin5 covers German and Turkish. There is no single-byte encoding covering German and Russian.

Use of .gif images of characters and <img src="Agrave.gif"> tags.
- Not font dependent.
- Not searchable, not scalable, slow. A lot of work to generate
Character Entity References (CER), e.g. À in SGML for LATIN CAPITAL LETTER A WITH GRAVE
- Font dependent.
- Not searchable. Not easily entered bu the user. Not XML. Less support in browsers than for NCR
Numeric character references (NCR), e.g. &#00C0; is the Unicode Code Point for LATIN CAPITAL LETTER A WITH GRAVE
- Font dependent.
- Not searchable. Not easily entered by the user.
Convert NCR and CER to iso8859-1 encoding.
- Font dependent.
- Searchable. Easily entered by the user via XPAT mapping functionality.
- Limited to one alphabet per document.

Reasons to use Unicode

See previous section.
Can represent more than one alphabet in a single document or web page.
Searchable. (xpatu)
Programming is simpler.
Latin characters can be easily entered by users via XPAT mapping functionality.
Non-Latin characters can be entered by users via national keyboards, virtual keyboards, IMEs, copy/paste.
Can be collated.
Fundamental to XML.
Still font dependent.

Back to top

DLXS data preparation and Unicode

Tools are getting better, more plentiful every day.

Terminal emulators

The Ã© (instead of é) problem.
Linux
- GNOME terminal
- xterm -u8 -fn '-misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1'
- Bitstream Cyberbit and MS Arial Unicode fonts
Windows
- PuTTY with Hummingbird Exceed X Server version 8 or higher on Windows
- MS Arial Unicode
- XMLSpy

Tools

The goal is to get your data into UTF-8 encoded XML. You need to know how characters in your data have been encoded in order to transform to another encoding.

iconv -c -f ISO-8859-1 -t UTF-8 -o outfile infile
DLXSROOT/bin/t/text/ncr2utf8
DLXSROOT/bin/t/text/isocer2utf8
OpenSP osx
XMLSpy

Back to top

Unicode XPAT indexing

More when we talk about data preparation more fully. For now this is just to highlight some differences in programs and processes.

xpatbldu, xpatu and xmlrgn are used instead of xpatbld, xpat and sgmlrgn.
Sample Makefile in DLXSROOT/bin/s/sampletc_utf8
<?xml version="1.0" encoding="UTF-8"?> Important for xmlrgn
Index point meta characters (e.g. &Greek. or &Cyrillic.) are based on the Unicode block definitions, Perl unicode lib (e.g. lib/5.8.3/unicore/lib/Latin.pl) and modified as described in the XPAT data dictionary document.

Specify index points in the data dictionary (.dd file) based on the alphabets in your data.

            <IndexPoints>
            <IndexPt> &printable.</IndexPt>
            <IndexPt>&printable.-</IndexPt>
            <IndexPt>-&printable.</IndexPt>
            <IndexPt>&printable.<.</IndexPt>
            <IndexPt>&printable.&.</IndexPt>

            <IndexPt> &Latin.</IndexPt>
            <IndexPt>&Latin.-</IndexPt>
            <IndexPt>-&Latin.</IndexPt>
            <IndexPt>&Latin.<.</IndexPt>

            <IndexPt> &Greek.</IndexPt>
            <IndexPt>&Greek.-</IndexPt>
            <IndexPt>-&Greek.</IndexPt>
            <IndexPt>&Greek.<.</IndexPt>
            </IndexPoints>

Specify character mappings in the data dictionary also based on the characters that occur in your data. Note U+XXXX notation. Refer to the Unicode character database. This is mainly for case mapping for alphabets that have case.

            ...
            <Map><From>U+0391</From><To>U+03B1</To></Map>
            <Map><From>U+0392</From><To>U+03B2</To></Map>
            <Map><From>U+0393</From><To>U+03B3</To></Map>
            <Map><From>U+0394</From><To>U+03B4</To></Map>
            <Map><From>U+0395</From><To>U+03B5</To></Map>
            ...

Run Makefile

DLPS production example

OAIster is %100 UTF-8 encoded XML indexed by xpatbldu and multirgn and searched using xpatu.
Supports Latin, Greek, Cyrillic, Han, Hiragana, Katakana and Hangul characters.
Highlighting based on .dd file character mappings.
OAIster data dictionary

Back to top

Middleware configuration for Unicode

The pre-Unicode XPAT version is 5.2.3. The first Unicode aware XPAT version is 5.3.0.

5.3 XPAT can read 5.2 indexes, i.e. 5.3 is backward compatible
5.2 XPAT cannot read 5.3 indexes

Configuration and requirements.

Perl 5.8.3 or higher.
In the collection manager (collmgr) the locale field should be set to en_US.UTF-8. Any value not including "UTF-8" means the middleware will assume Latin1 encoding and will use xpat instead of xpatu to read the index.
The <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> is processed automatically by the middleware to UTF-8. Tells the browser what encoding to use for the page.
The middleware does not support collections with different character encodings in cross-collection mode. For coherent results, collections must all be of a single encoding in cross-collection mode, either all UTF-8 or all ISO. If collections exist in both UTF-8 and ISO-8859-1 they will be treated as ISO-8859-1 in cross-collection mode (with predictably strange results). UTF-8 encoded Unicode collections should be offered solely in single collection mode under these circumstances.

Back to top