Unicode in DLXS
This topic touches on XPAT indexing, data preparation and middleware configuration as they relate to Unicode in DLXS.
Documentation on this topic can also be found at: http://docs.dlxs.org/class/dlxs-unicode.xml
Unicode in General
There is a lot of fuzziness in talk about characters. "Character set" considered harmful.
- Character Repertoire - collection of abstract characters independent of how they look when printed.
- Coded Character Set - assignment of a unique number to each character in a Character Repertoire.
- Code Point - the unique number assigned to a character in a Coded Character Set.
- ISO/IEC 10646 - The Coded Character Set for Unicode. The Unicode standard specifies additional properties for each character.
- Character Encoding Scheme - or Encoding for short, specifies how the number assigned to a character is stored in a file or in computer memory.
- UTF-8, UTF-16
- UTF-8 is a multi-byte variable length encoding. A byte is not a character, mostly.
- ASCII is UTF-8 by design
- Basic multilingual plane. 65536 characters. Divided into blocks of varying size by alphabet.
The ASCII encoding only supports 127 characters. The ISO-8859-* encodings support 256 characters but only one set of 256 characters at a time. Latin2 covers German and Polish. Latin5 covers German and Turkish. There is no single-byte encoding covering German and Russian.
- Use of .gif images of characters and <img src="Agrave.gif"> tags.
- Not font dependent.
- Not searchable, not scalable, slow. A lot of work to generate
- Character Entity References (CER), e.g. À in SGML for LATIN CAPITAL LETTER A WITH GRAVE
- Font dependent.
- Not searchable. Not easily entered bu the user. Not XML. Less support in browsers than for NCR
- Numeric character references (NCR), e.g. �C0; is the Unicode Code Point for LATIN CAPITAL LETTER A WITH GRAVE
- Font dependent.
- Not searchable. Not easily entered by the user.
- Convert NCR and CER to iso8859-1 encoding.
- Font dependent.
- Searchable. Easily entered by the user via XPAT mapping functionality.
- Limited to one alphabet per document.
- See previous section.
- Can represent more than one alphabet in a single document or web page.
- Searchable. (xpatu)
- Programming is simpler.
- Latin characters can be easily entered by users via XPAT mapping functionality.
- Non-Latin characters can be entered by users via national keyboards, virtual keyboards, IMEs, copy/paste.
- Can be collated.
- Fundamental to XML.
- Still font dependent.
Back to top
DLXS data preparation and Unicode
Tools are getting better, more plentiful every day.
- The é (instead of é) problem.
- Linux
- GNOME terminal
- xterm -u8 -fn '-misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1'
- Bitstream Cyberbit and MS Arial Unicode fonts
- Windows
- PuTTY with Hummingbird Exceed X Server version 8 or higher on Windows
- MS Arial Unicode
- XMLSpy
The goal is to get your data into UTF-8 encoded XML. You need to know
how characters in your data have been encoded in order to transform to
another encoding.
- iconv -c -f ISO-8859-1 -t UTF-8 -o outfile infile
- DLXSROOT/bin/t/text/ncr2utf8
- DLXSROOT/bin/t/text/isocer2utf8
- OpenSP osx
- XMLSpy
Back to top
Unicode XPAT indexing
More when we talk about data preparation more fully. For now this is just to highlight some differences in programs and processes.
- xpatbldu, xpatu and xmlrgn are used instead of xpatbld, xpat and sgmlrgn.
- Sample Makefile in DLXSROOT/bin/s/sampletc_utf8
- <?xml version="1.0" encoding="UTF-8"?> Important for xmlrgn
- Index point meta characters (e.g. &Greek. or &Cyrillic.) are based on the Unicode block definitions, Perl unicode lib (e.g. lib/5.8.3/unicore/lib/Latin.pl) and modified as described in the XPAT data dictionary document.
- Specify index points in the data dictionary (.dd file) based on the alphabets in your data.
<IndexPoints>
<IndexPt> &printable.</IndexPt>
<IndexPt>&printable.-</IndexPt>
<IndexPt>-&printable.</IndexPt>
<IndexPt>&printable.<.</IndexPt>
<IndexPt>&printable.&.</IndexPt>
<IndexPt> &Latin.</IndexPt>
<IndexPt>&Latin.-</IndexPt>
<IndexPt>-&Latin.</IndexPt>
<IndexPt>&Latin.<.</IndexPt>
<IndexPt> &Greek.</IndexPt>
<IndexPt>&Greek.-</IndexPt>
<IndexPt>-&Greek.</IndexPt>
<IndexPt>&Greek.<.</IndexPt>
</IndexPoints>
- Specify character mappings in the data dictionary also based on the characters that occur in your data. Note U+XXXX notation. Refer to the Unicode character database. This is mainly for case mapping for alphabets that have case.
...
<Map><From>U+0391</From><To>U+03B1</To></Map>
<Map><From>U+0392</From><To>U+03B2</To></Map>
<Map><From>U+0393</From><To>U+03B3</To></Map>
<Map><From>U+0394</From><To>U+03B4</To></Map>
<Map><From>U+0395</From><To>U+03B5</To></Map>
...
- Run Makefile
- OAIster is %100 UTF-8 encoded XML indexed by xpatbldu and multirgn and searched using xpatu.
- Supports Latin, Greek, Cyrillic, Han, Hiragana, Katakana and Hangul characters.
- Highlighting based on .dd file character mappings.
- OAIster data dictionary
Back to top
Middleware configuration for Unicode
The pre-Unicode XPAT version is 5.2.3. The first Unicode aware XPAT version is 5.3.0.
- 5.3 XPAT can read 5.2 indexes, i.e. 5.3 is backward compatible
- 5.2 XPAT cannot read 5.3 indexes
Configuration and requirements.
- Perl 5.8.3 or higher.
- In the collection manager (collmgr) the locale field should be set to en_US.UTF-8. Any value not including "UTF-8" means the middleware will assume Latin1 encoding and will use xpat instead of xpatu to read the index.
- The <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> is processed automatically by the middleware to UTF-8. Tells the browser what encoding to use for the page.
- The middleware does not support collections with different character encodings in cross-collection mode. For coherent results, collections must all be of a single encoding in cross-collection mode, either all UTF-8 or all ISO. If collections exist in both UTF-8 and ISO-8859-1 they will be treated as ISO-8859-1 in cross-collection mode (with predictably strange results). UTF-8 encoded Unicode collections should be offered solely in single collection mode under these circumstances.
Back to top