Unicode in DLXS
This topic touches on XPAT indexing, data preparation and middleware configuration and behavior as they relate to Unicode in DLXS.
Documentation on this topic can also be found at:
http://docs.dlxs.org/class/unicode.html (Release 11a)
http://www.dlxs.org/docs/12a/class/unicode.html (Release 12a)
Unicode in General
There is a lot of fuzziness in talk about characters. "Character set" considered harmful.
- Character Repertoire - collection of abstract characters independent of how they look when printed.
- Coded Character Set - assignment of a unique number to each character in a Character Repertoire.
- Code Point - the unique number assigned to a character in a Coded Character Set.
- ISO/IEC 10646 - The Coded Character Set for Unicode. The Unicode standard specifies additional properties for each character.
- Character Encoding Scheme - or Encoding for short, specifies how the number assigned to a character is stored in a file or in computer memory.
- UTF-8, UTF-16
- UTF-8 is a multi-byte variable length encoding. A byte is not a character, mostly.
- ASCII is UTF-8 by design
- Basic multilingual plane. 65536 characters. Divided into blocks of varying size by alphabet.
Your data may be pure ASCII encoding which supports 127 characters. The ISO-8859-* encodings support 256 characters but only one set of 256 characters at a time. Latin2 covers German and Polish. Latin5 covers German and Turkish. There is no single-byte encoding covering German and Russian, for example.
Methods to represent characters from other alphabets:
- Character Entity References (CER), e.g. À in SGML for LATIN CAPITAL LETTER A WITH GRAVE
- Not searchable. Not easily entered by the user. Not XML. Less support in browsers than for NCR
- Numeric character references (NCR), e.g. �C0; is the Unicode Code Point for LATIN CAPITAL LETTER A WITH GRAVE
- Not searchable. Not easily entered by the user.
- Can represent more than one alphabet in a single document or web page.
- Searchable. (xpatu)
- Programming is simpler.
- Latin characters can be easily entered by users via XPAT mapping functionality.
- Non-ASCII characters can be entered by users via national keyboards, virtual keyboards, IMEs, copy/paste.
- Can be collated.
- Fundamental to XML.
- Better font support than for character entity references.
Back to top
DLXS data preparation and Unicode
Tools are getting better, more plentiful every day.
- Do you see é instead of � Or: What You See Is Not (Always) What You Have (WYSINAWYH).
- Linux
- GNOME terminal
- xterm -u8 -fn '-misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1'
- Bitstream Cyberbit and MS Arial Unicode fonts
- Windows
- PuTTY with Hummingbird Exceed X Server version 8 or higher on Windows
- MS Arial Unicode
- XMLSpy
The goal is to get your data into UTF-8 encoded XML. You need to know
how characters in your data have been encoded in order to transform to
another encoding.
- iconv -c -f ISO-8859-1 -t UTF-8 -o outfile infile
- DLXSROOT/bin/t/text/ncr2utf8
- xpatutf8check
- jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul
- DLXSROOT/bin/t/text/utf8chars
- OpenSP osx
- XMLSpy
Back to top
Unicode XPAT indexing
More when we talk about data preparation more fully. For now this is just to highlight some differences in programs and processes.
- xpatbldu, xpatu and xmlrgn are used instead of xpatbld, xpat and sgmlrgn.
- Sample Makefile in DLXSROOT/bin/s/sampletc_utf8
- <?xml version="1.0" encoding="UTF-8"?> Important for xmlrgn
- Index point meta characters (e.g. &Greek. or &Cyrillic.) are based on the Unicode block definitions, Perl unicode lib (e.g. lib/5.8.3/unicore/lib/Latin.pl) and modified as described in the XPAT data dictionary document.
- Specify index points in the data dictionary (.dd file) based on the alphabets in your data.
<IndexPoints>
<IndexPt> &printable.</IndexPt>
<IndexPt>&printable.-</IndexPt>
<IndexPt>-&printable.</IndexPt>
<IndexPt>&printable.<.</IndexPt>
<IndexPt>&printable.&.</IndexPt>
<IndexPt> &Latin.</IndexPt>
<IndexPt>&Latin.-</IndexPt>
<IndexPt>-&Latin.</IndexPt>
<IndexPt>&Latin.<.</IndexPt>
<IndexPt> &Greek.</IndexPt>
<IndexPt>&Greek.-</IndexPt>
<IndexPt>-&Greek.</IndexPt>
<IndexPt>&Greek.<.</IndexPt>
</IndexPoints>
- Specify character mappings in the data dictionary also based on the characters that occur in your data. Note U+XXXX notation. Refer to the Unicode character database. This is mainly for case mapping for alphabets that have case.
...
<Map><From>U+0391</From><To>U+03B1</To></Map>
<Map><From>U+0392</From><To>U+03B2</To></Map>
<Map><From>U+0393</From><To>U+03B3</To></Map>
<Map><From>U+0394</From><To>U+03B4</To></Map>
<Map><From>U+0395</From><To>U+03B5</To></Map>
...
- Run Makefile
- OAIster is %100 UTF-8 encoded XML indexed by xpatbldu and multirgn and searched using xpatu.
- Supports Latin, Greek, Cyrillic, Han, Hiragana, Katakana and Hangul characters.
- Highlighting based on .dd file character mappings.
- OAIster data dictionary
- Workshop example is %100 UTF-8 encoded XML containing English, French and Greek and indexed by xpatbldu and xmlrgn and searched using xpatu. Wordwheel?
Back to top
Middleware configuration, requirements and behavior for Unicode
XPAT version 5.3.2
- 5.3 XPAT can read 5.2 indexes, i.e. 5.3 is backward compatible
- 5.2 XPAT cannot read 5.3 indexes
Perl 5.8.3 or higher is required. 5.8.8 is better. Avoid 5.8.6 (debugger problem).
Configuration and behavior
To make legacy Latin-1 encoded SGML data work:
- The collection manager (collmgr) locale field should be set to en_US to use xpat instead of xpatu to read the index.
- If there are character entity references like "é", declare them in the DLXSROOT/web/(c)/(collection)/entitiesdoctype.chnk file (copied from DLXSROOT/misc/sgml), if not already present in that file.
The basic assumption is that ANY input (user typed or search results form XPAT) is utf-8 encoded XML. Why? How? From what encoding?
- user input that is not valid UTF-8 will be transcoded into UTF-8 and reserved characters are turned into character entity references like &. Why? What effect on searching for tags? debug=qmap
- search results from XPAT are are processed through the DlpsUtils::Sgml2XmlFilter to transcode into UTF-8 and to change SGML-style singletons (e.g. <LB>) to XML-style singletons (e.g. <LB/>).
Downside: Searches for accented characters will fail in Latin-1 collections because the user's search term will be converted to UTF-8 but the collection data will be Latin-1. Unaccented searches will still work.
All XML templates have <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> elements to ensure user input is UTF-8 and to tell the browser to use UTF-8 encoding when rendering the page content.
The middleware supports collections with different character encodings in cross-collection mode. This fact is due to the transcoding Latin1 -> UTF-8 on input and Latin1 -> UTF-8 on output.
Back to top