Unicode in DLXS
This topic touches on XPAT indexing, data preparation and middleware configuration and behavior as they relate to Unicode in DLXS.
Documentation on this topic can also be found at:
http://www.dlxs.org/docs/13/class/unicode.html
Unicode and Character Sets
Your data may be pure ASCII encoding which supports 127 characters. The ISO-8859-* encodings support 256 characters but only one set of 256 characters at a time. Latin2 covers German and Polish. Latin5 covers German and Turkish. There is no single-byte encoding covering German and Russian, for example.
Methods to represent characters from other alphabets:
- Character Entity References (CER), e.g. À in SGML for LATIN CAPITAL LETTER A WITH GRAVE
- Not searchable. Not easily entered by the user. Not XML. Less support in browsers than for NCR
- Numeric character references (NCR), e.g. �C0; is the Unicode Code Point for LATIN CAPITAL LETTER A WITH GRAVE
- Not searchable. Not easily entered by the user.
- Character Repertoire - collection of abstract characters independent of how they look when printed.
- Coded Character Set - assignment of a unique number to each character in a Character Repertoire.
- Code Point - the unique number assigned to a character in a Coded Character Set.
- ISO/IEC 10646 - The Coded Character Set for Unicode. The Unicode standard specifies additional properties for each character.
- Character Encoding Scheme - or Encoding for short, specifies how the number assigned to a character is stored in a file or in computer memory.
- UTF-8, UTF-16
- UTF-8 is a multi-byte variable length encoding. A byte is not a character, except for code points 0 to 127.
- ASCII is UTF-8 by design
- Basic multilingual plane. 65536 characters. Divided into blocks of varying size by alphabet.
- Can represent more than one alphabet in a single document or web page.
- Searchable. (xpatu)
- Programming is simpler.
- Users can enter unaccented Latin characters and get results for accenter Latin characters via XPAT mapping functionality.
- Non-ASCII characters can be entered by users via national keyboards, virtual keyboards, IMEs, copy/paste.
- Can be collated.
- Fundamental to XML.
Back to top
DLXS data preparation and Unicode
Tools are getting better, more plentiful.
- Do you see é instead of é ... ? Example
- Linux
- GNOME terminal
- yudit editor
- XEmacs editor (caveats)
- xterm -u8 -fn '-misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1'
- Bitstream Cyberbit and MS Arial Unicode fonts
- Windows
- PuTTY terminal with Hummingbird Exceed X Server version 8 or higher on Windows
- MS Arial Unicode
- XMLSpy
The goal is to get your data into UTF-8 encoded XML. You need to know
how characters in your data have been encoded in order to transform to
another encoding.
- iconv -c -f ISO-8859-1 -t UTF-8 -o outfile infile
iconv -l shows a list of supported encodings
- DLXSROOT/bin/t/text/ncr2utf8
- xpatutf8check
- jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul
- DLXSROOT/bin/t/text/utf8chars
- OpenSP osx
- XMLSpy
Back to top
Unicode XPAT indexing
This applies to XPAT-based classesTextClass, FindaidClass and BibClass. ImageClass is MySQL-based. More when we talk about data preparation for the classes more fully.
- xpatbldu, xpatu and xmlrgn are used instead of xpatbld, xpat and sgmlrgn.
- Sample Makefile in DLXSROOT/bin/s/sampletc_utf8
- <?xml version="1.0" encoding="UTF-8"?> Important for xmlrgn
- Index point meta characters (e.g. &Greek. or &Cyrillic.) are based on the Unicode block definitions, Perl unicode lib (e.g. lib/5.8.3/unicore/lib/Latin.pl) and modified as described in the XPAT data dictionary document.
- Specify index points in the data dictionary (.dd file) based on the alphabets in your data.
<IndexPoints>
<IndexPt> &printable.</IndexPt>
<IndexPt>&printable.-</IndexPt>
<IndexPt>-&printable.</IndexPt>
<IndexPt>&printable.<.</IndexPt>
<IndexPt>&printable.&.</IndexPt>
<IndexPt> &Latin.</IndexPt>
<IndexPt>&Latin.-</IndexPt>
<IndexPt>-&Latin.</IndexPt>
<IndexPt>&Latin.<.</IndexPt>
<IndexPt> &Greek.</IndexPt>
<IndexPt>&Greek.-</IndexPt>
<IndexPt>-&Greek.</IndexPt>
<IndexPt>&Greek.<.</IndexPt>
</IndexPoints>
- Specify character mappings in the data dictionary also based on the characters that occur in your data. Note U+XXXX notation. Refer to the Unicode character database. This is mainly for case mapping for alphabets that have case.
...
<Map><From>U+0391</From><To>U+03B1</To></Map>
<Map><From>U+0392</From><To>U+03B2</To></Map>
<Map><From>U+0393</From><To>U+03B3</To></Map>
<Map><From>U+0394</From><To>U+03B4</To></Map>
<Map><From>U+0395</From><To>U+03B5</To></Map>
...
- OAIster is %100 UTF-8 encoded XML indexed by xpatbldu and multirgn and searched using xpatu.
- Supports Latin, Greek, Cyrillic, Han, Hiragana, Katakana and Hangul characters.
- Highlighting based on .dd file character mappings.
- OAIster data dictionary
- Workshop example is %100 UTF-8 encoded XML containing English, French and Greek and indexed by xpatbldu and xmlrgn and searched using xpatu. Wordwheel building is also Unicode based.
Back to top
Middleware configuration, requirements and behavior for Unicode
Use Perl 5.8.3 or higher. 5.8.8 is better. Avoid 5.8.6 (debugger problem).
If your data is UTF-8 encoded Unicode, set the collection manager (collmgr) locale field to en_US.UTF-8. Middleware wil use xpatu to read the index. That is all.
To make legacy Latin-1 encoded SGML data work:
- The collection manager (collmgr) locale field should be set to en_US (or left empty) to use xpat instead of xpatu to read the index.
- If there are character entity references like "é", declare them in the DLXSROOT/web/(c)/(collection)/entitiesdoctype.chnk file (copied from DLXSROOT/misc/sgml), if not already present in that file.
The basic assumption INSIDE the middleware is that ANY input (user typed or search results from XPAT) is UTF-8 encoded.
- USER input that is not valid UTF-8 will be transcoded into UTF-8 FROM LATIN-1
- SEARCH ENGINE results from XPAT are are processed through the DlpsUtils::Sgml2XmlFilter:
- to transcode into UTF-8 FROM LATIN-1
- to change SGML-style singletons (e.g. <LB>) to XML-style singletons (e.g. <LB/>).
Downside: Searches for accented characters will fail in Latin-1 collections because the user's search term will be converted to UTF-8 but the collection data will be Latin-1. Unaccented searches will still work.
Back to top