If the character encoding of the harvested data does not pass the Perl Encode::is_utf8 test, the data is assumed to be Latin-1 encoded and is transcoded from Latin-1 (iso-8859-1) to UTF-8. This heuristic is useful in the vast majority of cases. It is important to note it is impossible to determine the encoding if it is not specified or adherence to OAI specifications has lapsed.
With the exception of the five reserved XML CERs (lt, gt, apos, quot, amp), the ISO Character Entity References from ISOAMSA, ISOAMSB, ISOAMSC ISOAMSN, ISOAMSO, ISOAMSR, ISOCYR1, ISOCYR2, ISOGRK1, ISOGRK2, ISOGRK3, ISOGRK4, ISOLAT1, ISOLAT2, ISOMFRK, ISONUM, ISOPUB, ISOTECH, MMLALIAS, and MMLEXTRA are translated into their corresponding UTF-8 encoded Unicode characters. It is usually an error for CERs from these sets to appear in the XML because they require an internal subset declaration to make the XML valid. This also improves searchability of the records and decreases file size.
Numeric Character References of the form XXXX;
where XXXX represents a hexadecimal number of one to 4 hexadecimal digits and YYYY;
where YYYY represents a decimal number of one to 4 decimal digits are mapped to their equivalent UTF-8 encoded Unicode characters. This is not strictly necessary but improves searchability. The five reserved XML characters, if represented by NCRs, are not mapped.
"Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were current at the time this document was prepared. New characters may be added to these standards by amendments or new editions. Consequently, XML processors must accept any character in the range specified for Char. The use of "compatibility characters", as defined in section 6.8 of Unicode (see also D21 in section 3.6 of Unicode3), is discouraged. Character Range Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]."For this purpose we use a modified version of the utf8conditioner written by Simeon Warner at Cornell University (simeon at cs dot cornell dot edu).