The XML FAQ: Can XML use non-Latin characters?

Your support for our advertisers helps cover the cost of hosting, research, and maintenance of this FAQ

The XML FAQ — Frequently-Asked Questions about the Extensible Markup Language

Section 3: Authors

Q 3.9: Can XML use non-Latin characters?

Yes, this is the default

Yes, the XML Specification explicitly says XML uses ISO 10646, the international standard character repertoire which covers most known languages. Unicode is an identical repertoire, and the two standards track each other. The spec says (2.2): ‘All XML processors must accept the UTF-8 and UTF-16 encodings of ISO 10646…’. There is a Unicode FAQ at http://www.unicode.org/faq/ and an example of the range of alphabets and symbols at http://www.cogsci.ed.ac.uk/~richard/unicode-sample-3-2.html.

While XML software may allow you to enter any Unicode character into a document, your readers can only see the characters if their computer has a suitable font! Not all typefaces and font files have the entire Unicode repertoire (ones that do are huge).

UTF-8 is an encoding of Unicode into 8-bit characters: the first 128 are the same as ASCII, and higher-order characters are used to encode anything else from Unicode into sequences of between 2 and 6 bytes. UTF-8 in its single-octet form is therefore the same as ISO 646 IRV (ASCII), so you can continue to use ASCII for English or other languages using the Latin alphabet without diacritics (accents). Note that UTF-8 is incompatible with ISO 8859-1 (ISO Latin-1) after code point 127 decimal (the end of ASCII).

UTF-16 is an encoding of Unicode into 16-bit characters, which lets it represent 16 planes. UTF-16 is incompatible with ASCII because it uses two 8-bit bytes per character (four bytes above U+FFFF).

Peter Flynn writes:

The encoding specification can refer to any character set your software supports, but the XML Specification only requires that applications support UTF-8 and UTF-16. Some of the common encodings supported by software include:
US-ASCII
Characters TAB, LF, CR, space, and the printable characters 33 to 126 (decimal) only (all other control characters are forbidden by XML).
ISO-8859-1
(Western European Latin-1) As ASCII plus codes 128 to 255 (decimal). Covers most (but not all) western European accented letters.
ISO-8859-2 to 15
These other planes of ISO-8859 cover the remaining and different sets of Latin-based alphabetic and other symbols.
‘Codepages’ and other obsolescent sets
Some software may also support various obsolete ‘codepages’, such as IBM-850, Microsoft Windows-1252, Apple Macintosh Roman-8, DEC Multinational and other non-standard character encodings, but these are generally non-portable and should be avoided where possible.
One common practice in western Europe is to use ISO-8859-1 so that the majority of common accented letters can be used as single bytes, and to use character entity references or numeric entities for all other characters. This has the advantage that such files can be opened in almost any single-byte editor. The drawback is that numeric entities are not mnemonic, and character entities have to be declared in DTD or internal subset, but if they are rare, this may not be a serious problem.

Bertilo Wennergren writes:

UTF-16 is an encoding that represents each Unicode character of the first plane (the first 64K characters) of Unicode with a 16-bit unit — in practice with two bytes for each character. Thus it is backwards compatible with neither ASCII nor Latin-1. UTF-16 can also access an additional 1 million characters by a mechanism known as surrogate pairs (two 16-bit units for each character).
‘…the mechanisms for signalling which of the two are in use, and for bringing other encodings into play, are […] in the discussion of character encodings.’ The XML Specification explains how to specify in your XML file which coded character set you are using.
‘Regardless of the specific encoding used, any character in the ISO 10646 character set may be referred to by the decimal or hexadecimal equivalent of its bit string’: so no matter which character set you personally use, you can still refer to specific individual characters from elsewhere in the encoded repertoire by using &#dddd; (decimal character code) or &#xHHHH; (hexadecimal character code, in uppercase). The terminology can get confusing, as can the numbers: see the ISO 10646 Concept Dictionary. Rick Jelliffe has XML-ised the ISO character entity sets. Mike Brown's encoding information at http://skew.org/xml/tutorial/ is a very useful explanation of the need for correct encoding. There is an excellent online database of glyphs and characters in many encodings from the Estonian Language Institute server at http://www.eki.ee/letter/.