Saturday, June 10, 2006

Unicode in Java: not so fast (but XML is better)!

Unicode BMP Mapping

If I asked you whether Java supports Unicode, you would likely say yes, and you would be right. But did you know that this is not the end of the story? Did you know for example that a Java char (or its wrapper class Character) does not actually represent a Unicode character? Did you know that String.length() may not, in fact, always return the correct number of characters present in a string, and that String.charAt(12) does not always return the character actually at position 12? If so, you probably know more than I do and you can stop reading here!

The reason is quite simple: Unicode is able to address over one million characters (1,114,112 characters to be exact), and clearly this doesn't fit in the 16 bits that a char represents. You need 21 bits to represent that number of characters.

In reality, a Java char represents a Basic Multilingual Plane (BMP) code point (see the image of the BMP, where each square represents 256 characters, for a total of 65,536 characters, each identified by a number or "code point"), including surrogate code points (the special code points in gray that indicate that another 16-bit value will follow), and a Java String represents a string in UTF-16 format. This is usually all right, because most modern languages are fully represented using the BMP, and each code point in the BMP fits in 16 bits. In other words, in general one Unicode character fits in a Java char and all is well.

However, there are several planes of supplementary characters that do not fit in the BMP. Such Unicode characters are represented in UTF-16 (and thus in Java strings) with two 16-bit char values, called a surrogate pair.

Amazingly enough, the standard Java library did not handle Unicode code points correctly at all until JDK 1.5 (AKA Java 5) released in late 2004. In that version, the Character and String classes have been augmented with some helper methods to handle Unicode code points using int values. However, for backward compatibility reasons, the semantic of methods such as String.length() and String.charAt() has not been modified and they are still "wrong".

If you really want to handle Unicode, including supplementary character planes, in your Java application, you have to be really careful and to be aware of the existence of surrogate pairs. While there are historical (Unicode initially addressed 16 bits only) and convenience (most characters hold in 16 bits, so why use more memory by default?) reasons for this situation, it is a shame that the char type and Character class do not, actually, always represent characters, and that the String class's methods do not do what you think they do.

The good news is that when working directly with XML technologies such as XPath and XSLT, you are shielded from such issues. For example, the XPath 1.0 recommendation explicitly comments: "In many programming languages, a string is represented by a sequence of 16-bit Unicode code values; implementations of XPath in such languages must take care to ensure that a surrogate pair is correctly treated as a single XPath character." XQuery 1.0 and XPath 2.0 Functions and Operators comments as well: "A surrogate [meaning a surrogate pair, or two 16-bit values in UTF-16] counts as one character, not two."

2 comments:

  1. Alessandro VernetJune 12, 2006 at 6:22 AM

    And Erik, what are surrogate code points, surrogate pairs, and surrogate counts?

    Alex

    ReplyDelete
  2. Alex,

    I have updated the post a little bit to clarify some things, but here are more details.

    A "surrogate code point" is a "Unicode code point in the range U+D800 through U+DFFF." Those are shown in gray in the image, and they are reserved for UTF-16 encoding.

    A surrogate pair is a pair of two 16-bit values, each being a surrogate code point, that together identify a single character.

    There is no "surrogate counts": the sentence just means that a "surrogate [pair]" should be counted as a single character in XPath.

    Finally, a plane is "a range of 65,536 [...] contiguous Unicode code points". There are 17 planes in Unicode. Plane 0 is the Basic Multilingual Plane (BMP).

    The Unicode glossary (http://www.unicode.org/glossary/) is quite useful to answer these questions.

    -Erik

    ReplyDelete