Monday, May 2, 2011

Auto-fixing Windows/Unicode Character Encoding Issues


Nowadays, we use Unicode for almost everything, and Unicode supports a (very!) large number of characters. Unicode assigns a code (number) to each character, and in most cases this code is represented in 2 bytes when in memory (e.g. in Java and by Windows since NT), and a variable number of bytes when sent over the wire (UTF-8 encoding).

Before we started using Unicode, a single byte per character was used for western languages, with the ISO-8859-1 encoding. This encoding was fine in most cases but didn't contain some useful characters, such as curved quotes, both “double” and ‘single’, the trademark symbol™, or ligatures like œ.

<rabithole>The Œ and œ are the only French characters not present in ISO-8859-1, and it has been said that this is because French member of the ISO committee missed a session and that members from other countries simply decided to remove those characters during that session. (A perhaps more credible explanation is that the ISO committee members concluded that Œ and œ are ligatures rather than characters, and thus have no place in ISO-8859-1.)</rabithole>

Microsoft based its own character set on ISO-8859-1, but, in its infinite wisdom, decided to use some reserved codes of ISO-8859-1 for those "useful" characters, creating the Windows 1252 encoding.
The first 256 characters in Unicode come from ISO-8859-1, not Windows-1252, which means that the code for all those "useful characters" is higher than 255 in Unicode. The problem is that documents encoded with Windows-1252 are often incorrectly advertised to be in ISO-8859-1. The mistake is easy to make, as it works "in most cases". The error is so common, that the HTML5 spec says that browsers should parse documents advertised as using ISO-8859-1 as Windows-1252 (not trusting the advertised encoding!).

HTML5 only saves you as long as a Windows-1252 document is advertised as ISO-8859-1 to the browser. But if your Windows-1252 document is incorrectly opened as ISO-8859-1, and then saved as UTF-8, you end up with invalid Unicode characters, which won't be understood by browsers. Luckily, the solution to this problem isn't very complicated: it is just a matter of changing the code for characters that exist in Windows-1252 but not in ISO-8859-1 to their valid Unicode code, using a simple conversion table, and since version 3.9, you can setup Orbeon Forms to do this conversion automatically for you.

No comments:

Post a Comment