Tuesday, January 8, 2013

Automatic remapping of Windows-1252 characters to Unicode

Nowadays, almost everything uses Unicode, which supports a large (very large!) number of characters. Unicode assign a code (number) to each character, and in most cases this code is represented with:

  • With UTF-16 when in memory (e.g. in Java and Windows since NT);
  • With UTF-8 when sent over the wire.

Before Unicode, a single byte per character was used for Western languages, with the ISO Latin-1 encoding. This encoding was fine for most Western characters but didn't contain some useful characters, such as curved quotes, the euro sign, the trademark sign, plus many others. In their infinite wisdom, Microsoft decided to use some reserved codes of Latin-1 for those "useful" characters, creating the Windows 1252 encoding.

Unicode is based on Latin-1, and not Windows-1252, which means that the code for all those "useful characters" is higher than 255 in Unicode. The problem is that documents encoded with Windows-1252 are often incorrectly opened as Latin-1. The mistake is easy to do, as it works "in most cases". The error is so common, that the HTML5 spec says that a browser should parse document advertised as using Latin-1 as Windows-1252 (not trusting the advertised encoding!).

But should you incorrectly take that Windows-1252 encoded file as a Latin-1 encoded file, and pass along its content to a system where it is saved as Unicode, you might end up with control characters; if the text is sent back to the browser, still as Unicode, those control characters will show as squares, instead of the curved quotes, euro sign, or trademark signs you originally intended.

Luckily, there is a way to safely and automatically fix incorrectly encoded documents. This is done by changing the code for characters that exist in Windows-1252, but not Latin-1, to their valid Unicode code, using a simple conversion table. And Orbeon automatically does this for you. And of course, configuration properties allow you to disable this automatic conversion, or even to setup your own custom conversion.

2 comments:

  1. Nice post! I think the link in the "conversion table" is not working.

    ReplyDelete
  2. Dan, thank you for the note; I updated the link to the conversation table.

    ReplyDelete