Wednesday, June 14, 2006

How Browsers Have Been Saving Us from Incorrect Encodings

Windows 1252 code pageErik had an entry just a few days ago about Unicode, and we'll be now looking at two encodings: ISO-8859-1 and windows-1252. It is not that character encoding is the most exciting thing around, but it is one we need to get right.

ISO-8859-1, also called Latin 1, is the default encoding on most UNIX operating systems. The default encoding on Windows is not ISO-8859-1, but Windows-1252. The Windows encoding is a superset of ISO-8859-1 and only differs by using printable characters instead of control characters in the 0x80 to 0x9F range. Some relatively common characters, like the euro sign (€) and trade mark sign (™) are mapped to character in this range (shown in yellow in the image here).

It is very common for people to mistakenly specify that the encoding for a document is ISO-8859-1, while in fact the encoding is windows-1252. So what happens with all those documents incorrectly marked as ISO-8859-1? Are they rendered incorrectly? Well, no, in most cases they are rendered correctly, as if the windows-1252 encoding had been specified. The reason is that the control characters of ISO-8859-1 that map to printable characters in windows-1252 are not valid in HTML. So when those appear in an ISO-8859-1 document, the browser could either decide to consider the whole document invalid, or show the corresponding printable character from windows-1252. Most browsers go with this second options.

It looks like in quite a few cases our browsers have been saving us, maybe without us even knowing about it.

No comments:

Post a Comment