Wednesday, July 12, 2006

Ruby: Not All Rosy


First a disclaimer: no I am not switching to Rails as a web platform any time soon (XForms is way too cool to build Ajax-based user interfaces), but I happen to be writing a bit about Ruby, Rails and XML in our upcoming Web 2.0 book.

This post is a reaction in shock to the fact that Ruby doesn't support Unicode. Yes, that's right! I did quite a bit of research online on this, and it is just a fact: as of Ruby 1.8.4, there is no serious support in the language for internationalization as we understand it in Java or .NET. As per the documentation, a Ruby string is just a series of bytes.

To be fair, some classes like Regexp have built-in support for UTF-8, but this is by no means a universal thing in the Ruby Core API. The good news for Ruby users is that the issue seems to have been recognized (although I have read a fair amount of denial online by people who do not quite grasp what Unicode is for), and support for so-called multilingualization is meant to arrive with Ruby 1.9 or 2.0. But this has been in the making for a very long time.

I read people arguing something along the lines of "Ruby supports Unicode since you can put anything in a string" or "there is no particular reason to support Unicode rather than other encodings". Now this is just plain wrong, and here is why:

  • Unicode is not technically a character encoding. It is true that Unicode assigns so-called code points assigned to characters, and encodings typically do that as well. But Unicode attributes different code points to all the differrent characters of all the languages of the world (at least, that is the goal). With Unicode, you can encode those code points using different character encodings, such as UTF-8 and UTF-16. The real character encodings here are UTF-8 and UTF-16, not Unicode.

  • I already talked about how Java was broken with respect to Unicode because of its 16-bit characters and lack of even minimal support, until Java 1.5, for surrogate pairs in UTF-16. But by looking at strings as sequences of bytes, Ruby is not even at that level of brokenness. At least Java and C# are correct in 99.99 percents of the times (making up this statistic, of course, based on the fact that Unicode supplementary planes are very rarely used today). Not using Unicode means that methods and classes must special-handle encodings such as UTF-8, like Regexp does. Not having at least UTF-8 or UTF-16 support in strings and having other "encodings" as alternatives means that it is difficult to know what to expect when you look at a byte in a string.

  • It is widely recognized that using Unicode is the best available solution to provide internationalization. Witness the fact that Java (since 1995), .NET, Windows, MacOs X, and many Unices, as well of course as XML, have made Unicode central to their text handling. There is a reason for this, which is that so far only Unicode allows you to represent all the characters of all the modern languages (and many dead languages as well). It is probably not perfect, but take your issues and problems to the Unicode Consortium.

But this is fairly well-known, and I have come to the conclusion that the whole issue is the good old one of Han unification. According to Wikipedia, "most of the opposition to Han unification appears to be Japanese". And guess what: the original author of Ruby Yukihiro "Matz" Matsumoto, is Japanese, and Ruby has lots of Japanese users. This is no way a criticism of Japan (I love Japan), but a historical explanation of the fact that back in 1993-1995, when Ruby was developed, while chosing Unicode was a no-brainer for Java, it was not the case for Japanese developers.

My current understanding (and I may be wrong) is that with the latest versions of Unicode and the clarification of the relationship between characters and fonts should make the point quite moot, and there is really no reason not to standardize on Unicode as the internal representation of characters. So go Ruby, be bold and integrate Unicode already!

NOTE: I found this interesting and recent Unicode Technical Note on the subject of the encoding of the Han Script (scroll to the second subtitle). It is written by the Ideographic Rapporteur Group which advises Unicode and ISO on those matters, with members from China, Korea, Japan, Taiwan, and Vietnam. This comforts me in the idea that Han unification in Unicode is a good idea.

No comments:

Post a Comment