I'm beginning to love Unicode. At first I just started to yawn whenever I heard it mentioned. Yes, very un-geekly. But now I think it is a very good thing.
If you don't already know, Unicode is a unified system for encoding pretty much all the characters in all current languages. Some 100,000 or so. To replace hundreds of different incompatible local methods of entering and encoding characters.
Before, there was ASCII. One character per byte, which gives 256 possibilities, all of the combinations of the 8 binary bits in a byte. There was wide agreement on the first 128, whereas the last 128 changed from country to country, language to language. That worked ok when we were only talking about European languages, with latin letters like I'm writing, and if one just remembered what country's character set is used. But it is hopelessly inadequate for many other character sets, particularly Asian ones, like Chinese that has thousands of characters. Then one would use some system of storing each character in several bytes, and one would load special software on one's computer to be able to enter and view the characters, and they wouldn't be visible if one didn't have it.
Anyway, Unicode simplifies all of that. One coding system for all of it. It might still be tricky to figure out how to enter the various characters, but at least each one has its own code, a 4-digit hexadecimal code.
For practical purposes, on the web, the winning approach is a compromise called UTF-8. Instead of the straight Unicode, it will store characters as a variable number of bytes, from one to four. Normal English text, which would fit in the first 1/2 byte of ASCII, will be stored exactly the same way. But anything else can be done by the use of additional bytes.
Now, I don't totally grok the whole encoding scheme, but that doesn't really matter, because I probably don't have to do it in my head. The main thing is to use UTF-8 wherever I possibly can. I'm making a couple of programs right now where it is a must, and where it makes everything nicely simple. One is a newsfeed aggregator, which needs to be able to show the content of any feed in any language. The other is a mail client, which needs to do the same. And it seems like I succeeded with relatively little pain. OK, I don't always know when I need to encode and when I need to decode and when I need to leave things alone, but a little trial and error sorts it out. And then it is basically making sure that web pages are served with the UTF-8 encoding, and that my database stores things in UTF-8. MySQL 4.1 handles the last part nicely. And then any modern browser should see the characters as they're meant to be seen, no matter if they're Chinese, Hebrew, or whatever.
That also leaves a sore spot for the programs I really need to convert, but I haven't yet. This weblog program here does not yet handle unicode, so I can't just type in a bunch of stuff to impress you. Well, one can always do it with some special HTML codes, like here: ⽇⽉ . Oops, that didn't work either. I was trying to write Ming in Chinese. Anyway, that isn't what the UTF-8 thing is about. One should be able to just type things in in one's own language without having to worry much about codes. [ Programming | 2005-01-25 18:25 | | PermaLink ] More >
|