Ming the Mechanic: Unicode

Ming the Mechanic:
Unicode
The NewsLog of Flemming Funch

Unicode

2005-01-25 18:25
1 comment

by Flemming Funch

I'm beginning to love Unicode. At first I just started to yawn whenever I heard it mentioned. Yes, very un-geekly. But now I think it is a very good thing.

If you don't already know, Unicode is a unified system for encoding pretty much all the characters in all current languages. Some 100,000 or so. To replace hundreds of different incompatible local methods of entering and encoding characters.

Before, there was ASCII. One character per byte, which gives 256 possibilities, all of the combinations of the 8 binary bits in a byte. There was wide agreement on the first 128, whereas the last 128 changed from country to country, language to language. That worked ok when we were only talking about European languages, with latin letters like I'm writing, and if one just remembered what country's character set is used. But it is hopelessly inadequate for many other character sets, particularly Asian ones, like Chinese that has thousands of characters. Then one would use some system of storing each character in several bytes, and one would load special software on one's computer to be able to enter and view the characters, and they wouldn't be visible if one didn't have it.

Anyway, Unicode simplifies all of that. One coding system for all of it. It might still be tricky to figure out how to enter the various characters, but at least each one has its own code, a 4-digit hexadecimal code.

For practical purposes, on the web, the winning approach is a compromise called UTF-8. Instead of the straight Unicode, it will store characters as a variable number of bytes, from one to four. Normal English text, which would fit in the first 1/2 byte of ASCII, will be stored exactly the same way. But anything else can be done by the use of additional bytes.

Now, I don't totally grok the whole encoding scheme, but that doesn't really matter, because I probably don't have to do it in my head. The main thing is to use UTF-8 wherever I possibly can. I'm making a couple of programs right now where it is a must, and where it makes everything nicely simple. One is a newsfeed aggregator, which needs to be able to show the content of any feed in any language. The other is a mail client, which needs to do the same. And it seems like I succeeded with relatively little pain. OK, I don't always know when I need to encode and when I need to decode and when I need to leave things alone, but a little trial and error sorts it out. And then it is basically making sure that web pages are served with the UTF-8 encoding, and that my database stores things in UTF-8. MySQL 4.1 handles the last part nicely. And then any modern browser should see the characters as they're meant to be seen, no matter if they're Chinese, Hebrew, or whatever.

That also leaves a sore spot for the programs I really need to convert, but I haven't yet. This weblog program here does not yet handle unicode, so I can't just type in a bunch of stuff to impress you. Well, one can always do it with some special HTML codes, like here: ⽇⽉ . Oops, that didn't work either. I was trying to write Ming in Chinese. Anyway, that isn't what the UTF-8 thing is about. One should be able to just type things in in one's own language without having to worry much about codes.

[< Back] [Ming the Mechanic]

Category: Programming

1 comment

25 Jan 2005 @ 18:39 by swanny : Thanks
Thanks for this update ming
Yes Chinese is quite an interesting language yet actually quite simple
Their grammar consists of basically "object action subject"
although they have as you say mega characters where words or ideas
are represented as whole charactors which are sort of pictograms of sorts
It is neat some of these translation tools too though they still seem to do it somewhat
mediocrely as the grammar or syntax still varys from language to language
and there is no universal syntax or grammar code or not as far as I know
some being simple some complex..... ie chinese = I go lake (and thats about
as simple as grammar comes and then there well stuff like computer syntax
and xml which is kinda "greek" to me
hee hee

Other stories in Programming
2014-11-01 17:33: The conversation of work
2007-02-24 14:20: Writing books in HTML/CSS
2007-02-05 15:21: Software is hard
2006-11-19 21:30: Thingamy
2005-12-14 15:15: Ruby on Rails
2005-03-19 16:04: Comment and Refererrer Spam
2005-02-23 21:34: Wikipedia
2005-02-22 17:32: Mail
2005-02-10 16:00: More Google wizardry
2005-02-04 15:14: The Six Laws of the New Software

[< Back] [Ming the Mechanic]

[PermaLink]?

Link to this article as: http://ming.tv/flemming2.php/__show_article/_a000010-001455.htm

Main Page: ming.tv