Ming the Mechanic

Ming the Mechanic
The NewsLog of Flemming Funch

Tuesday, January 25, 2005

I'm beginning to love Unicode. At first I just started to yawn whenever I heard it mentioned. Yes, very un-geekly. But now I think it is a very good thing.

If you don't already know, Unicode is a unified system for encoding pretty much all the characters in all current languages. Some 100,000 or so. To replace hundreds of different incompatible local methods of entering and encoding characters.

Before, there was ASCII. One character per byte, which gives 256 possibilities, all of the combinations of the 8 binary bits in a byte. There was wide agreement on the first 128, whereas the last 128 changed from country to country, language to language. That worked ok when we were only talking about European languages, with latin letters like I'm writing, and if one just remembered what country's character set is used. But it is hopelessly inadequate for many other character sets, particularly Asian ones, like Chinese that has thousands of characters. Then one would use some system of storing each character in several bytes, and one would load special software on one's computer to be able to enter and view the characters, and they wouldn't be visible if one didn't have it.

Anyway, Unicode simplifies all of that. One coding system for all of it. It might still be tricky to figure out how to enter the various characters, but at least each one has its own code, a 4-digit hexadecimal code.

For practical purposes, on the web, the winning approach is a compromise called UTF-8. Instead of the straight Unicode, it will store characters as a variable number of bytes, from one to four. Normal English text, which would fit in the first 1/2 byte of ASCII, will be stored exactly the same way. But anything else can be done by the use of additional bytes.

Now, I don't totally grok the whole encoding scheme, but that doesn't really matter, because I probably don't have to do it in my head. The main thing is to use UTF-8 wherever I possibly can. I'm making a couple of programs right now where it is a must, and where it makes everything nicely simple. One is a newsfeed aggregator, which needs to be able to show the content of any feed in any language. The other is a mail client, which needs to do the same. And it seems like I succeeded with relatively little pain. OK, I don't always know when I need to encode and when I need to decode and when I need to leave things alone, but a little trial and error sorts it out. And then it is basically making sure that web pages are served with the UTF-8 encoding, and that my database stores things in UTF-8. MySQL 4.1 handles the last part nicely. And then any modern browser should see the characters as they're meant to be seen, no matter if they're Chinese, Hebrew, or whatever.

That also leaves a sore spot for the programs I really need to convert, but I haven't yet. This weblog program here does not yet handle unicode, so I can't just type in a bunch of stuff to impress you. Well, one can always do it with some special HTML codes, like here: ⽇⽉ . Oops, that didn't work either. I was trying to write Ming in Chinese. Anyway, that isn't what the UTF-8 thing is about. One should be able to just type things in in one's own language without having to worry much about codes.
[ Programming | 2005-01-25 18:25 | 1 comment | PermaLink ] More >

Webcam dsfdsf? fdsfdfdsdsfd?

So, I continue to have a bit of fun with that webcam thing I did. In part because there still are several thousand people coming by looking at it every day. So I add a few improvements once in a while.

Mikel Maron made the nice suggestion that one could establish the more precise location of the different cams collaboratively, and then one could maybe do fun things like having them pop up on a world map or something. So, I added forms for people to correct or expand the information on each location. Like, if they know the city, or the name of the building, company, bridge, or whatever, they can type it in. And while I was at it, I added a comment feature.

OK, so, presto, instant collaboration. Within a couple of hours lots of helpful (or maybe bored) visitors had figured out where a bunch of these places were, and they had typed them in.

But, at the same time, what is going on is that these webcams seem terribly interesting to Chinese or Japanese speaking people. 70,000 people came from just one Japanese softcore porn news site who for some reason linked to it.

But then there's a slight, eh, communication problem here. Or language problem. Or character set problem. See, I've set it up so that the forms where you leave comments or update the info can take Unicode characters. So if somebody wants to type a comment in Japanese, they should be able to do that. And some people do. But the explanatory text on my page is in English. And it seems that a large number of people don't really have any clue what any of it says, but they have a certain compulsion to type things into any field that they see. So, if there's a button that leads to a form where you can correct the city of the camera, they'll click on it, and they'll enter (I suppose) their own information. Or they say Hi or something. See, I find it very mysterious what they actually are writing. It is for sure nothing like English. But it isn't what will appear as Chinese or Japanese characters either. Rather, it looks to me like what one would type if one was just entering some random test garbage, by quickly running one's fingers over a few adjacent keys. But the strange thing is that dozens and dozens of different people (with different IPs) are entering either very similar, or exactly the same, text. This kind of thing:

Facility: fdsfdfdsdsfd

City: dsfdsf

Yeah, I can type that with 3 fingers without moving my hand from the keyboard. But why would multiple people type exactly the same thing?? Does it say something common in Chinese?

Now, we have a bit of a cryptographic puzzle here. Notice that "Facility" (the name of the field) has twice as many letters as "City". And "fdsfdfdsdsfd" has twice as many letters as "dsfdsf". Consider the possibility that somebody might think they're supposed to enter the exact word they see into the field. Like some kind of access verification. And they use some kind of foreign character input method that encodes Latin characters as one and a half bytes. If so, I can't quite seem to decode the system.

Or, are we dealing with some kind of Input Method Editor (IME) that lets people form Chinese symbols by repetitive use of keys on a QWERTY keyboard? Anybody knows?

This is a bit like receiving signals from some alien civilization. Where's the signal in the noise? How might these folks have encoded their symbols, and what strange things might they be referring to? Are they friendly? dsfdsf?

Otherwise, if anybody here actually speaks Chinese or Japanese, could you give me a translation, preferably into the proper character set, of a sentence like: "This is the information for the camera location. Please do not enter your own personal information here!"
[ Programming | 2005-01-25 20:25 | 9 comments | PermaLink ] More >

Main Page: ming.tv