Ming the Mechanic

Ming the Mechanic
The NewsLog of Flemming Funch

Wednesday, February 23, 2005

Last week Google offered to host part of Wikipedia's content. Yesterday Wikipedia was brought off line for some hours by a power failure and subsequent database corruption.

Now, I pay attention to those things not just because Wikipedia is a great resource that needs to be supported. But also because I'm working on a clone of it, and I've been busy downloading from it recently.

The intriguing thing is first of all that that even is a possible and an acceptable thing to do. Lots of people have put a lot of work into Wikipedia, and it is generously offered as a free service by a non-profit foundation. And not only that, you can go and download the whole database, and the software needed for running a local copy of it on another server. Because it is all based on free licenses and open source. And, for that matter, it is in many ways a good thing if people serve up copies of it. It takes a load off of their servers, and potentially others might find new things to do with it.

Anyway, that's what I'm trying to do. At this point I've mostly succeeded in making it show what it should show, which is a good start.

Even though the parts in principle are freely available, it is still a pretty massive undertaking. Just the database of current English language articles is around 2GB. And then there are the pictures. They offer in one download the pictures that are considered in the "Commons" section, i.e. they're public domain. That's around 2GB there too. But most of the pictures are considered Fair Use, meaning they're just being used without particularly having gotten a license for it. So, they can't just share them the same way. But I can still go and download them, of course, just one at a time. I set up a program to pick up about 1 per second. That is considered decent bahavior in that kind of matters. Might sound like a lot, but it shouldn't be much of a burden for the server. For example, the Ask Jeeves/Taoma web spider hits my own server about once per second, all day, every day, and that's perfectly alright with me. Anyway, the Wikipedia picture pickup took about a week like that, adding up to something like 20GB.

Okay, that's the data. But what the database contains is wiki markup. And what the wikipedia/mediawiki system uses is pretty damned extensive markup, with loads of features, templates, etc. Which needs to be interpreted to show it as a webpage. My first attempt was to try the mediawiki software which wikipedia runs on. Which I can freely download and easily enough install. But picking out pieces of it is quite a different matter. It is enormously complex, and everything is tied to everything else. I tried just picking out the parsing module. Which happened to be missing some other modules, which were missing some other modules, and pretty soon it became unwieldy, and I just didn't understand it. Then I looked for some of the other pieces of software one can download which are meant to produce static copies of wikipedia. They're very nice work, but either didn't quite do it quite like I wanted it, or didn't work for me, or were missing something important, like the pictures. So I ended up mostly doing it from scratch, based on the wikipedia specs for the markup. Although I also learned a number of things from wiki2static, a perl program which does an excellent job in parsing the wikipedia markup, in a way I actually can understand. It still became a very sizable undertaking. I had a bit of a head start in that I've previously made my own wiki program, which actually uses a subset of wikipedia's markup.

As it says on the wikipedia download site:

These dumps are not suitable for viewing in a web browser or text editor unless you do a little preprocessing on them first.

A "little preprocessing", ha, that's good. Well, a few thousands lines of code and a few days of non-stop server time seems to do it.

Anyway, it is a little mindblowing how cool it is that masses of valuable information is freely shared, and that with "a little preprocessing" one can use them in different contexts, build on top of them, and do new and different things, without having to reinvent the wheel first.

But the people who make these things available in the first place need support. Volunteers, contributors, bandwidth, money.
[ Programming | 2005-02-23 21:34 | 5 comments | PermaLink ] More >