Ming the Mechanic:
Wikipedia

The NewsLog of Flemming Funch
 Wikipedia2005-02-23 21:34
4 comments
picture by Flemming Funch

Last week Google offered to host part of Wikipedia's content. Yesterday Wikipedia was brought off line for some hours by a power failure and subsequent database corruption.

Now, I pay attention to those things not just because Wikipedia is a great resource that needs to be supported. But also because I'm working on a clone of it, and I've been busy downloading from it recently.

The intriguing thing is first of all that that even is a possible and an acceptable thing to do. Lots of people have put a lot of work into Wikipedia, and it is generously offered as a free service by a non-profit foundation. And not only that, you can go and download the whole database, and the software needed for running a local copy of it on another server. Because it is all based on free licenses and open source. And, for that matter, it is in many ways a good thing if people serve up copies of it. It takes a load off of their servers, and potentially others might find new things to do with it.

Anyway, that's what I'm trying to do. At this point I've mostly succeeded in making it show what it should show, which is a good start.

Even though the parts in principle are freely available, it is still a pretty massive undertaking. Just the database of current English language articles is around 2GB. And then there are the pictures. They offer in one download the pictures that are considered in the "Commons" section, i.e. they're public domain. That's around 2GB there too. But most of the pictures are considered Fair Use, meaning they're just being used without particularly having gotten a license for it. So, they can't just share them the same way. But I can still go and download them, of course, just one at a time. I set up a program to pick up about 1 per second. That is considered decent bahavior in that kind of matters. Might sound like a lot, but it shouldn't be much of a burden for the server. For example, the Ask Jeeves/Taoma web spider hits my own server about once per second, all day, every day, and that's perfectly alright with me. Anyway, the Wikipedia picture pickup took about a week like that, adding up to something like 20GB.

Okay, that's the data. But what the database contains is wiki markup. And what the wikipedia/mediawiki system uses is pretty damned extensive markup, with loads of features, templates, etc. Which needs to be interpreted to show it as a webpage. My first attempt was to try the mediawiki software which wikipedia runs on. Which I can freely download and easily enough install. But picking out pieces of it is quite a different matter. It is enormously complex, and everything is tied to everything else. I tried just picking out the parsing module. Which happened to be missing some other modules, which were missing some other modules, and pretty soon it became unwieldy, and I just didn't understand it. Then I looked for some of the other pieces of software one can download which are meant to produce static copies of wikipedia. They're very nice work, but either didn't quite do it quite like I wanted it, or didn't work for me, or were missing something important, like the pictures. So I ended up mostly doing it from scratch, based on the wikipedia specs for the markup. Although I also learned a number of things from wiki2static, a perl program which does an excellent job in parsing the wikipedia markup, in a way I actually can understand. It still became a very sizable undertaking. I had a bit of a head start in that I've previously made my own wiki program, which actually uses a subset of wikipedia's markup.

As it says on the wikipedia download site:
These dumps are not suitable for viewing in a web browser or text editor unless you do a little preprocessing on them first.
A "little preprocessing", ha, that's good. Well, a few thousands lines of code and a few days of non-stop server time seems to do it.

Anyway, it is a little mindblowing how cool it is that masses of valuable information is freely shared, and that with "a little preprocessing" one can use them in different contexts, build on top of them, and do new and different things, without having to reinvent the wheel first.

But the people who make these things available in the first place need support. Volunteers, contributors, bandwidth, money.


[< Back] [Ming the Mechanic]

Category:  

4 comments

24 Feb 2005 @ 00:47 by Ge Zi @24.126.199.23 : so ......
.... why exactly did you then reinvent the wheel, Flemming?
But I sure appreciate the amount of work something like that must be. I am just now about through with the one website which I used to get more familiar with php and where I even use xmlHTTP!!! And not even that - it even works. But this combination of Javascript and php was really a killer - similar enough to get confused and dissimilar enough to not work.
Still sometimes forget that $ in front of variables and that php wouldn't remind me either.
say, are there any better tool to debug a script that to put in var_dumps, upload, run again? Espeically if a form is involved that needs to be filling out until all the bugs are out ;-)  



24 Feb 2005 @ 02:40 by ming : Coding
I don't use anything better. I put various things into html comments, to check that things are alright. But no real debugger or anything. Oh, it exists, I just haven't found a great need.

On reinventing the wheel... well, if I only take a copy of wikipedia, while rewriting the code, and making it do the same thing, yeah, not much need for that in itself. But having the database locally opens up new possibilities. Like wikipedizing text. You know, take a piece of text and automatically highlight all the words found in the encyclopedia as links that you can click on. Somebody else thought of that already, so that's not new either. But adding different ways of searching it is an obvious start. Full text search, thumbnail previews, etc. Anyway, part of what I'm trying to do is to collect a number of different data sources and finding useful ways of combining them or cross-relating them.

Sometimes it can be fun to take apart somebody else's wheel, and put it together a little differently, and discover something new.  



24 Feb 2005 @ 06:10 by Ge Zi @24.126.199.23 : jawoll
That I can understand now.
I liked to disect code, but mostly when I was paid for doing that by the hour ;-)
I sometimes get so sucked into this stuff - it's terrible - adiction, really.

My very first big project I did for the japanese company I contracted for for so long was something like that. I got that xlisp interpreter and built on top of that but first had to understand how it worked and this thing is one recursive sucker! no chance to trace because you never knew how deep the recursion was.
At this time for me - and I guess not only for me - the idea of open source was very new and I remember sometimes feeling bad that I just took that code and used it - but yes, each sourcefile had the mentioning of the original author :-)  



24 Feb 2005 @ 16:51 by ming : Code Addictions
See, I'm a bit addicted to thinking it would be better doing it myself when I look at other people's code. Oh, I can learn some new tricks, but often I find it a bit unbearable to have to change somebody else's code, when there's a lot of it, and it is all over the place. So, after a few days of pulling my hair out, I get the strong urge to start all over and do it myself. Which I've done many times, and then I end up being stuck with it.

But whenever there's a neatly packaged library for doing something, I'm all for using it. Even though I easily forget that too. I was just recently doing a thing in C for an RSS aggregator that needed to pick up large numbers of feeds in parallel. And I'm not normally working in C. Somehow I decided to do everything from scratch. Which I did, but the code for picking up a simple file over the web is hundreds of lines, and there are some complex things to deal with, like chunk encoded transfers, and I have all the normal C problems of maybe having forgotten to check for a null pointer or something. Whereas there are some perfectly nice libraries that exist, like libwww, that reduces the job to just a few lines. And I could have spent the time on something that wasn't already invented.  



Your Name:
Your URL: (or email)
Subject:       
Comment:
For verification, please type the word you see on the left:


Other stories in
2007-02-24 14:20: Writing books in HTML/CSS
2007-02-05 15:21: Software is hard
2006-11-19 21:30: Thingamy
2005-12-14 15:15: Ruby on Rails
2005-03-19 16:04: Comment and Refererrer Spam
2005-02-22 17:32: Mail
2005-02-10 16:00: More Google wizardry
2005-02-04 15:14: The Six Laws of the New Software
2005-02-02 18:37: Blog, Ping and Spam
2005-02-02 17:02: Link Spamming



[< Back] [Ming the Mechanic] [PermaLink]? 


Link to this article as: http://ming.tv/flemming2.php/__show_article/_a000010-001480.htm
Main Page: ming.tv