Ming the Mechanic

Ming the Mechanic
The NewsLog of Flemming Funch

Wednesday, June 4, 2003

Scientific American has an article about self-repairing computers. Sounded promising, but instead it illustrates how we're in dire need of a paradigm change in computer design. The thinking of the researches mentioned in the article is that reboots usually fix things, but it takes too long, so we need micro-reboots, where smaller portions of the system can be rebooted by themselves. Well, sure, that's maybe a good thing, but if that's the best we can look forward to in terms of fault-tolerant computers, that is incredibly lame. Then Microsoft can come up with a micro-reinstall that will reinstall parts of your system several times each second.

What would be more interesting would be to rethink the way we do most of our technology. Most devices we use have single points of failure, and we've somehow ended up designing our software in a similar fashion. Your computer is full of tiny little copper wires, and a great many of them would cause the computer to stop functioning if you cut even one of them. The answer, if you send it to repair, would be that you'll buy a new motherboard. Even if you don't break anything, but you just cut power for 1/10 second, your computer will go down. And as far as the software goes, a misplaced comma, or a zero that should be a one, that's often enough for bringing down the whole thing. It just seems so primitive.

Compare with organic life forms. Look at anything that is alive and you'll find that it is self-repairing and extremely fault-tolerant. Most animals will keep going even with missing limbs, wounded, being fed crap, and in unfamiliar circumstances. The only thing that will bring an oganism down is some kind of systemic failure. Not just losing a few million cells, but a bunch of big things at the same time that will sabotage how the whole thing coordinates its activities.

Why can't I have technology like that? Why can't I use programs like that? Stuff that keeps working, even with heavy damage. Well, as a programmer I can of course start answering that. We don't know how to do that. We know how to do it on a small scale. We can make servers that have multiple hot-swappable power supplies. We can with some effort set up mirrored servers that will take over for each other. We can set up battery backed power supplies that take over when the power goes. And in software we can divide the functionality into "objects" that each will check and double-check everything, and try to recover from problems. But it isn't any pervasive philosophy, and it is built on a fragile foundation. If we are concerned about something needing to always be up, we might put in two or three or a certain component. But even when that mostly works, it is a very feeble attempt of fault-tolerance. Nature's way seems to be to have thousands or millions of independent but cooperating units, each having a knowledge of what needs to be done, but each going about it slightly differently.

Translated to the software world, does that mean we ought to write all programs as a large number of independnet agents or objects or even viruses, that somehow work together in getting things done? Or would neural networks do the trick? I wish I knew.
[ Technology | 2003-06-04 18:43 | 5 comments | PermaLink ] More >

Main Page: ming.tv