We write for posterity

The Google Analytics numbers for this blog are dismal. (Hi Mom! Hi Friends!) I think it’s because right now I don’t have much to say that’s both unique and interesting. Partly this is because so much of my life is off limits: I don’t want to talk about the joys & cares of raising a family, and I musn’t talk about the joys & cares of raising a new product. What’s left are comments on the general state of the web, and essays on general topics like this one.

Why write then, and who am I writing for? I write because something inside me compells me to, and because it helps me think to get my ideas down in written form. Who do I write for? From my Analytics numbers it’s clear that I’m writing primarily for search engines (Hi Googlebot!) rather than people. And that’s something interesting to think about: Baring a world-wide disaster or cultural revoloution, what I write today will persist for thousands and probably even millions of years, and will be read countless times by search engines, and only occasionally, if at all, by people.

My words will be torn apart and merged with other web pages from other authors, becoming a mulch out of which new insights will be gleaned. (Hmm, not unlike how my body will be recyled when I die, its atoms used to make new things.)

Perhaps the last time my essay will ever be read by a live human is in some far distant future when some graduate student is writing an essay on early-web-era civilization, and is trying to find out what those poor benighted souls thought of the future. (Hi posterity!)

No doubt my words will be automatically translated from 21st- century English into whatever language wins the world-wide language wars. Perhaps my essay will even be automatically annotated, with a description of who I was, and a best guess at what I looked like, from searching the world’s photo archives. There will be footnotes and links to explain the archaic topics I’m referencing. “Search engine” - they used to store data in seperate computers, and brute-force building the search index. How primitive! How quaint!

And no doubt the grad-student-of-the-future will glance over my words, then move on to the hundreds of other essays on similar themes. (Good luck with your own essay, future-guy!)

3dMark price/performance charts

3DMark is a GPU/CPU benchmark used by PC gamers to measure system performance. Here are some great charts showing

My home computer system is very weak compared to these charts, except in one dimension, which is that my 1600 x 1200 display puts me in the top 10% of gamers. Woot!

While many people (myself included) have switched to laptops and/or all-in-ones, if you’re planning on building a new desktop, check out the Ars Technica system guide. The guide does a good job of speccing out a “Budget Box”, a “Hot Rod”, and a “God Box”, and it’s updated every quarter.

Languages that look interesting

Currently I’m reading up on the following computer languages:

Python - fun, easy to learn, batteries included

Boo - fun like Python, but with macros and type declarations so that it can run fast.

Erlang - very brief code. I’m impressed by how concise the Wings3D source code is.

Typed Scheme - Scheme with type checking. (Could in theory run fast.)

I may try implementing my old “Dandy” game in these languages to see how they feel.

Darwin Ports issue with "patch"

Ever since I’ve upgraded to Apple Macintosh OS X 10.5 Leopard, I’ve run into problems using the Darwinports “port” command to install new software.

The problem is that for some reason the version of GNU “patch” that I have installed in /usr/bin/patch is version 2.5.8, and it doesn’t operate the way that Darwin ports expects. A typical error message is:

---> Applying patches to erlang
Error: Target org.macports.patch returned: shell command " cd "/opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_lang_erlang/work/erlang-R12B-0" && patch -p0 < '/opt/local/var/macports/sources/rsync.macports.org/release/ports/lang/erlang/files/patch-toolbar.erl'" returned error 2
Command output: Get file lib/toolbar/src/toolbar.erl from Perforce with lock? [y]
Perforce client error:
Connect to server failed; check $P4PORT.
TCP connect to perforce failed.
perforce: host unknown.
patch: **** Can't get file lib/toolbar/src/toolbar.erl from Perforce

Error: Status 1 encountered during processing.

The work-around is to define the environment variable POSIXLY_CORRECT=1 , as in:

POSIXLY_CORRECT=1 sudo port install erlang

Now, I’ve done some web searching, and I haven’t seen anyone else complaining about this problem, so perhaps there’s something odd about my setup.

Web scraping in Java, F#, Python, and not Lisp

Yesterday I wrote a web scraper. A web scraper is a program that crawls over a set of web pages, following links and collecting data. Another name for this kind of program is a “spider”, because it “crawls” the web.

In the past I’ve written scrapers in Java and F#, with good results. But yesterday, when I wanted to write a new scraper, I though I’d try using a dynamically-typed language instead.

What’s a dynamically-typed language you ask? Well, computer languages can generally be divided into two camps, depending on whether they make you declare the type of data that can be stored in a variable or not. Declaring the type up front can make the program run faster, but it’s more work for the developer. Java and F#, the languages I previously used to write a web scraper, are statically typed languages, although F# uses type inference so you don’t actually have to declare types very often – the computer figures it out for you.

In order to scrape HTML you need three things:

  1. a language
  2. a library that fetches HTTP pages
  3. a library that parses the HTML into a tree of HTML tags

Unless you’re using Mono or Microsoft’s Common Language Runtime, the language you choose will restrict the libraries that you can use.

So, the first thing I needed to do was choose a dynamic language. Since I just finished reading “Practical Common Lisp”, an excellent advanced tutorial on the Lisp language, I though I’d try using Lisp. But that didn’t work out very well at all. Lisp has neither a standard implementation nor a set of standard libraries for downloading web pages and parsing HTML. I did some Googling to try and find some combination of parts that would work for me. Unfortunately, it seemed that every web page I visited recommended a different combination of libraries, and none of the combinations I tried worked for me. In the end I just gave up in frustration.

Then, I turned to Python. I had not used Python much, but I knew it had a reputation as an easy-to-use language with a lot of easy-to-use libraries. And you know what? It really was easy! I did some web searches, copied some example code, and voila, I had a working web spider in about an hour. And the program was easy to write every step of the way. I used the standard CPython implementation for the language, Python’s built-in urllib2 library to fetch the web data, and the Beautiful Soup library for parsing the HTML.

How does the Python compare to Java and F# for web scraping?

Python Benefits:

  • Very brief, easy to write code
  • Libraries built in or easy to find
  • Lots of web examples
  • I didn’t have to think: I just used for loops and subroutine calls.
  • Very fast turn-around.
  • Easy to create and iterate over lists of strings.

Python non-issues for this application:

  • Didn’t matter that the language was slow, because this task is totally I/O bound.
  • Didn’t matter that the IDE is poor, using print and developing interactively was fine

F# Benefits:

  • Good IDE (Visual Studio)
  • Both URL fetching and HTML parsing libraries built in to CLR

F# Non-issues:

F# Drawbacks:

  • The CLR libraries for URL fetching and HTML parsing are more difficult to use than Python. It takes more steps to complete similar operations.
  • Strong typing gets in the way of writing simple code.
  • odd language syntax compared to Algol-derived languages.
  • Hard-to-understand error messages from the compiler.
  • Mixed functional/imperative programming is more complicated than just imperative programing.
  • The language and library encourages you to use advanced concepts to do simple things. In my web scraper I wrote a lot of classes and had methods that took complicated curried functions as arguments. This made the code hard to debug. In retrospect perhaps I should have just used lists of strings, the same as I did in Python. Since F# supports lists of strings pretty well, maybe this is my problem rather than F#’s. ;-)

Java benefits:

  • Good debugger
  • Good libraries
  • Multithreading

Java drawbacks:

  • Very wordy language
  • Very wordy libraries

Lisp drawbacks:

  • No standard implementation
  • No standard libraries

Looking to the future, I’d be interested in writing a web scraper in IronPython, which has good IDE support, and in C# 3.0, which has some support for type inference.

In any event, I’m left with a very favorable impression of Python, and plan to look into it some more. In the past I was put off from it because it was slow, but now I see how useful it is when speed doesn’t matter.

[Note: When I first wrote this article I was under the impression that CPython didn’t support threads. I since discovered (by reading the Python in a Nutshell book) that it does support threads. Once I knew this, I was able to easily add multi-threading to the web scraper. CPython’s threads are somewhat limited: only one thread is allowed to run Python code at a time. But that’s fine for this application, where the multiple threads spend most of their time blocked waiting for C-based network I/O ]

Hot Chips Conference archives

Curious about the internal designs of GPUs, CPUs, and game consoles? Tired of lame articles full of uninformed speculation and fanboy rants? Then check out the archives of the “Hot Chips” conference, an annual conference where computer chip designers get together to brag about their latest chips. The conference presentation slides are all online, and they’re full of good technical information on GPUs, CPUs. and even game consoles. Of course, presenters often gloss over any technical problems with their chips, so you won’t get the full picture. But these presentations offer a detailed technical look inside otherwise secret system architectures.

(For what it’s worth the web site is poorly organized, and many links are broken – you sometimes have to edit the URLs slightly to find the correct links.)

Some highlights:

Reality Co-Processor, Ken Hayes (Silicon Graphics, Inc.) - All about the Nintendo 64.

Gekko: A PowerPC compatible processor supporting high- performance 3D Graphics - Gamecube CPU

Multiple Cell Papers (PS3)

Xbox 360 System Architecture

Nintendo Wii security defeated by the Tweezer Attack

According to a presentation at the 24th Chaos Communication Congress, hackers have apparently been able to defeat the Nintendo Wii game console’s security system using tweezers to bypass the hardware memory protection.

The way it works is that the Wii runs in two modes: a GameCube emulation mode, which has access to just 1/8th of the total memory, and Wii mode, that has access to all the memory.

Hackers had already figured out how to run their own code in GameCube mode. So the trick was to run their code in GameCube mode, then use the tweezers to short out the address lines to allow the hacker’s code to access parts of rest of the memory. By shorting different address lines different portions of memory were made available. By collecting enough shards they eventually mapped all of memory.

Apparently the Wii operating system keeps its digital signature keys in this protected memory, and once the digital signatures were found it was possible to sign and run homebrew code on the Wii.

It is not clear to me whether the attack is a per-machine attack or a break-once-run-everywhere attack.

Searching for old web pages: The Wayback machine is cool!

I was searching for information on LINJ, a Lisp language that compiles into human-readable Java code. Unfortunately, the LINJ web site http://www.evaluator.pt/linj.html, is currently offline.

Luckily, it turned out that the Internet Archive Wayback machine had cached both that page, and the download files that that page had pointed to. Very cool!

Similarly, I was looking for the source to the Windows CE port of Quake 3, and found that the project’s web site http://www.noctemware.com/q3ce.html had been abandoned and taken over by spammers. Luckily the Wayback machine had cached both the original web page and the downloads.

Let this be a lesson to you aspiring open source developers out there: It’s better to store small open-source projects in a large “won’t-ever-go-away” source repository like SourceForge or Google Code than to use your own vanity domain hosting. Of course, even using a large popular repository is not failure-proof. Some large code repositories from the early days of the Internet, like DEC’s ftp site, have gone away after their owning company was bought by another company. Perhaps some sort of distributed system of discoverable git repositories is the answer.

In the case of Quake 3 for Windows CE, I’ve contacted the author, Christien Rioux, and with his kind permission I’ve set up a code.google.com project so that other people can more easily find the sources (and binaries): http://code.google.com/p/q3ce .

Why the Farmers won

I’ve been thinking about Lisp lately. A powerful language, with many excellent features, but not hugely successful. And I thought of one reason why:

The excellent book Guns, Germs and Steel hypothesizes that agriculture displaced hunting and gathering because people who practiced agriculture stayed in one place, and were able to have one child per year, as opposed to the hunter- gatherers, who had to wait until their children were old enough to walk before having another child. This reproductive rate difference was amplified by the ability for a given unit of land to support more farmers than hunters. As a result, agriculture displaced hunting even though individual hunters were far healthier (as seen by their skeleton height) than the farmers that displaced them.

So it is possible for a poorer technology to displace a better one, if it has compensating advantages. And I think that’s what’s hit Lisp. C and Java, which are each less powerful and more wordy than Lisp, are more successful for reasons other than power and brevity. Perhaps because both languages allow lots of reusable code to be written by ordinary programmers. Maybe Lispers are like healthy hunters, being displaced by hordes of sickly farmers.

Well, no doubt the Lispers will take some comfort in the Java and C programmers being displaced in turn by whatever language next becomes even more successful, just as hunter-gatherers may take pleasure in the trend that farmers are gradually being displaced by urban dwellers. (Come to think of it, Lispers have already seen their original competitor Fortran displaced by C/C++, and in turn much of C++ has been displaced by Java and C#.)

I think something similar is happening in productivity applications, as a generation of not-very-good-but-web-based productivity applications is displacing the Microsoft Office suite. (For example, I am typing this blog entry into a simple and ugly HTML-based web form rather than a beautiful Word document.)