Sebastian Kirsch: Blog

Wednesday, 28 September 2005

Don Knuth doesn’t validate

Filed under: — Sebastian Kirsch @ 11:30

Prof. Donald E. Knuth is one of the über-gods of computer science; most people know him as the author of numerous computer science papers, author of “The Art of Computer Programming", the author of the TEX typesetting system, and as the editor of the Journal of Algorithms and the ACM Transactions on Algorithms.

Knuth is known to be very fastidious in his work. The quality of the TEX system is legendary; Knuth pays a reward for every bug found in TEX, which started at $2.56 (one hexadecimal dollar), and doubled every year (until it was frozen at $327.68). TEX is essentially frozen, and has been in use unchanged for 20 years. While the first three volumes of The Art of Computer Programming have already been published, at the age of 67, Knuth is still preparing the remaining four volumes, expecting to work on them for another 20 years.

Unfortunately, he is also known for under-estimating timescales: When he started work on TEX, he expected to complete the project during his sabbatical; in fact, it took about eight years. The first draft for The Art of Computer Programming was supposed to be a book on compiler design, but turned into a 3000 page manuscript on the fundamentals of computing.

Recently, Knuth clashed with the administrators of W3 Validator, a validation service for HTML documents. The complete email exchange started with this message. Knowing a little about Knuth’s personality and mode of work, I found it quite hilarious.

Oh, the reason why this message is not sent from Knuth’s own email address is that he doesn’t have one. This lucky person quit using email in 1990, he has a secretary for that.

(Thanks to ScottyTM)

Tuesday, 27 September 2005

Python for Science

Filed under: — Sebastian Kirsch @ 14:20

Lately, I’ve become very fond of the Python language for data analysis tasks. For small, off the cuff problems, I usually cobble something together on the command line, using a combination of various shell commands and command-line Perl. But when the problem becomes difficult enough t o warrant writing a script for it, I usually switch to Python. One reason is that, when compared to Perl, Python code is much more aesthetically pleasing – to my eye, at least. Python is fun to write, without bouncing off the letters $@%{} all the time. Notice how those letters by themselves alreadly look like a swearword?

The other reason is the number of modules making data analysis a breeze. A good overview is on Konrad Hinsen’s page on scientific computing with Python. The modules I use most frequently are

  • Numeric, a module for vectors, matrices and linear algebra. It supports everything from basic matrix operations like multiplication or inverse to eigenvector decomposition and singular value decomposition. Numeric is also the base for a couple of other modules. A very convenient feature of Numeric are “ufuncs", or universal functions: Most binary operations (like addition, multiplication, etc.) and functions (like sine, cosine, etc) are defined for any combination of scalars, vectors and matrices. If one operand is a scalar, it the operation will be executed, using the scalar value, on every member of the matrix or vector. Likewise, functions are executed for every member of a vector or a matrix. Ufuncs also have special methods “reduce” and “accumulate” that accumulate the results of the function for every member of a matrix or vector.
  • Numarray is similar to Numeric (and maintained on the same sourceforge project), but faster for large arrays. It is the designated successor and supposed to be mostly compatible with Numeric, but lacks support for universal functions. A number of older packages still depend on Numeric, which is why both are still being maintained.
  • Scientific Python is a collection of modules for a wide range of applications, including vectors and tensors, quaterions, automatic derivation, linear and non-linear regression, and statistics. It also contains interfaces to the MPI library (message-passing interface, for parallel computation) and other libraries, as well as visualization functions.
  • VPython is a simple 3D visualization module; I came across it when I wanted to prepare some animations for a talk. Scientific Python contains support for VPython for visualizations.

In every case, there is probably another application more powerful or more suitable to the task; for example, one could probably use either Mathematica, Matlab or R for any of these tasks. There are also numerous C++ and Java libraries.

But there are some important differences:

  1. Python is free; this also applies to R and most libraries.
  2. Python has a reasonably shallow learning curve
  3. Python allows me to pull modules for different tasks into a single application, something not possible with specialized tools. For example, I could do analysis on data stored in an SQL database as easily as I can do it for a text file.
  4. Python has an interpreter (unlike C++ or Java), allowing me to play with my code and my data, and try out different approaches.

In summary, it is usually really fast and easy to whip up a simple Python application to solve my problems.

The following is an example of how to implement linear regression using Numeric, written before I discovered Scientific Python. Assuming that variables x and y are matrixes with 1 column and n rows:

mx = add.reduce(x) / x.shape[0]
my = add.reduce(y) / y.shape[0]
ssxy = add.reduce((x -mx) * (y -my))
ssxx = add.reduce((x -mx) * (x -mx))
b = ssxy / ssxx
a = my - b * mx

Another thing I needed recently was nonlinear regression – fitting a non-linear function to data points. The Python module Scientific.Functions.LeastSquares will do that, and it will even compute the first derivation of the function by itself (which is necessary for determining the Jacobi matrix used by the Levenberg Marquardt algorithm. Or so I am told.) The resulting program was as small as this:

First, one defines the model that is to be fitted on the data, as a regular Python function. This function takes the parameters to be fitted as a tuple in the first argument, and the x-value of a data point as the second argument. In my case, the data points were supposed to follow a power-law distribution:

def zipf(params, point):
    k = params[0]
    a = params[1]
    return a * pow(point, -k)

The data is assumed to be a list of 2-tuples, ie. tuples (<x -value>, <y -value>). This list is fed to the regression function, along with the model and starting values for the parameters:

(res, chisq) = leastSquaresFit(model, (1, 100), data)

The variable “res” contains the parameter combination providing the best fit, while “chisq” is the sum of the squared errors, a measure for the quality of the fit. More complicated models (with more than two parameters) are also possible. In my case, I decided to stick with a simple power-law distribution – a more complicated model provided a better fit, but with parameters that were way off the mark. The problem of overfitting …

I also implemented simple interval arithmetic in Python, in a fashion roughly similar to this module by Kragen Sitaker, but with support for most operators and mixed arithmetic (where one argument is an interval and one is a scalar.) Because Python has an interpreter, I can simply use it as an calculator for intervals.

Sunday, 25 September 2005

Frank Fremerey: Blütenraum

Filed under: — Sebastian Kirsch @ 23:24

Seit dem 22. September läuft im Knusperhäuschen eine Foto-Ausstellung von Frank Fremerey mit dem Titel “Blütenraum”. Eine schöne Gelegenheit, sich dort eine Tasse Tee zu gönnen und die Fotos zu genießen.

Adresse: Am Dreieck 3, Bonn-Innenstadt (ziemlich genau zwischen Münsterplatz und Friedensplatz)

Ich kenne Frank noch aus meiner Zeit bei der c’t, wo er damals als Redakteur gearbeitet hat. An meinem ersten Tag in der Redaktion war er der einzige, der auf dem Gang auf mich zugegangen ist, mir die Hand geschüttelt hat und zu mir gesagt hat: “Hallo, Du bist neu hier? Ich bin der Frank. Man sieht sich! Viel Spass!” Später habe ich ihn dann in Bonn wiedergetroffen: Er wohnt im gleichen Haus, in dem auch die Firma ist, bei der ich arbeite. Frank ist einer der Leute, denen man immer mal wieder über den Weg läuft – momentan hauptsächlich im Hausflur oder auf der Straße…

Friday, 09 September 2005

One year of blogging!

Filed under: — Sebastian Kirsch @ 10:05

Even I have to meta-blog every once in a while: The first article on this blog was published on 20th Aug. 2004, which means that I have been blogging for over a year.

I have written 131 articles during this time, statistically speaking a little more than one every three days. This is roughly the posting frequency I intended when I started this blog. The longest article was 7202 characters long, the shortest 93 characters, with an average length of about 1600 characters.

However, things look a little different when one looks at the distribution of articles in regard to the month:

month # articles
8/2004 2
9/2004 3
12/2004 26
1/2005 31
2/2005 24
3/2005 16
4/2005 10
5/2005 8
6/2005 9
7/2005 4
8/2005 5

What’s that? Has my blogging hype peaked in the first months of this year? Or is this a completely normal development? Have I run out of topics to post about?

Or may it simply be due to the fact that I started preparing my diploma thesis in March? And that a) not very much has been happening for me since then, because I’ve been busy working on my thesis, and b) when I come home in the evening, I usually think of other things than of writing even more?

I guess we’ll find out when I hand in my thesis. Which is in another two months. Stay tuned.

Wednesday, 07 September 2005

More things to see and do

Filed under: — Sebastian Kirsch @ 10:03

Vom 17.09.2005-08.01.2006 läuft im Lenbachhaus in München eine Franz-Marc-Retrospektive. Gezeigt werden über 100 Gemälde, 145 Arbeiten auf Papier, sowie Skulpturen und Kunstgewerbe. Laut Webseite die bislang größte Retrospektive des Künstlers, der mit 36 Jahren im ersten Weltkrieg gefallen ist.

Es gibt einen neuen Jim Jarmusch-Film namens “Broken Flowers”, mit Bill Murray in der Hauptrolle. Spiegel Online hat eine sehr positive Rezension.

Schon in den Kinos angelaufen ist Das wandelnde Schloss, ein neuer Film von Hayao Miyazaki, dem Regisseur von Klassikern wie “Prinzessin Mononoke", “Spirited Away” oder “Tonari no Totoro". Dazu muss man wohl nicht viel mehr sagen.

Sunday, 04 September 2005

Max-Ernst-Museum in Brühl

Filed under: — Sebastian Kirsch @ 22:27

Wie der WDR berichtet, wurde in Brühl gerade das Max-Ernst-Museum eröffnet. Ich habe die Arbeiten von Max Ernst schon vor vielen Jahren im Lenbachhaus in München gesehen und war begeistert. Umso mehr freut es mich, eine Sammlung seiner Arbeiten jetzt in der direkten Umgebung zu haben.

Meiner Meinung nach wird damit die ohnehin schon reichlich ausgestattete Museenlandschaft von Köln und Bonn um ein weiteres Highlight bereichert. Wer sich am Museum Ludwig, dem Wallraff-Richartz-Museum, der Bundeskunsthalle, dem Kunstmuseum Bonn und den diversen anderen Museen der Region sattgesehen hat, findet hier bestimmt noch etwas neues.

Vorausgesetzt, er findet das Museum überhaupt. Brühl, wo liegt das?

Für Nicht-Rheinländer: Brühl liegt genau in der Mitte zwischen Bonn und Köln. Da, wo das Phantasialand ist.

Für Bonner oder Kölner: Brühl ist da, wo der Regionalexpress zwischen Bonn und Köln hält.

Copyright © 1999--2004 Sebastian Marius Kirsch , all rights reserved.