Sebastian Kirsch: Blog

Monday, 31 July 2006

Estimating the PageRank distribution

Filed under: — Sebastian Kirsch @ 23:51

During my diploma thesis, I worked extensively with the PageRank algorithm – a fairly well-understood algorithm by now. But one problem consistently bugged me: What is the distribution of PageRank values on a graph?

We know quite a lot about distributions in all kinds of networks, for example social networks or communication networks. They are power-law networks; we know their degree distribution, we know the distribution of component sizes, and we even know how to generate them artifically. But what is the distribution of PageRank values when we apply the algorithm on such a graph?

A number of sources (Pandurangan et. al, 2002; Donato et. al., 2004; Craswell et. al., 2005) suggest that PageRank follows a power-law. These sources usually present a picture such as the following to support their claim (taken from Donato et. al., 2004):

[PageRank distribution from Donato et. al., 2004

There are two problems with this kind of illustration. The first is parameter estimation: How did they estimate the parameters for the power-law distribution? The second is less obvious: Is it even appropriate to display PageRank values in such a way, and can you infer something about PageRank from this kind of diagram?

On the horizontal axis of this diagram, you have the PageRank values, and on the vertical axis, you have the number of vertices with this PageRank value. This kind of diagram is often used for degree distributions: You plot the vertex degree and the number of vertices with this degree on a doubly logarithmic scale, and if it comes out as a straight line, you have a power-law distribution.

The difference between degree distributions and a PageRank distribution is that degree distributions are discrete, but PageRank is a continous measure. There are not discrete “bins” in which to sort the vertices, according to their PageRank value. By counting how many vertices have a certain PageRank value, you implicitly discretize the values. What are the criteria for this discretization? The computational accuracy of your platform? The termination condition of your PageRank implementation? Some other discretization? Are the intervals of the same length, or do they have different length?

These questions are not addressed in the cited papers. Therefore, I tried to find a way to estimate the distribution of PageRank without using artificial discretization.

The diagram above is an example of a probability mass function. It is very easy to plot the probability mass function from a sample for a discrete distribution: You simply count the number of occurrences of the discrete values in your sample.

Unfortunately, there is no probability mass function for continuous distributions – because there is nothing to count. There is just the probability density function, which is much harder to estimate from a sample. Usually, you estimate the probability density function from a sample by using a kernel density estimator: You presume that each of your samples is generated by a normal distribution centered on the data point, with a fixed variance, and add up all those normal distributions to arrive at an estimate of the probability density function for the data set.

But kernel density estimation has two problems: It breaks down if the distribution very dissimilar to a normal distribution – for example a power-law distribution: The vast majority of data points for a power-law distribution will be at one extreme of the distribution, and a small number will be at the other extreme. But it is precisely this small number of data points we are interested in, since they determine (for example) the structure of the network. Kernel density estimation is prone to smooth over those data points. The second problem is that this gives you only an estimate of the density function, but it does not give you parameters for a density function in closed form. If you estimate those parameters from the estimated density function, you are estimating twice – with dismal prospects for the accuracy of your estimate.

Which brings us back to the original question: How do you estimate the parameters for a given probability distribution?

Basically, I know of three methods: a) a priori estimation, b) curve fitting, or c) using an estimator for the given distribution.

Method a) can be used to derive a distribution from a generating model for the given graph – for example, it can be used to prove that an Erdös-Renyi graph has a degree distribution following the Poisson distribution. However, my understanding of random graph processes and stochastics is not sufficient to go this route.

Method b) is what I used for estimating parameters of discrete probability distributions in my thesis: I plotted the probability mass function of the distribution and used non-linear least squares regression to fit the mass function of a known distribution (the Zeta distribution in my case) to my data points. While this is not a stochastically valid estimate, it worked well enough for my purposes.

In order to apply this method, I needed to find a suitable distribution, and a suitable representation of my sample to fit the distribution. A likely candidate for the distribution is the Pareto distribution, the continous counterpart to the Zeta distribution. I could not use the probability density function for estimating the parameters, because of the problems outlined above. But it is very easy to derive the cumulative distribution function for a sample of continous values – again by counting the number of data points below a certain threshold (thresholds can be set at each data point.)

A disadvantage of the cumulative distribution is that it is not as easy to see the power-law relationship as it is with the probability mass or density functions: For the density or mass function, you simply plot on a doubly-logarithmic scale, and if it is a straight line: power-law distribution. Therefore, the cumulative distribution is at a definitive disadvantage as regards the visual impact.

But it has the advantage that I do not have to estimate it, and that I can fit the cumulative distribution function of the Pareto distribution to it, using, again, non-linear least squares regression. Unfortunately, this did not give very good results: I could not find a set of parameters that fit this distribution well.

Therefore, method c) remains: Using an estimator for the distribution. For example, for the normal distribution, the sample mean is a maximum likelihood estimator for the mean of the distribution, and the sample variance is a maximum likelihood estimator for the variance.

For the Pareto distribution, a maximum likelihood estimator for the minimum possible value is the minimum of the sample. An estimator for the exponent is given in the article on the Pareto distribution cited above.

Using this method, I could find parameters for the Pareto distribution for my data set. The final estimate looked like this; the dashed line is the estimate, and the solid line is the real cumulative distribution. It is not a perfect fit, but it seems to go in the right direction:

[cumulative distribution function]

The maximum likelihood estimator seems to over-estimate the exponent; it puts the exponent at about 3.78. Using curve fitting, I got an even higher exponent: about 4.61, which is (visually at least) a worse estimate than the admittedly questionable one I derived with the maximum likelihood estimator.

For comparison, here is the estimated probability density function, along with a scatterplot of the data. It is rather useless for assessing the quality of the estimate, but it is visually pleasing, and gives a certain idea of the nature of the distribution:

[probability density function]

I cannot compare my results directly with the results cited above for the web graph, since I do not have the necessary dataset. I used a social network with about 1300 nodes – orders of magnitude smaller than the web graph, but it should follow a similar distribution. I do however hope that it is a valid method for estimating the parameters – and does not share the fundamental problem of treating PageRank as a discrete variable.

Two questions remain: Is the Pareto distribution really the distribution governing PageRank, or is there some other factor? And what would the results be for a larger graph – ideally the web graph? Would the quality of the estimate be better, or would it be worse? Do we perhaps have to look in a completely different direction for the distribution of PageRank?

UPDATE:Some further reading: Problems with fitting to the power-law distribution (Goldstein et.al., 2004), Estimators of the regression parameters of the zeta distribution (Doray, Arsenault, 2002), Power laws, Pareto distributions and Zipf’s law (Newman, 2005).

Sunday, 09 April 2006

London, Day 1

Filed under: — Sebastian Kirsch @ 23:30

I had to get up at five in the morning today to catch my plane, which was bad enough, But what worse was when I realized that due to the time difference, I had effectively gotten up at four – this day would in effect be one hour longer. Ouch.

I had realized yesterday evening that there would in fact be another group from University of Bonn presenting, from the Clausen group. And come as it may, I actually met the two of them while waiting for boarding. Bastian would be presenting the results of his thesis, about information retrieval for motion capture data, and Tido came along for emotional support. I wish I had had someone for support with me, but at least I would not be alone in representing University of Bonn now.

The flight to Stansted was uneventful if packed; Tido, Bastian and me split up at Liverpool Street station, and I proceeded to my accommodation – where of course I was told that check-in was not until two o’clock. But at least I could leave my luggage.

In search for breakfast, I went to Gloucester Rd, where I found Jakob’s cafe and deli, a very nice little cafe. It does not look like much from the outside, but the back room has a glass ceiling, so it is very light and airy. It would be perfect for reading and working, in fact, and one gentleman in the back room was doing exactly that. The ham and cheese sandwich I ordered was actually nothing like an English sandwich – it was made from flat bread, toasted, with an extra helping of tomatoes and a salad on the side, so it was more like a complete meal. The coffee was very strong, but good, and I can only recommend their organic carrot cake: a very nice blend of carrots, cinnamon and ginger, with fresh raped carrots and coconut on top. I do not think I will be allowed there again though: In my sleep-deprived stupor, I forgot to tip.

After I had replenished myself, I walked through Kensington Palace Garden to Notting Hill. I had been told to go to Notting Hill for breakfast and going out in the evening, so I wanted to scout around the area. The walk along Portobello Rd turned out to be unexpectedly long until I got to the market area, but I was surprised that so many stores were open on a Sunday – but nothing of enough interest to me to actually buy it unfortunately.

And while walking to Ladbroke Grove Station, something happened to me which is rather peculiar to me: I seem to have an unnatural propensity for having people ask me the way. It happens everywhere – be it London, Amsterdam or Zürich, people stop me all the time and ask me the way. It happened to me four times today alone. Do I look so savvy and trustworthy? Do I look like a native everywhere I go? I don’t know.

After walking around in the vicinity of Notting Hill Gate for a while, I decided that there actually are not that many restaurants or pubs in this area, so I took the tube to Leicester Square – which turned out to be where it is really at. Just behind Leicester square is London’s chinatown, where you cannot throw the proverbial brick without hitting at least half a dozen Chinese restaurants. Behind that, you get into Soho, with lots of pubs and restaurants.

By now, the hotel check-in would be open, so I walked back to Leicester Square by way of Piccadilly Circus and took the Piccadilly line back to the hotel. Which was good, because there were delays on the Circle and District line due to a “signal failure".

On a related note, I think it is great that the tube lines all have names – I can already remember them now, whereas with numbered lines, I always get confused which one to take.

Oh, and in the tube station, I was asked to help carry a pram down the stairs, lots of “Thank you, mate” and “Now run away quickly in case there are any other stairs” – I guess I do have such a helpful air about me. So after checking into the hotel, which is in Beit Quadrangle, quite literally next door from Royal Albert Hall, I had a lie down for half an hour and proceeded to Science Museum, which is also just a stone’s throw from the hotel.

Admission to the Science Museum is actually free of charge, but they had a special treat for me: An exhibition of art from the Pixar studios, makers of movies like Finding Nemo or The Incredibles. They had concept art from most of Pixar’s movies and shorts, and were also screening their shorts on monitors. (You can see most of their shorts on their web site.) When seeing the final movie, one usually does not realize how much concept art precedes the making of such a movie, and that Pixar’s animators are all artists in their own right. And I guess it is true that one has to master the traditional arts before one can become a truly great animator – because one has to have the facility to actually see the finished movie before it is made, and to transform one’s ideas into a visual, two-dimensional rendition.

Dinner was at Thai Square on Exhibition Rd, after another two hour’s sleep, and the spicy lamb was very good. They also have a special kind of Thai iced tea with coconut milk, which one should try. and since the weather had turned rather ghastly, I concluded the evening with a pint of bitter at the local pub.

And, at the danger of being stoned for this remark by my English friends: English beer still tastes like horse’s piss. Hand pump or not. Give me a stout any day, but not again this stuff.

It is now half past eleven, and I am ready to go to bed – it has been a long day. And tomorrow will be a big day: My presentation at the conference itself.

Saturday, 08 April 2006

Off to London

Filed under: — Sebastian Kirsch @ 21:16

I have already talked about it at length, now it’s finally there: I’m flying to London tomorrow morning for ECIR 2006. The conference doesn’t begin until Monday, so I have Sunday off to explore London, and I’m getting back on Wednesday night.

The timeslot for my talk is in the second session, first slot, on Monday at 14:00. I’m glad to have such an early slot, because this means that I’m done with my part rather quickly and can enjoy the rest of the conference without it looming in my subconscious. And I’m hoping for an alert and attentive audience on the first day of the conference.

Friday, 23 December 2005

Tidings of comfort and joy!

Filed under: — Sebastian Kirsch @ 02:16

It must be the season. I have two very good pieces of news:

Yesterday, I passed the final exam for my minor subject, computational linguistics – and with flying colours, too!

I have to admit that I was rather nervous beforehand: I have a rather technical and scientific background, and as such had my difficulties with my computational linguistics lectures. After all, my professor is someone who starts his lectures on communication theory with Socrates and Plato. But all went well, and I was even lauded after the exam. I think the formulation was “If our computational linguistics majors understood the subject as well as you do, we’d be happy.”

This exam concludes my studies. Now all that is left is waiting for my advisor to grade my thesis, and waiting for the bureaucracy to write me a certificate. Then I will officially be a “Diplom-Informatiker” (ie. Master of Science in Computer Science.)

The second piece of good news is that our paper for ECIR 2006 was accepted. There were 178 papers submitted, and ours was one of 37 that were accepted – which makes me very proud. We got some very positive comments from the reviewers; mostly along the lines of “interesting and original work” and “there really is a need in our community for more of this kind of studies". This kind of appreciation really encourages me to further pursue this line of work. I think that my thesis can only be a preliminary treatment, and the whole field of social information retrieval is only just beginning to emerge. I just have to convince someone to fund me for further research…

I will also present the paper at ECIR in April 2006; the conference venue is Imperial College, London, and I’m quite giddy with anticipation to present my work in such a location.

So these are two great christmas presents for myself – and a marvelous conclusion for this year.

Tuesday, 22 November 2005

Diplomanden-Seminar

Filed under: — Sebastian Kirsch @ 20:03

Nachdem ich meine Arbeit erfolgreich abgegeben habe, steht für mich jetzt auch ein Vortrag im Diplomandenseminar an. Diesen werden ich halten unter dem Titel

“Social Information Retrieval”
am Freitag, 25.11.2005, 15:00-17:00 Uhr c.t.
im Hörsaal A207,
AVZ-III, Römerstrasse 164,
53117 Bonn

Alle Interessierten sind herzlich eingeladen.

Für Ortsunkundige: Mit dem Bus (551, 628, 638) bis Haltestelle “Pädagogische Fakultät", zum Haupteingang rein, rechts die Treppen hoch in den zweiten Stock, dann den Schildern folgen.

Nachtrag: Die Folien zum Vortrag sind jetzt auch online.

Tuesday, 08 November 2005

It is done!

Filed under: — Sebastian Kirsch @ 16:41

Finally, it is done. I submitted my diploma thesis last thursday. I will publish it on this site as soon as I receive my advisor’s opinion.

Me and my advisor also wrote a paper on the subject of my thesis, which we submitted to ECIR 2006. I hope it will be accepted for publication – I’d like to go to London for the conference.

Now all that is left for me is take an exam about my secondary subject (computational linguistics), and I will receive my diploma.

Tuesday, 27 September 2005

Python for Science

Filed under: — Sebastian Kirsch @ 14:20

Lately, I’ve become very fond of the Python language for data analysis tasks. For small, off the cuff problems, I usually cobble something together on the command line, using a combination of various shell commands and command-line Perl. But when the problem becomes difficult enough t o warrant writing a script for it, I usually switch to Python. One reason is that, when compared to Perl, Python code is much more aesthetically pleasing – to my eye, at least. Python is fun to write, without bouncing off the letters $@%{} all the time. Notice how those letters by themselves alreadly look like a swearword?

The other reason is the number of modules making data analysis a breeze. A good overview is on Konrad Hinsen’s page on scientific computing with Python. The modules I use most frequently are

  • Numeric, a module for vectors, matrices and linear algebra. It supports everything from basic matrix operations like multiplication or inverse to eigenvector decomposition and singular value decomposition. Numeric is also the base for a couple of other modules. A very convenient feature of Numeric are “ufuncs", or universal functions: Most binary operations (like addition, multiplication, etc.) and functions (like sine, cosine, etc) are defined for any combination of scalars, vectors and matrices. If one operand is a scalar, it the operation will be executed, using the scalar value, on every member of the matrix or vector. Likewise, functions are executed for every member of a vector or a matrix. Ufuncs also have special methods “reduce” and “accumulate” that accumulate the results of the function for every member of a matrix or vector.
  • Numarray is similar to Numeric (and maintained on the same sourceforge project), but faster for large arrays. It is the designated successor and supposed to be mostly compatible with Numeric, but lacks support for universal functions. A number of older packages still depend on Numeric, which is why both are still being maintained.
  • Scientific Python is a collection of modules for a wide range of applications, including vectors and tensors, quaterions, automatic derivation, linear and non-linear regression, and statistics. It also contains interfaces to the MPI library (message-passing interface, for parallel computation) and other libraries, as well as visualization functions.
  • VPython is a simple 3D visualization module; I came across it when I wanted to prepare some animations for a talk. Scientific Python contains support for VPython for visualizations.

In every case, there is probably another application more powerful or more suitable to the task; for example, one could probably use either Mathematica, Matlab or R for any of these tasks. There are also numerous C++ and Java libraries.

But there are some important differences:

  1. Python is free; this also applies to R and most libraries.
  2. Python has a reasonably shallow learning curve
  3. Python allows me to pull modules for different tasks into a single application, something not possible with specialized tools. For example, I could do analysis on data stored in an SQL database as easily as I can do it for a text file.
  4. Python has an interpreter (unlike C++ or Java), allowing me to play with my code and my data, and try out different approaches.

In summary, it is usually really fast and easy to whip up a simple Python application to solve my problems.

The following is an example of how to implement linear regression using Numeric, written before I discovered Scientific Python. Assuming that variables x and y are matrixes with 1 column and n rows:

mx = add.reduce(x) / x.shape[0]
my = add.reduce(y) / y.shape[0]
ssxy = add.reduce((x -mx) * (y -my))
ssxx = add.reduce((x -mx) * (x -mx))
b = ssxy / ssxx
a = my - b * mx

Another thing I needed recently was nonlinear regression – fitting a non-linear function to data points. The Python module Scientific.Functions.LeastSquares will do that, and it will even compute the first derivation of the function by itself (which is necessary for determining the Jacobi matrix used by the Levenberg Marquardt algorithm. Or so I am told.) The resulting program was as small as this:

First, one defines the model that is to be fitted on the data, as a regular Python function. This function takes the parameters to be fitted as a tuple in the first argument, and the x-value of a data point as the second argument. In my case, the data points were supposed to follow a power-law distribution:

def zipf(params, point):
    k = params[0]
    a = params[1]
    return a * pow(point, -k)

The data is assumed to be a list of 2-tuples, ie. tuples (<x -value>, <y -value>). This list is fed to the regression function, along with the model and starting values for the parameters:

(res, chisq) = leastSquaresFit(model, (1, 100), data)

The variable “res” contains the parameter combination providing the best fit, while “chisq” is the sum of the squared errors, a measure for the quality of the fit. More complicated models (with more than two parameters) are also possible. In my case, I decided to stick with a simple power-law distribution – a more complicated model provided a better fit, but with parameters that were way off the mark. The problem of overfitting …

I also implemented simple interval arithmetic in Python, in a fashion roughly similar to this module by Kragen Sitaker, but with support for most operators and mixed arithmetic (where one argument is an interval and one is a scalar.) Because Python has an interpreter, I can simply use it as an calculator for intervals.

Thursday, 05 May 2005

Time is running …

Filed under: — Sebastian Kirsch @ 14:46

I filed the application for my diploma thesis on tuesday, so the time is now officially running: I have to turn in the thesis within six months, by the 5th of November. Wish me luck.

Tuesday, 26 April 2005

A layman’s take on aspect-orientated programming

Filed under: — Sebastian Kirsch @ 14:12

During my classes on software technology, we were also treated to a short introduction to aspect-oriented programming. Aspect-oriented programming is the latest trend1 in software engineering: It aims to implement separation of concerns not in different classes or object, but as different “aspects” of the same object. Typically, the cited uses are a “logging/debugging aspect", a “security aspect”2, a “persistence aspect” etc.

What’s an aspect? Twenty years ago, people were asking the same thing about object-oriented programming: What’s an object? As no coherent definition of an aspect has yet emerged, I will describe aspects by their most common implementation, AspectJ, a Java dialect. AspectJ introduces the concept of join point interception: at certain points in the program flow, additional code (the aspect) may introduced into the program. “Certain points” in this case are, basically, every method call. One may write a function (an aspect, also called “advice") that will be executed before certain method calls, after certain method calls, or even around certain method calls. This function will be executed in the context of the method call in question, receive all of its arguments, and may basically do what it wants. The methods to which an aspect may be applied can be specified using wildcards.

There is a workgroup for AOP and OOP at Bonn University that has developed an even more general aspect language called LogicAJ, as well as tools for transforming java source code and run-time interception of Java method calls.

Obviously, this is a very powerful technique. The first time I saw it, I said to the instructor, “We’ve been doing that in Perl for years, manipulating the symbol table of other modules and exchanging their subroutines for our own. And we always felt bad about it.”3 I was wary of it; it somehow went against my instincts. (That’s not to say I couldn’t become used to it; I also became used to function and operator overloading in C++.)

My reasoning was thus: I perceive a trend towards more and more structural and semantic clues in programming language, in order to help the programmer to provide better structure for his code. In LISP, we had functions and lists. That was it; LISP is one of the most syntax-free languages – but on the other hand, it allowed you to do everything, to program in any way you wanted. In Java, one of the most successful languages at the moment, we have a distinction between classes and interfaces, objects, object methods and variables, class methods and variables, static variables, final variables, subclasses and subinterfaces, etc. ad nauseam. Some languages provide specialized syntax for member initialization (for example C++), for choosing the appropriate function depending on the type of the argument (function overloading, in C++ or Java) or even according to the value of the argument (pattern matching, for example in the ML language.)

And now people are using this artificial structure to put a totally generic system on top of the language. Which, predictably, leads to problems. The application of aspects to a base program is euphemistically called “weaving” – which works all right and is understandable as long as you have only one aspect. But what happens when several aspects are applied to the same base program? How do they interact? In what order should they be executed? What if one aspect relies on functionality provided by a different aspect? Can an aspect be applied to another aspect (leading to a meta-aspect)? These problems are still unsolved. I doubt that it will ever be solved – the formal semantics of even the simplest imperative languages are thorny enough.

Slashdot picked up this problem in a recent article. The article is based on papers from Passau University that attack the soundness of aspect-oriented programming. They even go as far as titling one paper “Aspect-Oriented Programming Considered Harmful” (recalling Edsger Dijkstra’s seminal paper Go To Statement Considered Harmful”.) They state some of the problems of aspects much more succinctly than I could.

In a way, aspect-oriented programming is the antithesis to another trend in language design – functional programming. Functional programming tries to minimize or even remove side effects of code, specifying that a function with the same input data will always return the same data – regardless of the order in which function calls happen. Aspect-oriented programming does exactly the opposite – the whole purpose of an aspect is introducing side effects to existing code. One may argue which way is better – no side effects or all side effects.

Functional programming does have some advantages: for one, it is usually much easier to debug and to verify than imperative code (and, by extension, object-oriented and aspect-oriented code.) – owing to the fact that it’s virtually free of side effects. Some organizations that tried it reported a marked increase in programmer productivity; Ericsson springs to mind with the Erlang language. A good overview of the characteristics and advantages of functional languages in in the corresponding Wikipedia article.

As I said in the headline, I am just a layman in the field of programming language design. I am not particularly attached to a specific programming language, but I have seen lots of them. Why I just seem to perceive all their worst points, I don’t know; I guess I just have a critic in me.

Footnotes:
1It’s hard not to say “fad” instead of “trend".
2Whatever security actually is in this context.
3It turns out that there is actually a Perl module for aspect-oriented programming that does exactly that; in difference to Java, there is no need to change the Perl interpreter itself in order to implement aspect-oriented programming. The same is true for LISP, where aspects can be introduced using macros. In fact, LISP with its macro mechanism was the first language to introduce the idea of code as a “different kind of data” that can be manipulated by other code.

Saturday, 19 March 2005

Interview with Tim Bray, Inventor of XML and RDF

Filed under: — Sebastian Kirsch @ 11:11

ACM Queue is proving itself again as a source of worthwhile information on the current state and trends in computer science and IT. This time, it has an interview with Tim Bray, the inventor of XML and RDF.

I find that I share many of Bray’s views, especially his pessimism about the so-called Semantic Web. Having read AI and computational linguistics in the course of my studies, I concur with Bray when he says, “You know, K[nowledge] R[epresentation] didn’t suddenly become easy just because it’s got pointy brackets.” He also mentions Doug Lenat, founder of the Cyc project, which produced one of the largest and most complete knowledge bases and has been going on for more than 15 years. There is an open off-shot called OpenCyc which contains part of the software and of the knowledge base. Cyc is an admirable project, but its applications are scarce, despite the enormous amount of energy that has been spent on it. It’s unlikely that knowledge representation in XML will change that – KR touches on many very deep and thorny problems in machine reasoning, inference, logics, and computational linguistics.

Bray also mentions the applications of XML and its relationship to older information exchange formats like ASN.1, and declares that the key point of XML is that it doesn’t describe the format of the data, but its meaning.

Personally, I am happy about the widespread adoption of XML – writing parsers is extremely hairy, and defining a document format in a non-ambiguous manner is doubly so. An open interchange format can give a boost to the development of applications, since the tedious tasks of serialization and transport can be delegated to the XML library. On the other hand, I do think that XML is excessively noisy for some applications, and that more efficient interchange formats exist, for example for purely numerical data.

Tuesday, 22 February 2005

Managing BibTeX databases

Filed under: — Sebastian Kirsch @ 19:03

Like many people who write in LATEX, I use BibTeX for automatically generating bibliographies. I used to manage the databases by hand, but after writing several papers that needed similar references and dozens of papers in the database, this gets a bit unwieldy.

A search on freshmeat.net found a number of frontends for BibTeX, but the most advanced seemed to be JabRef. It’s written in Java and is the successor of both JBibTeXManager and BibKeeper.

JabRef has a rather friendly interface, can manage several databases at once, copy between databases, and can import and export a number of formats. It can also import data from CiteSeer and PubMed. This means that once you have researched a paper on either of those two repositories, you can import the bibliography data by simply entering the reference in JabRef. You can also copy the citation command to the clipboard, or insert it directly into LyX.

The entry format is customizable, so you can add new entry types or add new fields to existing types. If you enter a URL for a paper, you can open it in the web browser with one click; if you enter the path name of the corresponding PDF file, you can open that in a viewer. This makes it the poor man’s document management system.

I usually don’t rely on GUI tools very much, but in this case, it seems worth it. I think JabRef is going to make managing my references a lot easier.

Thursday, 10 February 2005

Kern-Methoden zur Extraktion von Informationen II

Filed under: — Sebastian Kirsch @ 18:15

Ich habe heute meinen Vortrag beim Hauptseminar “Information Extraction” am Institut für Kommunikationsforschung gehalten. Die Folien sind auch schon online verfügbar.

Im Rückblick war es vielleicht etwas vermessen, Computerlinguisten die Grundlagen von maschinellem Lernen, Support Vector Machines, Kern-Methoden und Kernen auf strukturierten Daten in 45 Minuten erklären zu wollen. Ich hoffe jedoch, dass ich zumindest die Ideen dahinter einigermassen verständlich machen konnte – für alles andere gibt es die Ausarbeitung.

In den Gesprächen mit Kommilitonen nach dem Vortrag habe ich festgestellt, dass für die meisten Studenten maschinelles Lernen immer noch gleichbedeutend mit neuronalen Netzen ist. Es scheint einen fast religiösen Glauben in die Fähigkeiten von neuronalen Netzen zu geben: sobald wir nur genügend Neuronen simulieren könnten, würden die Maschinen auf magische Weise menschenähnliche kognitive Leistungen vollbringen können.

Dabei scheint vergessen zu werden, dass neuronale Netze im Grunde genommen nur “Hardware” sind – eine reines Berechnungsmodell, ähnlich Von-Neumann-Rechnern oder Turing-Maschinen. Was damit berechnet wird, ist eine Frage der “Software” – im Fall des neuronalen Netzes eine Frage der Gewichte und der Übergangscharakteristik der einzelnen Einheuten. Der Glaube daran, dass grössere neuronale Netze plötzlich kognitive Leistungen erbringen können, scheint mir deshalb ähnlich zu dem Glauben, dass Intel nur einen Prozessor bauen müsste, der schnell genug ist, und auf magische Weise würden die Programme, die wir darauf ausführen, plötzlich viel mehr Funktionen haben.

Das Problem liegt also vielmehr in der “Software” – der Hypothese, die unser Lernverfahren errechnet. In diesem Gebiet ist die Support Vector Machine meiner Meinung nach dem neuronalen Netz weit voraus: Ein klassisches neuronales Netz sucht mit Gradientenabstiegsverfahren eine Hypothese, die eine lokales Minimum der Fehlerfunktion darstellt. Über die Güte der Hypothese und über die Generalisierungsfähigkeit wird keine Aussage gemacht. Die Wahl der Hypothese bei einer Support Vector Machine ist dagegen aus der statistischen Lerntheorie motiviert; sie wird so gewählt, dass sie unter den Annahmen der statistischen Lerntheorie die beste Generalisierungsfähigkeit bietet.

Der einzige Vorteil des neuronalen Netzes – Transformation des Ursprungsproblems in einen hochdimensionalen Raum, wodurch dieses linear separierbar ist – ist durch den Einsatz von Kern-Methoden ebenfalls gegeben. Hier sind Kern-Methoden sogar flexibler, da sie mehr Möglichkeiten der Transformation bieten und auch in extrem hochdimensionalen Räumen noch gute Regularisierung bieten.

Konnektionistische Verfahren haben sicherlich ihren Platz; insbesondere vom Aspekt der Selbstorganisation sind sie sehr interessant und verdienen weitere Untersuchung. Im Bereich des maschinellen Lernens spricht jedoch viel für die weitere Forschung im Bereich Kern-Methoden und Large Margin Classifiers.

Tuesday, 08 February 2005

Guardian Unlimited on tag-based information sharing

Filed under: — Sebastian Kirsch @ 18:28

The Guardian has an article on tag-based information sharing under the title “Steal this bookmark!” (though I don’t know what tagging has to do with Abbie Hoffmann.) They write about del.icio.us, Flickr and 43 Things, as well as the tag-tracking studies by Technorati. They also make a good job of describing the problems with tags – namely, that they are a complete nightmare to AI researchers who try to fit the whole world into a carefully constructed, strictly hierarchical, redundancy-free and consistent semantic network. On the other hand, tags seem to work, whereas carefully constructed … etc. don’t.

Friday, 04 February 2005

Kern-Methoden zur Extraktion von Informationen

Filed under: — Sebastian Kirsch @ 11:17

Eine Vorabversion meiner Seminararbeit zum Thema “Kern-Methoden zur Extraktion von Informationen” ist jetzt verfügbar (PDF-Format).

Der Text versucht eine kurze Einführung in den Bereich maschinelles Lernen zu geben, mit besonderem Augenmerk auf Kern-Methoden, Support vector machines und Voted perceptron, sowie Kernfunktionen auf strukturierten Daten wie Bäumen. Ich habe versucht, die Beschreibungen praxisnah und nicht-technisch zu halten; so versuche ich, Computerlinguisten den Einstieg in dieses Thema zu erleichtern. Der praktische Teil stützt sich auf die Arbeiten von Dmitri Zelenko aus dem Jahre 2004 zu “relation extraction".

Den dazugehörigen Vortrag werde ich am 10.02.2005 im Blockseminar Information Extraction am Institut für Kommunikationsforschung und Phonetik der Uni Bonn halten.

Monday, 31 January 2005

The Sapir-Whorf hypothesis

Filed under: — Sebastian Kirsch @ 14:50

I learned the correct name today for a theory I have known for a very long time: The concept that your language shapes and influences your thoughts is formally called the Sapir-Whorf hypothesis. I have also heard it attributed to Ludwig Wittgenstein and Ferdinand de Saussure; more information is on wikipedia.

The Sapir-Whorf hypothesis had an influence on science as well as literature. George Orwell’s classic “1984″ contains a language called Newspeak that is designed to eradicate illoyal thought by abolishing words like “free". The reasoning is that once you cannot fight for freedom, because the very word does not exist in your language, rebellion becomes pointless. Another example is Samuel R. Delany’s “Babel-17″, where a fictional language denies its speakers independent thoughts and transforms them into traitors and terrorists; it is used as a weapon of war. The notion that certain works lose their meaning when translated to another language (for example, the Quran) is also an example of this principle.

The validity of the Sapir-Whorf hypothesis is under dispute; if it were all true, then translation between different languages would be impossible. Chomsky’s quest for a universal grammar that is innate to every human also goes counter to the Sapir-Whorf hypothesis. But neurolinguistic research points to the fact that certain areas of the brain are indeed influenced by a speaker’s native language.

Leaving the realm of human languages, how does this hypothesis apply to computer programming languages? From a purely functional point of view, all programming languages are equal – they are all turing complete, which means that any of them can be used to solve any problem as well as the others. So as regards the computer, it does not care about which programming language is used.

The reason why there are still hundreds of different languages and language dialects is not the computer, but the human who has to read and write the code. So the fundamental problem in language design is not designing a language that will allow you to do as much as possible, but one that will allow you to think about your problems in the most meaningful way.

The people who invent those languages appear to have been aware of their effect on the human mind for quite some time. Everyone knows Edsger Dijkstra’s quote, “The use of COBOL cripples the mind; its teaching should therefore be regarded as a criminal offense.” And of course the famous “A FORTRAN programmer can write FORTRAN in any language.” The implication is that your first programming language indeed shapes your mind and determines how you write code in subsequently learned languages.

In my experience, the language you use does indeed affect how you think about problems. I started out with imperative programming languages (C, Pascal), went on to object-oriented languages (Modula-3, Delphi, C++), learned functional programming in my first year at university (with Standard ML), and acquired a host of languages on the way (among them Perl, Python, PHP, PostScript, SQL, and a number of “lesser” languages like Bourne Shell or awk). The only thing I never properly learned is a logical programming language like Prolog (though I did hear a lecture about automated theorem proving, which essentially detailed the inference process of Prolog.)

Object-oriented programmers (especially those reared on UML etc.) tend to see everything as objects that pass messages to each other; objects with a state, a lifetime, and methods that can be called. While this is suited to most modeling tasks, good object-oriented design for other domains is very difficult, as no readily apparent mapping between parts of the problem and a hierarchy of objects exists. That is one reason, I think, why all the exercises in classes on object-oriented software design are always about modeling some part of the real world, and not about modeling an abstract concept. Tasks like stream processing or symbol manipulation are hard to capture properly in object-oriented models.

Purely functional programming is another extreme; it is extremely well suited for symbol manipulation and stream processing. When I learned it, functional programming inspired a view in my mind of lists and other data structures, flowing through functions and getting mangled and massaged on the way. I solved some exercises for the above mentioned lecture on theorem proving in ML, and the code was exceptionally clear and concise; some theorem provers were less than 20 lines of code. But once imperative elements enter the language, and the need to keep state arises, the functional model becomes rather unwieldy. There are object-oriented functional languages (OCaml being the most prominent example), but I have no experience with them.

Further evidence that the code you write influences how you think about and formulate algorithms offline is this: I encountered a number of algorithms that were exceptionally difficult to program in a modern, imperative or object-orientated language. They were formulated without loops or subroutines, and used jump instructions excessively. The only explanation I can think of is that maybe the authors were used to programming languages that promote this style, like fortran or assembly language. Writing a clean C++ or ML version of such an algorithm requires extensive reformulation, whereas other algorithms (for example, the ones mentioned above) could be implemented in ML in a minimum of time, straight from the description.

But the fact that different programming models promote a different approach to problems can also be used to your advantage. If you know a number of languages, then you can – bar external requirements – choose the language for solving your problem that makes it easiest to think about it. If you are doing symbol manipulation, choose a functional programming language (or a language with substantial functional features, like Python). If the task is modeling, choose an object-oriented language. If you want to query complex data structures, use a logical language like Prolog. Especially in research, many problems can be solves more efficiently by combining a hodge-podge of different languages and use each one in the area where it is best.

For commercial software development, other factors like maintainability tend to be more important, but even there, using an embedded language for configuration or extension can be an advantage.

There are also hybrid languages, like OCaml, which is a functional object-oriented language, or Curry, which is a logical-functional hybrid. Their effect on the programmer’s psyche and problem-solving technique is unclear to me.

And other programming languages claim to be agnostic of the programming model, for example C++. While C++ is mostly associated with object-oriented programming, Stroustrup maintains that other styles like functional programming are also possible in C++. Because of the poor support of features like garbage collection in C++, it does not provide the comvenience of other functional programming environments, though.

Tuesday, 18 January 2005

Illustration software

Filed under: — Sebastian Kirsch @ 22:17

Preparing a paper for a seminar, I’m once again faced with the problem of creating illustrations. Just some basic diagrams of graphs, trees, points, lines, arrows, some formulas.

I have a couple of rather simple requirements: Should produce decent-looking pictures, should provide means of specifying sizes, spaces and alignments in a meaningful way, should export to somnething that can be integrated with pdflatex (because I’m writing the paper in LATEX.)

Oh, and I’m cheap too. I know I could buy professional graphics software (like Adobe Illustrator) for something like EUR700, but could I afford it? No. Would I ever use it to its potential? No. My old favourite, CorelDRAW!, is not available for Mac OS X, unfortunately.

After trying a couple of alternatives – like OpenOffice.org Draw, and OmniGraffle, which came with my PowerBook – and asking a couple of friends, I found myself going back to the traditional tools: MetaPost and XFig.

MetaPost is kind of like a programming language for pictures, with the ability to solve systems of linear equations. That way, you almost never have to specify any coordinates – you just say the equivalent of “Well, those two boxes are 2cm apart, and they are evenly spaced, and the center of this box and that one is in one line", and the layout is done automatically. MetaPost can process your text using LATEX, so math is no problem, and you can use the exact same fonts as in your main documents. MetaPost can be imported in pdflatex natively, too.

XFig is more of a traditional drawing program. It was one of the first drawing programs available for SunView, written in 1985, and later ported to X11. That means it has been around for about 20 years! The user interface is somewhat strange (for example, it relies heavily on having a three-button mouse), but overall, it’s rather usable. And one very important point is that it allows exporting to MetaPost – having the same graphics format and the same fonts provides some sort of continuity for graphics that were made with different software.

So I use MetaPost for things that are easy to specify algorithmically (tree structures, for example) and XFig for everything else. Both programs are scriptable, so creating a little workflow that recompiles everything when a drawing changes is not too difficult, using a Makefile.

I still wonder whether my choice of software limits me in my graphical expression; after all, good illustrations are an integral part of a good paper, and can serve to explain much better than text or formulas.

Oh, the topic of the seminar is information extraction; I’m reporting on relation extraction using kernel methods. The paper and slides should appear on my academic page as soon as they’re finished.

Sunday, 16 January 2005

Jeff Dean’s behind-the-scenes look on Google

Filed under: — Sebastian Kirsch @ 21:18

In this 1-hour lecture (available as a Windows Media or Real Video format streaming video), Jeff Dean talks about Google’s architecture, why they do things the way they do it, and the frameworks they developed for their applications. He also gives a few examples of Google’s current and future projects.

Google is usually pretty secretive about the inner workings of their products; apart from the original papers on PageRank, some engineering papers about the Google File System, little is known about their software. On the other hand, the number of successful projects that came out of google in the last few years is astonishing: Groups, Pictures, Gmail, Blogger, Froogle, Desktop Search, localized search, Scholar, etc. pp.

Furthermore, these are difficult things to do. Research Index has been trying to do for years what Google Scholar is doing now: An exhaustive search aid for scientific papers, finding every single place where a certain paper is published, extracting authors, dates, references etc. Research Index always struggled because of lack of funding and computing power. For Google, it seems to work quite well. (I find myself using Google Scholar more often that Research Index, because it’s more reliable.)

So what is the secret behind Google’s success? I haven’t been able to work that out. They hire top-notch people (the list of papers by Google employees is very impressive), they have enormous amounts of computing resources (though Google never published any numbers, estimates range from 70.000–100.000 CPUs), access to huge datasets (their whole web index, basically), and the employees are encouraged to actually use those resources for their research projects: Every employee gets to devote 20% of his time to research, and apparently it’s not unheard of to requisition a cluster of a couple of thousand CPUs for a research project.

Do I want to work there? Hell yes.

Tuesday, 11 January 2005

Struggling with the voted perceptron

Filed under: — Sebastian Kirsch @ 21:51

I’m currently struggling with the voted perceptron learning algorithm by Yoav Freund and Robert Schapire: I’ve now read three different versions of this paper, and some things about the algorithm strike me as “not quite right".

The first thing is a minor issue: The voted perceptron algorithm is specified as starting with setting k := 0 and v1 := 0, but then immediately starts to make predictions with vk without ever initializing v0. That can’t be quite right. My guess is that they wanted to start with v0 := 0 (because a prediction vector of 0 is pretty pointless.)

The next problem is a little more involved. As a decision function, they use the standard f(x) := sign(wx) – but that’s not right, the classical perceptron algorithm uses f(x) := sign(wx - b), where b is the threshold of the decision function. It’s just that the threshold is usually hidden in this equation, because the “zeroth element” of vector x is taken to be x0 ≡ 1. This saves you the trouble of having to update the threshold by hand.

The voted perceptron does not have this implicit assumption, and being a kernel algorithm, it cannot have it – how would you specify that a component is equivalent to 1 in feature space?

According to everything I learnt in my lectures on neural networks and AI, a perceptron cannot work with a threshold of zero, because then the origin is part of all decision hyperplanes. A threshold of anything but zero is OK, because you can scale the vector w to get the right decision hyperplane. But the perceptron is supposed to be able to solve all linearly separable problems, not only those that are separable with a hyperplane containing the origin.

The only implementation of the voted perceptron I could find (in the Weka package) neatly sidesteps this problem by always using the polynomial kernel k(x, y) = (1 + xy)d; if no exponent is specified, they assume d = 1; and thus get their threshold. But this is not how the voted perceptron is specified. One solution would be to specify the algorithm with xx’ + 1 in the non-kernelized form, and k(x,x’) + 1 in the kernelized form everywhere the scalar product/kernel is used.

It’s a strange thing that no publication makes note of this problem; I will have to talk it through with someone.

Friday, 07 January 2005

Chaos everywhere

Filed under: — Sebastian Kirsch @ 01:28

There is a common perception that software projects are notable for their rate of failure, but hard data about this fact is difficult to come by. Prominent examples in Germany are the recent toll collect disaster (which delayed a toll for lorries on German motorways for years) and the introduction of the new unemployment insurance (Arbeitslosengeld-II), which caused the bank account data of thousands of people receiving dole to be garbled, resulting in them not getting their money.

Some data is found in the Standish Group’s chaos reports. They report that in 1994, a staggering 16% of all surveyed software projects were completed on time and in budget; 31% of the software projects are never completed at all. The figures for 2000 are slightly better; 28% of the projects succeeded and only 23% failed outright. So, all the advances in software design, object-oriented programming, modeling tools, CASE tools, IDEs etc. managed to decrease the failure rate by 3/4 and increase the success rate by 3/4.

Still, this is a pretty dismal track record. If we build houses the way we develop software in 2000, the first woodpecker would still destroy civilisation. For two thirds of the software projects, it means a very bumpy ride to completion, requiring either massive cost overruns or severe delays, and if you are already on that track, you only have a two-thirds chance of ever completing the project.

The reports also contain a detailed analysis of the cause of failure; I think this should be required reading for any IT project manager. Unfortunately, these reports were not part of any of my software engineering classes.

Delicious bookmarks

Filed under: — Sebastian Kirsch @ 00:57

I have not written about it in this blog yet, but my diploma thesis is going to be about community structures in knowledge networks, with applications in knowledge management and collaborative web search.

One tool for collaborative management of bookmarks is del.icio.us. Its greatest virtue is its simplicity: Every user can upload URLs and attach tags and a description to them. The URLs can later be queried by username or a combination of tags. For every URL, information about who else bookmarked this item is available, plus the tags used to bookmark the item. You can also subscribe to the bookmark list of other users and be notified when they bookmark new sites. (But you cannot find out who subscribes to your bookmarks.)

The system is completely centralized. There is no real community support for knowledge sharing. It also relies on manual tagging (though this could probably be hacked to include referrer information from web search engines.)

As with similar systems (AudioScrobbler comes to mind), it’s appealing at first because it is very simple, but the collaborative nature is limited because of the lack of communication facilities between users and the lack of community support. I suppose these features could be added eventually.

Tuesday, 04 January 2005

PowerPoint Is Evil

Filed under: — Sebastian Kirsch @ 23:58

Something for my mom, who is an elementary school teacher and who is learning PowerPoint at the moment:

This Wired article by one Edward Tufte details why PowerPoint (and related presentation software, but mostly PowerPoint, through its ubiquity) makes us stupid. Why it degrades the quality of communication and makes every well thought-out argument into a piece of mindless marketing drivel.

Tufte also wrote a monograph on the subject of PowerPoint abuse, called The cognitive style of PowerPoint; sample pages from the monograph are on his web site on a page about the Columbia accident.

Tufte is an artist and graphics designer who has written several books about the visual display of information. (A department in which I am sorely lacking, so perhaps I should treat myself to one of his books eventually.)

So, please, mom, think of the children: Don’t let them use PowerPoint!

Thursday, 30 December 2004

All your Bayes …

Filed under: — Sebastian Kirsch @ 03:01

This is really, really bad. The person in front is Vladimir Vapnik, by the way (of Support Vector Machine and statistical learning theory fame.)

Wednesday, 29 December 2004

To-do list

Filed under: — Sebastian Kirsch @ 15:15

My current to-do list as regards my studies:

  • Finish a report about a practical I made last year, topic: Kernel methods in drug screening. Deadline: Sometime mid-january.
  • Write a paper and slides for a seminar talk, topic: Kernel methods for information extraction. Deadline: 5th of february.
  • Write a proposal for my diploma thesis, preliminary topic: Community approaches for social networks. Deadline: 1st of March.

January is going to be a busy month.

Wednesday, 22 December 2004

Writing English

Filed under: — Sebastian Kirsch @ 13:07

For the German language, there exists a classic book for journalists that teaches a clear and understandable style: Deutsch für Profis by Wolf Schneider, the former head of the Hamburg journalists’ school. It is required reading for a number of magazines, for example c’t magazine.

I was looking for a similar book for English style and found the following works:

The last three are probably of more interest to newspaper journalists, but I will check out the first two to see whether they are comparable to the German style guide mentioned before.

Math software

Filed under: — Sebastian Kirsch @ 09:05

On the subject of open source math software – this one isn’t exactly open source, but it’s free as in beer, nonetheless: The original Graphing Calculator from Mac OS 9 is available for free as a beta version for Mac OS X. Looks neat.

Wednesday, 15 December 2004

Clustering with Spectral Methods

Filed under: — Sebastian Kirsch @ 20:54

I found a diploma thesis from Konstanz about graph clustering with spectral methods. So much to read, so little time …

Probabilistic/fuzzy clustering

Filed under: — Sebastian Kirsch @ 12:51

I just heard a lecture on fuzzy clustering (fuzzy k-means, autoclass, LSA, PLSA), and I’m wondering whether these techniques can also be applied to graph clustering – finding clusters of a graph and calculating the probability that a given node belongs to the different clusters. This will need some more thinking and research.

Tuesday, 14 December 2004

Open Source Math Software For Education?

Filed under: — Sebastian Kirsch @ 02:07

A Slashdot thread on Open Source Math Software; I’m looking forward to seeing the comments roll in, as this is a question I’d love having an answer to.

For my (limited) needs, I’ve used a number of math packages, namely bc, octave and numerical python for calculations, Yacas for symbolic computations, Gnuplot, Metapost and recently visual python for graphs, and GMM++ for matrix and vector calculations in C++. None of these tools is entirely satisfactory, but combined, they usually got the job done. It’s just annoying that each and every one of them uses its own language which you have to learn before you can use it. (Not counting the python and C++ libraries, of course.)

Other people use R, the swiss army chainsaw of mathematics, for everything. So far, it has been too intimidating for me to learn.

One suggestion from the Slashdot thread that I will have to check out is Maxima: It’s a symbolic computation package derived from an MIT project from the 70s called Macsyma. I don’t know whether it will be of much use, since apparently it hasn’t been under active development (just maintenance) since 1982.


Copyright © 1999--2004 Sebastian Marius Kirsch webmaster@sebastian-kirsch.org , all rights reserved.