### Struggling with the voted perceptron

I’m currently struggling with the voted perceptron learning algorithm by Yoav Freund and Robert Schapire: I’ve now read three different versions of this paper, and some things about the algorithm strike me as “not quite right".

The first thing is a minor issue: The voted perceptron algorithm is specified as starting with setting k := 0 and v_{1} := 0, but then immediately starts to make predictions with v_{k} without ever initializing v_{0}. That can’t be quite right. My guess is that they wanted to start with v_{0} := 0 (because a prediction vector of 0 is pretty pointless.)

The next problem is a little more involved. As a decision function, they use the standard f(x) := sign(**w** ⋅ **x**) – but that’s not right, the classical perceptron algorithm uses f(x) := sign(**w** ⋅ **x** - b), where b is the threshold of the decision function. It’s just that the threshold is usually hidden in this equation, because the “zeroth element” of vector **x** is taken to be x_{0} ≡ 1. This saves you the trouble of having to update the threshold by hand.

The voted perceptron does not have this implicit assumption, and being a kernel algorithm, it cannot have it – how would you specify that a component is equivalent to 1 in feature space?

According to everything I learnt in my lectures on neural networks and AI, a perceptron cannot work with a threshold of zero, because then the origin is part of all decision hyperplanes. A threshold of anything but zero is OK, because you can scale the vector **w** to get the right decision hyperplane. But the perceptron is supposed to be able to solve all linearly separable problems, not only those that are separable with a hyperplane containing the origin.

The only implementation of the voted perceptron I could find (in the Weka package) neatly sidesteps this problem by always using the polynomial kernel k(**x**, **y**) = (1 + **x** ⋅ **y**)^{d}; if no exponent is specified, they assume d = 1; and thus get their threshold. But this is not how the voted perceptron is specified. One solution would be to specify the algorithm with **x** ⋅ **x’** + 1 in the non-kernelized form, and k(**x**,**x’**) + 1 in the kernelized form everywhere the scalar product/kernel is used.

It’s a strange thing that no publication makes note of this problem; I will have to talk it through with someone.

## 1 Comment

RSS feed for comments on this post.

## Leave a comment

Sorry, the comment form is closed at this time.

The problem with origin is solved in some implementation adding a dimension at the feature vector, as done in the Novikoff (1962) [perceptron termination] theorem proof.

xx = [x, R] , ww = [w, b/R]

Comment by Matteo Bertini — Wednesday, 16 February 2005 @ 16:36