I’m currently struggling with the voted perceptron learning algorithm by Yoav Freund and Robert Schapire: I’ve now read three different versions of this paper, and some things about the algorithm strike me as “not quite right".
The first thing is a minor issue: The voted perceptron algorithm is specified as starting with setting k := 0 and v1 := 0, but then immediately starts to make predictions with vk without ever initializing v0. That can’t be quite right. My guess is that they wanted to start with v0 := 0 (because a prediction vector of 0 is pretty pointless.)
The next problem is a little more involved. As a decision function, they use the standard f(x) := sign(w ⋅ x) – but that’s not right, the classical perceptron algorithm uses f(x) := sign(w ⋅ x - b), where b is the threshold of the decision function. It’s just that the threshold is usually hidden in this equation, because the “zeroth element” of vector x is taken to be x0 ≡ 1. This saves you the trouble of having to update the threshold by hand.
The voted perceptron does not have this implicit assumption, and being a kernel algorithm, it cannot have it – how would you specify that a component is equivalent to 1 in feature space?
According to everything I learnt in my lectures on neural networks and AI, a perceptron cannot work with a threshold of zero, because then the origin is part of all decision hyperplanes. A threshold of anything but zero is OK, because you can scale the vector w to get the right decision hyperplane. But the perceptron is supposed to be able to solve all linearly separable problems, not only those that are separable with a hyperplane containing the origin.
The only implementation of the voted perceptron I could find (in the Weka package) neatly sidesteps this problem by always using the polynomial kernel k(x, y) = (1 + x ⋅ y)d; if no exponent is specified, they assume d = 1; and thus get their threshold. But this is not how the voted perceptron is specified. One solution would be to specify the algorithm with x ⋅ x’ + 1 in the non-kernelized form, and k(x,x’) + 1 in the kernelized form everywhere the scalar product/kernel is used.
It’s a strange thing that no publication makes note of this problem; I will have to talk it through with someone.
Leave a comment
Sorry, the comment form is closed at this time.