



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Solutions to problem set 4 of the cs446: pattern recognition and machine learning course, focusing on computing margins, vc dimension, and constructing kernels.
Typology: Assignments
1 / 6
This page cannot be seen from the preview
Don't miss anything!




CS446: Pattern Recognition and Machine Learning Fall 2008
Solution Handed In: October 22, 2008
(a) With the given hyperplane there are 12 positive examples, 38 negative. The margin of this data is γ = 0.158113883008419. (b) i. One possible hyperplane for this disjunction is: w =< 0 , 0 , 1 , 0 , 0 , 1 , 0 , 0 , 1 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 >, θ = 0. 5 In other words w 3 , w 6 , w 9 , w 12 are 1 and all other wi are 0, and theta is 0.5. This way if any of x 3 , x 6 , x 9 , or x 12 are positive, the dot product will be greater than 0.5 and the example classified as positive. This hyperplane yields a margin of γ = 0.25 on the data. ii. The minimum distance between a positive and negative example is 1. (c) i. Similar to above, one possible hyperplane for this disjunction is: w =< 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 , 0 , 0 >, θ = 0. 5 This hyperplane yields a margin of γ = 0.166666666666667 on the data. ii. The minimum distance between a positive and negative example is 1. (d) In the full example space of all 2^20 binary vectors, the distance between a positive and negative example is going to be just one bit for any function, since there has to be some dividing point for positive to negative, and there is going to be a one bit difference at this point. What we have here is only 50 points sampled from this space, so differences can be quite further apart. Given that, with the sparse disjunction there are more unused dimensions for the examples to change: For two examples to be close but classified differently they need to be close in the 16 unused dimensions, but differ in at least one of the 4 used dimensions. On the other hand for the denser disjunction two examples need to be close in the 11 unused dimensions, but differ in at least one of the 9 used dimensions. This means that randomly sampled examples are more likely to be closer when they have fewer dimensions to differ on. Given that the distance is closer (which is an upper bound on possible margin) and the given margin found is less for dense disjunctions, we have less ”wiggle room” to learn a linear separator, so we will need more examples to find one to agree with the training data. As we saw in with Novikoff’s bound, the size of the margin is inversely related to the number of mistakes Perceptron will make.
(a) The positive space of our hypothesis is a convex space since it is the intersection of two convex spaces (linear classifiers). This means that any set of points where
one point is inside the convex hull of the others, or on a line between two other points, cannot be shattered. If we labeled this interior point negative and all other positive, the negative point is inside any convex space containing the positives. Thus we can only restrict our investigation to spaces of points that form convex hulls of polygons. With 1 hinge we can shatter 5 points as depicted in figure 1. We can think of the 1-hinge space as being a triangle with a base that projects out to infinity. With 5 points in the form of a pentagon, no labeling leaves the negative example inside the convex hull of the projected triangle containing positive points.
Figure 1: 5 points shatterable by H. Only interesting configurations shown.
To complete our proof that V C(H) = 5, we must also show that there does not exist any set of 6 points that H can shatter. As described above we can concern ourselves with any 6 points that form a convex hexagon, no three points collinear. With any such hexagon there is an alternating labeling of the points such that any projected triangle that includes the three positive points must include a negative since it is inside the angle of these points (see figure 2). With 6 points on a convex hull, if you alternately label the points, then every angle formed by 3 positive examples contains a negative example.
Figure 2: 6 point labeling not seperable by H
(b) Using the same intuition as above, the 2-hinge lines form a space of convex quadri- laterals with one end projected to infinity, or closed triangles if all three lines in-
describe its location, each line at beginning and end can be described by its slope, and point inside needs two parameters, so thats 2 ∗ K + 4 parameters. A K-hinge line can form at most a closed K + 1 sided convex polygon, or an open K + 2 sided convex polygon. A K + 1 polygon (or K + 2 open polygon) can shatter points arranged around a 2 ∗ (K + 1) + 1 = 2 ∗ K + 3 convex hull, since for any labeling with K + 1 or fewer positives points, we can wrap a hull around these points which will not include the negative, and any labeling with K + 2 or greater positively labeled points there must be two positive points next to each other on the hull, thus no negative points between them and we can expand our polygon around both. If we move up to 2 ∗ K + 4 points on a convex hull there now exists a labeling with K + 2 positive points such that a negative is between each positive point. Any space we create with K + 2 lines will include some negative between two positive points, thus cannot form any arrangement of 2 ∗ K + 4 points and shatter. The VC dimension of K-hinge line is thus 2 ∗ K + 3.
(a) K(x 1 , x 2 ) =
(same(x 1 ,x 2 ) k
where same(x 1 , x 2 ) is the number of features set to the same value in the two input vectors. Computing same takes time linear in the number of features. (b) The kernel perceptron algorithm is executed similarly to perceptron with a few modifications. First, instead of using the function
f (x) = T hθ(w · x)
to classify examples, we use the following function, which transforms the examples into a higher dimensional space and reorganizes the computation of the high dimensional dot product:
f (x) = T hθ
xm∈M
S(xm)K(x, xm)
where M is the set of training examples that the algorithm has previously made a mistake on, and S(xm) ∈ {− 1 , 1 } is the label of example xm. We make these ideas explicit in the KernelPerceptron algorithm.
KernelPerceptron(S ∈ (X × Y )m) M ← ∅ foreach (x, y) ∈ S if y 6 ≡ T hθ(
(xm,ym)∈M ymK(x, xm)) M ← M ∪ (x, y) return λx : T hθ(
(xm,ym)∈M ymK(x, xm)) In the last line of the algorithm, λx : ... represents an anonymous function that takes x as an argument and returns some evaluation in terms of it.
(c) In class, we proved a theorem by Novikoff showing that perceptron will make
γ^2 mistakes on any training set known to be linearly separable. This analysis was done under the assumption that perceptron makes classifications with the function
f (x) = sign(w · x)
which implies that the threshold is being learned. (Why?) In our implementation of kernel perceptron, we did not learn the threshold, how- ever we used a slightly different f ; one with a fixed, positive threshold. Novikoff’s bound can still be used as an upper bound for our algorithm. (Why?) All we need to do is compute the values of R and γ. Every example in the original space translates to an example containing
(n k
active features in the blown-up space, plus the “threshold feature” which is always on. Therefore,
n k
γ is more complicated. It’s defined in terms of an oracle weight vector u such that ||u|| = 1. Given a data set, we’d like to maximize γ to make the bound as tight as possible. But we’re interested in learning any k-DNF concept, so we must find the concept that yields the smallest maximum γ. In the blown-up feature space, we’re learning a simple disjunction of r ≤ n variables. Without loss of generality, we can assume the u maximizing γ for such a function has the following form:
u =
(u 1 = 1, u 2 = 1, ..., ur = 1, ur+1 = 0, ..., un = 0, un+1 = θ) √ r + θ^2 where − 1 < θ < 0. The positive example x+^ closest to the hyperplane induced by u contains a single active feature out of the first r, in addition to the last feature which must be active. Therefore,
u · x+^ =
1 + θ √ r + θ^2 For every negative example x−, we have
u · x−^ =
θ √ r + θ^2 And since
γ = min(|u · x+|, |u · x−|)