









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Madam Amrita Ahuja distributed this handout in class of Artificial Intelligence course at Central University of Jammu and Kashmir. This handout explains important concepts including: Vector, Machines, Lagrange, Multiplier, Equation, Svm, Decision, Boundary, Constraints, Kernels
Typology: Exercises
1 / 16
This page cannot be seen from the preview
Don't miss anything!










Note that and for non-support vectors.
For when using a linear kernel. The summation only contains support vectors. Support vectors are training data points with
For when using a decomposable kernel (see definition below).
In SVMs we are trying to find a decision boundary that maximizes the "margin" or the "width of the road" separating the positives from the negative training data points.
To find this we minimize : subject to the constraints
The resulting Lagrange multiplier equation we try to optimize is:
Solving the above Lagrangian optimization problem will give us w, b, and alphas, parameters that determines a unique maximal margin (road) solution. On the maximum margin "road", the +ve, and -ve points that stride the "gutter" lines are called support vectors. The decision boundary lies at the middle of the road. The definition of the "road" is dependent only on the support vectors, so changing (adding deleting) non-support vector points will not change the solution. Note, that widest "road" is a 2D concept. If the problem is in 3D we want the widest region bounded by two planes; in even higher dimensions, a subspace bounded by two hyperplanes.
Solving for the Lagrange multiplier s in general requires numerical optimization methods that
are beyond the scope of this class. In practice, you use Quadratic Programming solvers. A popular algorithm for solving SVMs is Platt's SMO (Sequential Minimal Optimization) algorithm.
For SVM problems on quizzes, we generally just ask you to solve for the values of w, b and alphas using algebra and/or geometry.
A. Equations derived from optimizing the Lagrangian:
Sum of all alphas (support vector weights) with their signs should add to 0.
Sum of alphas, ys of support vectors wrt to vector w.
B. Equations from the boundaries and constraints:
where,
General form, for any k To classify an unknown kernel function
ernel. , we compute the against each of the support vectors. Support vectors are training data points with
For when using a linear kernel
General form, for any kernel.
For use when the Kernel is linear.
In document classification, feature vectors are composed of binary word features: Linear Kernel I(word=foo) outputs 1 if the word "foo" appears in the^ document 0 if it does not.
Each document is represented as |vocabulary| length feature vectors. Support vectors found are generally particularly salient documents (documents best at discriminating topics being classified).
Alternate formula for the two support vector case:
This equation is useful when solving SVM problems in 1D or 2D, where the width of the road can be visually determined.
Common SVM Kernels:
nearby
Sigmoidal (tanh) Kernel ● Allows for combination of linear decision boundaries
Properties of tanh: ● Similar to the sigmoid function
● Ranges from -1 to +1. ● tahn(x) => +1 when x >> 0 ● tahn(x) => -1 when x << 0
Resulting decision boundaries are logical combinations of linear boundaries. Not too different from second layer neurons in Neural Nets.
Like RBF, may exhibit overfitting when improperly used.
Linear combination of Kernels Scaling: Idea: Kernel functions are closed under addition and scaling (by a positive number).
for a > 0 or Linear combination: a,b>
This is a step-by-step solution to Problem 2.A from 2006 quiz 4: We are given the following graph with and points on the x-y axis;
+ve point at x 1 (0, 0) and a -ve point x 2 at (4, 4).
Can a SVM separate this? i.e. is it linearly separable? Heck Yeah! using the line above. Part 2A: Provide a decision boundary : We can find the decision boundary by graphical inspection.
Given the equation for the decision boundary, we next massage the algebra to get the decision boundary to conform with the desired form, namely:
Now we can read the solution from the equation coefficients:
w 1 = -1 w 2 = -1 b = 4
Next, using our formula for width of road, we check that these weights gives a road width of:
.
WAIT! This is clearly not the width of the "widest" road/margin. We remember that any multiple c (c>0) of the boundary equation is still the same decision boundary. So all equations of the form:
Strides this decision boundary. So here is a more general solution:
w 1 = -c w 2 = -c b = 4c
or and
Using The Width of the Road Constraint
Graphically we see that the widest width margin should be:
The solution weight vector and intercept can be solved by solving for c constrained by the known width-of-the-road. Length of in terms of c:
Now plugin all this into the margin width equation and solving for c, we get:
=> => =>
This means the true weight vector and intercept for the SVM solution should be:
and
Next we solve for alphas , using the w vector and equation 1.
Plugin in the vector values of support vectors and w:
We get two identical equations:
Step 2: Write out the system of equations, using SVM constraints:
Constraint 1: ,
Constraint 2: positive gutter.
Constraint 3: negative gutter.
This will yield 4 equations.
C1 -1 -1 1 0 0
y (^) AK(A, C3.A (^) A)=- 1*0=
y (^) BK(B, A)=- 1*0=
y (^) c K(C, A)= +1*2=
y (^) AK(A, C3.B (^) B)=- 1*0=
y (^) BK(B, B)=- 1*2=-
y (^) c K(C, B)= +1*2=
y (^) AK(A, C2.C (^) C)=- 1*0=
y (^) BK(B, C)=- 1*2=-
y (^) c K(C, C)= +1*4=
For clarity here are the four equations:
C
C3.A
C3.B
C2.C
Step 3: Use your favorite method of solving linear equations to solve for the 4 unknowns. Answer:
This is a more general way to solve SVM parameters, without the help of geometry. This method can be applied to problems where "margin" width or boundary equation can not be derived by inspection. (e.g. > 2D)
NOTE: We used the gutter constraints as equalities above because we are told that the given points lie on the "gutter". More realistically, if we were given more points, and not all points lay on the gutters, then we would be solving a system of inequalities (because the gutter equations are really constraints on >= 1 or <= -1).
In the quadratic programming solvers used to solve SVMs, we are in fact doing just that, we are minimizing a target function by subjecting it to a system of linear inequality constraints.
From Part 2E of 2006 Q4. You are given the graph below and the following kernel:
and you are asked to solve for equation for the decision boundary.
Step 1: First, decompose the kernel into a dot product of functions:
Answer:
Step 2: Convert all our original points into the new space using the transform. (We are going from 2D to 1D). Positive points are at:
Negative points are at:
Step 3: Plot the points in the new space, this appears as a line from 0 to 8. With positive points at 0, 2, 4 and negative points at 6, 8.
The support vectors lie between and (between values of 4 and 6)
Hence the decision boundary (maximum margin) should be:
The < due to the positive points being all less than 5.
Expanding the determined decision boundary in terms of components of x, we get:
Square both sides:
Suppose you have the above set of points. Let's solve the SVM parameters by inspection.
=> =>
plugging in in length of w, and solving for c:
=>
and
a) From expanding the first equation, we get:
which leads to two equations:
or and or
b) From expanding the second equation , we get:
or
c) Putting the equations from a) and b) together we can solve for the other two alphas.
or and similarly for
We see that the two +ve support vector alphas are split based on the ratio of distances
determined by s and t. If t = s were equal, then = =
Q: Suppose we moved point A to the origin at (0, 0). What happens to and?
A: This configuration basically implies s = 0; so we get: and.
Conceptually, now becomes the sole primary support vector because point A sits directly across from point B. Point A takes up all the share of the "pressure" in holding up the margin; point C, though still on the gutter, effectively becomes a non-support vector. So this implies that points on the gutter may not always serve the role of being a support vector.
Q: Suppose we changed k, by moving point B up/or down the y-axis what happens to the alphas?
A: All the alphas are proportional to
If k decreases , the road narrows , the alphas increases. Analogy, the supports need to apply more "pressure" to push the margin tighter.
If k increases , the road widens , the alphas decrease. Analogy: wider road needs less "pressure" on the supports to hold it in place.
Possible Termination conditions:
A calculator-free method for finding weight updates quickly
Replace the Weight Update Step 1c above with these steps.
where the denominator d is the same to all weights.
which is sum of all the incorrect numerators times two. Compute the new denominator for (uncircled) correct points:
sum of all the correct numerators times two.
if incorrect
if correct.
Quizzes often ask you for the Error of the final H(x) ensemble classifier on the training data. Here is a quick way to compute the output of H(x) without calculating logarithms. Step 1: compute the sign of each of stump h(x) on the given data point. Step 2: compute products of the log arguments of the +ve stumps and -ve stump.
If If
Example: suppose
if is + and is + and is -ve
(5 * 2) > 2 H(x) should output +ve
if is + and is - and is -ve
5 > (2 * 2) H(x) should output +ve. Step 3: Once you've computed all of the H(x) output values on the training data points, count the number of case where H(x) disagrees with the true output. That is the error.
Dear TA, how do I determine if a stump will "never" be used (such as for part 1.A of 2006 Q4)?
Test stumps that are never used are ones that make more errors than some pre-existing test stump. In other words, if the set of mistakes stump X makes is a superset of errors stump Y makes, then Error(X) > Error(Y) is always true, no matter what weight distributions we use. Hence we will always chose Y over X because it makes less errors. So X will never be used!
Here is the answer to problem 1A from the 2006 Q4 with explanation. Setup: We are given the tests and the mistakes they make on the training examples, and we are asked to cross out the tests that are never used.
Test Misclassified examples Never used? Reason?
TRUE 1,2,3,5 Yes, superset of G=Y or U!=N FALSE 4,6 Yes, superset of U=M
C=Y 1,6 No, C=N 2,3,4,5 Yes, superset of G=Y or U=M
U=Y 1,2,3,6 Yes, superset of U!=N U!=Y 4,5 Yes, superset of G=Y or U=M
U=N 4,5,6 Yes, superset of G=Y or U=M U!=N 1,2,3 No,
U=M 4 No,
Yes, superset of G=Y, C=Y U!=M 1,2,3,5, or U!=N G=Y 5 No,
Yes, superset of U=M, C=Y G=N 1,2,3,4, or U!=N
Food For thought: Suppose we were to come up with a strong classifier that is a uniform combination of stumps (equal weights). Q: How many mis-classifications would the following classifier commit?
H(x) = h(FALSE) + h(C=Y) + h(U!=Y)
A: Combining the misclassification sets of the stumps: {4, 6}, {1, 6}, {4, 5} Points 1, 5 will be misclassified by 1 stump and correctly classified by 2 stumps. So H(x) will be correct on 1, 5. Points 4, 6 will be misclassified by 2 stumps and correctly classified by 1 stump. So H(x) will misclassify 4, 6. Therefore the points H(x) will mis-classify will be {4, 6}
The proof for incorrect points will yield the same result. This shows that the denominator update rule used in step 3 can be derived directly from the weight update equations so it is correct.