Support Vector Machines-Artificial Intelligence-Tutorial Handout, Exercises of Artificial Intelligence

Madam Amrita Ahuja distributed this handout in class of Artificial Intelligence course at Central University of Jammu and Kashmir. This handout explains important concepts including: Vector, Machines, Lagrange, Multiplier, Equation, Svm, Decision, Boundary, Constraints, Kernels

Typology: Exercises

2011/2012

Uploaded on 07/31/2012

shaina_44kin
shaina_44kin 🇮🇳

3.9

(9)

64 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
SVM and Boosting
Note that and for non-support vectors.
For when using a linear kernel.
The summation only contains support vectors.
Support vectors are training data points with
For when using a decomposable kernel (see
definition below).
Support Vector Machines
In SVMs we are trying to find a decision boundary that maximizes the "margin" or the "width of
the road" separating the positives from the negative training data points.
To find this we minimize: subject to the constraints
The resulting Lagrange multiplier equation we try to optimize is:
Solving the above Lagrangian optimization problem will give us w, b, and alphas, parameters
that determines a unique maximal margin (road) solution. On the maximum margin "road",
the +ve, and -ve points that stride the "gutter" lines are called support vectors. The decision
boundary lies at the middle of the road. The definition of the "road" is dependent only on the
support vectors, so changing (adding deleting) non-support vector points will not change the
solution. Note, that widest "road" is a 2D concept. If the problem is in 3D we want the widest
region bounded by two planes; in even higher dimensions, a subspace bounded by two
hyperplanes.
Solving for the Lagrange multiplier s in general requires numerical optimization methods that
are beyond the scope of this class. In practice, you use Quadratic Programming solvers. A
popular algorithm for solving SVMs is Platt's SMO (Sequential Minimal Optimization) algorithm.
For SVM problems on quizzes, we generally just ask you to solve for the values of w, b and
alphas using algebra and/or geometry.
Useful Equations for solving SVM questions
A. Equations derived from optimizing the Lagrangian:
1. Partial of the Lagrangian wrt to b: From
Sum of all alphas (support vector weights) with their signs should add to 0.
2. Partial of the Lagrangian wrt to w: From
Sum of alphas, ys of support vectors wrt to vector w.
B. Equations from the boundaries and constraints:
3. The Decision boundary:
docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Support Vector Machines-Artificial Intelligence-Tutorial Handout and more Exercises Artificial Intelligence in PDF only on Docsity!

Note that and for non-support vectors.

For when using a linear kernel. The summation only contains support vectors. Support vectors are training data points with

For when using a decomposable kernel (see definition below).

Support Vector Machines

In SVMs we are trying to find a decision boundary that maximizes the "margin" or the "width of the road" separating the positives from the negative training data points.

To find this we minimize : subject to the constraints

The resulting Lagrange multiplier equation we try to optimize is:

Solving the above Lagrangian optimization problem will give us w, b, and alphas, parameters that determines a unique maximal margin (road) solution. On the maximum margin "road", the +ve, and -ve points that stride the "gutter" lines are called support vectors. The decision boundary lies at the middle of the road. The definition of the "road" is dependent only on the support vectors, so changing (adding deleting) non-support vector points will not change the solution. Note, that widest "road" is a 2D concept. If the problem is in 3D we want the widest region bounded by two planes; in even higher dimensions, a subspace bounded by two hyperplanes.

Solving for the Lagrange multiplier s in general requires numerical optimization methods that

are beyond the scope of this class. In practice, you use Quadratic Programming solvers. A popular algorithm for solving SVMs is Platt's SMO (Sequential Minimal Optimization) algorithm.

For SVM problems on quizzes, we generally just ask you to solve for the values of w, b and alphas using algebra and/or geometry.

Useful Equations for solving SVM questions

A. Equations derived from optimizing the Lagrangian:

  1. Partial of the Lagrangian wrt to b : From

Sum of all alphas (support vector weights) with their signs should add to 0.

  1. Partial of the Lagrangian wrt to w : From

Sum of alphas, ys of support vectors wrt to vector w.

B. Equations from the boundaries and constraints:

  1. The Decision boundary :

where,

General form, for any k To classify an unknown kernel function

ernel. , we compute the against each of the support vectors. Support vectors are training data points with

For when using a linear kernel

  1. Positive gutter:

General form, for any kernel.

For use when the Kernel is linear.

  1. Negative gutter:
  2. The width of the margin (or road):

In document classification, feature vectors are composed of binary word features: Linear Kernel I(word=foo) outputs 1 if the word "foo" appears in the^ document 0 if it does not.

Each document is represented as |vocabulary| length feature vectors. Support vectors found are generally particularly salient documents (documents best at discriminating topics being classified).

Alternate formula for the two support vector case:

This equation is useful when solving SVM problems in 1D or 2D, where the width of the road can be visually determined.

Common SVM Kernels:

nearby

Sigmoidal (tanh) Kernel ● Allows for combination of linear decision boundaries

Properties of tanh: ● Similar to the sigmoid function

● Ranges from -1 to +1. ● tahn(x) => +1 when x >> 0 ● tahn(x) => -1 when x << 0

Resulting decision boundaries are logical combinations of linear boundaries. Not too different from second layer neurons in Neural Nets.

Like RBF, may exhibit overfitting when improperly used.

Linear combination of Kernels Scaling: Idea: Kernel functions are closed under addition and scaling (by a positive number).

for a > 0 or Linear combination: a,b>

Method 1 of Solving SVM parameters by inspection:

This is a step-by-step solution to Problem 2.A from 2006 quiz 4: We are given the following graph with and points on the x-y axis;

+ve point at x 1 (0, 0) and a -ve point x 2 at (4, 4).

Can a SVM separate this? i.e. is it linearly separable? Heck Yeah! using the line above. Part 2A: Provide a decision boundary : We can find the decision boundary by graphical inspection.

  1. The decision boundary lies on the line: y = -x + 4
  2. We have a +ve support vector at (0, 0) with line equation y = -x
  3. We have a -ve support vector at (4, 4) with line equation y = - x + 8

Given the equation for the decision boundary, we next massage the algebra to get the decision boundary to conform with the desired form, namely:

  1. (< because +ve is below the line)
  2. (multiplied by -1)
  3. (writing out the coefficients explicitly)

Now we can read the solution from the equation coefficients:

w 1 = -1 w 2 = -1 b = 4

Next, using our formula for width of road, we check that these weights gives a road width of:

.

WAIT! This is clearly not the width of the "widest" road/margin. We remember that any multiple c (c>0) of the boundary equation is still the same decision boundary. So all equations of the form:

Strides this decision boundary. So here is a more general solution:

w 1 = -c w 2 = -c b = 4c

or and

Using The Width of the Road Constraint

Graphically we see that the widest width margin should be:

The solution weight vector and intercept can be solved by solving for c constrained by the known width-of-the-road. Length of in terms of c:

Now plugin all this into the margin width equation and solving for c, we get:

=> => =>

This means the true weight vector and intercept for the SVM solution should be:

and

Next we solve for alphas , using the w vector and equation 1.

Plugin in the vector values of support vectors and w:

We get two identical equations:

Step 2: Write out the system of equations, using SVM constraints:

Constraint 1: ,

Constraint 2: positive gutter.

Constraint 3: negative gutter.

This will yield 4 equations.

C1 -1 -1 1 0 0

y (^) AK(A, C3.A (^) A)=- 1*0=

y (^) BK(B, A)=- 1*0=

y (^) c K(C, A)= +1*2=

y (^) AK(A, C3.B (^) B)=- 1*0=

y (^) BK(B, B)=- 1*2=-

y (^) c K(C, B)= +1*2=

y (^) AK(A, C2.C (^) C)=- 1*0=

y (^) BK(B, C)=- 1*2=-

y (^) c K(C, C)= +1*4=

For clarity here are the four equations:

C

C3.A

C3.B

C2.C

Step 3: Use your favorite method of solving linear equations to solve for the 4 unknowns. Answer:

This is a more general way to solve SVM parameters, without the help of geometry. This method can be applied to problems where "margin" width or boundary equation can not be derived by inspection. (e.g. > 2D)

NOTE: We used the gutter constraints as equalities above because we are told that the given points lie on the "gutter". More realistically, if we were given more points, and not all points lay on the gutters, then we would be solving a system of inequalities (because the gutter equations are really constraints on >= 1 or <= -1).

In the quadratic programming solvers used to solve SVMs, we are in fact doing just that, we are minimizing a target function by subjecting it to a system of linear inequality constraints.

Example of SVMs with a Non-Linear Kernel

From Part 2E of 2006 Q4. You are given the graph below and the following kernel:

and you are asked to solve for equation for the decision boundary.

Step 1: First, decompose the kernel into a dot product of functions:

Answer:

Step 2: Convert all our original points into the new space using the transform. (We are going from 2D to 1D). Positive points are at:

Negative points are at:

Step 3: Plot the points in the new space, this appears as a line from 0 to 8. With positive points at 0, 2, 4 and negative points at 6, 8.

The support vectors lie between and (between values of 4 and 6)

Hence the decision boundary (maximum margin) should be:

The < due to the positive points being all less than 5.

Expanding the determined decision boundary in terms of components of x, we get:

Square both sides:

An Abstract Lesson on Support Vector Behavior

Suppose you have the above set of points. Let's solve the SVM parameters by inspection.

  1. Boundary equation:

=> =>

  1. Read off the and b and multiply by c (c>0):
  2. Now apply the width of the road/margin constraint:

plugging in in length of w, and solving for c:

=>

  1. Now we have the SVM optimal solutions to w and b:
  2. Next, solve for the using the two lagrangian equations:

and

a) From expanding the first equation, we get:

which leads to two equations:

or and or

b) From expanding the second equation , we get:

or

c) Putting the equations from a) and b) together we can solve for the other two alphas.

or and similarly for

We see that the two +ve support vector alphas are split based on the ratio of distances

determined by s and t. If t = s were equal, then = =

Observation A:

Q: Suppose we moved point A to the origin at (0, 0). What happens to and?

A: This configuration basically implies s = 0; so we get: and.

Conceptually, now becomes the sole primary support vector because point A sits directly across from point B. Point A takes up all the share of the "pressure" in holding up the margin; point C, though still on the gutter, effectively becomes a non-support vector. So this implies that points on the gutter may not always serve the role of being a support vector.

Observation B:

Q: Suppose we changed k, by moving point B up/or down the y-axis what happens to the alphas?

A: All the alphas are proportional to

If k decreases , the road narrows , the alphas increases. Analogy, the supports need to apply more "pressure" to push the margin tighter.

If k increases , the road widens , the alphas decrease. Analogy: wider road needs less "pressure" on the supports to hold it in place.

Possible Termination conditions:

  1. Stop after T rounds (we manually set some T.)
  2. Stop after H(x) (final classifier) has error = 0 on training data or < some error threshold.
  3. Stop when you can't find any more stumps h(x) where weighted error is < 0.5. (i.e. All stumps have E = 0.5).

The Numerator-Denominator method

A calculator-free method for finding weight updates quickly

Replace the Weight Update Step 1c above with these steps.

  1. Write all weights in the form of

where the denominator d is the same to all weights.

  1. Circle the data points that are incorrectly classified.
  2. Compute the new denominator for (the circled) incorrectly classified points:

which is sum of all the incorrect numerators times two. Compute the new denominator for (uncircled) correct points:

sum of all the correct numerators times two.

  1. New weights are the old numerator divided by the updated denominators found in step 3.

if incorrect

if correct.

  1. Adjust all the numerators and denominators such that the denominator is again the same for all weights. Optional: Check and make sure correct weights add up to 1/2, and incorrect weights also add up to 1/2.

A Shortcut on computing the output of H(x).

Quizzes often ask you for the Error of the final H(x) ensemble classifier on the training data. Here is a quick way to compute the output of H(x) without calculating logarithms. Step 1: compute the sign of each of stump h(x) on the given data point. Step 2: compute products of the log arguments of the +ve stumps and -ve stump.

If If

Example: suppose

if is + and is + and is -ve

(5 * 2) > 2 H(x) should output +ve

if is + and is - and is -ve

5 > (2 * 2) H(x) should output +ve. Step 3: Once you've computed all of the H(x) output values on the training data points, count the number of case where H(x) disagrees with the true output. That is the error.

FAQ

Dear TA, how do I determine if a stump will "never" be used (such as for part 1.A of 2006 Q4)?

Test stumps that are never used are ones that make more errors than some pre-existing test stump. In other words, if the set of mistakes stump X makes is a superset of errors stump Y makes, then Error(X) > Error(Y) is always true, no matter what weight distributions we use. Hence we will always chose Y over X because it makes less errors. So X will never be used!

Here is the answer to problem 1A from the 2006 Q4 with explanation. Setup: We are given the tests and the mistakes they make on the training examples, and we are asked to cross out the tests that are never used.

Test Misclassified examples Never used? Reason?

TRUE 1,2,3,5 Yes, superset of G=Y or U!=N FALSE 4,6 Yes, superset of U=M

C=Y 1,6 No, C=N 2,3,4,5 Yes, superset of G=Y or U=M

U=Y 1,2,3,6 Yes, superset of U!=N U!=Y 4,5 Yes, superset of G=Y or U=M

U=N 4,5,6 Yes, superset of G=Y or U=M U!=N 1,2,3 No,

U=M 4 No,

Yes, superset of G=Y, C=Y U!=M 1,2,3,5, or U!=N G=Y 5 No,

Yes, superset of U=M, C=Y G=N 1,2,3,4, or U!=N

Food For thought: Suppose we were to come up with a strong classifier that is a uniform combination of stumps (equal weights). Q: How many mis-classifications would the following classifier commit?

H(x) = h(FALSE) + h(C=Y) + h(U!=Y)

A: Combining the misclassification sets of the stumps: {4, 6}, {1, 6}, {4, 5} Points 1, 5 will be misclassified by 1 stump and correctly classified by 2 stumps. So H(x) will be correct on 1, 5. Points 4, 6 will be misclassified by 2 stumps and correctly classified by 1 stump. So H(x) will misclassify 4, 6. Therefore the points H(x) will mis-classify will be {4, 6}

The proof for incorrect points will yield the same result. This shows that the denominator update rule used in step 3 can be derived directly from the weight update equations so it is correct.