

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The third homework assignment for the data mining course (cs 395t) taught by inderjit dhillon during spring 2008. The assignment covers topics such as linear separability, logistic regression, perceptron algorithm, and support vector machines (svm). Students are required to prove theorems, analyze error functions, and derive algorithms for solving svm problems.
Typology: Assignments
1 / 2
This page cannot be seen from the preview
Don't miss anything!


Lecturer: Inderjit Dhillon Date Due: April 7, 2008 Keywords: Classification, Perceptron, Support Vector Machines
E(w) = −
n=
(yn log zn + (1 − yn) log(1 − zn)),
where zn = σ(wT^ xn), w specifies a hyperplane, and σ is the logistic sigmoid function defined by
σ(a) =
1 + exp(−a)
Prove that the error function E(w) is a convex function and provide a condition on the input data so that E(w) has a unique minimum.
that the input data is padded with one, i.e. xt =
xt 1
. Recall the update:
wt+1 = wt + ytxt, if yt(wTt xt) < 0 ,i.e., a mistake.
Assume that all the input data points have bounded Euclidean norm, i.e., ‖xt‖ ≤ R and are linearly separable with finite margin γ, i.e., there exists a hyperplane specified by w∗^ such that:
yt(w∗T^ xt) ≥ γ, ∀ t.
(a) Prove that the following holds after t updates: w∗T^ wt ≥ tγ. (b) Prove that: ‖wt‖^22 ≤ tR^2. (c) Using parts (a) and (b), prove that the Perceptron algorithm converges to a separating hyperplane after at most R
(^2) ‖w∗‖ (^22) γ^2 steps.
max α W (α), where W (α) =
i=1 αi^ −^
1 2
i=
j=1 yiyj^ αiαj^ Kij
subject to
i=
yiαi = 0, (1)
αi ≥ 0 , i = 1, ..., N. (2)
2 CS 395T: Data Mining: A Mathematical Perspective
In the above problem, Kij could be xTi xj or Kij = κ(xi, xj ) = h(xi)T^ h(xj ). Note that the matrix K is positive semi-definite. The dual variables α 1 , ..., αN are said to be feasible if (1) and (2) are satisfied. We will consider the following strategy for optimizing this problem: at each iteration, we start with a feasible α and then update exactly 2 α’s at a time. The update must maintain feasibility. Assume without loss of generality that the variables to be updated are α 1 and α 2. In the following, you will derive an update to α 1 and α 2 that maximizes the dual problem given above when only α 1 and α 2 are allowed to change. a) α 1 and α 2 are to be updated to ¯α 1 and ¯α 2. Using the constraints on α from the dual problem, show that if y 1 = y 2 , then ¯α 2 ≤ α 1 + α 2 , and if y 1 6 = y 2 , then ¯α 2 ≥ α 2 − α 1. b) Given that y 1 α 1 + y 2 α 2 = constant = y 1 α¯ 1 + y 2 α¯ 2 , express this equivalently as α 1 + sα 2 = γ, where s = y 1 y 2. Furthermore, let
vi =
j=
yj αj Kij , i = 1, 2.
Write the dual objective as a function of α 1 and α 2 (fixing the other α variables as constants), then use the equation α 1 + sα 2 = γ to express the dual as a function of only α 2 , yielding
W (α 2 ) = γ − sα 2 + α 2 −
K 11 (γ − sα 2 )^2 −
K 22 α^22 −sK 12 (γ − sα 2 )α 2 − y 1 (γ − sα 2 )v 1 − y 2 α 2 v 2 + constant
c) Differentiate W (α 2 ) with respect to α 2 to calculate the maximizing ¯α 2. Let d 12 = K 11 − 2 K 12 + K 22 for notational convenience. Justify why this solution is a maximum (not a minimum). d) Let Ei = f (xi) − yi = (
j=1 αj^ yj^ Kij^ +^ w^0 )^ −^ yi, i.e., the difference between the predicted value and the true class label. Simplify your result in part c) to obtain the following:
α¯ 2 = α 2 +
y 2 (E 1 − E 2 ) d 12
and then, using part a), obtain the final solution for ¯α 2 as:
α¯ 2 :=
max(0, min(¯α 2 , α 1 + α 2 )) if y 1 = y 2 , max(¯α 2 , α 2 − α 1 , 0) if y 1 6 = y 2.
Furthermore, show that ¯α 1 = α 1 +y 1 y 2 (α 2 − α¯ 2 ). This update results in a non-decreasing dual, and repeating over pairs of α eventually leads to global convergence of the SVM problem.