Data Mining: Homework 3 - Linear Separability, Logistic Regression, Perceptron, and SVM, Assignments of Health sciences

The third homework assignment for the data mining course (cs 395t) taught by inderjit dhillon during spring 2008. The assignment covers topics such as linear separability, logistic regression, perceptron algorithm, and support vector machines (svm). Students are required to prove theorems, analyze error functions, and derive algorithms for solving svm problems.

Typology: Assignments

Pre 2010

Uploaded on 08/27/2009

koofers-user-vub
koofers-user-vub 🇺🇸

7 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS 395T Data Mining: A Mathematical Perspective Spring 2008
Homework 3
Lecturer: Inderjit Dhillon Date Due: April 7, 2008
Keywords: Classification, Perceptron, Support Vector Machines
1. Given two sets of data points X={x1, x2, . . . , xn}and Y={y1, y2, . . . , yn}, prove that the two sets (Xand
Y) are linearly separable if and only if their convex hulls do not intersect.
2. Given training instances (xn, yn) with yn {0,1}, consider the following error function for logistic regression:
E(w) =
N
X
n=1
(ynlog zn+ (1 yn) log(1 zn)),
where zn=σ(wTxn), wspecifies a hyperplane, and σis the logistic sigmoid function defined by
σ(a) = 1
1 + exp(a).
Prove that the error function E(w) is a convex function and provide a condition on the input data so that
E(w) has a unique minimum.
3. In this exercise, we will prove correctness and convergence of the Perceptron algorithm for linearly separable
data.
Let wtrepresent the hyperplane at step tand (xt, yt) represent an input instance with yt {1,1}. Note
that the input data is padded with one, i.e. xt=xt
1. Recall the update:
wt+1 =wt+ytxt,if yt(wT
txt)<0 ,i.e., a mistake.
Assume that all the input data points have bounded Euclidean norm, i.e., kxtk Rand are linearly separable
with finite margin γ, i.e., there exists a hyperplane specified by wsuch that:
yt(wTxt)γ, t.
(a) Prove that the following holds after tupdates: wTwt.
(b) Prove that: kwtk2
2tR2.
(c) Using parts (a) and (b), prove that the Perceptron algorithm converges to a separating hyperplane after
at most R2kwk2
2
γ2steps.
4. In this exercise, we will derive an algorithm for solving the SVM problem. Recall the dual formulation for
the linearly-separable SVM:
max
αW(α),where W(α) = PN
i=1 αi1
2PN
i=1 PN
j=1 yiyjαiαjKij
subject to
N
X
i=1
yiαi= 0,(1)
αi0, i = 1, ..., N . (2)
1
pf2

Partial preview of the text

Download Data Mining: Homework 3 - Linear Separability, Logistic Regression, Perceptron, and SVM and more Assignments Health sciences in PDF only on Docsity!

CS 395T Data Mining: A Mathematical Perspective Spring 2008

Homework 3

Lecturer: Inderjit Dhillon Date Due: April 7, 2008 Keywords: Classification, Perceptron, Support Vector Machines

  1. Given two sets of data points X = {x 1 , x 2 ,... , xn} and Y = {y 1 , y 2 ,... , yn}, prove that the two sets (X and Y ) are linearly separable if and only if their convex hulls do not intersect.
  2. Given training instances (xn, yn) with yn ∈ { 0 , 1 }, consider the following error function for logistic regression:

E(w) = −

∑^ N

n=

(yn log zn + (1 − yn) log(1 − zn)),

where zn = σ(wT^ xn), w specifies a hyperplane, and σ is the logistic sigmoid function defined by

σ(a) =

1 + exp(−a)

Prove that the error function E(w) is a convex function and provide a condition on the input data so that E(w) has a unique minimum.

  1. In this exercise, we will prove correctness and convergence of the Perceptron algorithm for linearly separable data. Let wt represent the hyperplane at step t and (xt, yt) represent an input instance with yt ∈ { 1 , − 1 }. Note

that the input data is padded with one, i.e. xt =

[

xt 1

]

. Recall the update:

wt+1 = wt + ytxt, if yt(wTt xt) < 0 ,i.e., a mistake.

Assume that all the input data points have bounded Euclidean norm, i.e., ‖xt‖ ≤ R and are linearly separable with finite margin γ, i.e., there exists a hyperplane specified by w∗^ such that:

yt(w∗T^ xt) ≥ γ, ∀ t.

(a) Prove that the following holds after t updates: w∗T^ wt ≥ tγ. (b) Prove that: ‖wt‖^22 ≤ tR^2. (c) Using parts (a) and (b), prove that the Perceptron algorithm converges to a separating hyperplane after at most R

(^2) ‖w∗‖ (^22) γ^2 steps.

  1. In this exercise, we will derive an algorithm for solving the SVM problem. Recall the dual formulation for the linearly-separable SVM:

max α W (α), where W (α) =

∑N

i=1 αi^ −^

1 2

∑N

i=

∑N

j=1 yiyj^ αiαj^ Kij

subject to

∑^ N

i=

yiαi = 0, (1)

αi ≥ 0 , i = 1, ..., N. (2)

2 CS 395T: Data Mining: A Mathematical Perspective

In the above problem, Kij could be xTi xj or Kij = κ(xi, xj ) = h(xi)T^ h(xj ). Note that the matrix K is positive semi-definite. The dual variables α 1 , ..., αN are said to be feasible if (1) and (2) are satisfied. We will consider the following strategy for optimizing this problem: at each iteration, we start with a feasible α and then update exactly 2 α’s at a time. The update must maintain feasibility. Assume without loss of generality that the variables to be updated are α 1 and α 2. In the following, you will derive an update to α 1 and α 2 that maximizes the dual problem given above when only α 1 and α 2 are allowed to change. a) α 1 and α 2 are to be updated to ¯α 1 and ¯α 2. Using the constraints on α from the dual problem, show that if y 1 = y 2 , then ¯α 2 ≤ α 1 + α 2 , and if y 1 6 = y 2 , then ¯α 2 ≥ α 2 − α 1. b) Given that y 1 α 1 + y 2 α 2 = constant = y 1 α¯ 1 + y 2 α¯ 2 , express this equivalently as α 1 + sα 2 = γ, where s = y 1 y 2. Furthermore, let

vi =

∑^ N

j=

yj αj Kij , i = 1, 2.

Write the dual objective as a function of α 1 and α 2 (fixing the other α variables as constants), then use the equation α 1 + sα 2 = γ to express the dual as a function of only α 2 , yielding

W (α 2 ) = γ − sα 2 + α 2 −

K 11 (γ − sα 2 )^2 −

K 22 α^22 −sK 12 (γ − sα 2 )α 2 − y 1 (γ − sα 2 )v 1 − y 2 α 2 v 2 + constant

c) Differentiate W (α 2 ) with respect to α 2 to calculate the maximizing ¯α 2. Let d 12 = K 11 − 2 K 12 + K 22 for notational convenience. Justify why this solution is a maximum (not a minimum). d) Let Ei = f (xi) − yi = (

∑N

j=1 αj^ yj^ Kij^ +^ w^0 )^ −^ yi, i.e., the difference between the predicted value and the true class label. Simplify your result in part c) to obtain the following:

α¯ 2 = α 2 +

y 2 (E 1 − E 2 ) d 12

and then, using part a), obtain the final solution for ¯α 2 as:

α¯ 2 :=

max(0, min(¯α 2 , α 1 + α 2 )) if y 1 = y 2 , max(¯α 2 , α 2 − α 1 , 0) if y 1 6 = y 2.

Furthermore, show that ¯α 1 = α 1 +y 1 y 2 (α 2 − α¯ 2 ). This update results in a non-decreasing dual, and repeating over pairs of α eventually leads to global convergence of the SVM problem.