Midterm Exam for Machine Learning Course, Exams of Machine Learning

A midterm exam for the Machine Learning course at Carnegie Mellon University. The exam consists of 6 questions covering topics such as Bayes Optimal Classification, Logistic Regression, Regression, SVM, and Boosting. The exam allows the use of any material brought by the student, including books, class notes, and printouts of class materials. The exam has a time limit of 90 minutes and a total score of 100 points.

Typology: Exams

Pre 2010

Uploaded on 05/11/2023

kalia
kalia 🇺🇸

4

(7)

239 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010
Aarti Singh
Carnegie Mellon University
1. Personal info:
Name:
Andrew account:
E-mail address:
2. There should be 15 numbered pages in this exam (including this cover sheet).
3. You can use any material you brought: any book, class notes, your print outs of
class materials that are on the class website, including annotated slides and relevant
readings, and Andrew Moore’s tutorials. You cannot use materials brought by other
students. Calculators are not necessary. Laptops, PDAs, phones and Internet access
are not allowed.
4. If you need more room to work out your answer to a question, use the back of the page
and clearly mark on the front of the page if we are to look at what’s on the back.
5. Work efficiently. Some questions are easier, some more difficult. Be sure to give yourself
time to answer all of the easy ones, and avoid getting bogged down in the more difficult
ones before you have answered the easier ones.
6. You have 90 minutes.
7. Good luck!
Question Topic Max. score Score
1 Short questions 20
2 Bayes Optimal Classification 15
3 Logistic Regression 18
4 Regression 16
5 SVM 16
6 Boosting 15
Total 100
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Midterm Exam for Machine Learning Course and more Exams Machine Learning in PDF only on Docsity!

10-701/15-781 Machine Learning - Midterm Exam, Fall 2010

Aarti Singh

Carnegie Mellon University

1. Personal info:

  • Name:
  • Andrew account:
  • E-mail address:

2. There should be 15 numbered pages in this exam (including this cover sheet).

3. You can use any material you brought: any book, class notes, your print outs of

class materials that are on the class website, including annotated slides and relevant

readings, and Andrew Moore’s tutorials. You cannot use materials brought by other

students. Calculators are not necessary. Laptops, PDAs, phones and Internet access

are not allowed.

4. If you need more room to work out your answer to a question, use the back of the page

and clearly mark on the front of the page if we are to look at what’s on the back.

5. Work efficiently. Some questions are easier, some more difficult. Be sure to give yourself

time to answer all of the easy ones, and avoid getting bogged down in the more difficult

ones before you have answered the easier ones.

6. You have 90 minutes.

7. Good luck!

Question Topic Max. score Score

1 Short questions 20

2 Bayes Optimal Classification 15

3 Logistic Regression 18

4 Regression 16

5 SVM 16

6 Boosting 15

Total 100

1 Short Questions [20 pts]

Are the following statements True/False? Explain your reasoning in only 1

sentence.

1. Density estimation (using say, the kernel density estimator) can be used to perform

classification.

True: Estimate the joint density P (Y, X), then use it to calculate P (Y |X).

2. The correspondence between logistic regression and Gaussian Na¨ıve Bayes (with iden-

tity class covariances) means that there is a one-to-one correspondence between the

parameters of the two classifiers.

False: Each LR model parameter corresponds to a whole set of possible GNB classifier

parameters, there is no one-to-one correspondence because logistic regression is discrimi- native and therefore doesn’t model P (X), while GNB does model P (X).

3. The training error of 1-NN classifier is 0.

True: Each point is its own neighbor, so 1-NN classifier achieves perfect classification on

training data.

4. As the number of data points grows to infinity, the MAP estimate approaches the MLE

estimate for all possible priors. In other words, given enough data, the choice of prior

is irrelevant.

False: A simple counterexample is the prior which assigns probability 1 to a single choice

of parameter θ.

5. Cross validation can be used to select the number of iterations in boosting; this pro-

cedure may help reduce overfitting.

True: The number of iterations in boosting controls the complexity of the model, therefore,

a model selection procedure like cross validation can be used to select the appropriate model complexity and reduce the possibility of overfitting.

6. The kernel density estimator is equivalent to performing kernel regression with the

value Yi = 1 n at each point Xi in the original data set.

False: Kernel regression predicts the value of a point as the weighted average of the values

at nearby points, therefore if all of the points have the same value, then kernel regression will predict a constant (in this case, (^1) n ) for all values.

7. We learn a classifier f by boosting weak learners h. The functional form of f ’s decision

boundary is the same as h’s, but with different parameters. (e.g., if h was a linear

classifier, then f is also a linear classifier).

False: For example, the functional form of a decision stump is a single axis-aligned split

of the input space, but the functional form of the boosted classifier is linear combinations of decision stumps which can form a more complex (piecewise linear) decision boundary.

2. For the following dataset, circle the classifier which has larger Leave-One-Out Cross-

validation error.

a) 1-NN

b) 3-NN

Solution: 1-NN since 1-NN CV err: 5/10, 3-NN CV err: 1/

2 Bayes Optimal Classification [15 pts]

In classification, the loss function we usually want to minimize is the 0/1 loss:

`(f (x), y) = 1 {f (x) 6 = y}

where f (x), y ∈ { 0 , 1 } (i.e., binary classification). In this problem we will consider the effect

of using an asymmetric loss function:

`α,β (f (x), y) = α 1 {f (x) = 1, y = 0} + β 1 {f (x) = 0, y = 1}

Under this loss function, the two types of errors receive different weights, determined by

1. [4 pts] Determine the Bayes optimal classifier, i.e. the classifier that achieves minimum

risk assuming P (x, y) is known, for the loss `α,β where α, β > 0.

Solution: We can write

arg min f E`α,β (f (x), y) = arg min f EX,Y [α 1 {f (X) = 1, Y = 0} + β 1 {f (X) = 0, Y = 1}] = arg min f EX [EY |X [α 1 {f (X) = 1, Y = 0} + β 1 {f (X) = 0, Y = 1}]]

= arg min f EX [

y

α 1 {f (X) = 1, y = 0} + β 1 {f (X) = 0, y = 1}dP (y|x)]

= arg min f

x

[α 1 {f (x) = 1}P (y = 0|x) + β 1 {f (x) = 0}P (y = 1|x)]dP (x)

We may minimize the integrand at each x by taking:

f (x) =

1 βP (y = 1|x) ≥ αP (y = 0|x) 0 αP (y = 0|x) > βP (y = 1|x).

2. [3 pts] Suppose that the class y = 0 is extremely uncommon (i.e., P (y = 0) is small).

This means that the classifier f (x) = 1 for all x will have good risk. We may try to

put the two classes on even footing by considering the risk:

R = P (f (x) = 1|y = 0) + P (f (x) = 0|y = 1)

Show how this risk is equivalent to choosing a certain α, β and minimizing the risk

where the loss function is `α,β.

Solution: Notice that

E`α,β (f (x), y) = αP (f (x) = 1, y = 0) + βP (f (x) = 0, y = 1) = αP (f (x) = 1|y = 0)P (y = 0) + βP (f (x) = 0|y = 1)P (y = 1)

which is same as the minimizer of the given risk R if α = (^) P (y^1 =0) and β = (^) P (y^1 =1).

  • Y=
  • Y=
  • Y= 1 Y=

3 Logistic Regression [18 pts]

We consider here a discriminative approach for solving the classification problem illustrated

in Figure 1.

Figure 1: The 2-dimensional labeled training set, where ‘+’ corresponds to class y=1 and

‘O’ corresponds to class y = 0.

1. We attempt to solve the binary classification task depicted in Figure 1 with the simple

linear logistic regression model

P (y = 1|~x, ~w) = g(w 0 + w 1 x 1 + w 2 x 2 ) =

1 + exp(−w 0 − w 1 x 1 − w 2 x 2 )

Notice that the training data can be separated with zero training error with a linear

separator.

Consider training regularized linear logistic regression models where we try to maximize

∑^ n

i=

log (P (yi|xi, w 0 , w 1 , w 2 )) − Cw^2 j

for very large C. The regularization penalties used in penalized conditional log-

likelihood estimation are −Cw^2 j , where j = { 0 , 1 , 2 }. In other words, only one of the

parameters is regularized in each case. Given the training data in Figure 1, how does

the training error change with regularization of each parameter wj? State whether the

training error increases or stays the same (zero) for each wj for very large C. Provide

a brief justification for each of your answers.

(b) [3 pts] For very large C, with the same L1-norm regularization for w 1 and w 2 as

above, which value(s) do you expect w 0 to take? Explain briefly. (Note that the

number of points from each class is the same.) (You can give a range of values

for w 0 if you deem necessary).

SOLUTION: For very large C, we argued that both w 1 and w 2 will go to zero. Note that when w 1 = w 2 = 0 , the log-probability of labels becomes a finite value, which is equal to n log(0.5), i.e. w 0 = 0. In other words, P (y = 1|~x, ~w)=P (y = 0|~x, ~w)= 0. 5. We expect so because the number of elements in each class is the same and so we would like to predict each one with the same probability, and w 0 = 0 makes P (y = 1|~x, ~w)= 0. 5.

(c) [3 pts] Assume that we obtain more data points from the ‘+’ class that corre-

sponds to y=1 so that the class labels become unbalanced. Again for very large

C, with the same L1-norm regularization for w 1 and w 2 as above, which value(s)

do you expect w 0 to take? Explain briefly. (You can give a range of values for w 0

if you deem necessary).

SOLUTION: For very large C, we argued that both w 1 and w 2 will go to zero. With unbalanced classes where the number of ‘+’ labels are greater than that of ‘o’ labels, we want to have P (y = 1|~x, ~w) > P (y = 0|~x, ~w). For that to happen the value of w 0 should be greater than zero which makes P (y = 1|~x, ~w) > 0. 5.

4 Kernel regression [16 pts]

Now lets consider the non-parametric kernel regression setting. In this problem, you will

investigate univariate locally linear regression where the estimator is of the form:

f̂ (x) = β 1 + β 2 x

and the solution for parameter vector β = [β 1 β 2 ] is obtained by minimizing the weighted

least square error:

J(β 1 , β 2 ) =

∑^ n

i=

Wi(x)(Yi − β 1 − β 2 Xi)^2 where Wi(x) =

K

(Xi−x

h

∑n

i=1 K^

(Xi−x

h

where K is a kernel with bandwidth h. Observe that the weighted least squares error can

be expressed in matrix form as

J(β 1 , β 2 ) = (Y − Aβ)T^ W (Y − Aβ),

where Y is a vector of n labels in the training example, W is a n × n diagonal matrix with

weight of each training example on the diagonal, and

A =

1 X 1

1 X 2

1 Xn

1. [4 pts] Derive an expression in matrix form for the solution vector βˆ that minimizes

the weighted least square.

Solution: Differentiating the objective function wrt β, we have:

∂J(β) β

= 2AT^ W Aβ − 2 AT^ W T^ Y.

Therefore, the solution βˆ satisfies the following normal equations:

AT^ W Aβ = AT^ W T^ Y

And if AT^ W A is invertible, then the solution is βˆ = (AT^ W A)−^1 AT^ W T^ Y. (Note that W = W T^ , so the solution can be written in terms of either).

2. [3 pts] When is the above solution unique?

Solution: When AT^ W A is invertible. Since W is a diagonal matrix, AT^ W A = (W 1 /^2 A)T^ (W 1 /^2 A)

and hence rank(AT^ W A) = min(n, 2) - Refer TK’s recitation notes. Since a matrix is invert- ible if it is full rank, a unique solution exists if n ≥ 2.

3. [3 pts] If the solution is not unique, one approach is to optimize the objective function

J using gradient descent. Write the update equation for gradient descent in this case.

Note: Your answer must be expressed in terms of the matrices defined above.

5 SVM [16 pts]

5.1 L2 SVM

Let {(xi, yi)}li=1 be a set of l training pairs of feature vectors and labels. We consider binary

classification, and assume yi ∈ {− 1 , +1} ∀i. The following is the primal formulation of L

SVM, a variant of the standard SVM obtained by squaring the hinge loss:

min w,b,ξ

w>w +

C

∑^ l

i=

ξ^2 i

s.t. yi(w>xi + b) ≥ 1 − ξi, i ∈ { 1 ,... , l},

ξi ≥ 0 , i ∈ { 1 ,... , l}.

1. [4 pts] Show that removing the last set of constraints {ξi ≥ 0 ∀i} does not change the

optimal solution to the primal problem.

Solution: Let (w∗, b∗, ξ∗) be the optimal solution to the problem without the last set of

constraints. It suffices to show that ξ∗ i ≥ 0 ∀i. Suppose it is not the case, then there exists some ξ∗ j < 0. Then we have

yj ((w∗)>xj + b∗) ≥ 1 − ξ j∗ > 1 ,

implying that ξ j′ = 0 is a feasible solution and yet gives a smaller objective value since (ξ j′ )^2 = 0 < (ξ j∗ )^2 , a contradiction to the assumption that ξ j∗ is optimal.

2. [3 pts] After removing the last set of constraints, we get a simpler problem:

min

w,b,ξ

w>w +

C

∑^ l

i=

ξ i^2

s.t. yi(w>xi + b) ≥ 1 − ξi, i ∈ { 1 ,... , l}.

Give the Lagrangian of (1).

Solution: The Lagrangian is

L(w, b, ξ, α) := 1 2

w>w + C 2

∑^ l

i=

ξ^2 i −

∑^ l

i=

αi(yi(w>xi + b) − 1 + ξi),

where αi ≥ 0 , ∀i are the Lagrange multipliers.

3. [6 pts] Derive the dual of (1). How is it different from the dual of the standard SVM

with the hinge loss?

Solution: Taking partial derivatives of the Lagrangian wrt w, b and ξi,

∇wL(w, b, ξ, α) = 0 ⇐⇒ w =

∑^ l

i=

αiyixi ,

∂bL(w, b, ξ, α) = 0 ⇐⇒

∑^ l

i=

αiyi = 0,

∂ξi L(w, b, ξ, α) = 0 ⇐⇒ ξi = αi/C.

Plugging these back to the Lagrangian, rearranging terms and keeping constraints on the Lagrange multipliers we obtain the dual

max α −

2 α

(Q + I/C)α + 1 >α

s.t. y>α = 0, αi ≥ 0 ∀i,

where 1 is a vector of ones, I is the identity matrix, y is the vector of labels yi’s, and Q is the l-by-l kernel matrix such that Qij = yiyj x> i xj. Compared with the dual of the standard SVM, the quadratic term is regularized by an additional positive diagonal matrix, and thus has stronger convexity leading to faster convergence. The other difference is that the dual variables here are only bounded from below, but in the standard SVM the dual variables are bounded both from above (by C) and from below. In fact, for L2 svms the solution does not depend on the tradeoff parameter C.

5.2 Leave-one-out Error and Support Vectors

[3 pts] Consider the standard two-class SVM with the hinge loss. Argue that under a given

value of C,

LOO error ≤

#SVs

l

where l is the size of the training data and #SVs is the number of support vectors obtained

by training SVM on the entire set of training data.

Solution: Since the decision function only depends on the support vectors, removing a non-

support vector from the training data and then re-training an SVM would lead to the same decision function. Also, non-support vectors must be classified correctly. As a result, errors found in the leave-one-out validation must be caused by removing the support vectors, proving the desired result.

should be able to yield three decision regions. Also notice that at least one of h 1 or h 2 misclassifies two positive examples. If only h 2 misclassifies two positive examples, the possible decisions are (1) sign(α 1 − α 2 ) on those two positive examples, (2) sign(α 1 + α 2 ) on the remaining positive examples and (3) sign(α 1 −α 2 ) on the negative examples

  • which don’t yield zero training error since signs on (1) and (3) agree. If both h 1 and h 2 misclassify two positive examples, we have (1) sign(α 1 −α 2 ) on two positive examples, (2) sign(−α 1 + α 2 ) on the remaining positive examples and (3) sign(−α 1 − α 2 ) on the negative - which again don’t yield zero training error since signs on (1) and (2) don’t agree.

2. [2 pts] Why do we want to use “weak” learners when boosting?

Solution: To prevent overfitting, since the complexity of the overall learner increases at

each step. Starting with weak learners implies the final classifier will be less likely to overfit.

3. [4 pts] Suppose AdaBoost is run on m training examples, and suppose on each round

that the weighted training error t of the tth^ weak hypothesis is at most 1/ 2 − γ, for

some number γ > 0. After how many iterations, T , will the combined hypothesis H

be consistent with the m training examples, i.e., achieves zero training error? Your

answer should only be expressed in terms of m and γ. (Hint: What is the training

error when 1 example is misclassified?)

Solution: Training error when 1 example is misclassified = 1 /m. Therefore, we need to

guarantee that training error is < 1 /m. Since t ≤ 1 / 2 − γ, from class notes we know that

Training err of the combined hypothesis H ≤ exp(− 2 T γ^2 )

The upper bound is < 1 /m if T > ln m/ 2 γ^2.