Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

CS 189 Introduction to Final Spring 2017 Machine Learning Exam, Exams of Machine Learning

The final exam for the CS 189 Introduction to Machine Learning course offered in Spring 2017. The exam is closed book, closed notes except for a two-page cheat sheet. It consists of 26 multiple-choice questions worth 3 points each and 7 written questions worth a total of 72 points. instructions for the exam and multiple-choice questions related to machine learning topics such as clustering algorithms, PCA, SVD, and learning theory.

Typology: Exams

2016/2017

Uploaded on 05/11/2023

snehaaaa
snehaaaa 🇺🇸

4.7

(19)

238 documents

1 / 12

Toggle sidebar

Related documents


Partial preview of the text

Download CS 189 Introduction to Final Spring 2017 Machine Learning Exam and more Exams Machine Learning in PDF only on Docsity!

CS 189

Spring 2017

Introduction to

Machine Learning Final

  • Please do not open the exam before you are instructed to do so.
  • The exam is closed book, closed notes except your two page cheat sheet.
  • Electronic devices are forbidden on your person, including cell phones, iPods, headphones, and laptops. Turn your

cell phone off and leave all electronics at the front of the room, or risk getting a zero on the exam.

  • You have 3 hours.
  • Please write your initials at the top right of each odd-numbered page (e.g., write “JS” if you are Jonathan Shewchuk).

Finish this by the end of your 3 hours.

  • Mark your answers on the exam itself in the space provided. Do not attach any extra sheets.
  • The total number of points is 150. There are 26 multiple choice questions worth 3 points each, and 7 written questions

worth a total of 72 points.

  • For multiple answer questions, fill in the bubbles for ALL correct choices: there may be more than one correct choice,

but there is always at least one correct choice. NO partial credit on multiple answer questions: the set of all correct

answers must be checked.

First name

Last name

SID

First and last name of student to your left

First and last name of student to your right

Q1. [78 pts] Multiple Answer

Fill in the bubbles for ALL correct choices: there may be more than one correct choice, but there is always at least one correct

choice. NO partial credit: the set of all correct answers must be checked.

(1) [3 pts] Which of the following are NP-hard problems? Let X ∈ R

n×d

be a design matrix, let y ∈ R

n

be a vector of labels,

let L be the Laplacian matrix of some n-vertex graph, and let 1 = [1 1... 1]

.

min μ,y

k

i= 1

y j =i

|X

j

− μ i

|

2 where each μ i

is the

mean of sample points assigned class i

© min y

k

i= 1

y j =i

|X

j

− μ i

|

2 with each μ i

fixed

© min y∈R

n

1

4

y

Ly subject to |y|

2 = n; 1

y = 0

min y∈R

n

1

4

y

Ly subject to ∀ j, y j

∈ {− 1 , + 1 };

y = 0

(2) [3 pts] Which clustering algorithms permit you to decide the number of clusters after the clustering is done?

© k-means clustering

agglomerative clustering with single linkage

a k-d tree used for divisive clustering

© spectral graph clustering with 3 eigenvectors

(3) [3 pts] For which of the following does normalizing your input features influence the predictions?

© decision tree (with usual splitting method)

Lasso

neural network

soft-margin support vector machine

(4) [3 pts] With the SVD, we write X = UDV

. For which of the following matrices are the eigenvectors the columns of U?

© X

X

XX

© X

XX

X

XX

XX

(5) [3 pts] Why is PCA sometimes used as a preprocessing step before regression?

To reduce overfitting by removing poorly predic-

tive dimensions.

© To expose information missing from the input data.

To make computation faster by reducing the di-

mensionality of the data.

© For inference and scientific discovery, we prefer

features that are not axis-aligned.

(6) [3 pts] Consider the matrix X =

r

i= 1

α i

u i

v

i

where each α i

is a scalar and each u i

and v i

is a vector. It is possible that the

rank of X might be

© r + 1

r

r − 1

0

(7) [3 pts] Why would we use a random forest instead of a decision tree?

© For lower training error.

To reduce the variance of the model.

To better approximate posterior probabilities.

© For a model that is easier for a human to interpret.

(8) [3 pts] What tends to be true about increasing the k in k-nearest neighbors?

The decision boundary tends to get smoother.

The bias tends to increase.

© The variance tends to increase.

As the number of sample points approaches in-

finity (with n/k → ∞), the error rate approaches less

than twice the Bayes risk (assuming training and test

points are drawn independently from the same distri-

bution).

(9) [3 pts] Which of the following statements are true about the entropy of a discrete probability distribution?

It is a useful criterion for picking splits in decision

trees.

© It is a convex function of the class probabilities.

It is maximized when the probability distribution

is uniform.

© It is minimized when the probability distribution

is uniform.

(10) [3 pts] A low-rank approximation of a matrix can be useful for

removing noise.

discovering latent categories in the data.

filling in unknown values.

matrix compression.

(11) [3 pts] Let L be the Laplacian matrix of a graph with n vertices. Let

β = min

y∈R

n

∀i,y i ∈{− 1 ,+ 1 }

1

y= 0

y

Ly and γ = min

y∈R

n

|y|

2 =n

1

y= 0

y

Ly.

Which of the following are true for every Laplacian matrix L?

β ≥ γ

© β ≤ γ

© β > γ

© β < γ

(12) [3 pts] Which of the following are true about decision trees?

© They can be used only for classification.

© The tree depth never exceeds O(log n) for n sample

points.

© All the leaves must be pure.

Pruning usually achieves better test accuracy than

stopping early.

(13) [3 pts] Which of the following is an effective way of reducing overfitting in neural networks?

Augmenting the training data with similar syn-

thetic examples

Weight decay (i.e., ` 2

regularization)

© Increasing the number of layers

Dropout

(14) [3 pts] If the VC dimension of a hypothesis class H is an integer D < ∞ (i.e., VC(H) = D), this means

there exists some set of D points shattered by H.

© all sets of D points are shattered by H.

no set of D + 1 points is shattered by H.

Π

H

(D) = 2

D

.

(15) [3 pts] Consider the minimizer w

of the ` 2 -regularized least squares objective J(w) = |Xw − y|

2

  • λ|w|

2

with λ > 0. Which

of the following are true?

© Xw

= y

© w

= X

y, where X

is the pseudoinverse of X

© w

exists if and only if X

X is nonsingular

The minimizer w

is unique

(16) [3 pts] You are training a neural network, but the training error is high. Which of the following, if done in isolation, has

a better-than-tiny chance of reducing the training error?

Adding another hidden layer

Normalizing the input data

Adding more units to hidden layers

© Training on more data

(17) [3 pts] Filters in the late layers of a convolutional neural network designed to classify objects in photographs likely

represent

© edge detectors.

concepts such as “there is an animal.”

concepts such as “this image contains wheels.”

© concepts such as “Jen is flirting with Dan.”

(18) [3 pts] Which of the following techniques usually speeds up the training of a sigmoid-based neural network on a classifi-

cation task?

© Using batch descent instead of stochastic

© Increasing the learning rate with every iteration

Having a good initialization of the weights

Using the cross-entropy loss instead of the mean

squared error

(19) [3 pts] In a soft-margin support vector machine, decreasing the slack penalty term C causes

© more overfitting.

less overfitting.

© a smaller margin.

less sensitivity to outliers.

(20) [3 pts] The shortest distance from a point z to a hyperplane w

x = 0 is

© w

z

w

z

|w|

©

w

z

|w|

2

© |w| · |z|

(21) [3 pts] The Bayes decision rule

does the best a classifier can do, in expectation

© can be computed exactly from a large sample

chooses the class with the greatest posterior prob-

ability, if we use the 0-1 risk function

minimizes the risk functional

(22) [3 pts] Which of the following are techniques commonly used in training neural nets?

© linear programming

backpropagation

© Newton’s method

cross-validation

(23) [3 pts] Which of these statements about learning theory are correct?

The VC dimension of halfplanes is 3.

© For a fixed set of training points, the more di-

chotomies Π we have, the higher the probability that

the training error is close to the true risk.

© The VC dimension of halfspaces in 3D is ∞.

For a fixed hypothesis class H, the more train-

ing points we have, the higher the probability that the

training error is close to the true risk.

(24) [3 pts] Which of the following statements are true for a design matrix X ∈ R

n×d

with d > n? (The rows are n sample

points and the columns represent d features.)

© Least-squares linear regression computes the

weights w = (X

X)

− 1 X

y.

© X has exactly d − n eigenvectors with eigenvalue

zero.

© The sample points are linearly separable.

At least one principal component direction is or-

thogonal to a hyperplane that contains all the sample

points.

(25) [3 pts] Which of the following visuals accurately represent the clustering produced by greedy agglomerative hierarchical

clustering with centroid linkage on the set of feature vectors {(-2, -2), (-2, 0), (1, 3), (2, 2), (3, 4)}?

©

x

y

− 4 − 3 − 2 − 1 0 1 2 3 4

− 3

− 2

− 1

0

1

2

3

4

5

x

y

− 4 − 3 − 2 − 1 0 1 2 3 4

− 3

− 2

− 1

0

1

2

3

4

5

(-2, -2) (-2, 0) (1, 3) (2, 2) (3, 4)

©

(-2, -2) (-2, 0) (1, 3) (2, 2) (3, 4)

(26) [3 pts] Which of the following statements is true about the standard k-means clustering algorithm?

© The random partition initialization method usually

outperforms the Forgy method.

After a sufficiently large number of iterations, the

clusters will stop changing.

© It is computationally infeasible to find the optimal

clustering of n = 15 points in k = 3 clusters.

© You can use the metric d(x, y) =

x·y

|x|·|y|

.

Q2. [9 pts] A Miscellany

(a) [3 pts] Consider a convolutional neural network for reading the handwritten MNIST letters, which are 28 × 28 images.

Suppose the first hidden layer is a convolutional layer with 20 different 5 × 5 filters, applied to the input image with a

stride of 1 (i.e., every filter is applied to every 5 × 5 patch of the image, with patches allowed to overlap). Each filter has

a bias weight. How many weights (parameters) does this layer use?

20 × (5 × 5 + 1) = 520.

(b) [3 pts] Let X be an n × d design matrix representing n sample points in R

d

. Let X = UDV

be the singular value

decomposition of X. We stated in lecture that row i of the matrix UD gives the coordinates of sample point X i

in principal

coordinates space, i.e., X i

· v j

for each j, where X i

is the ith row of X and v j

is the jth column of V. Show that this is true.

As V is an orthogonal matrix, we can write XV = UDV

V = UD.

By the definition of matrix multiplication, (UD) i j

= X

i

· v j

.

(c) [3 pts] Let x, y ∈ R

d

be two points (e.g., sample or test points). Consider the function k(x, y) = x

rev(y) where rev(y)

reverses the order of the components in y. For example, rev

=

. Show that k cannot be a valid kernel function.

Hint: remember how the kernel function is defined, and show a simple two-dimensional counterexample.

We have that k((− 1 , 1), (− 1 , 1)) = −2, but this is impossible as, if k is a valid kernel, then there is some function Φ such

that k(x, x) = Φ(x)

Φ(x) ≥ 0.

Q3. [10 pts] Maximum Likelihood Estimation for Reliability

Testing

Suppose we are reliability testing n units taken randomly from a population of identical appliances. We want to estimate the

mean failure time of the population. We assume the failure times come from an exponential distribution with parameter λ > 0,

whose probability density function is f (x) = λe

−λx (on the domain x ≥ 0) and whose cumulative distribution function is

F(x) =

x

0

f (x) dx = 1 − e

−λx .

(a) [6 pts] In an ideal (but impractical) scenario, we run the units until they all fail. The failure times are t 1

, t 2

,... , t n

.

Formulate the likelihood function L(λ; t 1

,... , t n

) for our data. Then find the maximum likelihood estimate ˆλ for the

distribution’s parameter.

L(λ; t 1

,... , t n

) =

n ∏

i= 1

f (t i

) =

n ∏

i= 1

λe

−λt i = λ

n

e

−λ

∑ n

i= 1

t i

ln L(λ) = n ln λ − λ

n ∑

i= 1

t i

∂λ

ln L(λ) =

n

λ

n ∑

i= 1

t i

= 0

ˆ

λ =

n

n

i= 1

t i

(b) [4 pts] In a more realistic scenario, we run the units for a fixed time T. We observe r unit failures, where 0 ≤ r ≤ n, and

there are n − r units that survive the entire time T without failing. The failure times are t 1 , t 2 ,... , t r

.

Formulate the likelihood function L(λ; n, r, t 1 ,... , t r ) for our data. Then find the maximum likelihood estimate ˆλ for the

distribution’s parameter.

Hint 1: What is the probability that a unit will not fail during time T? Hint 2: It is okay to define L(λ) in a way that

includes contributions (densities and probability masses) that are not commensurate with each other. Then the constant

of proportionality of L(λ) is meaningless, but that constant is irrelevant for finding the best-fit parameter ˆλ. Hint 3: If

you’re confused, for part marks write down the likelihood that r units fail and n−r units survive; then try the full problem.

Hint 4: If you do it right, ˆλ will be the number of observed failures divided by the sum of unit test times.

L(λ; n, r, t 1 ,... , t r

) ∝

r ∏

i= 1

f (t i

)

(1 − F(T ))

n−r

=

r ∏

i= 1

λe

−λt i

(

e

−λT

)

n−r

= λ

r

e

−λ

∑ r

i= 1

t i e

−λ(n−r)T

ln L(λ) = r ln λ − λ

r ∑

i= 1

t i − λ(n − r)T + constant

∂λ

ln L(λ) =

r

λ

r ∑

i= 1

t i

− (n − r)T = 0

ˆ

λ =

r

r

i= 1

t i

  • (n − r)T

Q4. [10 pts] Decision Trees

Consider the design matrix

[

]

representing 6 sample points, each with two features f 1

and f 2

.

The labels for the data are [

]

In this question, we build a decision tree of depth 2 by hand to classify the data.

(a) [2 pts] What is the entropy at the root of the tree?

− 0 .5 log 2

  1. 5 − 0 .5 log 2

0. 5 = 1

(b) [3 pts] What is the rule for the first split? Write your answer in a form like f 1

≥ 4 or f 2

≥ 3. Hint: you should be able to

eyeball the best split without calculating the entropies.

If we sort by f 1 , the features and the corresponding labels are

[

]

.

If we sort by f 2 , we have

[

]

.

The best split is f 1

≥ 7.

(c) [3 pts] For each of the two treenodes after the first split, what is the rule for the second split?

For the treenode with labels (1, 1), there’s no need to split again.

For the treenode with labels (0, 1 , 0 , 0), if we sort by f 1 , we have

[

]

If we sort using f 2 , we get [

]

We easily see we should choose f 2

≥ 2.

(d) [2 pts] Let’s return to the root of the tree, and suppose we’re incompetent tree builders. Is there a (not trivial) split at the

root that would have given us an information gain of zero? Explain your answer.

Yes. The rules f 1

≥ 5, f 2

≥ 3, or f 2

≥ 5 would all fail to reduce the weighted average entropy below 1.

Q5. [11 pts] Bagging and Random Forests

We are building a random forest for a 2-class classification problem with t decision trees and bagging. The input is a n × d

design matrix X representing n sample points in R

d

(quantitative real-valued features only). For the ith decision tree we create

an n-point training set X

(i)

through standard bagging. At each node of each tree, we randomly select k of the features (this

random subset is selected independently for each treenode) and choose the single-feature split that maximizes the information

gain, compared to all possible single-feature splits on those k features. Assume that we can radix sort real numbers in linear

time, and we can randomly select an item from a set in constant time.

(a) [3 pts] Remind us how bagging works. How do we generate the data sets X

(i) ? What do we do with duplicate points?

For each training set X

(i) , we select n sample points from X uniformly at random with replacement. Duplicate points

have proportionally greater weight in the entropy (or other cost function) calculations. (We will accept an answer that

states that duplicate points are treated as if they were separate points infinitesimally close together.)

(b) [3 pts] Fill in the blanks to derive the overall running time to construct a random forest with bagging and random subset

selection. Let h be the height/depth (they’re the same thing) of the tallest/deepest tree in the forest. You must use the

tightest bounds possible with respect to n, d, t, k, h, and n

.

Consider choosing the split at a treenode whose box contains n

sample points. We can choose the best split for

these n

sample points in O( ) time. Therefore, the running time per sample point in that node

is O( ).

Each sample point in X

(i)

participates in at most O( ) treenodes, so each sample point contributes at

most O( ) to the time. Therefore, the total running time for one tree is O( ).

We have t trees, so the total running time to create the random forest is O( ).

The blanks in order: O(n

k), O(k), O(h), O(kh), O(nkh), O(nkht). They are each worth half a point.

(c) [2 pts] If we instead use a support vector machine to choose the split in each treenode, how does that change the asymp-

totic query time to classify a test point?

It slows queries down by a factor of Θ(d) or Θ(k) (depending whether you run the SVM on k features or all d features—

we’ll accept either interpretation), because it is necessary to inspect all d (or k) features of the query point at each

treenode.

(d) [3 pts] Why does bagging by itself (without random subset selection) tend not to improve the performance of decision

trees as much as we might expect?

It is common that the same few features tend to dominate in all of the subsets, so almost all the trees will tend to have

very similar early splits, and therefore all the trees will produce very similar estimates. The models are not decorrelated

enough.

Q6. [11 pts] One-Dimensional ConvNet Backprop

Consider this convolutional neural network architecture.

In the first layer, we have a one-dimensional convolution with a single filter of size 3 such that h i = s

(∑

3

j= 1

v j x i+ j− 1

)

. The

second layer is fully connected, such that z =

4

i= 1

w i

h i

. The hidden units’ activation function s(x) is the logistic (sigmoid)

function with derivative s

′ (x) = s(x) (1 − s(x)). The output unit is linear (no activation function). We perform gradient descent

on the loss function R = (y − z)

2

, where y is the training label for x.

(a) [1 pt] What is the total number of parameters in this neural network? Recall that convolutional layers share weights.

There are no bias terms.

The answer is 7. There are 3 parameters in layer 1 and 4 parameters in layer 2.

(b) [4 pts] Compute ∂R/∂w i

.

∂R

∂w i

= −2(y − z)h i

(c) [1 pt] Vectorize the previous expression—that is, write ∂R/∂w.

∂R

∂w

= −2(y − z)h

(d) [5 pts] Compute ∂R/∂v j

.

∂R

∂v j

= −2(y − z)

∂z

∂v j

= −2(y − z)

4 ∑

i= 1

∂z

∂h i

∂h i

∂v j

= −2(y − z)

4 ∑

i= 1

w i h i (1 − h i )x i+ j− 1

Q7. [11 pts] Spectral Graph Partitioning

(a) [3 pts] Write down the Laplacian matrix L G of the following graph G. Every edge has weight 1.

2 − 1 − 1 0 0 0 0

− 1 2 − 1 0 0 0 0

− 1 − 1 2 0 0 0 0

0 0 0 1 0 0 − 1

0 0 0 0 1 − 1 0

0 0 0 0 − 1 1 0

0 0 0 − 1 0 0 1

(b) [2 pts] Find three orthogonal eigenvectors of L G

, all having eigenvalue 0.

x =

, y =

, z =

(c) [2 pts] Use two of those three eigenvectors (it doesn’t matter which two) to assign each vertex of G a spectral vector in

R

2

. Draw these vectors in the plane, and explain how they partition G into three clusters. (Optional alternative: if you

can draw 3D figures well, you are welcome to use all three eigenvectors and assign each vertex a spectral vector in R

3

.)

The eigenvectors x and y give the embedding

1 , 2 , 3 7 →

[

0

]

, 4 , 7 7 →

[

0

]

, 5 , 6 7 →

[

0

0

]

.

Each of the three clusters is mapped to a single point in R

2

.

(d) [3 pts] Let K n be the complete graph on n vertices (every pair of vertices is connected by an edge of weight 1) and let

L

K n

be its Laplacian matrix. The eigenvectors of L K n

are v 1

= 1 =

[

1... 1

]

and every vector that is orthogonal to 1.

What are the eigenvalues of L K n

?

λ 1 = 0 and λ 2 = · · · = λ n = n.

(e) [1 pt] What property of these eigenvalues gives us a hint that the complete graph does not have any good partitions?

λ 2 is large, so there is no low-sparsity cut. (The optimal cut has sparsity ≥ λ 2

/2.)

Q8. [10 pts] We Hope You Learned This

Consider learning closed intervals on the real line. Our hypothesis class H consists of all intervals of the form [a, b] where

a < b and a, b ∈ R. We interpret an interval (hypothesis) [a, b] ∈ H as a classifier that identifies a point x as being in class C if

a ≤ x ≤ b, and identifies x as not being in class C if x < a or x > b.

(a) [2 pts] Consider a set containing two distinct points on the real line. Which such sets can be shattered by H?

All sets of two distinct points can be shattered.

(b) [2 pts] Show that no three points can be shattered by H.

Let X = {x 1

, x 2

, x 3

} with x 1

≤ x 2

≤ x 3

. No interval can contain x 1

and x 3

without containing x 2

. (I.e., suppose x 1

and x 3

are in class C, but x 2

is not.)

(c) [2 pts] Write down the shatter function Π H (n). Explain your answer.

Π

H (n) =

(

n

)

  • n + 1 =

n

2

  • n

There are

(

n

2

)

dichotomies with two or more points (imagine choosing the first and the last sample point in class C), n

dichotomies with one point, and one dichotomy with no points.

(d) [2 pts] Consider another hypothesis class H 2

. Each hypothesis in H 2 is a union of two intervals. H 2 is the set of all such

hypotheses (i.e., every union of two intervals on the number line). For example, [3, 7] ∪ [8. 5 , 10] ∈ H 2 ; that’s the set of

all points x such that 3 ≤ x ≤ 7 or 8. 5 ≤ x ≤ 10.

What is the largest number of distinct points that H 2 can shatter? Explain why no larger number can be shattered.

Four. If you have five distinct points, H 2 cannot include the first, third, and fifth points while excluding the second and

fourth.

(e) [2 pts] Which hypothesis class has a greater sample complexity, H or H 2

? Explain why.

H

2

, because its shatter function grows faster (quartic rather than quadratic) and its VC dimension is greater.