Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The final exam for the CS 189 Introduction to Machine Learning course offered in Spring 2017. The exam is closed book, closed notes except for a two-page cheat sheet. It consists of 26 multiple-choice questions worth 3 points each and 7 written questions worth a total of 72 points. instructions for the exam and multiple-choice questions related to machine learning topics such as clustering algorithms, PCA, SVD, and learning theory.
Typology: Exams
1 / 12
cell phone off and leave all electronics at the front of the room, or risk getting a zero on the exam.
Finish this by the end of your 3 hours.
worth a total of 72 points.
but there is always at least one correct choice. NO partial credit on multiple answer questions: the set of all correct
answers must be checked.
First name
Last name
First and last name of student to your left
First and last name of student to your right
Fill in the bubbles for ALL correct choices: there may be more than one correct choice, but there is always at least one correct
choice. NO partial credit: the set of all correct answers must be checked.
(1) [3 pts] Which of the following are NP-hard problems? Let X ∈ R
n×d
be a design matrix, let y ∈ R
n
be a vector of labels,
let L be the Laplacian matrix of some n-vertex graph, and let 1 = [1 1... 1]
.
min μ,y
k
i= 1
y j =i
j
− μ i
2 where each μ i
is the
mean of sample points assigned class i
© min y
k
i= 1
y j =i
j
− μ i
2 with each μ i
fixed
© min y∈R
n
1
4
y
Ly subject to |y|
2 = n; 1
y = 0
min y∈R
n
1
4
y
Ly subject to ∀ j, y j
y = 0
(2) [3 pts] Which clustering algorithms permit you to decide the number of clusters after the clustering is done?
© k-means clustering
agglomerative clustering with single linkage
a k-d tree used for divisive clustering
© spectral graph clustering with 3 eigenvectors
(3) [3 pts] For which of the following does normalizing your input features influence the predictions?
© decision tree (with usual splitting method)
Lasso
neural network
soft-margin support vector machine
(4) [3 pts] With the SVD, we write X = UDV
. For which of the following matrices are the eigenvectors the columns of U?
X
XX
X
XX
(5) [3 pts] Why is PCA sometimes used as a preprocessing step before regression?
To reduce overfitting by removing poorly predic-
tive dimensions.
© To expose information missing from the input data.
To make computation faster by reducing the di-
mensionality of the data.
© For inference and scientific discovery, we prefer
features that are not axis-aligned.
(6) [3 pts] Consider the matrix X =
r
i= 1
α i
u i
v
i
where each α i
is a scalar and each u i
and v i
is a vector. It is possible that the
rank of X might be
© r + 1
r
r − 1
(7) [3 pts] Why would we use a random forest instead of a decision tree?
© For lower training error.
To reduce the variance of the model.
To better approximate posterior probabilities.
© For a model that is easier for a human to interpret.
(8) [3 pts] What tends to be true about increasing the k in k-nearest neighbors?
The decision boundary tends to get smoother.
The bias tends to increase.
© The variance tends to increase.
As the number of sample points approaches in-
finity (with n/k → ∞), the error rate approaches less
than twice the Bayes risk (assuming training and test
points are drawn independently from the same distri-
bution).
(9) [3 pts] Which of the following statements are true about the entropy of a discrete probability distribution?
It is a useful criterion for picking splits in decision
trees.
© It is a convex function of the class probabilities.
It is maximized when the probability distribution
is uniform.
© It is minimized when the probability distribution
is uniform.
(10) [3 pts] A low-rank approximation of a matrix can be useful for
removing noise.
discovering latent categories in the data.
filling in unknown values.
matrix compression.
(11) [3 pts] Let L be the Laplacian matrix of a graph with n vertices. Let
β = min
y∈R
n
∀i,y i ∈{− 1 ,+ 1 }
1
y= 0
y
Ly and γ = min
y∈R
n
|y|
2 =n
1
y= 0
y
Ly.
Which of the following are true for every Laplacian matrix L?
β ≥ γ
© β ≤ γ
© β > γ
© β < γ
(12) [3 pts] Which of the following are true about decision trees?
© They can be used only for classification.
© The tree depth never exceeds O(log n) for n sample
points.
© All the leaves must be pure.
Pruning usually achieves better test accuracy than
stopping early.
(13) [3 pts] Which of the following is an effective way of reducing overfitting in neural networks?
Augmenting the training data with similar syn-
thetic examples
Weight decay (i.e., ` 2
regularization)
© Increasing the number of layers
Dropout
(14) [3 pts] If the VC dimension of a hypothesis class H is an integer D < ∞ (i.e., VC(H) = D), this means
there exists some set of D points shattered by H.
© all sets of D points are shattered by H.
no set of D + 1 points is shattered by H.
H
D
.
(15) [3 pts] Consider the minimizer w
∗
of the ` 2 -regularized least squares objective J(w) = |Xw − y|
2
2
with λ > 0. Which
of the following are true?
© Xw
∗
= y
© w
∗
= X
y, where X
is the pseudoinverse of X
© w
∗
exists if and only if X
X is nonsingular
The minimizer w
∗
is unique
(16) [3 pts] You are training a neural network, but the training error is high. Which of the following, if done in isolation, has
a better-than-tiny chance of reducing the training error?
Adding another hidden layer
Normalizing the input data
Adding more units to hidden layers
© Training on more data
(17) [3 pts] Filters in the late layers of a convolutional neural network designed to classify objects in photographs likely
represent
© edge detectors.
concepts such as “there is an animal.”
concepts such as “this image contains wheels.”
© concepts such as “Jen is flirting with Dan.”
(18) [3 pts] Which of the following techniques usually speeds up the training of a sigmoid-based neural network on a classifi-
cation task?
© Using batch descent instead of stochastic
© Increasing the learning rate with every iteration
Having a good initialization of the weights
Using the cross-entropy loss instead of the mean
squared error
(19) [3 pts] In a soft-margin support vector machine, decreasing the slack penalty term C causes
© more overfitting.
less overfitting.
© a smaller margin.
less sensitivity to outliers.
(20) [3 pts] The shortest distance from a point z to a hyperplane w
x = 0 is
© w
z
w
z
|w|
w
z
|w|
2
© |w| · |z|
(21) [3 pts] The Bayes decision rule
does the best a classifier can do, in expectation
© can be computed exactly from a large sample
chooses the class with the greatest posterior prob-
ability, if we use the 0-1 risk function
minimizes the risk functional
(22) [3 pts] Which of the following are techniques commonly used in training neural nets?
© linear programming
backpropagation
© Newton’s method
cross-validation
(23) [3 pts] Which of these statements about learning theory are correct?
The VC dimension of halfplanes is 3.
© For a fixed set of training points, the more di-
chotomies Π we have, the higher the probability that
the training error is close to the true risk.
© The VC dimension of halfspaces in 3D is ∞.
For a fixed hypothesis class H, the more train-
ing points we have, the higher the probability that the
training error is close to the true risk.
(24) [3 pts] Which of the following statements are true for a design matrix X ∈ R
n×d
with d > n? (The rows are n sample
points and the columns represent d features.)
© Least-squares linear regression computes the
weights w = (X
X)
− 1 X
y.
© X has exactly d − n eigenvectors with eigenvalue
zero.
© The sample points are linearly separable.
At least one principal component direction is or-
thogonal to a hyperplane that contains all the sample
points.
(25) [3 pts] Which of the following visuals accurately represent the clustering produced by greedy agglomerative hierarchical
clustering with centroid linkage on the set of feature vectors {(-2, -2), (-2, 0), (1, 3), (2, 2), (3, 4)}?
x
y
− 4 − 3 − 2 − 1 0 1 2 3 4
− 3
− 2
− 1
0
1
2
3
4
5
x
y
− 4 − 3 − 2 − 1 0 1 2 3 4
− 3
− 2
− 1
0
1
2
3
4
5
(-2, -2) (-2, 0) (1, 3) (2, 2) (3, 4)
(-2, -2) (-2, 0) (1, 3) (2, 2) (3, 4)
(26) [3 pts] Which of the following statements is true about the standard k-means clustering algorithm?
© The random partition initialization method usually
outperforms the Forgy method.
After a sufficiently large number of iterations, the
clusters will stop changing.
© It is computationally infeasible to find the optimal
clustering of n = 15 points in k = 3 clusters.
© You can use the metric d(x, y) =
x·y
|x|·|y|
(a) [3 pts] Consider a convolutional neural network for reading the handwritten MNIST letters, which are 28 × 28 images.
Suppose the first hidden layer is a convolutional layer with 20 different 5 × 5 filters, applied to the input image with a
stride of 1 (i.e., every filter is applied to every 5 × 5 patch of the image, with patches allowed to overlap). Each filter has
a bias weight. How many weights (parameters) does this layer use?
(b) [3 pts] Let X be an n × d design matrix representing n sample points in R
d
. Let X = UDV
be the singular value
decomposition of X. We stated in lecture that row i of the matrix UD gives the coordinates of sample point X i
in principal
coordinates space, i.e., X i
· v j
for each j, where X i
is the ith row of X and v j
is the jth column of V. Show that this is true.
As V is an orthogonal matrix, we can write XV = UDV
V = UD.
By the definition of matrix multiplication, (UD) i j
i
· v j
(c) [3 pts] Let x, y ∈ R
d
be two points (e.g., sample or test points). Consider the function k(x, y) = x
rev(y) where rev(y)
reverses the order of the components in y. For example, rev
. Show that k cannot be a valid kernel function.
Hint: remember how the kernel function is defined, and show a simple two-dimensional counterexample.
We have that k((− 1 , 1), (− 1 , 1)) = −2, but this is impossible as, if k is a valid kernel, then there is some function Φ such
that k(x, x) = Φ(x)
Φ(x) ≥ 0.
Suppose we are reliability testing n units taken randomly from a population of identical appliances. We want to estimate the
mean failure time of the population. We assume the failure times come from an exponential distribution with parameter λ > 0,
whose probability density function is f (x) = λe
−λx (on the domain x ≥ 0) and whose cumulative distribution function is
F(x) =
x
0
f (x) dx = 1 − e
−λx .
(a) [6 pts] In an ideal (but impractical) scenario, we run the units until they all fail. The failure times are t 1
, t 2
,... , t n
Formulate the likelihood function L(λ; t 1
,... , t n
) for our data. Then find the maximum likelihood estimate ˆλ for the
distribution’s parameter.
L(λ; t 1
,... , t n
n ∏
i= 1
f (t i
n ∏
i= 1
λe
−λt i = λ
n
e
−λ
∑ n
i= 1
t i
ln L(λ) = n ln λ − λ
n ∑
i= 1
t i
∂λ
ln L(λ) =
n
λ
n ∑
i= 1
t i
λ =
n
n
i= 1
t i
(b) [4 pts] In a more realistic scenario, we run the units for a fixed time T. We observe r unit failures, where 0 ≤ r ≤ n, and
there are n − r units that survive the entire time T without failing. The failure times are t 1 , t 2 ,... , t r
Formulate the likelihood function L(λ; n, r, t 1 ,... , t r ) for our data. Then find the maximum likelihood estimate ˆλ for the
distribution’s parameter.
Hint 1: What is the probability that a unit will not fail during time T? Hint 2: It is okay to define L(λ) in a way that
includes contributions (densities and probability masses) that are not commensurate with each other. Then the constant
of proportionality of L(λ) is meaningless, but that constant is irrelevant for finding the best-fit parameter ˆλ. Hint 3: If
you’re confused, for part marks write down the likelihood that r units fail and n−r units survive; then try the full problem.
Hint 4: If you do it right, ˆλ will be the number of observed failures divided by the sum of unit test times.
L(λ; n, r, t 1 ,... , t r
r ∏
i= 1
f (t i
n−r
r ∏
i= 1
λe
−λt i
e
−λT
n−r
= λ
r
e
−λ
∑ r
i= 1
t i e
−λ(n−r)T
ln L(λ) = r ln λ − λ
r ∑
i= 1
t i − λ(n − r)T + constant
∂λ
ln L(λ) =
r
λ
r ∑
i= 1
t i
− (n − r)T = 0
λ =
r
r
i= 1
t i
Consider the design matrix
representing 6 sample points, each with two features f 1
and f 2
The labels for the data are [
In this question, we build a decision tree of depth 2 by hand to classify the data.
(a) [2 pts] What is the entropy at the root of the tree?
− 0 .5 log 2
(b) [3 pts] What is the rule for the first split? Write your answer in a form like f 1
≥ 4 or f 2
≥ 3. Hint: you should be able to
eyeball the best split without calculating the entropies.
If we sort by f 1 , the features and the corresponding labels are
If we sort by f 2 , we have
[
The best split is f 1
(c) [3 pts] For each of the two treenodes after the first split, what is the rule for the second split?
For the treenode with labels (1, 1), there’s no need to split again.
For the treenode with labels (0, 1 , 0 , 0), if we sort by f 1 , we have
If we sort using f 2 , we get [
We easily see we should choose f 2
(d) [2 pts] Let’s return to the root of the tree, and suppose we’re incompetent tree builders. Is there a (not trivial) split at the
root that would have given us an information gain of zero? Explain your answer.
Yes. The rules f 1
≥ 5, f 2
≥ 3, or f 2
≥ 5 would all fail to reduce the weighted average entropy below 1.
We are building a random forest for a 2-class classification problem with t decision trees and bagging. The input is a n × d
design matrix X representing n sample points in R
d
(quantitative real-valued features only). For the ith decision tree we create
an n-point training set X
(i)
through standard bagging. At each node of each tree, we randomly select k of the features (this
random subset is selected independently for each treenode) and choose the single-feature split that maximizes the information
gain, compared to all possible single-feature splits on those k features. Assume that we can radix sort real numbers in linear
time, and we can randomly select an item from a set in constant time.
(a) [3 pts] Remind us how bagging works. How do we generate the data sets X
(i) ? What do we do with duplicate points?
For each training set X
(i) , we select n sample points from X uniformly at random with replacement. Duplicate points
have proportionally greater weight in the entropy (or other cost function) calculations. (We will accept an answer that
states that duplicate points are treated as if they were separate points infinitesimally close together.)
(b) [3 pts] Fill in the blanks to derive the overall running time to construct a random forest with bagging and random subset
selection. Let h be the height/depth (they’re the same thing) of the tallest/deepest tree in the forest. You must use the
tightest bounds possible with respect to n, d, t, k, h, and n
′
.
Consider choosing the split at a treenode whose box contains n
′
sample points. We can choose the best split for
these n
′
sample points in O( ) time. Therefore, the running time per sample point in that node
is O( ).
Each sample point in X
(i)
participates in at most O( ) treenodes, so each sample point contributes at
most O( ) to the time. Therefore, the total running time for one tree is O( ).
We have t trees, so the total running time to create the random forest is O( ).
The blanks in order: O(n
′
k), O(k), O(h), O(kh), O(nkh), O(nkht). They are each worth half a point.
(c) [2 pts] If we instead use a support vector machine to choose the split in each treenode, how does that change the asymp-
totic query time to classify a test point?
It slows queries down by a factor of Θ(d) or Θ(k) (depending whether you run the SVM on k features or all d features—
we’ll accept either interpretation), because it is necessary to inspect all d (or k) features of the query point at each
treenode.
(d) [3 pts] Why does bagging by itself (without random subset selection) tend not to improve the performance of decision
trees as much as we might expect?
It is common that the same few features tend to dominate in all of the subsets, so almost all the trees will tend to have
very similar early splits, and therefore all the trees will produce very similar estimates. The models are not decorrelated
enough.
Consider this convolutional neural network architecture.
In the first layer, we have a one-dimensional convolution with a single filter of size 3 such that h i = s
3
j= 1
v j x i+ j− 1
. The
second layer is fully connected, such that z =
4
i= 1
w i
h i
. The hidden units’ activation function s(x) is the logistic (sigmoid)
function with derivative s
′ (x) = s(x) (1 − s(x)). The output unit is linear (no activation function). We perform gradient descent
on the loss function R = (y − z)
2
, where y is the training label for x.
(a) [1 pt] What is the total number of parameters in this neural network? Recall that convolutional layers share weights.
There are no bias terms.
The answer is 7. There are 3 parameters in layer 1 and 4 parameters in layer 2.
(b) [4 pts] Compute ∂R/∂w i
∂w i
= −2(y − z)h i
(c) [1 pt] Vectorize the previous expression—that is, write ∂R/∂w.
∂w
= −2(y − z)h
(d) [5 pts] Compute ∂R/∂v j
∂v j
= −2(y − z)
∂z
∂v j
= −2(y − z)
4 ∑
i= 1
∂z
∂h i
∂h i
∂v j
= −2(y − z)
4 ∑
i= 1
w i h i (1 − h i )x i+ j− 1
(a) [3 pts] Write down the Laplacian matrix L G of the following graph G. Every edge has weight 1.
(b) [2 pts] Find three orthogonal eigenvectors of L G
, all having eigenvalue 0.
x =
, y =
, z =
(c) [2 pts] Use two of those three eigenvectors (it doesn’t matter which two) to assign each vertex of G a spectral vector in
2
. Draw these vectors in the plane, and explain how they partition G into three clusters. (Optional alternative: if you
can draw 3D figures well, you are welcome to use all three eigenvectors and assign each vertex a spectral vector in R
3
.)
The eigenvectors x and y give the embedding
Each of the three clusters is mapped to a single point in R
2
.
(d) [3 pts] Let K n be the complete graph on n vertices (every pair of vertices is connected by an edge of weight 1) and let
K n
be its Laplacian matrix. The eigenvectors of L K n
are v 1
and every vector that is orthogonal to 1.
What are the eigenvalues of L K n
λ 1 = 0 and λ 2 = · · · = λ n = n.
(e) [1 pt] What property of these eigenvalues gives us a hint that the complete graph does not have any good partitions?
λ 2 is large, so there is no low-sparsity cut. (The optimal cut has sparsity ≥ λ 2
Consider learning closed intervals on the real line. Our hypothesis class H consists of all intervals of the form [a, b] where
a < b and a, b ∈ R. We interpret an interval (hypothesis) [a, b] ∈ H as a classifier that identifies a point x as being in class C if
a ≤ x ≤ b, and identifies x as not being in class C if x < a or x > b.
(a) [2 pts] Consider a set containing two distinct points on the real line. Which such sets can be shattered by H?
All sets of two distinct points can be shattered.
(b) [2 pts] Show that no three points can be shattered by H.
Let X = {x 1
, x 2
, x 3
} with x 1
≤ x 2
≤ x 3
. No interval can contain x 1
and x 3
without containing x 2
. (I.e., suppose x 1
and x 3
are in class C, but x 2
is not.)
(c) [2 pts] Write down the shatter function Π H (n). Explain your answer.
H (n) =
n
n
2
There are
n
2
dichotomies with two or more points (imagine choosing the first and the last sample point in class C), n
dichotomies with one point, and one dichotomy with no points.
(d) [2 pts] Consider another hypothesis class H 2
. Each hypothesis in H 2 is a union of two intervals. H 2 is the set of all such
hypotheses (i.e., every union of two intervals on the number line). For example, [3, 7] ∪ [8. 5 , 10] ∈ H 2 ; that’s the set of
all points x such that 3 ≤ x ≤ 7 or 8. 5 ≤ x ≤ 10.
What is the largest number of distinct points that H 2 can shatter? Explain why no larger number can be shattered.
Four. If you have five distinct points, H 2 cannot include the first, third, and fifth points while excluding the second and
fourth.
(e) [2 pts] Which hypothesis class has a greater sample complexity, H or H 2
? Explain why.
2
, because its shatter function grows faster (quartic rather than quadratic) and its VC dimension is greater.