CIS 520: Machine Learning Final Exam, Fall 2013, Exams of Artificial Intelligence

The final exam for the CIS 520: Machine Learning course at the University of Pennsylvania in Fall 2013. The exam consists of 93 questions and allows two one-page, two-sided cheat sheets. The exam covers topics such as ridge regression, linear regression, conjugate prior, Lasso, stepwise regression, and elastic net. The exam policy, time limit, and grading instructions are also provided.

Typology: Exams

2012/2013

Uploaded on 05/11/2023

tanvir
tanvir 🇺🇸

5

(4)

224 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
UNIVERSITY of PENNSYLVANIA
CIS 520: Machine Learning
Final, Fall 2013
Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials.
Time: 2 hours. Be sure to write your name and Penn student ID (the 8 bigger digits on your ID
card) on the scantron form and fill in the associated bubbles in pencil. If you are taking this as a
WPE, then enter only your WPE exam number.
If you think a question is ambiguous, mark what you think is the best answer. The questions seek
to test your general understanding; they are not intentionally “trick questions.” As always, we
will consider written regrade requests if your interpretation of a question differed from what we
intended. We will only grade the scantron forms
For the “TRUE or FALSE” questions, note that “TRUE” is (a)and “FALSE” is (b). For the
multiple choice questions, select exactly one answer.
The exam is 10 pages long and has 93 questions.
Name:
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download CIS 520: Machine Learning Final Exam, Fall 2013 and more Exams Artificial Intelligence in PDF only on Docsity!

UNIVERSITY of PENNSYLVANIA

CIS 520: Machine Learning

Final, Fall 2013

Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials.

Time: 2 hours. Be sure to write your name and Penn student ID (the 8 bigger digits on your ID

card) on the scantron form and fill in the associated bubbles in pencil. If you are taking this as a WPE, then enter only your WPE exam number.

If you think a question is ambiguous, mark what you think is the best answer. The questions seek to test your general understanding; they are not intentionally “trick questions.” As always, we will consider written regrade requests if your interpretation of a question differed from what we intended. We will only grade the scantron forms

For the “TRUE or FALSE” questions, note that “TRUE” is (a) and “FALSE” is (b). For the multiple choice questions, select exactly one answer.

The exam is 10 pages long and has 93 questions.

Name:

  1. [0 points] This is version A of the exam. Please fill in the “bubble” for that letter.
  2. [1 points] True or False? Ridge regression finds the global optimum for minimizing its loss function (squared error plus an appropriate penalty).

F SOLUTION: True

  1. [2 points] Ridge regression minimizes which of the following? (Assume, as usual, n observa- tions).

(a)

i(yi^ −^ w

xi) (^2) + λ‖w‖ 2 2 (b)

i(yi^ −^ w

xi) (^2) + λ‖w‖ 2

(c) (1/n)

i(yi^ −^ w

xi) (^2) + λ‖w‖ 2 2 (d) (1/n)

i(yi^ −^ w

xi) (^2) + λ‖w‖ 2 (e)

i(yi^ −^ w

xi) (^2) − λ‖w‖ 2

F SOLUTION: A

  1. [2 points] When doing linear regression with n = 1, 000 , 000 observations and p = 10, 000 features, if one expects around 500 or 1,000 features to enter the model, the best penalty to use is

(a) AIC penalty (b) BIC penalty (c) RIC penalty (d) no penalty

F SOLUTION: B

  1. [1 points] True or False? The conjugate prior to the Gaussian distribution is a Gaussian distribution.

F SOLUTION: True

  1. [1 points] True or False? Lasso finds the global optimum for minimizing its loss function (squared error plus an appropriate penalty).

F SOLUTION: True

  1. [1 points] True or False? Stepwise regression with a BIC penalty term finds the global optimum for minimizing its loss function (squared error plus an appropriate penalty).

F SOLUTION: False

  1. [1 points] True or False? SVMs are generally formulated as MLE algorithms.

F SOLUTION: False

  1. [1 points] True or False? One can make a good argument that minimizing an L 1 loss penalty in regression gives “better” results than the more traditional L 2 loss function minimized by ordinary least squares.

F SOLUTION: True

  1. [1 points] True or False? L 2 penalized linear regression is, in general, more sensitive to outliers than L 1 penalized linear regression. (I.e. one point that is far from the predicted regression line will have more effect on the regression coefficients.)

F SOLUTION: True

  1. [1 points] True or False? Large margin methods like SVMs tend to be slightly less accurate in predictions (when measured with an L 0 loss function ) than logistic regression.

F SOLUTION: False

  1. [1 points] True or False? One can do kernelized logistic regression to get many of the same benefits one would get using kernels in SVMs.

F SOLUTION: True

  1. [1 points] True or False? It is not possible to both reduce bias and reduce variance by changing model forms (e.g. by adding a kernel to an SVM).

F SOLUTION: False

  1. [1 points] True or False? It is sometimes useful to do linear regression in the dual space.

F SOLUTION: True

  1. [1 points] True or False? Linear SVMs tend to overfit more than standard logistic regression.

F SOLUTION: False

  1. [2 points] You are doing ridge regression. You first estimate a regression model with some data. You then estimate a second model with four times as many observations (but the same ridge penalty). Roughly how do you expect the regression coefficients to change when more data is used?

(a) The coefficients should on average shrink towards zero (become smaller in absolute value). (b) The coefficients should on average move away from zero (become larger in absolute value). (c) The coefficients will not change.

F SOLUTION: B

  1. [1 points] True or False? Suppose we have two datasets X 1 and X 2 which each represent the same observations (and the same Y labels) except that X 1 ’s features are a subset of the features of X 2. (I.e., X 2 is X 1 with extra columns added.) Stepwise linear regression with an L 0 penalty will always add at least as many features when trained on the bigger feature set, X 2.

F SOLUTION: False

  1. [1 points] True or False? RIC (Risk Inflation Criterion) can be viewed as an MDL method.

F SOLUTION: True

  1. [1 points] True or False? If you expect a tiny fraction of the features to enter a model, BIC is a better penalty to use than RIC.

F SOLUTION: False

  1. [1 points] True or False? If you expect a tiny fraction of the features to enter a model, an appropriate L 0 penalty will probably work better than than an L 1 penalty.

F SOLUTION: True

  1. [1 points] True or False? The elastic net tends to select fewer features than well-optimized L 0 penalty methods.

F SOLUTION: True

  1. [1 points] True or False? At each round t of boosting, the minimum error of the classifier on the weighted dataset t is a monotone non-decreasing function, i.e. i ≤ j for all i < j.

F SOLUTION: False

  1. [1 points] True or False? Hinge loss is upper bounded by the loss function of boosting.

F SOLUTION: True

  1. [1 points] True or False? Boosting usually averages over many decision trees, each of which is learned without pruning.

F SOLUTION: False

  1. [1 points] True or False? In general, if you have multiple methods for making predictions, it is better to pick the best one rather than using the majority vote of the methods as the prediction.

F SOLUTION: False

  1. [1 points] True or False? Perceptrons are an online alternative to SVMs that generally converge to the same solution as SVMs.

F SOLUTION: False

  1. [1 points] True or False? Averaged and voted perceptrons result in models that contain similar numbers of parameters and hence are of similar accuracy and similar computational cost to use in making preditions.

F SOLUTION: False

  1. [2 points] Which of the following classifiers has the lowest 0-1 error (L 0 loss) given a training set with an infinite number of observations.

(a) Standard SVM with optimal regularization (b) Logistic regression with optimal regularization (c) Naive Bayes (d) Bayesian classifier (one that classifies using an estimated model p(y|x; θ) using the true distributional form (the correct equation for p(y|x).

F SOLUTION: D

  1. [2 points] An SVM using a Gaussian kernel gives the same separating hyperplane as a linear SVM when σ in the denominator of the exponential in the Gaussian approaches:

(a) 0 (b) ∞ (c) the Gaussian kernel SVM does not give the same hyperplane as the linear SVM in any limit

F SOLUTION: C: As sigma goes to infinity, the Gaussian kernel becomes exp(−||a − b||/sigma^2 ) → exp(0) → 1 : a singular kernel which is not the same as the linear kernel. Therefore, it doesn’t become the linear SVM in under any circumstance.

  1. [1 points] True or False? SVMs are, in general, more sensitive to the actual distribution of the data than logistic regression is.

F SOLUTION: False

  1. [2 points] Which model generally has the highest bias?

(a) linear SVM (b) Gaussian SVM (c) perceptron with constant update

F SOLUTION: C

  1. [2 points] It is best to use the primal SVM when ... (Choose the best answer.)

(a) p >> n (b) n >> p (c) we don’t have a kernel (d) the dual doesn’t exist

F SOLUTION: B

  1. [2 points] What is the objective for L 2 regularized L 1 -loss SVM?

(a) min ‖w‖ 1 + C

i ξi subject to yi(wT^ x + b) ≥ 1 − ξi and ξi ≥ 0 for i = 1...n (b) min (1/2)‖w‖^22 + C

i ξi subject to yi(wT^ x + b) ≥ 1 − ξi and ξi ≥ 0 for i = 1...n

F SOLUTION: D: As C increases, the weight for the 1 / 2 norm(w, 2)^2 term goes down therefore margin shrinks. Since margin shrinks the model has high variance but low bias.

  1. [2 points] Suppose we have two datasets X 1 , X 2 representing the same observations (with the same Y labels) except that X 1 ’s features are a subset of the features of X 2. If we apply L 2 regularized L 1 loss SVM to this dataset then which is true? (a) The number of support vectors for X 1 ≤ the number of support vectors for X 2. (b) The number of support vectors for X 1 ≥ the number of support vectors for X 2. (c) Not enough information to tell.

F SOLUTION: A: For intution: in very a high dimensional space, almost all points are support vectors, so if you keep fewer features, tthere will be relatively fewer support vectors.

  1. [2 points] Kernels work naturally with which of the following: I. PCA II. linear regression III. SVMs IV. decision trees (a) I and III (b) II and III (c) I, II and III (d) all of them

F SOLUTION: C

  1. [1 points] True or False? k(x, y) = exp(−‖x − y‖) is a valid kernel for any norm ‖z‖p.

F SOLUTION: True

  1. [1 points] True or False? k(x, y) =

‖x − y‖^22 + c^2 is a valid kernel.

F SOLUTION: True

  1. [1 points] True or False? k(x, y) =

∑n i=1 max(xi, yi) is a valid kernel.

F SOLUTION: False

  1. [2 points] The number of parameters needed to specify a Gaussian Mixture Model with 4 clusters, data of dimension 5, and diagonal covariances is: (a) Between 5 and 15 (b) Between 16 and 26 (c) Between 27 and 37 (d) Between 38 and 48 (e) More than 49

F SOLUTION: D: 45 centers, 45 covariances, 4-1 cluster probabilities; 3 + 20 + 20 = 43 (the clusters can have different diagonals)

  1. [2 points] The number of parameters needed to specify a Gaussian Mixture Model with 3 clusters, data of dimension 4, and spherical covariances is:

(a) Between 5 and 15 (b) Between 16 and 26 (c) Between 27 and 37 (d) Between 38 and 48 (e) More than 49

F SOLUTION: B; 3*4 centers, 3 covariancds 3-1 cluster probs 12+3+2=17 (the clusters can be different sizes)

  1. [2 points] Suppose you are given an EM algorithm that finds maximum likelihood estimates for a model with latent variables. You are asked to modify the algorithm so that it finds MAP estimates instead. Which step or steps do you need to modify?

(a) Expectation (b) Maximization (c) No modification is necessary (d) Both

F SOLUTION: B

  1. [1 points] True or False? EM is a search algorithm for finding maximum likelihood (or sometimes MAP) estimates. Thus, it can be replaced by other search algorithm that also maximizes the same likelihood function.

F SOLUTION: True

  1. [1 points] True or False? PCA can be formulated as an optimization problem that finds the set of k basis functions that best represent a set of observations X, in the sense of minimizing the L 2 reconstruction error.

F SOLUTION: True

  1. [1 points] True or False? PCA can be formulated as an optimization problem that finds the (orthogonal) directions of maximum covariance of a set of observations X.

F SOLUTION: True

  1. [1 points] True or False? Thin SVD of a rectangular matrix X = U 1 D 1 V 1 > (where X is np with n > p and D 1 is kk) and that of its transpose X>^ = U 2 D 2 V 2 > (again with D 2 also k*k) always yields U 1 = V 2 as long as there are no repeated singular values.

F SOLUTION: True

  1. [1 points] True or False? A kkk tensor Γ can be viewed as a mapping from three length k vectors (x 1 , x 2 , and x 3 ) to a scalar (y) such that the mapping is linear in each of the input vectors. Thus, given a bunch of observations of the form (y, x 1 , x 2 , x 3 ) the elements of Γ can be estimated using linear regression.

F SOLUTION: True

  1. [1 points] True or False? Deep Neural Networks are usually semi-parametric models.

F SOLUTION: True

  1. [1 points] True or False? Deep neural networks are almost always supervised learning meth- ods.

F SOLUTION: False

  1. [1 points] True or False? Random Forests is a generative model.

F SOLUTION: False

  1. [1 points] True or False? Random Forests tend to work well for problems like image classifi- cation.

F SOLUTION: True

  1. [2 points] Radial basis functions the dimensionality of the feature space

(a) only increase (b) only decrease (c) can either increase, decrease, or not change (d) don’t change

F SOLUTION: C

  1. [1 points] True or False? Latent Dirichlet Allocation is a latent variable model that can be estimated using EM-style optimization.

F SOLUTION: True

The following questions refer to the following figure.

C

D

B

G H

E

J K

F I

L

A

  1. [1 points] True or False? ¬(A⊥B|H)

F SOLUTION: True

  1. [1 points] True or False? (C⊥I|E, L, G)

F SOLUTION: False

  1. [1 points] True or False? (A⊥L|J, K, L)

F SOLUTION: True

  1. [1 points] True or False? (B⊥K|H)

F SOLUTION: False

  1. [1 points] True or False? (G⊥H|A, B, C)

F SOLUTION: False

  1. [1 points] True or False? J d-separates F and L

F SOLUTION: True

  1. [2 points] In EM models, which of the following equations do you use to derive the parameters θ?

(a) log P (D|θ) =

i log^

z q(Zi^ =^ z|xi)^

pθ (Zi=z,xi) q(Zi=z|xi) (b) F (q, θ) =

i(H(q(Zi|xi)) +^

z q(Zi^ =^ z|xi) log^ pθ(Zi^ =^ z, xi)) (c) F (q, θ) =

i(−KL(q(Zi|xi)||pθ(Zi|xi)) + log^ pθ(xi)) (d) log P (D|θ) =

i

z q(Zi^ =^ z|xi) log^

pθ (Zi=z,xi) q(Zi=z|xi)

F SOLUTION: B is the best answer (the first term drops out), but I gave everyone full credit on this just to be generous

p For the next question, consider the Bayes Net below with parameter values labeled. This is an instance of an HMM. (Similar to homework 8)

X 1

P (X 1 = a) = 0. 4

X 2

P (X 2 = a|X 1 = a) = 0. 7 P (X 2 = a|X 1 = b) = 0. 2

O 1 O 2

P (Oi = 0|Xi = a) = 0. 6 P (Oi = 0|Xi = b) = 0. 3

  1. [3 points] Suppose you have the observation sequence O 1 = 1, O 2 = 0. What is the prediction of Viterbi Decoding? (Maximize P (X 1 , X 2 | O 1 = 1, O 2 = 0))

(a) X 1 = a, X 2 = a (b) X 1 = a, X 2 = b (c) X 1 = b, X 2 = a (d) X 1 = b, X 2 = b

F SOLUTION: D