Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Machine Learning Midterm Exam: 10-601, Exams of Machine Learning

Carnegie Mellon University (CMU)Machine Learning

A midterm exam for the Machine Learning course at a university. The exam consists of 9 questions worth a total of 100 points and covers topics such as GMM, decision trees, HMM, Markov decision process, SVM, boosting, and model selection. The exam is open book and open notes but no electronic devices are allowed. questions and solutions related to short answers, GMM-gamma mixture model, and HMM.

Typology: Exams

2021/2022

Uploaded on 05/11/2023

kaijiang 🇺🇸

4.5

(8)

280 documents

1 / 21

This page cannot be seen from the preview

Don't miss anything!

10-601 Machine Learning, Midterm Exam

Instructors: Tom Mitchell, Ziv Bar-Joseph

Wednesday 12th December, 2012

There are 9 questions, for a total of 100 points.

This exam has 20 pages, make sure you have all pages before you begin.

This exam is open book, open notes, but no computers or other electronic devices.

This exam is challenging, but don’t worry because we will grade on a curve. Work efficiently.

Good luck!

Name:

Andrew ID:

Question Points Score

Short Answers 11

GMM - Gamma Mixture Model 10

Decision trees and Hierarchical clustering 8

D-separation 9

HMM 12

Markov Decision Process 12

SVM 12

Boosting 14

Model Selection 12

Total: 100

Discover Exams of Machine Learning Carnegie Mellon University (CMU)

Partial preview of the text

Download Machine Learning Midterm Exam: 10-601 and more Exams Machine Learning in PDF only on Docsity!

10-601 Machine Learning, Midterm Exam

Instructors: Tom Mitchell, Ziv Bar-Joseph

Wednesday 12th^ December, 2012

There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have all pages before you begin. This exam is open book, open notes, but no computers or other electronic devices.

This exam is challenging, but don’t worry because we will grade on a curve. Work efficiently.

Good luck!

Name:

Andrew ID:

Question Points Score Short Answers 11

GMM - Gamma Mixture Model 10 Decision trees and Hierarchical clustering 8 D-separation 9

HMM 12 Markov Decision Process 12 SVM 12

Boosting 14 Model Selection 12 Total: 100

Question 1. Short Answers

(a) [3 points] For data D and hypothesis H, say whether or not the following equations must always be true.

h P^ (H^ =^ h|D^ =^ d) = 1^ ... is this always true?

Solution: yes

h P^ (D^ =^ d|H^ =^ h) = 1^ ... is this always true?

Solution: no

h P^ (D^ =^ d|H^ =^ h)P^ (H^ =^ h) = 1^ ... is this always true?

Solution: no

(b) [2 points] For the following equations, describe the relationship between them. Write one of four answers: (1) “=” (2) “≤” (3) “≥” (4) “(depends)” Choose the most specific relation that always holds; “(depends)” is the least specific. Assume all probabilities are non-zero.

P (H = h|D = d) P (H = h)

P (H = h|D = d) P (D = d|H = h)P (H = h)

Solution: P (H|D) (DEPENDS) P (H) P (H|D) ≥ P (D|H)P (H) .. this is the numerator in Bayes Rule, have to divide by the normal- izer P (D), which is less than 1. Tricky... P (H|D) = P (D|H)P (H)/P (D) > P (D|H)P (H).

(c) [2 points] Suppose you are training Gaussian Naive Bayes (GNB) on the training set shown below. The dataset satisfies Gaussian Naive Bayes assumptions. Assume that the variance is independent of instances but dependent on classes, i.e. σik = σk where i indexes instances X(i)^ and k ∈ 1 , 2 indexes classes. Draw the decision boundaries when you train GNB

a. using the same variance for both classes, σ 1 = σ 2 b. using separate variance for each class σ 1 6 = σ 2

(d) [2 points] Assume that we have two possible conditional distributions (P (y = 1|x, w)) obtained by training a logistic regression on the dataset shown in the figure below:

In the first case, the value of P (y = 1|x, w) is equal to 1/3 for all the data points. In the second case, P (y = 1|x, w) is equal to zero for x = 1 and is equal to 1 for all other data points. One of these conditional distributions is obtained by finding the maximum likelihood of the parameter w. Which one is the MLE solution? Justify your answer in at most three sentences.

Solution: The MLE solution is the first case where the value of P (y = 1|x, w) is equal to 1/3 for all the data points.

(e) [2 points] Principal component analysis is a dimensionality reduction method that projects a dataset into its most variable components. You are given the following 2D datasets, draw the first and sec- ond principle components on each plot.

−0.8−0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.

−0.

Dimension 1

Dimension 2

Solution:

ii. [1 point] As you increase K, you will always get better likelihood of the data.

Solution: (All or none. 1 pt iff you get the answer and the explanation correct) false. Won’t improve after K > N

Question 3. Decision trees and Hierarchical clustering

Assume we are trying to learn a decision tree. Our input data consists of N samples, each with k attributes (N k). We define the depth of a tree as the maximum number of nodes between the root and any of the leaf nodes (including the leaf, not the root). (a) [2 points] If all attributes are binary, what is the maximal number of leaf (decision) nodes that we can have in a decision tree for this data? What is the maximal possible depth of a decision tree for this data?

Solution: 2 (k−1). Each feature can only be used once in each path from root to leaf. The maximum depth is O(k).

(b) [2 points] If all attributes are continuous, what is the maximum number of leaf nodes that we can have in a decision tree for this data? What is the maximal possible depth for a decision tree for this data?

Solution: Continuous values can be used multiple times, so the maximum number of leaf nodes can be the same as the number of samples, N and the maximal depth can also be N.

(c) [2 points] When using single link what is the maximal possible depth of a hierarchical clustering tree for the data in 1? What is the maximal possible depth of such a hierarchical clustering tree for the data in 2?

Solution: When using single link with binary data, we can obtain cases where we are always growing the cluster by 1 node at a time leading to a tree of depth N. This is also clearly the case for continuous values.

(d) [2 points] Would your answers to (3) change if we were using complete link instead of single link? If so, would it change for both types of data? Briefly explain.

Solution: While the answer for continuous values remain the same (its easy to design a dataset where each new sample is farther from any of the previous samples) for binary data, if k is small compared to N we will not be able to continue to add one node at a time to the initial cluster and so the depth will change to be lower than N.

Question 5. HMM

The figure above presents two HMMs. States are represented by circles and transitions by edges. In both, emissions are deterministic and listed inside the states.

Transition probabilities and starting probabilities are listed next to the relevant edges. For example, in HMM 1 we have a probability of 0.5 to start with the state that emits A and a probability of 0.5 to transition to the state that emits B if we are now in the state that emits A.

In the questions below, O 100 =A means that the 100th symbol emitted by the HMM is A. (a) [3 points] What is P (O 100 = A, O 101 = A, O 102 = A) for HMM1?

Solution: Note that P(O100=A, O101=A, O102=A) = P(O100=A, O101=A, O102=A,S100=A, S101=A, S102=A) since if we are not always in state A we will not be able to emit A. Given the Markov property this can be written as:

P(O100=A, O101=A, O102=A,S100=A, S101=A, S102=A) = P(O100=A—S100=A) P(S100=A)

P(O101=A—S101=A) P(S101=A—S100=A) P(O102=A—S102=A) P(S102=A—S101=A)

The emission probabilities in the above equation are all 1. The transitions are all 0.5. So the only question is: What is P(S100=A)? Since the model is fully symmetric, the answer to this is 0.5 and so the total equation evaluates to: 0. 53

(b) [3 points] What is P (O 100 = A, O 101 = A, O 102 = A) for HMM2?

Solution:

5 ∗ 0. 82

(c) [3 points] Let P 1 be: P 1 = P (O 100 = A, O 101 = B, O 102 = A, O 103 = B) for HMM1 and let P 2 be: P 2 = P (O 100 = A, O 101 = B, O 102 = A, O 103 = B) for HMM2. Choose the correct answer from the choices below and briefly explain.

P 1 > P 2
P 2 > P 1
P 1 = P 2
Impossible to tell the relationship between the two probabilities

Solution: (a). P1 evaluates to 0. 54 while P2 is 0. 5 ∗ 0. 24 so clearly P1¿P2.

(d) [3 points] Assume you are told that a casino has been using one of the two HMMs to generate streams of letters. You are also told that among the first 1000 letters emitted, 500 are As and 500 are Bs. Which is of the following answers is the most likely (briefly explain):

The casino has been using HMM 1
The casino has been using HMM 2
Impossible to tell

Solution: (c). While we saw in the previous question that it is much more less likely to switch between A and B in HMM2, this is only true if we switch at every step. However, when aggregating over 1000 steps, since the two HMMs are both symmetric, both are likely to generate the same number of As and Bs.

Solution: Yes, for example by multiplying all the rewards by two.

(e) [3 points] One of the important problems in MDPs is to decide what should be the value of the discount factor. For now assume that we don’t know the value of discount factor but an expert person tells us that action sequence {fast, slow, slow} is preferred to the action sequence { slow, fast, fast} if we start from either of states cool or warm. What does it tell us about the discount factor? What ranges of discount factor is consistent with this preference?

Solution: The discounted sum of future rewards using discount factor λ is calculated by: r + r(λ) + r(λ^2 ) +.... So by solving the below equation, we would be able to find a range for discount factor λ: 10 + 4λ + 4λ^2 > 4 + 10λ + 10λ^2

Question 7. SVM

(a) Kernels i. [4 points] In class we learnt that SVM can be used to classify linearly inseparable data by transforming it to a higher dimensional space with a kernel K(x, z) = φ(x)T^ φ(z), where φ(x) is a feature mapping. Let K 1 and K 2 be Rn^ × Rn^ kernels, K 3 be a Rd^ × Rd^ kernel and c ∈ R+ be a positive constant. φ 1 : Rn^ → Rd, φ 2 : Rn^ → Rd, and φ 3 : Rd^ → Rd^ are feature mappings of K 1 , K 2 and K 3 respectively. Explain how to use φ 1 and φ 2 to obtain the following kernels.

a. K(x, z) = cK 1 (x, z)

b. K(x, z) = K 1 (x, z)K 2 (x, z)

Solution: a. φ(x) =

(c)φ 1 (x) b. φ(x) = φ 1 (x)φ 2 (x)

ii. [2 points] One of the most commonly used kernels in SVM is the Gaussian RBF kernel: k(xi, xj ) =

exp

− ‖xi−xj^ ‖

2 2 σ

. Suppose we have three points, z 1 , z 2 , and x. z 1 is geometrically very close to x, and z 2 is geometrically far away from x. What is the value of k(z 1 , x) and k(z 2 , x)?. Choose one of the following: a. k(z 1 , x) will be close to 1 and k(z 2 , x) will be close to 0. b. k(z 1 , x) will be close to 0 and k(z 2 , x) will be close to 1. c. k(z 1 , x) will be close to c 1 , c 1 1 and k(z 2 , x) will be close to c 2 , c 2 0 , where c 1 , c 2 ∈ R d. k(z 1 , x) will be close to c 1 , c 1 0 and k(z 2 , x) will be close to c 2 , c 2 1 , where c 1 , c 2 ∈ R

Solution: Correct answer is a, RBF kernel generates a ”bump” around the center x. For points z 1 close to the center of the bump, K(z 1 , x) will be close to 1, for points away from the center of the bump K(z 2 , x) will be close to 0.

iii. [3 points] You are given the following 3 plots, which illustrates a dataset with two classes. Draw the decision boundary when you train an SVM classifier with linear, polynomial (order

and RBF kernels respectively. Classes have equal number of instances.

Solution:

(b) [3 points] Hard Margin SVM

(^00 1 2 3 4 5 6 )

Class − Class +

Support vector machines learn a decision boundary leading to the largest margin from both classes. You are training SVM on a tiny dataset with 4 points shown in Figure 2. This dataset consists of two examples with class label -1 (denoted with plus), and two examples with class label +1 (denoted with triangles). i. Find the weight vector w and bias b. What’s the equation corresponding to the decision bound- ary?

Solution: SVM tries to maximize the margin between two classes. Therefore, the optimal decision boundary is diagonal and it crosses the point (3,4). It is perpendicular to the line between support vectors (4,5) and (2,3), hence it is slope is m = -1. Thus the line equation is (x 2 −

= −1(x 1 − 3) = x 1 + x 2 = 7. From this equation, we can deduce that the weight vector has to be of the form (w 1 , w 2 ), where w 1 = w 2. It also has to satisfy the following equations: 2 w 1 + 3w 2 + b = 1 and 4 w 1 + 5w 2 + b = − 1

Hence w 1 = w 2 = − 1 / 2 and b = 7/ 2

ii. Circle the support vectors and draw the decision boundary.

Solution: See the solution above

Solution: t = (^13) αt = 12 ln(2) = 0. 3465 For data points that are classified correctly D 2 (i) = 1 /^3 ∗exp( Z− 2 0 .3465) ≈ 0. 25 and for the data point that is classified incorrectly D 2 (i) = 1 /^3 ∗exp(0 Z 2 .3465)≈ 0. 5 where Z 2 is the normalization factor.

(e) [3 points] Can boosting algorithm perfectly classify all the training examples? If no, briefly explain why. If yes, what is the minimum number of iteration?

Solution: No, since the data is not linearly separable.

(f) [1 point] True/False The training error of boosting classifier (combination of all the weak classifier) monotonically decreases as the number of iterations in the boosting algorithm increases. Justify your answer in at most two sentences.

Solution: False, boosting is minimizing loss function:

∑m i=1 exp(−yif^ (xi))^ which doesn’t necessary mean that the training error monotonically decrease. Please look at slides 14-18 http://www.cs. cmu.edu/˜tom/10601_fall2012/slides/boosting.pdf.

Question 9. Model Selection

(a) [2 points] Consider learning a classifier in a situation with 1000 features total. 50 of them are truly informative about class. Another 50 features are direct copies of the first 50 features. The final 900 features are not informative. Assume there is enough data to reliably assess how useful features are, and the feature selection methods are using good thresholds.

How many features will be selected by mutual information filtering?

Solution: about 100

How many features will be selected by a wrapper method?

Solution: about 50

(b) Consider k-fold cross-validation. Let’s consider the tradeoffs of larger or smaller k (the number of folds). For each, please select one of the multiple choice options. i. [2 points] With a higher number of folds, the estimated error will be, on average,

(a) Higher.
(b) Lower.
(c) Same.
(d) Can’t tell.

Solution: Lower (because more training data)

Machine Learning Midterm Exam: 10-601, Exams of Machine Learning

Related documents

Partial preview of the text

Download Machine Learning Midterm Exam: 10-601 and more Exams Machine Learning in PDF only on Docsity!

10-601 Machine Learning, Midterm Exam

Instructors: Tom Mitchell, Ziv Bar-Joseph

Wednesday 12th^ December, 2012

Question 1. Short Answers

P (H = h|D = d) P (H = h)

P (H = h|D = d) P (D = d|H = h)P (H = h)

Question 3. Decision trees and Hierarchical clustering

Question 5. HMM

P(O100=A, O101=A, O102=A,S100=A, S101=A, S102=A) = P(O100=A—S100=A) P(S100=A)

P(O101=A—S101=A) P(S101=A—S100=A) P(O102=A—S102=A) P(S102=A—S101=A)

Question 7. SVM

Question 9. Model Selection