Midterm Exam in Machine Learning, Exams of Probability and Statistics

A midterm exam in Machine Learning from Carnegie Mellon University. The exam consists of 5 questions with a total score of 100. The topics covered include Short questions, MLE/MAP, Bayes Nets, EM, and Regression. The exam is open book and open notes, and no computers or internet access is allowed. The exam duration is 80 minutes. The document also includes personal information and instructions for the exam.

Typology: Exams

2010/2011

Uploaded on 05/11/2023

ilyastrab
ilyastrab 🇺🇸

4.4

(52)

379 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
10-601 Machine Learning
Midterm Exam
Fall 2011
Tom Mitchell, Aarti Singh
Carnegie Mellon University
1. Personal information:
Name:
Andrew account:
E-mail address:
2. There should be 11 numbered pages in this exam.
3. This exam is open book, open notes. No computers or internet access is allowed.
4. You do not need a calculator.
5. If you need more room to answer a question, use the back of the page and clearly mark
on the front of the page if we are to look at the back.
6. Work efficiently. Answer the easier questions first.
7. You have 80 minutes.
8. Good luck!
Question Topic Max. score Score
1 Short questions 35
2 MLE/MAP 15
3 Bayes Nets 15
4 EM 15
5 Regression 20
Total 100
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Midterm Exam in Machine Learning and more Exams Probability and Statistics in PDF only on Docsity!

10-601 Machine Learning

Midterm Exam

Fall 2011

Tom Mitchell, Aarti Singh

Carnegie Mellon University

  1. Personal information:
    • Name:
    • Andrew account:
    • E-mail address:
  2. There should be 11 numbered pages in this exam.
  3. This exam is open book, open notes. No computers or internet access is allowed.
  4. You do not need a calculator.
  5. If you need more room to answer a question, use the back of the page and clearly mark on the front of the page if we are to look at the back.
  6. Work efficiently. Answer the easier questions first.
  7. You have 80 minutes.
  8. Good luck!

Question Topic Max. score Score 1 Short questions 35 2 MLE/MAP 15 3 Bayes Nets 15 4 EM 15 5 Regression 20 Total 100

1 Short Questions [35 pts]

Answer True/False in the following 8 questions. Explain your reasoning in 1 sentence.

  1. [3 pts] Suppose you are given a dataset of cellular images from patients with and without cancer. If you are required to train a classifier that predicts the probability that the patient has cancer, you would prefer to use Decision trees over logistic regression.

F SOLUTION: FALSE. Decision trees only provide a label estimate, whereas logistic regression provides the probability of a label (patient has cancer) for a given input (cellular image).

  1. [3 pts] Suppose the dataset in the previous question had 900 cancer-free images and 100 images from cancer patients. If I train a classifier which achieves 85% accuracy on this dataset, it is it a good classifier.

F SOLUTION: FALSE. This is not a good accuracy on this dataset, since a classifier that outputs ”cancer-free” for all input images will have better accuracy (90%).

  1. [3 pts] A classifier that attains 100% accuracy on the training set and 70% accuracy on test set is better than a classifier that attains 70% accuracy on the training set and 75% accuracy on test set.

F SOLUTION: FALSE. The second classifier has better test accuracy which reflects the true accuracy, whereas the first classifier is overfitting.

  1. [3 pts] A football coach whispers a play number n to two players A and B inde- pendently. Due to crowd noise, each player imperfectly and independently draws a conclusion about what the play number was. A thinks he heard the number nA, and B thinks he heard nB. True or false: nA and nB are marginally dependent but condi- tionally independent given the true play number n.

F SOLUTION: TRUE. Knowledge of nA value tells us something about nB therefore P (nA|nB ) 6 = P (nA) hence they are marginally dependent, but given n, nA and nB are determined independently. Also follows from following Bayes Net:

  1. [3 pts] A, B and C are three Boolean random variables. The following equality holds without any assumptions on the joint distribution P (A, B, C) P (A|B) = P (A|B, C = 0)P (C = 0) + P (A|B, C = 1)P (C = 1).

F SOLUTION: TRUE. Since C is a Boolean random variable, we have

P (A|B) = P (A, C = 0|B) + P (A, C = 1|B) = P (A|B, C = 0)P (C = 0) + P (A|B, C = 1)P (C = 1)

where last step follows from definition of conditional probability.

The following three short questions are not True/False questions. Please provide explanations for your answers.

  1. [3 pts] The Bayes net below implies that A is conditionally independent of B given C (A ⊥⊥ B|C). Prove this, based on its factorization of the joint distribution, and on the definition of conditional independence.

F SOLUTION: Using factorization of joint distribution P (A, B, C) = P (C)P (A|C)P (B|C)

and using definition of conditional independence P (A, B, C) = P (C)P (A, B|C)

Therefore, we have: P (A, B|C) = P (A|C)P (B|C) i.e. A is conditionally independent of B given C (A ⊥⊥ B|C).

  1. [3 pts] Which of the following classifiers can perfectly classify the following data:

(a) Decision Tree (b) Logistic Regression (c) Gaussian Naive Bayes

F SOLUTION: Decision Tree only. Decision trees of depth 2 which first splits on X 1 and then on X 2 wil perfectly classify it. Logistic regression leads to linear decision boundaries, hence cannot classify this data perfectly. Due to conditional independence requirement, it is not possible to fit a Gaussian that peaks at the labels of only one class and has no covariance between features, so Gaussian Naive Bayes cannot classify this data perfectly.

2 MLE/MAP Estimation [15 pts]

In this question you will estimate the probability of a coin landing heads using MLE and MAP estimates.

Suppose you have a coin whose probability of landing heads is p = 0.5, that is, it is a fair coin. However, you do not know p and would like to form an estimator θˆ for the probability of landing heads p. In class, we derived an estimator that assumed p can take on any value in the interval [0, 1]. In this question, you will derive an estimator that assumes p can take on only two possible values: 0.3 or 0. 6.

Note: Pθˆ[heads] = θˆ.

Hint: All the calculations involved here are simple. You do not require a calculator.

  1. [5 pts] You flip the coin 3 times and note that it landed 2 times on tails and 1 time on heads. Find the maximum likelihood estimate ˆθ of p over the set of possible values { 0. 3 , 0. 6 }. Solution:

θ^ ˆ = argmaxθ∈{ 0. 3 , 0. 6 } Pθ[D] = argmaxθ∈{ 0. 3 , 0. 6 } Pθ[heads]Pθ[tails]^2 = argmaxθ∈{ 0. 3 , 0. 6 } θ(1 − θ)^2 We observe that Pθ=0. 3 [D] Pθ=0. 6 D]

which implies that θˆ = 0.3.

  1. [4 pts] Suppose that you have the following prior on the parameter p:

P [p = 0.3] = 0. 3 and P [p = 0.6] = 0. 7.

Again, you flip the coin 3 times and note that it landed 2 times on tails and 1 time on heads. Find the MAP estimate θˆ of p over the set { 0. 3 , 0. 6 }, using this prior. Solution:

θ^ ˆ = argmaxθ∈{ 0. 3 , 0. 6 } Pθ[D]P [θ] We observe that

Pθ=0. 3 [D]P [θ = 0.3] Pθ=0. 6 [D]P [θ = 0.6]

which implies that θˆMAP = 0.6.

  1. [3 pts] Suppose that the number of times you flip the coin tends to infinity. What would be the maximum likelihood estimate θˆ of p over the set { 0. 3 , 0. 6 } in that case? Justify your answer. Solution: With the number of flips tending to infinity, proportion of heads to the total number of flips tends to 0.5. The MLE would be 0.6 as this is closer to 0.5.
  2. [3 pts] Suppose that the number of times you flip the coin tends to infinity. What would be the MAP estimate θˆ of p over the set { 0. 3 , 0. 6 }, using the prior defined in part 2 of this question? Justify your answer. Solution: With the number of flips tending to infinity, the effect of the prior becomes negligible. Therefore, the MAP estimate will be the same as the MLE.

(d) B is independent of C given only A (e) B is not independent of C given A and D

Solution: Any of the following satisfy the above:

D" E"

F"

G"

C" D"

E"

A"

B" C"

D"

A"

B" C"

D"

A"

B" C"

D"

  1. [4 pts] Consider the graph drawn below. Assume that each variable can only take on values true and false. (a) How many parameters are necessary to specify the joint distribution P (A, B, C, D, E, F, G) for this Bayes net? You may answer by writing the number of parameters directly next to each graph node.

Solution: See below for the number of parameters needed for each node. To- tal is 17. (b) Please give the minimum number of Bayes net parameters required to fully spec- ify the distribution P (G|A, B, C, D, E, F ). Briefly justify your answer.

Solution: Note that the Markov blanket for G consists only of F. Thus, P (G|A, B, C, D, E, F ) = P (G|F ) and only two parameters are need to specify this distribution.

A" B" C"

D" E"

F"

G"

A" B"

C" D"

E"

  1. [2 pts] Given the graph provided above, please state if the following are true or false.

(a) E is conditionally independent of G given F. Solution: True. (b) A is conditionally independent of C given B and G. Solution: False.

5 Bias and Variance in Linear Regression [20 pts]

In this question, we will explore bias and variance in linear regression. Assume that a total of N data points of the form (xi, yi) are generated from the following (true) model:

xi ∼ U nif (0, 1), yi = f (xi) + i, i ∼ N (0, 1), f (x) = x

We assume xi ⊥ j ∀i, j and i ⊥ j ∀i 6 = j (note a ⊥ b means a and b are independent).

You may find the following pieces of information useful when solving this problem:

  • bias^2 =

x

(ED[hD(x)] − f (x))^2 p(x)dx

  • variance =

x

ED[(hD(x) − ED[hD(x)])^2 ]p(x)dx

  • μˆ ∼ N (μ, (^) N^1 ) if ˆμ is the MLE estimator with N data points
  • If x ∼ U nif (0, 1), then

0 p(x)dx^ = 1, and therefore^ p(x) = 1.

We begin by examining the case where we are not aware that y depends on x. Instead, our (incorrect) model is that f (x) has some constant value f (x) = μ, and therefore

xi ∼ U nif (0, 1), yi ∼ N (μ, 1) with xi ⊥ yi.

We use the MLE estimator for μ. That is, we let ˆμ = (^) N^1

∑N

i=

yi. The prediction of our trivial

regression model for the value of yi is ˆμ, regardless of the value of xi.

  1. [2 pts] What is the value for ED[hD(x)] in this case? Here ED refers to the expected value over different training data sets of size N , and hD(x) is the predictor learned from a specific data set D.

F SOLUTION: ED[hD(x)] = ED[ˆμ] = (^12)

  1. [3 pts] What is the bias of this trivial regression model?

F SOLUTION: Bias^2 =

0

− x)^2 (1) dx = −

− x)^3 |^10 =

The bias is thus

1

  1. [2 pts] What is the variance of this trivial regression model?

F SOLUTION: The variance is the variance of the MLE estimator. By the third bullet, this is (^) N^1.

  1. [1 pts] What is the unavoidable error in this learning setting?

F SOLUTION: The unavoidable error is introduced by i, and is 1 by assumption.

  1. [2 pts] How do each of bias, variance, and unavoidable error change as N → ∞?

F SOLUTION: The unavoidable error and bias do not change. The variance goes to 0 as N → ∞.

F SOLUTION: The unavoidable error is still introduced by i, and is 1 by assumption.

  1. [2 pts] In the figure below, draw the two learned regression models if we have an infinite number of data points.

F SOLUTION: Model 1 (the trivial model) is the horizontal line. Model 2 (the linear regression model) is the diagonal line.

0.0 0.2 0.4 0.6 0.8 1.

0.^ 0.^ 0.^ 0.^ 0.^

x

y