Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

A midterm exam in Machine Learning from Carnegie Mellon University. The exam consists of 5 questions with a total score of 100. The topics covered include Short questions, MLE/MAP, Bayes Nets, EM, and Regression. The exam is open book and open notes, and no computers or internet access is allowed. The exam duration is 80 minutes. The document also includes personal information and instructions for the exam.

Typology: Exams

2010/2011

1 / 16

Download Midterm Exam in Machine Learning and more Exams Probability and Statistics in PDF only on Docsity! 10-601 Machine Learning Midterm Exam Fall 2011 Tom Mitchell, Aarti Singh Carnegie Mellon University 1. Personal information: • Name: • Andrew account: • E-mail address: 2. There should be 11 numbered pages in this exam. 3. This exam is open book, open notes. No computers or internet access is allowed. 4. You do not need a calculator. 5. If you need more room to answer a question, use the back of the page and clearly mark on the front of the page if we are to look at the back. 6. Work efficiently. Answer the easier questions first. 7. You have 80 minutes. 8. Good luck! Question Topic Max. score Score 1 Short questions 35 2 MLE/MAP 15 3 Bayes Nets 15 4 EM 15 5 Regression 20 Total 100 1 1 Short Questions [35 pts] Answer True/False in the following 8 questions. Explain your reasoning in 1 sentence. 1. [3 pts] Suppose you are given a dataset of cellular images from patients with and without cancer. If you are required to train a classifier that predicts the probability that the patient has cancer, you would prefer to use Decision trees over logistic regression. F SOLUTION: FALSE. Decision trees only provide a label estimate, whereas logistic regression provides the probability of a label (patient has cancer) for a given input (cellular image). 2. [3 pts] Suppose the dataset in the previous question had 900 cancer-free images and 100 images from cancer patients. If I train a classifier which achieves 85% accuracy on this dataset, it is it a good classifier. F SOLUTION: FALSE. This is not a good accuracy on this dataset, since a classifier that outputs ”cancer-free” for all input images will have better accuracy (90%). 3. [3 pts] A classifier that attains 100% accuracy on the training set and 70% accuracy on test set is better than a classifier that attains 70% accuracy on the training set and 75% accuracy on test set. F SOLUTION: FALSE. The second classifier has better test accuracy which reflects the true accuracy, whereas the first classifier is overfitting. 4. [3 pts] A football coach whispers a play number n to two players A and B inde- pendently. Due to crowd noise, each player imperfectly and independently draws a conclusion about what the play number was. A thinks he heard the number nA, and B thinks he heard nB. True or false: nA and nB are marginally dependent but condi- tionally independent given the true play number n. F SOLUTION: TRUE. Knowledge of nA value tells us something about nB therefore P (nA|nB) 6= P (nA) hence they are marginally dependent, but given n, nA and nB are determined independently. Also follows from following Bayes Net: 2 10. [3 pts] Which of the following classifiers can perfectly classify the following data: (a) Decision Tree (b) Logistic Regression (c) Gaussian Naive Bayes F SOLUTION: Decision Tree only. Decision trees of depth 2 which first splits on X1 and then on X2 wil perfectly classify it. Logistic regression leads to linear decision boundaries, hence cannot classify this data perfectly. Due to conditional independence requirement, it is not possible to fit a Gaussian that peaks at the labels of only one class and has no covariance between features, so Gaussian Naive Bayes cannot classify this data perfectly. 5 11. [5 pts] Boolean random variables A and B have the joint distribution specified in the table below. A B P (A,B) 0 0 0.32 0 1 0.48 1 0 0.08 1 1 0.12 Given the above table, please compute the following five quantities: F SOLUTION: P (A = 0) = P (A = 0, B = 0) + P (A = 0, B = 1) = 0.32 + 0.48 = 0.8 P (A = 1) = 1− P (A = 0) = 0.2 P (B = 1) = P (B = 1, A = 0) + P (B = 1, A = 1) = 0.48 + 0.12 = 0.6 P (B = 0) = 1− P (B = 1) = 0.4 P (A = 1|B = 0) = P (A = 1, B = 0)/P (B = 0) = 0.08/0.4 = 0.2 Are A and B independent? Justify your answer. F SOLUTION: YES. Using the calculations above, P (A = 0)P (B = 0) = 0.8 ∗ 0.4 = 0.32 = P (A = 0, B = 0) P (A = 0)P (B = 1) = 0.8 ∗ 0.6 = 0.48 = P (A = 0, B = 1) P (A = 1)P (B = 0) = 0.2 ∗ 0.4 = 0.08 = P (A = 1, B = 0) P (A = 1)P (B = 1) = 0.2 ∗ 0.6 = 0.12 = P (A = 1, B = 1) 6 2 MLE/MAP Estimation [15 pts] In this question you will estimate the probability of a coin landing heads using MLE and MAP estimates. Suppose you have a coin whose probability of landing heads is p = 0.5, that is, it is a fair coin. However, you do not know p and would like to form an estimator θ̂ for the probability of landing heads p. In class, we derived an estimator that assumed p can take on any value in the interval [0, 1]. In this question, you will derive an estimator that assumes p can take on only two possible values: 0.3 or 0.6. Note: Pθ̂[heads] = θ̂. Hint: All the calculations involved here are simple. You do not require a calculator. 1. [5 pts] You flip the coin 3 times and note that it landed 2 times on tails and 1 time on heads. Find the maximum likelihood estimate θ̂ of p over the set of possible values {0.3, 0.6}. Solution: θ̂ = argmaxθ∈{0.3,0.6} Pθ[D] = argmaxθ∈{0.3,0.6} Pθ[heads]Pθ[tails]2 = argmaxθ∈{0.3,0.6} θ(1− θ)2 We observe that Pθ=0.3[D] Pθ=0.6D] = 0.3 ∗ 0.72 0.6 ∗ 0.42 = 0.49 0.32 > 1 which implies that θ̂ = 0.3. 2. [4 pts] Suppose that you have the following prior on the parameter p: P [p = 0.3] = 0.3 and P [p = 0.6] = 0.7. Again, you flip the coin 3 times and note that it landed 2 times on tails and 1 time on heads. Find the MAP estimate θ̂ of p over the set {0.3, 0.6}, using this prior. Solution: θ̂ = argmaxθ∈{0.3,0.6} Pθ[D]P [θ] We observe that Pθ=0.3[D]P [θ = 0.3] Pθ=0.6[D]P [θ = 0.6] = 0.3 ∗ 0.72 ∗ 0.3 0.6 ∗ 0.42 ∗ 0.7 = 0.21 0.32 < 1 which implies that θ̂MAP = 0.6. 7 (d) B is independent of C given only A (e) B is not independent of C given A and D Solution: Any of the following satisfy the above: A" B" C" D" E" F" G" A" B" C" D" E" 1" 1" 1" 4" 4" 4" 2" A" B" C" D" A" B" C" D" A" B" C" D" 10 3. [4 pts] Consider the graph drawn below. Assume that each variable can only take on values true and false. (a) How many parameters are necessary to specify the joint distribution P (A,B,C,D,E, F,G) for this Bayes net? You may answer by writing the number of parameters directly next to each graph node. Solution: See below for the number of parameters needed for each node. To- tal is 17. (b) Please give the minimum number of Bayes net parameters required to fully spec- ify the distribution P (G|A,B,C,D,E, F ). Briefly justify your answer. Solution: Note that the Markov blanket for G consists only of F . Thus, P (G|A,B,C,D,E, F ) = P (G|F ) and only two parameters are need to specify this distribution. A" B" C" D" E" F" G" A" B" C" D" E" 1" 1" 1" 4" 4" 4" 2" 4. [2 pts] Given the graph provided above, please state if the following are true or false. (a) E is conditionally independent of G given F. Solution: True. (b) A is conditionally independent of C given B and G. Solution: False. 11 4 EM [15 pts] In this question you will apply EM to train the following simple Bayes net: using the following data set, for which X2 is unobserved in training example 4. Example X1 X2 1. 0 1 2. 0 0 3. 1 0 4. 1 ? 5. 0 1 The EM process has run for several iterations. At this point the parameter estimates are: θ̂X1=1 = P̂ (X1 = 1) = 0.4 θ̂X2=1|X1=1 = P̂ (X2 = 1|X1 = 1) = 0.4 θ̂X2=1|X1=0 = P̂ (X2 = 1|X1 = 0) = 0.66 1. [2 pts] What is calculated in the next E step? Answer: The expected value of X2 for example 4: P (X2 = 1|X1 = 1; θ) 2. [5 pts] What precisely is the result of the next E step? Show your work. P̂ (X2 = 1|X1 = 1) = θ̂X2=1|X1=1 = 0.4 3. [3 pts] What is calculated in the next M step? New estimates for θ̂X1=1, θ̂X2=1|X1=0 (which do not change), and θ̂X2=1|X1=1 4. [5 pts] What precisely is the result of the next M step? Show your work. θ̂X1=1 = 2 5 = 0.4 θ̂X2=1|X1=0 = 2 3 = 0.66 θ̂X2=1|X1=1 = 0.4 2 = 0.2 12 Now assume we notice that y in fact depends on x. Therefore, we change to a linear regression model (with zero intercept), which assumes the data are generated as follows: xi ∼ Unif(0, 1), yi = f(xi) + εi, εi ∼ N(0, 1), f(x) = ax We also assume (as in the true model) that xi ⊥ εj ∀i, j and εi ⊥ εj ∀i 6= j. 6. [3 pts] We choose our estimator â for a to minimize the squared sum of errors. That is, we choose â such that â = argmina 1 2 N∑ i=1 (yi − axi)2 Derive the closed form expression for â. Once we have chosen the value of â, we now have a regression model that predicts yi = âxi. F SOLUTION: Let f(a) = 1 2 N∑ i=1 (yi − axi)2. Then, ∂(f) ∂a = N∑ i=1 −xi(yi − axi) Setting the derivative to 0, we obtain: N∑ i=1 −xi(yi − axi) = 0 =⇒ N∑ i=1 xiyi = N∑ i=1 ax2i =⇒ â = N∑ i=1 xiyi N∑ i=1 x2i . 7. [2 pts] What is the bias of this linear regression model? F SOLUTION: The bias of the regression model is 0. 8. [2 pts] As N →∞, what is the variance of this linear regression model? F SOLUTION: The variance of the linear regression models goes to 0 as N →∞. 9. [1 pts] What is the unavoidable error in this learning setting? 15 F SOLUTION: The unavoidable error is still introduced by εi, and is 1 by assumption. 10. [2 pts] In the figure below, draw the two learned regression models if we have an infinite number of data points. F SOLUTION: Model 1 (the trivial model) is the horizontal line. Model 2 (the linear regression model) is the diagonal line. 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 x y 16