Download Homework 3: Unsupervised Learning - Machine Learning | CS 5350 and more Assignments Computer Science in PDF only on Docsity! Machine Learning (CS 5350/CS 6350) Due 08 Mar 2007 HW3: Unsupervised Learning Note: Please submit the written aspects of assignments in postscript or PDF format only. I highly recom- mend you use LATEX to prepare the assignments. A solution will be posted here after the due date. See http://www.cs.utah.edu/classes/cs5350/handin.html for handin instructions. 1 PRML Exercises Complete the following exercise from PRML: 9.14, 9.15. Bonus: 12.27. 2 Written Exercises 1. What is the relationship between k-means and EM for Gaussian mixture models? (≈ 100 words) 2. Explain EM intuitively. (≈ 100 words) 3. What are the primary pros and cons of PCA? How does it compare to factor analysis and to information- gain-based feature selection? (≈ 100 words) 3 Programming There are, again, two parts to the programming assignment. The first part is for clustering, the second part is for dimensionality reduction. The second part is quite trivial, but requires that you are successful in the first. CS6350 students have to do a bit more work in the first part. Clustering Here, we implement EM for Gaussian mixture models. I have provided a Matlab shell for this as well as data. You need to implement: (1) initialization; (2) the E-step; (3) the M-step; and (4) the computation of the complete and incomplete data log-likelihoods. You should use the algorithm with full covariance matrices. In the process of building the Gaussian mixture models, you will plot—by iteration—the complete data log likelihood, and the incomplete data log likelihood. A good way to debug your code—which is espe- cially difficult for unsupervised learning algorithms—is to make sure that the incomplete data log likelihood monotonically increases. There are various plotting commands strewn throughout the code to plot the complete/incomplete likelihood and so you should be able to debug that way. Once you have the GMM algorithm running, we will try it with different values of k ∈ {2, 3, 4, 5, 6, 7, 8, 9, 10}. For each of these, you should run the GMM with 10 different initializations and choose as your final clustering the 1 in these 10 with the highest complete log likelihood. Once you have these complete log likelihoods, plot them as a function of k. Which value of k would you choose based on these plots? For this part of the assignment, you should hand in: you code; the plot of the incomplete/complete likelihood for the best run for each k; the plot of k vs. the complete log likelihood; and an answer to the “choose k” question. 1