Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Advanced Data Analysis Exam 2: Mystery Multivariate Data, Exercises of Advanced Data Analysis

Carnegie Mellon University (CMU)Advanced Data Analysis

The instructions and questions for a take-home exam in advanced data analysis at carnegie mellon university. The exam covers topics such as initial exploration of data, joint distribution, principal components, factor analysis, and factor model selection using r. Students are required to submit both written responses and r code.

Typology: Exercises

2010/2011

Uploaded on 11/03/2011

bridge 🇺🇸

4.9

(13)

287 documents

1 / 5

This page cannot be seen from the preview

Don't miss anything!

Exam 2: Mystery Multivariate Data

36-402, Advanced Data Analysis

Due at 5 pm on Tuesday, 12 April 2011

Please read the background section, and all of the questions, carefully before

beginning to work.

You will be sent a data set (CSV format) by e-mail to your Andrew account.

Each data set is slightly different. Work only with your own. The origin f

your data has been suppressed, so all columns are merely named “X.1” through

“X.10”, and the rows are just named from 1 to 1000. If you have not received

a data set, or cannot open it, or it has the wrong format, contact Prof. Shalizi

by 9 am on Wednesday, 6 April. If you do not do so, the presumption will be

that you have received and can read your data.

You must turn in both a written response to the questions, and all of your

supporting R code.

Turn in a hard-copy of the write-up to Prof. Shalizi, either in his office

(Baker Hall 229C) or in his mailbox in the statistics department (Baker Hall

232). Include a signed copy of the last page of this exam as a cover sheet.

Turn in your code by uploading a plain text file to Blackboard. Name the

file andrewID-2.R, where of course andrewID is your actual Andrew username.

Make sure the file is in plain text format, so that it can be loaded into R and

run; files in other formats will not be graded. Please submit your code only

once; if you submit multiple versions, there is no guarantee that the version we

grade will be the latest one. (Appeals to change grades on that basis will be

denied.) Please do not submit your write-up electronically.

All work must be submitted by 5 pm on Tuesday. If you have not been able

to finish the exam by that point, please turn in whatever you have done, for

partial credit. Late exams will get no credit.

Many questions ask you to explain, describe or comment. Since communi-

cation is an essential part of data analysis, you will be graded on your writing.

Be clear, be concise, and use your own words.

Note that the answers to some questions are very different for different data

sets.

1

Discover Exercises of Advanced Data Analysis Carnegie Mellon University (CMU)

Partial preview of the text

Download Advanced Data Analysis Exam 2: Mystery Multivariate Data and more Exercises Advanced Data Analysis in PDF only on Docsity!

Exam 2: Mystery Multivariate Data

36-402, Advanced Data Analysis

Due at 5 pm on Tuesday, 12 April 2011

Please read the background section, and all of the questions, carefully before beginning to work. You will be sent a data set (CSV format) by e-mail to your Andrew account. Each data set is slightly different. Work only with your own. The origin f your data has been suppressed, so all columns are merely named “X.1” through “X.10”, and the rows are just named from 1 to 1000. If you have not received a data set, or cannot open it, or it has the wrong format, contact Prof. Shalizi by 9 am on Wednesday, 6 April. If you do not do so, the presumption will be that you have received and can read your data. You must turn in both a written response to the questions, and all of your supporting R code. Turn in a hard-copy of the write-up to Prof. Shalizi, either in his office (Baker Hall 229C) or in his mailbox in the statistics department (Baker Hall 232). Include a signed copy of the last page of this exam as a cover sheet. Turn in your code by uploading a plain text file to Blackboard. Name the file andrewID-2.R, where of course andrewID is your actual Andrew username. Make sure the file is in plain text format, so that it can be loaded into R and run; files in other formats will not be graded. Please submit your code only once; if you submit multiple versions, there is no guarantee that the version we grade will be the latest one. (Appeals to change grades on that basis will be denied.) Please do not submit your write-up electronically. All work must be submitted by 5 pm on Tuesday. If you have not been able to finish the exam by that point, please turn in whatever you have done, for partial credit. Late exams will get no credit. Many questions ask you to explain, describe or comment. Since communi- cation is an essential part of data analysis, you will be graded on your writing. Be clear, be concise, and use your own words. Note that the answers to some questions are very different for different data sets.

Initial exploration (15 points)

(a) (4 points) Check whether the marginal distribution for each variable is Gaussian; both graphical and quantitative tests are acceptable. (b) (5 points) Explain what the test you used in problem 1a does, and why it works. (c) (3 points) Explain a different procedure that you could have used. (d) (3 points) Explain why making a scatter-plot of variable values against row numbers would not let you check whether a distribution is Gaus- sian.

A joint distribution (15 points)

(a) (4 points) Using npudens from the np package, or any similar func- tion, make a kernel density estimate of the joint distribution of X. and X.10. Plot it; contour, color or perspective plots are all accept- able. (b) (5 points) Fit a two-dimensional Gaussian to the same data, and plot it in the same way. (c) (1 point) Plot the difference between the non-parametric and para- metric density estimates. Comment. (d) (5 points) Could you use the same procedure as in Problem 1 to check that the joint distribution is Gaussian? If so, explain how to modify it and why it still works. If not, explain why it cannot be adjusted, and describe a different procedure which you could use. Extra credit (10 points): implement your test and report your results.

Principal components (15 points)

(a) (10 points) Find the first five principal components. Make a plot of these components. The horizontal axis should run over the integers from 1 to 10, the vertical axis should run from −1 to +1, and the points should indicate the projection of the components on to the corresponding observable variables. Put all five components in the same plot, using color or line style to distinguish them. (Alternately, use a three-dimensional plot.) Make sure the results come through clearly when printed. Comment on any patterns you see in the com- ponents. Grading will reflect how much visual clarity you can give this plot. (b) (5 points) Plot the amount of variance retained by the first q com- ponents vs. q, for up to 10 components.

Factor analysis (15 points)

(a) (6 points) Explain what this function does. What are the inputs? What are the outputs? (b) (6 points) Explain what each line does. (c) (4 points) Is the abs necessary in the second line of the while loop? Is it necessary in the next-to-last line of the while loop? (d) (4 points) Run this function on your data. Describe the results.

Mixtures (Extra Credit, 50 points) Install the mixtools package from CRAN, and read sections 1, 2, and 6.1 of the paper describing it by Be- naglia et al. (http://www.jstatsoft.org/v32/i06). - (10 points) Using mvnormalmixEM(), fit Gaussian mixture models to your data, varying the number of clusters or mixture components from 2 to 6. Plot the likelihood as a function of the number of clusters. Hint: The default settings for maxit and epsilon will take forever; more reasonable ones here would be maxit=100 and epsilon=1e-1. Explain, in your write-up, what these settings mean. Fitting the model with seven clusters might then still take up to an hour on a slow machine. - (10 points) Describe how the boot.comp function works. - (10 points) Use boot.comp to select the number of components to use. Hint: You will want to use fewer than the default number of bootstrap replicates, and also pass along the fitting arguments. Even so, this may take a very long time. - (10 points) Describe a way to decide whether to use this selected mixture model, or the factor model you selected earlier, and explain why this comparison should be reliable. - (10 points) Implement your comparison. Which model is favored?

36-402, Advanced Data Analysis, Spring 2011

Examination 2

I have read the university policy on cheating and plagiarism (http://www.cmu.edu/policies/documents/Cheating.html). I have completed this take-home examination honestly, without giving or receiving prohibited assistance to anyone.

Signed:

Name:

Advanced Data Analysis Exam 2: Mystery Multivariate Data, Exercises of Advanced Data Analysis

Related documents

Partial preview of the text

Download Advanced Data Analysis Exam 2: Mystery Multivariate Data and more Exercises Advanced Data Analysis in PDF only on Docsity!

Exam 2: Mystery Multivariate Data

36-402, Advanced Data Analysis

Due at 5 pm on Tuesday, 12 April 2011

36-402, Advanced Data Analysis, Spring 2011