Analyzing Duration-Waiting Time Relationship in 'Old Faithful' Geyser Dataset, Exercises of Advanced Data Analysis

A homework assignment for a data analysis course in spring 2011. Students are required to analyze the 'old faithful' geyser dataset, which includes two columns: duration (length of the latest eruption) and waiting (interval between eruptions). The assignment includes tasks such as linear regression, plotting residuals, variance estimation, weighted least squares regression, nonparametric kernel regression, and estimating the conditional density of waiting given duration. The goal is to understand the relationship between duration and waiting time and to evaluate the compatibility of the data with homoskedastic noise.

Typology: Exercises

2010/2011

Uploaded on 11/03/2011

bridge
bridge 🇺🇸

4.9

(13)

287 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Homework Assignment 3: Old Heteroskedastic
36-402, Data Analysis, Spring 2011
Due at the start of class, 1 February 2011
The data set geyser in the library MASS contains a series of consecutive ob-
servations on the “Old Faithful” geyser at Yellowstone National Park, famed for
the approximate regularity of its eruptions. There are two columns: duration,
the length of the latest eruption of the geyser (in minutes), and waiting, the
interval from the end of one eruption to the start of the next (also in minutes).
Begin by obtaining the library (if you don’t have it already) and loading the
data set. You should be able to reproduce this:
> summary(geyser)
waiting duration
Min. : 43.00 Min. :0.8333
1st Qu.: 59.00 1st Qu.:2.0000
Median : 76.00 Median :4.0000
Mean : 72.31 Mean :3.4608
3rd Qu.: 83.00 3rd Qu.:4.3833
Max. :108.00 Max. :5.4500
1. (5 points) Linearly regress waiting on duration. Plot the data points,
together with the regression line. Comment (briefly) on any noteworthy
features of the plot.
2. (5 points) Plot the squared residuals from the linear regression versus
duration. Comment.
3. (20 points) Using the method presented in the notes to the lecture for
25 January, estimate the variance as a function of duration. Add it to
the plot from the previous problem. Does the estimated variance function
seem compatible with homoskedastic noise?
4. (10 points) Re-do the linear regression with weighted least squares, mak-
ing the weights inversely proportional to the estimated variance function.
What happens to the linear regression coefficients? Are the changes sta-
tistically significant? Does it seem like they matter?
5. (5 points) Do a nonparametric kernel regression of waiting on duration.
Plot the results along with the raw data. Comment on how the results
differ from the linear regression.
1
pf2

Partial preview of the text

Download Analyzing Duration-Waiting Time Relationship in 'Old Faithful' Geyser Dataset and more Exercises Advanced Data Analysis in PDF only on Docsity!

Homework Assignment 3: Old Heteroskedastic

36-402, Data Analysis, Spring 2011

Due at the start of class, 1 February 2011

The data set geyser in the library MASS contains a series of consecutive ob- servations on the “Old Faithful” geyser at Yellowstone National Park, famed for the approximate regularity of its eruptions. There are two columns: duration, the length of the latest eruption of the geyser (in minutes), and waiting, the interval from the end of one eruption to the start of the next (also in minutes). Begin by obtaining the library (if you don’t have it already) and loading the data set. You should be able to reproduce this:

summary(geyser) waiting duration Min. : 43.00 Min. :0. 1st Qu.: 59.00 1st Qu.:2. Median : 76.00 Median :4. Mean : 72.31 Mean :3. 3rd Qu.: 83.00 3rd Qu.:4. Max. :108.00 Max. :5.

  1. (5 points) Linearly regress waiting on duration. Plot the data points, together with the regression line. Comment (briefly) on any noteworthy features of the plot.
  2. (5 points) Plot the squared residuals from the linear regression versus duration. Comment.
  3. (20 points) Using the method presented in the notes to the lecture for 25 January, estimate the variance as a function of duration. Add it to the plot from the previous problem. Does the estimated variance function seem compatible with homoskedastic noise?
  4. (10 points) Re-do the linear regression with weighted least squares, mak- ing the weights inversely proportional to the estimated variance function. What happens to the linear regression coefficients? Are the changes sta- tistically significant? Does it seem like they matter?
  5. (5 points) Do a nonparametric kernel regression of waiting on duration. Plot the results along with the raw data. Comment on how the results differ from the linear regression.
  1. (20 points) Repeat the variance estimation using the residuals from the kernel regression. Compare this estimated variance function to the previ- ous one.
  2. (25 points) Use npcdens to estimate the conditional density of waiting given duration. Plot the results. (Three-dimensional, contour and level plots are all acceptable. Ask the instructor if you have another idea. You may find the examples in help(npplot) useful.)
  3. (10 points) Describe how the plot of the conditional density relates to the plots you made in problems 5 and 6.
  4. (10 points, extra credit) Suppose the Park Service wanted to provide tourists with estimates of the time until the next eruption of the geyser, including a margin of error. What model would you recommend they use? (Explain.)