Statistical Analysis of Geyser Data: Heteroskedasticity and Kernel Regression, Exercises of Advanced Data Analysis

Solutions to homework assignment 3 for the course 36-402, advanced data analysis, in spring 2011. It focuses on analyzing the geyser data for heteroskedasticity and using kernel regression to estimate variance and waiting times. The assignment includes plots, regression models, and interpretations of results.

Typology: Exercises

2010/2011

Uploaded on 11/03/2011

bridge
bridge šŸ‡ŗšŸ‡ø

4.9

(13)

287 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Homework Assignment 3: Old Heteroskedastic
36-402, Advanced Data Analysis, Spring 2011
SOLUTIONS
# Setup
library(MASS)
data(geyser)
summary(geyser)
1. Answer:
# Plot the data points
plot(geyser$duration, geyser$waiting, cex = 0.5, pch = 16,
main = "Waiting time as function of geyser duration",
xlab = "Duration (minutes)", ylab = "Waiting time (minutes)")
mtext("Black line = LS regression line")
# Build linear model:
lm.1 = lm(waiting ~ duration, data = geyser)
# Add the regression line to the plot
abline(lm.1)
See Figure 1. Values of duration seem to cluster around 2 minutes and
4-5 minutes. The cluster on the right has more variation in waiting times
compared to the cluster on the left.
2. Answer:
# Plot the squared residuals against duration
plot(geyser$duration, lm.1$residuals^2, cex = 0.5, pch = 16,
main = "Squared residuals versus geyser duration",
xlab = "Geyser duration", ylab = "Squared residuals from linear model")
See Figure 2. The squared residuals seem to be largest, on average, near
duration = 4.5, and substantially smaller for the cluster of observations
near duration = 2.
3. Answer:
# To estimate the variance, I will use the same kernel regression function
# from HW Assignment 2:
1
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Statistical Analysis of Geyser Data: Heteroskedasticity and Kernel Regression and more Exercises Advanced Data Analysis in PDF only on Docsity!

Homework Assignment 3: Old Heteroskedastic

36-402, Advanced Data Analysis, Spring 2011

SOLUTIONS

# Setuplibrary(MASS)

data(geyser)summary(geyser)

1. Answer:

# Plot the data pointsplot(geyser$duration, geyser$waiting, cex = 0.5, pch = 16,

main = "Waiting time as function of geyser duration",xlab = "Duration (minutes)", ylab = "Waiting time (minutes)")

mtext("Black line = LS regression line")# Build linear model:

lm.1 = lm(waiting ~ duration, data = geyser)# Add the regression line to the plot

abline(lm.1)

See Figure 1. Values of4-5 minutes. The cluster on the right has more variation in duration seem to cluster around 2 minutes and waiting times

compared to the cluster on the left.

2. Answer:

# Plot the squared residuals against durationplot(geyser$duration, lm.1$residuals^2, cex = 0.5, pch = 16,

main = "Squared residuals versus geyser duration",xlab = "Geyser duration", ylab = "Squared residuals from linear model")

See Figure 2. The squared residuals seem to be largest, on average, near duration = 4.5, and substantially smaller for the cluster of observations

near duration = 2.

3. Answer:

# To estimate the variance, I will use the same kernel regression function# from HW Assignment 2:

l l

l

l l l

l

l l

l

l

l

l

l

l

l l

l

l

l l

l

l

l

l

l

l

l

l l l

l l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l l l

l

l

l l

l

l

l l

l l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l l

l

l

l

l

l

l

l

l

l

l l l

l

l l l

l l

l

l

l

l l l l l l

l l

l llll

l

l

l

l

l

l

l

l

l

l ll l l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l l

l

l

l

ll l

l

l

l

l

l

l

l

l

l l l l

l

l

l

l

l

l

l

l l l

l

l

l

l

l

l

l

l

l

l l

l

l

l ll l

l l

l

l

l

l

l

l

l

l l l

l

l

l

l

l

l l

l

l l l

l

l

l

l

l

l

l

l

l

l

l l l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l l llllll

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

Waiting time as function of geyser duration

Duration (minutes)

Waiting time (minutes)

Black line = LS regression line

Figure 1:

lm.4 = lm(waiting ~ duration, data = geyser, weights = 1/reg.var3$mean)

Find the coefficient for the unweighted linear model> summary(lm.1)$coefficients #output includes the following:

(Intercept) 99.309856Estimate Std. Error 1. duration -7.800326 0.

Find the coefficient for the weighted linear model> summary(lm.4)$coefficients #output includes the following:

(Intercept) 98.896429Estimate Std. Error 1. duration -7.808154 0. The difference between the two coefficients for duration is only about0.008, while the corresponding standard errors are at least 50 times as large, indicating that the difference in slopes is not statistically significant.Even if this difference were statistically significant, the difference is too small to matter much, at least from the perspective of a casual touristwho wants to know when the next eruption will occur. The difference between intercepts is quite a bit larger, at about 0.4, butagain the standard errors are at least several times larger.

  1. Answer:

    Do a nonparametric kernel regression of waiting on durationregression5 = npreg(waiting ~ duration, data = geyser, residuals = T)

    Plot the data and then plot the regression resultsplot(geyser$duration, geyser$waiting, cex = 0.5, pch = 16,

    main = "Waiting time as function of geyser durationwith a kernel regression curve",

    Pair each observation, indexed by duration, with its fitted valuexlab = "Duration (minutes)", ylab = "Waiting time (minutes)")

    fits = cbind(geyser$duration, regression5$mean)# Order the observation-fit pairs by duration (column 1) fits5 = fits[order(fits[,1]),]# Add the fits to the plot lines(fits, col=2, lwd = 2.5) See Figure 3.be linear, resulting in a curvy fit. The kernel regression suggests that the The kernel regression (the red curve) is not restricted to dependence ofand 3 minutes, and then decreases sharply for larger values of waiting on duration is rather flat for duration duration between 0.
  2. Answer:

l l

l

l l l

l

l l

l

l

l

l

l

l

l l

l

l

l l

l

l

l

l

l

l

l

l l l

l l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l l l

l

l

l l

l

l

l l

l l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l l

l

l

l

l

l

l

l

l

l

l l l

l

l l l

l l

l

l

l

l l l l l l

l l

l llll

l

l

l

l

l

l

l

l

l

l ll l l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l l

l

l

l

ll l

l

l

l

l

l

l

l

l

l l l l

l

l

l

l

l

l

l

l l l

l

l

l

l

l

l

l

l

l

l l

l

l

l ll l

l l

l

l

l

l

l

l

l

l l l

l

l

l

l

l

l l

l

l l l

l

l

l

l

l

l

l

l

l

l

l l l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l l llllll

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

Waiting time as function of geyser durationwith a kernel regression curve

Duration (minutes)

Waiting time (minutes)

Figure 3:

Estimated variance of waiting given duration:Solid black is for unweighted linear regression, and dashed blue is for kernel regression

Duration (minutes)

Estimated variance of waiting given duration

Figure 4:

duration

waiting 5

Conditional Density

Conditional density estimate for waiting given duration

Figure 5: