Download Final Exam Paper - Statistical Analysis | STAT 200 and more Exams Statistics in PDF only on Docsity!
STAT 200 - Practice Exam 3
Solutions appear at the end of the document. (Don’t forget to also look over Homework assignments. Highly recommended.)
Chapter 9 Questions
9.2 (p. 255)
9.4 (p. 255)
9.12 (p. 264)
9.13 (p. 265)
Chapter 10 Questions
10.2 (p. 283)
10.6 (p. 297)
This problem should have a 10
th
observation with price=$559 and bedrooms=5.
10.10 (p. 298)
10.11 (p. 298)
10.13 (p. 299)
10.15 (p. 299)
10.21 (p. 310)
Which of the two models is better?
10.22 (p. 311)
10.23 (p. 311)
If you believe the model is invalid, find a better one.
10.27 (p. 311)
The Price variable should be capitalized.
Question 9.2 (p. 255)
poll=c(315,197,141,39,16,79) actual=c(48.6,31.5,12.5,2.8,0.6,4.0)
Let’s compare the polling percentages with the actual percentages:
cbind(poll/sum(poll), actual/sum(actual)) [,1] [,2] [1,] 0.400 0. [2,] 0.250 0. [3,] 0.179 0. [4,] 0.050 0. [5,] 0.020 0. [6,] 0.100 0.
Some are close, but some are off by 6, 7, or even 8%. We will perform a chi-squared goodness-
of- fit test to be sure. Remember to use the actual percentages in the p= option, but first they
must be in decimal terms.
actual.p = actual/sum(actual) chisq.test(poll, p=actual.p)
Chi-squared test for given probabilities
data: poll X-squared = 152.6, df = 5, p-value < 2.2e-
The p -value indicates we reject the null hypothesis that there is goodness-of- fit, and thus the
sample data is not consistent with the actual results.
Question 9.4 (p. 255)
library(UsingR); data(pi2000)
First we must build a table counting the number of appearances of each digit.
table(pi2000) pi 0 1 2 3 4 5 6 7 8 9
181 213 207 189 195 205 200 197 202 211
Since this dataset has 2000 digits total, we expect to see each individual digit about 2000/
times if they appear with equal probability. Let’s do a chi-squared goodness-of- fit test. Since a
null hypothesis of equal probabilities is the default in R, we don’t need to specify the p= option.
chisq.test(table(pi2000))
Chi-squared test for given probabilities
data: table(pi2000) X-squared = 4.42, df = 9, p-value = 0.
The p -value fails to reject the null hypothesis of goodness-of- fit, so we conclude it is more likely
the first 2000 digits do appear with equal probability. Note that if you instead type
chisq.test(pi2000) without putting the digits in table form, you’re incorrectly testing the
digits individually (notice df=1999) instead of their 10 groups.
Question 10.2 (p. 283)
library(UsingR); data(MLBattend) attach(MLBattend) model.10.2 = lm(attendance~wins) model.10.
Call: lm(formula = attendance ~ wins)
Coefficients: (Intercept) wins -378164 27345
The linear model is Attendance = -378164 + 27345(Wins), so Attendance increases 27345 with
each additional win. Thus a team that jumps from 80 wins to 90 increases Wins by 10 and
Attendance by...
[1] 273450
We could also solve this by finding the predicted values and then their difference:
predict(model.10.2, data.frame(wins=c(80,90))) 1 2 1809451 2082903 2082903- [1] 273452
Question 10.6 (p. 297)
price = c(300, 250, 400, 550, 317, 389, 425, 289, 389, 559) bedrooms = c(3, 3, 4, 5, 4, 3, 6, 3, 4, 5) model.10.6 = lm(price~bedrooms)
The scatterplot with regression line appears below.
plot(price~bedrooms) abline(model.10.6)
3.0 3.5 4.0 4.5 5.0 5.5 6.
250
350
450
550
bedrooms
price
Testing whether an extra bedroom is worth $60,000 versus the alternative that it is worth more is
the same as testing the null hypothesis that slope equals 60,000 (H 0 : ß 1 = 60,000) versus the
alternative that the slope is greater than 60,000 (HA: ß 1 > 60,000).
summary(model.10.6) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 94.4 98.0 0.96 0. bedrooms 73.1 23.8 3.08 0.015 * ...
Our summary statistics show the observed estimate for slope to be $73,100 with a standard error
of $23,800. Remember to use the t -distribution with n – p = n – 2 degrees of freedom.
t = (73.1 – 60) / 23. t [1] 0. p.value = pt(t, df=10-2, lower.tail=F) p.value [1] 0.
The p -value indicates we fail to reject the null hypothesis, and we conclude that there is not
significant evidence to suggest an extra bedroom is worth more than $60,000.
Question 10.10 (p. 298)
year = 1952: pop = c(724, 176, 920, 1392, 1392, 1448, 1212, 1672, 2068, 1980, 2116) model.10.10 = lm(pop~year) model.10. ... Coefficients: (Intercept) year -318864 164
The scatterplot with regression line appears below.
plot(pop~year) abline(model.10.10)
1952 1954 1956 1958 1960 1962
500
1000
1500
2000
year
pop
predict(model.10.10, data.frame(year=1963)) [1] 2355
The model predicts 2355 seals in 1963. It is inappropriate to use this to predict the seal
population in 2004 because the data is from 40 years previous and not indicative of current
conditions or patterns in population growth (or decay).
Question 10.15 (p. 299)
library(UsingR); data(galton) attach(galton) model.10.15 = lm(child~parent) summary(model.10.15) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 23.9415 2.8109 8.52 <2e-16 *** parent 0.6463 0.0411 15.71 <2e-16 *** ...
Test the null hypothesis that slope equals 1 (H 0 : ß 1 = 1) versus the alternative that the slope is not
equal to 1 (HA: ß 1? 1). Remember to use the t -distribution with n – 2 degrees of freedom.
t = (0.6463 – 1) / 0. t [1] -8. p.value = pt(t, df=length(parent)-2, lower.tail=T) * 2 p.value [1] 3.212e-
The p -value indicates we reject the null hypothesis, and we conclude that there is significant
evidence to suggest that the slope does not equal 1.
Question 10.21 (p. 310)
This picks up where the example in our class notes leaves off.
library(UsingR); data(galileo); attach(galileo) gal.3 = lm(h.d ~ init.h + I(init.h^2) + I(init.h^3)) gal.4 = update(gal.3,. ~. + I(init.h^4))
Recall that the cubic polynomial model (gal.3) is a highly significant model in all respects.
summary(gal.3)
Call: lm(formula = h.d ~ init.h + I(init.h^2) + I(init.h^3))
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -7.82e+02 8.48e+01 -9.22 **0.0027 **** init.h 2.77e+00 2.65e-01 10.47 **0.0019 **** I(init.h^2) -2.07e-03 2.63e-04 -7.87 **0.0043 **** I(init.h^3) 5.48e-07 8.33e-08 6.58 **0.0072 ****
Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.01 on 3 degrees of freedom Multiple R-Squared: 0.999 , Adjusted R-squared: 0. F-statistic: 1.6e+03 on 3 and 3 DF, p-value: 2.66e-
As for the model which adds on a fourth-degree term, we see the following.
summary(gal.4)
Call: lm(formula = h.d ~ init.h + I(init.h^2) + I(init.h^3) + I(init.h^4))
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.32e+03 2.34e+02 -5.64 **0.030 *** init.h 5.07e+00 9.87e-01 5.14 **0.036 *** I(init.h^2) -5.61e-03 1.51e-03 -3.72 0.. I(init.h^3) 2.89e-06 9.92e-07 2.91 0. I(init.h^4) -5.61e-10 2.37e-10 -2.36 0.
Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.52 on 2 degrees of freedom Multiple R-Squared: 1 , Adjusted R-squared: 1 F-statistic: 3.02e+03 on 4 and 2 DF, p-value: 0.
The new term only lowered residual standard error a little and raised R^2 a little. The p - value for
the F -test went up quite a bit, but the model is still significant overall. Most notable however is
that the p -values of the marginal t -tests for the coefficients have all gone up rendering some
variables as now insignificant. Since the summary statistics have marginally improved, but the
individual t -tests have greatly worsened, we would reject this model (gal.4) in favor of the
previous cubic polynomial model (gal.3).
This is confirmed by the ANOVA test (partial F -test) in which the null hypothesis says to keep
the smaller model while the alternative says to keep the larger model because the new parameters
are significant. Here, we see that the p -value is insignificant, so we fail to reject the null and
keep the smaller gal.3 model.
anova(gal.3, gal.4) Analysis of Variance Table
Model 1: h.d ~ init.h + I(init.h^2) + I(init.h^3) Model 2: h.d ~ init.h + I(init.h^2) + I(init.h^3) + I(init.h^4) Res.Df RSS Df Sum of Sq F Pr(>F) 1 3 48. 2 2 12.7 1 35.5 5.58 0.
We can also compare the AIC values. Oddly, gal.4 has the lower AIC value which suggests we
should keep that model. But the other evidence suggesting gal.3 is better far outweighs this.
AIC(gal.3); AIC(gal.4) [1] 43. [1] 36.
par(mfrow=c(2,2)) plot(model.10.23, which=1:4)
It appears the residuals have mean 0, but the constant variance assumption is shaky because in
plot 1 the variance starts small, increases getting much bigger in the middle, then lowers for large
values of Attendance. This could just be because of the outliers though. Also, the higher-valued
points being above the line in the normal qq-plot indicates the residuals are skewed left, but there
aren’t issues in the middle, so this isn’t too bad.
All in all, we could go either way as to the validity of this model. But we do see an opportunity
to improve it by removing the insignificant predictive variable, Wins.
model.10.23b = update(model.10.23,. ~. - wins) summary(model.10.23b) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -448780 197034 -2.28 0.023 * year 5175 1197 4.32 1.7e-05 *** runs.scored 2956 218 13.57 < 2e-16 *** games.behind -17332 1946 -8.91 < 2e-16 ***
Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1
Residual standard error: 622000 on 834 degrees of freedom Multiple R-Squared: 0.325, Adjusted R-squared: 0. F-statistic: 134 on 3 and 834 DF, p-value: <2e-
plot(model.10.23b,which=1:4)
Now all independent variables are significant. Equally good is that residual standard error, the
coefficient of determination, and the F -test p -value didn’t budge. A look at the diagnostic plots
also shows similar results to the previous model. A comparison of the AIC Thus, we prefer to
model Attendance onto the explanatory variables of Year, Runs Scored, and Games Behind.
For additional analysis, we look at the AIC values and the ANOVA test, both of which support
our decision to keep the new, smaller model.
AIC(model.10.23); AIC(model.10.23b) [1] 24744 [1] 24743
anova(model.10.23b, model.10.23) Analysis of Variance Table
Model 1: attendance ~ year + runs.scored + games.behind Model 2: attendance ~ year + runs.scored + wins + games.behind Res.Df RSS Df Sum of Sq F Pr(>F) 1 834 3.23e+ 2 833 3.22e+14 1 4.39e+11 1.13 0.
Question 10.27 (p. 311)
library(MASS); data(Cars93); attach(Cars93) model.10.27 = lm(MPG.city~EngineSize+Weight+Passengers+Price) summary(model.10.27) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 46.38941 2.09752 22.12 < 2e-16 *** EngineSize 0.19612 0.58888 0.33 0. Weight -0.00821 0.00134 -6.11 2.6e-08 *** Passengers 0.26962 0.42495 0.63 0. Price -0.03580 0.04918 -0.73 0.
Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.06 on 88 degrees of freedom Multiple R-Squared: 0.716, Adjusted R-squared: 0. F-statistic: 55.6 on 4 and 88 DF, p-value: <2e-
The variable Weight is the only one marked as significant. The summary statistics all look pretty
good though. At this point, you can remove the weakest contributor (EngineSize) and see how
the new model responds, continuing this process until you get to a stopping point and find the
best model available based on the marginal t -tests and summary statistics.
In a quicker fashion we can do this based solely on AIC values. We do this with the stepAIC
function beginning with the full model given above.
library(MASS) stepAIC(model.10.27) Start: AIC= 212. MPG.city ~ EngineSize + Weight + Passengers + Price
Df Sum of Sq RSS AIC
- EngineSize 1 1 825 211
- Passengers 1 4 828 211
- Price 1 5 829 211 824 213
- Weight 1 350 1174 244
<< Skipping several steps >> ...
Step: AIC= 208. MPG.city ~ Weight
Df Sum of Sq RSS AIC 840 209
- Weight 1 2066 2906 322 ...
Using AIC as the only criterion, we see that modeling MPG.city onto Weight along provides us
with the lowest AIC (208.7) of all models considered. You can scroll up and down the output
through each step to verify this, but since it is best, R puts that model at the very end and list its
coefficient estimates.