Final Exam Paper - Statistical Analysis | STAT 200, Exams of Statistics

Material Type: Exam; Class: Statistical Analysis; Subject: Statistics; University: University of Illinois - Urbana-Champaign; Term: Spring 2000;

Typology: Exams

Pre 2010

Uploaded on 03/10/2009

koofers-user-cov
koofers-user-cov 🇺🇸

9 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
STAT 200 - Practice Exam 3
Solutions appear at the end of the document.
(Don’t forget to also look over Homework assignments. Highly recommended.)
Chapter 9 Questions
9.2 (p. 255)
9.4 (p. 255)
9.12 (p. 264)
9.13 (p. 265)
Chapter 10 Questions
10.2 (p. 283)
10.6 (p. 297)
This problem should have a 10th observation with price=$559 and bedrooms=5.
10.10 (p. 298)
10.11 (p. 298)
10.13 (p. 299)
10.15 (p. 299)
10.21 (p. 310)
Which of the two models is better?
10.22 (p. 311)
10.23 (p. 311)
If you believe the model is invalid, find a better one.
10.27 (p. 311)
The Price variable should be capitalized.
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Final Exam Paper - Statistical Analysis | STAT 200 and more Exams Statistics in PDF only on Docsity!

STAT 200 - Practice Exam 3

Solutions appear at the end of the document. (Don’t forget to also look over Homework assignments. Highly recommended.)

Chapter 9 Questions

9.2 (p. 255)

9.4 (p. 255)

9.12 (p. 264)

9.13 (p. 265)

Chapter 10 Questions

10.2 (p. 283)

10.6 (p. 297)

This problem should have a 10

th

observation with price=$559 and bedrooms=5.

10.10 (p. 298)

10.11 (p. 298)

10.13 (p. 299)

10.15 (p. 299)

10.21 (p. 310)

Which of the two models is better?

10.22 (p. 311)

10.23 (p. 311)

If you believe the model is invalid, find a better one.

10.27 (p. 311)

The Price variable should be capitalized.

Question 9.2 (p. 255)

poll=c(315,197,141,39,16,79) actual=c(48.6,31.5,12.5,2.8,0.6,4.0)

Let’s compare the polling percentages with the actual percentages:

cbind(poll/sum(poll), actual/sum(actual)) [,1] [,2] [1,] 0.400 0. [2,] 0.250 0. [3,] 0.179 0. [4,] 0.050 0. [5,] 0.020 0. [6,] 0.100 0.

Some are close, but some are off by 6, 7, or even 8%. We will perform a chi-squared goodness-

of- fit test to be sure. Remember to use the actual percentages in the p= option, but first they

must be in decimal terms.

actual.p = actual/sum(actual) chisq.test(poll, p=actual.p)

Chi-squared test for given probabilities

data: poll X-squared = 152.6, df = 5, p-value < 2.2e-

The p -value indicates we reject the null hypothesis that there is goodness-of- fit, and thus the

sample data is not consistent with the actual results.

Question 9.4 (p. 255)

library(UsingR); data(pi2000)

First we must build a table counting the number of appearances of each digit.

table(pi2000) pi 0 1 2 3 4 5 6 7 8 9

181 213 207 189 195 205 200 197 202 211

Since this dataset has 2000 digits total, we expect to see each individual digit about 2000/

times if they appear with equal probability. Let’s do a chi-squared goodness-of- fit test. Since a

null hypothesis of equal probabilities is the default in R, we don’t need to specify the p= option.

chisq.test(table(pi2000))

Chi-squared test for given probabilities

data: table(pi2000) X-squared = 4.42, df = 9, p-value = 0.

The p -value fails to reject the null hypothesis of goodness-of- fit, so we conclude it is more likely

the first 2000 digits do appear with equal probability. Note that if you instead type

chisq.test(pi2000) without putting the digits in table form, you’re incorrectly testing the

digits individually (notice df=1999) instead of their 10 groups.

Question 10.2 (p. 283)

library(UsingR); data(MLBattend) attach(MLBattend) model.10.2 = lm(attendance~wins) model.10.

Call: lm(formula = attendance ~ wins)

Coefficients: (Intercept) wins -378164 27345

The linear model is Attendance = -378164 + 27345(Wins), so Attendance increases 27345 with

each additional win. Thus a team that jumps from 80 wins to 90 increases Wins by 10 and

Attendance by...

[1] 273450

We could also solve this by finding the predicted values and then their difference:

predict(model.10.2, data.frame(wins=c(80,90))) 1 2 1809451 2082903 2082903- [1] 273452

Question 10.6 (p. 297)

price = c(300, 250, 400, 550, 317, 389, 425, 289, 389, 559) bedrooms = c(3, 3, 4, 5, 4, 3, 6, 3, 4, 5) model.10.6 = lm(price~bedrooms)

The scatterplot with regression line appears below.

plot(price~bedrooms) abline(model.10.6)

3.0 3.5 4.0 4.5 5.0 5.5 6.

250

350

450

550

bedrooms

price

Testing whether an extra bedroom is worth $60,000 versus the alternative that it is worth more is

the same as testing the null hypothesis that slope equals 60,000 (H 0 : ß 1 = 60,000) versus the

alternative that the slope is greater than 60,000 (HA: ß 1 > 60,000).

summary(model.10.6) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 94.4 98.0 0.96 0. bedrooms 73.1 23.8 3.08 0.015 * ...

Our summary statistics show the observed estimate for slope to be $73,100 with a standard error

of $23,800. Remember to use the t -distribution with n – p = n – 2 degrees of freedom.

t = (73.1 – 60) / 23. t [1] 0. p.value = pt(t, df=10-2, lower.tail=F) p.value [1] 0.

The p -value indicates we fail to reject the null hypothesis, and we conclude that there is not

significant evidence to suggest an extra bedroom is worth more than $60,000.

Question 10.10 (p. 298)

year = 1952: pop = c(724, 176, 920, 1392, 1392, 1448, 1212, 1672, 2068, 1980, 2116) model.10.10 = lm(pop~year) model.10. ... Coefficients: (Intercept) year -318864 164

The scatterplot with regression line appears below.

plot(pop~year) abline(model.10.10)

1952 1954 1956 1958 1960 1962

500

1000

1500

2000

year

pop

predict(model.10.10, data.frame(year=1963)) [1] 2355

The model predicts 2355 seals in 1963. It is inappropriate to use this to predict the seal

population in 2004 because the data is from 40 years previous and not indicative of current

conditions or patterns in population growth (or decay).

Question 10.15 (p. 299)

library(UsingR); data(galton) attach(galton) model.10.15 = lm(child~parent) summary(model.10.15) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 23.9415 2.8109 8.52 <2e-16 *** parent 0.6463 0.0411 15.71 <2e-16 *** ...

Test the null hypothesis that slope equals 1 (H 0 : ß 1 = 1) versus the alternative that the slope is not

equal to 1 (HA: ß 1? 1). Remember to use the t -distribution with n – 2 degrees of freedom.

t = (0.6463 – 1) / 0. t [1] -8. p.value = pt(t, df=length(parent)-2, lower.tail=T) * 2 p.value [1] 3.212e-

The p -value indicates we reject the null hypothesis, and we conclude that there is significant

evidence to suggest that the slope does not equal 1.

Question 10.21 (p. 310)

This picks up where the example in our class notes leaves off.

library(UsingR); data(galileo); attach(galileo) gal.3 = lm(h.d ~ init.h + I(init.h^2) + I(init.h^3)) gal.4 = update(gal.3,. ~. + I(init.h^4))

Recall that the cubic polynomial model (gal.3) is a highly significant model in all respects.

summary(gal.3)

Call: lm(formula = h.d ~ init.h + I(init.h^2) + I(init.h^3))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -7.82e+02 8.48e+01 -9.22 **0.0027 **** init.h 2.77e+00 2.65e-01 10.47 **0.0019 **** I(init.h^2) -2.07e-03 2.63e-04 -7.87 **0.0043 **** I(init.h^3) 5.48e-07 8.33e-08 6.58 **0.0072 ****


Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.01 on 3 degrees of freedom Multiple R-Squared: 0.999 , Adjusted R-squared: 0. F-statistic: 1.6e+03 on 3 and 3 DF, p-value: 2.66e-

As for the model which adds on a fourth-degree term, we see the following.

summary(gal.4)

Call: lm(formula = h.d ~ init.h + I(init.h^2) + I(init.h^3) + I(init.h^4))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.32e+03 2.34e+02 -5.64 **0.030 *** init.h 5.07e+00 9.87e-01 5.14 **0.036 *** I(init.h^2) -5.61e-03 1.51e-03 -3.72 0.. I(init.h^3) 2.89e-06 9.92e-07 2.91 0. I(init.h^4) -5.61e-10 2.37e-10 -2.36 0.


Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.52 on 2 degrees of freedom Multiple R-Squared: 1 , Adjusted R-squared: 1 F-statistic: 3.02e+03 on 4 and 2 DF, p-value: 0.

The new term only lowered residual standard error a little and raised R^2 a little. The p - value for

the F -test went up quite a bit, but the model is still significant overall. Most notable however is

that the p -values of the marginal t -tests for the coefficients have all gone up rendering some

variables as now insignificant. Since the summary statistics have marginally improved, but the

individual t -tests have greatly worsened, we would reject this model (gal.4) in favor of the

previous cubic polynomial model (gal.3).

This is confirmed by the ANOVA test (partial F -test) in which the null hypothesis says to keep

the smaller model while the alternative says to keep the larger model because the new parameters

are significant. Here, we see that the p -value is insignificant, so we fail to reject the null and

keep the smaller gal.3 model.

anova(gal.3, gal.4) Analysis of Variance Table

Model 1: h.d ~ init.h + I(init.h^2) + I(init.h^3) Model 2: h.d ~ init.h + I(init.h^2) + I(init.h^3) + I(init.h^4) Res.Df RSS Df Sum of Sq F Pr(>F) 1 3 48. 2 2 12.7 1 35.5 5.58 0.

We can also compare the AIC values. Oddly, gal.4 has the lower AIC value which suggests we

should keep that model. But the other evidence suggesting gal.3 is better far outweighs this.

AIC(gal.3); AIC(gal.4) [1] 43. [1] 36.

par(mfrow=c(2,2)) plot(model.10.23, which=1:4)

It appears the residuals have mean 0, but the constant variance assumption is shaky because in

plot 1 the variance starts small, increases getting much bigger in the middle, then lowers for large

values of Attendance. This could just be because of the outliers though. Also, the higher-valued

points being above the line in the normal qq-plot indicates the residuals are skewed left, but there

aren’t issues in the middle, so this isn’t too bad.

All in all, we could go either way as to the validity of this model. But we do see an opportunity

to improve it by removing the insignificant predictive variable, Wins.

model.10.23b = update(model.10.23,. ~. - wins) summary(model.10.23b) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -448780 197034 -2.28 0.023 * year 5175 1197 4.32 1.7e-05 *** runs.scored 2956 218 13.57 < 2e-16 *** games.behind -17332 1946 -8.91 < 2e-16 ***


Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1

Residual standard error: 622000 on 834 degrees of freedom Multiple R-Squared: 0.325, Adjusted R-squared: 0. F-statistic: 134 on 3 and 834 DF, p-value: <2e-

plot(model.10.23b,which=1:4)

Now all independent variables are significant. Equally good is that residual standard error, the

coefficient of determination, and the F -test p -value didn’t budge. A look at the diagnostic plots

also shows similar results to the previous model. A comparison of the AIC Thus, we prefer to

model Attendance onto the explanatory variables of Year, Runs Scored, and Games Behind.

For additional analysis, we look at the AIC values and the ANOVA test, both of which support

our decision to keep the new, smaller model.

AIC(model.10.23); AIC(model.10.23b) [1] 24744 [1] 24743

anova(model.10.23b, model.10.23) Analysis of Variance Table

Model 1: attendance ~ year + runs.scored + games.behind Model 2: attendance ~ year + runs.scored + wins + games.behind Res.Df RSS Df Sum of Sq F Pr(>F) 1 834 3.23e+ 2 833 3.22e+14 1 4.39e+11 1.13 0.

Question 10.27 (p. 311)

library(MASS); data(Cars93); attach(Cars93) model.10.27 = lm(MPG.city~EngineSize+Weight+Passengers+Price) summary(model.10.27) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 46.38941 2.09752 22.12 < 2e-16 *** EngineSize 0.19612 0.58888 0.33 0. Weight -0.00821 0.00134 -6.11 2.6e-08 *** Passengers 0.26962 0.42495 0.63 0. Price -0.03580 0.04918 -0.73 0.


Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.06 on 88 degrees of freedom Multiple R-Squared: 0.716, Adjusted R-squared: 0. F-statistic: 55.6 on 4 and 88 DF, p-value: <2e-

The variable Weight is the only one marked as significant. The summary statistics all look pretty

good though. At this point, you can remove the weakest contributor (EngineSize) and see how

the new model responds, continuing this process until you get to a stopping point and find the

best model available based on the marginal t -tests and summary statistics.

In a quicker fashion we can do this based solely on AIC values. We do this with the stepAIC

function beginning with the full model given above.

library(MASS) stepAIC(model.10.27) Start: AIC= 212. MPG.city ~ EngineSize + Weight + Passengers + Price

Df Sum of Sq RSS AIC

  • EngineSize 1 1 825 211
  • Passengers 1 4 828 211
  • Price 1 5 829 211 824 213
  • Weight 1 350 1174 244

<< Skipping several steps >> ...

Step: AIC= 208. MPG.city ~ Weight

Df Sum of Sq RSS AIC 840 209

  • Weight 1 2066 2906 322 ...

Using AIC as the only criterion, we see that modeling MPG.city onto Weight along provides us

with the lowest AIC (208.7) of all models considered. You can scroll up and down the output

through each step to verify this, but since it is best, R puts that model at the very end and list its

coefficient estimates.