Download Midterm Exam 1 with Answers - Applied Regression Analysis | STAT 51200 and more Exams Statistics in PDF only on Docsity!
Statistics 512 Midterm Exam 1
Thursday, October 03, 2002
Time: 75 minutes
Instructor: Dr. K. L. Simonsen
Name: Answers.
You may detach the “Output” pages from the back of the exam. It is not necessary to hand those in.
This exam is open-book and open-notes. Calculators are permitted.
Please
- do not cheat on this exam
- circle your answers where appropriate
- write your solution clearly and legibly so that I can follow it
- cross out your mistakes; do not erase large quantities of work
- use the back of a page if you run out of room and indicate this on the question page
- move on if you get stuck
- do not leave the room until the end of the class period
Question Possible Actual
Total 75
(9)
1. List three major assumptions in simple linear regression following a), b), and c) below. For each
assumption, (i) describe one way that you could check that assumption and (ii) suggest one thing
you might try if that assumption is violated.
a) Assumption:
linear relationship
(i) Diagnostic:
scatterplot
(ii) Remedial
Transform Y or X, or use polynomial regression
b) Assumption:
Normally distributed errors
(i) Diagnostic:
Qqplot (normal probability plot) or histogram of residuals
(ii) Remedial:
Transformation of Y
c) Assumption:
Constant variance
(i) Diagnostic:
Residual plot (vs X or Y ˆ)
(ii) Remedial:
Weighted regression, variance-stabilizing transformation
(20)
2. In a simple linear regression problem with n = 62, the following estimates were obtained:
b 0 =18, b 1 = -3, s{b 1 } = 1, MSE = 4, X = 5.
a) Write the equation for the simple linear regression model.
Y = β 0 + β 1 X + ε or Yi = β 0 + β 1 Xi + ε i
b) Write the estimated regression line.
Y = 18 – 3 X
c) Predict the value of Y when X = 5.
Y = 18 – (3)(5) = 18 – 15 = 3
d) Calculate the estimated standard deviation of the error term in the model.
s = MSE = 4 = 2
e) Give a 95% confidence interval for the slope of the regression line.
tc^ = t(1-α/2, n-2) = t(0.975, 60) = 2.
CI = b 1 ± tc^ s{b 1 } = -3 ± (2)(1) = -3 ± 2 = [-5, -1]
f) Give a 95% prediction interval for a new observation Y when X = 5.
{ }
( )
{ }
{ } ( )( ) [ ]
⎡ ⎤
− ⎡ ⎤ ×
= ⎢^ + + ⎥= + + = =
⎢ ⎥ ⎢^ ⎥
⎢ ⎥ ⎣^ ⎦
⎣ ⎦
= =
= ± = ± = ± = −
4.0645 2.
ˆ 3 2 2.016 3 4.032 1.032,7.
X X
c
h
X X
s pred MSE
n SS SS
s pred
CI Y t s pred
(23)
3. Refer to the SAS output marked OUTPUT FOR PROBLEM 3. The data are from a study of 78 7th
grade students. The goal is to predict GRADE (average school grade on a scale of 0 to 11) from
variables which include IQ (score on an I.Q. test) and GENDER (0 = female, 1 = male).
a) Using the output for the simple linear regression, does there appear to be a linear relationship
between GRADE and IQ? Give a test statistic with degrees of freedom and p-value to support
your answer (you may use other evidence as well).
The scatterplot shows a linear relationship.
t = 7.14 with 76 df and P < 0.0001 indicates the slope is non-zero.
b) Individual #51 has GRADE = 0.53 and IQ = 103. What value of GRADE is predicted for this
individual by the estimated simple linear regression model? Calculate the residual ei for this
observation. The studentized residual for this individual is equal to –3.895. On that basis do
you consider this observation to be an outlier? Explain.
GRADE = -3.557 + 0.101 IQ. With IQ = 103, we get
GRADE = -3.557 + (0.101)(103) = -3.557 + 10.403 = 6.
ei = 0.53 – 6.848 = -6.
A studentized residual of –3.895 is large compared to a t-distribution; therefore I would
consider this value to be an outlier.
c) The variable IQGEN is the product of IQ and GENDER. Examine the output for the model
involving these three variables. Write down the estimated regression equation for this model.
Also write down the two separate fitted lines for female and male students.
GRADE = –2.252 + 0.094 IQ – 3.842 GENDER + 0.027 IQGEN
female (GENDER = 0) : GRADE = –2.252 + 0.094 IQ
male (GENDER = 1): GRADE = (–2.252 – 3.842) + (0.094 + 0.027) IQ = –6.094 + 0.121 IQ
d) Examine the results of the t-tests for the three regression coefficients as well as the result of the
(general linear) F-test labelled “SAMELINE”. The results of this general linear test were
produced with the SAS input line “test gender, iqgen;”. State the null hypotheses
tested by each of these four tests and whether that hypothesis is rejected. What apparent
conflict do you see between the results of these tests? Explain why such a conflict might arise
and suggest one possible action that might be used to eliminate this conflict.
1: βIQ = 0 rejected
2: βGENDER = 0 not rejected
3: βIQGEN = 0 not rejected
4: βGENDER = βIQGEN = 0 rejected
The results of tests 2 and 3 contradict the results of test 4, which says that the two
coefficients are not both zero. This type of conflict often arises in cases of multicollinearity.
In this case we suspect that IQGEN is highly correlated with IQ, and in fact the output from
PROC CORR on the last page tells us that Corr(IQGEN, GENDER) = 0.986. We could either
eliminate one variable (presumably IQGEN) from the model, or standardize the variables in
hopes of removing multicollinearity.
(23)
4. Refer to the SAS output labelled OUTPUT FOR PROBLEM 4. This continues the analysis begun
in problem 3 using GRADE, IQ, and GENDER. Now the additional variables AGE (in years) and
SC (score on a “self-concept” scale) are included (and IQGEN is removed). You may also use the
OUTPUT FOR PROBLEM 3 results for this problem
a) Examine the results of the model that includes IQ, AGE, GENDER, and SC (the “full” model).
Which variable(s), if any, would you consider eliminating from the model? Justify your answer
extensively using information such as the results of hypothesis tests, extra sums of squares, and
R
values, as well as any other evidence that may support your argument.
I would eliminate AGE because 1) its t-test is not significant when all other variables are in
the model; 2) it has the smallest type II SS; 3) its type I SS is small even though it is second
on the list; 4) the R^2 value with it included is 54% and without it is 52%, which is not a big
difference; most of the individuals are either 12 or 13 years old, which is not a big difference.
b) Does multicollinearity appear to be an issue in this analysis? Explain your reasoning, making
specific reference to the parameter estimates and the results of hypothesis tests, as well as any
other evidence that may support your argument.
There is certainly evidence of multicollinearity. On a pairwise level, IQ is correlated with SC
and AGE. For IQ, the type I SS (136.3) and the type II SS (47.1) are very different,
indicating that some of the information contained in IQ is also represented by the remaining
variables. However, there is also evidence that the multicollinearity does not cause too many
problems in this analysis. The parameter estimates do not change enormously when the
model is changed: in the four models shown, the coefficients for IQ are 0.101, 0.094, 0.074,
and 0.084. The coefficients for SC are essentially the same in the two models in which it
appears. The coefficients for GENDER are –0.91 and –0.97, quite similar, in the two models
that do not include IQGEN, but –3.84 in the model that includes IQGEN. The big change with
the latter model indicates a problem only between those variables, which is also supported by
the fact that Corr(GENDER, IQGEN) = 0.986. In the final model, all the t-tests are
significant, indicating that these variables are significant in spite of any multicollinearity.
c) Which variable do you think is the most important explanatory variable? Do you recommend
using this variable alone in the model? Justify your answer.
IQ is the most important. It has the smallest p-value and the largest type I and type II SS.
However, the model using IQ alone has R^2 = 40% compared to R^2 = 54% for the full model,
so the other variables contribute substantially to the fit. Thus I would not recommend using
this variable alone.
d) What are the dimensions of the design matrix for the full model in this problem?
n = 78, p = 5, so the matrix is n x p = 78 x 5
e) What proportion of the variation in GRADE is explained by the full model?
R^2 = 54%
OUTPUT FOR PROBLEM 3
The REG Procedure Model: MODEL Dependent Variable: grade Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 136.31881 136.31881 51.01 <. Error 76 203.10809 2. Corrected Total 77 339. Root MSE 1.63477 R-Square 0. Dependent Mean 7.44654 Adj R-Sq 0. Coeff Var 21. Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| 95% Confidence Limits Intercept 1 -3.55706 1.55176 -2.29 0.0247 -6.64766 -0. iq 1 0.10102 0.01414 7.14 <.0001 0.07285 0. The REG Procedure Model: MODEL Dependent Variable: grade Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 155.42484 51.80828 20.84 <. Error 74 184.00205 2. Corrected Total 77 339. Root MSE 1.57687 R-Square 0. Dependent Mean 7.44654 Adj R-Sq 0. Coeff Var 21. Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -2.25235 2.15377 -1.05 0. iq 1 0.09400 0.02017 4.66 <. gender 1 -3.84266 3.03670 -1.27 0. iqgen 1 0.02656 0.02784 0.95 0. Test sameline Results for Dependent Variable grade Mean Source DF Square F Value Pr > F Numerator 2 9.55302 3.84 0. Denominator 74 2.
gr ade = - 3. 5571 +0. 101 i q N 78 R0. 4016sq A0. 3937dj Rsq R1. 6348MSE 0 2 4 6 8 10 12 i q 70 80 90 100 110 120 130 140 gr ade = - 3. 5571 +0. 101 i q N 78 R0. 4016sq A0. 3937dj Rsq R1. 6348MSE
- 8
- 6
- 4
- 2 0 2 4 i q 70 80 90 100 110 120 130 140
OUTPUT FOR PROBLEM 4
The REG Procedure Model: MODEL Dependent Variable: grade Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 183.86686 45.96672 21.57 <. Error 73 155.56003 2. Corrected Total 77 339. Root MSE 1.45978 R-Square 0. Dependent Mean 7.44654 Adj R-Sq 0. Coeff Var 19. Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Type I SS Type II SS Intercept 1 3.62511 4.43504 0.82 0.4164 4325.17293 1. iq 1 0.07401 0.01573 4.70 <.0001 136.31881 47. age 1 -0.52028 0.28534 -1.82 0.0723 8.58581 7. gender 1 -0.91623 0.34531 -2.65 0.0098 15.00824 15. sc 1 0.05166 0.01541 3.35 0.0013 23.95401 23. The REG Procedure Model: MODEL Dependent Variable: grade Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 176.78223 58.92741 26.81 <. Error 74 162.64466 2. Corrected Total 77 339. Root MSE 1.48253 R-Square 0. Dependent Mean 7.44654 Adj R-Sq 0. Coeff Var 19. Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -4.05384 1.41211 -2.87 0. iq 1 0.08412 0.01495 5.62 <. sc 1 0.05129 0.01565 3.28 0. gender 1 -0.96852 0.34948 -2.77 0.
The CORR Procedure 6 Variables: grade iq age sc iqgen gender Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum grade 78 7.44654 2.09956 580.83000 0.53000 10. iq 78 108.92308 13.17097 8496 72.00000 136. age 78 12.74359 0.63319 994.00000 12.00000 15. sc 78 56.96154 12.41223 4443 20.00000 80. iqgen 78 66.85897 55.44758 5215 0 136. gender 78 0.60256 0.49254 47.00000 0 1. Pearson Correlation Coefficients, N = 78 grade iq age sc iqgen gender grade 1.00000 0.63373 -0.38927 0.54183 -0.00505 -0. iq 0.63373 1.00000 -0.38236 0.49315 0.30884 0. age -0.38927 -0.38236 1.00000 -0.17808 -0.04358 0. sc 0.54183 0.49315 -0.17808 1.00000 0.16141 0. iqgen -0.00505 0.30884 -0.04358 0.16141 1.00000 0. gender -0.09733 0.19142 0.00214 0.09519 0.98562 1.