Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Midterm Exam 1 with Answers - Applied Regression Analysis | STAT 51200, Exams of Statistics

Material Type: Exam; Class: Applied Regression Analysis; Subject: STAT-Statistics; University: Purdue University - Main Campus; Term: Fall 2002;

Typology: Exams

Pre 2010

Uploaded on 07/30/2009

koofers-user-em4
koofers-user-em4 🇺🇸

10 documents

1 / 9

Toggle sidebar

Related documents


Partial preview of the text

Download Midterm Exam 1 with Answers - Applied Regression Analysis | STAT 51200 and more Exams Statistics in PDF only on Docsity!

Statistics 512 Midterm Exam 1

Thursday, October 03, 2002

Time: 75 minutes

Instructor: Dr. K. L. Simonsen

Name: Answers.

You may detach the “Output” pages from the back of the exam. It is not necessary to hand those in.

This exam is open-book and open-notes. Calculators are permitted.

Please

  • do not cheat on this exam
  • circle your answers where appropriate
  • write your solution clearly and legibly so that I can follow it
  • cross out your mistakes; do not erase large quantities of work
  • use the back of a page if you run out of room and indicate this on the question page
  • move on if you get stuck
  • do not leave the room until the end of the class period

Question Possible Actual

Total 75

(9)

1. List three major assumptions in simple linear regression following a), b), and c) below. For each

assumption, (i) describe one way that you could check that assumption and (ii) suggest one thing

you might try if that assumption is violated.

a) Assumption:

linear relationship

(i) Diagnostic:

scatterplot

(ii) Remedial

Transform Y or X, or use polynomial regression

b) Assumption:

Normally distributed errors

(i) Diagnostic:

Qqplot (normal probability plot) or histogram of residuals

(ii) Remedial:

Transformation of Y

c) Assumption:

Constant variance

(i) Diagnostic:

Residual plot (vs X or Y ˆ)

(ii) Remedial:

Weighted regression, variance-stabilizing transformation

(20)

2. In a simple linear regression problem with n = 62, the following estimates were obtained:

b 0 =18, b 1 = -3, s{b 1 } = 1, MSE = 4, X = 5.

a) Write the equation for the simple linear regression model.

Y = β 0 + β 1 X + ε or Yi = β 0 + β 1 Xi + ε i

b) Write the estimated regression line.

Y = 18 – 3 X

c) Predict the value of Y when X = 5.

Y = 18 – (3)(5) = 18 – 15 = 3

d) Calculate the estimated standard deviation of the error term in the model.

s = MSE = 4 = 2

e) Give a 95% confidence interval for the slope of the regression line.

tc^ = t(1-α/2, n-2) = t(0.975, 60) = 2.

CI = b 1 ± tc^ s{b 1 } = -3 ± (2)(1) = -3 ± 2 = [-5, -1]

f) Give a 95% prediction interval for a new observation Y when X = 5.

{ }

( )

{ }

{ } ( )( ) [ ]

⎡ ⎤

− ⎡ ⎤ ×

= ⎢^ + + ⎥= + + = =

⎢ ⎥ ⎢^ ⎥

⎢ ⎥ ⎣^ ⎦

⎣ ⎦

= =

= ± = ± = ± = −

4.0645 2.

ˆ 3 2 2.016 3 4.032 1.032,7.

X X

c

h

X X

s pred MSE

n SS SS

s pred

CI Y t s pred

(23)

3. Refer to the SAS output marked OUTPUT FOR PROBLEM 3. The data are from a study of 78 7th

grade students. The goal is to predict GRADE (average school grade on a scale of 0 to 11) from

variables which include IQ (score on an I.Q. test) and GENDER (0 = female, 1 = male).

a) Using the output for the simple linear regression, does there appear to be a linear relationship

between GRADE and IQ? Give a test statistic with degrees of freedom and p-value to support

your answer (you may use other evidence as well).

The scatterplot shows a linear relationship.

t = 7.14 with 76 df and P < 0.0001 indicates the slope is non-zero.

b) Individual #51 has GRADE = 0.53 and IQ = 103. What value of GRADE is predicted for this

individual by the estimated simple linear regression model? Calculate the residual ei for this

observation. The studentized residual for this individual is equal to –3.895. On that basis do

you consider this observation to be an outlier? Explain.

GRADE = -3.557 + 0.101 IQ. With IQ = 103, we get

GRADE = -3.557 + (0.101)(103) = -3.557 + 10.403 = 6.

ei = 0.53 – 6.848 = -6.

A studentized residual of –3.895 is large compared to a t-distribution; therefore I would

consider this value to be an outlier.

c) The variable IQGEN is the product of IQ and GENDER. Examine the output for the model

involving these three variables. Write down the estimated regression equation for this model.

Also write down the two separate fitted lines for female and male students.

GRADE = –2.252 + 0.094 IQ – 3.842 GENDER + 0.027 IQGEN

female (GENDER = 0) : GRADE = –2.252 + 0.094 IQ

male (GENDER = 1): GRADE = (–2.252 – 3.842) + (0.094 + 0.027) IQ = –6.094 + 0.121 IQ

d) Examine the results of the t-tests for the three regression coefficients as well as the result of the

(general linear) F-test labelled “SAMELINE”. The results of this general linear test were

produced with the SAS input line “test gender, iqgen;”. State the null hypotheses

tested by each of these four tests and whether that hypothesis is rejected. What apparent

conflict do you see between the results of these tests? Explain why such a conflict might arise

and suggest one possible action that might be used to eliminate this conflict.

1: βIQ = 0 rejected

2: βGENDER = 0 not rejected

3: βIQGEN = 0 not rejected

4: βGENDER = βIQGEN = 0 rejected

The results of tests 2 and 3 contradict the results of test 4, which says that the two

coefficients are not both zero. This type of conflict often arises in cases of multicollinearity.

In this case we suspect that IQGEN is highly correlated with IQ, and in fact the output from

PROC CORR on the last page tells us that Corr(IQGEN, GENDER) = 0.986. We could either

eliminate one variable (presumably IQGEN) from the model, or standardize the variables in

hopes of removing multicollinearity.

(23)

4. Refer to the SAS output labelled OUTPUT FOR PROBLEM 4. This continues the analysis begun

in problem 3 using GRADE, IQ, and GENDER. Now the additional variables AGE (in years) and

SC (score on a “self-concept” scale) are included (and IQGEN is removed). You may also use the

OUTPUT FOR PROBLEM 3 results for this problem

a) Examine the results of the model that includes IQ, AGE, GENDER, and SC (the “full” model).

Which variable(s), if any, would you consider eliminating from the model? Justify your answer

extensively using information such as the results of hypothesis tests, extra sums of squares, and

R

values, as well as any other evidence that may support your argument.

I would eliminate AGE because 1) its t-test is not significant when all other variables are in

the model; 2) it has the smallest type II SS; 3) its type I SS is small even though it is second

on the list; 4) the R^2 value with it included is 54% and without it is 52%, which is not a big

difference; most of the individuals are either 12 or 13 years old, which is not a big difference.

b) Does multicollinearity appear to be an issue in this analysis? Explain your reasoning, making

specific reference to the parameter estimates and the results of hypothesis tests, as well as any

other evidence that may support your argument.

There is certainly evidence of multicollinearity. On a pairwise level, IQ is correlated with SC

and AGE. For IQ, the type I SS (136.3) and the type II SS (47.1) are very different,

indicating that some of the information contained in IQ is also represented by the remaining

variables. However, there is also evidence that the multicollinearity does not cause too many

problems in this analysis. The parameter estimates do not change enormously when the

model is changed: in the four models shown, the coefficients for IQ are 0.101, 0.094, 0.074,

and 0.084. The coefficients for SC are essentially the same in the two models in which it

appears. The coefficients for GENDER are –0.91 and –0.97, quite similar, in the two models

that do not include IQGEN, but –3.84 in the model that includes IQGEN. The big change with

the latter model indicates a problem only between those variables, which is also supported by

the fact that Corr(GENDER, IQGEN) = 0.986. In the final model, all the t-tests are

significant, indicating that these variables are significant in spite of any multicollinearity.

c) Which variable do you think is the most important explanatory variable? Do you recommend

using this variable alone in the model? Justify your answer.

IQ is the most important. It has the smallest p-value and the largest type I and type II SS.

However, the model using IQ alone has R^2 = 40% compared to R^2 = 54% for the full model,

so the other variables contribute substantially to the fit. Thus I would not recommend using

this variable alone.

d) What are the dimensions of the design matrix for the full model in this problem?

n = 78, p = 5, so the matrix is n x p = 78 x 5

e) What proportion of the variation in GRADE is explained by the full model?

R^2 = 54%

OUTPUT FOR PROBLEM 3

The REG Procedure Model: MODEL Dependent Variable: grade Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 136.31881 136.31881 51.01 <. Error 76 203.10809 2. Corrected Total 77 339. Root MSE 1.63477 R-Square 0. Dependent Mean 7.44654 Adj R-Sq 0. Coeff Var 21. Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| 95% Confidence Limits Intercept 1 -3.55706 1.55176 -2.29 0.0247 -6.64766 -0. iq 1 0.10102 0.01414 7.14 <.0001 0.07285 0. The REG Procedure Model: MODEL Dependent Variable: grade Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 155.42484 51.80828 20.84 <. Error 74 184.00205 2. Corrected Total 77 339. Root MSE 1.57687 R-Square 0. Dependent Mean 7.44654 Adj R-Sq 0. Coeff Var 21. Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -2.25235 2.15377 -1.05 0. iq 1 0.09400 0.02017 4.66 <. gender 1 -3.84266 3.03670 -1.27 0. iqgen 1 0.02656 0.02784 0.95 0. Test sameline Results for Dependent Variable grade Mean Source DF Square F Value Pr > F Numerator 2 9.55302 3.84 0. Denominator 74 2.

gr ade = - 3. 5571 +0. 101 i q N 78 R0. 4016sq A0. 3937dj Rsq R1. 6348MSE 0 2 4 6 8 10 12 i q 70 80 90 100 110 120 130 140 gr ade = - 3. 5571 +0. 101 i q N 78 R0. 4016sq A0. 3937dj Rsq R1. 6348MSE

  • 8
  • 6
  • 4
  • 2 0 2 4 i q 70 80 90 100 110 120 130 140

OUTPUT FOR PROBLEM 4

The REG Procedure Model: MODEL Dependent Variable: grade Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 183.86686 45.96672 21.57 <. Error 73 155.56003 2. Corrected Total 77 339. Root MSE 1.45978 R-Square 0. Dependent Mean 7.44654 Adj R-Sq 0. Coeff Var 19. Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Type I SS Type II SS Intercept 1 3.62511 4.43504 0.82 0.4164 4325.17293 1. iq 1 0.07401 0.01573 4.70 <.0001 136.31881 47. age 1 -0.52028 0.28534 -1.82 0.0723 8.58581 7. gender 1 -0.91623 0.34531 -2.65 0.0098 15.00824 15. sc 1 0.05166 0.01541 3.35 0.0013 23.95401 23. The REG Procedure Model: MODEL Dependent Variable: grade Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 176.78223 58.92741 26.81 <. Error 74 162.64466 2. Corrected Total 77 339. Root MSE 1.48253 R-Square 0. Dependent Mean 7.44654 Adj R-Sq 0. Coeff Var 19. Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -4.05384 1.41211 -2.87 0. iq 1 0.08412 0.01495 5.62 <. sc 1 0.05129 0.01565 3.28 0. gender 1 -0.96852 0.34948 -2.77 0.

The CORR Procedure 6 Variables: grade iq age sc iqgen gender Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum grade 78 7.44654 2.09956 580.83000 0.53000 10. iq 78 108.92308 13.17097 8496 72.00000 136. age 78 12.74359 0.63319 994.00000 12.00000 15. sc 78 56.96154 12.41223 4443 20.00000 80. iqgen 78 66.85897 55.44758 5215 0 136. gender 78 0.60256 0.49254 47.00000 0 1. Pearson Correlation Coefficients, N = 78 grade iq age sc iqgen gender grade 1.00000 0.63373 -0.38927 0.54183 -0.00505 -0. iq 0.63373 1.00000 -0.38236 0.49315 0.30884 0. age -0.38927 -0.38236 1.00000 -0.17808 -0.04358 0. sc 0.54183 0.49315 -0.17808 1.00000 0.16141 0. iqgen -0.00505 0.30884 -0.04358 0.16141 1.00000 0. gender -0.09733 0.19142 0.00214 0.09519 0.98562 1.