Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Midterm Exam Answers - Applied Software Design | STAT 51200, Exams of Statistics

Material Type: Exam; Professor: Zhang; Class: Applied Regression Analysis; Subject: STAT-Statistics; University: Purdue University - Main Campus; Term: Spring 2009;

Typology: Exams

Pre 2010

Uploaded on 07/30/2009

koofers-user-ibe
koofers-user-ibe 🇺🇸

4

(2)

10 documents

1 / 5

Toggle sidebar

Related documents


Partial preview of the text

Download Midterm Exam Answers - Applied Software Design | STAT 51200 and more Exams Statistics in PDF only on Docsity!

STAT 512 (Section 3) Midterm Exam

Time: 120 minutes (16 pages in total)

Name: (2 points) _____Solution Key____________________

I. (10 points) List the three major assumptions of a simple linear regression model. For each

assumption, give one way that you might check whether that assumption is satisfied and identify one

remedy that might be used to adjust for the problem if it exists.

Assumption #1: Normality of Error Terms

Diagnostic: Examine a Normal QQ plot of the residuals for linearity; Examine a histogram;

Shapiro-Wilks Test

Remedy: Transformation on Y (may use Box-Cox to find one)

Assumption #2: Constancy of Variance (of error terms)

Diagnostic: Examine residual plots (vs. Y-hat or X’s) for different “vertical spreads”; Modified

Levine or Breusch-Pegan Tests

Remedy: Transformations on Y may help or use Weighted Least Sqauares

Assumption #3: Independence of Error Terms (if you substituted Linearity, you lost 2 points)

Diagnostic: Plot the residuals against time or sequence

Remedy: Account for dependency in the model (perhaps by including time or see Ch12)

II. (22 points ) In a simple linear regression problem with the sample size n = 42, the following

estimates were obtained:  

0 1 1

b 12, b  6.4, s b 2.0, MSE 2.25, X 5. (^).

  1. (3 points) Write the equation for the simple linear regression model. Be sure to include a

statement of assumptions.

i 0 1 i i

Y    X   (^) where

 

2 ~ 0, ; 1, 2,...,

iid

i

N  in

  1. (2 points) Write the equation for the estimated regression line.

The estimated equation is

ˆ

i i

Y   X

  1. (2 points) Calculate the estimated standard deviation of the error term in the model.

We have sMSE  2.25 1.

  1. (3 points) Give a 95% confidence interval for the slope of the regression line.

Confidence intervals are of the form: Pt. Est +/– T-crit * SE(Pt. Est.). In this case, the critical

value comes from a t-distribution with 50 degrees of freedom. We estimate from the table that

 0.975,^40 ^ 2.

c

tt   df  

. So the CI is given by ^ 

 6.4 (2.021) 2.

, or

 ^ 10.442,^ 2.358

  1. (2 points) Test the null hypothesis that 1

  (^10) against the alternative hypothesis that 1

  10.

State your decision rule and give your conclusion.

Since -10 is in the confidence interval computed in Part (e), we do not have any evidence against

the null, and therefore cannot reject that 1

  (^10) with 95% confidence.

  1. (2 points) Predict the value of a new observation Y when X = 5.

The predicted value at X = 5 is    

ˆ

h

Y     .

  1. (3 points) Give a 95% prediction interval for the predicted Y in (6).

We need the SSX to get the variance for prediction. We note that  

2

1

s bMSE / SSX and so

this formula may be used to get SSX = 0.5625. Thus ^ 

(^2 1) 0.

42 0.

s pred  2.25 1    3.  

.

The predicted value at X = 5 is -20 (similar to part (6)). So the CI is given by

 20   2.021 3.3036, or   23.6733,  16.3267.

  1. (3points) Suppose you are designing a new experiment to obtain a more precise prediction

interval of Y at X = 5. Please explain what you could do for the experimental design in order to

minimize the width of the prediction interval.

Note that  

 

 

2

2

(^2 ) 1

h

i

X X

n X X

s pred MSE

 

  

 

 

. So there are three things we can do: (1) Add

observations to increase n, (2) Make (^) X  5 so that the third term has a small numerator, and (3)

Increase the spread in the X’s so that SSX in the denominator is increased.

  1. (2 points) You would like to obtain prediction intervals at X = 2, X = 5, X=7 and X = 8. Explain

what you would do to obtain these four intervals with a family confidence level of 95%.

We would want to use a Bonferroni correction, taking  0.05 / 4 0..

III. (24 points) An analysis has 67 observations and 4 predictor variables (i.e., X 1 , X 2 , X 3 , and X 4 ). Also

included in the analysis are the interaction between the first two predictor variables (i.e., X 5 ), and the

interaction between the last two predictor variables (i.e., X 6 ).

  1. (10 points) Complete the following ANOVA table.

Sum of Mean

Source DF Squares Square F Value

Model 6 15000 2500 12.

Error 60 12000 200

Total 66 27000

  1. (5 points) Perform the test for model significance –– making sure to state your hypotheses,

decision rule, and conclusion.

To test for model significance we have 0 1 2 3 4 5 6

H :        (^0) against the alternative

hypothesis that some i

 (^) is non-zero. We reject the null since F = 12.5 > F(0.95; 6,60) = 2.25. So

we conclude that at least one of the predictors is important.

  1. (4 points) Please describe a test on whether the first predictor should be included in the model at

all –– making sure to state your hypotheses and decision rule (when to reject the null

hypothesis).

To test for significance of the first predictor we have 0 1 5

H :    (^0) against the alternative

hypothesis that either 1

 (^) or 5

 (^) is non-zero. We will reject the null if the F-statistics value is greater

than F(0.95; 2,60) = 3.15.

  1. (5 points) Suppose the averages of X 1 , X 2 , X 3 , and X 4 are very large. Please state potential

problem in testing for significance of the two interactions, and give a brief explanation of what

you would do about it.

We may have MULTICOLLINEARITY issue due large mean values of predictors. We should check

pairwise and perhaps more complicated correlations among the predictors. Probably the best solution

will be to standardize these predictors, and recalculate the interactions.

IV. (42 points) Refer to the SAS output at the back of the exam for this section. (You may detach it.)

A group of high-technology companies agreed to share employee salary information in an effort to

establish salary ranges for technical positions in research and development. Data obtained for each

employee included current salary (i.e., SALARY in $1,000), a coded variable indicating whether the

highest academic degree obtained is a master degree (i.e., MASTER=0 for bachelor’s degree, and

MASTER =1 for master degree), years of experience since last degree (i.e., EXPERIENCE in number of

years).

  1. (2 points) How many observations are used in the analysis?

There are 28 observations (since 27 total d.f.). The 29

th & 30

th are not an observation – we are just using

it to get predicted values, etc. for those values of X 1 & X 2.

  1. (2 points) The distribution of years of experience appears to be skewed (i.e. non-normal). Is this a

concern? Explain.

It is not a big concern as we do not assume normality (or even symmetry) for the predictor variables.

No major outliers show up on the histogram – only skewness. However, since experience was so

skewed, we might not get as good predictive ability when years of experience are higher.

  1. (3 points) Give the estimated regression equation for the fitted multiple regression model.

1 2

ˆ

Y 29.66064  1.36486 X  10.85754 X where X1 = experience and X2 = master.

  1. (3 points) Interpret the “slope” parameter estimates from your model in part (3).

For an employee holding the same degree, the expected salary is increased by 1.36486×$1,000 annually.

Similarly, the expected salary of an employee with master degree will be 10.85754×$1,000 higher than

that of an employee with bachelor’s degree. NOTE: the unit of salary is $1,000.

  1. (3 points) What are the predicted and residual values for Observation #10?

From the output for Obs #10 we see that 10

ˆ

Y 41.8830 (^) and 10

e ˆ  2. .

  1. (3 points) Test whether the model is significant. Give the null and alternative hypotheses, test

statistic, p-value, and your conclusion.

The null hypothesis is 0 1 2

H :    (^0) and we test against the alternative hypothesis

1 2

: 0 or 0 a

H     (^). The F-statistic is 34.93 and the p-value is very small (from SAS) so we conclude

that at least one of the variables is in fact important.

  1. (3 points) Test whether the variable MASTER is significant in a model already containing

EXPERIENCE. Give the null and alternative hypotheses, test statistic, p-value, and your conclusion.

The null hypothesis is 0 2

H :   (^0) and we test against the alternative hypothesis 2

: 0

a

H   (^). Using the

variable added last t-test given in SAS, the test statistic is t = 7.64 and the p-value for the test is very

small (from SAS). Hence we reject (^0 )

H :   0

and conclude that MASTER is significant in a model

containing EXPERIENCE.

  1. (2 points) Which of the two variables explains more of the extra variation in the salary when the

other one is already included in the model?

SSM(EXPERIENCE|MASTER)=299.63556 and SSM(MASTER|EXPERIENCE)=781.11474. So

MASTER explains more of the extra variation in the salary.

  1. (3 points) For an employee who has worked for four years since obtaining his/her last degree of

bachelor, what is the predicted salary for the next year?

Reading off the output for “Obs.” #29 we see that the predicted value is 36.4849.

  1. (3 points) Give a confidence interval for the predicted salary in part (9).

Again reading off for the prediction interval (not the CI for mean response) we have (28.6064,44.3635).

  1. (3 points) For all employees who have worked four years since obtaining their last degree of master,

what is the average salary of these employees in the next year?

Reading off the output for “Obs.” #30 we see that the predicted value is 47.3425.

  1. (3 points) Give a confidence interval for the average salary in part (11).

Again reading off for the mean interval (the CI for mean response) we have (45.0482,49.6368).

  1. (3 points) Examine the diagnostic plots for the regression analysis. Identify any potential violations

of the assumptions.

There are some MAJOR violations here. First of all, the error terms may not have constant variances.

Secondly, there is linear trend between residuals and data recording sequence which may imply

violation of independent error.

  1. (3 points) Suggest remedies for anything you identified in part (13).

We will want to try a transformation on Y to perhaps eliminate the normality/constant variance

problems. WLS might also be used. For the dependent errors, we may have to identify any potential

predictors ignored.

  1. (3 points) Explain the difference between the residuals (RESIDUAL) and studentized residuals

(STUDENT RESIDUAL) columns in the output.

The studentized residuals have been scaled – that is to say they are divided by their standard errors. It is

easier to identify outliers from these since they in theory are N(0,1).