Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Material Type: Exam; Professor: Zhang; Class: Applied Regression Analysis; Subject: STAT-Statistics; University: Purdue University - Main Campus; Term: Spring 2009;
Typology: Exams
1 / 5
Time: 120 minutes (16 pages in total)
Name: (2 points) _____Solution Key____________________
I. (10 points) List the three major assumptions of a simple linear regression model. For each
assumption, give one way that you might check whether that assumption is satisfied and identify one
remedy that might be used to adjust for the problem if it exists.
Assumption #1: Normality of Error Terms
Diagnostic: Examine a Normal QQ plot of the residuals for linearity; Examine a histogram;
Shapiro-Wilks Test
Remedy: Transformation on Y (may use Box-Cox to find one)
Assumption #2: Constancy of Variance (of error terms)
Diagnostic: Examine residual plots (vs. Y-hat or X’s) for different “vertical spreads”; Modified
Levine or Breusch-Pegan Tests
Remedy: Transformations on Y may help or use Weighted Least Sqauares
Assumption #3: Independence of Error Terms (if you substituted Linearity, you lost 2 points)
Diagnostic: Plot the residuals against time or sequence
Remedy: Account for dependency in the model (perhaps by including time or see Ch12)
II. (22 points ) In a simple linear regression problem with the sample size n = 42, the following
0 1 1
b 12, b 6.4, s b 2.0, MSE 2.25, X 5. (^).
statement of assumptions.
i 0 1 i i
Y X (^) where
2 ~ 0, ; 1, 2,...,
iid
i
N i n
The estimated equation is
i i
We have s MSE 2.25 1.
Confidence intervals are of the form: Pt. Est +/– T-crit * SE(Pt. Est.). In this case, the critical
value comes from a t-distribution with 50 degrees of freedom. We estimate from the table that
c
t t df
, or
(^10) against the alternative hypothesis that 1
State your decision rule and give your conclusion.
Since -10 is in the confidence interval computed in Part (e), we do not have any evidence against
the null, and therefore cannot reject that 1
(^10) with 95% confidence.
h
2
1
s b MSE / SSX and so
(^2 1) 0.
42 0.
s pred 2.25 1 3.
The predicted value at X = 5 is -20 (similar to part (6)). So the CI is given by
interval of Y at X = 5. Please explain what you could do for the experimental design in order to
minimize the width of the prediction interval.
2
2
(^2 ) 1
h
i
X X
n X X
s pred MSE
. So there are three things we can do: (1) Add
observations to increase n, (2) Make (^) X 5 so that the third term has a small numerator, and (3)
Increase the spread in the X’s so that SSX in the denominator is increased.
what you would do to obtain these four intervals with a family confidence level of 95%.
We would want to use a Bonferroni correction, taking 0.05 / 4 0..
III. (24 points) An analysis has 67 observations and 4 predictor variables (i.e., X 1 , X 2 , X 3 , and X 4 ). Also
included in the analysis are the interaction between the first two predictor variables (i.e., X 5 ), and the
interaction between the last two predictor variables (i.e., X 6 ).
Sum of Mean
Source DF Squares Square F Value
Model 6 15000 2500 12.
Error 60 12000 200
Total 66 27000
decision rule, and conclusion.
To test for model significance we have 0 1 2 3 4 5 6
H : (^0) against the alternative
hypothesis that some i
(^) is non-zero. We reject the null since F = 12.5 > F(0.95; 6,60) = 2.25. So
we conclude that at least one of the predictors is important.
all –– making sure to state your hypotheses and decision rule (when to reject the null
hypothesis).
To test for significance of the first predictor we have 0 1 5
H : (^0) against the alternative
hypothesis that either 1
(^) or 5
(^) is non-zero. We will reject the null if the F-statistics value is greater
than F(0.95; 2,60) = 3.15.
problem in testing for significance of the two interactions, and give a brief explanation of what
you would do about it.
We may have MULTICOLLINEARITY issue due large mean values of predictors. We should check
pairwise and perhaps more complicated correlations among the predictors. Probably the best solution
will be to standardize these predictors, and recalculate the interactions.
IV. (42 points) Refer to the SAS output at the back of the exam for this section. (You may detach it.)
A group of high-technology companies agreed to share employee salary information in an effort to
establish salary ranges for technical positions in research and development. Data obtained for each
employee included current salary (i.e., SALARY in $1,000), a coded variable indicating whether the
highest academic degree obtained is a master degree (i.e., MASTER=0 for bachelor’s degree, and
MASTER =1 for master degree), years of experience since last degree (i.e., EXPERIENCE in number of
years).
There are 28 observations (since 27 total d.f.). The 29
th & 30
th are not an observation – we are just using
it to get predicted values, etc. for those values of X 1 & X 2.
concern? Explain.
It is not a big concern as we do not assume normality (or even symmetry) for the predictor variables.
No major outliers show up on the histogram – only skewness. However, since experience was so
skewed, we might not get as good predictive ability when years of experience are higher.
1 2
Y 29.66064 1.36486 X 10.85754 X where X1 = experience and X2 = master.
For an employee holding the same degree, the expected salary is increased by 1.36486×$1,000 annually.
Similarly, the expected salary of an employee with master degree will be 10.85754×$1,000 higher than
that of an employee with bachelor’s degree. NOTE: the unit of salary is $1,000.
From the output for Obs #10 we see that 10
Y 41.8830 (^) and 10
e ˆ 2. .
statistic, p-value, and your conclusion.
The null hypothesis is 0 1 2
H : (^0) and we test against the alternative hypothesis
1 2
: 0 or 0 a
H (^). The F-statistic is 34.93 and the p-value is very small (from SAS) so we conclude
that at least one of the variables is in fact important.
EXPERIENCE. Give the null and alternative hypotheses, test statistic, p-value, and your conclusion.
The null hypothesis is 0 2
H : (^0) and we test against the alternative hypothesis 2
a
H (^). Using the
variable added last t-test given in SAS, the test statistic is t = 7.64 and the p-value for the test is very
small (from SAS). Hence we reject (^0 )
and conclude that MASTER is significant in a model
containing EXPERIENCE.
other one is already included in the model?
SSM(EXPERIENCE|MASTER)=299.63556 and SSM(MASTER|EXPERIENCE)=781.11474. So
MASTER explains more of the extra variation in the salary.
bachelor, what is the predicted salary for the next year?
Reading off the output for “Obs.” #29 we see that the predicted value is 36.4849.
Again reading off for the prediction interval (not the CI for mean response) we have (28.6064,44.3635).
what is the average salary of these employees in the next year?
Reading off the output for “Obs.” #30 we see that the predicted value is 47.3425.
Again reading off for the mean interval (the CI for mean response) we have (45.0482,49.6368).
of the assumptions.
There are some MAJOR violations here. First of all, the error terms may not have constant variances.
Secondly, there is linear trend between residuals and data recording sequence which may imply
violation of independent error.
We will want to try a transformation on Y to perhaps eliminate the normality/constant variance
problems. WLS might also be used. For the dependent errors, we may have to identify any potential
predictors ignored.
(STUDENT RESIDUAL) columns in the output.
The studentized residuals have been scaled – that is to say they are divided by their standard errors. It is
easier to identify outliers from these since they in theory are N(0,1).