Download Lecture Notes on Linear Regression Diagnostic and Variable Transformations | STAT 431 and more Study notes Statistics in PDF only on Docsity!
Statistics 431:
Statistical Inference
Lecture 17: Linear regression diagnostics and
variable transformations
Regression diagnostics
- (^) The conditions required for inference from simple linear regression must be
checked:
- (^) Linearity. Diagnostic: residual plot.
- (^) Constant variance. Diagnostic: residual plot.
- (^) Normality. Diagnostics:
- histogram of residuals.
- normal quantile plot of residuals.
- (^) Independence. Diagnostic: residual plot.
- (^) Outliers and influential points. Diagnostic: scatterplot, Cook’s distances.
Bivariate Fit of HousePrice By CrimeRate
HousePrice
CrimeRate
Summary of Fit
Residual
Distri
Resid
3
CrimeRate
Summary of Fit
RSquare 0.
RSquare Adj 0.
Root Mean Square Error 78861.
Mean of Response 158464.
Observations (or Sum Wgts) 98
Parameter Estimates
Term Estimate Std Error t Ratio Prob>|t|
Intercept 225233.55 16404.02 13.73 <.
CrimeRate -2288.689 491.5375 -4.66 <.
Residual plots
- (^) A residual plot is a scatterplot of each residual ei vs its corresponding xi.
- (^) A residual plot can suggest nonlinearity and/or non-constant variance, both
violations of the assumptions in the ideal linear model.
- (^) If the ideal model holds, the residual plot should contain patternless scatter.
- (^) The residuals should be centered around a horizontal line at 0. A nonlinear
regression function μ( x ) will lead to a violation of this constraint. For example, the residuals on the left and right might be generally above zero, and those in the middle generally below zero.
- (^) The residuals should have the same spread around zero, at all x values. If,
for example, the residual plot has a “bulge” in the model, the constant-variance assumption has been violated.
- (^) Look for pronounced patterns. No residual plot will satisfy you as being
“perfectly” patternless (not even when we do a regression on data simulated from the ideal linear model!).
0
100000
200000
300000
Residual
10 20 30 40 50 60 70 CrimeRate
Distributions
Residuals HousePrice
- (^) Some evidence of upwards curvature in the average residuals at the low and high end of crime rate.
DisplayFeet
0
100
Residual
0 1 2 3 4 5 6 7 8
DisplayFeet
Residual plot shows mean residual less
1 and 7 display feet and greater than ze
- (^) The residual plot shows negative mean residuals at 1 foot and 7 feet, and positive mean residuals elsewhere.
- (^) What does this indicate?
Examining and Transforming Data Tukey’s bulging rule
Figure 15. Mosteller and Tukey’s bulging rule for selecting linearizing trans ormations.
10
Display effect: Transformations to
x
0
100
200
300
400
Sales
0 1 2 3 4 5 6 7 8 DisplayFeet
0
50 Residual
0 1 2 3 4 5 6 7 8 DisplayFeet
- (^) Still looks like overestimation at 1 foot and underestimation at 2–3 feet.
Display effect: Transformations to log x
0
100
200
300
400
Sales
0 1 2 3 4 5 6 7 8 DisplayFeet
0
50
100
Residual
0 1 2 3 4 5 6 7 8 DisplayFeet
- (^) This is a slight improvement.
Heteroscedasticity
- (^) You may have already forgotten that the ideal linear model assumes all the
distrns Y | X = x have the same variance σ 2. This is homoscedasticity.
- (^) When this assumption is violated, we have non-constant variances, or
heteroscedasticity.
- (^) To detect non-constant variance, look at the residual plot. If the vertical
spread of residuals is not the same from left to right, the constant variance assumption is in trouble. E.g.,
- “Fan pattern”: spread increases from left to right
- “Inverse fan pattern”: spread decreases from left to right
- (^) Consequences of heteroscedasticity:
- Hypothesis tests and confidence intervals for β’s are misspecified.
- But, least-squares estimates ( βˆ 0 , βˆ 1 ) are still unbiased.
- (^) What to do: try transformations of Y.
Another residual plot
- (^) We can plot residuals ei against their corresponding fitted values y ˆ i , rather
than against xi.
- (^) For simple linear regression, this is just an affine transformation of X (to
β^ ˆ 0 + ˆβ 1 X ): it moves the whole residual cloud up or down, and stretches the x axis, but doesn’t change the conclusions we draw.
- (^) However, it is useful in multiple linear regression, when there are many X
variables to choose from.
Heteroscedasticity
- When the requirement of a constant variance is
violated we have a condition of heteroscedasticity.
The spread increases with y^
y^
Residual
^y
++
+^ +
+^ ++
Display effect example
0
50
100
Residual
0 1 2 3 4 5 6 7 8 DisplayFeet
Distributions Residuals Sales
-50 0 50 100
No strong indication of violation of constant variance or
normality assumptions.
- (^) Nothing strongly suggests heteroscedasticity or departure from normality.
Residual plots vs time
- (^) A confounder is a variable that is not included among the covariates in a
study but influences both the covariates and the response.
- (^) Plots of the residuals versus the temporal or spatial order of observation
can reveal
- serial correlation (violation of independence)
- confounding variables associated with time and space (multiple regression can be used to account for these variables).
- (^) Example: the math department at a university must plan for the number of
instructors in large service courses. The goal is to predict enrollment in such courses Y based on number of first-year students X.