Lecture Notes on Linear Regression Diagnostic and Variable Transformations | STAT 431, Study notes of Statistics

Material Type: Notes; Class: STATISTICAL INFERENCE; Subject: Statistics; University: University of Pennsylvania; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-a6p
koofers-user-a6p 🇺🇸

3

(1)

10 documents

1 / 22

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Statistics 431:
Statistical Inference
Lecture 17: Linear regression diagnostics and
variable transformations
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16

Partial preview of the text

Download Lecture Notes on Linear Regression Diagnostic and Variable Transformations | STAT 431 and more Study notes Statistics in PDF only on Docsity!

Statistics 431:

Statistical Inference

Lecture 17: Linear regression diagnostics and

variable transformations

Regression diagnostics

  • (^) The conditions required for inference from simple linear regression must be

checked:

  • (^) Linearity. Diagnostic: residual plot.
  • (^) Constant variance. Diagnostic: residual plot.
  • (^) Normality. Diagnostics:
    • histogram of residuals.
    • normal quantile plot of residuals.
  • (^) Independence. Diagnostic: residual plot.
  • (^) Outliers and influential points. Diagnostic: scatterplot, Cook’s distances.

Bivariate Fit of HousePrice By CrimeRate

HousePrice

CrimeRate

Summary of Fit

Residual

Distri

Resid

3

CrimeRate

Summary of Fit

RSquare 0.

RSquare Adj 0.

Root Mean Square Error 78861.

Mean of Response 158464.

Observations (or Sum Wgts) 98

Parameter Estimates

Term Estimate Std Error t Ratio Prob>|t|

Intercept 225233.55 16404.02 13.73 <.

CrimeRate -2288.689 491.5375 -4.66 <.

Residual plots

  • (^) A residual plot is a scatterplot of each residual ei vs its corresponding xi.
  • (^) A residual plot can suggest nonlinearity and/or non-constant variance, both

violations of the assumptions in the ideal linear model.

  • (^) If the ideal model holds, the residual plot should contain patternless scatter.
  • (^) The residuals should be centered around a horizontal line at 0. A nonlinear

regression function μ( x ) will lead to a violation of this constraint. For example, the residuals on the left and right might be generally above zero, and those in the middle generally below zero.

  • (^) The residuals should have the same spread around zero, at all x values. If,

for example, the residual plot has a “bulge” in the model, the constant-variance assumption has been violated.

  • (^) Look for pronounced patterns. No residual plot will satisfy you as being

“perfectly” patternless (not even when we do a regression on data simulated from the ideal linear model!).

0

100000

200000

300000

Residual

10 20 30 40 50 60 70 CrimeRate

Distributions

Residuals HousePrice

  • (^) Some evidence of upwards curvature in the average residuals at the low and high end of crime rate.

DisplayFeet

0

100

Residual

0 1 2 3 4 5 6 7 8

DisplayFeet

Residual plot shows mean residual less

1 and 7 display feet and greater than ze

  • (^) The residual plot shows negative mean residuals at 1 foot and 7 feet, and positive mean residuals elsewhere.
  • (^) What does this indicate?

Examining and Transforming Data Tukey’s bulging rule

Figure 15. Mosteller and Tukey’s bulging rule for selecting linearizing trans ormations.

10

Display effect: Transformations to

x

0

100

200

300

400

Sales

0 1 2 3 4 5 6 7 8 DisplayFeet

0

50 Residual

0 1 2 3 4 5 6 7 8 DisplayFeet

  • (^) Still looks like overestimation at 1 foot and underestimation at 2–3 feet.

Display effect: Transformations to log x

0

100

200

300

400

Sales

0 1 2 3 4 5 6 7 8 DisplayFeet

0

50

100

Residual

0 1 2 3 4 5 6 7 8 DisplayFeet

  • (^) This is a slight improvement.

Heteroscedasticity

  • (^) You may have already forgotten that the ideal linear model assumes all the

distrns Y | X = x have the same variance σ 2. This is homoscedasticity.

  • (^) When this assumption is violated, we have non-constant variances, or

heteroscedasticity.

  • (^) To detect non-constant variance, look at the residual plot. If the vertical

spread of residuals is not the same from left to right, the constant variance assumption is in trouble. E.g.,

  • “Fan pattern”: spread increases from left to right
  • “Inverse fan pattern”: spread decreases from left to right
  • (^) Consequences of heteroscedasticity:
  • Hypothesis tests and confidence intervals for β’s are misspecified.
  • But, least-squares estimates ( βˆ 0 , βˆ 1 ) are still unbiased.
  • (^) What to do: try transformations of Y.

Another residual plot

  • (^) We can plot residuals ei against their corresponding fitted values y ˆ i , rather

than against xi.

  • (^) For simple linear regression, this is just an affine transformation of X (to

β^ ˆ 0 + ˆβ 1 X ): it moves the whole residual cloud up or down, and stretches the x axis, but doesn’t change the conclusions we draw.

  • (^) However, it is useful in multiple linear regression, when there are many X

variables to choose from.

Heteroscedasticity

  • When the requirement of a constant variance is

violated we have a condition of heteroscedasticity.

  • (^) + (^) +

  • (^) +

The spread increases with y^

y^

Residual

^y

++

+^ +

+^ ++

Display effect example

0

50

100

Residual

0 1 2 3 4 5 6 7 8 DisplayFeet

Distributions Residuals Sales

-50 0 50 100

No strong indication of violation of constant variance or
normality assumptions.
  • (^) Nothing strongly suggests heteroscedasticity or departure from normality.

Residual plots vs time

  • (^) A confounder is a variable that is not included among the covariates in a

study but influences both the covariates and the response.

  • (^) Plots of the residuals versus the temporal or spatial order of observation

can reveal

  • serial correlation (violation of independence)
  • confounding variables associated with time and space (multiple regression can be used to account for these variables).
  • (^) Example: the math department at a university must plan for the number of

instructors in large service courses. The goal is to predict enrollment in such courses Y based on number of first-year students X.