Validating Linear Model: Parameters, Hypotheses, and Residual Diagnosis - Prof. Emiliano V, Study notes of Mathematics

This document, from the university of connecticut - storrs, fall 2009 semester, covers the validation of a linear model in the context of applied actuarial statistics. Topics include interpreting parameter estimates, hypothesis testing for slope, testing variable significance, interval estimates, prediction and prediction intervals, analyzing residuals, checking normality, detecting constant variance, and dealing with unusual observations. The document also includes r source codes for fitting the model and analyzing the residuals.

Typology: Study notes

Pre 2010

Uploaded on 02/24/2010

koofers-user-i4o
koofers-user-i4o 🇺🇸

10 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Validating the Linear
Model
EA Valdez
Model interpretations
Inferenceon the slope
estimates
Testingvariable significance
Interval estimates
Prediction and prediction
interval
Fitting regression line
with R
R source codes forfitting
model
Prediction intervals
Analyzing the residuals
Checking Normality
Detecting constant variance
Unusual observations
Some advice on unusual
observations
page 1
Validating the Linear Model
Math 3621 Applied Actuarial Statistics
Fall 2009 semester
EA Valdez
University of Connecticut - Storrs
Lecture Week 5
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Validating Linear Model: Parameters, Hypotheses, and Residual Diagnosis - Prof. Emiliano V and more Study notes Mathematics in PDF only on Docsity!

Model

EA Valdez

Model interpretations

Inference on the slope estimates Testing variable significance Interval estimates Prediction and prediction interval

Fitting regression line

with R

R source codes for fitting model Prediction intervals

Analyzing the residuals

Checking Normality Detecting constant variance Unusual observations Some advice on unusual observations

Validating the Linear Model

Math 3621 Applied Actuarial Statistics

Fall 2009 semester

EA Valdez

University of Connecticut - Storrs

Lecture Week 5

Model

EA Valdez

Model interpretations

Inference on the slope estimates Testing variable significance Interval estimates Prediction and prediction interval

Fitting regression line

with R

R source codes for fitting model Prediction intervals

Analyzing the residuals

Checking Normality Detecting constant variance Unusual observations Some advice on unusual observations

Interpreting the parameter estimates

Recall the estimated regression equation:

Ŷ = b 0 +^ b 1 X^ ,

where b 0 is the intercept and b 1 is the slope coefficient.

Interpreting these parameter estimates:

We expect Y = b 0 when X = 0, but only if this makes

sense.

We expect Y to change by an amount of b 1 whenever X

increases by one unit.

Model

EA Valdez

Model interpretations

Inference on the slope estimates Testing variable significance Interval estimates Prediction and prediction interval

Fitting regression line

with R

R source codes for fitting model Prediction intervals

Analyzing the residuals

Checking Normality Detecting constant variance Unusual observations Some advice on unusual observations

Is the independent variable important? The t-test

In assessing whether X is an important (or significant)

predictor variable, we conduct the t-test for

H 0 : β 1 = 0 , vs Ha : β 1 6 = 0.

The test statistic simplifies to

t-ratio = t(b 1 ) =

b 1

se(b 1 )

.

Reject the H 0 if |t(b 1 )| > tα/ 2 ,n− 2 and say that there is

reason to believe the independent variable X is an

important predictor. Otherwise, if we accept H 0 , then there

is reason to believe that it is not.

Model

EA Valdez

Model interpretations

Inference on the slope estimates Testing variable significance Interval estimates Prediction and prediction interval

Fitting regression line

with R

R source codes for fitting model Prediction intervals

Analyzing the residuals

Checking Normality Detecting constant variance Unusual observations Some advice on unusual observations

Constructing a confidence interval for β 1

A 100(1-α)% confidence interval for the slope β 1 is given by

b 1 ± tα/ 2 ,n− 2 se(b 1 ) = b 1 ± tα/ 2 ,n− 2

s

sX

√ n − 1

,

where tα/ 2 ,n− 2 refers to the 100( 1 − α/ 2 )-th (upper) percentile

of a t-distribution with n − 2 degrees of freedom.

Model

EA Valdez

Model interpretations

Inference on the slope estimates Testing variable significance Interval estimates Prediction and prediction interval

Fitting regression line

with R

R source codes for fitting model Prediction intervals

Analyzing the residuals

Checking Normality Detecting constant variance Unusual observations Some advice on unusual observations

R source codes for fitting model

# fitting the linear model with income as predictor to purchase price

> lm1 <- lm(price~income)

> summary(lm1)

Call:

lm(formula = price ~ income)

Residuals:

Min 1Q Median 3Q Max

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 5.866e+03 7.498e+02 7.824 9.8e-11 ***

income 2.113e-01 1.508e-02 14.009 < 2e-16 ***

Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 2459 on 60 degrees of freedom

Multiple R-Squared: 0.7659, Adjusted R-squared: 0.

F-statistic: 196.3 on 1 and 60 DF, p-value: < 2.2e-

# ANOVA table

> anova(lm1)

Analysis of Variance Table

Response: price

Df Sum Sq Mean Sq F value Pr(>F)

income 1 1186892153 1186892153 196.26 < 2.2e-16 ***

Residuals 60 362851718 6047529

Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1

> new.income <-data.frame(income=c(75000))

> predict(lm1,new.income,interval="prediction")

fit lwr upr

[1,] 21715.63 16676.12 26755.

Model

EA Valdez

Model interpretations

Inference on the slope estimates Testing variable significance Interval estimates Prediction and prediction interval

Fitting regression line

with R

R source codes for fitting model Prediction intervals

Analyzing the residuals

Checking Normality Detecting constant variance Unusual observations Some advice on unusual observations

Prediction intervals

# the prediction interval

> new <- data.frame(income=seq(min(income),max(income),5))

> predinc.plim <- predict(lm1,new,interval="prediction")

> matplot(new$income,predinc.plim,col=c("black","red","red"),lty=c(1,2,3),type="l",xlab="income",

ylab="purchase price",main="The estimated regression line with prediction intervals")

> points(income,price,col="blue",cex=1.5)

2 e+04 4 e+04 6 e+04 8 e+04 1 e+

5000

10000

15000

20000

25000

30000

The estimated regression line with prediction intervals

income

purchase price l

l

l

l

l

l l

l

ll

l

l

l

l

l

l

l

l l

l

l

l

l l

l l

l

l l

l l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l (^) l

l

l

l l

l

l

l

l

Model

EA Valdez

Model interpretations

Inference on the slope estimates Testing variable significance Interval estimates Prediction and prediction interval

Fitting regression line

with R

R source codes for fitting model Prediction intervals

Analyzing the residuals

Checking Normality Detecting constant variance Unusual observations Some advice on unusual observations

Detecting constant variance

In detecting homoscedasticity, plot the fitted values against the residuals.

# detect homoscedasticity

> plot(fitted(lm1),residuals(lm1),xlab="Fitted",ylab="Residuals",cex=1.2,

main="Fitted values vs residuals for the Car Price data")

> abline(h=0,col="blue")

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l l

l

l

l l l l

l

l l

l

l

l

l

l

l

l l

l

l

l

l

l l

l

l

l l l

l

l

l

l l

l

l

l

l

l l

l l

l

0

2000

4000

6000

Fitted values vs residuals for the Car Price data

Fitted

Residuals

Model

EA Valdez

Model interpretations

Inference on the slope estimates Testing variable significance Interval estimates Prediction and prediction interval

Fitting regression line

with R

R source codes for fitting model Prediction intervals

Analyzing the residuals

Checking Normality Detecting constant variance Unusual observations Some advice on unusual observations

Checking heteroscedasticity - what to look for?

l

ll

l

l

l

l

l l lll

l

l

l l

l

l

lll

l

l

l

l

l

l

l

l l

l

l

l

l

ll

l

l

l

l

l

l

l

l

l ll l

l

l

no problem

lllll

l

l

l l

l

ll

l l l ll

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l ll

l

l

ll

ll

l

l

l

l

ll l

l

l

mild heteroscedasticity

llllll lll

l

l

l

ll l l

lll

l

l

l ll

ll lll l

l

l

l

l

l

l l

l l l

ll

l

l

l

l

l l l

l

strong heteroscedasticity

l

llll

l

l

l

ll

l

ll

l

l

l

l

l

l

l

l

ll

l

l l

l

l

l

l l

l

l

l

l

l

l

ll

l

l

l

l

l

l

l

l

l

l

l

non−linear

Model

EA Valdez

Model interpretations

Inference on the slope estimates Testing variable significance Interval estimates Prediction and prediction interval

Fitting regression line

with R

R source codes for fitting model Prediction intervals

Analyzing the residuals

Checking Normality Detecting constant variance Unusual observations Some advice on unusual observations

Outliers and high leverage points

An observation is considered “unusual” if far from the

majority of the data set.

It could be “unusual”:

in the vertical direction, in which case we call it an outlier; or

in the horizontal direction, in which case we call it a high

leverage point.

It is possible that an observation is unusual in both

directions - hence both an outlier and a high leverage

point.

In the next few slides, we illustrate the effect of unusual

observations by considering a “fictitious” data set as in

Frees book, Example on page 42.

Model

EA Valdez

Model interpretations

Inference on the slope estimates Testing variable significance Interval estimates Prediction and prediction interval

Fitting regression line

with R

R source codes for fitting model Prediction intervals

Analyzing the residuals

Checking Normality Detecting constant variance Unusual observations Some advice on unusual observations

An example data set with unusual observations

The following plot shows three “unusual” observations -

denoted by points A, B, and C.

l

l

l ll

l l

ll^ l

l l

l

l

lll l

l

l l

l

0 2 4 6 8 10

2

3

4

5

6

7

8

9

x

y

A B

C

the 19 base points

Model

EA Valdez

Model interpretations

Inference on the slope estimates Testing variable significance Interval estimates Prediction and prediction interval

Fitting regression line

with R

R source codes for fitting model Prediction intervals

Analyzing the residuals

Checking Normality Detecting constant variance Unusual observations Some advice on unusual observations

Effects on the regression

The following table summarizes the results of running the

various regression models to assess the impact of the unusual

observations.

Data b 0 b 1 s R^2 (%) t(b 1 )

Base 1.869 0.611 0.288 89.0 11.

Base + A 1.750 0.693 0.846 53.7 4. Base + B 1.775 0.640 0.285 94.7 18. Base + C 3.356 0.155 0.865 10.3 1.

Model

EA Valdez

Model interpretations

Inference on the slope estimates Testing variable significance Interval estimates Prediction and prediction interval

Fitting regression line

with R

R source codes for fitting model Prediction intervals

Analyzing the residuals

Checking Normality Detecting constant variance Unusual observations Some advice on unusual observations

Some advice on unusual observations

What can be done about unusual observations?

Check for possible error in data entry.

Investigate the reasons for why it happened?

Possible to exclude first, then try including again to evaluate

its impact.

Maybe they are not mistakes or aberrations, but may be

naturally occurring.

It could be dangerous to immediately exclude them

altogether.

Check out p. 68 of Faraway - we’ll deal with this issue

again later in multiple regression.