Lecture Slides on Variable Selection and Model Building | STATS 0120B, Study notes of Statistics

Material Type: Notes; Class: INT APPLD REGR ANLY; Subject: Statistics; University: University of California - Los Angeles; Term: Spring 2007;

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-a54
koofers-user-a54 🇺🇸

10 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Stats 120B (Spring 2007) Variable Selection and Model Building Instructor: H. Xu
1 All possible regressions: The Hald Cement Data (MPV 9.1)
Data concerning the heat evolved in calories per gram of cement (y) as a function of the amount of
each of four ingredients in the mix
x1=tricalcium aluminate, x2=tricalcium silicate, x3=tetracalcium alumino ferrite x4=dicalcium sili-
cate
y
5 10 15 20
5 10 15 20
80 100
5 10 15 20
x1
x2
30 50 70
5 10 15 20
x3
80 100
30 50 70
10 30 50
10 30 50
x4
> dat=read.table("cement.dat", h=T); attach(dat)
> plot(dat)
> # create a function to obtain lm() information
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Lecture Slides on Variable Selection and Model Building | STATS 0120B and more Study notes Statistics in PDF only on Docsity!

Stats 120B (Spring 2007) Variable Selection and Model Building Instructor: H. Xu

1 All possible regressions: The Hald Cement Data (MPV 9.1)

• Data concerning the heat evolved in calories per gram of cement (y) as a function of the amount of

each of four ingredients in the mix

• x1=tricalcium aluminate, x2=tricalcium silicate, x3=tetracalcium alumino ferrite x4=dicalcium sili-

cate

y

l l

l

l

l

l l

l

l

l

l

l l

l l

l

l

l

l l

l

l

l

l

l l

l l

l

l

l

l l

l

l

l

l

l l

l l

l

l

l

l l

l

l

l

l

l l

l

l

l l l

l

l l l

l

l

l l x

l

l

l l l

l

l l l

l

l

ll l

l

ll l

l

l l l

l

l

ll l

l

l l l

l

l l l

l

l

ll

l l

l

l

l l

l

l

l l l

ll

l l

l

l

l l

l

l

l l l

ll

x

l l

l

l

l l

l

l

l l l

ll

l l

l

l

l l

l

l

l l l

ll

l

l

l l l

l

l

l l

l

l

ll l

l

ll l

l

l

l l

l

l

ll l

l

l l l

l

l

l l

l

l

ll

x

l

l

l l l

l

l

l l

l

l

ll

l l

l

l

l l

l

l

l l

l

ll

l l

l

l

l l

l

l

l l

l

ll

l l

l

l

l l

l

l

l l

l

ll

l l

l

l

l l

l

l

l l

l

ll

x

> dat=read.table("cement.dat", h=T); attach(dat)

> plot(dat)

> # create a function to obtain lm() information

> lm.info = function(g, sigma.full)

+ { # sigma: estimated sigma from the full model

+ p = ncol(g$model)

+ n = nrow(g$model)

+ if(n-p != g$df) stop("n-p != g$df")

+ rss = sum(resid(g)^2)

+ gs= summary(g)

+ Cp = rss/sigma.full^2 - n + 2*p # Cp

+ aic = extractAIC(g,k=2)[2] # give AIC

+ bic = extractAIC(g,k=log(n))[2] # gives BIC

+ press = sum((resid(g)/(1-ls.diag(g)$hat))^2) # press

+ c(p=p, df=n-p, ss=rss, rsq=gs$r.squared, rsq.a=gs$adj.r.squared, ms=rss/g$df,

+ Cp=Cp, aic=aic, bic=bic, press=press)

> # do all possible subset regressions

> full=lm(y~., dat)

> sigma.full = summary(full)$sigma

> info = NULL

> g=lm(y~1, dat)

> info = rbind(info, lm.info(g, sigma.full))

> subsets = list(1,2,3,4, c(1,2), c(1,3), c(1,4), c(2,3), c(2,4), c(3,4),

+ c(1,2,3), c(1,2,4), c(1,3,4), c(2,3,4), 1:4)

> for(i in 1:length(subsets)){

+ sub = c(1,subsets[[i]]+1)

+ g=lm(y~., dat[,sub])

+ info = rbind(info, lm.info(g, sigma.full))

> round(info, 2)

p df ss rsq rsq.a ms Cp aic bic press

[1,] 1 12 2715.76 0.00 0.00 226.31 442.92 71.44 72.01 3187.

[2,] 2 11 1265.69 0.53 0.49 115.06 202.55 63.52 64.65 1699.

[3,] 2 11 906.34 0.67 0.64 82.39 142.49 59.18 60.31 1202.

[4,] 2 11 1939.40 0.29 0.22 176.31 315.15 69.07 70.20 2616.

[5,] 2 11 883.87 0.67 0.64 80.35 138.73 58.85 59.98 1194.

[6,] 3 10 57.90 0.98 0.97 5.79 2.68 25.42 27.11 93.

[7,] 3 10 1227.07 0.55 0.46 122.71 198.09 65.12 66.81 2218.

[8,] 3 10 74.76 0.97 0.97 7.48 5.50 28.74 30.44 121.

[9,] 3 10 415.44 0.85 0.82 41.54 62.44 51.04 52.73 701.

[10,] 3 10 868.88 0.68 0.62 86.89 138.23 60.63 62.32 1461.

[11,] 3 10 175.74 0.94 0.92 17.57 22.37 39.85 41.55 294.

[12,] 4 9 48.11 0.98 0.98 5.35 3.04 25.01 27.27 90.

[13,] 4 9 47.97 0.98 0.98 5.33 3.02 24.97 27.23 85.

[14,] 4 9 50.84 0.98 0.98 5.65 3.50 25.73 27.99 94.

[15,] 4 9 73.81 0.97 0.96 8.20 7.34 30.58 32.84 146.

[16,] 5 8 47.86 0.98 0.97 5.98 5.00 26.94 29.77 110.

> # all subset regressions, a simple way but don’t tell which model is the best

> library(leaps)

> b <- regsubsets(y~x1+x2+x3+x4, dat)

> (rs <- summary(b))

Subset selection object

Call: regsubsets.formula(y ~ x1 + x2 + x3 + x4, dat)

4 Variables (and intercept)

Forced in Forced out

2 Stepwise regressions: The Hald Cement Data (MPV 9.2-9.4)

> ## forward selection using Cp

> g1=lm(y~1, dat)

> step(g1, scope=~x1+x2+x3+x4, data=dat, direction="forward", scale=sigma.full^2) # Cp

Start: AIC= 442.

y ~ 1

Df Sum of Sq RSS Cp

+ x4 1 1831.90 883.87 138.

+ x2 1 1809.43 906.34 142.

+ x1 1 1450.08 1265.69 202.

+ x3 1 776.36 1939.40 315.

2715.76 442.

Step: AIC= 138.

y ~ x

Df Sum of Sq RSS Cp

+ x1 1 809.10 74.76 5.

+ x3 1 708.13 175.74 22.

+ x2 1 14.99 868.88 138.

883.87 138.

Step: AIC= 5.

y ~ x4 + x

Df Sum of Sq RSS Cp

+ x2 1 26.789 47.973 3.

+ x3 1 23.926 50.836 3.

74.762 5.

Step: AIC= 3.

y ~ x4 + x1 + x

Df Sum of Sq RSS Cp

47.973 3.

+ x3 1 0.109 47.864 5.

Call:

lm(formula = y ~ x4 + x1 + x2, data = dat)

Coefficients:

(Intercept) x4 x1 x

> ## backward elimination using Cp

> step(full, data=dat, direction="backward", scale=sigma.full^2) # Cp

Start: AIC= 5

y ~ x1 + x2 + x3 + x

Df Sum of Sq RSS Cp

  • x3 1 0.109 47.973 3.
  • x4 1 0.247 48.111 3.

Df Sum of Sq RSS Cp

Call: lm(formula = y ~ x1 + x2, data = dat)

stepwise regression using Cp

    • x2 1 2.972 50.836 3.
  • 47.864 5.
    • x1 1 25.951 73.815 7.
  • Step: AIC= 3.
    • y ~ x1 + x2 + x
    • x4 1 9.93 57.90 2.
  • 47.97 3.
    • x2 1 26.79 74.76 5.
    • x1 1 820.91 868.88 138.
  • Step: AIC= 2.
    • y ~ x1 + x
  • 57.90 2. Df Sum of Sq RSS Cp
    • x1 1 848.43 906.34 142.
    • x2 1 1207.78 1265.69 202.
  • (Intercept) x1 x Coefficients: - 52.5773 1.4683 0.
  • Start: AIC= > step(full, dat, direction="both", scale=sigma.full^2) # Cp
    • y ~ x1 + x2 + x3 + x
    • x3 1 0.109 47.973 3. Df Sum of Sq RSS Cp
    • x4 1 0.247 48.111 3.
    • x2 1 2.972 50.836 3.
  • 47.864 5.
    • x1 1 25.951 73.815 7.
  • Step: AIC= 3.
    • y ~ x1 + x2 + x
    • x4 1 9.93 57.90 2. Df Sum of Sq RSS Cp
  • 47.97 3.
    • x2 1 26.79 74.76 5.
    • x1 1 820.91 868.88 138.
  • Step: AIC= 2.
    • y ~ x1 + x
  • 57.90 2. Df Sum of Sq RSS Cp

3 Case study: The asphalt data (MPV 9.4)

  • Data concerning the rut depth of 31 asphalt pavements prepared under different conditions specified

by 5 regressors.

  • A 6th regressor is used as an indicator variable to separate the data into 2 sets of runs.
  • y=rut depth per million wheel passes, x1=viscosity of the asphalt, x2=percentage of asphalt in the

surface courses, x3=percentage of asphalt in the base course, x4=the run (0 or 1), x5=percentage of

fines in the surface course, x6=percentage of voids in the surface course.

depth

l

lll lll

ll

l

l

l

l

l l

l

lllll lllll (^) l ll l l

l

ll l ll^ l

ll

l

l

l

l

l l

l

l (^) lllll lll llll^ ll

l

l ll l ll l l

l

l

l

l

l l

l

l l lllll lll llll l

l

lll lll

ll

l

l

l

l

l l

l

lllllllllllllll

l

l l l ll l

l l

l

l

l

l

l l

l

lll l^ l ll l lll^ llll 0

l

l (^) ll l ll

ll

l

l

l

l

l l

l

ll (^) ll ll llll (^) lllll

l l lllll lllllll l l

llllll l

l

l

l

l

l l l l

viscosity

lll l l lllll llllll l l lll l

l

l

l

l

l

l l l l (^) ll l l llll lllll ll^ lllllll

l

l

l

l

l

l l l l (^) llllllllllllllll llllll

l

l

l

l

l

l l l l (^) l llllll ll l^ llll l lllllll

l

l

l

l

l

l l l l (^) ll l lll l l ll llllllllll^ l l

l

l

l

l

l

l l l l

l

l lll

l l ll

l

l l

l l

l l l

l

l

l l l

l l ll lll

l l (^) l

l lll

l lll

l

ll

l l

l l (^) l

l

l

l l l

l l l l ll l

l l (^) suface l

l ll^ l

l l (^) l l

l

l l

l l

l ll

l

l

l l l

l l l l l ll

l l (^) l

l lll

l lll

l

ll

l l

l l (^) l

l

l

l l l

l l ll lll

l l (^) l

l l ll

l ll^ l

l

l l

l l

l ll

l

l

l l l

l l l l l ll

l l

l

l l ll

l lll

l

l l

l l

l ll

l

l

l l l

l l ll l ll

l l

l

l

ll

l

l

l

l

l l l l

l l

l

l ll

l l l

ll

ll l

l l

l

l

l l

l

ll

l

l

l

l

l ll l

ll

l

l ll

l l l

l l^

l l l

l l

l

l

l l

l

ll

l

l

l

l

l l l l

l l

l

l l l

l l l

l l

ll l

l l

l

l

l base

l

l

ll

l

l

l

l

l ll l

ll

l

l ll

l l l

ll

ll l

l l

l

l

l l

l

l l

l

l

l

l

l ll l

l (^) l

l

l l l

l l l

l l^

ll l

l l

l

l

l l

l

ll

l

l

l

l

l l l l

l (^) l

l

l l l

l l l

l l^

ll l

l l

l

l

l

l l lllll lllllll l l

lllllllllllllll

ll llllllllllllll

lllll lllll l ll l l

lll l l lllll llllll

l lllll lll llll ll

ll l ll lll ll l lllll

l l lllll lll llll l

run

l lll l lllll l ll lll

lll l l ll l lll llll

ll l ll l l lllll llll

ll ll ll llll lllll

l

l

l

ll

l l

l

l

l l l

l

l

l

l l

l

l

l

ll

l

l l l

l l

l

l

l

l

l

l

ll l l

l

l

l l l

l

l

l

l l

l

l

l

ll

l

l l l

l l

l

l

l

l

l

l

ll l l

l

l

l l l

l

l

l

l l

l

l

l

l l

l

l l l

l l

l

l

l

l

l

l

l l l l

l

l

l l l

l

l

l

l l

l

l

l

l l

l

l l l

l l

l

l

l

l

l

l

ll l l

l

l

l l l

l

l

l

l l

l

l

l

ll

l

l l l

l l

l

l

l fines

l

l

l

l l

l l

l

l

l l l

l

l

l

l l

l

l

l

l l

l

l l l

l l

l

l

l

l l

l l

l

l

l (^) ll

l

l l l

l

l

l l

l l

l

l

l

l

ll l

l

ll l

l l l

l l

l

l

lll

l

l l l

l

l

l (^) l l l

l

l

l

l

l l l

l

ll l

l

l l

l l

l

l

lll

l

l l l

l

l

ll

l l

l

l

l

l

ll l

l

ll l

l l l

l l

l

l

l l l l

l l l

l

l

ll

l l

l

l

l

l

ll l

l

l l l

l

l l

l l

l

l

lll

l

l l l

l

l

l (^) l l l

l

l

l

l

ll l

l

ll l

l l l

l l

l

l

ll (^) l

l

l l l

l

l

ll

l l

l

l

l

l

ll l

l

l l l

l

voids

> dat=read.table("asphalt.dat", h=T); attach(dat)

> plot(dat)

> # see if we need transform the predictors

> library(alr3) # bctrans

> library(MASS) # boxcox

> ans=bctrans(depth~viscosity+suface+base+run+fines+voids, dat)

Error in bctrans1(mf, Y = y, ..., call = match.call(expand.dots = TRUE)) :

All values must be > 0; use family="yeo.johnson"

> ans=bctrans(depth~viscosity+suface+base+fines+voids, dat)

> summary(ans)

box.cox Transformations to Multinormality

Est.Power Std.Err. Wald(Power=0) Wald(Power=1)

viscosity -0.0224 0.1063 -0.2105 -9.

suface 4.7640 2.3060 2.0659 1.

base -3.5401 4.2424 -0.8344 -1.

fines -0.7789 1.8330 -0.4250 -0.

voids 0.5065 1.2129 0.4176 -0.

LRT df p.value

LR test, all lambda equal 0 5.860003 5 0.

LR test, all lambda equal 1 89.039393 5 0.

> lrt.bctrans(ans,lrt=list(c(0,1,1,1,1)))

LRT df p.value

LR test, all lambda equal 0 5.860003 5 0.

LR test, all lambda equal 1 89.039393 5 0.

LR test, lambda = 0 1 1 1 1 4.842049 5 0.

> # log transform for viscosity

> log.visc = log(viscosity)

> g=lm(depth~log.visc+suface+base+run+fines+voids)

> summary(g)

Estimate Std. Error t value Pr(>|t|)

(Intercept) -14.9592 25.2881 -0.592 0.

log.visc -3.1515 0.9194 -3.428 0.00220 **

suface 3.9706 2.4966 1.590 0.

base 1.2631 3.9703 0.318 0.

run 1.9655 3.6472 0.539 0.

fines 0.1164 1.0124 0.115 0.

voids 0.5893 1.3244 0.445 0.

Residual standard error: 3.324 on 24 degrees of freedom

Multiple R-Squared: 0.806,Adjusted R-squared: 0.

F-statistic: 16.62 on 6 and 24 DF, p-value: 1.743e-

> # check residual plots to see if need transform the response

> par(mfrow=c(2,4)); par(mar=c(5,4,0,0)+.1)

> plot(g, 2:1)

> plot(log.visc, resid(g))

> plot(suface, resid(g))

> plot(base, resid(g))

> plot(run, resid(g))

> plot(fines, resid(g))

> plot(voids, resid(g))

> ## now refit with transformed response

> g=lm(log(depth)~log.visc+suface+base+run+fines+voids)

> summary(g)

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.23294 2.34970 -0.525 0.

log.visc -0.55769 0.08543 -6.528 9.45e-07 ***

suface 0.58358 0.23198 2.516 0.019 *

base -0.10337 0.36891 -0.280 0.

run -0.34005 0.33889 -1.003 0.

fines 0.09775 0.09407 1.039 0.

voids 0.19885 0.12306 1.616 0.

Residual standard error: 0.3088 on 24 degrees of freedom

Multiple R-Squared: 0.961,Adjusted R-squared: 0.

F-statistic: 98.47 on 6 and 24 DF, p-value: 1.059e-

> par(mfrow=c(2,4)); par(mar=c(5,4,0,0)+.1)

> plot(g, 1:2)

> plot(log.visc, resid(g))

> plot(suface, resid(g))

> plot(base, resid(g))

> plot(run, resid(g))

> plot(fines, resid(g))

> plot(voids, resid(g))

Fitted values

Residuals

l

l

l

l

l

ll l l

l

l

l l l

l

l

l

l

l

l

l

l

l

l

l

l

ll

l

l

l

l

l

l

l

l

ll l l

l

l

l l l

l

l

l

l

l

l

l

l

l

l

l

l

ll

l

l

l

Theoretical Quantiles

Standardized residuals

l

l

l

l

l

ll l l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

log.visc

resid(g)

l

l

l

l

l

l l l l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

ll

l

l

l

suface

resid(g)

l

l

l

l

l

l (^) l l l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

base

resid(g)

l

l

l

l

l

ll l l

l

l

ll

l

l

l

l

l

l

l

l

l

l

l

l

l

ll

l

l

l

run

resid(g)

l

l

l

l

l

l l l l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

fines

resid(g)

l

l

l

l

l

l (^) l l l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l (^) l

l

l

l

voids

resid(g)

> # do model/variable selection

> library(leaps)

> b <- regsubsets(log(depth)~log.visc+suface+base+run+fines+voids, dat)

> summary(b)

Subset selection object

Call: regsubsets.formula(log(depth) ~ log.visc + suface + base + run +

fines + voids, dat)

6 Variables (and intercept)

Forced in Forced out

log.visc FALSE FALSE

suface FALSE FALSE

base FALSE FALSE

run FALSE FALSE

fines FALSE FALSE

voids FALSE FALSE

1 subsets of each size up to 6

Selection Algorithm: exhaustive

log.visc suface base run fines voids

> # stepwise regressions using AIC, BIC and Cp

> full=lm(log(depth)~log.visc+suface+base+run+fines+voids)

> g1=lm(log(depth)~1, dat)

> step(g1, scope=~log.visc+suface+base+run+fines+voids, data=dat, direction="both", trace=0) # AIC

Call:

lm(formula = log(depth) ~ log.visc + suface + voids, data = dat)

Coefficients:

(Intercept) log.visc suface voids

> step(full, dat, direction="both", k=log(nrow(dat)), trace=0) # BIC

Call:

lm(formula = log(depth) ~ log.visc + suface + voids)

Coefficients:

(Intercept) log.visc suface voids

> step(full, dat, direction="both", scale=sigma.hat(full)^2, trace=0) # Cp

Call:

lm(formula = log(depth) ~ log.visc + suface + voids)

Coefficients:

(Intercept) log.visc suface voids