Download Lecture Slides on Variable Selection and Model Building | STATS 0120B and more Study notes Statistics in PDF only on Docsity!
Stats 120B (Spring 2007) Variable Selection and Model Building Instructor: H. Xu
1 All possible regressions: The Hald Cement Data (MPV 9.1)
• Data concerning the heat evolved in calories per gram of cement (y) as a function of the amount of
each of four ingredients in the mix
• x1=tricalcium aluminate, x2=tricalcium silicate, x3=tetracalcium alumino ferrite x4=dicalcium sili-
cate
y
l l
l
l
l
l l
l
l
l
l
l l
l l
l
l
l
l l
l
l
l
l
l l
l l
l
l
l
l l
l
l
l
l
l l
l l
l
l
l
l l
l
l
l
l
l l
l
l
l l l
l
l l l
l
l
l l x
l
l
l l l
l
l l l
l
l
ll l
l
ll l
l
l l l
l
l
ll l
l
l l l
l
l l l
l
l
ll
l l
l
l
l l
l
l
l l l
ll
l l
l
l
l l
l
l
l l l
ll
x
l l
l
l
l l
l
l
l l l
ll
l l
l
l
l l
l
l
l l l
ll
l
l
l l l
l
l
l l
l
l
ll l
l
ll l
l
l
l l
l
l
ll l
l
l l l
l
l
l l
l
l
ll
x
l
l
l l l
l
l
l l
l
l
ll
l l
l
l
l l
l
l
l l
l
ll
l l
l
l
l l
l
l
l l
l
ll
l l
l
l
l l
l
l
l l
l
ll
l l
l
l
l l
l
l
l l
l
ll
x
> dat=read.table("cement.dat", h=T); attach(dat)
> plot(dat)
> # create a function to obtain lm() information
> lm.info = function(g, sigma.full)
+ { # sigma: estimated sigma from the full model
+ p = ncol(g$model)
+ n = nrow(g$model)
+ if(n-p != g$df) stop("n-p != g$df")
+ rss = sum(resid(g)^2)
+ gs= summary(g)
+ Cp = rss/sigma.full^2 - n + 2*p # Cp
+ aic = extractAIC(g,k=2)[2] # give AIC
+ bic = extractAIC(g,k=log(n))[2] # gives BIC
+ press = sum((resid(g)/(1-ls.diag(g)$hat))^2) # press
+ c(p=p, df=n-p, ss=rss, rsq=gs$r.squared, rsq.a=gs$adj.r.squared, ms=rss/g$df,
+ Cp=Cp, aic=aic, bic=bic, press=press)
> # do all possible subset regressions
> full=lm(y~., dat)
> sigma.full = summary(full)$sigma
> info = NULL
> g=lm(y~1, dat)
> info = rbind(info, lm.info(g, sigma.full))
> subsets = list(1,2,3,4, c(1,2), c(1,3), c(1,4), c(2,3), c(2,4), c(3,4),
+ c(1,2,3), c(1,2,4), c(1,3,4), c(2,3,4), 1:4)
> for(i in 1:length(subsets)){
+ sub = c(1,subsets[[i]]+1)
+ g=lm(y~., dat[,sub])
+ info = rbind(info, lm.info(g, sigma.full))
> round(info, 2)
p df ss rsq rsq.a ms Cp aic bic press
[1,] 1 12 2715.76 0.00 0.00 226.31 442.92 71.44 72.01 3187.
[2,] 2 11 1265.69 0.53 0.49 115.06 202.55 63.52 64.65 1699.
[3,] 2 11 906.34 0.67 0.64 82.39 142.49 59.18 60.31 1202.
[4,] 2 11 1939.40 0.29 0.22 176.31 315.15 69.07 70.20 2616.
[5,] 2 11 883.87 0.67 0.64 80.35 138.73 58.85 59.98 1194.
[6,] 3 10 57.90 0.98 0.97 5.79 2.68 25.42 27.11 93.
[7,] 3 10 1227.07 0.55 0.46 122.71 198.09 65.12 66.81 2218.
[8,] 3 10 74.76 0.97 0.97 7.48 5.50 28.74 30.44 121.
[9,] 3 10 415.44 0.85 0.82 41.54 62.44 51.04 52.73 701.
[10,] 3 10 868.88 0.68 0.62 86.89 138.23 60.63 62.32 1461.
[11,] 3 10 175.74 0.94 0.92 17.57 22.37 39.85 41.55 294.
[12,] 4 9 48.11 0.98 0.98 5.35 3.04 25.01 27.27 90.
[13,] 4 9 47.97 0.98 0.98 5.33 3.02 24.97 27.23 85.
[14,] 4 9 50.84 0.98 0.98 5.65 3.50 25.73 27.99 94.
[15,] 4 9 73.81 0.97 0.96 8.20 7.34 30.58 32.84 146.
[16,] 5 8 47.86 0.98 0.97 5.98 5.00 26.94 29.77 110.
> # all subset regressions, a simple way but don’t tell which model is the best
> library(leaps)
> b <- regsubsets(y~x1+x2+x3+x4, dat)
> (rs <- summary(b))
Subset selection object
Call: regsubsets.formula(y ~ x1 + x2 + x3 + x4, dat)
4 Variables (and intercept)
Forced in Forced out
2 Stepwise regressions: The Hald Cement Data (MPV 9.2-9.4)
> ## forward selection using Cp
> g1=lm(y~1, dat)
> step(g1, scope=~x1+x2+x3+x4, data=dat, direction="forward", scale=sigma.full^2) # Cp
Start: AIC= 442.
y ~ 1
Df Sum of Sq RSS Cp
+ x4 1 1831.90 883.87 138.
+ x2 1 1809.43 906.34 142.
+ x1 1 1450.08 1265.69 202.
+ x3 1 776.36 1939.40 315.
2715.76 442.
Step: AIC= 138.
y ~ x
Df Sum of Sq RSS Cp
+ x1 1 809.10 74.76 5.
+ x3 1 708.13 175.74 22.
+ x2 1 14.99 868.88 138.
883.87 138.
Step: AIC= 5.
y ~ x4 + x
Df Sum of Sq RSS Cp
+ x2 1 26.789 47.973 3.
+ x3 1 23.926 50.836 3.
74.762 5.
Step: AIC= 3.
y ~ x4 + x1 + x
Df Sum of Sq RSS Cp
47.973 3.
+ x3 1 0.109 47.864 5.
Call:
lm(formula = y ~ x4 + x1 + x2, data = dat)
Coefficients:
(Intercept) x4 x1 x
> ## backward elimination using Cp
> step(full, data=dat, direction="backward", scale=sigma.full^2) # Cp
Start: AIC= 5
y ~ x1 + x2 + x3 + x
Df Sum of Sq RSS Cp
- x3 1 0.109 47.973 3.
- x4 1 0.247 48.111 3.
Df Sum of Sq RSS Cp
Call: lm(formula = y ~ x1 + x2, data = dat)
stepwise regression using Cp
- 47.864 5.
- Step: AIC= 3.
- 47.97 3.
- Step: AIC= 2.
- 57.90 2. Df Sum of Sq RSS Cp
- x2 1 1207.78 1265.69 202.
- (Intercept) x1 x Coefficients: - 52.5773 1.4683 0.
- Start: AIC= > step(full, dat, direction="both", scale=sigma.full^2) # Cp
- x3 1 0.109 47.973 3. Df Sum of Sq RSS Cp
- 47.864 5.
- Step: AIC= 3.
- x4 1 9.93 57.90 2. Df Sum of Sq RSS Cp
- 47.97 3.
- Step: AIC= 2.
- 57.90 2. Df Sum of Sq RSS Cp
3 Case study: The asphalt data (MPV 9.4)
- Data concerning the rut depth of 31 asphalt pavements prepared under different conditions specified
by 5 regressors.
- A 6th regressor is used as an indicator variable to separate the data into 2 sets of runs.
- y=rut depth per million wheel passes, x1=viscosity of the asphalt, x2=percentage of asphalt in the
surface courses, x3=percentage of asphalt in the base course, x4=the run (0 or 1), x5=percentage of
fines in the surface course, x6=percentage of voids in the surface course.
depth
l
lll lll
ll
l
l
l
l
l l
l
lllll lllll (^) l ll l l
l
ll l ll^ l
ll
l
l
l
l
l l
l
l (^) lllll lll llll^ ll
l
l ll l ll l l
l
l
l
l
l l
l
l l lllll lll llll l
l
lll lll
ll
l
l
l
l
l l
l
lllllllllllllll
l
l l l ll l
l l
l
l
l
l
l l
l
lll l^ l ll l lll^ llll 0
l
l (^) ll l ll
ll
l
l
l
l
l l
l
ll (^) ll ll llll (^) lllll
l l lllll lllllll l l
llllll l
l
l
l
l
l l l l
viscosity
lll l l lllll llllll l l lll l
l
l
l
l
l
l l l l (^) ll l l llll lllll ll^ lllllll
l
l
l
l
l
l l l l (^) llllllllllllllll llllll
l
l
l
l
l
l l l l (^) l llllll ll l^ llll l lllllll
l
l
l
l
l
l l l l (^) ll l lll l l ll llllllllll^ l l
l
l
l
l
l
l l l l
l
l lll
l l ll
l
l l
l l
l l l
l
l
l l l
l l ll lll
l l (^) l
l lll
l lll
l
ll
l l
l l (^) l
l
l
l l l
l l l l ll l
l l (^) suface l
l ll^ l
l l (^) l l
l
l l
l l
l ll
l
l
l l l
l l l l l ll
l l (^) l
l lll
l lll
l
ll
l l
l l (^) l
l
l
l l l
l l ll lll
l l (^) l
l l ll
l ll^ l
l
l l
l l
l ll
l
l
l l l
l l l l l ll
l l
l
l l ll
l lll
l
l l
l l
l ll
l
l
l l l
l l ll l ll
l l
l
l
ll
l
l
l
l
l l l l
l l
l
l ll
l l l
ll
ll l
l l
l
l
l l
l
ll
l
l
l
l
l ll l
ll
l
l ll
l l l
l l^
l l l
l l
l
l
l l
l
ll
l
l
l
l
l l l l
l l
l
l l l
l l l
l l
ll l
l l
l
l
l base
l
l
ll
l
l
l
l
l ll l
ll
l
l ll
l l l
ll
ll l
l l
l
l
l l
l
l l
l
l
l
l
l ll l
l (^) l
l
l l l
l l l
l l^
ll l
l l
l
l
l l
l
ll
l
l
l
l
l l l l
l (^) l
l
l l l
l l l
l l^
ll l
l l
l
l
l
l l lllll lllllll l l
lllllllllllllll
ll llllllllllllll
lllll lllll l ll l l
lll l l lllll llllll
l lllll lll llll ll
ll l ll lll ll l lllll
l l lllll lll llll l
run
l lll l lllll l ll lll
lll l l ll l lll llll
ll l ll l l lllll llll
ll ll ll llll lllll
l
l
l
ll
l l
l
l
l l l
l
l
l
l l
l
l
l
ll
l
l l l
l l
l
l
l
l
l
l
ll l l
l
l
l l l
l
l
l
l l
l
l
l
ll
l
l l l
l l
l
l
l
l
l
l
ll l l
l
l
l l l
l
l
l
l l
l
l
l
l l
l
l l l
l l
l
l
l
l
l
l
l l l l
l
l
l l l
l
l
l
l l
l
l
l
l l
l
l l l
l l
l
l
l
l
l
l
ll l l
l
l
l l l
l
l
l
l l
l
l
l
ll
l
l l l
l l
l
l
l fines
l
l
l
l l
l l
l
l
l l l
l
l
l
l l
l
l
l
l l
l
l l l
l l
l
l
l
l l
l l
l
l
l (^) ll
l
l l l
l
l
l l
l l
l
l
l
l
ll l
l
ll l
l l l
l l
l
l
lll
l
l l l
l
l
l (^) l l l
l
l
l
l
l l l
l
ll l
l
l l
l l
l
l
lll
l
l l l
l
l
ll
l l
l
l
l
l
ll l
l
ll l
l l l
l l
l
l
l l l l
l l l
l
l
ll
l l
l
l
l
l
ll l
l
l l l
l
l l
l l
l
l
lll
l
l l l
l
l
l (^) l l l
l
l
l
l
ll l
l
ll l
l l l
l l
l
l
ll (^) l
l
l l l
l
l
ll
l l
l
l
l
l
ll l
l
l l l
l
voids
> dat=read.table("asphalt.dat", h=T); attach(dat)
> plot(dat)
> # see if we need transform the predictors
> library(alr3) # bctrans
> library(MASS) # boxcox
> ans=bctrans(depth~viscosity+suface+base+run+fines+voids, dat)
Error in bctrans1(mf, Y = y, ..., call = match.call(expand.dots = TRUE)) :
All values must be > 0; use family="yeo.johnson"
> ans=bctrans(depth~viscosity+suface+base+fines+voids, dat)
> summary(ans)
box.cox Transformations to Multinormality
Est.Power Std.Err. Wald(Power=0) Wald(Power=1)
viscosity -0.0224 0.1063 -0.2105 -9.
suface 4.7640 2.3060 2.0659 1.
base -3.5401 4.2424 -0.8344 -1.
fines -0.7789 1.8330 -0.4250 -0.
voids 0.5065 1.2129 0.4176 -0.
LRT df p.value
LR test, all lambda equal 0 5.860003 5 0.
LR test, all lambda equal 1 89.039393 5 0.
> lrt.bctrans(ans,lrt=list(c(0,1,1,1,1)))
LRT df p.value
LR test, all lambda equal 0 5.860003 5 0.
LR test, all lambda equal 1 89.039393 5 0.
LR test, lambda = 0 1 1 1 1 4.842049 5 0.
> # log transform for viscosity
> log.visc = log(viscosity)
> g=lm(depth~log.visc+suface+base+run+fines+voids)
> summary(g)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14.9592 25.2881 -0.592 0.
log.visc -3.1515 0.9194 -3.428 0.00220 **
suface 3.9706 2.4966 1.590 0.
base 1.2631 3.9703 0.318 0.
run 1.9655 3.6472 0.539 0.
fines 0.1164 1.0124 0.115 0.
voids 0.5893 1.3244 0.445 0.
Residual standard error: 3.324 on 24 degrees of freedom
Multiple R-Squared: 0.806,Adjusted R-squared: 0.
F-statistic: 16.62 on 6 and 24 DF, p-value: 1.743e-
> # check residual plots to see if need transform the response
> par(mfrow=c(2,4)); par(mar=c(5,4,0,0)+.1)
> plot(g, 2:1)
> plot(log.visc, resid(g))
> plot(suface, resid(g))
> plot(base, resid(g))
> plot(run, resid(g))
> plot(fines, resid(g))
> plot(voids, resid(g))
> ## now refit with transformed response
> g=lm(log(depth)~log.visc+suface+base+run+fines+voids)
> summary(g)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.23294 2.34970 -0.525 0.
log.visc -0.55769 0.08543 -6.528 9.45e-07 ***
suface 0.58358 0.23198 2.516 0.019 *
base -0.10337 0.36891 -0.280 0.
run -0.34005 0.33889 -1.003 0.
fines 0.09775 0.09407 1.039 0.
voids 0.19885 0.12306 1.616 0.
Residual standard error: 0.3088 on 24 degrees of freedom
Multiple R-Squared: 0.961,Adjusted R-squared: 0.
F-statistic: 98.47 on 6 and 24 DF, p-value: 1.059e-
> par(mfrow=c(2,4)); par(mar=c(5,4,0,0)+.1)
> plot(g, 1:2)
> plot(log.visc, resid(g))
> plot(suface, resid(g))
> plot(base, resid(g))
> plot(run, resid(g))
> plot(fines, resid(g))
> plot(voids, resid(g))
Fitted values
Residuals
l
l
l
l
l
ll l l
l
l
l l l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
ll l l
l
l
l l l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
Theoretical Quantiles
Standardized residuals
l
l
l
l
l
ll l l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
log.visc
resid(g)
l
l
l
l
l
l l l l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
suface
resid(g)
l
l
l
l
l
l (^) l l l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
base
resid(g)
l
l
l
l
l
ll l l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
run
resid(g)
l
l
l
l
l
l l l l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
fines
resid(g)
l
l
l
l
l
l (^) l l l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l (^) l
l
l
l
voids
resid(g)
> # do model/variable selection
> library(leaps)
> b <- regsubsets(log(depth)~log.visc+suface+base+run+fines+voids, dat)
> summary(b)
Subset selection object
Call: regsubsets.formula(log(depth) ~ log.visc + suface + base + run +
fines + voids, dat)
6 Variables (and intercept)
Forced in Forced out
log.visc FALSE FALSE
suface FALSE FALSE
base FALSE FALSE
run FALSE FALSE
fines FALSE FALSE
voids FALSE FALSE
1 subsets of each size up to 6
Selection Algorithm: exhaustive
log.visc suface base run fines voids
> # stepwise regressions using AIC, BIC and Cp
> full=lm(log(depth)~log.visc+suface+base+run+fines+voids)
> g1=lm(log(depth)~1, dat)
> step(g1, scope=~log.visc+suface+base+run+fines+voids, data=dat, direction="both", trace=0) # AIC
Call:
lm(formula = log(depth) ~ log.visc + suface + voids, data = dat)
Coefficients:
(Intercept) log.visc suface voids
> step(full, dat, direction="both", k=log(nrow(dat)), trace=0) # BIC
Call:
lm(formula = log(depth) ~ log.visc + suface + voids)
Coefficients:
(Intercept) log.visc suface voids
> step(full, dat, direction="both", scale=sigma.hat(full)^2, trace=0) # Cp
Call:
lm(formula = log(depth) ~ log.visc + suface + voids)
Coefficients:
(Intercept) log.visc suface voids