














































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
[Week 11] Model Selection -- F test, Backward, Forward and Stepwise Variable Selection
Typology: Lecture notes
1 / 54
This page cannot be seen from the preview
Don't miss anything!















































Lecture 10: Model Selection The general F-test Model Selection Backward Variable Selection Forward Variable Selection Stepwise Variable Selection Akaike information criterion Bayesian information criterion
For the multiple regression model
Yi = β 0 + β 1 xi 1 +... + βkxik + i, i ∼ N ID(0, σ^2 )
we sometimes want to test a hypothesis that constrains (‘fixes values’) several parameters simultaneously. Thus, possible models are:
Yi = β 0 + β 1 xi 1 + β 2 xi 2 + i, or Yi = β 0 + β 2 xi 2 + i, or Yi = β 0 + β 1 xi 1 + i, or Yi = β 0 + i
I (^) Setting two parameter values to zero, i.e.
H 0 : β 1 = β 2 = 0 versus H 1 : β 1 6 = 0 ∨ β 2 6 = 0.
I (^) Example: k ≥ 2 , all (slope) parameter values set to zero
H 0 : β 1 =... = βk = 0 versus H 1 : ∃βj 6 = 0, j = 1,... , k. I (^) This test problem is also known as the omnibus test problem or overall test. I (^) We can test each of the constraints separately using t-tests but how do we combine the outcomes to get a single p-value for the combined claim?
I (^) p and q typically denote the number of parameters in H 1 and H 0 , respectively. I (^) The denominator S 0 /(n − p) = ˆσ^2 is the adjustment for (estimator of) scale. I (^) (p − q) is the number of linearly independent constraints imposed by H 0. I (^) If H 0 : βj = 0 for one and only one j ∈ { 0 ,... , k} then the F-test just reduces to the usual t-test. Note: F 1 ,n−p = t^2 n−p. I (^) The F-statistic provided in summary(lm(...)) is the statistic for the omnibus test problem.
M0 = lm(L ~. , data = dat) summary(M0)
Information for the null model I (^) For H 0 : β 1 = β 2 = 0 the model is Li = β 0 + i. I (^) S 1 = 718. 729 with df = 11 is the RSS from the null model with M1 = lm(L ~ 1 , data = dat) summary(M1)
S1 = deviance(M1) S
M1$df
Test for the null model vs full model The observed f value is
f ∗^ =
= 18. 616 , ⇒ p-value = P (F 2 , 9 ≥ 18 .616) = 0. 00063.
Thus, we reject H 0.
f.obs = (S1 - S0) / 2 / (S0 / 9); f.obs
1 - pf(f.obs, 2, 9) # omnibus p-value
Although we can drop height or weight individually we cannot drop both variables.
I (^) In choosing between models, statisticians have two aims: I (^) to choose a simple (i.e. not too complex) model; I (^) to choose a model that fits the data well. I (^) A possibility to measure the complexity of a linear regression model is by the number of regression parameters, p. The greater this value, the more complex the model. I (^) A possibility to measure the closeness of fit of the model to data is by using the residual sum of squares (RSS). I (^) Think of model comparison like shopping – is it worth spending more (parameters) in order to get a better (fitting) model?
I (^) Let m denote any subset of pm distinct elements from { 1 ,... , p}. Remark: Typically the intercept is forced to be part of the model. I (^) Let M denote a set of linear regression models for the relationship between Y and X. Remark: Often M is reduced by preselection. Example I (^) There are 24 = 16 distinct subsets of { 1 , 2 , 3 , 4 }: ∅, { 1 }, { 2 }, { 1 , 2 }, { 3 },.. ., { 1 , 2 , 3 , 4 }. I (^) If the intercept is forced to be be part of the model, then there are 24 −^1 = 2k^ = 8 possible subsets.
I (^) In the ‘worst’ case there are 2 k^ (or 2 p, p = k + 1 if intercept can be excluded) possible submodels in M if no preselection of models takes place. I (^) This is computationally/manually extensive. Example I (^) There are 2 13+1^ = 16, 384 possible regression models. I (^) Remark: A search scheme that considers only a quadratic growing number (in p) of models, i.e. p^2 = 14^2 = 196, would be almost 100 times faster than a full search!
... or Bushwalking in M
Automated variable selection procedures ‘walk along’ the following ‘path’:
I (^) Data on production of cheddar cheese from the LaTrobe Valley of Victoria. I (^) Taste of the final product is related to the concentration of several chemicals in the cheese. I (^) n = 30 samples of cheese were tasted by experts, and the following four variables recorded:
taste Tasters’ ratings Acetic Acetic acid in cheese H2S Hydrogen sulphide in cheese Lactic Lactic acid in the cheese.
I (^) Of interest to the manufacturers to relate the cheese’s taste to the ‘chemical’ variables. I (^) Therefore construct multiple linear regression model of taste on other variables. I (^) Variable selection will allow us to produce a parsimonious model. I (^) Backwards variable selection starts with the full model (i.e. with all predictors). I (^) Let us have a look at the data first and then we will run a backward selection based on the F-test with the deletion of the least significant variable as long as pout > 5%.