Lecture 10: Model Selection and Comparison in Regression Analysis, Lecture notes of Statistics

[Week 11] Model Selection -- F test, Backward, Forward and Stepwise Variable Selection

Typology: Lecture notes

2018/2019

Uploaded on 06/15/2019

kefart
kefart 🇺🇸

4.4

(11)

55 documents

1 / 54

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lecture 10: Model Selection
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36

Partial preview of the text

Download Lecture 10: Model Selection and Comparison in Regression Analysis and more Lecture notes Statistics in PDF only on Docsity!

Lecture 10: Model Selection

Outline

Lecture 10: Model Selection The general F-test Model Selection Backward Variable Selection Forward Variable Selection Stepwise Variable Selection Akaike information criterion Bayesian information criterion

Example – Two explanatory variables

For the multiple regression model

Yi = β 0 + β 1 xi 1 +... + βkxik + i, i ∼ N ID(0, σ^2 )

we sometimes want to test a hypothesis that constrains (‘fixes values’) several parameters simultaneously. Thus, possible models are:

Yi = β 0 + β 1 xi 1 + β 2 xi 2 + i, or Yi = β 0 + β 2 xi 2 + i, or Yi = β 0 + β 1 xi 1 + i, or Yi = β 0 + i

Example: k ≥ 2 , set two parameter values to zero

I (^) Setting two parameter values to zero, i.e.

H 0 : β 1 = β 2 = 0 versus H 1 : β 1 6 = 0 ∨ β 2 6 = 0.

I (^) Example: k ≥ 2 , all (slope) parameter values set to zero

H 0 : β 1 =... = βk = 0 versus H 1 : ∃βj 6 = 0, j = 1,... , k. I (^) This test problem is also known as the omnibus test problem or overall test. I (^) We can test each of the constraints separately using t-tests but how do we combine the outcomes to get a single p-value for the combined claim?

Theory – Remarks on the F-statistic

I (^) p and q typically denote the number of parameters in H 1 and H 0 , respectively. I (^) The denominator S 0 /(n − p) = ˆσ^2 is the adjustment for (estimator of) scale. I (^) (p − q) is the number of linearly independent constraints imposed by H 0. I (^) If H 0 : βj = 0 for one and only one j ∈ { 0 ,... , k} then the F-test just reduces to the usual t-test. Note: F 1 ,n−p = t^2 n−p. I (^) The F-statistic provided in summary(lm(...)) is the statistic for the omnibus test problem.

Example – Catheter data

M0 = lm(L ~. , data = dat) summary(M0)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 21.0084 8.7512 2.401 0.0399 *

H 0.1964 0.3606 0.545 0.

W 0.1908 0.1652 1.155 0.

Residual standard error: 3.943 on 9 degrees of freedom

Multiple R-squared: 0.8053,Adjusted R-squared: 0.

F-statistic: 18.62 on 2 and 9 DF, p-value: 0.

Example – Catheter data

Information for the null model I (^) For H 0 : β 1 = β 2 = 0 the model is Li = β 0 + i. I (^) S 1 = 718. 729 with df = 11 is the RSS from the null model with M1 = lm(L ~ 1 , data = dat) summary(M1)

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.208 2.333 15.52 7.97e-09 ***

---

Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 8.083 on 11 degrees of freedom

S1 = deviance(M1) S

[1] 718.

M1$df

[1] 11

Example – Catheter data

Test for the null model vs full model The observed f value is

f ∗^ =

(S 1 − S 0 )/ 2

S 0 / 9

= 18. 616 , ⇒ p-value = P (F 2 , 9 ≥ 18 .616) = 0. 00063.

Thus, we reject H 0.

f.obs = (S1 - S0) / 2 / (S0 / 9); f.obs

[1] 18.

1 - pf(f.obs, 2, 9) # omnibus p-value

[1] 0.

Although we can drop height or weight individually we cannot drop both variables.

Remarks – General thoughts on choosing between models

I (^) In choosing between models, statisticians have two aims: I (^) to choose a simple (i.e. not too complex) model; I (^) to choose a model that fits the data well. I (^) A possibility to measure the complexity of a linear regression model is by the number of regression parameters, p. The greater this value, the more complex the model. I (^) A possibility to measure the closeness of fit of the model to data is by using the residual sum of squares (RSS). I (^) Think of model comparison like shopping – is it worth spending more (parameters) in order to get a better (fitting) model?

Theory – Possible subsets

I (^) Let m denote any subset of pm distinct elements from { 1 ,... , p}. Remark: Typically the intercept is forced to be part of the model. I (^) Let M denote a set of linear regression models for the relationship between Y and X. Remark: Often M is reduced by preselection. Example I (^) There are 24 = 16 distinct subsets of { 1 , 2 , 3 , 4 }: ∅, { 1 }, { 2 }, { 1 , 2 }, { 3 },.. ., { 1 , 2 , 3 , 4 }. I (^) If the intercept is forced to be be part of the model, then there are 24 −^1 = 2k^ = 8 possible subsets.

Theory – Powerset

I (^) In the ‘worst’ case there are 2 k^ (or 2 p, p = k + 1 if intercept can be excluded) possible submodels in M if no preselection of models takes place. I (^) This is computationally/manually extensive. Example I (^) There are 2 13+1^ = 16, 384 possible regression models. I (^) Remark: A search scheme that considers only a quadratic growing number (in p) of models, i.e. p^2 = 14^2 = 196, would be almost 100 times faster than a full search!

Theory – Automated variable selection algorithms...

... or Bushwalking in M

Automated variable selection procedures ‘walk along’ the following ‘path’:

  1. Choose a model to start with, e.g. I (^) the model with no covariates (null model), I (^) or the model with all covariates included (full model).
  2. Test to see if there is an advantage in adding or removing covariates.
  3. Repeat adding/removing variables until there is no advantage in changing the model. Such a strategy requires to visit a quadratic order of number of models!

Example – Cheese tasting data

I (^) Data on production of cheddar cheese from the LaTrobe Valley of Victoria. I (^) Taste of the final product is related to the concentration of several chemicals in the cheese. I (^) n = 30 samples of cheese were tasted by experts, and the following four variables recorded:

taste Tasters’ ratings Acetic Acetic acid in cheese H2S Hydrogen sulphide in cheese Lactic Lactic acid in the cheese.

Example – Cheese data: Backward selection

I (^) Of interest to the manufacturers to relate the cheese’s taste to the ‘chemical’ variables. I (^) Therefore construct multiple linear regression model of taste on other variables. I (^) Variable selection will allow us to produce a parsimonious model. I (^) Backwards variable selection starts with the full model (i.e. with all predictors). I (^) Let us have a look at the data first and then we will run a backward selection based on the F-test with the deletion of the least significant variable as long as pout > 5%.