Model Selection: Association vs. Prediction in Biostatistics, Lab Reports of Epidemiology

Guidelines for model selection in statistical analysis for both association and prediction in the context of biostatistics and epidemiology. It covers the importance of regression coefficient estimates, variable selection, use of automated procedures, area under the roc curve, and goodness of fit tests. The document also discusses model selection strategies for association and prediction, including confirmatory and exploratory methods, and presents suggestions for future studies.

Typology: Lab Reports

Pre 2010

Uploaded on 03/10/2009

koofers-user-6yd-2
koofers-user-6yd-2 🇺🇸

10 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Biostat/Epi 536
October 21, 2008
Is each of the following important
for ASSOCIATION? for PREDICTION?
regression coefficient estimates Yes No
what variables are in the model Yes No*
the use of automated procedures No Yes
the area under the ROC curve No Yes
goodness of fit tests No Yes
* Once we’ve chosen the set of variables that we want to consider, we don’t care which of them are actually included in the final model.
However, the choice of this set of variables may depend on what data is available or what type of data we want to use in our prediction.
Model Selection for ASSOCIATION:
Confirmatory
[can be used to test hypotheses]
Exploratory
[can generate hypotheses, but cannot test them]
Main Association
Variable
- include whether or not significant
- include in form consistent with prior hypothesis
- include whether or not significant
- include in best-fitting form
Interactions with Main
Association Variable
- include only if specified in prior hypothesis
- include only in form consistent with prior hypothesis
- include only if specified in prior hypothesis
- explore functional forms and choose one that fits well
[but include main effects if interaction is included]
Adjustment Variables
(e.g. confounders,
precision variables)
- include only if specified in prior hypothesis
- include in as rich a form as possible to minimize
residual confounding
- statistical significance is irrelevant
- examine possible confounders to see if controlling for
them changes coefficient of interest
- examine different forms, and choose richer model when
there is a difference in the coefficient of interest**
Presentation - explanation of what was controlled, and how
- adjusted odds ratios, with CIs
- possibly unadjusted odds ratios, with CIs
- sometimes partially adjusted odds ratios
- explanation of what was controlled, and how
- adjusted odds ratios, with CIs
- possibly unadjusted odds ratios, with CIs
- sometimes partially adjusted odds ratios
- brief description of model selection and suggestions for
future studies of hypotheses that were generated
** F-tests should only be used to compare functional form (e.g. splines versus linear) if the splines are created with the main association variable. If they are created with an adjustment
variable, then the choice of form should be made based on the coefficient of interest only. If this coefficient is different with different forms of adjustment, choose the richer one.
pf2

Partial preview of the text

Download Model Selection: Association vs. Prediction in Biostatistics and more Lab Reports Epidemiology in PDF only on Docsity!

Biostat/Epi 536October 21, 2008

Is each of the following important

for ASSOCIATION?

for PREDICTION?

regression coefficient estimates

Yes

No

what variables are in the model

Yes

No*

the use of automated procedures

No

Yes

the area under the ROC curve

No

Yes

goodness of fit tests

No

Yes

  • Once we’ve chosen the set of variables that we want to consider, we don’t care which of them are actually included in the final model.

However, the choice of this set of variables may depend on what data is available or what type of data we want to use in our prediction.

Model Selection for ASSOCIATION:

Confirmatory[can be used to test hypotheses]

Exploratory[can generate hypotheses, but cannot test them]

Main AssociationVariable

-^

include whether or not significant

-^

include in form consistent with prior hypothesis

-^

include whether or not significant

-^

include in best-fitting form

Interactions with MainAssociation Variable

-^

include only if specified in prior hypothesis

-^

include only in form consistent with prior hypothesis

-^

include only if specified in prior hypothesis

-^

explore functional forms and choose one that fits well[but include main effects if interaction is included]

Adjustment Variables(e.g. confounders,precision variables)

-^

include only if specified in prior hypothesis

-^

include in as rich a form as possible to minimizeresidual confounding

-^

statistical significance is irrelevant

-^

examine possible confounders to see if controlling forthem changes coefficient of interest

-^

examine different forms, and choose richer model whenthere is a difference in the coefficient of interest**

Presentation

-^

explanation of what was controlled, and how

-^

adjusted odds ratios, with CIs

-^

possibly unadjusted odds ratios, with CIs

-^

sometimes partially adjusted odds ratios

-^

explanation of what was controlled, and how

-^

adjusted odds ratios, with CIs

-^

possibly unadjusted odds ratios, with CIs

-^

sometimes partially adjusted odds ratios

-^

brief description of model selection and suggestions forfuture studies of hypotheses that were generated

** F-tests should only be used to compare functional form (e.g. splines versus linear) if the splines are created with the main association variable. If they are created with an adjustmentvariable, then the choice of form should be made based on the coefficient of interest only. If this coefficient is different with different forms of adjustment, choose the richer one.

Biostat/Epi 536

October 21, 2008

Model Selection for PREDICTION:

1. Decide how you’re going to validate your model. If you decide to split your data into

two samples (one for model-building and one for validation), split it and don’t look at the

validation data again until after you’ve chosen a model.

  • make sure you split your data randomly, i.e. only split at the middle record if you are sure the data have been entered randomly
  • generate a random variable and sort based on that to be sure that order is random

2. Choose the set of variables that you’ll consider for inclusion in your model. Decide

whether or not you want to look at interactions and higher-order terms.

  • could be limited by outside criteria
  • could be based either on relationships in the data or on a priori hypotheses

3. Determine what procedure(s) you’ll use to choose from among these variables.

  • small number of variables: compare AIC (or BIC) for all possible combinations, and choose the model with the lowest AIC (or BIC)
  • larger number of variables: use automated procedure(s), e.g. forward selection, backward selection, forward stepwise, backward stepwise
  • perhaps try more than one, and compare results

4. Find the best model using your chosen procedure(s).

  • add back in main effects for interactions and higher-order terms even if an automated procedure takes them out (or, to avoid this problem, group terms with parentheses in your Stata command)

5. Plot an ROC curve, compute the area under it, and calculate goodness of fit tests, using a

validation dataset or resampling methods.

  • either use the validation data that you set aside before choosing your model or use resampling methods such as the bootstrap [you don’t need to know how to do this for this class]

6. Present your data as discussed in class.

  • describe the process you used to choose your model
  • present the results of goodness of fit tests, preferably based on a validation sample or resampling methods
  • present the ROC curve and the calculated area under it, preferably based on a validation sample or resampling methods
  • in general, do not report p-values from your model because they will be artificially low