Model Building and Variable Selection for Binomial Data: A Comprehensive Approach, Study notes of Mathematics

An in-depth exploration of model selection strategies for binomial data, focusing on the balance between model complexity and interpretability. Indications of collinearity and numerical instability, model building strategies using univariate analysis and multiple logistic regression, and automated variable selection. The document also emphasizes the importance of assessing assumptions and identifying confounding variables.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-jbt
koofers-user-jbt 🇺🇸

10 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Binomial Data - continued
2.7 Model Building and Variable Selection
Note: These notes are a revision of P. K. Choudhary lecture notes at the University of Texas
at Dallas. I have edited them for our class and have included a section on automated variable
selection.
MODEL SELECTION
Competing goals:
Should be complex enough to fit the data well.
Should be simple to interpret should smooth the data rather than overfitting it.
Issue: How to select a parsimonious (simple) model that fits the data well?
Unrealistic to hope to find the true model for a real dataset.
Part science, part statistics, part experience and part common sense.
Less number of parameters leads to more precise estimates.
Watch out for collinearity - correlation in the estimated coefficients. If two covariates
are highly correlated, we do not need both of them in the model.
Indications of collinearity:
Large standard errors.
Look at the correlation matrix of the estimated coefficients. In R, use cor2cov(vcov(fit)),
where fit contains the glm fit.
Indications of numerical instability:
Error messages from the fitting program.
Collinearity.
Large standard errors.
Zero or near-zero cell counts.
Complete or near-complete separation. Complete separation means all zero responses
appear at one combination of covariates and all one responses appear at another com-
bination. No overlap in the covariates for the two responses. MLE does not exist in
this case.
1
pf3
pf4

Partial preview of the text

Download Model Building and Variable Selection for Binomial Data: A Comprehensive Approach and more Study notes Mathematics in PDF only on Docsity!

Binomial Data - continued

2.7 Model Building and Variable Selection

Note: These notes are a revision of P. K. Choudhary lecture notes at the University of Texas at Dallas. I have edited them for our class and have included a section on automated variable selection.

MODEL SELECTION

Competing goals:

  • Should be complex enough to fit the data well.
  • Should be simple to interpret should smooth the data rather than overfitting it.

Issue: How to select a parsimonious (simple) model that fits the data well?

  • Unrealistic to hope to find the true model for a real dataset.
  • Part science, part statistics, part experience and part common sense.
  • Less number of parameters leads to more precise estimates.
  • Watch out for collinearity - correlation in the estimated coefficients. If two covariates are highly correlated, we do not need both of them in the model.

Indications of collinearity:

  • Large standard errors.
  • Look at the correlation matrix of the estimated coefficients. In R, use cor2cov(vcov(fit)), where fit contains the glm fit.

Indications of numerical instability:

  • Error messages from the fitting program.
  • Collinearity.
  • Large standard errors.
  • Zero or near-zero cell counts.
  • Complete or near-complete separation. Complete separation means all zero responses appear at one combination of covariates and all one responses appear at another com- bination. No overlap in the covariates for the two responses. MLE does not exist in this case.

Models building strategy: (seven or fewer explanatory variables)

Step 1: Use univariate analysis to identify important covariates - the ones that are at least moderately associated with response - one covariate at a time.

  • Analyze contingency tables for each categorical covariate. Pay particular attention to cells with lowcounts. May need to collapse categories in a sensible fashion.
  • Use nonparametric smoothing for each continuous covariate. Can also categorize the covariate and look at the plot of mean response (estimate of π) in each group against the group mid-point. To get a plot on logit scale, plot the logit transfor- mation of this mean response. This plot also suggests the appropriate scale of the variable.
  • One can also fit logistic regression models with one covariate at a time and analyze the fits. In particular, look at the estimated coefficients, their standard errors and the likelihood ratio test for the significance of the coefficient. Rule of thumb: select all the variables whose p-value ¡ 0.25 along with the variables of known clinical importance.

Step 2: Fit a multiple logistic regression model using the variables selected in step 1.

  • Verify the importance of each variable in this multiple model using Wald statistic.
  • Compare the coefficients of the each variable with the coefficient from the model containing only that variable.
  • Eliminate any variable that doesnt appear to be important, and fit a new model. Check if the new model is significantly different from the old model. If it is, then the deleted variable was important.
  • Repeat this process of deleting, refitting and verifying until it appears that all the important variables are included in the model. At this point, add the variables into the model that were not selected in the original multiple model.
  • Assess the joint significance of the variables that were not selected. This step is important as it helps to identify the confounding variables. Make changes in the model, if necessary.
  • At the end, we have the preliminary main effects model - it contains the important variables.

Step 2 (alternate): Build the main effects model.

  • Verify the importance of each variable in this multiple model using the change in the deviance. Begin with the full model based on the explanatory variables found in Step 1 and work backward in the change in the deviance analysis. model is significantly different from the old model. If it is, then the deleted variable was important.

Model Building Strategy: Automatic stepwise selection procedure (10 or more explana- tory variables)

The use of automated explanatory variable selection is somewhat controversial. The North- east SAS Users Paper 222-26 by Shtatland, Cain, adn Barton seems to be a reasonable attempt to balance over and under parameterization of models chosen by blindly applying automated selection procedures. We will apply their procedure to the Pima Indian diabetes study found in the SAS file pima logistic modelbuilding.sas.

Diagnostics: Validate your model as we have previously discussed. Model building is iterative. The previous steps may have yielded several candidate models from which to choose.