Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Model Building and Variable Selection for Binomial Data: A Comprehensive Approach, Study notes of Mathematics

Tennessee Technological University (TTU)Mathematics

An in-depth exploration of model selection strategies for binomial data, focusing on the balance between model complexity and interpretability. Indications of collinearity and numerical instability, model building strategies using univariate analysis and multiple logistic regression, and automated variable selection. The document also emphasizes the importance of assessing assumptions and identifying confounding variables.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-jbt 🇺🇸

10 documents

1 / 4

This page cannot be seen from the preview

Don't miss anything!

Binomial Data - continued

2.7 Model Building and Variable Selection

Note: These notes are a revision of P. K. Choudhary lecture notes at the University of Texas

at Dallas. I have edited them for our class and have included a section on automated variable

selection.

MODEL SELECTION

Competing goals:

•Should be complex enough to fit the data well.

•Should be simple to interpret should smooth the data rather than overfitting it.

Issue: How to select a parsimonious (simple) model that fits the data well?

•Unrealistic to hope to find the true model for a real dataset.

•Part science, part statistics, part experience and part common sense.

•Less number of parameters leads to more precise estimates.

•Watch out for collinearity - correlation in the estimated coefficients. If two covariates

are highly correlated, we do not need both of them in the model.

Indications of collinearity:

•Large standard errors.

•Look at the correlation matrix of the estimated coefficients. In R, use cor2cov(vcov(fit)),

where fit contains the glm fit.

Indications of numerical instability:

•Error messages from the fitting program.

•Collinearity.

•Large standard errors.

•Zero or near-zero cell counts.

•Complete or near-complete separation. Complete separation means all zero responses

appear at one combination of covariates and all one responses appear at another com-

bination. No overlap in the covariates for the two responses. MLE does not exist in

this case.

1

Discover Study notes of Mathematics Tennessee Technological University (TTU)

Partial preview of the text

Download Model Building and Variable Selection for Binomial Data: A Comprehensive Approach and more Study notes Mathematics in PDF only on Docsity!

Binomial Data - continued

2.7 Model Building and Variable Selection

Note: These notes are a revision of P. K. Choudhary lecture notes at the University of Texas at Dallas. I have edited them for our class and have included a section on automated variable selection.

MODEL SELECTION

Competing goals:

Should be complex enough to fit the data well.
Should be simple to interpret should smooth the data rather than overfitting it.

Issue: How to select a parsimonious (simple) model that fits the data well?

Unrealistic to hope to find the true model for a real dataset.
Part science, part statistics, part experience and part common sense.
Less number of parameters leads to more precise estimates.
Watch out for collinearity - correlation in the estimated coefficients. If two covariates are highly correlated, we do not need both of them in the model.

Indications of collinearity:

Large standard errors.
Look at the correlation matrix of the estimated coefficients. In R, use cor2cov(vcov(fit)), where fit contains the glm fit.

Indications of numerical instability:

Error messages from the fitting program.
Collinearity.
Large standard errors.
Zero or near-zero cell counts.
Complete or near-complete separation. Complete separation means all zero responses appear at one combination of covariates and all one responses appear at another com- bination. No overlap in the covariates for the two responses. MLE does not exist in this case.

Models building strategy: (seven or fewer explanatory variables)

Step 1: Use univariate analysis to identify important covariates - the ones that are at least moderately associated with response - one covariate at a time.

Analyze contingency tables for each categorical covariate. Pay particular attention to cells with lowcounts. May need to collapse categories in a sensible fashion.
Use nonparametric smoothing for each continuous covariate. Can also categorize the covariate and look at the plot of mean response (estimate of π) in each group against the group mid-point. To get a plot on logit scale, plot the logit transfor- mation of this mean response. This plot also suggests the appropriate scale of the variable.
One can also fit logistic regression models with one covariate at a time and analyze the fits. In particular, look at the estimated coefficients, their standard errors and the likelihood ratio test for the significance of the coefficient. Rule of thumb: select all the variables whose p-value ¡ 0.25 along with the variables of known clinical importance.

Step 2: Fit a multiple logistic regression model using the variables selected in step 1.

Verify the importance of each variable in this multiple model using Wald statistic.
Compare the coefficients of the each variable with the coefficient from the model containing only that variable.
Eliminate any variable that doesnt appear to be important, and fit a new model. Check if the new model is significantly different from the old model. If it is, then the deleted variable was important.
Repeat this process of deleting, refitting and verifying until it appears that all the important variables are included in the model. At this point, add the variables into the model that were not selected in the original multiple model.
Assess the joint significance of the variables that were not selected. This step is important as it helps to identify the confounding variables. Make changes in the model, if necessary.
At the end, we have the preliminary main effects model - it contains the important variables.

Step 2 (alternate): Build the main effects model.

Verify the importance of each variable in this multiple model using the change in the deviance. Begin with the full model based on the explanatory variables found in Step 1 and work backward in the change in the deviance analysis. model is significantly different from the old model. If it is, then the deleted variable was important.

Model Building Strategy: Automatic stepwise selection procedure (10 or more explana- tory variables)

The use of automated explanatory variable selection is somewhat controversial. The North- east SAS Users Paper 222-26 by Shtatland, Cain, adn Barton seems to be a reasonable attempt to balance over and under parameterization of models chosen by blindly applying automated selection procedures. We will apply their procedure to the Pima Indian diabetes study found in the SAS file pima logistic modelbuilding.sas.

Diagnostics: Validate your model as we have previously discussed. Model building is iterative. The previous steps may have yielded several candidate models from which to choose.

Model Building and Variable Selection for Binomial Data: A Comprehensive Approach, Study notes of Mathematics

Related documents

Partial preview of the text

Download Model Building and Variable Selection for Binomial Data: A Comprehensive Approach and more Study notes Mathematics in PDF only on Docsity!

Binomial Data - continued

2.7 Model Building and Variable Selection