
















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Prof. Lumley Material Type: Notes; Class: ADV APPL LIN MODELS; Subject: Biostatistics; University: University of Washington - Seattle; Term: Autumn 2005;
Typology: Study notes
1 / 24
This page cannot be seen from the preview
Don't miss anything!

















A Thomas Lumley production starring Ben French
2005-10-
Logistic regression for a rare event is relatively inefficient. For a single binary predictor we have a 2 × 2 table
X=0 X= Y=0 a b m 0 Y=1 c d m 1 n 0 n 1
The estimated variance of β is
var[ˆβ] =
a
b
c
d If a and b are much larger than c and d the variance of ˆβ depends mostly on c and d.
We could also correct for biased sampling depending on Y using sampling weights.
Suppose in a population sample we get
X=0 X= Y=0 a b m 0 Y=1 c d m 1 n 0 n 1
The odds ratio is ad/bc
If we sample all the cases and a fraction π of the controls the expected value of the sample table looks like
X=0 X= Y=0 aπ bπ m 0 π Y=1 c d m 1
The probability-weighted sample odds ratio estimates
daπ/π cbπ/π
= ad bc
The unweighted sample odds ratio estimates
daπ cbπ
ad bc
The odds ratio is consistently estimated using any (correct or incorrect) value for π. If we can use an arbitrary value of π in estimation we should choose the value that gives the greatest precision, which is π = 1.
The case–control study is fundamental to epidemiology, particu- larly cancer epidemiology. The main difficulty is ensuring that the controls and cases really are sampled from the same population.
For example, to estimate the risk from cellphone use when driving the cases are car crashes; the controls should be a random sample of non-crashes from people driving at the same time as the crash.
If cases are heart attacks treated at UWMC, controls should be a random sample of people who didn’t have a heart attack but would have been treated at UWMC if they had.
Suppose we want an analysis stratified by a confounder, and so we have K 2 × 2 tables and the Mantel–Haenszel estimator.
The MH estimating equations for the common odds ratio ψ are
∑^ K k=
akdk − bkckψ = 0
If the data came by case–control sampling with sampling fractions πk in stratum k we get
∑^ K k=
ak πk
dk − bk πk
ckψ = 0
With multiple confounders a stratified analysis is not feasible and we need logistic regression. We also need logistic regression for a continuous exposure variable.
When X is high-dimensional or continuous and has unspecified distribution it is harder to work with P [X|Y ] than in 2 × 2 tables, but the probability-weighting approach is still straightforward.
The probability-weighted estimating functions for logistic regres- sion are ∑^ n i=
xi
πi
(Y − μi) = 0
where πi = 1 for cases and πi = π for controls
We hope that these estimating equations are unbiased for any choice of π, which would allow us to the ordinary logistic regression equations with π = 1, which are the most efficient.
Start with the population or cohort of size N from which the sample is taken. Write Ri = 1 if person i in the population is observed.
Suppose that in the population logitE[Yi|Xi = xi] = logitμ = α + xβ and that we fit a logistic regression model with linear predictor logit˜μ = ˜α + xβ. Write π for the true control sampling fraction, so that P [Ri = 1|Y = 0] = π, and ˜π for the assumed control sampling fraction.
The estimating equations for ˜α and β are
∑^ N i=
Ui =
∑^ N i=
xi Ri πi
(Yi − μ˜i) = 0
The assumption that the logistic regression model is true in the population was critical. If E[Y ] is misspecified the case–control and cohort logistic regressions do not estimate the same β.
Suppose there is an interaction between exposure and age, and we do not model this interaction. The cohort logistic regression model estimates a weighted average of the effect of exposure at different ages, weighted according to the population distribution of age.
The case–control logistic regression also estimates a weighted average, but weighted according to the case–control distribution of age. If age is a risk factor for disease [and it always is], the case–control logistic regression gives more weight to the effect of exposure at higher ages.
Biostatisticians usually consider this a worthwhile tradeoff. Survey statisticians may disagree.
We have shown that a logistic regression is consistent with any sets of probability weights. The most efficient set of weights comes from ignoring the case–control sampling.
Prentice & Pyke (Biometrics, 1976) show that in fact logistic regression ignoring the case–control sampling is the maximum likelihood estimator. If X is not discrete it is a nonparametric MLE, which does not necessarily imply anything optimal about its properties.
Breslow, Robins & Wellner (2000) showed that logistic regres- sion is semiparametric efficient in case–control studies.
summary(model1) Call: glm(formula = cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp, family = binomial(), data = esoph)
Deviance Residuals: Min 1Q Median 3Q Max -1.6891 -0.5618 -0.2168 0.2314 2.
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.9108 1.0302 -5.737 9.61e-09 *** agegp35-44 1.6095 1.0676 1.508 0. agegp45-54 2.9752 1.0242 2.905 0.003675 ** agegp55-64 3.3584 1.0198 3.293 0.000991 *** agegp65-74 3.7270 1.0253 3.635 0.000278 ***
agegp75+ 3.6818 1.0645 3.459 0.000543 *** tobgp10-19 0.3407 0.2054 1.659 0.. tobgp20-29 0.3962 0.2456 1.613 0. tobgp30+ 0.8677 0.2765 3.138 0.001701 ** alcgp40-79 1.1216 0.2384 4.704 2.55e-06 *** alcgp80-119 1.4471 0.2628 5.506 3.68e-08 *** alcgp120+ 2.1154 0.2876 7.356 1.90e-13 ***
Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 227.241 on 87 degrees of freedom Residual deviance: 53.973 on 76 degrees of freedom AIC: 225.
Analysis of Deviance Table
Model 1: cbind(ncases, ncontrols) ~ agegp + tobgp Model 2: cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 79 120. 2 76 53.973 3 66.054 2.984e-
Wald tests
library(survey) regTermTest(model1,~alcgp) Wald test for alcgp in glm(formula = cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp, family = binomial(), data = esoph) Chisq = 57.89887 on 3 df: p= 1.652e-
regTermTest(model1,~tobgp) Wald test for tobgp in glm(formula = cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp, family = binomial(), data = esoph) Chisq = 10.76880 on 3 df: p= 0.