Case Control Study - Lecture Notes | BIOST 570, Study notes of Biostatistics

Prof. Lumley Material Type: Notes; Class: ADV APPL LIN MODELS; Subject: Biostatistics; University: University of Washington - Seattle; Term: Autumn 2005;

Typology: Study notes

Pre 2010

Uploaded on 03/18/2009

koofers-user-czd-1
koofers-user-czd-1 🇺🇸

8 documents

1 / 24

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Case–control studies
A Thomas Lumley production
starring Ben French
BIOST 570
2005-10-24
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18

Partial preview of the text

Download Case Control Study - Lecture Notes | BIOST 570 and more Study notes Biostatistics in PDF only on Docsity!

Case–control studies

A Thomas Lumley production starring Ben French

BIOST 570

2005-10-

Rare events

Logistic regression for a rare event is relatively inefficient. For a single binary predictor we have a 2 × 2 table

X=0 X= Y=0 a b m 0 Y=1 c d m 1 n 0 n 1

The estimated variance of β is

var[ˆβ] =

a

b

c

d If a and b are much larger than c and d the variance of ˆβ depends mostly on c and d.

Probability weights

We could also correct for biased sampling depending on Y using sampling weights.

Suppose in a population sample we get

X=0 X= Y=0 a b m 0 Y=1 c d m 1 n 0 n 1

The odds ratio is ad/bc

If we sample all the cases and a fraction π of the controls the expected value of the sample table looks like

X=0 X= Y=0 aπ bπ m 0 π Y=1 c d m 1

Probability weights

The probability-weighted sample odds ratio estimates

daπ/π cbπ/π

= ad bc

The unweighted sample odds ratio estimates

daπ cbπ

ad bc

The odds ratio is consistently estimated using any (correct or incorrect) value for π. If we can use an arbitrary value of π in estimation we should choose the value that gives the greatest precision, which is π = 1.

Case–control studies

The case–control study is fundamental to epidemiology, particu- larly cancer epidemiology. The main difficulty is ensuring that the controls and cases really are sampled from the same population.

For example, to estimate the risk from cellphone use when driving the cases are car crashes; the controls should be a random sample of non-crashes from people driving at the same time as the crash.

If cases are heart attacks treated at UWMC, controls should be a random sample of people who didn’t have a heart attack but would have been treated at UWMC if they had.

Stratified analysis

Suppose we want an analysis stratified by a confounder, and so we have K 2 × 2 tables and the Mantel–Haenszel estimator.

The MH estimating equations for the common odds ratio ψ are

∑^ K k=

akdk − bkckψ = 0

If the data came by case–control sampling with sampling fractions πk in stratum k we get

∑^ K k=

ak πk

dk − bk πk

ckψ = 0

Logistic regression

With multiple confounders a stratified analysis is not feasible and we need logistic regression. We also need logistic regression for a continuous exposure variable.

When X is high-dimensional or continuous and has unspecified distribution it is harder to work with P [X|Y ] than in 2 × 2 tables, but the probability-weighting approach is still straightforward.

The probability-weighted estimating functions for logistic regres- sion are ∑^ n i=

xi

πi

(Y − μi) = 0

where πi = 1 for cases and πi = π for controls

We hope that these estimating equations are unbiased for any choice of π, which would allow us to the ordinary logistic regression equations with π = 1, which are the most efficient.

Logistic regression

Start with the population or cohort of size N from which the sample is taken. Write Ri = 1 if person i in the population is observed.

Suppose that in the population logitE[Yi|Xi = xi] = logitμ = α + xβ and that we fit a logistic regression model with linear predictor logit˜μ = ˜α + xβ. Write π for the true control sampling fraction, so that P [Ri = 1|Y = 0] = π, and ˜π for the assumed control sampling fraction.

The estimating equations for ˜α and β are

∑^ N i=

Ui =

∑^ N i=

xi Ri πi

(Yi − μ˜i) = 0

Logistic regression

The assumption that the logistic regression model is true in the population was critical. If E[Y ] is misspecified the case–control and cohort logistic regressions do not estimate the same β.

Suppose there is an interaction between exposure and age, and we do not model this interaction. The cohort logistic regression model estimates a weighted average of the effect of exposure at different ages, weighted according to the population distribution of age.

The case–control logistic regression also estimates a weighted average, but weighted according to the case–control distribution of age. If age is a risk factor for disease [and it always is], the case–control logistic regression gives more weight to the effect of exposure at higher ages.

Biostatisticians usually consider this a worthwhile tradeoff. Survey statisticians may disagree.

Likelihood

We have shown that a logistic regression is consistent with any sets of probability weights. The most efficient set of weights comes from ignoring the case–control sampling.

Prentice & Pyke (Biometrics, 1976) show that in fact logistic regression ignoring the case–control sampling is the maximum likelihood estimator. If X is not discrete it is a nonparametric MLE, which does not necessarily imply anything optimal about its properties.

Breslow, Robins & Wellner (2000) showed that logistic regres- sion is semiparametric efficient in case–control studies.

Example

summary(model1) Call: glm(formula = cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp, family = binomial(), data = esoph)

Deviance Residuals: Min 1Q Median 3Q Max -1.6891 -0.5618 -0.2168 0.2314 2.

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.9108 1.0302 -5.737 9.61e-09 *** agegp35-44 1.6095 1.0676 1.508 0. agegp45-54 2.9752 1.0242 2.905 0.003675 ** agegp55-64 3.3584 1.0198 3.293 0.000991 *** agegp65-74 3.7270 1.0253 3.635 0.000278 ***

Example

agegp75+ 3.6818 1.0645 3.459 0.000543 *** tobgp10-19 0.3407 0.2054 1.659 0.. tobgp20-29 0.3962 0.2456 1.613 0. tobgp30+ 0.8677 0.2765 3.138 0.001701 ** alcgp40-79 1.1216 0.2384 4.704 2.55e-06 *** alcgp80-119 1.4471 0.2628 5.506 3.68e-08 *** alcgp120+ 2.1154 0.2876 7.356 1.90e-13 ***


Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 227.241 on 87 degrees of freedom Residual deviance: 53.973 on 76 degrees of freedom AIC: 225.

Example

Analysis of Deviance Table

Model 1: cbind(ncases, ncontrols) ~ agegp + tobgp Model 2: cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 79 120. 2 76 53.973 3 66.054 2.984e-

Wald tests

library(survey) regTermTest(model1,~alcgp) Wald test for alcgp in glm(formula = cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp, family = binomial(), data = esoph) Chisq = 57.89887 on 3 df: p= 1.652e-

Example

regTermTest(model1,~tobgp) Wald test for tobgp in glm(formula = cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp, family = binomial(), data = esoph) Chisq = 10.76880 on 3 df: p= 0.