Partial preview of the text
Download Lecture Slides on Missing Data with Examples | BIOST 570 and more Study notes Biostatistics in PDF only on Docsity!
Missing data: examples
Thomas Lumley
BIOST 570
Example
(X, Z) are bivariate Normal with mean zero, variance 1, correla-
tion 0.5.
Z is missing with probability exp(x)/(1 + exp(x))
Complete-case linear regression for Z ∼ X is not biased
Complete-case linear regression for X ∼ Z is biased.
Z is Missing At Random, so multiple imputation will work
R depends on observable X, so weighting will work.
Example
Set up data
> px<-exp(dd$x)/(1+exp(dd$x))
> drop<-rbinom(nrow(dd),1,1-px)
Example
> lm(z~x,data=dd)
Call:
lm(formula = z ~ x, data = dd)
Coefficients:
(Intercept) x
> lm(z~x,data=dd,subset=!drop)
Call:
lm(formula = z ~ x, data = dd, subset = !drop)
Coefficients:
(Intercept) x
Weighting
> lm(x~z,data=dd,subset=!drop,weights=1/px)
Call:
lm(formula = x ~ z, data = dd, subset = !drop, weights = 1/px)
Coefficients:
(Intercept) z
> pmodel<-glm(drop~x,data=dd,family=binomial)
> estp<-1-fitted(pmodel)
> lm(x~z,data=dd,subset=!drop,weights=1/estp)
Call:
lm(formula = x ~ z, data = dd, subset = !drop, weights = 1/px)
Coefficients:
(Intercept) z
Imputation
> impmodel<-lm(z~x,data=dd,subset=!drop)
> summary(impmodel)
Call:
lm(formula = z ~ x, data = dd, subset = !drop)
Residuals:
Min 1Q Median 3Q Max
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.01012 0.07117 -0.142 0.
x 0.55677 0.07007 7.946 1.46e-13 ***
Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 0.9003 on 197 degrees of freedom
Multiple imputation
> dd1<-dd
> dd1$z<-ifelse(drop,
rnorm(nrow(dd),m=predict(impmodel,newdata=dd),s=0.9003),
dd$z)
> dd2<-dd
> dd2$z<-ifelse(drop,
rnorm(nrow(dd),m=predict(impmodel,newdata=dd),s=0.9003),
dd$z)
> dd3<-dd
> dd3$z<-ifelse(drop,
rnorm(nrow(dd),m=predict(impmodel,newdata=dd),s=0.9003),
dd$z)
> dd4<-dd
> dd4$z<-ifelse(drop,
rnorm(nrow(dd),m=predict(impmodel,newdata=dd),s=0.9003),
dd$z)
Multiple imputation
> dd5<-dd
> dd5$z<-ifelse(drop,
rnorm(nrow(dd),m=predict(impmodel,newdata=dd),s=0.9003),
dd$z)
> library(mitools)
> ddimp<-imputationList(list(dd1,dd2,dd3,dd4,dd5))
> models<-with(ddimp, lm(x~z))
> summary(MIcombine(models))
Multiple imputation results:
with.imputationList(ddimp, lm(x ~ z))
MIcombine.default(models)
results se (lower upper) missInfo
(Intercept) 0.03576729 0.04634838 -0.05618825 0.1277228 22 %
z 0.47037529 0.04374016 0.38360450 0.5571461 21 %
Likelihood
The observed data loglikelihood is univariate normal for X alone,
bivariate normal for (X, Z)
> ell<-function(theta){
xmu<-theta[1]
mu<-theta[1:2]
Sigma<-matrix(theta[c(3,4,4,5)],2)
xsigma<-sqrt(theta[3])
ellx<-sum(dnorm(dd$x[!!drop],m=xmu,s=xsigma,log=TRUE))
ellxz<-
for(i in which(!drop)){
xz<-as.matrix(dd[i,c("x","z")])
ellxz<-ellxz-log(2pidet(Sigma))
-(xz-mu)%%solve(2Sigma,t(xz-mu))
ellxz+ellx}
Likelihood
Direct optimization gives
> optim(c(0,0.25,0.9,0.5,0.7),ell,method="BFGS",
control=list(trace=TRUE,fnscale=-1))
initial value 759.
iter 10 value 724.
final value 724.
converged
$par
[1] 0.04750160 0.01633095 0.62181402 0.34618559 0.
$value
[1] -724.
$counts
function gradient
EM algorithm
The EM algorithm would be more work that direct maximization
in this case. For the cases with Z missing we need to
compute E[Z|X] and E[Z
|X] as functions of the bivariate normal
parameters in order to compute
E
[
((X, Z) − μ)Σ
((X, Z) − μ)
T
|X
]
If we were computing analytic rather than numeric first deriva-
tives (which often helps optimization substantially) the EM
algorithm would mean that we only needed one set of derivatives
rather than two.
The EM algorithm is most used not for real missing data but
for likelihoods depending on latent variables, where treating the
latent variables as missing data simplifies the likelihood.