Data Analysis Simulation 1, Exercises - Engineering, Exercises of Advanced Data Analysis

Data Analysis Simulation 1, Exercises - Engineering, Advanced Data Analysis, Diabetes

Typology: Exercises

2010/2011

Uploaded on 11/03/2011

bridge
bridge 🇺🇸

4.9

(13)

287 documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Homework Assignment 7: Diabetes
36-402, Advanced Data Analysis, Spring 2011
SOLUTIONS
1. Answer:
# Look what variables are in the data:
colnames(pima)
# Consider maximums and minimums to check that the variable values are feasible
summary(pima)
help(pima)
From the variable descriptions, it is not perfectly clear what feasible values
would be for each variable, but we strongly suspect that body mass index
should never be zero (that would imply zero weight), nor should diastolic
blood pressure, blood glucose level, or the triceps skin fold thickness. This
is supported both by common sense and also by the fact that the distri-
bution for each of these variables appears to have an irregular spike at
0. Insulin also has such a spike. There is a form of diabetes (“type 1”)
where insulin levels are very low or even indetectable, but this is actu-
ally rare among Native Americans, as opposed to some parts of Europe
(Scandanavia, Sardinia)1, so while these zeroes are, most likely, mistakes
or missing values, there is no penalty for not removing them. Leaving
them in changes the estimates in later problems, but not hugely.
Let’s remove these observations:
good.data = (pima$glucose != 0) & (pima$diastolic != 0) & (pima$triceps != 0) &
(pima$bmi != 0) & (pima$insulin != 0)
pima.clean = pima[good.data,]
# Make note that we’ve deleted about 50% of the observations!
nrow(pima.clean)/nrow(pima)
# Make a scatterplot (figure 1)
plot(pima.clean, cex = 0.4, pch = 16)
See Figure 1 for an overview of the pairwise relationships. The data looks
“healthy” now that the zero observations are removed.
1See, e.g., The Merck Manual, 18th edition, ch. 158, online at http://www.merckmanuals.
com/professional/sec12/ch158/ch158b.html.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download Data Analysis Simulation 1, Exercises - Engineering and more Exercises Advanced Data Analysis in PDF only on Docsity!

Homework Assignment 7: Diabetes

36-402, Advanced Data Analysis, Spring 2011

SOLUTIONS

  1. Answer:

Look what variables are in the data:

colnames(pima)

Consider maximums and minimums to check that the variable values are feasible

summary(pima) help(pima)

From the variable descriptions, it is not perfectly clear what feasible values would be for each variable, but we strongly suspect that body mass index should never be zero (that would imply zero weight), nor should diastolic blood pressure, blood glucose level, or the triceps skin fold thickness. This is supported both by common sense and also by the fact that the distri- bution for each of these variables appears to have an irregular spike at

  1. Insulin also has such a spike. There is a form of diabetes (“type 1”) where insulin levels are very low or even indetectable, but this is actu- ally rare among Native Americans, as opposed to some parts of Europe (Scandanavia, Sardinia)^1 , so while these zeroes are, most likely, mistakes or missing values, there is no penalty for not removing them. Leaving them in changes the estimates in later problems, but not hugely. Let’s remove these observations:

good.data = (pima$glucose != 0) & (pima$diastolic != 0) & (pima$triceps != 0) & (pima$bmi != 0) & (pima$insulin != 0) pima.clean = pima[good.data,]

Make note that we’ve deleted about 50% of the observations!

nrow(pima.clean)/nrow(pima)

Make a scatterplot (figure 1)

plot(pima.clean, cex = 0.4, pch = 16)

See Figure 1 for an overview of the pairwise relationships. The data looks “healthy” now that the zero observations are removed. (^1) See, e.g., The Merck Manual, 18th edition, ch. 158, online at http://www.merckmanuals. com/professional/sec12/ch158/ch158b.html.

Figure 1:

upper.lin = pred.lin$fit + 1.96*pred.lin$se.fit

95% C.I. lower bound:

ilogit(lower.lin)

Equivalently, exp(lower.lin)/(1+exp(lower.lin))

95% C.I. upper bound:

ilogit(upper.lin)

Equivalently, exp(upper.lin)/(1+exp(upper.lin))

We obtain a lower bound of about 0.030 and an upper bound of about 0 .097. Note that instead of computing the standard error for the prediction explicitly, we are using the se.f it = T option in the predict function. By using predict() with the type="response" option, one can get a standard error on the untransformed, probability scale; but the sampling distribution here is not as nearly Gaussian as it is on the transformed, logit scale, so using this mechanically to get an approximate confidence interval is not very accurate, unless the standard error is very small and/or the predicted probability is zero close to 0.5. (This was discussed in lecture and in Faraway.)

  1. Answer: Note that odds= (^1) −pp. If βb is the estimated coefficient for bmi, the logistic regression equation is

log(odds) = βb ∗ bmi + other

where “other” includes the intercept and all the other terms and their coefficients. Then the (multiplicative) change in odds of testing positive for diabetes for someone moving from the third quartile to the first quartile in bmi is the ratio eβb∗bmi.1st.quantile+other eβb∗bmi.3rd.quantile+other^

= eβb∗(bmi.1st.quantile−bmi.3rd.quantile)

We code this in R:

odds.change = exp(coefficients(logit.model)["bmi"]* (quantile(pima.clean$bmi,0.25)-quantile(pima.clean$bmi,0.75)))

> as.numeric(odds.change) [1] 0.

(We could replace the difference in quantiles with -IQR(pima.clean$bmi), but the IQR() function is a little obscure.) Therefore, all else held con- stant, the model suggests that a woman at the first quantile on bmi has about half the odds of testing positive for diabetes compared to being at the third quantile. But how confident can we be of this? From the table in problem 2, we see the standard error of the bmi coefficient βb is about 0.027. The es- timated coefficients in a logistic regression have approximately Gaussian

distributions around their true values, so a 95% confidence interval for βb is 0. 075 ± 1. 96 ∗ 0 .027, or (0. 017 , 0 .12). In R:

bmi.se = summary(logit.model)$coefficients["bmi","Std. Error"] bmi.lower = as.numeric(coefficients(logit.model)["bmi"] - 1.96bmi.se) bmi.upper = as.numeric(coefficients(logit.model)["bmi"] + 1.96bmi.se)

Plugging the lower and upper bounds into the formula for the change in odds gives a confidence interval for the change in odds:

odds.change.upper = exp(bmi.lower* (quantile(pima.clean$bmi,0.25)-quantile(pima.clean$bmi,0.75))) > as.numeric(odds.change.upper) [1] 0.

odds.change.lower = exp(bmi.upper* (quantile(pima.clean$bmi,0.25)-quantile(pima.clean$bmi,0.75))) > as.numeric(odds.change.lower) [1] 0.

We can be rather confident that moving to the lower bmi is associated with decreasing the odds of testing postive for diabetes, by a factor between 0.34 and 0.86. Alternatively, a somewhat more accurate method of calculating confidence intervals for logistic regression coefficients is available in the package MASS (as illustrated by Faraway). Using that here:

> library(MASS) > confint(logit.model,"bmi") Waiting for profiling to be done... 2.5 % 97.5 % 0.01766988 0.

gives a very similar confidence interval for βb, and so a very similar confi- dence interval for the change in odds, namely (0. 336 , 0 .858).

  1. Answer: We can directly compare the mean diastolic blood pressure for women who test positive for diabetes against those who don’t:

Diastolic blood pressure for those testing negative:

> mean(pima.clean$diastolic[pima.clean$test == 0]) [1] 68.

Diastolic blood pressure for those testing positive:

> mean(pima.clean$diastolic[pima.clean$test == 1]) [1] 74.

data well if the difference between the true probability of testing positive for diabetes and the predicted probability is small over the covariate space. Option 1 (based on Faraway): A simple way to test that the predicted and the truth do not deviate much is as follows. Gather all observations for which the predicted value ˆp is in a small interval, say, (0, 0 .5). Count how many times the diabetes test is positive in this group. Then use the pˆ values to compute the number of positive diabetes test results we would expect to see in this group, based on the model. If the difference between observed and expected is large, we have evidence to reject that the model fits well. To be more thorough, we actually want to consider more than just the interval (0, 0 .5). Partioning the full interval (0, 1) into 10 pieces is standard practice, and then we can use the chi-squared statistic to compute a p-value for the goodness of fit. Option 2 (based on lecture notes): A non-parametric model should match the training data (as measured by log-likelihood or deviance) at least as well as the logistic regression model, because the former is more flexi- ble. When the logistic regression is correct, however, the non-parametric model ends up approximating the parametric model, and the difference should be small and shrinking with sample size. We do not have a closed- form expression for the distribution of the difference in deviances, but we can find it by simulating from the logistic regression model (parametric bootstrapping).

  1. Answer:

Option 1:

Function to compute a p-value for the goodness-of-fit

using the Hosmer and Lemeshow goodness of fit test.

This code is taken from http://sas-and-r.blogspot.com/

hosmerlem = function(y, yhat, g=10) {

divide yhat into 10 equal bins after ordering:

cutyhat = cut(yhat,breaks = quantile(yhat, probs=seq(0, 1, 1/g)), include.lowest=TRUE)

tabulate the association between y and the 10 yhat bins:

obs = xtabs(cbind(1 - y, y) ~ cutyhat)

tabulate the expected association based on the predicted values:

expect = xtabs(cbind(1 - yhat, yhat) ~ cutyhat)

compute a statistic that should be distributed Chi-squared under the null:

chisq = sum((obs - expect)^2/expect)

The p-value:

P = 1 - pchisq(chisq, g - 2) return(list(chisq=chisq,p.value=P)) }

Use the function to compute a p-value for goodness of fit:

> hosmerlem(y=pima.clean$test, yhat=logit.model$fitted) $p.value [1] 0.

The p-value is quite large, so this test does not provide evidence against the hypothesis that the model fits the data well.

Option 2 follows the notes: fit a GAM, compare the deviances, then simu- late from the logistic regression and compare deviances on the simulations. For conformity with the notes, this uses the gam() function from the pack- age gam, but using the one from mgcv, as on the midterm, wouldn’t change things much.

pima.gam <- gam(test ~ lo(pregnant) + lo(glucose) + lo(diastolic) + lo(triceps) +lo(insulin) +lo(bmi) + lo(diabetes) + lo(age), data=pima.clean, family=binomial)

(See Figure 2 for the plots of the partial response functions.) Common mistakes here were to forget to enclose the terms to be smoothed in lo() or s(), and forgetting to specify family=binomial.

Now compare the deviances:

> (diff.deviance.obs <- logit.model$deviance - pima.gam$deviance) [1] 36.

As expected, the GAM does better at fitting the data than the logistic model. Is this within the range we’d expect if the logistic model were true?

sim.from.logistic <- function(model=logit.model) {

Presume that the fitted values of the model are probabilities

This is true of logistic regression

p.hat <- fitted(model) n <- length(p.hat)

Generate binary data with those probabilities

y.new <- rbinom(n,size=1,prob=p.hat) return(y.new) }

sim.diff.deviance <- function(model=logit.model,x=pima.clean[,-9]) {

simulate from the logistic model

y.new <- sim.from.logistic(model) GLM.new <- glm(y.new ~ x[,1] + x[,2] + x[,3] + x[,4] + x[,5] + x[,6]

  • x[,7] + x[,8], family=binomial) GAM.new <- gam(y.new ~ lo(x[,1]) + lo(x[,2]) + lo(x[,3]) + lo(x[,4]) + lo(x[,5]) + lo(x[,6]) + lo(x[,7]) + lo(x[,8]), family=binomial)

the sampling distribution under the null. If the GAM was correct, we’d have a quite large probability (about 82%) of being over that threshold, and so of noticing that the logistic regression was wrong. Similarly, the probability of getting such a small difference in deviances is only about 7%, so we have pretty reasonable (though not conclusive!) evidence against the GAM. (Figure 4.) Of course there could be other nonlinear models which are closer to the logistic regression, and which we would not have good evidence to reject, but at least large nonlinearities, like in Figure 2, can be ruled out with some confidence.

  1. Answer: See problem 7 for the code fitting the GAM.

plot(fitted(pima.gam), fitted(logit.model), xlab="Nonparametric fitted probabilities", ylab="Logistic fitted probabilities", cex = 0.4, pch = 16) abline(0,1,col="grey")

The two models seem to agree tolerably about who has very low probabil- ity of becoming diabetic (though the logistic regression gives them some- what higher probabilities), but the disagreements become more common and bigger at higher predicted probabilities.

par(mfrow=c(2,4)) plot(pima.gam,se=TRUE) par(mfrow=c(1,1))

Figure 2: Partial response functions for the generalized additive model fit in Problem 7; dotted lines show ±2 standard errors. Note that the vertical axes have different scales across the plots.

plot(density(null.dist.diff.deviance),xlim=c(0,105), main="Sampling distributions of the difference in deviance",xlab="") lines(density(alt.dist),col="grey") abline(v=diff.deviance.obs,lty=2)

Figure 4: Sampling distributions for the difference-in-deviance test statistic when simulating from the logistic regression (black curve) or the GAM (grey curve); dashed vertical line is at the value observed on the read data.

Figure 5: