








Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Data Analysis Simulation 1, Exercises - Engineering, Advanced Data Analysis, Diabetes
Typology: Exercises
1 / 14
This page cannot be seen from the preview
Don't miss anything!









colnames(pima)
summary(pima) help(pima)
From the variable descriptions, it is not perfectly clear what feasible values would be for each variable, but we strongly suspect that body mass index should never be zero (that would imply zero weight), nor should diastolic blood pressure, blood glucose level, or the triceps skin fold thickness. This is supported both by common sense and also by the fact that the distri- bution for each of these variables appears to have an irregular spike at
good.data = (pima$glucose != 0) & (pima$diastolic != 0) & (pima$triceps != 0) & (pima$bmi != 0) & (pima$insulin != 0) pima.clean = pima[good.data,]
nrow(pima.clean)/nrow(pima)
plot(pima.clean, cex = 0.4, pch = 16)
See Figure 1 for an overview of the pairwise relationships. The data looks “healthy” now that the zero observations are removed. (^1) See, e.g., The Merck Manual, 18th edition, ch. 158, online at http://www.merckmanuals. com/professional/sec12/ch158/ch158b.html.
Figure 1:
upper.lin = pred.lin$fit + 1.96*pred.lin$se.fit
ilogit(lower.lin)
ilogit(upper.lin)
We obtain a lower bound of about 0.030 and an upper bound of about 0 .097. Note that instead of computing the standard error for the prediction explicitly, we are using the se.f it = T option in the predict function. By using predict() with the type="response" option, one can get a standard error on the untransformed, probability scale; but the sampling distribution here is not as nearly Gaussian as it is on the transformed, logit scale, so using this mechanically to get an approximate confidence interval is not very accurate, unless the standard error is very small and/or the predicted probability is zero close to 0.5. (This was discussed in lecture and in Faraway.)
log(odds) = βb ∗ bmi + other
where “other” includes the intercept and all the other terms and their coefficients. Then the (multiplicative) change in odds of testing positive for diabetes for someone moving from the third quartile to the first quartile in bmi is the ratio eβb∗bmi.1st.quantile+other eβb∗bmi.3rd.quantile+other^
= eβb∗(bmi.1st.quantile−bmi.3rd.quantile)
We code this in R:
odds.change = exp(coefficients(logit.model)["bmi"]* (quantile(pima.clean$bmi,0.25)-quantile(pima.clean$bmi,0.75)))
> as.numeric(odds.change) [1] 0.
(We could replace the difference in quantiles with -IQR(pima.clean$bmi), but the IQR() function is a little obscure.) Therefore, all else held con- stant, the model suggests that a woman at the first quantile on bmi has about half the odds of testing positive for diabetes compared to being at the third quantile. But how confident can we be of this? From the table in problem 2, we see the standard error of the bmi coefficient βb is about 0.027. The es- timated coefficients in a logistic regression have approximately Gaussian
distributions around their true values, so a 95% confidence interval for βb is 0. 075 ± 1. 96 ∗ 0 .027, or (0. 017 , 0 .12). In R:
bmi.se = summary(logit.model)$coefficients["bmi","Std. Error"] bmi.lower = as.numeric(coefficients(logit.model)["bmi"] - 1.96bmi.se) bmi.upper = as.numeric(coefficients(logit.model)["bmi"] + 1.96bmi.se)
Plugging the lower and upper bounds into the formula for the change in odds gives a confidence interval for the change in odds:
odds.change.upper = exp(bmi.lower* (quantile(pima.clean$bmi,0.25)-quantile(pima.clean$bmi,0.75))) > as.numeric(odds.change.upper) [1] 0.
odds.change.lower = exp(bmi.upper* (quantile(pima.clean$bmi,0.25)-quantile(pima.clean$bmi,0.75))) > as.numeric(odds.change.lower) [1] 0.
We can be rather confident that moving to the lower bmi is associated with decreasing the odds of testing postive for diabetes, by a factor between 0.34 and 0.86. Alternatively, a somewhat more accurate method of calculating confidence intervals for logistic regression coefficients is available in the package MASS (as illustrated by Faraway). Using that here:
> library(MASS) > confint(logit.model,"bmi") Waiting for profiling to be done... 2.5 % 97.5 % 0.01766988 0.
gives a very similar confidence interval for βb, and so a very similar confi- dence interval for the change in odds, namely (0. 336 , 0 .858).
> mean(pima.clean$diastolic[pima.clean$test == 0]) [1] 68.
> mean(pima.clean$diastolic[pima.clean$test == 1]) [1] 74.
data well if the difference between the true probability of testing positive for diabetes and the predicted probability is small over the covariate space. Option 1 (based on Faraway): A simple way to test that the predicted and the truth do not deviate much is as follows. Gather all observations for which the predicted value ˆp is in a small interval, say, (0, 0 .5). Count how many times the diabetes test is positive in this group. Then use the pˆ values to compute the number of positive diabetes test results we would expect to see in this group, based on the model. If the difference between observed and expected is large, we have evidence to reject that the model fits well. To be more thorough, we actually want to consider more than just the interval (0, 0 .5). Partioning the full interval (0, 1) into 10 pieces is standard practice, and then we can use the chi-squared statistic to compute a p-value for the goodness of fit. Option 2 (based on lecture notes): A non-parametric model should match the training data (as measured by log-likelihood or deviance) at least as well as the logistic regression model, because the former is more flexi- ble. When the logistic regression is correct, however, the non-parametric model ends up approximating the parametric model, and the difference should be small and shrinking with sample size. We do not have a closed- form expression for the distribution of the difference in deviances, but we can find it by simulating from the logistic regression model (parametric bootstrapping).
Option 1:
hosmerlem = function(y, yhat, g=10) {
cutyhat = cut(yhat,breaks = quantile(yhat, probs=seq(0, 1, 1/g)), include.lowest=TRUE)
obs = xtabs(cbind(1 - y, y) ~ cutyhat)
expect = xtabs(cbind(1 - yhat, yhat) ~ cutyhat)
chisq = sum((obs - expect)^2/expect)
P = 1 - pchisq(chisq, g - 2) return(list(chisq=chisq,p.value=P)) }
> hosmerlem(y=pima.clean$test, yhat=logit.model$fitted) $p.value [1] 0.
The p-value is quite large, so this test does not provide evidence against the hypothesis that the model fits the data well.
Option 2 follows the notes: fit a GAM, compare the deviances, then simu- late from the logistic regression and compare deviances on the simulations. For conformity with the notes, this uses the gam() function from the pack- age gam, but using the one from mgcv, as on the midterm, wouldn’t change things much.
pima.gam <- gam(test ~ lo(pregnant) + lo(glucose) + lo(diastolic) + lo(triceps) +lo(insulin) +lo(bmi) + lo(diabetes) + lo(age), data=pima.clean, family=binomial)
(See Figure 2 for the plots of the partial response functions.) Common mistakes here were to forget to enclose the terms to be smoothed in lo() or s(), and forgetting to specify family=binomial.
Now compare the deviances:
> (diff.deviance.obs <- logit.model$deviance - pima.gam$deviance) [1] 36.
As expected, the GAM does better at fitting the data than the logistic model. Is this within the range we’d expect if the logistic model were true?
sim.from.logistic <- function(model=logit.model) {
p.hat <- fitted(model) n <- length(p.hat)
y.new <- rbinom(n,size=1,prob=p.hat) return(y.new) }
sim.diff.deviance <- function(model=logit.model,x=pima.clean[,-9]) {
y.new <- sim.from.logistic(model) GLM.new <- glm(y.new ~ x[,1] + x[,2] + x[,3] + x[,4] + x[,5] + x[,6]
the sampling distribution under the null. If the GAM was correct, we’d have a quite large probability (about 82%) of being over that threshold, and so of noticing that the logistic regression was wrong. Similarly, the probability of getting such a small difference in deviances is only about 7%, so we have pretty reasonable (though not conclusive!) evidence against the GAM. (Figure 4.) Of course there could be other nonlinear models which are closer to the logistic regression, and which we would not have good evidence to reject, but at least large nonlinearities, like in Figure 2, can be ruled out with some confidence.
plot(fitted(pima.gam), fitted(logit.model), xlab="Nonparametric fitted probabilities", ylab="Logistic fitted probabilities", cex = 0.4, pch = 16) abline(0,1,col="grey")
The two models seem to agree tolerably about who has very low probabil- ity of becoming diabetic (though the logistic regression gives them some- what higher probabilities), but the disagreements become more common and bigger at higher predicted probabilities.
par(mfrow=c(2,4)) plot(pima.gam,se=TRUE) par(mfrow=c(1,1))
Figure 2: Partial response functions for the generalized additive model fit in Problem 7; dotted lines show ±2 standard errors. Note that the vertical axes have different scales across the plots.
plot(density(null.dist.diff.deviance),xlim=c(0,105), main="Sampling distributions of the difference in deviance",xlab="") lines(density(alt.dist),col="grey") abline(v=diff.deviance.obs,lty=2)
Figure 4: Sampling distributions for the difference-in-deviance test statistic when simulating from the logistic regression (black curve) or the GAM (grey curve); dashed vertical line is at the value observed on the read data.
Figure 5: