



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Data Analysis The Truth About Linear Regression 1, Exercises - Engineering - Prof. Cosma Shalizi, Advanced Data Analysis, The Advantages of Backwardness
Typology: Exercises
1 / 7
This page cannot be seen from the preview
Don't miss anything!




This problem set was based on the preliminary analysis in the paper
E. Maasoumi, J. S. Racine and T. Stengos, “Growth and con- vergence: a profile of distribution dynamics and mobility”, Journal of Econometrics 136 (2007): 483–508, http://surface.syr.edu/ cgi/viewcontent.cgi?article=1004&context=ecn
install.packages("np") require(np) data(oecdpanel)
lm.1 = lm(growth ~ initgdp, data = oecdpanel) > signif(lm.1$coefficients,3) (Intercept) initgdp -0.02130 0.
The coefficient of initgdp in the linear model is about 5× 10 −^3 , suggesting that higher growth rates are associated with higher initial levels of GDP, contrary to the claim we want to test.
0.1 0.2 0.3 0.4 0.
MSE vs. bandwidth 5-fold CV (black, solid) and in-sample (green, dashed)
Kernel bandwidth
MSE
Figure 1: The black solid line is the MSE for predictions on new observations in 5-fold cv. The green, dashed line is in-sample MSE when all of the data is used (so the training set = testing set = the whole data).
6 7 8 9
-0.
-0.
growth vs. initgdp
initgdp
growth
Figure 2: The red line is the line for the linear model. The green points are the kernel regression fitted values with bandwidth = 0.3.
In particular, the sign of the coefficient for initgdp is opposite what we ob- tained in problem 1. This result suggests that, “holding constant popgro and inv”, high growth rates are associated with low initial levels of GDP, consistent with the claim we want to test.
oecd.npr <- npreg(growth ~ initgdp + popgro + inv, data=oecdpanel, tol=0.1, ftol=0.1) summary(oecd.npr)
#output contains this: initgdp popgro inv Bandwidth(s): 0.211 0.0591 0.
The selected bandwidth for initgdp is quite a bit smaller than 0.3.
med.popgro = median(oecdpanel$popgro) med.inv = median(oecdpanel$inv) > signif(med.popgro,3) [1] -2. > signif(med.inv,3) [1] -1.
There are several ways to get the predictions. One is to make a new data frame which has the values we want. Here we re-use the real initgdp val- ues (we could also use an equally-spaced grid), but sort them in increasing order, and fix popgro and inv to their medians.
initgdp.sort = sort(oecdpanel$initgdp) med.oecdpanel = data.frame(initgdp=initgdp.sort, popgro=med.popgro,inv=med.inv)
mod4.med.preds = predict(lm.4, newdata = med.oecdpanel)
mod5.med.preds = predict(oecd.npr, newdata = med.oecdpanel)
pdf("graphics/p6.pdf")
plot(oecdpanel$initgdp, oecdpanel$growth, lwd = 2, cex = 0.3, xlab = "initgdp", ylab = "growth",
kernel.fold.mses = vector(length=nfolds) for (fold in 1:nfolds) { train = oecdpanel[case.folds!=fold,] test = oecdpanel[case.folds==fold,]
linear.model = lm(growth ~ initgdp+popgro+inv, data=train) kernel.model = npreg(growth ~ initgdp + popgro + inv, data=train, tol=0.1, ftol=0.1)
linear.predictions = predict(linear.model, newdata=test) kernel.predictions = predict(kernel.model, newdata=test) linear.fold.mses[fold] = mean((test$growth - linear.predictions)^2) kernel.fold.mses[fold] = mean((test$growth - kernel.predictions)^2) } cv.linear.mod = mean(linear.fold.mses) cv.kernel.mod = mean(kernel.fold.mses)
Based on the cross-validation scores (0.000782 for the linear model and 0 .000754 for the kernel regression), one would prefer the kernel regression model.