Data Analysis The Truth About Linear Regression 1, Exercises - Engineering, Exercises of Advanced Data Analysis

Data Analysis The Truth About Linear Regression 1, Exercises - Engineering - Prof. Cosma Shalizi, Advanced Data Analysis, The Advantages of Backwardness

Typology: Exercises

2010/2011

Uploaded on 11/03/2011

bridge
bridge 🇺🇸

4.9

(13)

287 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Homework Assignment 2: The Advantages of
Backwardness
36-402, Advanced Data Analysis, Spring 2011
SOLUTIONS
This problem set was based on the preliminary analysis in the paper
E. Maasoumi, J. S. Racine and T. Stengos, “Growth and con-
vergence: a profile of distribution dynamics and mobility”, Journal
of Econometrics 136 (2007): 483–508, http://surface.syr.edu/
cgi/viewcontent.cgi?article=1004&context=ecn
# Setup
install.packages("np")
require(np)
data(oecdpanel)
1. Answer:
# Fit a linear model of growth on initgdp
lm.1 = lm(growth ~ initgdp, data = oecdpanel)
> signif(lm.1$coefficients,3)
(Intercept) initgdp
-0.02130 0.00533
The coefficient of initgdp in the linear model is about 5×103, suggesting
that higher growth rates are associated with higher initial levels of GDP,
contrary to the claim we want to test.
2. Answer: See Figure 1.
The black line suggests that choosing a bandwidth near 0.3 minimizes
the MSE for predictions on new observations. The green line shows the
MSE for the model fit after using the whole data as the training data.
The important thing to notice from the green line is that the model fit
always improves as the bandwidth decreases because the test set = the
training set = the whole data no cross-validation. For finding the right
bandwidth to optimize the model for prediction on new observations, it is
important that the training set is separate from the test set.
1
pf3
pf4
pf5

Partial preview of the text

Download Data Analysis The Truth About Linear Regression 1, Exercises - Engineering and more Exercises Advanced Data Analysis in PDF only on Docsity!

Homework Assignment 2: The Advantages of

Backwardness

36-402, Advanced Data Analysis, Spring 2011

SOLUTIONS

This problem set was based on the preliminary analysis in the paper

E. Maasoumi, J. S. Racine and T. Stengos, “Growth and con- vergence: a profile of distribution dynamics and mobility”, Journal of Econometrics 136 (2007): 483–508, http://surface.syr.edu/ cgi/viewcontent.cgi?article=1004&context=ecn

Setup

install.packages("np") require(np) data(oecdpanel)

  1. Answer:

Fit a linear model of growth on initgdp

lm.1 = lm(growth ~ initgdp, data = oecdpanel) > signif(lm.1$coefficients,3) (Intercept) initgdp -0.02130 0.

The coefficient of initgdp in the linear model is about 5× 10 −^3 , suggesting that higher growth rates are associated with higher initial levels of GDP, contrary to the claim we want to test.

  1. Answer: See Figure 1. The black line suggests that choosing a bandwidth near 0.3 minimizes the MSE for predictions on new observations. The green line shows the MSE for the model fit after using the whole data as the training data. The important thing to notice from the green line is that the model fit always improves as the bandwidth decreases because the test set = the training set = the whole data ↔ no cross-validation. For finding the right bandwidth to optimize the model for prediction on new observations, it is important that the training set is separate from the test set.

0.1 0.2 0.3 0.4 0.

MSE vs. bandwidth 5-fold CV (black, solid) and in-sample (green, dashed)

Kernel bandwidth

MSE

Figure 1: The black solid line is the MSE for predictions on new observations in 5-fold cv. The green, dashed line is in-sample MSE when all of the data is used (so the training set = testing set = the whole data).

6 7 8 9

-0.

-0.

growth vs. initgdp

initgdp

growth

Figure 2: The red line is the line for the linear model. The green points are the kernel regression fitted values with bandwidth = 0.3.

In particular, the sign of the coefficient for initgdp is opposite what we ob- tained in problem 1. This result suggests that, “holding constant popgro and inv”, high growth rates are associated with low initial levels of GDP, consistent with the claim we want to test.

  1. Answer: Run the provided code:

oecd.npr <- npreg(growth ~ initgdp + popgro + inv, data=oecdpanel, tol=0.1, ftol=0.1) summary(oecd.npr)

#output contains this: initgdp popgro inv Bandwidth(s): 0.211 0.0591 0.

The selected bandwidth for initgdp is quite a bit smaller than 0.3.

  1. Answer:

Find and report the medians:

med.popgro = median(oecdpanel$popgro) med.inv = median(oecdpanel$inv) > signif(med.popgro,3) [1] -2. > signif(med.inv,3) [1] -1.

There are several ways to get the predictions. One is to make a new data frame which has the values we want. Here we re-use the real initgdp val- ues (we could also use an equally-spaced grid), but sort them in increasing order, and fix popgro and inv to their medians.

Make a new data set med.oecdpanel that contains growth, initgdp, popgro,

and inv, but replace popgro and inv and with their respective medians

Use the actual values of initgdp, but sort them for easier plotting later

initgdp.sort = sort(oecdpanel$initgdp) med.oecdpanel = data.frame(initgdp=initgdp.sort, popgro=med.popgro,inv=med.inv)

Find the predicted growth rates under the model from problem 4

mod4.med.preds = predict(lm.4, newdata = med.oecdpanel)

Find the predicted growth rates under the model from problem 5

mod5.med.preds = predict(oecd.npr, newdata = med.oecdpanel)

Plot growth v.s. initgdp for med.data and add the fitted values for both models:

pdf("graphics/p6.pdf")

Make scatterplot of initgdp versus growth:

plot(oecdpanel$initgdp, oecdpanel$growth, lwd = 2, cex = 0.3, xlab = "initgdp", ylab = "growth",

kernel.fold.mses = vector(length=nfolds) for (fold in 1:nfolds) { train = oecdpanel[case.folds!=fold,] test = oecdpanel[case.folds==fold,]

Fit the models

linear.model = lm(growth ~ initgdp+popgro+inv, data=train) kernel.model = npreg(growth ~ initgdp + popgro + inv, data=train, tol=0.1, ftol=0.1)

Predict on the test set

linear.predictions = predict(linear.model, newdata=test) kernel.predictions = predict(kernel.model, newdata=test) linear.fold.mses[fold] = mean((test$growth - linear.predictions)^2) kernel.fold.mses[fold] = mean((test$growth - kernel.predictions)^2) } cv.linear.mod = mean(linear.fold.mses) cv.kernel.mod = mean(kernel.fold.mses)

Based on the cross-validation scores (0.000782 for the linear model and 0 .000754 for the kernel regression), one would prefer the kernel regression model.

  1. Answer: The initial analysis, in problems 1 to 3, undermined the idea of catching up. In the subsequent problems, we began controlling for other variables, and found that the idea of catching up might be correct. Both analyses are legitimate, if interpreted carefully, but the thing that makes the second analysis interesting is that it is more suggestive of causality. If we can establish that having a lower initial GDP in some sense causes higher growth rates, then we might have arrived at a conclusion with a lot of policy relevance. It would suggest, for instance, that we should expect a country’s growth rate to decline as it got richer, and not attribute that to problems with the economy. It would also suggest that growth should be extra high after recessions and wars, which lower GDP, but that those rates couldn’t be sustained (and blowing up part of the country to raise the growth rate would be perverse). The kernel regression model complicates matters by allowing us to model growth as a possibly non-linear function of initgdp. Hence, the question “Is the catching up theory” correct does not necessarily have a yes/no answer; it appears that the theory may hold true for specific ranges of initgdp but not everywhere. It is important to keep in mind that while the idea of catching up may be legitimate, it seems unlikely to account for much of the variability in growth rates, as you might guess from the last plot.