Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Data Analysis The Truth About Linear Regression 1, Exercises - Engineering, Exercises of Advanced Data Analysis

Carnegie Mellon University (CMU)Advanced Data Analysis

Data Analysis The Truth About Linear Regression 1, Exercises - Engineering - Prof. Cosma Shalizi, Advanced Data Analysis, The Advantages of Backwardness

Typology: Exercises

2010/2011

Uploaded on 11/03/2011

bridge 🇺🇸

4.9

(13)

287 documents

1 / 7

This page cannot be seen from the preview

Don't miss anything!

Homework Assignment 2: The Advantages of

Backwardness

36-402, Advanced Data Analysis, Spring 2011

SOLUTIONS

This problem set was based on the preliminary analysis in the paper

E. Maasoumi, J. S. Racine and T. Stengos, “Growth and con-

vergence: a profile of distribution dynamics and mobility”, Journal

of Econometrics 136 (2007): 483–508, http://surface.syr.edu/

cgi/viewcontent.cgi?article=1004&context=ecn

# Setup

install.packages("np")

require(np)

data(oecdpanel)

1. Answer:

# Fit a linear model of growth on initgdp

lm.1 = lm(growth ~ initgdp, data = oecdpanel)

> signif(lm.1$coefficients,3)

(Intercept) initgdp

-0.02130 0.00533

The coefficient of initgdp in the linear model is about 5×10−3, suggesting

that higher growth rates are associated with higher initial levels of GDP,

contrary to the claim we want to test.

2. Answer: See Figure 1.

The black line suggests that choosing a bandwidth near 0.3 minimizes

the MSE for predictions on new observations. The green line shows the

MSE for the model fit after using the whole data as the training data.

The important thing to notice from the green line is that the model fit

always improves as the bandwidth decreases because the test set = the

training set = the whole data ↔no cross-validation. For finding the right

bandwidth to optimize the model for prediction on new observations, it is

important that the training set is separate from the test set.

1

Discover Exercises of Advanced Data Analysis Carnegie Mellon University (CMU)

Partial preview of the text

Download Data Analysis The Truth About Linear Regression 1, Exercises - Engineering and more Exercises Advanced Data Analysis in PDF only on Docsity!

Homework Assignment 2: The Advantages of

Backwardness

36-402, Advanced Data Analysis, Spring 2011

SOLUTIONS

This problem set was based on the preliminary analysis in the paper

E. Maasoumi, J. S. Racine and T. Stengos, “Growth and con- vergence: a profile of distribution dynamics and mobility”, Journal of Econometrics 136 (2007): 483–508, http://surface.syr.edu/ cgi/viewcontent.cgi?article=1004&context=ecn

Setup

install.packages("np") require(np) data(oecdpanel)

Answer:

Fit a linear model of growth on initgdp

lm.1 = lm(growth ~ initgdp, data = oecdpanel) > signif(lm.1$coefficients,3) (Intercept) initgdp -0.02130 0.

The coefficient of initgdp in the linear model is about 5× 10 −^3 , suggesting that higher growth rates are associated with higher initial levels of GDP, contrary to the claim we want to test.

Answer: See Figure 1. The black line suggests that choosing a bandwidth near 0.3 minimizes the MSE for predictions on new observations. The green line shows the MSE for the model fit after using the whole data as the training data. The important thing to notice from the green line is that the model fit always improves as the bandwidth decreases because the test set = the training set = the whole data ↔ no cross-validation. For finding the right bandwidth to optimize the model for prediction on new observations, it is important that the training set is separate from the test set.

0.1 0.2 0.3 0.4 0.

MSE vs. bandwidth 5-fold CV (black, solid) and in-sample (green, dashed)

Kernel bandwidth

MSE

Figure 1: The black solid line is the MSE for predictions on new observations in 5-fold cv. The green, dashed line is in-sample MSE when all of the data is used (so the training set = testing set = the whole data).

6 7 8 9

-0.

growth vs. initgdp

initgdp

growth

Figure 2: The red line is the line for the linear model. The green points are the kernel regression fitted values with bandwidth = 0.3.

In particular, the sign of the coefficient for initgdp is opposite what we ob- tained in problem 1. This result suggests that, “holding constant popgro and inv”, high growth rates are associated with low initial levels of GDP, consistent with the claim we want to test.

Answer: Run the provided code:

oecd.npr <- npreg(growth ~ initgdp + popgro + inv, data=oecdpanel, tol=0.1, ftol=0.1) summary(oecd.npr)

#output contains this: initgdp popgro inv Bandwidth(s): 0.211 0.0591 0.

The selected bandwidth for initgdp is quite a bit smaller than 0.3.

Answer:

Find and report the medians:

med.popgro = median(oecdpanel$popgro) med.inv = median(oecdpanel$inv) > signif(med.popgro,3) [1] -2. > signif(med.inv,3) [1] -1.

There are several ways to get the predictions. One is to make a new data frame which has the values we want. Here we re-use the real initgdp val- ues (we could also use an equally-spaced grid), but sort them in increasing order, and fix popgro and inv to their medians.

Make a new data set med.oecdpanel that contains growth, initgdp, popgro,

and inv, but replace popgro and inv and with their respective medians

Use the actual values of initgdp, but sort them for easier plotting later

initgdp.sort = sort(oecdpanel$initgdp) med.oecdpanel = data.frame(initgdp=initgdp.sort, popgro=med.popgro,inv=med.inv)

Find the predicted growth rates under the model from problem 4

mod4.med.preds = predict(lm.4, newdata = med.oecdpanel)

Find the predicted growth rates under the model from problem 5

mod5.med.preds = predict(oecd.npr, newdata = med.oecdpanel)

Plot growth v.s. initgdp for med.data and add the fitted values for both models:

pdf("graphics/p6.pdf")

Make scatterplot of initgdp versus growth:

plot(oecdpanel$initgdp, oecdpanel$growth, lwd = 2, cex = 0.3, xlab = "initgdp", ylab = "growth",

kernel.fold.mses = vector(length=nfolds) for (fold in 1:nfolds) { train = oecdpanel[case.folds!=fold,] test = oecdpanel[case.folds==fold,]

Fit the models

linear.model = lm(growth ~ initgdp+popgro+inv, data=train) kernel.model = npreg(growth ~ initgdp + popgro + inv, data=train, tol=0.1, ftol=0.1)

Predict on the test set

linear.predictions = predict(linear.model, newdata=test) kernel.predictions = predict(kernel.model, newdata=test) linear.fold.mses[fold] = mean((test$growth - linear.predictions)^2) kernel.fold.mses[fold] = mean((test$growth - kernel.predictions)^2) } cv.linear.mod = mean(linear.fold.mses) cv.kernel.mod = mean(kernel.fold.mses)

Based on the cross-validation scores (0.000782 for the linear model and 0 .000754 for the kernel regression), one would prefer the kernel regression model.

Answer: The initial analysis, in problems 1 to 3, undermined the idea of catching up. In the subsequent problems, we began controlling for other variables, and found that the idea of catching up might be correct. Both analyses are legitimate, if interpreted carefully, but the thing that makes the second analysis interesting is that it is more suggestive of causality. If we can establish that having a lower initial GDP in some sense causes higher growth rates, then we might have arrived at a conclusion with a lot of policy relevance. It would suggest, for instance, that we should expect a country’s growth rate to decline as it got richer, and not attribute that to problems with the economy. It would also suggest that growth should be extra high after recessions and wars, which lower GDP, but that those rates couldn’t be sustained (and blowing up part of the country to raise the growth rate would be perverse). The kernel regression model complicates matters by allowing us to model growth as a possibly non-linear function of initgdp. Hence, the question “Is the catching up theory” correct does not necessarily have a yes/no answer; it appears that the theory may hold true for specific ranges of initgdp but not everywhere. It is important to keep in mind that while the idea of catching up may be legitimate, it seems unlikely to account for much of the variability in growth rates, as you might guess from the last plot.

Data Analysis The Truth About Linear Regression 1, Exercises - Engineering, Exercises of Advanced Data Analysis

Related documents

Partial preview of the text

Download Data Analysis The Truth About Linear Regression 1, Exercises - Engineering and more Exercises Advanced Data Analysis in PDF only on Docsity!

Homework Assignment 2: The Advantages of

Backwardness

36-402, Advanced Data Analysis, Spring 2011

SOLUTIONS

Setup

Fit a linear model of growth on initgdp

Find and report the medians:

Make a new data set med.oecdpanel that contains growth, initgdp, popgro,

and inv, but replace popgro and inv and with their respective medians

Use the actual values of initgdp, but sort them for easier plotting later

Find the predicted growth rates under the model from problem 4

Find the predicted growth rates under the model from problem 5

Plot growth v.s. initgdp for med.data and add the fitted values for both models:

Make scatterplot of initgdp versus growth:

Fit the models

Predict on the test set