Data Analysis Urban Scaling Soln, Exercises - Engineering, Exercises of Advanced Data Analysis

Data Analysis Urban Scaling Soln, Exercises - Engineering - Prof. Cosma Shalizi, Advanced Data Analysis

Typology: Exercises

2010/2011

Uploaded on 11/03/2011

bridge
bridge 🇺🇸

4.9

(13)

287 documents

1 / 15

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Midterm Exam 1: Urban Scaling, Continued
36-402, Advanced Data Analysis, Spring 2011
SOLUTIONS
General set-up:
gmp = read.csv(file = "gmp-2006.csv")
Your data file was derived from this data file, plus or minus 4% noise for each
observation.
1. Answer: The basic scatterplot function will drop points where either the
horizontal or vertical coordinate is NA. Some but not all smoothers will also
do so. However, smooth.spline will not. Because extracting the points
where neither variable is a repetitive operation, it’s better to turn it into
a function. Fortunately, R has already done this for us, in the na.omit
function. Thus na.omit(gmp) would be a version of the data set with only
rows with complete data. This is however too aggressive for this problem
and the next if we’re interested in finance, there’s no sense dropping a
city because we don’t have information about professional services.
## Here is a function for adding spline fits to the desired plots
add.spline.fit = function(x.name = "pop", y.name = "finance"){
# Remove rows of data with NA y-values
relevant.data = na.omit(gmp[,c(x.name,y.name)])
# note selecting just the columns needed for the plot:
# relevant.data will have two columns, and x will come first
# Fit the spline (using generalized cross validation)
my.spline = smooth.spline(x=relevant.data[,1],y=relevant.data[,2])
# Add to the current plot
lines(my.spline, lwd = 2, col = 2)
# The "$x" attribute of my.spline has the unique values of x IN ORDER
# while the "$y" has the corresponding fitted values; since the lines()
# function looks for either separate x and y arguments, or one argument with
# components named x and y, this works.
}
# Open up a 4-pane plotting frame
par(mfrow = c(2,2))
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Data Analysis Urban Scaling Soln, Exercises - Engineering and more Exercises Advanced Data Analysis in PDF only on Docsity!

Midterm Exam 1: Urban Scaling, Continued

36-402, Advanced Data Analysis, Spring 2011

SOLUTIONS

General set-up:

gmp = read.csv(file = "gmp-2006.csv")

Your data file was derived from this data file, plus or minus 4% noise for each observation.

  1. Answer: The basic scatterplot function will drop points where either the horizontal or vertical coordinate is NA. Some but not all smoothers will also do so. However, smooth.spline will not. Because extracting the points where neither variable is a repetitive operation, it’s better to turn it into a function. Fortunately, R has already done this for us, in the na.omit function. Thus na.omit(gmp) would be a version of the data set with only rows with complete data. This is however too aggressive for this problem and the next — if we’re interested in finance, there’s no sense dropping a city because we don’t have information about professional services.

Here is a function for adding spline fits to the desired plots

add.spline.fit = function(x.name = "pop", y.name = "finance"){

Remove rows of data with NA y-values

relevant.data = na.omit(gmp[,c(x.name,y.name)])

note selecting just the columns needed for the plot:

relevant.data will have two columns, and x will come first

Fit the spline (using generalized cross validation)

my.spline = smooth.spline(x=relevant.data[,1],y=relevant.data[,2])

Add to the current plot

lines(my.spline, lwd = 2, col = 2)

The "$x" attribute of my.spline has the unique values of x IN ORDER

while the "$y" has the corresponding fitted values; since the lines()

function looks for either separate x and y arguments, or one argument with

components named x and y, this works.

}

Open up a 4-pane plotting frame

par(mfrow = c(2,2))

plot(gmp$pop, gmp$finance, cex=0.4, lwd=1.4, pch=16, main = "Finance", xlab = "Population", ylab = "Share",log="x") add.spline.fit(y.name = "finance")

plot(gmp$pop, gmp$professional.and.technical, cex=0.4, lwd=1.4, pch=16, main = "Professional and technical services", xlab = "Population", ylab = "Share",log="x") add.spline.fit(y.name = "professional.and.technical")

plot(gmp$pop, gmp$ict, cex=0.4, lwd=1.4, pch=16, main = "Information technology", xlab = "Population", ylab = "Share",log="x") add.spline.fit(y.name = "ict")

plot(gmp$pop, gmp$management, cex=0.4, lwd=1.4, pch=16, main = "Management", xlab = "Population", ylab = "Share",log="x") add.spline.fit(y.name = "management")

See Figure 1. The relationship between finance and pop is nearly mono- tonic; increasing pop generally corresponds to increasing finance. For professional services and management, there is a leveling off on the right half of each plot, but no decline. Only information technology lacks a clear pattern. Notice that all the plots show population on a logarithmic scale — this is just for visual clarity and is not required. Since there are not that many large cities, using an ordinary linear scale for population would bunch most of the data up on the left-hand side of each plot, making it harder to see what’s happening. If we used a different smoother, and still wanted to use lines to add the smoothed values to the plot, we would have to make sure they came in some sort of sensible order — otherwise we’d get line segments connecting points for cities in alphabetical order, which is a mess. Acceptable alter- natives include: using points on the fitted values; re-arranging the fitted values in order of increasing (or decreasing) population; using predict to get values at regularly spaced and increasing grid points, and then lines to plot them (examples in the notes); or even just observing that there was a problem with the plot, which you did not know how to solve.

  1. Answer: Re-use the machinery from the last problem.

par(mfrow = c(2,2))

plot(gmp$finance, gmp$pcgmp, cex=0.4, lwd=1.4, pch=16, main = "pcgmp vs finance share", xlab = "Share", ylab = "pcgmp") add.spline.fit(x.name = "finance", y.name = "pcgmp")

plot(gmp$professional.and.technical, gmp$pcgmp, cex=0.4, lwd=1.4, pch=16, main = "pcgmp vs professional share", xlab = "Share", ylab = "pcgmp") add.spline.fit(x.name = "professional.and.technical", y.name = "pcgmp")

plot(gmp$ict, gmp$pcgmp, cex=0.4, lwd=1.4, pch=16, main = "pcgmp vs ict share", xlab = "Share", ylab = "pcgmp") add.spline.fit(x.name = "ict", y.name = "pcgmp")

plot(gmp$management, gmp$pcgmp, cex=0.4, lwd=1.4, pch=16, main = "pcgmp vs management share", xlab = "Share", ylab = "pcgmp") add.spline.fit(x.name = "management", y.name = "pcgmp")

See Figure 2. The plots generally suggest that increasing the share for specialist service industries corresponds to increase productivity. The re- lationship is not especially clean for professional and technical services. For ICT, the only non-monotonicity is driven by the contrast between the city with the second-highest share of ICT (San Jose-Sunnyvale-Santa Clara, CA) and that with the highest share (Corvallis, OR) — which is the unsurprising fact that Silicon Valley is quite rich.

  1. (a) Answer: The first two terms, α 0 and b ln N , are the power-law scaling part of the model. These terms are parametric: they assume a specific functional form for the relationship between the response and ln N , involving only a finite number (in this case, two) unknown constants. The fj are the non-parametric terms — or alternatively, “infinitely parametric”, since in principle they need to be estimated at infinitely many points, and we (or the gam function) use smoothing to interpolate between the points we see. — It’s defensible to argue that α 0 is not really a parametric term, since it just does the work of centering the fitted values. (b) Answer: We used gam from the mgcv package: library(mgcv)

0.05 0.15 0.25 0.

20000

40000

60000

80000

pcgmp vs finance share

Share

pcgmp

0.05 0.10 0.

20000

40000

60000

80000

pcgmp vs professional share

Share

pcgmp

0.0 0.1 0.2 0.3 0.4 0.5 0.

20000

40000

60000

80000

pcgmp vs ict share

Share

pcgmp

0.00 0.01 0.02 0.03 0.04 0.

20000

40000

60000

80000

pcgmp vs management share

Share

pcgmp

Figure 2:

0.05 0.10 0.15 0.20 0.25 0.30 0.

-0.

finance

s(finance,1)

0.05 0.10 0.

-0.

professional.and.technical

s(professional.and.technical,1)

0.0 0.1 0.2 0.3 0.

-0.

ict

s(ict,1.97)

0.00 0.01 0.02 0.03 0.04 0.

-0.

management

s(management,1.52)

Figure 3:

-2 -1 0 1 2

-0.

-0.

Normal Q-Q Plot

Theoretical Quantiles

Sample Quantiles

10.2 10.4 10.6 10.8 11.0 11.

-0.

-0.

Residuals vs. fitted values

Fitted values

Residuals

Figure 4:

gmp.plus.10 <- gmp gmp.plus.10$pop <- 1.1*gmp.plus.10$pop new.p.law.preds <- predict(lm.p.law,newdata=gmp.plus.10) Now we just look at the average change from old to new predictions: > mean(new.p.law.preds - old.p.law.preds) [1] 0. (You can check that the answer doesn’t depend on how the new predictions are calculated.) This approach will work for any kind of model which has a predict method. There is a short-cut, however, in this case, since we are looking at a linear model without interactions. Since, for any num- bers M and N , log M N = log M + log N , increasing population by a factor of 1.1 is the same as adding log 1.1 to log N. In the power law model, then, the change in the predicted value is

∆ loĝ y = (b 0 + b 1 log 1. 1 N ) − (b 0 + b 1 log N ) = (b 0 + b 1 log 1.1 + b 1 log N ) − (b 0 + b 1 log N ) = b 1 log 1. 1

no matter what N was to start with. This is > log(1.1) * coefficients(lm.p.law)[2] log(pop)

which agrees with what we got by direct calculation, as it should.

(b) Answer: Again, we can use direct calculation. Because the log- additive model needs the industry shares, the new data frame we give predict must have those columns; so we’ll use the gmp.plus. data frame made in part (a); but we need to drop rows with NAs, since the model doesn’t know what to do with them. old.lam.preds <- predict(gam.3) # Gives fitted values without newdata new.lam.preds <- predict(gam.3,newdata=na.omit(gmp.plus.10)) Taking the mean change: > mean(new.lam.preds - old.lam.preds) [1] -0. Alternatively, the shortcut for part (a) still works here, because we (i) we are making an additive change to log N and to no other input variable, (ii) the response, in the model, depends linearly on log N , and (iii) there is no interaction in the model between log N and any other variable. We do however need to extract the parametric coef- ficient b.

> log(1.1)*coefficients(gam.3)["log(pop)"] log(pop) -0. Again, this agrees exactly with direct calculation. (c) Answer: The two models lead to different conclusions: The power- law model says that increases in population lead to (or at least pre- dict) increases in productivity, whereas the additive model suggests that increases in population, all else being equal, lead to a slight de- crease in productivity. (Notice that this matches the sign the two models give to the coefficient for log N .) To the extent that larger populations go along with higher produc- tivity in the power-law model, the additive model suggests that this is because bigger cities have bigger shares of high-productivity indus- tries. Since the data arise from observing how cities happen to be, rather than experimentally making cities be certain ways and seeing what happens, it is hazardous to draw causal conclusions from either model.

  1. (a) Answer:

in-sample MSE for log.y in problem 3

> mean(residuals(gam.3)^2) [1] 0.

in-sample MSE squared error for log.y in power-law model

> mean(residuals(lm.p.law)^2) [1] 0. We see that the in-sample MSE for the additive model is about 0. 007 less than the in-sample MSE for the power-law model. This should be no surprise; the additive model includes all the terms of the power-law model and then some, meaning that additive model has the flexibility to fit the data more closely than the power-law model. (b) Answer: There are several ways to do this problem. Since the log-additive model is more flexible than the power law model, and includes it as a special case, we cannot just compare in- sample MSEs. (See part (a) again.) We could however ask whether the amount by which the log-additive model does better in sample is compatible with the power law model actually being right. This is what the partial F -test from 401 checks for linear models (with Gaussian noise, etc.); similarly for the log-likelihood ratio test. The testing procedure of the notes for lecture 10 works along the same lines. Any of them could be used, but all of them would require sim- ulating from the null model to get the right sampling distribution. (Just plugging in to a table for the F distribution would not work.) So this could be done along the lines of Lecture 10 and Homework 6, or of Homework 5.

fold.mses.gam[fold] = mean((truth - predictions.gam)^2)

power-law MSE:

fold.mses.lm[fold] = mean((truth - predictions.lm)^2) }

Report an estimate of how much the power-law model MSE exceeds

the power-law MSE:

return(mean(fold.mses.gam) - mean(fold.mses.lm)) }

Running on the data,

> observed.cv <- do.nfold.cv(data=gmp.clean) > observed.cv [1] -0.

So the additive model seems to generalize better than the power law. This does not tell us, however, whether it does so much better that we ought to give up on the power law.

Using system.time, the cross-validation takes about 0.5 seconds to run on my laptop; in fact just doing each fold takes about 0.1 second, mostly from fitting the additive model. Doing 1000 cross-validations should therefore take about 0. 5 ∗ 1000 = 500 seconds, or a litle over 8 minutes.

We also need to simulate from the null model, i.e., the power law. This model says nothing about how city size or any of the other ex- planatory variables are distributed, so we will follow the resampling- the-residuals idea from the lecture on bootstrapping. This takes no position on whether the noise is Gaussian, but it does assume that the noise is independent of the explanatory variables. One could do something more elaborate, e.g. a kernel density estimate of the dis- tribution of residuals conditional on the input variables, but what we’re doing here is sufficient.

simulate.power.law <- function(data=gmp.clean,model=lm.p.law) { n <- nrow(data) fitted.pcgmp <- predict(model,newdata=data) noise <- sample(residuals(model),size=n,replace=TRUE)

Or use the resample() command defined in the lecture notes

data$pcgmp <- exp(fitted.pcgmp + noise) return(data) }

Finally, we put this together:

> bootstrap.cv.mses <- replicate(1000,do.nfold.cv(data=simulate.power.law())) > (1+sum(bootstrap.cv.mses < observed.cv))/(1+length(bootstrap.cv.mses)) [1] 0.

This is 1/1001, which means that the amount by which the additive model beats the power law in real life is bigger than on all the 1000 simulations. In fact, the smallest value in bootstrap.cv.mses is − 5. 5 × 10 −^5 ; the observed value is almost 200 times as large. We can conclude with high confidence that the additive model does in fact generalize better than the power law. (This indeed took eight and a half minutes to run, long enough to work on something small.)

  1. (a) Answer: If we take the power law model at face value, then cities should become more productive simply by increasing their popula- tion; the economic changes would take care of themselves. Presum- ably there are limits to this (it is hard to see how it could work if neighboring cities simply dumped all their retirees, disabled people, homeless people, etc. into one city), but the model suggests that, within the ordinary demographic range for America, a city could in- deed become more productive just by attracting more people. On the other hand, the log-additive model implies that population as such does not matter, that what’s important is the city’s industrial mix, and that the correlation of city size with productivity is due to higher-productivity industries tending to be located in bigger cities. This suggests that cities should concentrate on attracting those in- dustries (which might in fact lead to population growth). It does not say anything about how best to do that.^3 Since the additive model seems to be superior to the power law model, I would tentatively say that changing the industrial mix is more important than changing population, which seems to have no effect for cities with the same industry shares. (The question was limited to “American cities”, because the data provides no information on cities elsewhere.) (b) Answer: The simplest way to improve the analysis would be to get the missing data values! We could also try incorporating interactions between variables (perhaps by a kernel regression model), or addi- tional demographic variables (e.g., what fraction of the population is of working age?), or additional industrial variables (e.g., shares of other industries, more detailed industry categories). More generally, it is hard to tell from static data like this whether increasing a city’s population attracts more high-productivity in- dustries, or population tends to accumulate around such industries. Carefully looking at changes over time could be helpful to figuring out which comes first. (Perhaps they feed back on each other in a virtuous cycle.) So would including additional variables. If we could incorporate other variables which might effect both population and (^3) It is also not clear whether every city could follow this strategy at once — it might be that there is only so much demand for financial or management services.