









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Data Analysis Urban Scaling Soln, Exercises - Engineering - Prof. Cosma Shalizi, Advanced Data Analysis
Typology: Exercises
1 / 15
This page cannot be seen from the preview
Don't miss anything!










General set-up:
gmp = read.csv(file = "gmp-2006.csv")
Your data file was derived from this data file, plus or minus 4% noise for each observation.
add.spline.fit = function(x.name = "pop", y.name = "finance"){
relevant.data = na.omit(gmp[,c(x.name,y.name)])
my.spline = smooth.spline(x=relevant.data[,1],y=relevant.data[,2])
lines(my.spline, lwd = 2, col = 2)
}
par(mfrow = c(2,2))
plot(gmp$pop, gmp$finance, cex=0.4, lwd=1.4, pch=16, main = "Finance", xlab = "Population", ylab = "Share",log="x") add.spline.fit(y.name = "finance")
plot(gmp$pop, gmp$professional.and.technical, cex=0.4, lwd=1.4, pch=16, main = "Professional and technical services", xlab = "Population", ylab = "Share",log="x") add.spline.fit(y.name = "professional.and.technical")
plot(gmp$pop, gmp$ict, cex=0.4, lwd=1.4, pch=16, main = "Information technology", xlab = "Population", ylab = "Share",log="x") add.spline.fit(y.name = "ict")
plot(gmp$pop, gmp$management, cex=0.4, lwd=1.4, pch=16, main = "Management", xlab = "Population", ylab = "Share",log="x") add.spline.fit(y.name = "management")
See Figure 1. The relationship between finance and pop is nearly mono- tonic; increasing pop generally corresponds to increasing finance. For professional services and management, there is a leveling off on the right half of each plot, but no decline. Only information technology lacks a clear pattern. Notice that all the plots show population on a logarithmic scale — this is just for visual clarity and is not required. Since there are not that many large cities, using an ordinary linear scale for population would bunch most of the data up on the left-hand side of each plot, making it harder to see what’s happening. If we used a different smoother, and still wanted to use lines to add the smoothed values to the plot, we would have to make sure they came in some sort of sensible order — otherwise we’d get line segments connecting points for cities in alphabetical order, which is a mess. Acceptable alter- natives include: using points on the fitted values; re-arranging the fitted values in order of increasing (or decreasing) population; using predict to get values at regularly spaced and increasing grid points, and then lines to plot them (examples in the notes); or even just observing that there was a problem with the plot, which you did not know how to solve.
par(mfrow = c(2,2))
plot(gmp$finance, gmp$pcgmp, cex=0.4, lwd=1.4, pch=16, main = "pcgmp vs finance share", xlab = "Share", ylab = "pcgmp") add.spline.fit(x.name = "finance", y.name = "pcgmp")
plot(gmp$professional.and.technical, gmp$pcgmp, cex=0.4, lwd=1.4, pch=16, main = "pcgmp vs professional share", xlab = "Share", ylab = "pcgmp") add.spline.fit(x.name = "professional.and.technical", y.name = "pcgmp")
plot(gmp$ict, gmp$pcgmp, cex=0.4, lwd=1.4, pch=16, main = "pcgmp vs ict share", xlab = "Share", ylab = "pcgmp") add.spline.fit(x.name = "ict", y.name = "pcgmp")
plot(gmp$management, gmp$pcgmp, cex=0.4, lwd=1.4, pch=16, main = "pcgmp vs management share", xlab = "Share", ylab = "pcgmp") add.spline.fit(x.name = "management", y.name = "pcgmp")
See Figure 2. The plots generally suggest that increasing the share for specialist service industries corresponds to increase productivity. The re- lationship is not especially clean for professional and technical services. For ICT, the only non-monotonicity is driven by the contrast between the city with the second-highest share of ICT (San Jose-Sunnyvale-Santa Clara, CA) and that with the highest share (Corvallis, OR) — which is the unsurprising fact that Silicon Valley is quite rich.
0.05 0.15 0.25 0.
20000
40000
60000
80000
pcgmp vs finance share
Share
pcgmp
0.05 0.10 0.
20000
40000
60000
80000
pcgmp vs professional share
Share
pcgmp
0.0 0.1 0.2 0.3 0.4 0.5 0.
20000
40000
60000
80000
pcgmp vs ict share
Share
pcgmp
0.00 0.01 0.02 0.03 0.04 0.
20000
40000
60000
80000
pcgmp vs management share
Share
pcgmp
Figure 2:
0.05 0.10 0.15 0.20 0.25 0.30 0.
-0.
finance
s(finance,1)
0.05 0.10 0.
-0.
professional.and.technical
s(professional.and.technical,1)
0.0 0.1 0.2 0.3 0.
-0.
ict
s(ict,1.97)
0.00 0.01 0.02 0.03 0.04 0.
-0.
management
s(management,1.52)
Figure 3:
-2 -1 0 1 2
-0.
-0.
Normal Q-Q Plot
Theoretical Quantiles
Sample Quantiles
10.2 10.4 10.6 10.8 11.0 11.
-0.
-0.
Residuals vs. fitted values
Fitted values
Residuals
Figure 4:
gmp.plus.10 <- gmp gmp.plus.10$pop <- 1.1*gmp.plus.10$pop new.p.law.preds <- predict(lm.p.law,newdata=gmp.plus.10) Now we just look at the average change from old to new predictions: > mean(new.p.law.preds - old.p.law.preds) [1] 0. (You can check that the answer doesn’t depend on how the new predictions are calculated.) This approach will work for any kind of model which has a predict method. There is a short-cut, however, in this case, since we are looking at a linear model without interactions. Since, for any num- bers M and N , log M N = log M + log N , increasing population by a factor of 1.1 is the same as adding log 1.1 to log N. In the power law model, then, the change in the predicted value is
∆ loĝ y = (b 0 + b 1 log 1. 1 N ) − (b 0 + b 1 log N ) = (b 0 + b 1 log 1.1 + b 1 log N ) − (b 0 + b 1 log N ) = b 1 log 1. 1
no matter what N was to start with. This is > log(1.1) * coefficients(lm.p.law)[2] log(pop)
which agrees with what we got by direct calculation, as it should.
(b) Answer: Again, we can use direct calculation. Because the log- additive model needs the industry shares, the new data frame we give predict must have those columns; so we’ll use the gmp.plus. data frame made in part (a); but we need to drop rows with NAs, since the model doesn’t know what to do with them. old.lam.preds <- predict(gam.3) # Gives fitted values without newdata new.lam.preds <- predict(gam.3,newdata=na.omit(gmp.plus.10)) Taking the mean change: > mean(new.lam.preds - old.lam.preds) [1] -0. Alternatively, the shortcut for part (a) still works here, because we (i) we are making an additive change to log N and to no other input variable, (ii) the response, in the model, depends linearly on log N , and (iii) there is no interaction in the model between log N and any other variable. We do however need to extract the parametric coef- ficient b.
> log(1.1)*coefficients(gam.3)["log(pop)"] log(pop) -0. Again, this agrees exactly with direct calculation. (c) Answer: The two models lead to different conclusions: The power- law model says that increases in population lead to (or at least pre- dict) increases in productivity, whereas the additive model suggests that increases in population, all else being equal, lead to a slight de- crease in productivity. (Notice that this matches the sign the two models give to the coefficient for log N .) To the extent that larger populations go along with higher produc- tivity in the power-law model, the additive model suggests that this is because bigger cities have bigger shares of high-productivity indus- tries. Since the data arise from observing how cities happen to be, rather than experimentally making cities be certain ways and seeing what happens, it is hazardous to draw causal conclusions from either model.
> mean(residuals(gam.3)^2) [1] 0.
> mean(residuals(lm.p.law)^2) [1] 0. We see that the in-sample MSE for the additive model is about 0. 007 less than the in-sample MSE for the power-law model. This should be no surprise; the additive model includes all the terms of the power-law model and then some, meaning that additive model has the flexibility to fit the data more closely than the power-law model. (b) Answer: There are several ways to do this problem. Since the log-additive model is more flexible than the power law model, and includes it as a special case, we cannot just compare in- sample MSEs. (See part (a) again.) We could however ask whether the amount by which the log-additive model does better in sample is compatible with the power law model actually being right. This is what the partial F -test from 401 checks for linear models (with Gaussian noise, etc.); similarly for the log-likelihood ratio test. The testing procedure of the notes for lecture 10 works along the same lines. Any of them could be used, but all of them would require sim- ulating from the null model to get the right sampling distribution. (Just plugging in to a table for the F distribution would not work.) So this could be done along the lines of Lecture 10 and Homework 6, or of Homework 5.
fold.mses.gam[fold] = mean((truth - predictions.gam)^2)
fold.mses.lm[fold] = mean((truth - predictions.lm)^2) }
return(mean(fold.mses.gam) - mean(fold.mses.lm)) }
Running on the data,
> observed.cv <- do.nfold.cv(data=gmp.clean) > observed.cv [1] -0.
So the additive model seems to generalize better than the power law. This does not tell us, however, whether it does so much better that we ought to give up on the power law.
Using system.time, the cross-validation takes about 0.5 seconds to run on my laptop; in fact just doing each fold takes about 0.1 second, mostly from fitting the additive model. Doing 1000 cross-validations should therefore take about 0. 5 ∗ 1000 = 500 seconds, or a litle over 8 minutes.
We also need to simulate from the null model, i.e., the power law. This model says nothing about how city size or any of the other ex- planatory variables are distributed, so we will follow the resampling- the-residuals idea from the lecture on bootstrapping. This takes no position on whether the noise is Gaussian, but it does assume that the noise is independent of the explanatory variables. One could do something more elaborate, e.g. a kernel density estimate of the dis- tribution of residuals conditional on the input variables, but what we’re doing here is sufficient.
simulate.power.law <- function(data=gmp.clean,model=lm.p.law) { n <- nrow(data) fitted.pcgmp <- predict(model,newdata=data) noise <- sample(residuals(model),size=n,replace=TRUE)
data$pcgmp <- exp(fitted.pcgmp + noise) return(data) }
Finally, we put this together:
> bootstrap.cv.mses <- replicate(1000,do.nfold.cv(data=simulate.power.law())) > (1+sum(bootstrap.cv.mses < observed.cv))/(1+length(bootstrap.cv.mses)) [1] 0.
This is 1/1001, which means that the amount by which the additive model beats the power law in real life is bigger than on all the 1000 simulations. In fact, the smallest value in bootstrap.cv.mses is − 5. 5 × 10 −^5 ; the observed value is almost 200 times as large. We can conclude with high confidence that the additive model does in fact generalize better than the power law. (This indeed took eight and a half minutes to run, long enough to work on something small.)