


















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Data Analysis The Bootstrap, Lecture Slide - Engineering - Prof. Cosma Shalizi, Advanced Data Analysis, Bootstrap, Stochastic Models, Uncertainty, Sampling Distributions, Bootstrap Principle, Bias Correction, Bootstrap Confidence Intervals, percentile method, Double bootstrap, Pareto's Law of Wealth Inequality, Non-parametric Bootstrapping, statistical error, jack-knife, Bootstrapping Regression Models, Re-sampling Residuals, Re-sampling Residuals with Heteroskedasticity
Typology: Slides
1 / 26
This page cannot be seen from the preview
Don't miss anything!



















1 Stochastic Models, Uncertainty, Sampling Distributions 2
2 The Bootstrap Principle 4 2.1 Variances and Standard Errors................... 6 2.2 Bias Correction............................ 6 2.3 Confidence Intervals......................... 6 2.3.1 Other Bootstrap Confidence Intervals........... 9 2.4 Hypothesis Testing.......................... 10 2.4.1 Double bootstrap hypothesis testing............ 10 2.5 Parametric Bootstrapping Example: Pareto’s Law of Wealth In- equality................................ 11
3 Non-parametric Bootstrapping 15 3.1 Parametric vs. Nonparametric Bootstrapping........... 17
4 Bootstrapping Regression Models 18 4.1 Re-sampling Points: Parametric Example............. 18 4.2 Re-sampling Points: Non-parametric Example........... 20 4.3 Re-sampling Residuals: Example.................. 23 4.4 Re-sampling Residuals with Heteroskedasticity.......... 24
5 Bootstrap with Dependent Data 24
6 Things Bootstrapping Does Poorly 24
7 Further Reading 25 Last time, I promise that if we only knew how to simulate from our models, we could use them to do all sorts of wonderful things — and, in particular, that we could use them to quantify uncertainty. The key technique here is what’s called bootstrapping, or the bootstrap, which is the subject of this set of notes.
Statistics is the branch of applied mathematics which studies ways of drawing inferences from limited and imperfect data. We want to know how a neuron in a rat’s brain responds when one of its whiskers gets tweaked, or how many rats live in Pittsburgh, or how high the water will get under the 16mathrmth^ Street bridge during May, or the typical course of daily temperatures in the city over the year, or the relationship between the number of birds of prey in Schenley Park in the spring and the number of rats the previous fall. We have some data on all of these things. But we know that our data is incomplete, and experience tells us that repeating our experiments or observations, even taking great care to replicate the conditions, gives more or less different answers every time. It is foolish to treat any inference from the data in hand as certain. If all data sources were totally capricious, there’d be nothing to do be- yond piously qualifying every conclusion with “but we could be wrong about this”. A mathematical science of statistics is possible because while repeating an experiment gives different results, some kinds of results are more common than others; their relative frequencies are reasonably stable. We thus model the data-generating mechanism through probability distributions and stochastic processes. When and why we can use stochastic models are very deep questions, but ones for another time. If we can use them in our problem, quantities like the ones I mentioned above are represented as functions of the stochastic model, i.e., of the underlying probability distribution. Since a function of a function is a “functional”, and these quantities are functions of the true probability dis- tribution function, we’ll call these functionals or statistical functionals^1. Functionals could be single numbers (like the total rat population), or vectors, or even whole curves (like the expected time-course of temperature over the year, or the regression of hawks now on rats earlier). Statistical inference becomes estimating those functionals, or testing hypotheses about them. These estimates and other inferences are functions of the data values, which means that they inherit variability from the underlying stochastic process. If we “re-ran the tape” (as the late, great Stephen Jay Gould used to say), we would get different data, with a certain characteristic distribution, and applying a fixed procedure would yield different inferences, again with a certain distribu- tion. Statisticians want to use this distribution to quantify the uncertainty of the inferences. For instance, the standard error is an answer to the question “By how much would our estimate of this functional vary, typically, from one replication of the experiment to another?” (It presumes a particular mean- ing for “typically vary”, as root mean square deviation around the mean.) A confidence region on a parameter, likewise, is the answer to “What are all the values of the parameter which could have produced this data with at least some
(^1) Most writers in theoretical statistics just call them “parameters” in a generalized sense, but I will try to restrict that word to actual parameters specifying statistical models, to minimize confusion. I may slip up.
This yoked a very broad and abstract theory of inference to very narrow and concrete practical formulas, an uneasy combination often preserved in basic statistics classes. The arrival of (comparatively) cheap and fast computers made it feasible for scientists and statisticians to record lots of data and to fit models to it, so they did. Sometimes the models were conventional ones, including the special-case assumptions, which often enough turned out to be detectably, and consequen- tially, wrong. At other times, scientists wanted more complicated or flexible models, some of which had been proposed long before, but now moved from being theoretical curiosities to stuff that could run overnight^3. In principle, asymptotics might handle either kind of problem, but convergence to the limit could be unacceptably slow, especially for more complex models. By the 1970s, then, statistics faced the problem of quantifying the uncer- tainty of inferences without using either implausibly-helpful assumptions or asymptotics; all of the solutions turned out to demand even more computa- tion. Here we will examine what may be the most successful solution, Bradley Efron’s proposal to combine estimation with simulation, which he gave the less- that-clear but persistent name of “the bootstrap” (Efron, 1979).
Remember that the key to dealing with uncertainty in parameters and func- tionals is the sampling distribution of estimators. Knowing what distribution we’d get for our estimates on repeating the experiment would give us things like standard errors. Efron’s insight was that we can simulate replication. After all, we have already fitted a model to the data, which is a guess at the mechanism which generated the data. Running that mechanism generates simulated data which, by hypothesis, has the same distribution as the real data. Feeding the simulated data through our estimator gives us one draw from the sampling dis- tribution; repeating this many times yields the sampling distribution. Since we are using the model to give us its own uncertainty, Efron called this “bootstrap- ping”; unlike the Baron Munchhausen’s plan for getting himself out of a swamp by pulling himself out by his bootstraps, it works. Figure 1 sketches the over-all process: fit a model to data, use the model to calculate the functional, then get the sampling distribution by generating new, synthetic data from the model and repeating the estimation on the simulation output. To fix notation, we’ll say that the original data is x. (In general this is a whole data frame, not a single number.) Our parameter estimate from the data is θˆ. Surrogate data sets simulated from the fitted model will be X˜ 1 , X˜ 2 ,... X˜B. The corresponding re-estimates of the parameters on the surrogate data are θ˜ 1 , θ˜ 2 ,... θ˜B. The functional of interest is estimated by the statistic T , with sample value ˆt = T (x), and values of the surrogates of ˜t 1 = T ( X˜ 1 ), ˜t 2 = T ( X˜ 2 ),
(^3) Kernel regression, kernel density estimation, and nearest neighbors prediction were all proposed in the 1950s, but didn’t begin to be widely used until the 1970s, or even the 1980s.
. -0.
-0.
q0.01 = -0.
. -0.
-0. -0.
q0.01 = -0.
Figure 1: Schematic for model-based bootstrapping: simulated values are gen- erated from the fitted model, then treated like the original data, yielding a new estimate of the functional of interest, here called q 0. 01 (as a hint towards this week’s homework).
rboot <- function(B, statistic, simulator, ...) { tboots <- replicate(B, statistic(simulator(...))) return(tboots) }
bootstrap.se <- function(simulator, statistic, B, ...) { tboots <- rboot(B, statistic, simulator, ...) se <- sd(tboots) return(se) }
Code Example 1: Sketch of code for calculating bootstrap standard errors. The function rboot generates B bootstrap samples (using the simulator func- tion) and calculates the statistic g on them (using statistic). simulator needs to be a function which returns a surrogate data set in a form suitable for statistic. Any arguments needed by simulator are passed to it through the ... meta-argument. Because every use of bootstrapping is going to need to do this, it makes sense to break it out as a separate function, rather than writing the same code many times (with many chances of getting it wrong). bootstrap.se just calls rboot and takes a standard deviation. (Again, ... is used to pass along arguments.)
bootstrap.bias <- function(simulator, statistic, B, t.hat, ...) { tboots <- rboot(B, statistic, simulator, ...) bias <- mean(tboots) - t.hat return(bias) }
Code Example 2: Sketch of code for bootstrap bias correction. Arguments are as in Code Example 1, except that t.hat is the estimate on the original data.
bootstrap.ci.basic <- function(simulator, statistic, B, t.hat, alpha, ...) { tboots <- rboot(B,statistic, simulator, ...) ci.lower <- 2t.hat - quantile(tboots,1-alpha/2) ci.upper <- 2t.hat - quantile(tboots,alpha/2) return(list(ci.lower=ci.lower,ci.upper=ci.upper)) }
Code Example 3: Sketch of code for calculating the basic bootstrap confidence interval. See Code Examples 2 and 1.
confidence level is 1 − α, then we want
Pr (t 0 ∈ C) = 1 − α (4)
no matter what the true value of t 0. When we calculate a confidence interval, our inability to deal with distributions exactly means that the true confidence level, or coverage of the interval, is not quite the desired confidence level 1 − α; the closer it is, the better the approximation, and the more accurate the confidence interval. Call the upper and lower limits of the confidence interval are Cu and Cl. Moreover, let’s say that we want this to be symmetric, with probability α/2 on either side of the confidence interval. (This is why we called the confidence level 1 − α, instead of α.) Then, looking at the lower limit,
α/ 2 = Pr (Cl ≥ t 0 ) (5) = Pr
Cl − ˆt ≥ t 0 − ˆt
= Pr
tˆ − Cl ≤ ˆt − t 0
Likewise, at the upper end,
α/2 = Pr
ˆt − Cu ≥ tˆ − t 0
The point of these manipulations is that bootstrapping gives us the distribution of ˜t − ˆt, which is approximately the same as the distribution of ˆt − t 0. Knowing that distribution, and ˆt, we can solve for the confidence limits Cl and Cu:
Cl = ˆt − (Q˜t(1 − α/2) − ˆt) (9) Cu = ˆt − (Q˜t(α/2) − ˆt) (10)
where Qt˜ is the quantile function of ˜t. This is the basic bootstrap confidence interval, or the pivotal CI. It is simple and reasonably accurate, and makes a very good default choice for how to find a confidence interval.
boot.pvalue <- function(test,simulator,B,testhat, ...) { testboot <- rboot(B=B, statistic=test, simulator=simulator, ...) p <- (sum(test >= testhat)+1)/(B+1) return(p) }
Code Example 4: Bootstrap p-value calculation. testhat should be the value of the test statistic on the actual data. test is a function which takes in a data set and calculates the test statistic, under the presumption that large values indicate departure from the null hypothesis. Note the +1 in the numerator and denominator of the p-value — it would be more straightforward to leave them off, but this is a little more stable when B is comparatively small. (Also, it keeps us from ever reporting a p-value of exactly 0.)
For hypothesis tests, we may want to calculate two sets of sampling distribu- tions: the distribution of the test statistic under the null tells us about the size of the test and significance levels, and the distribution under the alternative tells about power and realized power. We can find either with bootstrapping, by simulating from either the null or the alternative. In such cases, the statistic of interest, which I’ve been calling T , is the test statistic. Code Example 4 illustrates how to find a p-value by simulating under the null hypothesis. The same procedure would work to calculate power, only we’d need to simulate from the alternative hypothesis, and testhat would be set to the critical value of T separating acceptance from rejection, not the observed value.
2.4.1 Double bootstrap hypothesis testing
When the hypothesis we are testing involves estimated parameters, we may need to correct for this. Suppose, for instance, that we are doing a goodness-of-fit test. If we estimate our parameters on the data set, we adjust our distribution so that it matches the data. It is thus not surprising if it seems to fit the data well! (Essentially, it’s the problem of evaluating performance by looking at in-sample fit, which is more or less where we began the course.) Some test statistics have distributions which are not affected by estimating parameters, at least not asymptotically. In other cases, one can analytically come up with correction terms. When these routes are blocked, one uses a double bootstrap, where a second level of bootstrapping checks how much estimation improves the apparent fit of the model. This is perhaps most easily explained in pseudo-code (Code Example 5).
doubleboot.pvalue <- function(test,simulator,B1,B2, estimator, thetahat, testhat, ...) { for (i in 1:B1) { xboot <- simulator(theta=thetahat, ...) thetaboot <- estimator(xboot) testboot[i] <- test(xboot) pboot[i] <- boot.pvalue(test,simulator,B2,testhat=testboot[i], theta=thetaboot, ...) } p <- (sum(testboot >= testhat)+1)/(B1+1) p.adj <- (sum(pboot <= p)+1)/(B1+1) }
Code Example 5: Code sketch for “double bootstrap” significance testing. The inner or second bootstrap is used to calculate the distribution of nominal bootstrap p-values. For this to work, we need to draw our second-level bootstrap samples from θ˜, the bootstrap re-estimate, not from θˆ, the data estimate. The code presumes the simulator function takes a theta argument allowing this.
The Pareto distribution^4 , or power-law distribution, is a popular model for data with “heavy tails”, i.e. where the probability density f (x) goes to zero only very slowly as x → ∞. (It was briefly mentioned last time in the notes.) The probability density is
f (x) =
θ − 1 x 0
x x 0
)−θ (15)
where x 0 is the minimum scale of the distribution, and θ is the scaling expo- nent. (Exercise: show that x 0 is the mode of the distribution.) The Pareto is highly right-skewed, with the mean being much larger than the median. If we know x 0 , one can show (Exercise: do so!) that the maximum likeli- hood estimator of the exponent θ is
θˆ = 1 + ∑n n i=1 log^
xi x 0
and that this is consistent^5 , and efficient. Picking x 0 is a harder problem (see Clauset et al. 2009) — for the present purposes, pretend that the Oracle tells us. The file pareto.R, on the class website, contains a number of functions related to the Pareto distribution, including a function pareto.fit for estimating it. (There’s an example of its use below.)
(^4) Named after Vilfredo Pareto, the highly influential late-19th/early-20th century economist, political scientist, and proto-Fascist. (^5) Because the sample mean of log X converges, under the law of large numbers
1e+09 2e+09 5e+09 1e+10 2e+10 5e+
Net worth (dollars)
Fraction of individuals at or above that net worth
plot.survival.loglog(wealth,xlab="Net worth (dollars)", ylab="Fraction of individuals at or above that net worth") rug(wealth,side=1,col="grey") curve((302/400)ppareto(x,threshold=9e8,exponent=2.34,lower.tail=FALSE), add=TRUE,lty=2,from=9e8,to=2max(wealth))
Figure 2: Upper cumulative distribution function (or “survival function”) of net worth for the 400 richest individuals in the US (2000 data). The solid line shows the fraction of the 400 individuals whose net worth W equaled or exceeded a given value w, Pr (W ≥ w). (Note the logarithmic scale for both axes.) The dashed line is a maximum-likelihood estimate of the Pareto distribution, taking x 0 = $9 × 108. (This threshold was picked using the method of Clauset et al. 2009.) Since there are 302 individuals at or above the threshold, the cumulative distribution function of the Pareto has to be reduced by a factor of (302/400).
pareto.ci <- function(B,exponent,x0,n,alpha) { tboot <- rboot.pareto(B,exponent,x0,n) ci.lower <- 2exponent - quantile(tboot,1-alpha/2) ci.upper <- 2exponent - quantile(tboot,alpha/2) return(list(ci.lower=ci.lower, ci.upper=ci.upper)) }
Code Example 7: Parametric bootstrap confidence interval for the Pareto scaling exponent.
didn’t require asymptotic assumptions. Asymptotically, the bias is known to go to zero; at this size, bootstrapping gives a bias of 3 × 10 −^3 , which is effectively negligible. We can also get the confidence interval (Code Example 7). Using, again, 104 bootstrap replications, the 95% CI is (2. 16 , 2 .47). In theory, the confidence interval could be calculated exactly, but it involves the inverse gamma distribu- tion (Arnold, 1983), and it is quite literally faster to write and do the bootstrap than go to the library and look it up. A more challenging problem is goodness-of-fit. Here we’ll use the Kolmogorov- Smirnov statistic.^8 Code Example 8 calculates the p-value. With ten thousand bootstrap replications,
ks.pvalue.pareto(1e4,wealth,2.34,9e8) [1] 0.
Ten thousand replicates is enough that we should be able to accurately esti- mate probabilities of around 0√ .01 (since the binomial standard error will be
(^4) ≈ 9. 9 × 10 − (^4) ; if it weren’t, we might want to increase B. Simply plugging in to the standard formulas, and thereby ignoring the ef- fects of estimating the scaling exponent, gives a p-value of 0.16, which is not outstanding but not awful either. Properly accounting for the flexibility of the model, however, the discrepancy between what it predicts and what the data shows is so large that it would take an awfully big (one-a-hundred) coincidence to produce it just by chance. We have, therefore, detected that the Pareto distribution makes systematic errors for this data, but we don’t know much about what they are. We will look in later lectures at techniques which can begin to tell us something about how it fails.
so this gives us the asymptotic variance. (^8) The pareto.R file contains a function, pareto.tail.ks.test, which does a goodness-of- fit test for fitting a power-law to the tail of the distribution. That differs somewhat from what follows, because it takes into account the extra uncertainty which comes from having to estimate x 0. Here, I am pretending that an Oracle told us x 0 = 9 × 108.
-0.
-0.
q0.01 = -0.
-0. -0. -0.
q0.01 = -0.
Figure 3: Schematic for non-parametric bootstrapping. New data is simulated by re-sampling from the original data (with replacement), and parameters are calculated either directly from the empirical distribution, or by applying a model to this surrogate data.
bution, without the mediation of a parametric model. Efron’s non-parametric bootstrap treats the original data set as a complete population and draws a new, simulated sample from it, picking each observation with equal probability (allowing repeated values) and then re-running the estimation (Figure 3). In fact, this is usually what people mean when they talk about “the bootstrap” without any modifier. Everything we did with parametric bootstrapping can also be done with non- parametric bootstrapping — the only thing that’s changing is the distribution the surrogate data is coming from. NP-bootstrap should remind you of k-fold cross-validation. The analog of leave-one-out CV is a procedure called the jack-knife, where we repeat the estimate n times on n − 1 of the data points, holding each one out in turn. It’s historically important (it dates back to the 1940s), but generally doesn’t work as well as the non-parametric bootstrap. An important variant is the smoothed bootstrap, where we re-sample the data points and then perturb each by a small amount of noise, generally Gaussian. This corresponds exactly to sampling from a kernel density estimate
resample <- function(x) { sample(x,size=length(x),replace=TRUE) }
resamp.pareto <- function(B,data,x0) { replicate(B,pareto.fit(resample(data),threshold=x0)$exponent) }
resamp.pareto.CI <- function(B,data,alpha,x0) { thetahat <- pareto.fit(data,threshold=x0)$exponent thetaboot <- resamp.pareto(B,data,x0) ci.lower <- thetahat - (quantile(thetaboot,1-alpha/2) - thetahat) ci.upper <- thetahat - (quantile(thetaboot,alpha/2) - thetahat) return(list(ci.lower=ci.lower,ci.upper=ci.upper)) }
Code Example 9: Non-parametric bootstrap confidence intervals for the Pareto scaling exponent.
(see last lecture). Code Example 9 shows how to use re-sampling to get a 95% confidence interval for the Pareto exponent^10. With B = 10^4 , it gives the 95% confidence interval for the scaling exponent as (2. 18 , 2 .48). The fact that this is very close to the interval we got from parametric bootstrapping should actually reassure us about its validity.
When we have a properly specified model, simulating from the model gives more accurate results (at the same n) than does re-sampling the empirical distribution — parametric estimates of the distribution converge faster than the empirical distribution does. If on the other hand the parametric model is mis-specified, then it is rapidly converging to the wrong distribution. This is of course just another bias-variance trade-off. Since I am suspicious of most parametric modeling assumptions, I prefer re- sampling, when I can figure out how to do it, or at least until I have convinced myself that a parametric model is very good approximation to reality.
(^10) Even if the Pareto model is wrong, the estimator of the exponent will converge on the value which gives, in a certain sense, the best approximation to the true distribution from among all power laws. Econometricians call such parameter values thepseudo-true; we are getting a confidence interval for the pseudo-truth. In this case, the pseudo-true scaling exponent can still be a useful way of summarizing how heavy tailed the income distribution is, despite the fact that the power law makes systematic errors.
geyser.lm.cis <- function(B,alpha) { tboot <- replicate(B,est.waiting.on.duration(resample.geyser())) low.quantiles <- apply(tboot,1,quantile,probs=alpha/2) high.quantiles <- apply(tboot,1,quantile,probs=1-alpha/2) low.cis <- 2coefficients(geyser.lm) - high.quantiles high.cis <- 2coefficients(geyser.lm) - low.quantiles cis <- rbind(low.cis,high.cis) return(cis) }
Code Example 10: Bootstrapped confidence intervals for the linear model of Old Faithful, based on re-sampling data points. Relies on functions defined in the text.
We can check this by running summary on its output, and seeing that it gives about the same quartiles and mean for both variables as summary(geyser)^11. Next, we define the estimator:
est.waiting.on.duration <- function(data) { fit <- lm(waiting ~ duration, data=data) return(coefficients(fit)) }
We can check that this function works by seeing that est.waiting.on.duration(geyser) gives the same results as coefficients(geyser.lm). Putting the pieces together according to the basic confidence interval recipe (Code Example 10), we get
signif(geyser.lm.cis(B=1e4,alpha=0.05),3) (Intercept) duration low.cis 96.5 -8. high.cis 102.0 -6.
Notice that we do not have to assume homoskedastic Gaussian noise — for- tunately, because, as you saw in the homework, that’s a very bad assumption here^12.
(^11) The minimum and maximum won’t match up well — why not? (^12) We have calculated 95% confidence intervals for the intercept β 0 and the slope β 1 sepa- rately. These intervals cover their coefficients all but 5% of the time. Taken together, they give us a rectangle in (β 0 , β 1 ) space, but the coverage probability of this rectangle could be anywhere from 95% all the way down to 90%. To get a confidence region which simultane- ously covers both coefficients 95% of the time, we have two big options. One is to stick to a box-shaped region and just increase the confidence level on each coordinate (to 97.5%). The other is to define some suitable metric of how far apart coefficient vectors are (e.g., ordinary Euclidean distance), find the 95% percentile of the distribution of this metric, and trace the appropriate contour around βˆ 0 , βˆ 1.
Nothing in the logic of re-sampling data points for regression requires us to use a parametric model. Here we’ll provide 95% confidence bounds for the kernel smoothing of the geyser data. We use the same simulator, but start with a different regression curve, and need a different estimator.
library(np) npr.waiting.on.duration <- function(data) { bw <- npregbw(waiting ~ duration, data=data) fit <- npreg(bw) return(fit) }
Now we construct pointwise 95% confidence bands for the regression curve. For this end, we don’t really need to keep around the whole kernel regression object — we’ll just use its predicted values on a uniform grid of points, extending slightly beyond the range of the data (Code Example 11). Observe that this will go through bandwidth selection again for each bootstrap sample. This is slow, but it is the most secure way of getting good confidence bands. Applying the bandwidth we found on the data to each re-sample would be faster, but would introduce an extra level of approximation, since we wouldn’t be treating each simulation run the same as the original data. Figure 4 shows the curve fit to the data, the 95% confidence limits, and (faintly) all of the bootstrapped curves. Doing the 800 bootstrap replicates took 14 minutes on my laptop^13 ; if speed had been really important I could have told npreg to be less aggressive about optimizing the bandwidth (and maybe shut down other programs).
(^13) Specifically, I ran system.time(geyser.npr.cis <- npr.cis(B=800,alpha=0.05)), which not only did the calculations and stored them in geyser.npr.cis, but told me how much time it took R to do them.