Chapter 11 The Bootstrap, Study notes of Statistics

The bootstrap is a method for estimating the variance of an estimator and for finding approximate confidence intervals for parameters.

Typology: Study notes

2021/2022

Uploaded on 07/05/2022

allan.dev
allan.dev 🇦🇺

4.5

(86)

1K documents

1 / 15

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 11
The Bootstrap
This chapter covers the following topics:
What is the Bootstrap?
Why Does it Work?
Examples of the Bootstrap.
11.1 Introduction
Most of this volume is devoted to parametric inference. In this chapter we depart from
the parametric framework and discuss a nonparametric technique called the bootstrap.
The bootstrap is a method for estimating the variance of an estimator and for finding
approximate confidence intervals for parameters. Although the method is nonparametric,
it can be used for inference about parameters in parametric and nonparametric models
which is why we include it in this volume.
11.2 A More General Notion of “Parameter”
We begin by broadening what we mean by a parameter. Let us begin with a few examples.
1. Let X1,...,X
nPwhere P2(P:2). Let b
nbe the maximum likelihood
estimator of . We would like to estimate the variance of b
nand we want a 1
confidence interval for .
209
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Chapter 11 The Bootstrap and more Study notes Statistics in PDF only on Docsity!

Chapter 11

The Bootstrap

This chapter covers the following topics:

  • What is the Bootstrap?
  • Why Does it Work?
  • Examples of the Bootstrap.

11.1 Introduction

Most of this volume is devoted to parametric inference. In this chapter we depart from

the parametric framework and discuss a nonparametric technique called the bootstrap.

The bootstrap is a method for estimating the variance of an estimator and for finding

approximate confidence intervals for parameters. Although the method is nonparametric,

it can be used for inference about parameters in parametric and nonparametric models

which is why we include it in this volume.

11.2 A More General Notion of “Parameter”

We begin by broadening what we mean by a parameter. Let us begin with a few examples.

  1. Let X 1

,... , X

n

⇠ P where P 2 (P ✓

: ✓ 2 ⇥). Let

b ✓ n

be the maximum likelihood

estimator of ✓. We would like to estimate the variance of

b ✓ (^) n and we want a 1 ↵

confidence interval for ✓.

210 CHAPTER 11. THE BOOTSTRAP

  1. Let X 1

,... , X

n

⇠ P and let ✓ = T (P ) denote the mean of P. Hence, ✓ = E[X i

] =

R

xdP (x). Let

b ✓ n

1

n

P

n

i=

X

i

. Again, we would like to estimate the variance of

b ✓ n

and we want a 1 ↵ confidence interval for ✓.

  1. Let X 1 ,... , X (^) n ⇠ P and let ✓ = T (P ) denote the median of P. Hence, P(X (^) i  ✓) =

P(X

i

✓) = 1/ 2. Let

b ✓ n

denote the sample median. Yet again, we would like to

estimate the variance of

b ✓ n

and we want a 1 ↵ confidence interval for ✓.

In the first example, ✓ denotes the parameter of a parametic model. In the second and third

example, we are in a nonparametric situation; in these cases we think of a “parameter” as

a function of the distribution P and we write ✓ = T (P ). The bootstrap can be used in both

the parametric and nonparametric settings.

Let P n

be the empirical distribution. This is the discrete distribution that puts mass 1 /n at

each datapoint X i

. Hence,

P (^) n (A) =

n

n X

i=

I(X (^) i 2 A). (11.1)

In the nonparametric case, we will estimate the parameter ✓ = T (P ) by

b ✓ (^) n = T (P (^) n ) which

is called the plug-in estimator. For example, when ✓ = T (P ) =

R

xdP (x) is the mean, the

plug-in estmator is

b ✓ n

= T (P

n

Z

xdP n

(x) =

n

X

i=

X

i

which is the sample mean.

A sample of size n drawn from P n

is called a bootstrap sample, denoted by

X

1

,... , X

n

⇠ P

n

Bootstrap samples play an important role in what follows. Note that drawing an iid sample

X

1

,... , X

n

from P n

is equivalent to drawing n observations, with replacement, from the

original data {X 1

,... , X

n

}. Thus, bootstrap sampling is often described as “resampling the

data.” This can be a bit confusing and we think it is much clearer to think of a bootstrap

sample X

1

,... , X

n

as n draws from the empirical distribution P n

11.3 The Bootstrap

Now we give the bootstrap algorithms for estimating the variance of

b ✓ n

and for construct-

ing confidence intervals. The explanation of why (and when) the bootstrap gives valid

estimates, is deferred until Section 11.5. Let

b ✓ n

= g(X 1

,... , X

n

) denotes some estimator.

212 CHAPTER 11. THE BOOTSTRAP

● ●

● ● ●

0.0 0.5 1.0 1.5 2.

Figure 11.1: 50 points drawn from the model Y i

= 1 + 2X

i

X

2

i

i

where X i

Uniform(0, 2) and ✏ i

⇠ N (0,. 2

2 ). In this case, the maximum of the polynomail occurs at

✓ = 1. The true and estimated curves are shown in the figure. At the bottom of the plot we

show the 95 percent boostrap confidence interval based on B = 1, 000.

Theorem 139. Under appropriate regularity conditions,

P(✓ 2 C (^) n ) = 1 ↵ O

p

n

as n! 1.

11.4 Examples

Example 140. Consider the polynomial regression model Y = g(X) + ✏ where X, Y 2 R

and g(x) = 0 + 1 x+ 2 x

2

. Given data (X 1 , Y 1 ),... , (X (^) n , Yn ) we can estimate = ( 0 , 1 , 2 )

with the least squares estimator

b . Suppose that g(x) is concave and we are interested in

the location at which g(x) is maximized. It is easy to see that the maximum occurs at

x = ✓ where ✓ = (1/2) 1 / 2. A point estimate of ✓ is

b ✓ = (1/2)

b 1 /

b 2. Now we use the

bootstrap to get a confidence interval for ✓. Figure 11.1 shows 50 points drawn from the

above model with 0 = 1 , 1 = 2, 2 = 1. The X (^) i ’s were sample uniformly on [0, 2] and

we took ✏ i

⇠ N (0,. 2

2 ). In this case, ✓ = 1. The true and estimated curves are shown in

the figure. At the bottom of the plot we show the 95 percent boostrap confidence interval

based on B = 1, 000.

11.5. WHY DOES THE BOOTSTRAP WORK? 213

Example 141. Let (X 1

, Y

1

, Z

1

),... , (X

n

, Y

n

, Z

n

) ⇠ P where X i

2 R, Y

i

2 R, Z

i

2 R

d

. The

partial correlation of X and Y given Z is

12

p

11

22

where ⌦ = ⌃

1 and ⌃ is the covariance matrix of W = (X, Y, Z)

T

. The partial correlation

measures the linear dependence between X and Y after removing the effect of Z. For

illustration, suppose we generate the data as follows: we take Z ⇠ N (0, 1), X = 10Z + ✏

and Y = 10Z + where ✏, ⇠ N (0, 1). The correlation between X and Y is very large. But

the partial correlation is 0. We generated n = 100 data points from this model. The sample

correlation was 0.99. However, the estimate partial correaltion was -0.16 which is much

closer to 0. The 95 percent bootstrap confidence interval is [-.33,.02] which includes the

true value, namely, 0.

11.5 Why Does the Bootstrap Work?

To explain why the bootstrap works, let us begin with a heuristic. Let

F (^) n (t) = P(

p

n(

b ✓

b ✓ (^) n )  t)

and let

b F n

(t) = P(

p

n(

b ✓

b ✓ n

)  t|X 1

,... , X

n

be the bootstrap approximation to F n

. We do not know F n

be we do know

b F n

in the

sense that it depends only on the observed data. Usually, F (^) n will be close to some limiting

distribution L. Similarly,

b F n

will be close to some limiting distribution

b L. Moreover, L and

b L will be close which implies that F n

and

b F n

are close. In practice, we usually approximate

b F n

by its Monte Carlo version

F (t) =

B

B X

j=

I(

p

n(

b ✓

j

b ✓ j

)  t).

But F is close to

b F (^) n as long as we take B large. See Figure 11.2.

Now we will give more detail in a simple, special case. Suppose that X 1

,... , X

n

⇠ P where

X (^) i has mean μ and variance

2

. Suppose we want to construct a confidence interval for μ.

Let μb n

1

n

P

n

i=

X

i

and define

F

n

(t) = P(

p

n(μb n

μ)  t). (11.3)

11.5. WHY DOES THE BOOTSTRAP WORK? 215

F

n

b

F

n

L

b

L

F

O(1/

p

n)

O

P

p

n)

O

P

p

n)

O(1/

p

B)

Figure 11.2: The distribution F n

(t) = P(

p

n(

b ✓ n

✓)  t) is close to some limit distribution

L. Similarly, the bootstrap distribution

b F n

(t) = P(

p

n(

b ✓

n

b ✓ n

)  t|X 1

,... , X

n

) is close to

some limit distribution

b L. Since

b L and L are close, it follows that F n

and

b F n

are close. In

practice, we approximate

b F n

with its Monte Carlo version F which we can make as close

to

b F (^) n as we like by taking B large.

216 CHAPTER 11. THE BOOTSTRAP

To prove this result, let us recall that Berry-Esseen Theorem from Chapter 2. For conve-

nience, we repeat the theorem here.

Theorem 143 (Berry-Esseen Theorem). Let X 1

,... , X

n

be i.i.d. with mean μ and variance

2

. Let μ 3 = E[|X (^) i μ|

3

] < 1. Let X (^) n = n

1

P

n

i=

X (^) i be the sample mean and let be the

cdf of a N (0, 1) random variable. Let Z n

p

n(X (^) n μ)

. Then

sup

z

P(Z (^) n  z) (z)

μ (^3)

p

n

Proof of the Bootstrap Theorem. Let

(t) denote the cdf of a Normal with mean 0 and

variance

2

. Let b

2

=

1

n

P

n

i=

(X

i

bμ n

2

. Thus, b

2

= Var(

p

n(bμ

n

bμ n

)|X

1

,... , X

n

). Now,

by the triangle inequality,

sup

t

b F n

(t) F n

(t)|  sup

t

|F

n

(t)

(t)| + sup

t

(t) b

(t)| + sup

t

b F n

(t) b

(t)|

= I + II + III.

Let Z ⇠ N (0, 1). Then, Z ⇠ N (0,

2 ) and from the Berry-Esseen theorem,

I = sup

t

|F

n

(t)

(t)| = sup

t

P

p

n(μb n

μ)  t

P (Z  t)

= sup

t

P

✓p

n(μb n

μ)

t

P

Z 

t

μ 3

p

n

Using the same argument on the third term, we have that

III = sup

t

b F n

(t) b

(t)| 

μ b 3

p

n

where μb 3

1

n

P

i=

|X

i

bμ n

3 is the empirical third moment. By the strong law of large

numbers, μb 3

converges almost surely to μ 3

. So, almost surely, for all large n, bμ 3

 2 μ 3

and so III 

33

4

2 μ (^3) p

n

. From the fact that b = O P

p

1 /n) it may be shown that II =

sup t

(t) b

(t)| = O P

p

1 /n). (This may be seen by Taylor expanding b

(t) around .)

This completes the proof. ⇤

We have shown that sup t

b F (^) n (t) F (^) n (t)| = O (^) P

1 p

n

. From this, it may be shown that, for

each 0 < < 1 , t

z

= O

P

1 p

n

. From this, one can prove Theorem 139.

So far we have focused on the mean. Similar theorems may be proved for more general

parameters. The details are complex so we will not discuss them here. We give a little more

information in the appendix. For a thorough treatment, we refer the reader to Chapter 23

of van der Vaart (1998).

218 CHAPTER 11. THE BOOTSTRAP

that the distribution of X i

is sub-Gaussian, although this is stronger than needed. This

means that E(e

t

T X )  e

c||t||

2

for some c > 0.

Let μ = E[X i

] 2 R

d

. Here is a bootstrap algorithm for constructing a confidence set for μ.

High Dimensional Bootstrap

  1. Draw a bootstrap sample X

1

,... , X

n

⇠ P

n

. Compute μb

n

1

n

P

n

i=

X

i

  1. Repeat the previous step, B times, yielding estimators bμ

n, 1

,... , μb

n,B

  1. Let

b F n

(t) =

B

B X

j=

I(

p

n||bμ

n,j

μb n

1

 t).

  1. Let

C (^) n =

a 2 R

d

: ||a bμ (^) n || 1 

t (^) ↵

p

n

where t (^) ↵ =

b F

1

(1 ↵).

  1. Output C n

Theorem 144 (Chernozhukov, Chetverikov and Kato, 2014). Suppose that d = o(e

n

1 / 8

Then

P(μ 2 C n

c log d

n

1 / 8

for some c > 0.

Under the stated conditions, the same result applies to higher-order moments. If ✓ = g(μ)

for some function g then we can get a confidence set for ✓ by applying g to C n

. We call this

the projected confidence set. That is, if we define A n

= {g(μ) : μ 2 C n

} then it follows that

P(✓ 2 A

n

c log d

n

1 / 8

Alternatively, we can apply the bootstrap to

p

n(g(μb) g(μ)). However, we do not auto-

matically get the same coverage guarantee that the projected set has.

Example 145. Let us consider constructing a confidence set for a high-dimensional covari-

ance matrix. Let X 1

,... , X

n

2 R

k be a random sample and let ⌃ = Var(X) which is a k ⇥ k

matrix. There are d = O(k

2

) parameters here. Let

b ⌃ = (1/n)

P

n

i=

(X (^) i X (^) n )(X (^) i X (^) n )

T

.

Also, let = vec(⌃) and b = vec(

b ⌃), where vec takes a matrix and converts it into a vector

by stacking the columns. We can then apply the bootstrap algorithm above to

p

n(b )

11.8. SUBSAMPLING 219

to get the bootstrap quantile t ↵

. Let ` n

= b t ↵

p

n and u n

= b + t ↵

p

n. We can then

unstack ` n

and u n

into k ⇥ k matrices L n

and U n

. It then follows that

P(L

n

 ⌃  U

n

c log d

n

1 / 8

where A  B means that Ajk  B (^) jk for all (j, k).

11.8 Subsampling

11.9 Finite Sample Methods

11.9.1 The Permutation Test

In this section we discuss a nonparametric hypothesis testing method. The test is not based

on the bootstrap but we include it here because it is similar in spirit to the bootstrap. Let

X

1

,... , X

n

⇠ F, Y

1

,... , Y

m

⇠ G

be two independent samples and suppose we want to test the hypothesis

H 0 : F = G versus H 1 : F 6 = G. (11.7)

The permutation test gives an exact (nonasymptotic), nonparametric method for testing

this hypothesis. Let Z = (X, Y ) where X = (X 1

,... , X

n

T and Y = (Y 1

,... , Y

m

T

. Define a

vector W of length N = n + m that indicates which group Z i

is from. Thus, W i

= 1 if i  n

and W i

= 2 if i > n. The data look like this:

(X, Y )

T X 1

... X

n

Y

1

... Y

m

Z Z 1... Zn Z (^) n+1... Z (^) n+m

W 1... 1 2... 2

Let T = T (Z, W ) be any test statistic. For example, consider T = |X Y |. We can

write T as a function of Z and W as follows. Define X(Z, W ) = {Z i

: W

i

= 1} and

Y (Z, W ) = {Z (^) i : W (^) i = 2} and then T = |X Y | = |X(Z, W ) Y (Z, W )|.

Let T

⇤ = T (Z, W

⇤ ) where W

⇤ denotes a random permutation of W. Define the permutation

p-value

p = P(T

t) (11.8)

where t = T (Z, W ) is the observed value of the test statistic. This p-value defines an exact

test. The steps of the algorithm are as follows:

11.10. SUMMARY 221

− 6 − 4 − 2 0 2 4 6

4

2

0

2

4

− 6 − 4 − 2 0 2 4 6

4

2

0

2

4

Test Statistics

0.2 0.3 0.4 0.5 0.

0

100

200

300

400

Figure 11.3: Top left: X 1

,... , X

n

. Top right: Y 1

,... , Y

m

. Bottom left: values of the test

statistic from 1,000 permutations.

It would be difficult to find a useful expression for the distribution of the test statistic T

under the null hypothesis H 0

: F = G. However, we can compute the p-value easily using

the permutation test. Figure 11.3 shows an example. The top left plot shows n = 10

observations from F and the top right plot shows n = 10 observations from G. (We took F

to be bivariate normal and G to be a mixture of two normals.) The test statistic is 0.45 and

the p-value, based on B = 1, 000 is 0.006 suggesting that we should reject H 0

. The bottom

left shows a histogram of the values of T from the 1,000 permutations. The vertical line is

the observed value of T. The p-value is the fraction of statistics greater than T.

11.9.2 Confidence Rectangles for Quantiles

11.9.3 Confidence Rectangles for Means

11.9.4 Conformal Methods

11.10 Summary

The bootstrap provides nonparametric standard errors and confidence intervals. To draw

a bootstrap sample we draw n observations X

1

,... , X

n

from the empirical distribution

P (^) n. This is equivalent to drawing n observations with replacement from the original daa

X

1

,... , X

n

. We then compute the estimator

b ✓

⇤ = g(X

1

,... , X

n

). If we repeat this whole

222 CHAPTER 11. THE BOOTSTRAP

process B times we get

b ✓

1

B

. The standard deviation of these values approximates

the stanard error of

b ✓ n

= g(X 1

,... , X

n

11.11 Bibliographic Remarks

Further details on statistical functionals can be found in [51], [13], [52], [23] and [59].

The jackknife was invented by [47] and [58]. The bootstrap was invented by [20]. There

are several books on these topics including [22], [13], [29] and [52]. Also, see Section

3.6 of [60].

Appendix

More on Plug-in Estimators. Let ✓ = T (P ). The plug-in estimator of ✓ is

b ✓ n

= T (P

n

where P n

is the empirical distribution that puts mass 1 /n at each X i

. For example, suppose

that T (P ) =

R

x dP (x) is the mean. Then T (P n

R

x dP n

(x) = n

1

P

n

i=

X

i

since itegrat-

ing with respect to P n

corresponds to summing over the discrete measure with mass 1 /n

at X i

As another example, suppose that ✓ = T (P ) is the variance of X. Let μ denote the mean.

Then

✓ = E(X μ)

2

=

Z

(x μ)

2

dP (x) =

Z

x

2

dP (x)

Z

xdP (x)

2

Thus, the plug-in estimator is

b ✓ n

Z

x

2

dP n

(x)

Z

xdP n

(x)

2

n

n X

i=

X

2

i

n

n X

i=

X

i

2

n

n X

i=

(X

i

X

n

2

.

For one more example, let ✓ be the ↵ quantile of X. Here it is convenient to work with the

cdf F n

(x) = P (X  x). Thus ✓ = T (P ) = T (F ) = F

1 (↵) where F

1 (y) = inf x

{F

n

(x)

y}. The empirical cdf is F n

(x) = n

1

P

n

i=

I(X

i

 x) and

b ✓ n

= T (F

n

) = inf x

{F

n

(x) ↵}.

In other words,

b ✓ n

is just the corresponding sample quantile.

Hadamard Differentiability. The key condition needed for the bootstrap is Hadamard

differentiability. Let P denote all distributions on the real line and let D denote the linear

space generated by P. Write T ((1 ✏)P + ✏Q) = T (P + ✏D) where D = Q P 2 D. The