Nonparametric Bootstrap: Estimating Population Parameters without Assuming a Model - Prof., Study notes of Statistics

An introduction to the nonparametric bootstrap method, which is a statistical technique used to estimate population parameters without assuming a specific distribution model. The concept of bootstrapping, the role of the empirical distribution function, and the steps to perform a nonparametric bootstrap analysis. It also includes an example using city population data and r code for implementing the bootstrap method.

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-f46
koofers-user-f46 🇺🇸

10 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
22S:166
Introduction to the Bootstrap
Lecture 8
September 18, 2006
Kate Cowles
374 SH, 335-0727
2
Resources
Efron, B. (1982) The Jackknife, the Boot-
strap, and Other Resampling Plans. Num-
ber 38 in CBMS-NSF Regional Conference
Series in Applies Mathematics. Philadelphia:
SIAM.
Efron, B. and Tibshirani, R.J. (1993) An
Introduction to the Bootstrap. New York:
Chapman & Hall.
Davison, A.c. and Hinkley, D.V. (1997) Boot-
strap Methods and their Application, New
York: Cambridge University Press.
3
Review concepts
suppose we have one sample of ndata values:
y1,...,yn
sample values considered outcomes of i.i.d.
random variables Y1,...,Yn
probability density function (pdf) or proba-
bility mass function (pmf) f
cumulative distribution function (cdf) F
sample will be used to make inference
about population characteristic θ
using statistic Twhose value in sample is
t
questions of interest regarding T
bias?
standard error?
quantiles?
how to compute confidence limits for θ?
4
likely values under a null hypothesis of in-
terest?
pf3
pf4
pf5
pf8

Partial preview of the text

Download Nonparametric Bootstrap: Estimating Population Parameters without Assuming a Model - Prof. and more Study notes Statistics in PDF only on Docsity!

22S:

Introduction to the Bootstrap

Lecture 8 September 18, 2006

Kate Cowles 374 SH, 335- [email protected]

Resources

  • Efron, B. (1982) The Jackknife, the Boot- strap, and Other Resampling Plans. Num- ber 38 in CBMS-NSF Regional Conference Series in Applies Mathematics. Philadelphia: SIAM.
  • Efron, B. and Tibshirani, R.J. (1993) An Introduction to the Bootstrap. New York: Chapman & Hall.
  • Davison, A.c. and Hinkley, D.V. (1997) Boot- strap Methods and their Application, New York: Cambridge University Press.

3 Review concepts

  • suppose we have one sample of n data values: y 1 ,... , yn
  • sample values considered outcomes of i.i.d. random variables Y 1 ,... , Yn
  • probability density function (pdf) or proba- bility mass function (pmf) f
  • cumulative distribution function (cdf) F
  • sample will be used to make inference
    • about population characteristic θ
    • using statistic T whose value in sample is t
  • questions of interest regarding T
    • bias?
    • standard error?
    • quantiles?
    • how to compute confidence limits for θ?

4

  • likely values under a null hypothesis of in- terest?

Two classes of statistical methods

  • parametric
    • particular mathematical model for behav- ior of random variables Yj
    • pdf or pmf f is completely determined by values of unknown parameters ψ
    • quantity of interest in statistical analysis θ is a component or function of ψ
  • nonparametric
    • uses only the fact the Yjs are i.i.d.
    • no mathematical model for their distribu- tion
    • (may be useful to do a nonparameteric analysis even if a reasonable parametric model exists) ∗ to assess sensitivity of conclusions to as- sumptions of parametric model

The empirical distribution

  • puts probability mass (^) n^1 at each sample value yj
  • empirical distribution function (edf) or Fˆ
    • nonparametric mle of F
    • sample proportion Fˆ (y) = #{yj n≤y} ∗ where # denotes the number of items in a set
  • edf plays role of fitted model when no math- ematical form is assumed for F

7 Example for the nonparametric bootstrap: City population data

  • for each of n = 49 U.S. cities, two data values
    • uj = population in 1920 (in 1000s)
    • xj = population in 1930 (in 1000s)
  • population of interest is all U.S. cities
  • the 49 cities are assumed to be a simple ran- dom sample from this population
  • define (U,X) as pair of population values for a randomly selected city
  • then if we knew θ = E E((XU )) and the total 1920 population for the U.S., we could estimate the total 1930 population of U.S.
  • want to estimate θ without assuming any parametric model for X and U
  • sample-based statistic is T = XU¯ ¯

8

  • observations 1 to 10 of this dataset are in- cluded with the boot package for R

elements of the vector being sampled.

> x <- seq(1:25) > x [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 > sample(x, 25) [1] 2 20 3 9 6 8 15 10 23 1 19 25 12 21 14 4 13 24 17 5 11 18 7 22 16 > sample(x, 25, replace = TRUE) [1] 4 6 16 11 21 17 6 12 5 8 15 19 23 16 15 20 18 19 21 5 25 7 8 20 3

Bias correction using the bootstrap

  • notation
    • θ – true and unknown population quantity value
    • θˆ – estimate of θ based on sample data
    • θˆ∗b^ – estimate of θ from b-th bootstrap sample

15 Bias correction continued

  • So in a sense:
    • θˆ∗s are to θˆ as θˆ is to θ
  • bootstrap esimate of bias
    • Note: bias = EF (θˆ − θ)

biaŝ boot =^

B

 ^ B∑ b=

θ^ ˆ∗b^ − θˆ

 

= θˆ∗.^ − θˆ

  • So bias-corrected point estimate is θ˜ = θˆ −

  (^) θˆ∗.^ − θˆ

 

= 2θˆ − θˆ∗.

16 Percentile method for confidence intervals

  • denote cdf of bootstrap distribution of θˆ∗^ as CDF̂ (t) = P r ∗(θˆ∗^ ≤^ t)
  • If bootstrap distribution is obtained by sim- ulation then

CDF̂ (t) ' #(θˆ∗b^ ≤^ t) B

  • define confidence interval as interval between appropriate quantiles

R code for the City Data

Main driver program

> drive.bootstrap function(mdata, mlefunc, bootfunc, B) { mle <- mlefunc(mdata) # compute mle print(c("mle",mle)) boots <- bootfunc(mdata, B) # generate bootstrap samples print("quantiles") print(quantile(boots, c(0.005, 0.025, 0.5, 0.975, 0.995))) meanb <- mean(boots) print(c("mean", meanb)) stderr <- sqrt( var(boots) ) print(c("stderr", stderr)) biasc <- 2 * mle - meanb print("bias-corrected point estimate") biasc }

Function defining computation of θˆ

> meanratio function(mdata) { mean(mdata$x)/mean(mdata$u) }

Function carrying out desired type of bootstrap > nonparboot.ratio function(mydat, B) {

nonparametric bootstrap for ratio of means

data object must contain 2 columns of data

returns B bootstrap estimates of ratio of means

if(ncol(mydat) != 2) { print("input matrix must have 2 columns of numeric data") } else { bootratio <- numeric() n <- nrow(mydat) # number of observations for(i in 1:B) { index1 <- sample(n, replace = T)

sample of size n from integers 1:n with replacement

boot1 <- mydat[index1, ]

bootstrap sample; rows of data corresponding to index

bootratio <- c(bootratio, mean(boot1$x) / mean(boot1$u)) }

19 bootratio } }

20 Function call and results

library(boot) data(city) drive.bootstrap(city, meanratio, nonparboot.ratio, 81) > drive.bootstrap(city, meanratio, nonparboot.ratio, 81) [1] "mle" "1.5203125" [1] "quantiles" 0.5% 2.5% 50% 97.5% 99.5% 1.229048 1.276364 1.523894 2.277978 2. [1] "mean" "1.58087829542998" [1] "stderr" "0.268075889571327" [1] "bias-corrected point estimate" [1] 1.

> drive.bootstrap(city, meanratio, nonparboot.ratio, 1000) [1] "mle" "1.5203125" [1] "quantiles" 0.5% 2.5% 50% 97.5% 99.5% 1.195834 1.254945 1.529401 2.118204 2. [1] "mean" "1.57122730135862" [1] "stderr" "0.239441130366229" [1] "bias-corrected point estimate" [1] 1.

R code for parametric bootstrap for the air conditioning data

> parboot.logexpmean function(mdata, B) {

parametric bootstrap for log of exponential mean parm

data object must contain 1 column of data

returns B bootstrap estimates of log exponential mean parm

if(ncol(mdata) != 1) { print("input data must have 1 column of numeric data") } else { mle <- logexpmean( mdata ) bootlogexpmean <- numeric() n <- nrow(mdata) # number of observations for(i in 1:B) { boot1 <- rexp( n, exp( - mle ) )

bootstrap sample; n random draws from exponential distribution

with mle from mdata as parameter

bootlogexpmean <- c(bootlogexpmean, logexpmean( as.matx(boot1) ) )

} bootlogexpmean } }

> logexpmean function(mdata) { if(ncol(mdata) != 1) { print("input data must have 1 column of numeric data") } else { log( mean(mdata) ) } }

27 > drive.bootstrap function(mdata, mlefunc, bootfunc, B) { mle <- mlefunc(mdata) # compute mle print(c("mle",mle)) boots <- bootfunc(mdata, B) # generate bootstrap samples print("quantiles") print(quantile(boots, c(0.005, 0.025, 0.5, 0.975, 0.995))) meanb <- mean(boots) print(c("mean", meanb)) stderr <- sqrt( var(boots) ) print(c("stderr", stderr)) biasc <- 2 * mle - meanb print("bias-corrected point estimate") print(biasc)

now bias-corrected percentile method of C.I.

first get Pr(a bootstrap estimate is <= rhohat)

cof <- length(boots[boots <= mle])/B print(c("cof", cof)) z0 <- qnorm(cof) # corresponding normal quantile z.alpha <- qnorm(0.975) p.low.end <- pnorm(2 * z0 - z.alpha) p.high.end <- pnorm(2 * z0 + z.alpha) print(c(p.low.end, p.high.end)) boots <- sort(boots) print("bias corrected C.I.") print(c(boots[B * p.low.end], boots[B * p.high.end])) }

28 > drive.bootstrap2(aircondit, logexpmean, parboot.logexpmean, 100) [1] "mle" "4.68290253452844" [1] "quantiles" 0.5% 2.5% 50% 97.5% 99.5% 3.834368 4.094229 4.710413 5.181885 5. [1] "mean" "4.67889044080901" [1] "stderr" "0.286036072515565" [1] "bias-corrected point estimate" [1] 4. [1] "cof" "0.45" [1] 0.01350800 0. [1] "bias corrected C.I." [1] 3.684830 5. > drive.bootstrap2(aircondit, logexpmean, parboot.logexpmean, 100) [1] "mle" "4.68290253452844" [1] "quantiles" 0.5% 2.5% 50% 97.5% 99.5% 4.046429 4.096043 4.684174 5.134347 5. [1] "mean" "4.66888582075081" [1] "stderr" "0.274983035251789" [1] "bias-corrected point estimate" [1] 4. [1] "cof" "0.5" [1] 0.025 0. [1] "bias corrected C.I." [1] 4.081475 5.

> drive.bootstrap2(aircondit, logexpmean, parboot.logexpmean, 1000) [1] "mle" "4.68290253452844" [1] "quantiles" 0.5% 2.5% 50% 97.5% 99.5% 3.760392 3.997390 4.640964 5.166151 5.

[1] "mean" "4.62143415974039" [1] "stderr" "0.29366260048462" [1] "bias-corrected point estimate" [1] 4. [1] "cof" "0.563" [1] 0.05021169 0. [1] "bias corrected C.I." [1] 4.113786 5.

> drive.bootstrap2(aircondit, logexpmean, parboot.logexpmean, 25000) [1] "mle" "4.68290253452844" [1] "quantiles" 0.5% 2.5% 50% 97.5% 99.5% 3.814595 4.033918 4.654331 5.180473 5. [1] "mean" "4.64203157705189" [1] "stderr" "0.292952308908569" [1] "bias-corrected point estimate" [1] 4. [1] "cof" "0.53676" [1] 0.03791469 0. [1] "bias corrected C.I." [1] 4.098616 5.

> drive.bootstrap2(aircondit, logexpmean, parboot.logexpmean, 25000) [1] "mle" "4.68290253452844" [1] "quantiles" 0.5% 2.5% 50% 97.5% 99.5% 3.795942 4.014882 4.654935 5.179212 5. [1] "mean" "4.64014458348023" [1] "stderr" "0.295191848042946" [1] "bias-corrected point estimate" [1] 4. [1] "cof" "0.53592"

[1] 0.03756714 0. [1] "bias corrected C.I." [1] 4.082338 5.

31 Bias-correcting bootstrap confidence intervals

  • recall: CDF̂ (q) = P r ∗(θˆ∗^ ≤^ q)

=

#{θˆb^ ≤ q} B

  • if CDF̂ (θˆ) 6 = .5, then bias correction to per- centile method c.i. may be in order
  • let z 0 = Φ−^1 ( CDF̂ (θˆ))
  • what Splus/R function evalutes Φ−^1
  • then bias-corrected 1  − α c.i. is ̂CDF −^1 (Φ(2z 0 − zα/ 2 )), CDF̂ −^1 (Φ(2z 0 + zα/ 2 ))

 

  • here zα/ 2 is upper α/2 point of standard normal Φ(zα/ 2 ) = 1 − α/ 2