Nonparametric Curve Estimation: Pointwise vs. Simultaneous Confidence & Bias Correction - , Study notes of Data Analysis & Statistical Methods

Nonparametric curve estimation methods, focusing on pointwise and simultaneous confidence regions. It covers the impact of bandwidth on bias and variance, and the challenges of bias correction in nonparametric problems using the bootstrap. The document also includes an exercise on mean and variance of the estimator.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-i62-2
koofers-user-i62-2 🇺🇸

10 documents

1 / 38

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
METHODOLOGY AND THEORY
FOR THE BOOTSTRAP
(Seventh set of two lectures)
Main topic of these lectures: Bootstrap
methods for nonparametric curve estima-
tion
Pointwise versus simultaneous confidence
regions
We shall dispose of this topic first, so that we
can focus subsequently on other issues.
Suppose we have an estimator ˆgof a func-
tion gon an interval I, and, for a given level
1αof probability, have constructed a con-
fidence region, or “tube,” for g, consisting of
a boundary above and a boundary below the
curve represented by the formula y= ˆg(x), for
x I.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26

Partial preview of the text

Download Nonparametric Curve Estimation: Pointwise vs. Simultaneous Confidence & Bias Correction - and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

METHODOLOGY AND THEORY

FOR THE BOOTSTRAP

(Seventh set of two lectures)

Main topic of these lectures: Bootstrap methods for nonparametric curve estima- tion

Pointwise versus simultaneous confidence regions

We shall dispose of this topic first, so that we can focus subsequently on other issues.

Suppose we have an estimator ˆg of a func- tion g on an interval I, and, for a given level 1 − α of probability, have constructed a con- fidence region, or “tube,” for g, consisting of a boundary above and a boundary below the curve represented by the formula y = ˆg(x), for x ∈ I.

Pointwise versus simultaneous confidence regions (cont. 1)

The region can be interpreted as the union of intervals (ˆg 1 (x), ˆg 2 (x)), for x ∈ I. Of course, ˆg 1 and ˆg 2 are constructed from data, and satisfy ˆg 1 ≤ ˆg 2.

Such a region is commonly referred to a “(1 − α)-level confidence region for g on the interval I.”

We can interpret the statement in two ways. Either (i) the interval (ˆg 1 (x), ˆg 2 (x)) covers g(x) with probability approximately 1 − α, for each x ∈ I; or (ii) the probability that the graph rep- resented by the equation y = g(x) lies within the tube, converges to 1 − α as n increases.

Pointwise versus simultaneous confidence regions (cont. 3)

This close relationship vanishes in nonparamet- ric cases, however. There, simultaneous confi- dence regions are generally an order of magni- tude wider than their parametric counterparts.

Although the factor by which the width in- creases is proportional only to (log n)^1 /^2 , in asymptotic terms, the increase is generally substantial, and this alone causes simultaneous bands to be unpopular.

When coupled with the relative lack of interest in predicting the value of E(Y | X = x 0 ) simul- taneously for many values of x 0 , this means that the pointwise interpretation is the obvious choice in at least the setting of nonparamet- ric regression. We shall adopt it in the density estimation context, too.

Our treatment of confidence regions in the setting of nonparametric curve estimation will address only the case of nonparametric den- sity estimation. Nonparametric regression is broadly similar.

Local estimation

Nonparametric curve estimators (usually, esti- mators of densities or regression means) work without imposing structural assumptions.

Typically the only conditions required are that the function in question have sufficiently many bounded derivatives. That is, it should be suf- ficiently smooth.

Since only smoothness is assumed, the value of the estimator at a particular point, x say, is based largely, if not wholly, on data values close to x.

In density estimation, where we have a sample X 1 ,... , Xn from the distribution with density f , this means that to estimate f at x we use only those Xi’s that are close to x.

In regression, where estimators are based on data (x 1 , Y 1 ),... , (xn, Yn) and we wish to esti- mate g(x) = E(Y | X = x), we use only those pairs (xi, Yi) for which xi is close to x.

Example: Nonparametric density estima- tion

Suppose we sample independent and identi- cally distributed data X = {X 1 ,... , Xn} from a distribution with density f. We wish to esti- mate this function.

Let K be a bounded, compactly supported, symmetric probability density, let h > 0 denote a “bandwidth,” and put

fˆ (x) = 1 nh

∑^ n i=

K

(x − X i h

) .

This is our estimator of f (x).

The estimator fˆ is itself a density; it is non- negative and it integrates to 1, since K has both those properties.

Reliance of fˆ on bandwidth

Note that

ψi(x) = h−^1 K

(x − X i h

)

is itself a density, for each fixed Xi: ψi ≥ 0, ∫ ψi = 1.

The density ψi gets narrower and taller as h de- creases. Our estimator ˆf is obtained by simply averaging the values of the ψi’s. Clearly, ad- justing the bandwidth affects the shape, and hence the properties, of ˆf.

Exercise. Prove from these results, and el- ementary calculus, that if h = h(n) → 0 as n → ∞, in such a manner that nh → ∞; and if f has two continuous derivatives in a neigh- bourhood of x; then

E{ fˆ (x)} = f (x) + 12 κ 2 h^2 f ′′(x) + o(h^2 ) , var{ fˆ (x)} = (nh)−^1 κ f (x) + o{(nh)−^1 } ,

where κ =

∫ K^2 , κ 2 =

∫ u^2 K(u) du.

The first result here implies that ˆf (x) is asymp- totically unbiased for f (x). That is, as n in- creases, the difference between E{ fˆ (x)}, and the quantity f (x) that ˆf (x) is estimating, con- verges to zero.

The second result implies that the variance of fˆ (x) converges to zero as n increases.

Mean squared error

Therefore, the mean squared error of fˆ (x) is given by

E{ fˆ (x) − f (x)}^2 = var{ fˆ (x)} + {E fˆ (x) − f (x)}^2 =

C 1

nh

  • C 2 h^4 + o{(nh)−^1 + h^4 } ,

where the constants C 1 = κ f (x) and C 2 = 1 4 κ

2 2 f^ ′′(x) (^2) depend on x.

(Proof: Use the results of the Exercise.)

It follows that the optimal choice of h, for the purpose of minimising mean squared error, is of size n−^1 /^5.

This order of magnitude of bandwidth brings the variance and squared-bias terms, of respec- tive orders (nh)−^1 and h^4 , into balance.

Choice of kernel

Common choices of K are the standard normal density,

K(u) =

2 π

exp

( −^12 u^2

) ,

and the k-weight kernel,

K(u) = ck

( 1 − u^2

)k

for |u| ≤ 1, where the integer k ≥ 1 is chosen so that K ntegrates to 1. The case k = 2 is popular; then K is called the “biweight kernel.”

High-order kernels

More generally, we can take K to satisfy

∫ (^) ∞ −∞

uj^ K(u) du

  

= 1 if j = 0, = 0 if 1 ≤ j ≤ r − 1, 6 = 0 if j = r,

for a given integer r ≥ 1.

When using a kernel of this type the variance of fˆ remains of size (nh)−^1 , but the order of bias changes from h^2 to hr. Therefore, by choos- ing r > 2 we can improve, at least in theory, the mean-square performance of fˆ. In partic- ular, the order of mean squared error can be reduced to n−^2 r/(2r+1)^ by choosing h of size n−^1 /(2r+1).

Note, however, that if r > 2 then if (1) is to hold, K must take negative values, and as a result, fˆ is not any longer guaranteed to be nonnegative.

We shall take r = 2 in all the arguments below.

Nonparametric and semiparametric prob- lems (cont.)

The effective number of parameters that are being fitted, when constructing the density es- timator fˆ in a given interval, equals the num- ber of bandwidths that can be fitted into the interval.

Therefore the number of fitted parameters di- verges at rate n^1 /^5 as sample size grows.

In comparison, estimation of “global” char- acteristics, such as mean, variance and other moments, is a semiparametric problem. Al- though in such cases estimation involves a po- tentially infinite number of unknowns, conven- tional convergence rates can be attained.

Implications for the bootstrap

We have not, so far, encountered cases where the effective number of parameters grew un- boundedly as sample size increased.

The implications for the bootstrap are man- ifested in at least two ways: difficulties with bias, and a worsening of overall convergence rate (including the order of magnitude of cov- erage error).

Both these difficulties are manifested to some extent in more conventional, parametric prob- lems, where the number of parameters is large, although fixed (as sample size increases).

In such instances, difficulties with bias and ac- curacy arise frequently, although in a theoret- ical treatment they do not result in an actual deterioration of convergence rate.

Bias in semiparametric problems (cont.)

Moreover, the bootstrap estimator of bias,

bias =̂ E(ˆθ∗^ | X ) − θ ,ˆ

accurately approximates bias. Indeed,

bias = bias +̂ Op^ ( n−^3 /^2 ) ,

and this high degree of precision led us to sug- gest ˆθ − bias as a bias-corrected estimator of̂ θ:

θˆbc = ˆθ − bias = ˆ̂ θ − {E(ˆθ∗^ | X ) − θˆ} = 2 ˆθ − E(ˆθ∗^ | X ).

Bias in nonparametric problems

Reflecting the infinite-parameter nature of nonparametric density estimation, the boot- strap fails rather spectacularly to approximate bias.

To appreciate this point, let X ∗^ = {X 1 ∗,... , X n∗} denote a resample drawn by sampling random- ly, with replacement, from X. Then the stan- dard bootstrap form of ˆf is ˆf ∗, defined by

fˆ ∗(x) = 1 nh

∑^ n i=

K

( x − X i∗ h

) .

Now,

E

{ K

( x − X i∗ h

) ∣ ∣∣ ∣

∣∣ ∣∣ X

}

n

∑^ n i=

K

( x − X i∗ h

) = h fˆ (x).

Therefore,

E{ fˆ ∗(x) | X } = ˆf (x) ,

implying that the bootstrap estimator of bias is

bias =̂ E{ fˆ ∗(x) | X } − fˆ (x) = 0.