






























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Nonparametric curve estimation methods, focusing on pointwise and simultaneous confidence regions. It covers the impact of bandwidth on bias and variance, and the challenges of bias correction in nonparametric problems using the bootstrap. The document also includes an exercise on mean and variance of the estimator.
Typology: Study notes
1 / 38
This page cannot be seen from the preview
Don't miss anything!































(Seventh set of two lectures)
Main topic of these lectures: Bootstrap methods for nonparametric curve estima- tion
Pointwise versus simultaneous confidence regions
We shall dispose of this topic first, so that we can focus subsequently on other issues.
Suppose we have an estimator ˆg of a func- tion g on an interval I, and, for a given level 1 − α of probability, have constructed a con- fidence region, or “tube,” for g, consisting of a boundary above and a boundary below the curve represented by the formula y = ˆg(x), for x ∈ I.
Pointwise versus simultaneous confidence regions (cont. 1)
The region can be interpreted as the union of intervals (ˆg 1 (x), ˆg 2 (x)), for x ∈ I. Of course, ˆg 1 and ˆg 2 are constructed from data, and satisfy ˆg 1 ≤ ˆg 2.
Such a region is commonly referred to a “(1 − α)-level confidence region for g on the interval I.”
We can interpret the statement in two ways. Either (i) the interval (ˆg 1 (x), ˆg 2 (x)) covers g(x) with probability approximately 1 − α, for each x ∈ I; or (ii) the probability that the graph rep- resented by the equation y = g(x) lies within the tube, converges to 1 − α as n increases.
Pointwise versus simultaneous confidence regions (cont. 3)
This close relationship vanishes in nonparamet- ric cases, however. There, simultaneous confi- dence regions are generally an order of magni- tude wider than their parametric counterparts.
Although the factor by which the width in- creases is proportional only to (log n)^1 /^2 , in asymptotic terms, the increase is generally substantial, and this alone causes simultaneous bands to be unpopular.
When coupled with the relative lack of interest in predicting the value of E(Y | X = x 0 ) simul- taneously for many values of x 0 , this means that the pointwise interpretation is the obvious choice in at least the setting of nonparamet- ric regression. We shall adopt it in the density estimation context, too.
Our treatment of confidence regions in the setting of nonparametric curve estimation will address only the case of nonparametric den- sity estimation. Nonparametric regression is broadly similar.
Local estimation
Nonparametric curve estimators (usually, esti- mators of densities or regression means) work without imposing structural assumptions.
Typically the only conditions required are that the function in question have sufficiently many bounded derivatives. That is, it should be suf- ficiently smooth.
Since only smoothness is assumed, the value of the estimator at a particular point, x say, is based largely, if not wholly, on data values close to x.
In density estimation, where we have a sample X 1 ,... , Xn from the distribution with density f , this means that to estimate f at x we use only those Xi’s that are close to x.
In regression, where estimators are based on data (x 1 , Y 1 ),... , (xn, Yn) and we wish to esti- mate g(x) = E(Y | X = x), we use only those pairs (xi, Yi) for which xi is close to x.
Example: Nonparametric density estima- tion
Suppose we sample independent and identi- cally distributed data X = {X 1 ,... , Xn} from a distribution with density f. We wish to esti- mate this function.
Let K be a bounded, compactly supported, symmetric probability density, let h > 0 denote a “bandwidth,” and put
fˆ (x) = 1 nh
∑^ n i=
(x − X i h
) .
This is our estimator of f (x).
The estimator fˆ is itself a density; it is non- negative and it integrates to 1, since K has both those properties.
Reliance of fˆ on bandwidth
Note that
ψi(x) = h−^1 K
(x − X i h
)
is itself a density, for each fixed Xi: ψi ≥ 0, ∫ ψi = 1.
The density ψi gets narrower and taller as h de- creases. Our estimator ˆf is obtained by simply averaging the values of the ψi’s. Clearly, ad- justing the bandwidth affects the shape, and hence the properties, of ˆf.
Exercise. Prove from these results, and el- ementary calculus, that if h = h(n) → 0 as n → ∞, in such a manner that nh → ∞; and if f has two continuous derivatives in a neigh- bourhood of x; then
E{ fˆ (x)} = f (x) + 12 κ 2 h^2 f ′′(x) + o(h^2 ) , var{ fˆ (x)} = (nh)−^1 κ f (x) + o{(nh)−^1 } ,
where κ =
∫ K^2 , κ 2 =
∫ u^2 K(u) du.
The first result here implies that ˆf (x) is asymp- totically unbiased for f (x). That is, as n in- creases, the difference between E{ fˆ (x)}, and the quantity f (x) that ˆf (x) is estimating, con- verges to zero.
The second result implies that the variance of fˆ (x) converges to zero as n increases.
Mean squared error
Therefore, the mean squared error of fˆ (x) is given by
E{ fˆ (x) − f (x)}^2 = var{ fˆ (x)} + {E fˆ (x) − f (x)}^2 =
nh
where the constants C 1 = κ f (x) and C 2 = 1 4 κ
2 2 f^ ′′(x) (^2) depend on x.
(Proof: Use the results of the Exercise.)
It follows that the optimal choice of h, for the purpose of minimising mean squared error, is of size n−^1 /^5.
This order of magnitude of bandwidth brings the variance and squared-bias terms, of respec- tive orders (nh)−^1 and h^4 , into balance.
Choice of kernel
Common choices of K are the standard normal density,
K(u) =
2 π
exp
( −^12 u^2
) ,
and the k-weight kernel,
K(u) = ck
( 1 − u^2
)k
for |u| ≤ 1, where the integer k ≥ 1 is chosen so that K ntegrates to 1. The case k = 2 is popular; then K is called the “biweight kernel.”
High-order kernels
More generally, we can take K to satisfy
∫ (^) ∞ −∞
uj^ K(u) du
= 1 if j = 0, = 0 if 1 ≤ j ≤ r − 1, 6 = 0 if j = r,
for a given integer r ≥ 1.
When using a kernel of this type the variance of fˆ remains of size (nh)−^1 , but the order of bias changes from h^2 to hr. Therefore, by choos- ing r > 2 we can improve, at least in theory, the mean-square performance of fˆ. In partic- ular, the order of mean squared error can be reduced to n−^2 r/(2r+1)^ by choosing h of size n−^1 /(2r+1).
Note, however, that if r > 2 then if (1) is to hold, K must take negative values, and as a result, fˆ is not any longer guaranteed to be nonnegative.
We shall take r = 2 in all the arguments below.
Nonparametric and semiparametric prob- lems (cont.)
The effective number of parameters that are being fitted, when constructing the density es- timator fˆ in a given interval, equals the num- ber of bandwidths that can be fitted into the interval.
Therefore the number of fitted parameters di- verges at rate n^1 /^5 as sample size grows.
In comparison, estimation of “global” char- acteristics, such as mean, variance and other moments, is a semiparametric problem. Al- though in such cases estimation involves a po- tentially infinite number of unknowns, conven- tional convergence rates can be attained.
Implications for the bootstrap
We have not, so far, encountered cases where the effective number of parameters grew un- boundedly as sample size increased.
The implications for the bootstrap are man- ifested in at least two ways: difficulties with bias, and a worsening of overall convergence rate (including the order of magnitude of cov- erage error).
Both these difficulties are manifested to some extent in more conventional, parametric prob- lems, where the number of parameters is large, although fixed (as sample size increases).
In such instances, difficulties with bias and ac- curacy arise frequently, although in a theoret- ical treatment they do not result in an actual deterioration of convergence rate.
Bias in semiparametric problems (cont.)
Moreover, the bootstrap estimator of bias,
bias =̂ E(ˆθ∗^ | X ) − θ ,ˆ
accurately approximates bias. Indeed,
bias = bias +̂ Op^ ( n−^3 /^2 ) ,
and this high degree of precision led us to sug- gest ˆθ − bias as a bias-corrected estimator of̂ θ:
θˆbc = ˆθ − bias = ˆ̂ θ − {E(ˆθ∗^ | X ) − θˆ} = 2 ˆθ − E(ˆθ∗^ | X ).
Bias in nonparametric problems
Reflecting the infinite-parameter nature of nonparametric density estimation, the boot- strap fails rather spectacularly to approximate bias.
To appreciate this point, let X ∗^ = {X 1 ∗,... , X n∗} denote a resample drawn by sampling random- ly, with replacement, from X. Then the stan- dard bootstrap form of ˆf is ˆf ∗, defined by
fˆ ∗(x) = 1 nh
∑^ n i=
( x − X i∗ h
) .
Now,
E
{ K
( x − X i∗ h
) ∣ ∣∣ ∣
∣∣ ∣∣ X
}
n
∑^ n i=
( x − X i∗ h
) = h fˆ (x).
Therefore,
E{ fˆ ∗(x) | X } = ˆf (x) ,
implying that the bootstrap estimator of bias is
bias =̂ E{ fˆ ∗(x) | X } − fˆ (x) = 0.