


Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Hoeffding's inequality and its application to gauge the closeness of empirical risks to their expected values in the context of classification error bounds. The document also introduces bonferroni's bound and its use in obtaining probabilities of significant deviations from expected values. The text further explains the distribution-free nature of the bound and its implications for the minimum empirical risk classifier.
Typology: Study notes
1 / 4
This page cannot be seen from the preview
Don't miss anything!



This work is produced by The Connexions Project and licensed under the Creative Commons Attribution License †
Given a set of training data {Xi, Yi} n i=1 and a nite collection of candidate functions^ F, select
f (^) n ∈ F that
(hopefully) is a good predictor for future cases. That is
fn= argmin f ∈F
Rn (f^ )^ (1)
where
Rn (f^ )^ is the empirical risk. For any particular^ f^ ∈ F, the corresponding empirical risk is dened as
Rn (f^ ) =
n
∑^ n
i=
(^1) {f (Xi) 6 =Yi}. (2)
Hoeding's inequality (Cherno's bound in this case) allows us to gauge how close
Rn (f^ )^ is to the true risk
of f , R (f ), in probability
Rn (f^ )^ −^ R^ (f^ )^ | ≥^ ε
≤ 2 e − 2 nε^2
. (3)
Since our selection process involves deciding among all f ∈ F, we would like to gauge how close the
empirical risks are to their expected values. We can do this by studying the probability that one or more of
the empirical risks deviates signicantly from its expected value. This is captured by the probability
max f ∈F
Rn (f^ )^ −^ R^ (f^ )^ | ≥^ ε
Note that the event
max f ∈F
Rn (f^ )^ −^ R^ (f^ )^ | ≥^ ε^ (5)
∗Version 1.2: Feb 11, 2009 10:37 am US/Central †http://creativecommons.org/licenses/by/2.0/
is equivalent to union of the events
f ∈F
Rn (f ) − R (f ) | ≥ ε}. (6)
Therefore, we can use Bonferonni's bound (aka the union of events or union bound) to obtain
max f ∈F
Rn (f^ )^ −^ R^ (f^ )^ | ≥^ ε
f ∈F |
Rn (f^ )^ −^ R^ (f^ )^ | ≥^ ε
f ∈F P
Rn (f^ )^ −^ R^ (f^ )^ | ≥^ ε
f ∈F 2 e
− 2 nε^2
= 2 |F|e−^2 nε
2
where |F| is the number of classiers in F. In the proof of Hoeding's inequality we also obtained a one-sided
inequality that implied
R (f ) −
Rn (f^ )^ ≥^ ε
≤ e − 2 nε^2 (8)
and hence
max f ∈F
R (f ) −
Rn (f^ )^ ≥^ ε
≤ |F|e − 2 nε^2
. (9)
We can restate the inequality above as follows, For all f ∈ F and for all δ > 0 with probability at least 1 − δ
R (f ) ≤
Rn (f ) +
log|F| + log (1/δ)
2 n
This follows by setting δ = |F|e − 2 nε^2 and solving for ε. Thus with a high probability (1 − δ), the true risk for
all f ∈ F is bounded by the empirical risk of f plus a constant that depends on δ > 0 , the number of training
samples n, and the size F. Most importantly the bound does not depend on the unknown distribution PXY.
Therefore, we can call this a distribution-free bound.
We can use the distribution-free bound above to obtain a bound on the expected performance of the
minimum empirical risk classier
f (^) n = argmin f ∈F
Rn (f^ )^.^ (11)
We are interested in bounding
f (^) n
− min f ∈F
R (f ) (12)
the expected risk of
f (^) n minus the minimum risk for all f ∈ F. Note that this dierence is always non-negative
since
f (^) n is at best as good as
f ∗ = argmin f ∈F
R (f ). (13)
Thus
f (^) n
Rn (f^
∗ )
≤ C (F, n, δ) + δ. (23)
So we have
f (^) n
− min f ∈F
R (f ) ≤
log|F| + log (1/δ)
2 n
In particular, for δ =
1 /n, we have
f (^) n
− min f ∈F
R (f ) ≤
log|F|+logn 2 n +^ √^1 n
log|F|+logn+ n ,^ since^
x +
y ≤
x + y, ∀ x, y > 0
Let F be the collection of all classiers with M equal volume cells. Then |F| = 2 M , and the histogram
classication rule
f (^) n = argmin f ∈F
n
∑^ n
i=
(^1) {f (Xi) 6 =Yi}
satises
f (^) n
− min f ∈F
R (f ) ≤
M log2 + 2 + logn
n
which suggests the choice M = log 2 n (balancing M log 2 with logn), resulting in
f (^) n
− min f ∈F
R (f ) = O
logn
n