Error Bounds for Classification: Hoeffding's Inequality and Bonferroni's Bond, Study notes of Probability and Statistics

Hoeffding's inequality and its application to gauge the closeness of empirical risks to their expected values in the context of classification error bounds. The document also introduces bonferroni's bound and its use in obtaining probabilities of significant deviations from expected values. The text further explains the distribution-free nature of the bound and its implications for the minimum empirical risk classifier.

Typology: Study notes

2011/2012

Uploaded on 10/19/2012

lumidee
lumidee 🇺🇸

4.4

(48)

363 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Connexions module: m16265 1
Classification Error Bounds
Robert Nowak
This work is produced by The Connexions Project and licensed under the
Creative Commons Attribution License
1 Recap: Classier design
Given a set of training data
{Xi, Yi}n
i=1
and a nite collection of candidate functions
F
, select ^
fn F
that
(hopefully) is a good predictor for future cases. That is
^
fn=argmin
f∈F
^
Rn(f)
(1)
where ^
Rn(f)
is the empirical risk. For any particular
f F
, the corresponding empirical risk is dened as
^
Rn(f) = 1
n
n
X
i=1
1{f(Xi)6=Yi}.
(2)
2 Hoeding's inequality
Hoeding's inequality (Cherno's bound in this case) allows us to gauge how close ^
Rn(f)
is to the true risk
of
f
,
R(f)
, in probability
P|
^
Rn(f)R(f)| ε2e22.
(3)
Since our selection process involves deciding among all
f F
, we would like to gauge how close the
empirical risks are to their expected values. We can do this by studying the probability that one or more of
the empirical risks deviates signicantly from its expected value. This is captured by the probability
Pmax
f∈F |
^
Rn(f)R(f)| ε.
(4)
Note that the event
max
f∈F |
^
Rn(f)R(f)| ε
(5)
Version 1.2: Feb 11, 2009 10:37 am US/Central
http://creativecommons.org/licenses/by/2.0/
http://cnx.org/content/m16265/1.2/
pf3
pf4

Partial preview of the text

Download Error Bounds for Classification: Hoeffding's Inequality and Bonferroni's Bond and more Study notes Probability and Statistics in PDF only on Docsity!

Classification Error Bounds

Robert Nowak

This work is produced by The Connexions Project and licensed under the Creative Commons Attribution License †

1 Recap: Classier design

Given a set of training data {Xi, Yi} n i=1 and a nite collection of candidate functions^ F, select

^

f (^) n ∈ F that

(hopefully) is a good predictor for future cases. That is

^

fn= argmin f ∈F

^

Rn (f^ )^ (1)

where

^

Rn (f^ )^ is the empirical risk. For any particular^ f^ ∈ F, the corresponding empirical risk is dened as

^

Rn (f^ ) =

n

∑^ n

i=

(^1) {f (Xi) 6 =Yi}. (2)

2 Hoeding's inequality

Hoeding's inequality (Cherno's bound in this case) allows us to gauge how close

^

Rn (f^ )^ is to the true risk

of f , R (f ), in probability

P

^

Rn (f^ )^ −^ R^ (f^ )^ | ≥^ ε

≤ 2 e − 2 nε^2

. (3)

Since our selection process involves deciding among all f ∈ F, we would like to gauge how close the

empirical risks are to their expected values. We can do this by studying the probability that one or more of

the empirical risks deviates signicantly from its expected value. This is captured by the probability

P

max f ∈F

^

Rn (f^ )^ −^ R^ (f^ )^ | ≥^ ε

Note that the event

max f ∈F

^

Rn (f^ )^ −^ R^ (f^ )^ | ≥^ ε^ (5)

∗Version 1.2: Feb 11, 2009 10:37 am US/Central †http://creativecommons.org/licenses/by/2.0/

is equivalent to union of the events

f ∈F

^

Rn (f ) − R (f ) | ≥ ε}. (6)

Therefore, we can use Bonferonni's bound (aka the union of events or union bound) to obtain

P

max f ∈F

^

Rn (f^ )^ −^ R^ (f^ )^ | ≥^ ε

= P

f ∈F |

^

Rn (f^ )^ −^ R^ (f^ )^ | ≥^ ε

f ∈F P

^

Rn (f^ )^ −^ R^ (f^ )^ | ≥^ ε

f ∈F 2 e

− 2 nε^2

= 2 |F|e−^2 nε

2

where |F| is the number of classiers in F. In the proof of Hoeding's inequality we also obtained a one-sided

inequality that implied

P

R (f ) −

^

Rn (f^ )^ ≥^ ε

≤ e − 2 nε^2 (8)

and hence

P

max f ∈F

R (f ) −

^

Rn (f^ )^ ≥^ ε

≤ |F|e − 2 nε^2

. (9)

We can restate the inequality above as follows, For all f ∈ F and for all δ > 0 with probability at least 1 − δ

R (f ) ≤

^

Rn (f ) +

log|F| + log (1/δ)

2 n

This follows by setting δ = |F|e − 2 nε^2 and solving for ε. Thus with a high probability (1 − δ), the true risk for

all f ∈ F is bounded by the empirical risk of f plus a constant that depends on δ > 0 , the number of training

samples n, and the size F. Most importantly the bound does not depend on the unknown distribution PXY.

Therefore, we can call this a distribution-free bound.

3 Error Bounds

We can use the distribution-free bound above to obtain a bound on the expected performance of the

minimum empirical risk classier

^

f (^) n = argmin f ∈F

^

Rn (f^ )^.^ (11)

We are interested in bounding

E

[

R

^

f (^) n

)]

− min f ∈F

R (f ) (12)

the expected risk of

^

f (^) n minus the minimum risk for all f ∈ F. Note that this dierence is always non-negative

since

^

f (^) n is at best as good as

f ∗ = argmin f ∈F

R (f ). (13)

Thus

E

[

R

^

f (^) n

^

Rn (f^

∗ )

]

≤ C (F, n, δ) + δ. (23)

So we have

E

[

R

^

f (^) n

)]

− min f ∈F

R (f ) ≤

log|F| + log (1/δ)

2 n

  • δ, ∀δ > 0. (24)

In particular, for δ =

1 /n, we have

E

[

R

^

f (^) n

)]

− min f ∈F

R (f ) ≤

log|F|+logn 2 n +^ √^1 n

log|F|+logn+ n ,^ since^

x +

y ≤

x + y, ∀ x, y > 0

4 Application: Histogram Classier

Let F be the collection of all classiers with M equal volume cells. Then |F| = 2 M , and the histogram

classication rule

^

f (^) n = argmin f ∈F

n

∑^ n

i=

(^1) {f (Xi) 6 =Yi}

satises

E

[

R

^

f (^) n

)]

− min f ∈F

R (f ) ≤

M log2 + 2 + logn

n

which suggests the choice M = log 2 n (balancing M log 2 with logn), resulting in

E

[

R

^

f (^) n

)]

− min f ∈F

R (f ) = O

logn

n