Uniform Bounds, Lecture Notes - Mathematics, Study notes of Probability and Statistics

Uniform Bounds, Lecture Notes - Mathematics, Probability, Uniform Bounds, Finite Classes, Shattering, Vapnik-Chervonenkis, Sauer's Theorem, Bounding Expectations

Typology: Study notes

2010/2011

Uploaded on 11/03/2011

sergeybrin
sergeybrin 🇺🇸

4.6

(8)

236 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lecture Notes 3
1 Uniform Bounds
Recall that, if X1, . . . , XnBernoulli(p) and bpn=n1Pn
i=1 Xithen, from Hoeffding’s
inequality,
P(|bpnp|> )2e2n2.
Sometimes we want to say more than this.
Example 1 Suppose that X1, . . . , Xnhave cdf F. Let
Fn(t) = 1
n
n
X
i=1
I(Xit).
We call Fnthe empirical cdf. How close is Fnto F? That is, how big is |Fn(t)F(t)|?
From Hoeffding’s inequality,
P(|Fn(t)F(t)|> )2e2n2.
But that is only for one point t. How big is supt|Fn(t)F(t)|? We would like a bound of
the form
Psup
t|Fn(t)F(t)|> something small.
Example 2 Suppose that X1, . . . , XnP. Let
Pn(A) = 1
n
n
X
i=1
I(XiA).
How close is Pn(A)to P(A)? That is, how big is |Pn(A)P(A)|? From Hoeffding’s inequal-
ity,
P(|Pn(A)P(A)|> )2e2n2.
But that is only for one set A. How big is supA∈A |Pn(A)P(A)|for a class of sets A? We
would like a bound of the form
Psup
A∈A |Pn(A)P(A)|> something small.
Example 3 (Classification.) Suppose we observe data (X1, Y1),...,(Xn, Yn)where Yi
{0,1}. Let (X, Y )be a new pair. Suppose we observe X. Now we want to predict Y. A
classifier his a function h(x)which takes values in {0,1}. When we observe Xwe predict
Ywith h(X). The classification error, or risk, is the probability of an error:
R(h) = P(Y6=h(X)).
1
pf3
pf4
pf5

Partial preview of the text

Download Uniform Bounds, Lecture Notes - Mathematics and more Study notes Probability and Statistics in PDF only on Docsity!

Lecture Notes 3

1 Uniform Bounds

Recall that, if X 1 ,... , Xn ∼ Bernoulli(p) and p̂n = n−^1

∑n i=1 Xi^ then, from Hoeffding’s inequality, P(|p̂n − p| > ) ≤ 2 e−^2 n

2 .

Sometimes we want to say more than this.

Example 1 Suppose that X 1 ,... , Xn have cdf F. Let

Fn(t) =

n

∑^ n

i=

I(Xi ≤ t).

We call Fn the empirical cdf. How close is Fn to F? That is, how big is |Fn(t) − F (t)|? From Hoeffding’s inequality,

P(|Fn(t) − F (t)| > ) ≤ 2 e−^2 n 2 .

But that is only for one point t. How big is supt |Fn(t) − F (t)|? We would like a bound of the form

P

sup t

|Fn(t) − F (t)| > 

≤ something small.

Example 2 Suppose that X 1 ,... , Xn ∼ P. Let

Pn(A) =

n

∑^ n

i=

I(Xi ∈ A).

How close is Pn(A) to P (A)? That is, how big is |Pn(A) − P (A)|? From Hoeffding’s inequal- ity, P(|Pn(A) − P (A)| > ) ≤ 2 e−^2 n

2 .

But that is only for one set A. How big is supA∈A |Pn(A) − P (A)| for a class of sets A? We would like a bound of the form

P

sup A∈A

|Pn(A) − P (A)| > 

≤ something small.

Example 3 (Classification.) Suppose we observe data (X 1 , Y 1 ),... , (Xn, Yn) where Yi ∈ { 0 , 1 }. Let (X, Y ) be a new pair. Suppose we observe X. Now we want to predict Y. A classifier h is a function h(x) which takes values in { 0 , 1 }. When we observe X we predict Y with h(X). The classification error, or risk, is the probability of an error:

R(h) = P(Y 6 = h(X)).

The training error is the fraction of errors on the observed data (X 1 , Y 1 ),... , (Xn, Yn):

R̂ (h) =^1 n

∑^ n

i=

I(Yi 6 = h(Xi)).

By Hoeffding’s inequality,

P(|R̂ (h) − R(h)| > ) ≤ 2 e−^2 n 2 . How do we choose a classifier? One way is to start with a set of classifiers H. Then we define ̂h to be the member of H that minimizes the training error. Thus

̂ h = argminh∈H R̂ (h).

An example is the set of linear classifiers. Suppose that x ∈ Rd. A linear classifier has the form h(x) = 1 of βT^ x ≥ 0 and h(x) = 0 of βT^ x < 0 where β = (β 1 ,... , βd)T^ is a set of parameters. Although ̂h minimizes R̂ (h), it does not minimize R(h). Let h∗ minimize the true error

R(h). A fundamental question is: how close is R(̂h) to R(h∗)? We will see later than R(̂h) is close to R(h∗) if suph |R̂ (h) − R(h)| is small. So we want

P

sup h

|R̂ (h) − R(h)| > 

≤ something small.

More generally, we can state out goal as follows. For any function f define

P (f ) =

f (x)dP (x), Pn(f ) =

n

∑^ n

i=

f (Xi).

Let F be a set of functions. In our first example, each f was of the form ft(x) = I(x ≤ t) and F = {ft : t ∈ R}. We want to bound P

sup f ∈F

|Pn(f ) − P (f )| > 

We will see that the bounds we obtain have the form

P

sup f ∈F

|Pn(f ) − P (f )| > 

≤ c 1 κ(F)e−c^2 n

2

where c 1 and c 2 are positive constants and κ(F) is a measure of the size (or complexity) of the class F. Similarly, if A is a class of sets then we want a bound of the form

P

sup A∈A

|Pn(A) − P (A)| > 

≤ c 1 κ(A)e−c^2 n

2

where Pn(A) = n−^1

∑n i=1 I(Xi^ ∈^ A). Bounds like these are called uniform bonds since they hold uniformly over a class of functions or over a class of sets.

  1. A = all discs in Rd.
  2. A = all rectangles in Rd.
  3. A = all half-spaces in Rd^ = {x : βT^ x ≥ 0 }.
  4. A = all convex sets in Rd.

Let F = {x 1 ,... , xn} be a finite set. Let G be a subset of F. Say that A picks out G if

A ∩ F = G

for some A ∈ A. For example, let A = {(a, b) : a ≤ b}. Suppose that F = { 1 , 2 , 7 , 8 , 9 } and G = { 2 , 7 }. Then A picks out G since A ∩ F = G if we choose A = (1. 5 , 7 .5) for example. Let S(A, F ) be the number of these subsets picked out by A. Of course S(A, F ) ≤ 2 n.

Example 4 Let A = {(a, b) : a ≤ b} and F = { 1 , 2 , 3 }. Then A can pick out:

∅, { 1 }, { 2 }, { 3 }, { 1 , 2 }, { 2 , 3 }, { 1 , 2 , 3 }.

So s(A, F ) = 7. Note that 7 < 8 = 2^3. If F = { 1 , 6 } then A can pick out:

∅, { 1 }, { 6 }, { 1 , 6 }.

In this case s(A, F ) = 4 = 2^2.

We say that F is shattered if s(A, F ) = 2n^ where n is the number of points in F.

Let Fn denote all finite sets with n elements.

Define the shatter coefficient

sn(A) = sup F ∈Fn

s(A, F ).

Note that sn(A) ≤ 2 n.

The following theorem is due to Vapnik and Chervonenis. The proof is beyond the scope of the course. (If you take 10-702/36-702 you will learn the proof.)

Class A VC dimension VA A = {A 1 ,... , AN } ≤ log 2 N Intervals [a, b] on the real line 2 Discs in R^2 Closed balls in Rd^ ≤ d + 2 Rectangles in Rd^2 d Half-spaces in Rd^ d + 1 Convex polygons in R^2 ∞ Convex polygons with d vertices 2 d + 1

Table 1: The VC dimension of some classes A.

Theorem 5 Let A be a class of sets. Then

P

sup A∈A

|Pn(A) − P (A)| > 

≤ 8 sn(A) e−n

(^2) / 32

. (1)

This partly solves one of our problems. But, how big can sn(A) be? Sometimes sn(A) = 2 n^ for all n. For example, let A be all polygons in the plane. Then sn(A) = 2n^ for all n. But, in many cases, we will see that sn(A) = 2n^ for all n up to some integer d and then sn(A) < 2 n^ for all n > d.

The Vapnik-Chervonenkis (VC) dimension is

d = d(A) = largest n such that sn(A) = 2n.

In other words, d is the size of the largest set that can be shattered.

Thus, sn(A) = 2n^ for all n ≤ d and sn(A) < 2 n^ for all n > d. The VC dimensions of some common examples are summarized in Table 1. Now here is an interesting question: for n > d how does sn(A) behave? It is less than 2n^ but how much less?

Theorem 6 (Sauer’s Theorem) Suppose that A has finite VC dimension d. Then, for all n ≥ d, s(A, n) ≤ (n + 1)d. (2)

Hence, for any s,

E(Z n^2 ) =

0

P(Z^2 n > t)dt

∫ (^) s

0

P(Z n^2 > t)dt +

s

P(Z n^2 > t)dt

≤ s +

s

P(Z^2 n > t)dt

≤ s + 2N

s

e−^2 ntdt

= s + 2N

e−^2 ns 2 n

= s +

N

n

e−^2 ns.

Let s = log(N )/(2n). Then

E(Z n^2 ) ≤ s +

N

n

e−^2 ns^ =

log N 2 n

n

log N + 2 2 n

Finally, we use Cauchy-Schwartz:

E(Zn) ≤

E(Z n^2 ) ≤

log N + 2 2 n

= O

log N n

In summary:

E

max 1 ≤j≤N

|Pn(Aj ) − P (Aj )|

= O

log N n

For a single set A we would have E|Pn(A) − P (A)| ≤ O(1/

n). The bound only increases logarithmically with N.