Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Uniform Bounds, Lecture Notes - Mathematics, Study notes of Probability and Statistics

Carnegie Mellon University (CMU)Probability and Statistics

Uniform Bounds, Lecture Notes - Mathematics, Probability, Uniform Bounds, Finite Classes, Shattering, Vapnik-Chervonenkis, Sauer's Theorem, Bounding Expectations

Typology: Study notes

2010/2011

Uploaded on 11/03/2011

sergeybrin 🇺🇸

4.6

(8)

236 documents

1 / 7

This page cannot be seen from the preview

Don't miss anything!

Lecture Notes 3

1 Uniform Bounds

Recall that, if X1, . . . , Xn∼Bernoulli(p) and bpn=n−1Pn

i=1 Xithen, from Hoeffding’s

inequality,

P(|bpn−p|> )≤2e−2n2.

Sometimes we want to say more than this.

Example 1 Suppose that X1, . . . , Xnhave cdf F. Let

Fn(t) = 1

n

X

i=1

I(Xi≤t).

We call Fnthe empirical cdf. How close is Fnto F? That is, how big is |Fn(t)−F(t)|?

From Hoeffding’s inequality,

P(|Fn(t)−F(t)|> )≤2e−2n2.

But that is only for one point t. How big is supt|Fn(t)−F(t)|? We would like a bound of

the form

Psup

t|Fn(t)−F(t)|> ≤something small.

Example 2 Suppose that X1, . . . , Xn∼P. Let

Pn(A) = 1

n

X

i=1

I(Xi∈A).

How close is Pn(A)to P(A)? That is, how big is |Pn(A)−P(A)|? From Hoeffding’s inequal-

ity,

P(|Pn(A)−P(A)|> )≤2e−2n2.

But that is only for one set A. How big is supA∈A |Pn(A)−P(A)|for a class of sets A? We

would like a bound of the form

Psup

A∈A |Pn(A)−P(A)|> ≤something small.

Example 3 (Classification.) Suppose we observe data (X1, Y1),...,(Xn, Yn)where Yi∈

{0,1}. Let (X, Y )be a new pair. Suppose we observe X. Now we want to predict Y. A

classifier his a function h(x)which takes values in {0,1}. When we observe Xwe predict

Ywith h(X). The classification error, or risk, is the probability of an error:

R(h) = P(Y6=h(X)).

1

Discover Study notes of Probability and Statistics Carnegie Mellon University (CMU)

Partial preview of the text

Download Uniform Bounds, Lecture Notes - Mathematics and more Study notes Probability and Statistics in PDF only on Docsity!

Lecture Notes 3

1 Uniform Bounds

Recall that, if X 1 ,... , Xn ∼ Bernoulli(p) and p̂n = n−^1

∑n i=1 Xi^ then, from Hoeffding’s inequality, P(|p̂n − p| > ) ≤ 2 e−^2 n

2 .

Sometimes we want to say more than this.

Example 1 Suppose that X 1 ,... , Xn have cdf F. Let

Fn(t) =

n

∑^ n

i=

I(Xi ≤ t).

We call Fn the empirical cdf. How close is Fn to F? That is, how big is |Fn(t) − F (t)|? From Hoeffding’s inequality,

P(|Fn(t) − F (t)| > ) ≤ 2 e−^2 n 2 .

But that is only for one point t. How big is supt |Fn(t) − F (t)|? We would like a bound of the form

P

sup t

|Fn(t) − F (t)| >

≤ something small.

Example 2 Suppose that X 1 ,... , Xn ∼ P. Let

Pn(A) =

n

∑^ n

i=

I(Xi ∈ A).

How close is Pn(A) to P (A)? That is, how big is |Pn(A) − P (A)|? From Hoeffding’s inequal- ity, P(|Pn(A) − P (A)| > ) ≤ 2 e−^2 n

2 .

But that is only for one set A. How big is supA∈A |Pn(A) − P (A)| for a class of sets A? We would like a bound of the form

P

sup A∈A

|Pn(A) − P (A)| >

≤ something small.

Example 3 (Classification.) Suppose we observe data (X 1 , Y 1 ),... , (Xn, Yn) where Yi ∈ { 0 , 1 }. Let (X, Y ) be a new pair. Suppose we observe X. Now we want to predict Y. A classifier h is a function h(x) which takes values in { 0 , 1 }. When we observe X we predict Y with h(X). The classification error, or risk, is the probability of an error:

R(h) = P(Y 6 = h(X)).

The training error is the fraction of errors on the observed data (X 1 , Y 1 ),... , (Xn, Yn):

R̂ (h) =^1 n

∑^ n

i=

I(Yi 6 = h(Xi)).

By Hoeffding’s inequality,

P(|R̂ (h) − R(h)| > ) ≤ 2 e−^2 n 2 . How do we choose a classifier? One way is to start with a set of classifiers H. Then we define ̂h to be the member of H that minimizes the training error. Thus

̂ h = argminh∈H R̂ (h).

An example is the set of linear classifiers. Suppose that x ∈ Rd. A linear classifier has the form h(x) = 1 of βT^ x ≥ 0 and h(x) = 0 of βT^ x < 0 where β = (β 1 ,... , βd)T^ is a set of parameters. Although ̂h minimizes R̂ (h), it does not minimize R(h). Let h∗ minimize the true error

R(h). A fundamental question is: how close is R(̂h) to R(h∗)? We will see later than R(̂h) is close to R(h∗) if suph |R̂ (h) − R(h)| is small. So we want

P

sup h

|R̂ (h) − R(h)| >

≤ something small.

More generally, we can state out goal as follows. For any function f define

P (f ) =

f (x)dP (x), Pn(f ) =

n

∑^ n

i=

f (Xi).

Let F be a set of functions. In our first example, each f was of the form ft(x) = I(x ≤ t) and F = {ft : t ∈ R}. We want to bound P

sup f ∈F

|Pn(f ) − P (f )| >

We will see that the bounds we obtain have the form

P

sup f ∈F

|Pn(f ) − P (f )| >

≤ c 1 κ(F)e−c^2 n

2

where c 1 and c 2 are positive constants and κ(F) is a measure of the size (or complexity) of the class F. Similarly, if A is a class of sets then we want a bound of the form

P

sup A∈A

|Pn(A) − P (A)| >

≤ c 1 κ(A)e−c^2 n

2

where Pn(A) = n−^1

∑n i=1 I(Xi^ ∈^ A). Bounds like these are called uniform bonds since they hold uniformly over a class of functions or over a class of sets.

A = all discs in Rd.
A = all rectangles in Rd.
A = all half-spaces in Rd^ = {x : βT^ x ≥ 0 }.
A = all convex sets in Rd.

Let F = {x 1 ,... , xn} be a finite set. Let G be a subset of F. Say that A picks out G if

A ∩ F = G

for some A ∈ A. For example, let A = {(a, b) : a ≤ b}. Suppose that F = { 1 , 2 , 7 , 8 , 9 } and G = { 2 , 7 }. Then A picks out G since A ∩ F = G if we choose A = (1. 5 , 7 .5) for example. Let S(A, F ) be the number of these subsets picked out by A. Of course S(A, F ) ≤ 2 n.

Example 4 Let A = {(a, b) : a ≤ b} and F = { 1 , 2 , 3 }. Then A can pick out:

∅, { 1 }, { 2 }, { 3 }, { 1 , 2 }, { 2 , 3 }, { 1 , 2 , 3 }.

So s(A, F ) = 7. Note that 7 < 8 = 2^3. If F = { 1 , 6 } then A can pick out:

∅, { 1 }, { 6 }, { 1 , 6 }.

In this case s(A, F ) = 4 = 2^2.

We say that F is shattered if s(A, F ) = 2n^ where n is the number of points in F.

Let Fn denote all finite sets with n elements.

Define the shatter coefficient

sn(A) = sup F ∈Fn

s(A, F ).

Note that sn(A) ≤ 2 n.

The following theorem is due to Vapnik and Chervonenis. The proof is beyond the scope of the course. (If you take 10-702/36-702 you will learn the proof.)

Class A VC dimension VA A = {A 1 ,... , AN } ≤ log 2 N Intervals [a, b] on the real line 2 Discs in R^2 Closed balls in Rd^ ≤ d + 2 Rectangles in Rd^2 d Half-spaces in Rd^ d + 1 Convex polygons in R^2 ∞ Convex polygons with d vertices 2 d + 1

Table 1: The VC dimension of some classes A.

Theorem 5 Let A be a class of sets. Then

P

sup A∈A

|Pn(A) − P (A)| >

≤ 8 sn(A) e−n

(^2) / 32

. (1)

This partly solves one of our problems. But, how big can sn(A) be? Sometimes sn(A) = 2 n^ for all n. For example, let A be all polygons in the plane. Then sn(A) = 2n^ for all n. But, in many cases, we will see that sn(A) = 2n^ for all n up to some integer d and then sn(A) < 2 n^ for all n > d.

The Vapnik-Chervonenkis (VC) dimension is

d = d(A) = largest n such that sn(A) = 2n.

In other words, d is the size of the largest set that can be shattered.

Thus, sn(A) = 2n^ for all n ≤ d and sn(A) < 2 n^ for all n > d. The VC dimensions of some common examples are summarized in Table 1. Now here is an interesting question: for n > d how does sn(A) behave? It is less than 2n^ but how much less?

Theorem 6 (Sauer’s Theorem) Suppose that A has finite VC dimension d. Then, for all n ≥ d, s(A, n) ≤ (n + 1)d. (2)

Hence, for any s,

E(Z n^2 ) =

0

P(Z^2 n > t)dt

∫ (^) s

0

P(Z n^2 > t)dt +

s

P(Z n^2 > t)dt

≤ s +

s

P(Z^2 n > t)dt

≤ s + 2N

s

e−^2 ntdt

= s + 2N

e−^2 ns 2 n

= s +

N

n

e−^2 ns.

Let s = log(N )/(2n). Then

E(Z n^2 ) ≤ s +

N

n

e−^2 ns^ =

log N 2 n

n

log N + 2 2 n

Finally, we use Cauchy-Schwartz:

E(Zn) ≤

E(Z n^2 ) ≤

log N + 2 2 n

= O

log N n

In summary:

E

max 1 ≤j≤N

|Pn(Aj ) − P (Aj )|

= O

log N n

For a single set A we would have E|Pn(A) − P (A)| ≤ O(1/

n). The bound only increases logarithmically with N.

Uniform Bounds, Lecture Notes - Mathematics, Study notes of Probability and Statistics

Related documents

Partial preview of the text

Download Uniform Bounds, Lecture Notes - Mathematics and more Study notes Probability and Statistics in PDF only on Docsity!

Lecture Notes 3

1 Uniform Bounds

P

P

N

N

= O

= O