



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Uniform Bounds, Lecture Notes - Mathematics, Probability, Uniform Bounds, Finite Classes, Shattering, Vapnik-Chervonenkis, Sauer's Theorem, Bounding Expectations
Typology: Study notes
1 / 7
This page cannot be seen from the preview
Don't miss anything!




Recall that, if X 1 ,... , Xn ∼ Bernoulli(p) and p̂n = n−^1
∑n i=1 Xi^ then, from Hoeffding’s inequality, P(|p̂n − p| > ) ≤ 2 e−^2 n
2 .
Sometimes we want to say more than this.
Example 1 Suppose that X 1 ,... , Xn have cdf F. Let
Fn(t) =
n
∑^ n
i=
I(Xi ≤ t).
We call Fn the empirical cdf. How close is Fn to F? That is, how big is |Fn(t) − F (t)|? From Hoeffding’s inequality,
P(|Fn(t) − F (t)| > ) ≤ 2 e−^2 n 2 .
But that is only for one point t. How big is supt |Fn(t) − F (t)|? We would like a bound of the form
P
sup t
|Fn(t) − F (t)| >
≤ something small.
Example 2 Suppose that X 1 ,... , Xn ∼ P. Let
Pn(A) =
n
∑^ n
i=
I(Xi ∈ A).
How close is Pn(A) to P (A)? That is, how big is |Pn(A) − P (A)|? From Hoeffding’s inequal- ity, P(|Pn(A) − P (A)| > ) ≤ 2 e−^2 n
2 .
But that is only for one set A. How big is supA∈A |Pn(A) − P (A)| for a class of sets A? We would like a bound of the form
sup A∈A
|Pn(A) − P (A)| >
≤ something small.
Example 3 (Classification.) Suppose we observe data (X 1 , Y 1 ),... , (Xn, Yn) where Yi ∈ { 0 , 1 }. Let (X, Y ) be a new pair. Suppose we observe X. Now we want to predict Y. A classifier h is a function h(x) which takes values in { 0 , 1 }. When we observe X we predict Y with h(X). The classification error, or risk, is the probability of an error:
R(h) = P(Y 6 = h(X)).
The training error is the fraction of errors on the observed data (X 1 , Y 1 ),... , (Xn, Yn):
R̂ (h) =^1 n
∑^ n
i=
I(Yi 6 = h(Xi)).
By Hoeffding’s inequality,
P(|R̂ (h) − R(h)| > ) ≤ 2 e−^2 n 2 . How do we choose a classifier? One way is to start with a set of classifiers H. Then we define ̂h to be the member of H that minimizes the training error. Thus
̂ h = argminh∈H R̂ (h).
An example is the set of linear classifiers. Suppose that x ∈ Rd. A linear classifier has the form h(x) = 1 of βT^ x ≥ 0 and h(x) = 0 of βT^ x < 0 where β = (β 1 ,... , βd)T^ is a set of parameters. Although ̂h minimizes R̂ (h), it does not minimize R(h). Let h∗ minimize the true error
R(h). A fundamental question is: how close is R(̂h) to R(h∗)? We will see later than R(̂h) is close to R(h∗) if suph |R̂ (h) − R(h)| is small. So we want
P
sup h
|R̂ (h) − R(h)| >
≤ something small.
More generally, we can state out goal as follows. For any function f define
P (f ) =
f (x)dP (x), Pn(f ) =
n
∑^ n
i=
f (Xi).
Let F be a set of functions. In our first example, each f was of the form ft(x) = I(x ≤ t) and F = {ft : t ∈ R}. We want to bound P
sup f ∈F
|Pn(f ) − P (f )| >
We will see that the bounds we obtain have the form
P
sup f ∈F
|Pn(f ) − P (f )| >
≤ c 1 κ(F)e−c^2 n
2
where c 1 and c 2 are positive constants and κ(F) is a measure of the size (or complexity) of the class F. Similarly, if A is a class of sets then we want a bound of the form
P
sup A∈A
|Pn(A) − P (A)| >
≤ c 1 κ(A)e−c^2 n
2
where Pn(A) = n−^1
∑n i=1 I(Xi^ ∈^ A). Bounds like these are called uniform bonds since they hold uniformly over a class of functions or over a class of sets.
Let F = {x 1 ,... , xn} be a finite set. Let G be a subset of F. Say that A picks out G if
A ∩ F = G
for some A ∈ A. For example, let A = {(a, b) : a ≤ b}. Suppose that F = { 1 , 2 , 7 , 8 , 9 } and G = { 2 , 7 }. Then A picks out G since A ∩ F = G if we choose A = (1. 5 , 7 .5) for example. Let S(A, F ) be the number of these subsets picked out by A. Of course S(A, F ) ≤ 2 n.
Example 4 Let A = {(a, b) : a ≤ b} and F = { 1 , 2 , 3 }. Then A can pick out:
∅, { 1 }, { 2 }, { 3 }, { 1 , 2 }, { 2 , 3 }, { 1 , 2 , 3 }.
So s(A, F ) = 7. Note that 7 < 8 = 2^3. If F = { 1 , 6 } then A can pick out:
∅, { 1 }, { 6 }, { 1 , 6 }.
In this case s(A, F ) = 4 = 2^2.
We say that F is shattered if s(A, F ) = 2n^ where n is the number of points in F.
Let Fn denote all finite sets with n elements.
Define the shatter coefficient
sn(A) = sup F ∈Fn
s(A, F ).
Note that sn(A) ≤ 2 n.
The following theorem is due to Vapnik and Chervonenis. The proof is beyond the scope of the course. (If you take 10-702/36-702 you will learn the proof.)
Class A VC dimension VA A = {A 1 ,... , AN } ≤ log 2 N Intervals [a, b] on the real line 2 Discs in R^2 Closed balls in Rd^ ≤ d + 2 Rectangles in Rd^2 d Half-spaces in Rd^ d + 1 Convex polygons in R^2 ∞ Convex polygons with d vertices 2 d + 1
Table 1: The VC dimension of some classes A.
Theorem 5 Let A be a class of sets. Then
sup A∈A
|Pn(A) − P (A)| >
≤ 8 sn(A) e−n
(^2) / 32
. (1)
This partly solves one of our problems. But, how big can sn(A) be? Sometimes sn(A) = 2 n^ for all n. For example, let A be all polygons in the plane. Then sn(A) = 2n^ for all n. But, in many cases, we will see that sn(A) = 2n^ for all n up to some integer d and then sn(A) < 2 n^ for all n > d.
The Vapnik-Chervonenkis (VC) dimension is
d = d(A) = largest n such that sn(A) = 2n.
In other words, d is the size of the largest set that can be shattered.
Thus, sn(A) = 2n^ for all n ≤ d and sn(A) < 2 n^ for all n > d. The VC dimensions of some common examples are summarized in Table 1. Now here is an interesting question: for n > d how does sn(A) behave? It is less than 2n^ but how much less?
Theorem 6 (Sauer’s Theorem) Suppose that A has finite VC dimension d. Then, for all n ≥ d, s(A, n) ≤ (n + 1)d. (2)
Hence, for any s,
E(Z n^2 ) =
0
P(Z^2 n > t)dt
∫ (^) s
0
P(Z n^2 > t)dt +
s
P(Z n^2 > t)dt
≤ s +
s
P(Z^2 n > t)dt
≤ s + 2N
s
e−^2 ntdt
= s + 2N
e−^2 ns 2 n
= s +
n
e−^2 ns.
Let s = log(N )/(2n). Then
E(Z n^2 ) ≤ s +
n
e−^2 ns^ =
log N 2 n
n
log N + 2 2 n
Finally, we use Cauchy-Schwartz:
E(Zn) ≤
E(Z n^2 ) ≤
log N + 2 2 n
log N n
In summary:
E
max 1 ≤j≤N
|Pn(Aj ) − P (Aj )|
log N n
For a single set A we would have E|Pn(A) − P (A)| ≤ O(1/
n). The bound only increases logarithmically with N.