




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A lecture note from carnegie mellon university (cmu) 18-859s, spring 2007, covering the topics of learning decision trees and dnfs (disjunctive normal forms) in the context of boolean functions. The lecture, delivered by ryan o’donnell and scribed by suresh purini, discusses the concepts of concentration of functions, learning algorithms, and the relationship between decision trees and dnfs. It also covers the fourier expansion of boolean functions and the learnability of decision trees.
Typology: Slides
1 / 8
This page cannot be seen from the preview
Don't miss anything!





Analysis of Boolean Functions (CMU 18-859S, Spring 2007)
Feb. 18, 2007
Lecturer: Ryan O’Donnell Scribe: Suresh Purini
We recall the following definition and two important learning algorithms discussed in previous
lecture.
Definition 1.1 Given a collection S of subsets of [n] , we say f : {− 1 , 1 }
n
→ R has ǫ -concentration
on S , if
∑
S /∈S
f (S)
2
≤ ǫ.
Theorem 1.2 Let C be a class of n -bit functions, such that ∀f ∈ C , f is ǫ -concentrated on S =
{S ⊆ [n]| |S| ≤ d} , then the function class C is learnable under the uniform distribution to an
accuracy of O(ǫ) , with a probability of at least 1 − δ , in time poly(|S|, 1 /ǫ)poly(n) log (1/δ) using
random examples only.
This algorithm is called Low Degree algorithm and was proposed by Linial, Mansour and Nisan
in [3]. Refer theorem 5.4 in lecture notes 8.
Theorem 1.3 Let C be a class of n -bit functions, such that ∀f ∈ C , f is ǫ -concentrated on some
collection S_. Then the function class_ C is learnable using membership queries (Goldreich-Levin
Algorithm) in poly(|S|, 1 /ǫ)poly(n) log (1/δ) time.
This algorithm is called Kushilevitz-Mansour algorithm [2]. Refer corollary 5.5 in lecture notes
A decision tree is a binary tree in which the internal nodes are labeled with variables and the leafs
are labeled with either − 1 or +1. And the left and right edges corresponding to any internal node
is labeled − 1 and +1 respectively. We can think of the decision tree as defining a boolean function
in the natural obvious way. For example, the decision tree in the figure 1 defines a boolean function
whose DNF formula is x 1 x 2 x 3 + x 1 x¯ 2 x 4 + ¯x 1 x 2.
Note that, given any boolean function we can come up with a corresponding decision tree.
Let P be a path in the decision tree. An example of a path in the figure 1 is P = (x 1 =
− 1 , x 2
= +1, x 4
Figure 1:
Let 1 P
n → { 0 , 1 } be an indicator function for path P. For example,
P
1 if x 1
= − 1 , x 2
= +1, x 4
0 else
Observation 2.1 A boolean function f can be expressed in terms of path functions 1 P
’s, corre-
sponding to various paths in the decision tree of the function f as follows
f (x) =
P aths P
P
(x)f (P )
where f (P ) is the label on the leaf when the function f takes the path P in its decision tree.
Observation 2.2 Let V be the set of variables occurring in a path function 1 P
and d be the cardi-
nality of the set V_. Then the Fourier expansion of_ 1 P
looks like
S⊆V
−d
XS.
It is easy to see the proof of the above observation by noting that the Fourier expansion for the
path function 1 P
, when P = (x 1
= − 1 , x 2
= +1, x 4
= −1), is 1 P
= x 1
x¯ 2
x 4
1
2
1
2
x 1
1
2
1
2
x 2 )(
1
2
1
2
x 4 ).
Proposition 2.3 If f : {− 1 , 1 }
n → {− 1 , 1 } is computable by a depth- d decision tree then
1. Fourier expansion of f has degree at most d i.e.,
|S|>d
f (S)
2 = 0_._
2. All Fourier coefficients are integer multiples of 2
−d
.
3. The number of nonzero Fourier coefficients is at most 4
d .
Proposition 2.11 If f has a decision tree of size s , ||
f|| 1
≤ s_._
Proof:
f|| 1
P aths P
P
f (P )
P aths P
P
≤ s
Proposition 2.12 Given any function f with ||f || 2
2
≤ 1 and ǫ > 0 , S = {S ⊆ [n]||
f (S)| ≥
ǫ
f|| 1
} , then f is ǫ -concentrated on S_. Note that_ |S| ≤
f|| 1
ǫ
2
Proof:
S /∈S
f (S)
2
≤ max S /∈S
f(S)|
S /∈S
f (S)|
≤ max S /∈S
f(S)|
S /∈S
f (S)| +
S∈S
f (S)|
ǫ
f|| 1
f|| 1
≤ ǫ
Corollary 2.13 Any class of functions C = {f | ||f || 2
2
≤ 1 and ||
f|| 1
≤ s} is learnable with
random examples in time poly(s,
1
ǫ
Let us now consider functions which are computable by decision trees where nodes branch on
arbitrary parities of variables. Figure 2 contains an example of a function computable by decision
tree on the parity of the various subsets of variables. Another example is parity function which is
computable by a depth- 1 parity decision tree.
Proposition 2.14 If a function f : {− 1 , 1 }
n → {− 1 , 1 } is expressible as a size- s decision tree on
parities, then ||
f|| 1
≤ s_._
Figure 2:
Proof: Let 1 P
be an { 0 , 1 }-indicator function for a path P in the decision tree. Let the path P =
S 1
= b 1
Sd
= b d
), i.e., we get the path P by taking the edges labeled b 1
, · · · , b d
starting from the root node. We have
P
b 1
S 1
b d
S d
It can be seen that ||
P
1
= 1. Since f (x) =
P aths P
P
(x)f (P ), we have ||
f|| 1
≤ s. 2
Definition 2.15 An AND of parities is called a coset.
Remark 2.16 If a function f : {− 1 , 1 }
n
→ {− 1 , 1 } is expressible as
s
i=
i
, where Pi ’s are
cosets then ||
f|| 1 ≤ s_._
Remark 2.17 Proposition 2.14 implies that we can learn all parity functions in poly(
1
ǫ
) time.
Observe that we cannot see this result straightforward from the usual decision trees on parity
functions.
Theorem 2.18 [1] If a function f : {− 1 , 1 }
n → {− 1 , 1 } with ||
f|| 1
≤ s , then
f =
2
2
O(s
4 )
i=
Pi
where Pi ’s are cosets.
Proposition 3.1 If f has a size- s DNF formula, it is ǫ -close to a width- log(
s
ǫ
Proof: Let (I, X) be a random restriction with ρ =
1
10 w
. We know from Hastad’s switching lemma
f X→
¯ I
has a depth greater than d with a probability less than 2
−d
. Hence the following sum is
nonzero (and less than 1) with a probability less than 2
−d
.
S⊆I,|S|>d
f X→
¯ I
2
Therefore, we have
−d
≥ E
(X,I)
S⊆I
|S|>d
f X→
¯ I
2
I
X∈{− 1 , 1 }
|
¯ I|
S⊆I
|S|>d
f X→
¯ I
2
I
S⊆I
|S|>d
X∈{− 1 , 1 }
|¯I|
S⊆I
2
(Recall^ FS⊆I (x) =
f x
I
S⊆I
|S|>d
T ⊆
¯ I
S⊆I
2
I
S⊆I
|S|>d
T ⊆
¯ I
f (S ∪ T )
2
U
f (U)
2
Pr
I
[|U ∩ I| > d]
Suppose |U| ≥ 20 dw, then |U ∩ I| is binomially distributed with mean 20 dwρ = 2d. Using
Chernoff bound, we get that PrI [|U ∩ I| > d] ≤
1
2
, when d ≥ 5. Therefore we have the
U
f(U)
2
Pr
I
[|U ∩ I| > d] ≤ 2
−d
U
|U |≥ 20 dw
f (U)
2
−d
U
|U |≥ 20 dw
f(U)
2
≤ 2
−d+
Remark 3.9 By putting dw = w log (
1
ǫ
) , we get the theorem 3.
Further References Yishay Mansour’s survey paper[4] also contains some of the ideas in this
lecture notes.
[1] B. Green and T. Sanders. A quantitative version of the idempotent theorem in harmonic anal-
ysis. ArXiv Mathematics e-prints , Nov. 2006.
[2] E. Kushilevitz and Y. Mansour. Learning decision trees using the fourier spectrum. In STOC
’91: Proceedings of the twenty-third annual ACM symposium on Theory of computing , pages
455–464, New York, NY, USA, 1991. ACM Press.
[3] N. Linial, Y. Mansour, and N. Nisan. Constant depth circuits, fourier transform, and learnabil-
ity. J. ACM , 40(3):607–620, 1993.
[4] Y. Mansour. Learning boolean functions via the fourier transform. In V. Roychowdhury, K.-
Y. Siu, and A. Orlitsky, editors, Theoretical Advances in Neural Computation and Learning.
Kluwer, 1994.