Learning Boolean Functions: Decision Trees and DNFs, Slides of Computer Architecture and Organization

A lecture note from carnegie mellon university (cmu) 18-859s, spring 2007, covering the topics of learning decision trees and dnfs (disjunctive normal forms) in the context of boolean functions. The lecture, delivered by ryan o’donnell and scribed by suresh purini, discusses the concepts of concentration of functions, learning algorithms, and the relationship between decision trees and dnfs. It also covers the fourier expansion of boolean functions and the learnability of decision trees.

Typology: Slides

2010/2011

Uploaded on 10/07/2011

rolla45
rolla45 🇺🇸

4

(6)

133 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Analysis of Boolean Functions (CMU 18-859S, Spring 2007)
Lecture 9: Learning Decision Trees and DNFs
Feb. 18, 2007
Lecturer: Ryan O’Donnell Scribe: Suresh Purini
1 Two Important Learning Algorithms
We recall the following definition and two important learning algorithms discussed in previous
lecture.
Definition 1.1 Given a collection Sof subsets of [n], we say f:{1,1}nRhas ǫ-concentration
on S, if P
S /∈S
ˆ
f(S)2ǫ.
Theorem 1.2 Let Cbe a class of n-bit functions, such that f C,fis ǫ-concentrated on S=
{S[n]| |S| d}, then the function class Cis learnable under the uniform distribution to an
accuracy of O(ǫ), with a probability of at least 1δ, in time poly(|S|,1)poly(n) log (1)using
random examples only.
This algorithm is called Low Degree algorithm and was proposed by Linial, Mansour and Nisan
in [3]. Refer theorem 5.4 in lecture notes 8.
Theorem 1.3 Let Cbe a class of n-bit functions, such that f C,fis ǫ-concentrated on some
collection S. Then the function class Cis learnable using membership queries (Goldreich-Levin
Algorithm) in poly(|S |,1)poly(n) log (1)time.
This algorithm is called Kushilevitz-Mansour algorithm [2]. Refer corollary 5.5 in lecture notes
8.
2 Learning Decision Trees
A decision tree is a binary tree in which the internal nodes are labeled with variables and the leafs
are labeled with either 1or +1. And the left and right edges corresponding to any internal node
is labeled 1and +1 respectively. We can think of the decision tree as defining a boolean function
in the natural obvious way. For example, the decision tree in the figure 1 defines a boolean function
whose DNF formula is x1x2x3+x1¯x2x4+ ¯x1x2.
Note that, given any boolean function we can come up with a corresponding decision tree.
Let Pbe a path in the decision tree. An example of a path in the figure 1 is P= (x1=
1, x2= +1, x4=1).
1
pf3
pf4
pf5
pf8

Partial preview of the text

Download Learning Boolean Functions: Decision Trees and DNFs and more Slides Computer Architecture and Organization in PDF only on Docsity!

Analysis of Boolean Functions (CMU 18-859S, Spring 2007)

Lecture 9: Learning Decision Trees and DNFs

Feb. 18, 2007

Lecturer: Ryan O’Donnell Scribe: Suresh Purini

1 Two Important Learning Algorithms

We recall the following definition and two important learning algorithms discussed in previous

lecture.

Definition 1.1 Given a collection S of subsets of [n] , we say f : {− 1 , 1 }

n

→ R has ǫ -concentration

on S , if

S /∈S

f (S)

2

≤ ǫ.

Theorem 1.2 Let C be a class of n -bit functions, such that ∀f ∈ C , f is ǫ -concentrated on S =

{S ⊆ [n]| |S| ≤ d} , then the function class C is learnable under the uniform distribution to an

accuracy of O(ǫ) , with a probability of at least 1 − δ , in time poly(|S|, 1 /ǫ)poly(n) log (1/δ) using

random examples only.

This algorithm is called Low Degree algorithm and was proposed by Linial, Mansour and Nisan

in [3]. Refer theorem 5.4 in lecture notes 8.

Theorem 1.3 Let C be a class of n -bit functions, such that ∀f ∈ C , f is ǫ -concentrated on some

collection S_. Then the function class_ C is learnable using membership queries (Goldreich-Levin

Algorithm) in poly(|S|, 1 /ǫ)poly(n) log (1/δ) time.

This algorithm is called Kushilevitz-Mansour algorithm [2]. Refer corollary 5.5 in lecture notes

2 Learning Decision Trees

A decision tree is a binary tree in which the internal nodes are labeled with variables and the leafs

are labeled with either − 1 or +1. And the left and right edges corresponding to any internal node

is labeled − 1 and +1 respectively. We can think of the decision tree as defining a boolean function

in the natural obvious way. For example, the decision tree in the figure 1 defines a boolean function

whose DNF formula is x 1 x 2 x 3 + x 1 x¯ 2 x 4 + ¯x 1 x 2.

Note that, given any boolean function we can come up with a corresponding decision tree.

Let P be a path in the decision tree. An example of a path in the figure 1 is P = (x 1 =

− 1 , x 2

= +1, x 4

Figure 1:

Let 1 P

n → { 0 , 1 } be an indicator function for path P. For example,

P

1 if x 1

= − 1 , x 2

= +1, x 4

0 else

Observation 2.1 A boolean function f can be expressed in terms of path functions 1 P

’s, corre-

sponding to various paths in the decision tree of the function f as follows

f (x) =

P aths P

P

(x)f (P )

where f (P ) is the label on the leaf when the function f takes the path P in its decision tree.

Observation 2.2 Let V be the set of variables occurring in a path function 1 P

and d be the cardi-

nality of the set V_. Then the Fourier expansion of_ 1 P

looks like

S⊆V

−d

XS.

It is easy to see the proof of the above observation by noting that the Fourier expansion for the

path function 1 P

, when P = (x 1

= − 1 , x 2

= +1, x 4

= −1), is 1 P

= x 1

x¯ 2

x 4

1

2

1

2

x 1

1

2

1

2

x 2 )(

1

2

1

2

x 4 ).

Proposition 2.3 If f : {− 1 , 1 }

n → {− 1 , 1 } is computable by a depth- d decision tree then

1. Fourier expansion of f has degree at most d i.e.,

|S|>d

f (S)

2 = 0_._

2. All Fourier coefficients are integer multiples of 2

−d

.

3. The number of nonzero Fourier coefficients is at most 4

d .

Proposition 2.11 If f has a decision tree of size s , ||

f|| 1

≤ s_._

Proof:

f|| 1

P aths P

P

f (P )

P aths P

P

≤ s

Proposition 2.12 Given any function f with ||f || 2

2

≤ 1 and ǫ > 0 , S = {S ⊆ [n]||

f (S)| ≥

ǫ

f|| 1

} , then f is ǫ -concentrated on S_. Note that_ |S| ≤

f|| 1

ǫ

2

Proof:

S /∈S

f (S)

2

≤ max S /∈S

f(S)|

[

S /∈S

f (S)|

]

≤ max S /∈S

f(S)|

[

S /∈S

f (S)| +

S∈S

f (S)|

]

ǫ

f|| 1

f|| 1

≤ ǫ

Corollary 2.13 Any class of functions C = {f | ||f || 2

2

≤ 1 and ||

f|| 1

≤ s} is learnable with

random examples in time poly(s,

1

ǫ

Let us now consider functions which are computable by decision trees where nodes branch on

arbitrary parities of variables. Figure 2 contains an example of a function computable by decision

tree on the parity of the various subsets of variables. Another example is parity function which is

computable by a depth- 1 parity decision tree.

Proposition 2.14 If a function f : {− 1 , 1 }

n → {− 1 , 1 } is expressible as a size- s decision tree on

parities, then ||

f|| 1

≤ s_._

Figure 2:

Proof: Let 1 P

be an { 0 , 1 }-indicator function for a path P in the decision tree. Let the path P =

(X

S 1

= b 1

, · · · , X

Sd

= b d

), i.e., we get the path P by taking the edges labeled b 1

, · · · , b d

starting from the root node. We have

P

b 1

X

S 1

b d

X

S d

It can be seen that ||

P

1

= 1. Since f (x) =

P aths P

P

(x)f (P ), we have ||

f|| 1

≤ s. 2

Definition 2.15 An AND of parities is called a coset.

Remark 2.16 If a function f : {− 1 , 1 }

n

→ {− 1 , 1 } is expressible as

s

i=

± 1 P

i

, where Pi ’s are

cosets then ||

f|| 1 ≤ s_._

Remark 2.17 Proposition 2.14 implies that we can learn all parity functions in poly(

1

ǫ

) time.

Observe that we cannot see this result straightforward from the usual decision trees on parity

functions.

Theorem 2.18 [1] If a function f : {− 1 , 1 }

n → {− 1 , 1 } with ||

f|| 1

≤ s , then

f =

2

2

O(s

4 )

i=

Pi

where Pi ’s are cosets.

3 Learning DNFs

Proposition 3.1 If f has a size- s DNF formula, it is ǫ -close to a width- log(

s

ǫ

) DNF.

Proof: Let (I, X) be a random restriction with ρ =

1

10 w

. We know from Hastad’s switching lemma

f X→

¯ I

has a depth greater than d with a probability less than 2

−d

. Hence the following sum is

nonzero (and less than 1) with a probability less than 2

−d

.

S⊆I,|S|>d

f X→

¯ I

(S)

2

Therefore, we have

−d

≥ E

(X,I)

S⊆I

|S|>d

f X→

¯ I

(S)

2

= E

I

 E

X∈{− 1 , 1 }

|

¯ I|

S⊆I

|S|>d

f X→

¯ I

(S)

2

= E

I

S⊆I

|S|>d

E

X∈{− 1 , 1 }

|¯I|

[

F

S⊆I

(X)

2

]

 (Recall^ FS⊆I (x) =

f x

(S))

= E

I

S⊆I

|S|>d

T ⊆

¯ I

F

S⊆I

(T )

2

= E

I

S⊆I

|S|>d

T ⊆

¯ I

f (S ∪ T )

2

U

f (U)

2

Pr

I

[|U ∩ I| > d]

Suppose |U| ≥ 20 dw, then |U ∩ I| is binomially distributed with mean 20 dwρ = 2d. Using

Chernoff bound, we get that PrI [|U ∩ I| > d] ≤

1

2

, when d ≥ 5. Therefore we have the

U

f(U)

2

Pr

I

[|U ∩ I| > d] ≤ 2

−d

U

|U |≥ 20 dw

f (U)

2

−d

U

|U |≥ 20 dw

f(U)

2

≤ 2

−d+

Remark 3.9 By putting dw = w log (

1

ǫ

) , we get the theorem 3.

Further References Yishay Mansour’s survey paper[4] also contains some of the ideas in this

lecture notes.

References

[1] B. Green and T. Sanders. A quantitative version of the idempotent theorem in harmonic anal-

ysis. ArXiv Mathematics e-prints , Nov. 2006.

[2] E. Kushilevitz and Y. Mansour. Learning decision trees using the fourier spectrum. In STOC

’91: Proceedings of the twenty-third annual ACM symposium on Theory of computing , pages

455–464, New York, NY, USA, 1991. ACM Press.

[3] N. Linial, Y. Mansour, and N. Nisan. Constant depth circuits, fourier transform, and learnabil-

ity. J. ACM , 40(3):607–620, 1993.

[4] Y. Mansour. Learning boolean functions via the fourier transform. In V. Roychowdhury, K.-

Y. Siu, and A. Orlitsky, editors, Theoretical Advances in Neural Computation and Learning.

Kluwer, 1994.