Naive Bayes Classifier: A Probabilistic Approach to Text Classification - Prof. Dan Roth, Study notes of Computer Science

An overview of the naive bayes classifier, a probabilistic machine learning algorithm used for text classification. The assumptions of naive bayes, the learning method, and the process of estimating probabilities. It also includes examples of text classification using naive bayes and a discussion on smoothing and robust estimation of probabilities.

Typology: Study notes

Pre 2010

Uploaded on 03/16/2009

koofers-user-o95
koofers-user-o95 🇺🇸

10 documents

1 / 54

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Bayesian Classifier
f:XV, finite set of values
Instances xX can be described as a collection of features
Given an example, assign it the most probable value in V
{0,1}x )x,...,x,(xx in21
=
)x,...,x,x|P(vargmax x)|P(vargmax v n21jVvjVvMAP jj
=
=
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36

Partial preview of the text

Download Naive Bayes Classifier: A Probabilistic Approach to Text Classification - Prof. Dan Roth and more Study notes Computer Science in PDF only on Docsity!

Bayesian Classifier

f:X

V, finite set of values

Instances x

X can be described as a collection of features

Given an example, assign it the most probable value in V

{0,1}

x

)

x

,...,

x ,

(x

x

i

n

2

1

=

)

x

,...,

x ,

x |

P(v

argmax

x) |

P(v

argmax

v

n 2 1 j V v j V v

MAP

j

j^

=

=

Bayesian Classifier

f:X

V, finite set of values

Instances x

X can be described as a collection of features

Given an example, assign it the most probable value in V

Bayes Rule:

Notational convention: P(y) means P(Y=y)

)

x

,...,

x ,

x |

P(v

argmax

x) |

P(v

argmax

v

n 2 1 j V v j V v

MAP

j

j^

=

= v

MAP

argmax

v

j

V

P(x

1

,x

2

,...,x

n

v

j

)P(v

j

P(x

1

,x

2

,...,x

n

argmax

v

j

V

P(x

1

,x

2

,...,x

n

| v

j

)P(v

j

{0,1}

x

)

x

,...,

x ,

(x

x

i

n

2

1

=

Naïve Bayes

Assumption: feature values are independent given the target value

)P(v

v |

x

x ,

P(x

argmax

v

j j n 2 1 V v

MAP

∈j

=

= = = = =

=

n

1 i^

j

i

j n j n 4 3 j n 3 2 j n 2 1

j n 3 j n 3 2 j n 2 1

j n 2 j n 2 1

j

n

2

1

)

v |

P(x

v |

P(x

v ,

x

,...,

x |

P(x

v ,

x

,...,

x |

)P(x

v ,

x

,...,

x |

P(x

v |

x

,...,

P(x

v ,

x

,...,

x |

)P(x

v ,

x

,...,

x |

P(x

v |

x

,...,

)P(x

v ,

x

,...,

x |

P(x

)

v |

x

,...,

x ,

P(x

)

)...

)

.......

)

)

)

Naïve Bayes

Assumption: feature values are independent given the target value

Generative model:

First choose a value v

j^

V

according to P(v

j^

)

For each v

j^

: choose x

1

x

2

, …, x

n

according to P(x

k

|v

j^

)

)P(v

v |

x

x ,

P(x

argmax

v

j j n 2 1 V v

MAP

∈j

=

n

1 i^

j i i j n n 2 2 1 1

v

v |

b

P(x

) v v | b x

b

x ,

b

P(x

Naïve Bayes

Assumption: feature values are independent given the target value

Learning method

: Estimate n|V| parameters and use them to

compute the new value.

This is learning without search. Given a collection of training examples,

you just compute the best hypothesis (given the assumptions)

This is learning without trying to achieve consistency or even

approximate consistency.

)P(v

v |

x

x ,

P(x

argmax

v

j j n 2 1 V v

MAP

∈j

=

n

1 i^

j i i j n n 2 2 1 1

v

v |

b

P(x

) v v | b x

b

x ,

b

P(x

Why does it work?

Conditional Independence

Notice that the features values are conditionally

independent,

given the target value, and are not required to be independent.

Example:

f(x,y)=x

y

over the product distribution defined by

p(x=0)=p(x=1)=1/2 and p(y=0)=p(y=1)=1/

The distribution is defined so that x and y are independent: p(x,y) = p(x)p(y) (Interpretation - for every value of x and y)

But, given that f(x,y)=0:

p(x=1|f=0) = p(y=1|f=0) = 1/3 p(x=1,y=1 | f=0) = 0

so x and y are not conditionally independent.

Conditional Independence

The other direction

also does not hold.

x and y can be conditionally independent but not independent.

f=0:

p(x=1|f=0) =1, p(y=1|f=0) = 0

f=1:

p(x=1|f=1) =0, p(y=1|f=1) = 1

and assume, say,

that

p(f=0) = p(f=1)=1/

Given the value of f,

x and y are independent.

What about unconditional independence? p(x=1) = p(x=1|f=0)p(f=0)+p(x=1|f=1)p(f=1) = 0.5+0=0.5 p(y=1) = p(y=1|f=0)p(f=0)+p(y=1|f=1)p(f=1) = 0.5+0=0. But, p(x=1, y=1)=p(x=1,y=1|f=0)p(f=0)+p(x=1,y=1|f=1)p(f=1) = 0 so x and y are not independent.

Example

Day

Outlook

Temperature

Humidity

Wind

PlayTennis

1

Sunny

Hot

High

Weak

No

2

Sunny

Hot

High

Strong

No

3

Overcast

Hot

High

Weak

Yes

4

Rain

Mild

High

Weak

Yes

5

Rain

Cool

Normal

Weak

Yes

6

Rain

Cool

Normal

Strong

No

7

Overcast

Cool

Normal

Strong

Yes

8

Sunny

Mild

High

Weak

No

9

Sunny

Cool

Normal

Weak

Yes

10

Rain

Mild

Normal

Weak

Yes

11

Sunny

Mild

Normal

Strong

Yes

12

Overcast

Mild

High

Strong

Yes

13

Overcast

Hot

Normal

Weak

Yes

14

Rain

Mild

High

Strong

No

i^

j

i

j

V

v

NB

v |

P(x

P(v

argmax

v

j

Example

Compute P(

PlayTennis= yes)

; P(

PlayTennis= no)

Compute P(outlook= s/oc/r

PlayTennis= yes/no) (6 numbers)

Compute P(Temp= h/mild/cool |

PlayTennis= yes/no) (6 numbers)

Compute P(humidity= hi/nor

PlayTennis= yes/no) (4 numbers)

Compute P(wind= w/st

PlayTennis= yes/no) (4 numbers)

i^

j

i

j

V

v

NB

v |

P(x

P(v

argmax

v

j

Example

Compute P(

PlayTennis= yes)

; P(

PlayTennis= no)

Compute P(outlook= s/oc/r

PlayTennis= yes/no) (6 numbers)

Compute P(Temp= h/mild/cool |

PlayTennis= yes/no) (6 numbers)

Compute P(humidity= hi/nor

PlayTennis= yes/no) (4 numbers)

Compute P(wind= w/st

PlayTennis= yes/no) (4 numbers)

Given a new instance:

(Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)

Predict:

PlayTennis=?

i^

j

i

j

V

v

NB

v |

P(x

P(v

argmax

v

j

Example

Given:

(Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)

P(

PlayTennis= yes)

=9/14=0.

P(

PlayTennis= no)

=5/14=0.

P(outlook = sunny | yes)= 2/

P(outlook = sunny | no)= 3/

P(temp = cool | yes)

= 3/

P(temp = cool | no) = 1/

P(humidity = hi |yes)

= 3/

P(humidity = hi | no) = 4/

P(wind = strong | yes) = 3/

P(wind = strong | no)= 3/

P(yes, …..) ~ 0.

P(no, …..) ~ 0.

What if we were asked about Outlook=OC?

i^

j

i

j

V

v

NB

v |

P(x

P(v

argmax

v

j

Example

Given:

(Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)

P(

PlayTennis= yes)

=9/14=0.

P(

PlayTennis= no)

=5/14=0.

P(outlook = sunny | yes)= 2/

P(outlook = sunny | no)= 3/

P(temp = cool | yes)

= 3/

P(temp = cool | no) = 1/

P(humidity = hi |yes)

= 3/

P(humidity = hi | no) = 4/

P(wind = strong | yes) = 3/

P(wind = strong | no)= 3/

P(yes, …..) ~ 0.

P(no, …..) ~ 0.

P(no|instance) = 0.0206/(0.0053+0.0206)=0.

i^

j

i

j

V

v

NB

v |

P(x

P(v

argmax

v

j

Naïve Bayes: Two classes

Notice that the naïve Bayes method gives a method for predicting^ rather than an explicit classifier.

In the case of two classes, v

{0,1} we predict that v=1 iff:

i^

j

i

j

V

v

NB

v |

P(x

P(v

argmax

v

j

v |

P(x

P(v

v |

P(x

P(v

n

1 i^

j

i

j

n

1 i^

j

i

j

= =

q

(

q

P(v

p

(

p

P(v

v | 1

P(x

q

v | 1

P(x

p

i

i

i

i

x- 1 i

x i

j

x - 1 i x i j i

i

i

i

= n = n i i

1 1

Denote

20

Naïve Bayes: Two classes

In the case of two classes, v

{0,1} we predict that v=1 iff:

q

1

q

q

(

P(v

p

1

p

p

P(v

q

(

q

P(v

p

p

P(v

n

1 i

x

i i

i

j

n

1 i

x i i i j n

1 i

x- 1 i

x i

j

n

1 i

x - 1 i x i j

i i

i

i

i

i

= =

= =