














































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of the naive bayes classifier, a probabilistic machine learning algorithm used for text classification. The assumptions of naive bayes, the learning method, and the process of estimating probabilities. It also includes examples of text classification using naive bayes and a discussion on smoothing and robust estimation of probabilities.
Typology: Study notes
1 / 54
This page cannot be seen from the preview
Don't miss anything!















































{0,1}
x
)
x
,...,
x ,
(x
x
i
n
2
1
∈
=
)
x
,...,
x ,
x |
P(v
argmax
x) |
P(v
argmax
v
n 2 1 j V v j V v
MAP
j
j^
∈
∈
=
=
)
x
,...,
x ,
x |
P(v
argmax
x) |
P(v
argmax
v
n 2 1 j V v j V v
MAP
j
j^
∈
∈
=
= v
MAP
argmax
v
j
∈
V
P(x
1
,x
2
,...,x
n
v
j
)P(v
j
P(x
1
,x
2
,...,x
n
argmax
v
j
∈
V
P(x
1
,x
2
,...,x
n
| v
j
)P(v
j
{0,1}
x
)
x
,...,
x ,
(x
x
i
n
2
1
∈
=
Assumption: feature values are independent given the target value
)P(v
v |
x
x ,
P(x
argmax
v
j j n 2 1 V v
MAP
∈j
∏
=
= = = = =
=
n
1 i^
j
i
j n j n 4 3 j n 3 2 j n 2 1
j n 3 j n 3 2 j n 2 1
j n 2 j n 2 1
j
n
2
1
)
v |
P(x
v |
P(x
v ,
x
,...,
x |
P(x
v ,
x
,...,
x |
)P(x
v ,
x
,...,
x |
P(x
v |
x
,...,
P(x
v ,
x
,...,
x |
)P(x
v ,
x
,...,
x |
P(x
v |
x
,...,
)P(x
v ,
x
,...,
x |
P(x
)
v |
x
,...,
x ,
P(x
)
)...
)
.......
)
)
)
Assumption: feature values are independent given the target value
Generative model:
First choose a value v
j^
∈
V
according to P(v
j^
)
For each v
j^
: choose x
1
x
2
, …, x
n
according to P(x
k
|v
j^
)
)P(v
v |
x
x ,
P(x
argmax
v
j j n 2 1 V v
MAP
∈j
=
n
1 i^
j i i j n n 2 2 1 1
v
v |
b
P(x
) v v | b x
b
x ,
b
P(x
Assumption: feature values are independent given the target value
Learning method
: Estimate n|V| parameters and use them to
compute the new value.
This is learning without search. Given a collection of training examples,
you just compute the best hypothesis (given the assumptions)
This is learning without trying to achieve consistency or even
approximate consistency.
)P(v
v |
x
x ,
P(x
argmax
v
j j n 2 1 V v
MAP
∈j
=
n
1 i^
j i i j n n 2 2 1 1
v
v |
b
P(x
) v v | b x
b
x ,
b
P(x
Why does it work?
Notice that the features values are conditionally
independent,
given the target value, and are not required to be independent.
Example:
f(x,y)=x
∧
y
over the product distribution defined by
p(x=0)=p(x=1)=1/2 and p(y=0)=p(y=1)=1/
The distribution is defined so that x and y are independent: p(x,y) = p(x)p(y) (Interpretation - for every value of x and y)
But, given that f(x,y)=0:
p(x=1|f=0) = p(y=1|f=0) = 1/3 p(x=1,y=1 | f=0) = 0
so x and y are not conditionally independent.
The other direction
also does not hold.
x and y can be conditionally independent but not independent.
f=0:
p(x=1|f=0) =1, p(y=1|f=0) = 0
f=1:
p(x=1|f=1) =0, p(y=1|f=1) = 1
and assume, say,
that
p(f=0) = p(f=1)=1/
Given the value of f,
x and y are independent.
What about unconditional independence? p(x=1) = p(x=1|f=0)p(f=0)+p(x=1|f=1)p(f=1) = 0.5+0=0.5 p(y=1) = p(y=1|f=0)p(f=0)+p(y=1|f=1)p(f=1) = 0.5+0=0. But, p(x=1, y=1)=p(x=1,y=1|f=0)p(f=0)+p(x=1,y=1|f=1)p(f=1) = 0 so x and y are not independent.
Day
Outlook
Temperature
Humidity
Wind
PlayTennis
1
Sunny
Hot
High
Weak
No
2
Sunny
Hot
High
Strong
No
3
Overcast
Hot
High
Weak
Yes
4
Rain
Mild
High
Weak
Yes
5
Rain
Cool
Normal
Weak
Yes
6
Rain
Cool
Normal
Strong
No
7
Overcast
Cool
Normal
Strong
Yes
8
Sunny
Mild
High
Weak
No
9
Sunny
Cool
Normal
Weak
Yes
10
Rain
Mild
Normal
Weak
Yes
11
Sunny
Mild
Normal
Strong
Yes
12
Overcast
Mild
High
Strong
Yes
13
Overcast
Hot
Normal
Weak
Yes
14
Rain
Mild
High
Strong
No
∈
i^
j
i
j
V
v
NB
v |
P(x
P(v
argmax
v
j
PlayTennis= yes)
PlayTennis= no)
PlayTennis= yes/no) (6 numbers)
PlayTennis= yes/no) (6 numbers)
PlayTennis= yes/no) (4 numbers)
PlayTennis= yes/no) (4 numbers)
∈
i^
j
i
j
V
v
NB
v |
P(x
P(v
argmax
v
j
PlayTennis= yes)
PlayTennis= no)
PlayTennis= yes/no) (6 numbers)
PlayTennis= yes/no) (6 numbers)
PlayTennis= yes/no) (4 numbers)
PlayTennis= yes/no) (4 numbers)
(Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)
PlayTennis=?
∈
i^
j
i
j
V
v
NB
v |
P(x
P(v
argmax
v
j
Given:
(Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)
P(
PlayTennis= yes)
=9/14=0.
P(
PlayTennis= no)
=5/14=0.
P(outlook = sunny | yes)= 2/
P(outlook = sunny | no)= 3/
P(temp = cool | yes)
= 3/
P(temp = cool | no) = 1/
P(humidity = hi |yes)
= 3/
P(humidity = hi | no) = 4/
P(wind = strong | yes) = 3/
P(wind = strong | no)= 3/
P(yes, …..) ~ 0.
P(no, …..) ~ 0.
What if we were asked about Outlook=OC?
∈
i^
j
i
j
V
v
NB
v |
P(x
P(v
argmax
v
j
Given:
(Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)
P(
PlayTennis= yes)
=9/14=0.
P(
PlayTennis= no)
=5/14=0.
P(outlook = sunny | yes)= 2/
P(outlook = sunny | no)= 3/
P(temp = cool | yes)
= 3/
P(temp = cool | no) = 1/
P(humidity = hi |yes)
= 3/
P(humidity = hi | no) = 4/
P(wind = strong | yes) = 3/
P(wind = strong | no)= 3/
P(yes, …..) ~ 0.
P(no, …..) ~ 0.
P(no|instance) = 0.0206/(0.0053+0.0206)=0.
∈
i^
j
i
j
V
v
NB
v |
P(x
P(v
argmax
v
j
Notice that the naïve Bayes method gives a method for predicting^ rather than an explicit classifier.
In the case of two classes, v
∈
{0,1} we predict that v=1 iff:
∈
i^
j
i
j
V
v
NB
v |
P(x
P(v
argmax
v
j
v |
P(x
P(v
v |
P(x
P(v
n
1 i^
j
i
j
n
1 i^
j
i
j
= =
q
(
q
P(v
p
(
p
P(v
v | 1
P(x
q
v | 1
P(x
p
i
i
i
i
x- 1 i
x i
j
x - 1 i x i j i
i
i
i
= n = n i i
1 1
Denote
20
In the case of two classes, v
∈
{0,1} we predict that v=1 iff:
q
1
q
q
(
P(v
p
1
p
p
P(v
q
(
q
P(v
p
p
P(v
n
1 i
x
i i
i
j
n
1 i
x i i i j n
1 i
x- 1 i
x i
j
n
1 i
x - 1 i x i j
i i
i
i
i
i
= =
= =