






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of text categorization techniques, focusing on naive bayes and bernoulli model for document classification. Naive bayes is a probabilistic algorithm that estimates the conditional probability of a class given a document. The bernoulli model assumes that the terms in a document are conditionally independent given the class. Both models are widely used for text classification tasks such as email filtering, document retrieval, and sentiment analysis.
Typology: Slides
1 / 10
This page cannot be seen from the preview
Don't miss anything!







-^
-^
predict a topic of a Web page
decide whether a Web page is relevant with respect to theinterests of a given user
-^
k
nearest neighbors (
k-NN)
Naïve Bayes
support vector machines
-^
learns by memorizing all the training instances
-^
measure distances between
x
and all training instances
return a set
N(
x,D,k)
of the k points closest to
x
predict a class for
x
by majority voting
-^
asymptotic error rate of the 1-NN classifier is always lessthan twice the optimal Bayes error
-^
An event – a document as a whole^ –
ωωωω
-^
1, if
ωωωω
is in the document
-^
0, otherwise
-^
the class is the only cause of appearance of each word in adocument
-^
tossing |V| independent coins
the occurrence of each word in a document is a Bernoullievent
xj
= 1[0]
ωωωω
does [does not] occur in dj
P(
ωωωω
|c)j
ωωωω
in documents of classj
c
| |
1
c P x c P x c d P j
V j
j
j
j
∏
=
c P c d P d P
c P c d P d c P
-^
throwing a die with |V| faces |
d| times
occurrence of each word is multinomial event
-^
nj
is the number of occurrences of
ωωωω
inj
d
-^
P(
ωωωω
|c)j
ωωωω
occurs at any positionj
t^ ∈
[ 1,…,|d| ]
-^
G
c P c d P d P
c P c d P d c P
∏
=
| | 1
V j
n
j
j c
d
c d P
-^
Estimate parameters
θθθθ
from the available
data
-^
Training data set is a collection of labeleddocuments { (d
, ci
), i = 1,…,n }i
-^
θθθθ
c,j
ωωωω
must satisfy
Σ
θθθθj c,j
= 1
for each class
c
-^
q
andj
α
are hyperparameters of Dirichlet prior
n
ij^ is the number of occurrences of
ωωωω
in dj
i
-^
∑ ∑
∑ =^
=
| |
1
: :
, ˆ^
V l^
c ci
il
n
c ci
ij
j
j c
i i
n n
q α
α
θ
N n
q
c
c
c^
=
' ' '
ˆ
α α
θ
' α