Text Categorization: Naive Bayes and Bernoulli Model for Document Classification, Slides of Fundamentals of E-Commerce

An overview of text categorization techniques, focusing on naive bayes and bernoulli model for document classification. Naive bayes is a probabilistic algorithm that estimates the conditional probability of a class given a document. The bernoulli model assumes that the terms in a document are conditionally independent given the class. Both models are widely used for text classification tasks such as email filtering, document retrieval, and sentiment analysis.

Typology: Slides

2012/2013

Uploaded on 07/30/2013

asif.ali
asif.ali 🇮🇳

5

(3)

129 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Text Categorization
Grouping textual documents into different fixed
classes
Examples
predict a topic of a Web page
decide whether a Web page is relevant with respect to the
interests of a given user
Machine learning techniques
k nearest neighbors (k-NN)
Naïve Bayes
support vector machines
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Text Categorization: Naive Bayes and Bernoulli Model for Document Classification and more Slides Fundamentals of E-Commerce in PDF only on Docsity!

Text Categorization

-^

Grouping textual documents into different fixedclasses

-^

Examples^ –

predict a topic of a Web page

decide whether a Web page is relevant with respect to theinterests of a given user

-^

Machine learning techniques^ –

k

nearest neighbors (

k-NN)

Naïve Bayes

support vector machines

k

Nearest Neighbors

-^

Memory based^ –

learns by memorizing all the training instances

-^

Prediction of

x

’s class

measure distances between

x

and all training instances

return a set

N(

x,D,k)

of the k points closest to

x

predict a class for

x

by majority voting

-^

Performs well in many domains^ –

asymptotic error rate of the 1-NN classifier is always lessthan twice the optimal Bayes error

Bernoulli Model

-^

An event – a document as a whole^ –

a bag of words

words are attributes of the event

vocabulary term

ωωωω

is a Bernoully attribute

-^

1, if

ωωωω

is in the document

-^

0, otherwise

binary attributes are mutually independent giventhe class

-^

the class is the only cause of appearance of each word in adocument

Bernoulli Model

-^

Generating a document^ –

tossing |V| independent coins

the occurrence of each word in a document is a Bernoullievent

xj

= 1[0]

ωωωω

does [does not] occur in dj

P(

ωωωω

|c)j

  • probability of observing

ωωωω

in documents of classj

c

| |

1

c P x c P x c d P j

V j

j

j

j

=

c P c d P d P

c P c d P d c P

Multinomial Model

-^

Generating a document^ –

throwing a die with |V| faces |

d| times

occurrence of each word is multinomial event

-^

nj

is the number of occurrences of

ωωωω

inj

d

-^

P(

ωωωω

|c)j

  • probability that

ωωωω

occurs at any positionj

t^ ∈

[ 1,…,|d| ]

-^

G

  • normalization constant

c P c d P d P

c P c d P d c P

=

| | 1

V j

n

j

j c

P

d

GP

c d P

Learning Naïve Bayes

-^

Estimate parameters

θθθθ

from the available

data

-^

Training data set is a collection of labeleddocuments { (d

, ci

), i = 1,…,n }i

Learning Multinomial Model

-^

Generative parameters

θθθθ

c,j

= P(

ωωωω

|c)j

must satisfy

Σ

θθθθj c,j

= 1

for each class

c

-^

Distributions of terms given the class^ –

q

andj

α

are hyperparameters of Dirichlet prior

n

ij^ is the number of occurrences of

ωωωω

in dj

i

-^

Unconditional class probabilities

∑ ∑

∑ =^

=

=

| |

1

: :

, ˆ^

V l^

c ci

il

n

c ci

ij

j

j c

i i

n n

q α

α

θ

N n

q

c

c

c^

=

' ' '

ˆ

α α

θ

' α