Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Natural language processing session 9, Lecture notes of Natural Language Processing (NLP)

Illinois Institute of Technology (IIT)Natural Language Processing (NLP)

Session 9 - elabborate notes and lectures by Derrik Higgins

Typology: Lecture notes

2018/2019

Uploaded on 11/18/2019

mohammed-jawhar 🇺🇸

1.5

(2)

2 documents

1 / 51

This page cannot be seen from the preview

Don't miss anything!

Text Categorization and

Naïve Bayes

CS-585

Natural Language Processing

Derrick Higgins

(with slides from William W. Cohen and Chris Manning)

Discover Lecture notes of Natural Language Processing (NLP) Illinois Institute of Technology (IIT)

Partial preview of the text

Download Natural language processing session 9 and more Lecture notes Natural Language Processing (NLP) in PDF only on Docsity!

Text Categorization and

Naïve Bayes

CS- 585

Natural Language Processing Derrick Higgins (with slides from William W. Cohen and Chris Manning)

TEXT CATEGORIZATION (CLASSIFICATION)

Text Classification: Examples

Classify news stories as World, US, Business, SciTech, Sports, Entertainment, Health, Other
Add MeSH terms to Medline abstracts (e.g. “Conscious Sedation” [E03.250])
Classify business names by industry.
Classify student essays as A,B,C,D, or F.
Classify email as Spam, Other.
Classify email to tech staff as Mac, Windows, ..., Other.
Classify pdf files as ResearchPaper, Other
Classify documents as WrittenByReagan, GhostWritten
Classify movie reviews as Favorable,Unfavorable,Neutral.
Classify technical papers as Interesting, Uninteresting.
Classify web sites of companies by Standard Industrial Classification (SIC) code.
Classify jokes as Funny, NotFunny.

Text Classification: Examples

Best-studied benchmark: Reuters- 21578 newswire stories
- 9603 train, 3299 test documents, 80-100 words each, 93 classes ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS BUENOS AIRES, Feb 26 Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:
- Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).
- Maize Mar 48.0, total 48.0 (nil).
- Sorghum nil (nil)
- Oilseed export registrations were:
- Sunflowerseed total 15.0 (7.9)
- Soybean May 20.0, total 20.0 (nil) The board also detailed export registrations for subproducts, as follows.... Categories: grain, wheat (of 93 binary choices)

Bag of words representation

ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS BUENOS AIRES, Feb 26 Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:

Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).
Maize Mar 48.0, total 48.0 (nil).
Sorghum nil (nil)
Oilseed export registrations were:
Sunflowerseed total 15.0 (7.9)
Soybean May 20.0, total 20.0 (nil) The board also detailed export registrations for subproducts, as follows.... Categories: grain, wheat

Bag of words representation

xxxxxxxxxxxxxxxxxxx GRAIN/OILSEED xxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxx grain xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx grains, oilseeds xxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxx tonnes, xxxxxxxxxxxxxxxxx shipments xxxxxxxxxxxx total xxxxxxxxx total xxxxxxxx xxxxxxxxxxxxxxxxxxxx:

Xxxxx wheat xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, total xxxxxxxxxxxxxxxx
Maize xxxxxxxxxxxxxxxxx
Sorghum xxxxxxxxxx
Oilseed xxxxxxxxxxxxxxxxxxxxx
Sunflowerseed xxxxxxxxxxxxxx
Soybean xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.... Categories: grain, wheat

NAÏVE BAYES

Text Classification with Naive Bayes

Represent document x as set of (𝑤 𝑖

𝑖 ) pairs:

𝑥 = {(grain, 3 ), (wheat, 1 ), … , (the, 6 )}
For each y, build a probabilistic model Pr(𝑋|𝑌 = 𝑦) of “documents” in class y
Pr(𝑋 = {(grain, 3 ), … }|𝑌 = wheat) = …
Pr(𝑋 = {(grain, 3 ), … }|𝑌 = nonWheat) = …
To classify, find the y which was most likely to generate x—i.e., which gives x the best score according to Pr(𝑥|𝑦)
𝑓(𝑥) = argmax 𝑦 Pr(𝑥|𝑦)×Pr(𝑦)

Text Classification with Naive Bayes

How to estimate Pr(𝑋|𝑌)?
Simplest useful process to generate a bag of words: - pick word 1 according to Pr(𝑊|𝑌) - repeat for word 2, 3, .... - each word is generated independently of the others (which is clearly not true) but means Õ = = = = n i n i w w Y y w Y y 1 1 Pr( ,..., | ) Pr( | ) How to estimate Pr(W|Y)?

Two Unreasonable Assumptions

Bag-of-words:

The order of the words in document d

makes no difference (but repetitions do)

Conditional Independence:

Words appear independently of each

other, given the document class

(e.g., if you see “car”, the word “drive” is no more likely to appear than if you saw “dog”)

Simple Smoothing

If 𝑋 contains a vocabulary word that does not occur with class 𝑌 = 𝑦 in the training: 𝑃(𝑋|𝑌 = 𝑦) = 0 , no matter what else is there!
Solution:
- Assign small probability to unseen words,
- Taking away probability from seen words
- Every word that occurred 𝑁 times with class 𝑌 = 𝑦, we will pretend actually occurred 𝑁 + 𝛼 times

Text Classification with Naive Bayes

How to estimate Pr(X|Y)? Õ =

n i n i w w Y y w Y y 1 1 Pr( ,..., | ) Pr( | ) ... and also imagine 𝛼 “pseudo-occurrences” of 𝑤𝑖 in class 𝑌 = 𝑦

Pr 𝑤 T

UVWXY Z[ ∧ ]^_ a UVWXY ]^_a|b|

Avoiding Underflow

Consider:
- Many docs have more than 100 words
- Word probabilities will each be <0.
- So, P(X|Y)<
  - 100 for any document X èUNDERFLOW!!
Solution: log 𝑎 > log 𝑏 iff 𝑎 > 𝑏 Use log[𝑃(𝑋|𝑌)𝑃(𝑌)] = log 𝑃(𝑋|𝑌) + log 𝑃(𝑌) log 𝑃(𝑋|𝑌) = Σ 𝑤𝑖𝜀𝑋 log 𝑃(𝑤 𝑖

Text Classification with Naive Bayes

Putting this together: for each document xi with label yi d_count[yi]++ d_count++ for each word wij in xi w_count[wij][yi]++ w_count[yi]++
- to classify a new x=w 1 ...wn , pick y with top score : 𝑠𝑐𝑜𝑟𝑒 𝑦, 𝑤q, ⋯ , 𝑤s = log 𝑑𝑐𝑜𝑢𝑛𝑡[𝑦] 𝑑𝑐𝑜𝑢𝑛𝑡 + v T^q X log 𝑤𝑐𝑜𝑢𝑛𝑡 𝑤T 𝑦 + 0. 5 𝑤𝑐𝑜𝑢𝑛𝑡 𝑦 + 1 key point: we only need counts for words that actually appear in x

Natural language processing session 9, Lecture notes of Natural Language Processing (NLP)

Related documents

Partial preview of the text

Download Natural language processing session 9 and more Lecture notes Natural Language Processing (NLP) in PDF only on Docsity!

Text Categorization and

Naïve Bayes

CS- 585

Text Classification: Examples

Text Classification: Examples

Bag of words representation

Bag of words representation

The order of the words in document d

makes no difference (but repetitions do)

Words appear independently of each

other, given the document class

Simple Smoothing

Avoiding Underflow