Download Natural language processing session 9 and more Lecture notes Natural Language Processing (NLP) in PDF only on Docsity!
Text Categorization and
Naรฏve Bayes
CS- 585
Natural Language Processing Derrick Higgins (with slides from William W. Cohen and Chris Manning)
TEXT CATEGORIZATION (CLASSIFICATION)
Text Classification: Examples
- Classify news stories as World, US, Business, SciTech, Sports, Entertainment, Health, Other
- Add MeSH terms to Medline abstracts (e.g. โConscious Sedationโ [E03.250])
- Classify business names by industry.
- Classify student essays as A,B,C,D, or F.
- Classify email as Spam, Other.
- Classify email to tech staff as Mac, Windows, ..., Other.
- Classify pdf files as ResearchPaper, Other
- Classify documents as WrittenByReagan, GhostWritten
- Classify movie reviews as Favorable,Unfavorable,Neutral.
- Classify technical papers as Interesting, Uninteresting.
- Classify web sites of companies by Standard Industrial Classification (SIC) code.
- Classify jokes as Funny, NotFunny.
Text Classification: Examples
- Best-studied benchmark: Reuters- 21578 newswire stories
- 9603 train, 3299 test documents, 80-100 words each, 93 classes ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS BUENOS AIRES, Feb 26 Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:
- Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).
- Maize Mar 48.0, total 48.0 (nil).
- Sorghum nil (nil)
- Oilseed export registrations were:
- Sunflowerseed total 15.0 (7.9)
- Soybean May 20.0, total 20.0 (nil) The board also detailed export registrations for subproducts, as follows.... Categories: grain, wheat (of 93 binary choices)
Bag of words representation
ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS BUENOS AIRES, Feb 26 Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:
- Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).
- Maize Mar 48.0, total 48.0 (nil).
- Sorghum nil (nil)
- Oilseed export registrations were:
- Sunflowerseed total 15.0 (7.9)
- Soybean May 20.0, total 20.0 (nil) The board also detailed export registrations for subproducts, as follows.... Categories: grain, wheat
Bag of words representation
xxxxxxxxxxxxxxxxxxx GRAIN/OILSEED xxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxx grain xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx grains, oilseeds xxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxx tonnes, xxxxxxxxxxxxxxxxx shipments xxxxxxxxxxxx total xxxxxxxxx total xxxxxxxx xxxxxxxxxxxxxxxxxxxx:
- Xxxxx wheat xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, total xxxxxxxxxxxxxxxx
- Maize xxxxxxxxxxxxxxxxx
- Sorghum xxxxxxxxxx
- Oilseed xxxxxxxxxxxxxxxxxxxxx
- Sunflowerseed xxxxxxxxxxxxxx
- Soybean xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.... Categories: grain, wheat
NAรVE BAYES
Text Classification with Naive Bayes
- Represent document x as set of (๐ค ๐
๐ ) pairs:
- ๐ฅ = {(grain, 3 ), (wheat, 1 ), โฆ , (the, 6 )}
- For each y, build a probabilistic model Pr(๐|๐ = ๐ฆ) of โdocumentsโ in class y
- Pr(๐ = {(grain, 3 ), โฆ }|๐ = wheat) = โฆ
- Pr(๐ = {(grain, 3 ), โฆ }|๐ = nonWheat) = โฆ
- To classify, find the y which was most likely to generate xโi.e., which gives x the best score according to Pr(๐ฅ|๐ฆ)
- ๐(๐ฅ) = argmax ๐ฆ Pr(๐ฅ|๐ฆ)รPr(๐ฆ)
Text Classification with Naive Bayes
- How to estimate Pr(๐|๐)?
- Simplest useful process to generate a bag of words: - pick word 1 according to Pr(๐|๐) - repeat for word 2, 3, .... - each word is generated independently of the others (which is clearly not true) but means ร = = = = n i n i w w Y y w Y y 1 1 Pr( ,..., | ) Pr( | ) How to estimate Pr(W|Y)?
Two Unreasonable Assumptions
The order of the words in document d
makes no difference (but repetitions do)
- Conditional Independence:
Words appear independently of each
other, given the document class
(e.g., if you see โcarโ, the word โdriveโ is no more likely to appear than if you saw โdogโ)
Simple Smoothing
- If ๐ contains a vocabulary word that does not occur with class ๐ = ๐ฆ in the training: ๐(๐|๐ = ๐ฆ) = 0 , no matter what else is there!
- Solution:
- Assign small probability to unseen words,
- Taking away probability from seen words
- Every word that occurred ๐ times with class ๐ = ๐ฆ, we will pretend actually occurred ๐ + ๐ผ times
Text Classification with Naive Bayes
- How to estimate Pr(X|Y)? ร =
n i n i w w Y y w Y y 1 1 Pr( ,..., | ) Pr( | ) ... and also imagine ๐ผ โpseudo-occurrencesโ of ๐ค๐ in class ๐ = ๐ฆ
UVWXY Z[ โง ]^_ a UVWXY ]^_a|b|
Avoiding Underflow
- Consider:
- Many docs have more than 100 words
- Word probabilities will each be <0.
- So, P(X|Y)<
- 100 for any document X รจUNDERFLOW!!
- Solution: log ๐ > log ๐ iff ๐ > ๐ Use log[๐(๐|๐)๐(๐)] = log ๐(๐|๐) + log ๐(๐) log ๐(๐|๐) = ฮฃ ๐ค๐๐๐ log ๐(๐ค ๐
Text Classification with Naive Bayes
- Putting this together: for each document xi with label yi d_count[yi]++ d_count++ for each word wij in xi w_count[wij][yi]++ w_count[yi]++
- to classify a new x=w 1 ...wn , pick y with top score : ๐ ๐๐๐๐ ๐ฆ, ๐คq, โฏ , ๐คs = log ๐๐๐๐ข๐๐ก[๐ฆ] ๐๐๐๐ข๐๐ก + v T^q X log ๐ค๐๐๐ข๐๐ก ๐คT ๐ฆ + 0. 5 ๐ค๐๐๐ข๐๐ก ๐ฆ + 1 key point: we only need counts for words that actually appear in x