Natural language processing session 9, Lecture notes of Natural Language Processing (NLP)

Session 9 - elabborate notes and lectures by Derrik Higgins

Typology: Lecture notes

2018/2019

Uploaded on 11/18/2019

mohammed-jawhar
mohammed-jawhar ๐Ÿ‡บ๐Ÿ‡ธ

1.5

(2)

2 documents

1 / 51

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Text Categorization and
Naรฏve Bayes
CS-585
Natural Language Processing
Derrick Higgins
(with slides from William W. Cohen and Chris Manning)
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33

Partial preview of the text

Download Natural language processing session 9 and more Lecture notes Natural Language Processing (NLP) in PDF only on Docsity!

Text Categorization and

Naรฏve Bayes

CS- 585

Natural Language Processing Derrick Higgins (with slides from William W. Cohen and Chris Manning)

TEXT CATEGORIZATION (CLASSIFICATION)

Text Classification: Examples

  • Classify news stories as World, US, Business, SciTech, Sports, Entertainment, Health, Other
  • Add MeSH terms to Medline abstracts (e.g. โ€œConscious Sedationโ€ [E03.250])
  • Classify business names by industry.
  • Classify student essays as A,B,C,D, or F.
  • Classify email as Spam, Other.
  • Classify email to tech staff as Mac, Windows, ..., Other.
  • Classify pdf files as ResearchPaper, Other
  • Classify documents as WrittenByReagan, GhostWritten
  • Classify movie reviews as Favorable,Unfavorable,Neutral.
  • Classify technical papers as Interesting, Uninteresting.
  • Classify web sites of companies by Standard Industrial Classification (SIC) code.
  • Classify jokes as Funny, NotFunny.

Text Classification: Examples

  • Best-studied benchmark: Reuters- 21578 newswire stories
    • 9603 train, 3299 test documents, 80-100 words each, 93 classes ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS BUENOS AIRES, Feb 26 Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:
    • Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).
    • Maize Mar 48.0, total 48.0 (nil).
    • Sorghum nil (nil)
    • Oilseed export registrations were:
    • Sunflowerseed total 15.0 (7.9)
    • Soybean May 20.0, total 20.0 (nil) The board also detailed export registrations for subproducts, as follows.... Categories: grain, wheat (of 93 binary choices)

Bag of words representation

ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS BUENOS AIRES, Feb 26 Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets:

  • Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).
  • Maize Mar 48.0, total 48.0 (nil).
  • Sorghum nil (nil)
  • Oilseed export registrations were:
  • Sunflowerseed total 15.0 (7.9)
  • Soybean May 20.0, total 20.0 (nil) The board also detailed export registrations for subproducts, as follows.... Categories: grain, wheat

Bag of words representation

xxxxxxxxxxxxxxxxxxx GRAIN/OILSEED xxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxx grain xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx grains, oilseeds xxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxx tonnes, xxxxxxxxxxxxxxxxx shipments xxxxxxxxxxxx total xxxxxxxxx total xxxxxxxx xxxxxxxxxxxxxxxxxxxx:

  • Xxxxx wheat xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, total xxxxxxxxxxxxxxxx
  • Maize xxxxxxxxxxxxxxxxx
  • Sorghum xxxxxxxxxx
  • Oilseed xxxxxxxxxxxxxxxxxxxxx
  • Sunflowerseed xxxxxxxxxxxxxx
  • Soybean xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.... Categories: grain, wheat

NAรVE BAYES

Text Classification with Naive Bayes

  • Represent document x as set of (๐‘ค ๐‘–

๐‘– ) pairs:

  • ๐‘ฅ = {(grain, 3 ), (wheat, 1 ), โ€ฆ , (the, 6 )}
  • For each y, build a probabilistic model Pr(๐‘‹|๐‘Œ = ๐‘ฆ) of โ€œdocumentsโ€ in class y
  • Pr(๐‘‹ = {(grain, 3 ), โ€ฆ }|๐‘Œ = wheat) = โ€ฆ
  • Pr(๐‘‹ = {(grain, 3 ), โ€ฆ }|๐‘Œ = nonWheat) = โ€ฆ
  • To classify, find the y which was most likely to generate xโ€”i.e., which gives x the best score according to Pr(๐‘ฅ|๐‘ฆ)
  • ๐‘“(๐‘ฅ) = argmax ๐‘ฆ Pr(๐‘ฅ|๐‘ฆ)ร—Pr(๐‘ฆ)

Text Classification with Naive Bayes

  • How to estimate Pr(๐‘‹|๐‘Œ)?
  • Simplest useful process to generate a bag of words: - pick word 1 according to Pr(๐‘Š|๐‘Œ) - repeat for word 2, 3, .... - each word is generated independently of the others (which is clearly not true) but means ร• = = = = n i n i w w Y y w Y y 1 1 Pr( ,..., | ) Pr( | ) How to estimate Pr(W|Y)?

Two Unreasonable Assumptions

  • Bag-of-words:

The order of the words in document d

makes no difference (but repetitions do)

  • Conditional Independence:

Words appear independently of each

other, given the document class

(e.g., if you see โ€œcarโ€, the word โ€œdriveโ€ is no more likely to appear than if you saw โ€œdogโ€)

Simple Smoothing

  • If ๐‘‹ contains a vocabulary word that does not occur with class ๐‘Œ = ๐‘ฆ in the training: ๐‘ƒ(๐‘‹|๐‘Œ = ๐‘ฆ) = 0 , no matter what else is there!
  • Solution:
    • Assign small probability to unseen words,
    • Taking away probability from seen words
    • Every word that occurred ๐‘ times with class ๐‘Œ = ๐‘ฆ, we will pretend actually occurred ๐‘ + ๐›ผ times

Text Classification with Naive Bayes

  • How to estimate Pr(X|Y)? ร• =

n i n i w w Y y w Y y 1 1 Pr( ,..., | ) Pr( | ) ... and also imagine ๐›ผ โ€œpseudo-occurrencesโ€ of ๐‘ค๐‘– in class ๐‘Œ = ๐‘ฆ

  • Pr ๐‘ค T

UVWXY Z[ โˆง ]^_ a UVWXY ]^_a|b|

Avoiding Underflow

  • Consider:
    • Many docs have more than 100 words
    • Word probabilities will each be <0.
    • So, P(X|Y)<
      • 100 for any document X รจUNDERFLOW!!
  • Solution: log ๐‘Ž > log ๐‘ iff ๐‘Ž > ๐‘ Use log[๐‘ƒ(๐‘‹|๐‘Œ)๐‘ƒ(๐‘Œ)] = log ๐‘ƒ(๐‘‹|๐‘Œ) + log ๐‘ƒ(๐‘Œ) log ๐‘ƒ(๐‘‹|๐‘Œ) = ฮฃ ๐‘ค๐‘–๐œ€๐‘‹ log ๐‘ƒ(๐‘ค ๐‘–

Text Classification with Naive Bayes

  • Putting this together: for each document xi with label yi d_count[yi]++ d_count++ for each word wij in xi w_count[wij][yi]++ w_count[yi]++
    • to classify a new x=w 1 ...wn , pick y with top score : ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘’ ๐‘ฆ, ๐‘คq, โ‹ฏ , ๐‘คs = log ๐‘‘๐‘๐‘œ๐‘ข๐‘›๐‘ก[๐‘ฆ] ๐‘‘๐‘๐‘œ๐‘ข๐‘›๐‘ก + v T^q X log ๐‘ค๐‘๐‘œ๐‘ข๐‘›๐‘ก ๐‘คT ๐‘ฆ + 0. 5 ๐‘ค๐‘๐‘œ๐‘ข๐‘›๐‘ก ๐‘ฆ + 1 key point: we only need counts for words that actually appear in x