Natural Language Processing: CSE 591 Fall 2008 Course Outline - Prof. Joerg Hakenberg, Study notes of Computer Science

An outline of the natural language processing (nlp) course cse 591 offered in fall 2008. The course covers various topics such as tokenization, stemming, part-of-speech tagging, text mining, information retrieval, and word sense disambiguation. Students will participate in programming exercises, homework assignments, and a class project. The document also discusses features and vector space models, stop words, and stemming.

Typology: Study notes

Pre 2010

Uploaded on 09/02/2009

koofers-user-z94
koofers-user-z94 🇺🇸

10 documents

1 / 26

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CSE 591
Natural language processing
-Tokenization, Stemming, Part-of-speech-
Fall 2008
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a

Partial preview of the text

Download Natural Language Processing: CSE 591 Fall 2008 Course Outline - Prof. Joerg Hakenberg and more Study notes Computer Science in PDF only on Docsity!

CSE 591

Natural language processing

-Tokenization, Stemming, Part-of-speech-

Fall 2008

Class format

• two^ exams: mid-term and final

- test theoretical aspects and principles

• four^ homeworks, ca. two weeks each

- mostly programming exercises

- groups of two students possible

- or: one larger class project, one student each

• programming exercise + presentation

• reading + 10page summary + presentation

• exams account for 30% of final grade each

• homeworks 10% each (project: 40%)

• a few excess points per exam/homework

α

• bag-of-words: represent document as set of words

• a^ feature^ is a^ property^ [of a document]

- mainly: observation, i.e.,^ occurrence of a certain term

• a document may be represented by a list of features

➠ a feature vector

• vector space model^ (VSM)

• comparison of documents/queries

➱ comparison of vectors

- Cosine similiarity: angle between vectors

- Euclidean distance: ordinary dist. btw points

Features and VSM

Doc1 Doc2 Query and and and by cat cat cat dog dog eat rain rain the the was cat dog

Features of documents

• Words^ vs^ tokens

- token:^ “anything treated as a single symbol during syntax

analysis”

- token: include punctuation (as separate tokens)

- token: split at hyphens/slashes

- “… CD95/Fas-mediated apoptosis …”

- separate words/tokens from punctuation

• Word^ n -grams^ ---^ n^ adjacent words

• noun phrases, verb phrases, adjective phrases

- “abstruse scientific publication”

- head noun: publication

Zipfian distribution

  • words follow a^ Zipf distribution^ across documents
    • few words^ occur^ very frequently
    • some words occur a medium number of times
    • most words^ occur^ infrequently
  • Zipf’s law:^ “frequency of a word is inversely proportional to its rank”
    • “the” has rank 1, frequency of 61847 Word PoS Freq the Det 61847 of Prep 29391 and Conj 26817 a Det 21626 in Prep 18214 to Inf 16284 it Pron 10875 is Verb 9982 to Prep 9343 was Verb 9236 I Pron 8875 for Prep 8412 that Conj 7308 you Pron 6954 he Pron 6810 be* Verb 6644 with Prep 6575 on Prep 6475 by Prep 5096 at Prep 4790 have* Verb 4735 are Verb 4707 not Neg 4626

English 250,000 words

Word stems and lemmata

• a document might contain different^ word forms^ to

refer to the same concept

• a feature vector should contain one entry for each

word

• organ, organs, organelle in different documents

- same concept

- word stem: organ

• organize, organization

- different from organ

- lemma: organize vs organ

organ organs organelle organize organized organization …

organ organize …

Stop words and stems

Stemming

10000 stop words

100 stop words

What’s left

• different documents that use entirely different words

to refer to the same/very similar concepts

- organization, venture, company, agency, association

- may be combined into one feature

• later in this course:^ word sense disambiguation

- bank vs bank vs bank

Part-of-speech

• useful for^ generating features^ for tasks in NLP/ML

- IR, NER, indexing,^ WSD, etc.

- NER: a verb will most likely not refer to a person,

location, disease, gene, …

- WSD:^ “to^ duck ” vs “the^ duck ”

• necessary for^ chunking^ and^ parsing

- subsequent tasks in NLP

- noun phrases

- subj/obj relations

duck_verb dog_verb break_noun …

duck_noun dog_noun break_verb …

Part-of-speech tagging

• POS tagging: assigning the POS tag to a word

• POS is not fixed for a given word/word form:

- break, duck, dogs - noun or verb?

• consider word and its context in a sentence

- “the duck”

• closed word classes^ vs^ open word classes

- preposition: closed (for, by, …)

- nouns: open (computer, Frappucchino, …)

- consider context and features of each word, e.g., its suffix: -sed

➠ likely a verb

• most common tag + proper noun for unknowns^ ➱^ 90% accuracy

• rule-based, stochastic, and neural approaches

Brill tagger

• initially tags each word by assigning the most likely category (based on

a training corpus: 90% WSJ)

• unseen words: capitalization and 3-letter suffix

• error rate of 7.9%

• aquires patches from a patch corpus (5% WSJ)

• 8 templates: change tag a to b when

• preceding (following) tag is z

• word two before (after) has tag z

• preceding tag is z and following tag is w

• one of the three tags before (after) is z

• reduces error rate to 5.1 using 71 patches

TnT - Tags’n’Trigrams

• uses a 2nd-order hidden Markov model (HMM)

• tokens

• trigrams, bigrams, unigrams

• unseen words: 98% of words with^ -able^ are adjectives

• P(t|4-suffix), smoothed by P(t|3-suffix) ... P(t)

• capitalization: P(t3|t1,t2) => P(t3,c3|t1,c1,t2,c2)

• accuracy of 96.7% on WSJ (85.5% for unseen tokens)

Period disambiguation

• fixed rule-based vs machine learning

- ML: supervised vs unsupervised

• rule-based: some heuristics + special cases

- l.c. + period + white space + u.c.

- no split within parenthesis

- no split after^ known^ abbreviations (vs., etc., pp., …)

- no split if a sentence <3 words results

Unsupervised P.D.

• unsupervised: no (manually labeled) training data

• example: Schmid 2000

• two passes over a text:

(1) gather frequencies of

• likely abbreviations and names (initials)

- sequence of characters that always appears with trailing period

• lower case words

- indicator for split when appearing capitalized after period

• words before+after numbers

- number disambiguation

(2) use information to disambiguate periods

• >99.5% accuracy on Wall-Street-Journal corpus (WSJ)