Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Natural Language Processing: CSE 591 Fall 2008 Course Outline - Prof. Joerg Hakenberg, Study notes of Computer Science

An outline of the natural language processing (nlp) course cse 591 offered in fall 2008. The course covers various topics such as tokenization, stemming, part-of-speech tagging, text mining, information retrieval, and word sense disambiguation. Students will participate in programming exercises, homework assignments, and a class project. The document also discusses features and vector space models, stop words, and stemming.

Typology: Study notes

Pre 2010

Uploaded on 09/02/2009

koofers-user-z94
koofers-user-z94 🇺🇸

10 documents

1 / 26

Toggle sidebar

Related documents


Partial preview of the text

Download Natural Language Processing: CSE 591 Fall 2008 Course Outline - Prof. Joerg Hakenberg and more Study notes Computer Science in PDF only on Docsity!

CSE 591

Natural language processing

-Tokenization, Stemming, Part-of-speech-

Fall 2008

Class format

  • two^ exams: mid-term and final
    • test theoretical aspects and principles
  • four^ homeworks, ca. two weeks each
    • mostly programming exercises
    • groups of two students possible
    • or: one larger class project, one student each
      • programming exercise + presentation
      • reading + 10page summary + presentation
  • exams account for 30% of final grade each
  • homeworks 10% each (project: 40%)
  • a few excess points per exam/homework

Last week

  • Introduction to text mining
    • problems and examples
    • NER, EMN,^ WSD, …
  • Information retrieval
    • document representation
      • bag-of-words
      • feature, feature vector, vector space
    • document similarity
      • Cosine coefficient, Dice, Jaccard, …
  • IR vs question answering cat dog

α

  • bag-of-words: represent document as set of words
  • a^ feature^ is a^ property^ [of a document]
    • mainly: observation, i.e.,^ occurrence of a certain term
  • a document may be represented by a list of features ➠ a feature vector
  • vector space model^ (VSM)
  • comparison of documents/queries ➱ comparison of vectors - Cosine similiarity: angle between vectors - Euclidean distance: ordinary dist. btw points

Features and VSM

Doc1 Doc2 Query and and and by cat cat cat dog dog eat rain rain the the was cat dog

Features of documents

  • Words^ vs^ tokens
    • token:^ “anything treated as a single symbol during syntax analysis”
    • token: include punctuation (as separate tokens)
    • token: split at hyphens/slashes
    • “… CD95/Fas-mediated apoptosis …”
    • separate words/tokens from punctuation
  • Word^ n -grams^ ---^ n^ adjacent words
  • noun phrases, verb phrases, adjective phrases
    • “abstruse scientific publication”
    • head noun: publication

Stop words

  • some words are more important than others
    • when comparing documents
  • see last week: weighting schemes, e.g.,^ TF*IDF
  • sometimes useful to filter out some words that occur in every document
    • and, or, the, for, by, in, at, … =^ stop words
    • numbers, ranges, percentages, dates, …^ ➠^ replace with markup
    • stop word filtering
    • in general: most frequent words in a document collection^ ➠ domain dependent
    • filtering is also task-dependent
      • disambiguation of “pen”: playpen vs writing utensil
      • into^ the pen”^ ➠^ playpen, cote

Zipfian distribution

  • words follow a^ Zipf distribution^ across documents
    • few words^ occur^ very frequently
    • some words occur a medium number of times
    • most words^ occur^ infrequently
  • Zipf’s law:^ “frequency of a word is inversely proportional to its rank”
    • “the” has rank 1, frequency of 61847 Word PoS Freq the Det 61847 of Prep 29391 and Conj 26817 a Det 21626 in Prep 18214 to Inf 16284 it Pron 10875 is Verb 9982 to Prep 9343 was Verb 9236 I Pron 8875 for Prep 8412 that Conj 7308 you Pron 6954 he Pron 6810 be* Verb 6644 with Prep 6575 on Prep 6475 by Prep 5096 at Prep 4790 have* Verb 4735 are Verb 4707 not Neg 4626 English 250,000 words

Word stems and lemmata

  • a document might contain different^ word forms^ to refer to the same concept
  • a feature vector should contain one entry for each word
  • organ, organs, organelle in different documents
    • same concept
    • word stem: organ
  • organize, organization
    • different from organ
    • lemma: organize vs organ … organ organs organelle organize organized organization … … organ organize …

Porter stemmer

  • Five+ rules that subsequently alter/remove suffixes SSES -> SS caresses -> caress IES -> I ponies -> poni ties -> ti SS -> SS caress -> caress S -> cats -> cat (v) ED -> plastered -> plaster bled -> bled (v) ING -> motoring -> motor sing -> sing AT -> ATE conflat(ed) -> conflate BL -> BLE troubl(ed) -> trouble IZ -> IZE siz(ed) -> size (d and not (L or *S or Z)) -> single letter hopp(ing) -> hop tann(ed) -> tan fall(ing) -> fall (m>1) E -> probate -> probat rate -> rate (m=1 and not o) E -> cease -> ceas 1a 1b 1b’ 5a

Stop words and stems

Stemming 10000 stop words 100 stop words

What’s left

  • different documents that use entirely different words to refer to the same/very similar concepts
  • organization, venture, company, agency, association
  • may be combined into one feature
  • later in this course:^ word sense disambiguation
    • bank vs bank vs bank

Part-of-speech

  • part-of-speech tag: syntactic category of a word
  • 8 basic^ POS^ in English: noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection
  • but many more categories/sub-categories
    • noun (s), noun (p), proper noun (s), …
    • verb (past tense), verb (present tense), …
  • typically, we distinguish between 50 and 150 POS tags
    • see, for instance,^ Brown Corpus,^ http:// www.scs.leeds.ac.uk/ccalas/tagsets/brown.html
    • text collection with >1 million words

Part-of-speech

  • useful for^ generating features^ for tasks in NLP/ML
    • IR, NER, indexing,^ WSD, etc.
    • NER: a verb will most likely not refer to a person, location, disease, gene, …
    • WSD:^ “to^ duck ” vs “the^ duck
  • necessary for^ chunking^ and^ parsing
    • subsequent tasks in NLP
    • noun phrases
    • subj/obj relations … duck_verb dog_verb break_noun … … duck_noun dog_noun break_verb …

Part-of-speech tagging

  • POS tagging: assigning the POS tag to a word
  • POS is not fixed for a given word/word form:
    • break, duck, dogs - noun or verb?
  • consider word and its context in a sentence
    • “the duck”
  • closed word classes^ vs^ open word classes
    • preposition: closed (for, by, …)
    • nouns: open (computer, Frappucchino, …)
    • consider context and features of each word, e.g., its suffix: -sed ➠ likely a verb
  • most common tag + proper noun for unknowns^ ➱^ 90% accuracy
  • rule-based, stochastic, and neural approaches

Brill tagger

  • “error-driven transformation-based tagger”
  • if the word is known, assigns most frequent tag
  • unknown^ ➠^ assigns “noun” / “proper noun” (case)
  • predefined rules change tags (transformation)
  • learn rules from^ labeled example^ (e.g., Brown corpus)
    • lexical rule: word^ ➠^ tag IF condition (suffix is -tion)
    • contextual rule: tag1^ ➠^ tag2 IF condition (preceding tag is noun)
  • get error rate of each candidate rule (select best)
  • apply to whole document, add to rule set
  • until applying rule set does not change the document anymore ➱ model that can be applied to arbitrary text

Brill tagger

  • initially tags each word by assigning the most likely category (based on a training corpus: 90% WSJ)
  • unseen words: capitalization and 3-letter suffix
  • error rate of 7.9%
  • aquires patches from a patch corpus (5% WSJ)
    • 8 templates: change tag a to b when
      • preceding (following) tag is z
      • word two before (after) has tag z
      • preceding tag is z and following tag is w
      • one of the three tags before (after) is z
      • ...
    • reduces error rate to 5.1 using 71 patches

TnT - Tags’n’Trigrams

  • uses a 2nd-order hidden Markov model (HMM)
    • tokens
    • trigrams, bigrams, unigrams
  • unseen words: 98% of words with^ -able^ are adjectives
    • P(t|4-suffix), smoothed by P(t|3-suffix) ... P(t)
    • capitalization: P(t3|t1,t2) => P(t3,c3|t1,c1,t2,c2)
  • accuracy of 96.7% on WSJ (85.5% for unseen tokens)

Period disambiguation

  • many (sub)tasks work on^ sentence level^ instead of whole text
  • POS-tagging, parsing
  • relation mining, machine translation, summarization
  • splitting a text into sentence^ not a trivial task
  • good heuristic: split at “lower-case---period---white space---upper case” sequence ... found to be 9.8 (p < 0.0005) ... => 2x no ... on the X chromosome. The ... => yes ... as published by A. Greenfield et al. in ... => 2x no ... by A. Greenfield et al. This study ... => no, yes ... found (A. Greenfield. ICMLʼ07, 14-28) ... => 2x no ... found as fibronectin A. Greenfield showed that ... => yes

Period disambiguation

  • fixed rule-based vs machine learning
    • ML: supervised vs unsupervised
  • rule-based: some heuristics + special cases
    • l.c. + period + white space + u.c.
    • no split within parenthesis
    • no split after^ known^ abbreviations (vs., etc., pp., …)
    • no split if a sentence <3 words results

Unsupervised P.D.

  • unsupervised: no (manually labeled) training data
  • example: Schmid 2000
  • two passes over a text: (1) gather frequencies of - likely abbreviations and names (initials) - sequence of characters that always appears with trailing period - lower case words - indicator for split when appearing capitalized after period - words before+after numbers - number disambiguation (2) use information to disambiguate periods
  • 99.5% accuracy on Wall-Street-Journal corpus (WSJ)

Chunking

  • identifies^ constituents^ in a sentence
  • split a sentence into chunks:
    • noun groups, verb groups
    • compound nouns etc.
  • also^ shallow parsing,^ light parsing
  • not internal structure, not role in sentence ➱ that would be full sentence parsing (next week)

Chunking

  • usually solved using regular expressions
    • tag patterns
  • simple examples:
  • more complex examples:
    + <NN|JJ> <N.>
    ? <JJ.><NN.>+ another/DT sharp/JJ dive/NN trade/NN figures/NNS any/DT new/JJ policy/NN measures/NNS earlier/JJR stages/NNS Panamanian/JJ dictator/NN Manuel/NNP Noriega/NNP his/PRP$ Mansion/NNP House/NNP speech/NN the/DT price/NN cutting/VBG 3/CD %/NN to/TO 4/CD %/NN more/JJR than/IN 10/CD %/NN the/DT fastest/JJS developing/VBG trends/NNS 's/POS skill/NN

Chinking

  • chunking the other way around
  • find tokens/sequence^ that are not^ part of a chunk ➱ “barked at” is a chink

[ the/DT little/JJ yellow/JJ dog/NN ] barked/VBD at/IN [ the/DT cat/NN ]

Summary

  • tasks in text analysis build on NLP components
    • generate features for IR, NER,^ WSD, etc.
    • tokenization, stemming, part-of-speech tagging
      • “bob” is not a person when POS=verb
      • generate “comparable” features
        • organ ,^ organs , and^ organelles^ ➠^ same feature
        • the duck^ and^ to duck^ ➠^ different features
    • period disambiguation
      • translate text sentence-by-sentence
      • facilitate full sentence parsing

Bibliography

  • POS tagging
    • Eric Brill: Simple Rule-Based Part of Speech Tagger.^ ANLP , 152-155, 1992.
    • Eric Brill: Some Advances in Transformation-Based Part of Speech Tagging.^ AAAI , 722-727, 1994.
    • Thorsten Brants:^ TnT - A Statistical Part-Of-Speech Tagger.^ ANLP-NAACL , 2000.
  • Stemming
    • M.F. Porter:^ An algorithm for suffix stripping.^ Program , 14(3), 130−137, 1980.
  • Period disambiguation
    • Grefenstette & Tapanainen:^ What is a word, what is a sentence? Problems of tokenization. 1994.
    • H. Schmid:^ Unsupervised learning of period disambiguation for tokenisation.^ Tech Report, 2000.
    • Mikheev:^ Document centered approach to text normalization.^ ACM SIGIR , 2000.
  • Biomedical NLP / text mining
    • Bruijn & Martin: Getting to the (C)ore of knowledge: mining biomedical literature. 2002.
    • Cohen & Hersh:^ A survey of current work in biomedical text mining. 2005.
    • Cohen & Hunter: Getting started in text mining.^ PLoS Comp Biol , 2008.
  • Books on various topics Search for last name, year & title to get PDFs. CiteseerX.