Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

For each uploaded document

Answer questions

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Natural Language Processing: CSE 591 Fall 2008 Course Outline - Prof. Joerg Hakenberg, Study notes of Computer Science

Arizona State University (ASU) - Tempe Computer Science

Prof. Joerg Hakenberg

An outline of the natural language processing (nlp) course cse 591 offered in fall 2008. The course covers various topics such as tokenization, stemming, part-of-speech tagging, text mining, information retrieval, and word sense disambiguation. Students will participate in programming exercises, homework assignments, and a class project. The document also discusses features and vector space models, stop words, and stemming.

Typology: Study notes

Pre 2010

Uploaded on 09/02/2009

koofers-user-z94 🇺🇸

10 documents

1 / 26

This page cannot be seen from the preview

Don't miss anything!

bg1

CSE 591

Natural language processing

-Tokenization, Stemming, Part-of-speech-

Fall 2008

pf3

pf4

pf5

pf8

pf9

pfa

pfd

pfe

pff

pf12

pf13

pf14

pf15

pf16

pf17

pf18

pf19

pf1a

Discover Study notes of Computer Science Arizona State University (ASU) - Tempe

Related documents

Word Sense Disambiguation: Course Material and Project Guide - Prof. Joerg Hakenberg

Text Analysis in Biomedicine and Archaeology: CSE 591 Fall 2008 - Prof. Joerg Hakenberg

Information Retrieval Models and Text Processing - Prof. R. J. Glushko

Course Outline Course Outline Course Outline

Effect of Stemming on Mean Average Precision in IR and the Web

Text Processing and Indexing Techniques for Information Retrieval Systems

NLP Exam Prep: Questions and Answers

Course Outline-Data Warehouse-Coursebreakdown

(1)

L4 TP Text Processing in KNIME Analytics Platform Practice Exam

NLP Exam Prep: Questions and Answers

Government Essay Outline

Information Retrieval 3, Exercises - Computer Science

Partial preview of the text

Download Natural Language Processing: CSE 591 Fall 2008 Course Outline - Prof. Joerg Hakenberg and more Study notes Computer Science in PDF only on Docsity!

CSE 591

Natural language processing

-Tokenization, Stemming, Part-of-speech-

Fall 2008

Class format

• two^ exams: mid-term and final

- test theoretical aspects and principles

• four^ homeworks, ca. two weeks each

- mostly programming exercises

- groups of two students possible

- or: one larger class project, one student each

• programming exercise + presentation

• reading + 10page summary + presentation

• exams account for 30% of final grade each

• homeworks 10% each (project: 40%)

• a few excess points per exam/homework

α

• bag-of-words: represent document as set of words

• a^ feature^ is a^ property^ [of a document]

- mainly: observation, i.e.,^ occurrence of a certain term

• a document may be represented by a list of features

➠ a feature vector

• vector space model^ (VSM)

• comparison of documents/queries

➱ comparison of vectors

- Cosine similiarity: angle between vectors

- Euclidean distance: ordinary dist. btw points

Features and VSM

Doc1 Doc2 Query and and and by cat cat cat dog dog eat rain rain the the was cat dog

Features of documents

• Words^ vs^ tokens

- token:^ “anything treated as a single symbol during syntax

analysis”

- token: include punctuation (as separate tokens)

- token: split at hyphens/slashes

- “… CD95/Fas-mediated apoptosis …”

- separate words/tokens from punctuation

• Word^ n -grams^ ---^ n^ adjacent words

• noun phrases, verb phrases, adjective phrases

- “abstruse scientific publication”

- head noun: publication

Zipfian distribution

words follow a^ Zipf distribution^ across documents
- few words^ occur^ very frequently
- some words occur a medium number of times
- most words^ occur^ infrequently
Zipf’s law:^ “frequency of a word is inversely proportional to its rank”
- “the” has rank 1, frequency of 61847 Word PoS Freq the Det 61847 of Prep 29391 and Conj 26817 a Det 21626 in Prep 18214 to Inf 16284 it Pron 10875 is Verb 9982 to Prep 9343 was Verb 9236 I Pron 8875 for Prep 8412 that Conj 7308 you Pron 6954 he Pron 6810 be* Verb 6644 with Prep 6575 on Prep 6475 by Prep 5096 at Prep 4790 have* Verb 4735 are Verb 4707 not Neg 4626

English 250,000 words

Word stems and lemmata

• a document might contain different^ word forms^ to

refer to the same concept

• a feature vector should contain one entry for each

word

• organ, organs, organelle in different documents

- same concept

- word stem: organ

• organize, organization

- different from organ

- lemma: organize vs organ

organ organs organelle organize organized organization …

organ organize …

Stop words and stems

Stemming

10000 stop words

100 stop words

What’s left

• different documents that use entirely different words

to refer to the same/very similar concepts

- organization, venture, company, agency, association

- may be combined into one feature

• later in this course:^ word sense disambiguation

- bank vs bank vs bank

Part-of-speech

• useful for^ generating features^ for tasks in NLP/ML

- IR, NER, indexing,^ WSD, etc.

- NER: a verb will most likely not refer to a person,

location, disease, gene, …

- WSD:^ “to^ duck ” vs “the^ duck ”

• necessary for^ chunking^ and^ parsing

- subsequent tasks in NLP

- noun phrases

- subj/obj relations

duck_verb dog_verb break_noun …

duck_noun dog_noun break_verb …

Part-of-speech tagging

• POS tagging: assigning the POS tag to a word

• POS is not fixed for a given word/word form:

- break, duck, dogs - noun or verb?

• consider word and its context in a sentence

- “the duck”

• closed word classes^ vs^ open word classes

- preposition: closed (for, by, …)

- nouns: open (computer, Frappucchino, …)

- consider context and features of each word, e.g., its suffix: -sed

➠ likely a verb

• most common tag + proper noun for unknowns^ ➱^ 90% accuracy

• rule-based, stochastic, and neural approaches

Brill tagger

• initially tags each word by assigning the most likely category (based on

a training corpus: 90% WSJ)

• unseen words: capitalization and 3-letter suffix

• error rate of 7.9%

• aquires patches from a patch corpus (5% WSJ)

• 8 templates: change tag a to b when

• preceding (following) tag is z

• word two before (after) has tag z

• preceding tag is z and following tag is w

• one of the three tags before (after) is z

• reduces error rate to 5.1 using 71 patches

TnT - Tags’n’Trigrams

• uses a 2nd-order hidden Markov model (HMM)

• tokens

• trigrams, bigrams, unigrams

• unseen words: 98% of words with^ -able^ are adjectives

• P(t|4-suffix), smoothed by P(t|3-suffix) ... P(t)

• capitalization: P(t3|t1,t2) => P(t3,c3|t1,c1,t2,c2)

• accuracy of 96.7% on WSJ (85.5% for unseen tokens)

Period disambiguation

• fixed rule-based vs machine learning

- ML: supervised vs unsupervised

• rule-based: some heuristics + special cases

- l.c. + period + white space + u.c.

- no split within parenthesis

- no split after^ known^ abbreviations (vs., etc., pp., …)

- no split if a sentence <3 words results

Unsupervised P.D.

• unsupervised: no (manually labeled) training data

• example: Schmid 2000

• two passes over a text:

(1) gather frequencies of

• likely abbreviations and names (initials)

- sequence of characters that always appears with trailing period

• lower case words

- indicator for split when appearing capitalized after period

• words before+after numbers

- number disambiguation

(2) use information to disambiguate periods

• >99.5% accuracy on Wall-Street-Journal corpus (WSJ)