Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

N-grams and Probabilistic Context-free Grammars in Natural Language Processing, Study notes of Computer Science

University of San Francisco (USF)Computer Science

This document from the university of san francisco's department of computer science discusses the use of n-grams and probabilistic context-free grammars (cfgs) in natural language processing. The advantages and disadvantages of ir-based and classical nlp approaches, the concept of n-grams, smoothing techniques, and applications such as segmentation and language detection. Probabilistic cfgs and their role in faster parsing and information extraction are also explored.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-iuj-1 🇺🇸

5

(1)

10 documents

1 / 6

This page cannot be seen from the preview

Don't miss anything!

Artificial Intelligence

Programming

Statistical NLP

Chris Brooks

Department of Computer Science

University of San Francisco

Outline

n-grams

Applications of n-grams

review - Context-free grammars

Probabilistic CFGs

Information Extraction

Departmentof Computer Science — University of San Francisco – p.1/??

Advantages of IR approaches

Recall that IR-based approaches use the “bag of words”

model.

TFIDF is used to account for word frequency.

Takes information about common words into account.

Can deal with grammatically incorrect sentences.

Gives us a “degree of correctness”, rather than just

yes or no.

Departmentof Computer Science — University of San Francisco

Disadvantages of IR approaches

No use of structural information.

Not even co-occurrence of words

Can’t deal with synonyms or dereferencing pronouns

Very little semantic analysis.

Departmentof Computer Science — University of San Francisco – p.3/??

Advantages of classical NLP

Classical NLP approaches use a parser to generate a

parse tree.

This can then be used to transform knowledge into a

form that can be reasoned with.

Identifies sentence structure

Easier to do semantic interpretation

Can handle anaphora, synonyms, etc.

Departmentof Computer Science — University of San Francisco – p.4/??

Disadvantages of classical NLP

Doesn’t take frequency into account

No way to choose between different parses for a

sentence

Can’t deal with incorrect grammar

Requires a lexicon.

Maybe we can incorporate both statistical information

and structure.

Departmentof Computer Science — University of San Francisco

Discover Study notes of Computer Science University of San Francisco (USF)

Partial preview of the text

Download N-grams and Probabilistic Context-free Grammars in Natural Language Processing and more Study notes Computer Science in PDF only on Docsity!

Artificial Intelligence

Programming^ Statistical NLP

Chris Brooks Department of Computer ScienceUniversity of San Francisco

Outline n-grams^ Applications of n-grams review - Context-free grammars Probabilistic CFGs Information Extraction

Department of Computer Science — University of San Francisco – p.1/

??

Advantages of IR approaches Recall that IR-based approaches use the “bag of words”model. TFIDF is used to account for word frequency.^ Takes information about common words into account.^ Can deal with grammatically incorrect sentences.^ Gives us a “degree of correctness”, rather than justyes or no.

Department of Computer Science — University of San Fra

Disadvantages of IR approaches No use of structural information.^ Not even co-occurrence of words Can’t deal with synonyms or dereferencing pronouns Very little semantic analysis.

Department of Computer Science — University of San Francisco – p.3/

??

Advantages of classical NLP Classical NLP approaches use a parser to generate aparse tree. This can then be used to transform knowledge into aform that can be reasoned with.^ Identifies sentence structure^ Easier to do semantic interpretation^ Can handle anaphora, synonyms, etc.

Department of Computer Science — University of San Francisco – p.4/

??

Disadvantages of classical NLP Doesn’t take frequency into account No way to choose between different parses for asentence Can’t deal with incorrect grammar Requires a lexicon. Maybe we can incorporate both statistical informationand structure.

Department of Computer Science — University of San Fra

n-grams The simplest way to add structure to our IR approach isto count the occurrence not only of single tokens, but of sequences

of tokens.

So far, we’ve considered words as tokens. A token is sometimes called a

gram

an^ n

-gram model considers the probability that a

sequence of

n^ tokens occurs in a row.

More precisely, it is the probability P^ (token

|tokeni

, tokeni− 1

, ..., tokeni− 2

)i−n Department of Computer Science — University of San Francisco – p.6/

??

n-grams Our approach in assignment 3 uses 1-grams, orunigrams. We could also choose to count

bigrams

, or 2-grams.

The sentence “Every good boy deserves fudge”contains the bigrams “every good”, “good boy”, “boydeserves”, “deserves fudge” We could continue this approach to 3-grams, or4-grams, or 5-grams. Longer n-grams give us more accurate informationabout content, since they include phrases rather thansingle words. What’s the downside here?

Department of Computer Science — University of San Francisco – p.7/

??

Sampling theory We need to be able to estimate the probability of each n-gram occurring.^ In assignment 3, we do this by collecting a corpusand counting the distribution of words in the corpus.^ If the corpus is too small, these counts may not bereflective of an

n-gram’s true frequency.

Many

n-grams will not appear at all in our corpus.

For example, if we have a lexicon of 20,000 words, thereare:^20

= 400 million distinct bigrams

20 ,^000

3 = 8 trillion distinct trigrams

20 ,^000

.^6 ×^

distinct 4-grams

Department of Computer Science — University of San Fra

Smoothing So, when we are estimating

n-gram counts from a

corpus, there will be many

n-grams that we never see.

This might occur in your assignment - what if there’sa word in your similarity set that’s not in the corpus? The simplest thing to do is

add-one smoothing

. We start

each

n-gram with a count of 1, rather than zero. Easy, but not very theoretically satisfying.

Department of Computer Science — University of San Francisco – p.9/

??

Linear Interpolation Smoothing We can also use estimates of shorter-length

n-grams to

help out.^ Assumption: the sequence

w, w^1

, w 23

and the

sequence

w, w^1

are related. 2

More precisely, we want to know

P^ (w

|w, w 32

).^ We count 1

all 1-grams, 2-grams, and 3-grams. we estimate

P^ (w

|w, w 32

)^ as 1

cP^ (^1

w|w^3

, w 21

) +^ c

P^ (w 2

|w) + 32

cP^ (^3

w)^3

So where do we get

c, c^1

, c? 23

They might be fixed, based on past experience. Or, we could learn them.

Department of Computer Science — University of San Francisco – p.10/

??

Application: segmentation One application of

n-gram models is

segmentation

Splitting a sequence of characters into tokens, or findingword boundaries.^ Speech-to-text systems^ Chinese and Japanese^ genomic data^ Documents with other characters, such as representing space. The algorithm for doing this is called

Viterbi

segmentation^ (Like parsing, it’s a form of dynamic programming)

Department of Computer Science — University of San Fran

Example st: [1.

0.^

0.2]

rds:^ [’c’^ ’ca’^ ’cat’^

’catt’

’cattl’

’cattle’

’cattlef’

’cattlefi’

attlefis’

’fish’] 10 sh^ ’fish’

onto result i-4sh^ ’cattle’

onto result

0

Department of Computer Science — University of San Francisco – p.18/

??

What’s going on here? The Viterbi algorithm is

searching

through the space of

all combinations of substrings.^ States with high probability mass are pursued. The ’best’ array is used to prevent the algorithm fromrepeatedly expanding portions of the search space. This is an example of dynamic programming (like chartparsing)

Department of Computer Science — University of San Francisco – p.19/

??

Application: language detection n-grams have also been successfully used to detect thelanguage a document is in. Approach: consider

letters

as tokens, rather than words.

Gather a corpus in a variety of different languages(Wikipedia works well here.) Process the documents, and count all two-grams. Estimate probabilities for Language L with

count #of 2 −grams

Call this

PL

Assumption: different languages have characteristictwo-grams.

Department of Computer Science — University of San Fran

Application: language detection To classify a document by language:^ Find all two-grams in the document. Call this set T.^ For each language L, the

likelihood

that the

document is of language L is: P(tL

)^ ×^1

P(tL

)^ ×^2

...^ ×^

P(tL

)n

The language with the highest likelihood is the mostprobable language.^ (this is a form of Bayesian inference - we’ll spendmore time on this later in the semester.)

Department of Computer Science — University of San Francisco – p.21/

??

Going further n-grams and segmentation provide some interestingideas:^ We can combine structure with statistical knowledge.^ Probabilities can be used to help guide search^ Probabilities can help a parser choose betweendifferent outcomes. But, no structure used apart from colocation. Maybe we can apply these ideas to grammars.

Department of Computer Science — University of San Francisco – p.22/

??

Reminder: CFGs Recall context-free grammars from the last lecture Single non-terminal on the left, anything on the right.^ S -> NP VP^ VP -> Verb | Verb PP^ Verb -> ’run’ | ’sleep’ We can construct sentences that have more than onelegal parse.^ “Squad helps dog bite victim” CFGs don’t give us any information about which parseto select.

Department of Computer Science — University of San Fran

Probabalistic CFGs A probabalisitc CFG is just a regular CFG withprobabilities attached to the right-hand sides of rules.^ The have to sum up to 1 They indicate how often a particular non-terminalderives that right-hand side.

Department of Computer Science — University of San Francisco – p.24/

??

Example

S ->^

NP^ VP

PP^ ->

P^ NP

VP^ ->

V^ NP

VP^ ->

VP^ PP

P ->^

with^

V ->^

saw^ (1.0) NP^ ->

NP^ PP

NP^ ->

astronomers

NP^ ->

stars

NP^ ->

saw^

NP^ ->

ears

NP^ ->

telescopes

Department of Computer Science — University of San Francisco – p.25/

??

Disambiguation The probability of aparse tree beingcorrect is just theproduct of each rulein the tree beingderived. This^

lets^

us^ com-

pare two parses andsay^

which

is^

more

likely.

S (1.0)NP (0.1)^ VP (0.7)V (1.0)

NP (0.4) astronomers

saw^

NP(0.18) PP (1.0)P(1.0)

NP (0.18) stars

with^

ears

S (1.0)NP (0.1)^

VP (0.3) VP (0.7)V (1.0) astronomers

saw^

PP (1.0)NP(0.18) P(1.0)^ NP (0.18) stars^

with^

ears

P1=1.0 * 0.1 * 0.7 * 1.0 * 0.4 * 0.18 * 1.0 * 1.0 * 0.18 = 0.0009072P2=1.0 * 0.1 * 0.3 * 0.7 * 1.0 * 1.0 * 0.18 * 1.0 * 0.18 = 0.

Department of Computer Science — University of San Fran

Faster Parsing We can also use probabilities to speed up parsing. Recall that both top-down and chart pasring proceed ina primarily depth-first fashion.^ They choose a rule to apply, and based on itsright-hand side, they choose another rule. Probabilities can be used to better select which rule toapply, or which branch of the search tree to follow. This is a form of best-first search.

Department of Computer Science — University of San Francisco – p.27/

??

Information Extraction An increasingly common application of parsing is information extraction

This is the process of creating structured information(database or knowledge base entries) from unstructuredtext. Example:^ Suppose we want to build a price comparison agentthat can visit sites on the web and find the best dealson flatscreen TVs?^ Suppose we want to build a database about videogames. We might do this by hand, or we could writea program that could parse wikipedia pages andinsert knowledge such as madeBy(Blizzard,WorldOfWarcraft) into a knowledge base.

Department of Computer Science — University of San Francisco – p.28/

??

Extracting specific information A program that fetches HTML pages and extractsspecfic information is called a

scraper

Simple scrapers can be built with regular expressions.^ For example, prices typically have a dollar sign, somedigits, a period, and two digits.^ $[0-9]+.[0-9]{2} This approach will work, but it has several limitations^ Can only handle simple extractions^ Brittle and page specific

Department of Computer Science — University of San Fran

N-grams and Probabilistic Context-free Grammars in Natural Language Processing, Study notes of Computer Science

Related documents

Partial preview of the text

Download N-grams and Probabilistic Context-free Grammars in Natural Language Processing and more Study notes Computer Science in PDF only on Docsity!

Artificial Intelligence

Programming^ Statistical NLP

Outline n-grams^ Applications of n-grams review - Context-free grammars Probabilistic CFGs Information Extraction

Disadvantages of IR approaches No use of structural information.^ Not even co-occurrence of words Can’t deal with synonyms or dereferencing pronouns Very little semantic analysis.

Advantages of classical NLP Classical NLP approaches use a parser to generate aparse tree. This can then be used to transform knowledge into aform that can be reasoned with.^ Identifies sentence structure^ Easier to do semantic interpretation^ Can handle anaphora, synonyms, etc.

Disadvantages of classical NLP Doesn’t take frequency into account No way to choose between different parses for asentence Can’t deal with incorrect grammar Requires a lexicon. Maybe we can incorporate both statistical informationand structure.

n-grams The simplest way to add structure to our IR approach isto count the occurrence not only of single tokens, but of sequences

of tokens.

So far, we’ve considered words as tokens. A token is sometimes called a

gram

an^ n

-gram model considers the probability that a

sequence of

n^ tokens occurs in a row.

More precisely, it is the probability P^ (token

|tokeni

, tokeni− 1

, ..., tokeni− 2

)i−n Department of Computer Science — University of San Francisco – p.6/

n-grams Our approach in assignment 3 uses 1-grams, orunigrams. We could also choose to count

bigrams

, or 2-grams.

Sampling theory We need to be able to estimate the probability of each n-gram occurring.^ In assignment 3, we do this by collecting a corpusand counting the distribution of words in the corpus.^ If the corpus is too small, these counts may not bereflective of an

n-gram’s true frequency.

Many

n-grams will not appear at all in our corpus.

For example, if we have a lexicon of 20,000 words, thereare:^20

= 400 million distinct bigrams

20 ,^000

3 = 8 trillion distinct trigrams

20 ,^000

.^6 ×^

distinct 4-grams

Smoothing So, when we are estimating

n-gram counts from a

corpus, there will be many

n-grams that we never see.

This might occur in your assignment - what if there’sa word in your similarity set that’s not in the corpus? The simplest thing to do is

add-one smoothing

. We start

each

n-gram with a count of 1, rather than zero. Easy, but not very theoretically satisfying.

Linear Interpolation Smoothing We can also use estimates of shorter-length

n-grams to

help out.^ Assumption: the sequence

w, w^1

, w 23

and the

sequence

w, w^1

are related. 2

More precisely, we want to know

P^ (w

|w, w 32

).^ We count 1

all 1-grams, 2-grams, and 3-grams. we estimate

P^ (w

|w, w 32

)^ as 1

cP^ (^1

w|w^3

, w 21

) +^ c

P^ (w 2

|w) + 32

cP^ (^3

w)^3

So where do we get

c, c^1

, c? 23

They might be fixed, based on past experience. Or, we could learn them.

Application: segmentation One application of

n-gram models is

segmentation

Splitting a sequence of characters into tokens, or findingword boundaries.^ Speech-to-text systems^ Chinese and Japanese^ genomic data^ Documents with other characters, such as representing space. The algorithm for doing this is called

Viterbi