N-grams and Probabilistic Context-free Grammars in Natural Language Processing, Study notes of Computer Science

This document from the university of san francisco's department of computer science discusses the use of n-grams and probabilistic context-free grammars (cfgs) in natural language processing. The advantages and disadvantages of ir-based and classical nlp approaches, the concept of n-grams, smoothing techniques, and applications such as segmentation and language detection. Probabilistic cfgs and their role in faster parsing and information extraction are also explored.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-iuj-1
koofers-user-iuj-1 🇺🇸

5

(1)

10 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Artificial Intelligence
Programming
Statistical NLP
Chris Brooks
Department of Computer Science
University of San Francisco
Outline
n-grams
Applications of n-grams
review - Context-free grammars
Probabilistic CFGs
Information Extraction
Departmentof Computer Science University of San Francisco p.1/??
Advantages of IR approaches
Recall that IR-based approaches use the “bag of words”
model.
TFIDF is used to account for word frequency.
Takes information about common words into account.
Can deal with grammatically incorrect sentences.
Gives us a “degree of correctness”, rather than just
yes or no.
Departmentof Computer Science University of San Francisco
Disadvantages of IR approaches
No use of structural information.
Not even co-occurrence of words
Can’t deal with synonyms or dereferencing pronouns
Very little semantic analysis.
Departmentof Computer Science University of San Francisco p.3/??
Advantages of classical NLP
Classical NLP approaches use a parser to generate a
parse tree.
This can then be used to transform knowledge into a
form that can be reasoned with.
Identifies sentence structure
Easier to do semantic interpretation
Can handle anaphora, synonyms, etc.
Departmentof Computer Science University of San Francisco p.4/??
Disadvantages of classical NLP
Doesn’t take frequency into account
No way to choose between different parses for a
sentence
Can’t deal with incorrect grammar
Requires a lexicon.
Maybe we can incorporate both statistical information
and structure.
Departmentof Computer Science University of San Francisco
pf3
pf4
pf5

Partial preview of the text

Download N-grams and Probabilistic Context-free Grammars in Natural Language Processing and more Study notes Computer Science in PDF only on Docsity!

Artificial Intelligence

Programming^ Statistical NLP

Chris Brooks Department of Computer ScienceUniversity of San Francisco

Outline n-grams^ Applications of n-grams review - Context-free grammars Probabilistic CFGs Information Extraction

Department of Computer Science — University of San Francisco – p.1/

??

Advantages of IR approaches Recall that IR-based approaches use the “bag of words”model. TFIDF is used to account for word frequency.^ Takes information about common words into account.^ Can deal with grammatically incorrect sentences.^ Gives us a “degree of correctness”, rather than justyes or no.

Department of Computer Science — University of San Fra

Disadvantages of IR approaches No use of structural information.^ Not even co-occurrence of words Can’t deal with synonyms or dereferencing pronouns Very little semantic analysis.

Department of Computer Science — University of San Francisco – p.3/

??

Advantages of classical NLP Classical NLP approaches use a parser to generate aparse tree. This can then be used to transform knowledge into aform that can be reasoned with.^ Identifies sentence structure^ Easier to do semantic interpretation^ Can handle anaphora, synonyms, etc.

Department of Computer Science — University of San Francisco – p.4/

??

Disadvantages of classical NLP Doesn’t take frequency into account No way to choose between different parses for asentence Can’t deal with incorrect grammar Requires a lexicon. Maybe we can incorporate both statistical informationand structure.

Department of Computer Science — University of San Fra

n-grams The simplest way to add structure to our IR approach isto count the occurrence not only of single tokens, but of sequences

of tokens.

So far, we’ve considered words as tokens. A token is sometimes called a

gram

an^ n

-gram model considers the probability that a

sequence of

n^ tokens occurs in a row.

More precisely, it is the probability P^ (token

|tokeni

, tokeni− 1

, ..., tokeni− 2

)i−n Department of Computer Science — University of San Francisco – p.6/

??

n-grams Our approach in assignment 3 uses 1-grams, orunigrams. We could also choose to count

bigrams

, or 2-grams.

The sentence “Every good boy deserves fudge”contains the bigrams “every good”, “good boy”, “boydeserves”, “deserves fudge” We could continue this approach to 3-grams, or4-grams, or 5-grams. Longer n-grams give us more accurate informationabout content, since they include phrases rather thansingle words. What’s the downside here?

Department of Computer Science — University of San Francisco – p.7/

??

Sampling theory We need to be able to estimate the probability of each n-gram occurring.^ In assignment 3, we do this by collecting a corpusand counting the distribution of words in the corpus.^ If the corpus is too small, these counts may not bereflective of an

n-gram’s true frequency.

Many

n-grams will not appear at all in our corpus.

For example, if we have a lexicon of 20,000 words, thereare:^20

= 400 million distinct bigrams

20 ,^000

3 = 8 trillion distinct trigrams

20 ,^000

.^6 ×^

distinct 4-grams

Department of Computer Science — University of San Fra

Smoothing So, when we are estimating

n-gram counts from a

corpus, there will be many

n-grams that we never see.

This might occur in your assignment - what if there’sa word in your similarity set that’s not in the corpus? The simplest thing to do is

add-one smoothing

. We start

each

n-gram with a count of 1, rather than zero. Easy, but not very theoretically satisfying.

Department of Computer Science — University of San Francisco – p.9/

??

Linear Interpolation Smoothing We can also use estimates of shorter-length

n-grams to

help out.^ Assumption: the sequence

w, w^1

, w 23

and the

sequence

w, w^1

are related. 2

More precisely, we want to know

P^ (w

|w, w 32

).^ We count 1

all 1-grams, 2-grams, and 3-grams. we estimate

P^ (w

|w, w 32

)^ as 1

cP^ (^1

w|w^3

, w 21

) +^ c

P^ (w 2

|w) + 32

cP^ (^3

w)^3

So where do we get

c, c^1

, c? 23

They might be fixed, based on past experience. Or, we could learn them.

Department of Computer Science — University of San Francisco – p.10/

??

Application: segmentation One application of

n-gram models is

segmentation

Splitting a sequence of characters into tokens, or findingword boundaries.^ Speech-to-text systems^ Chinese and Japanese^ genomic data^ Documents with other characters, such as  representing space. The algorithm for doing this is called

Viterbi

segmentation^ (Like parsing, it’s a form of dynamic programming)

Department of Computer Science — University of San Fran

Example st: [1.

0.^

0.^

0.^

0.2]

rds:^ [’c’^ ’ca’^ ’cat’^

’catt’

’cattl’

’cattle’

’cattlef’

’cattlefi’

attlefis’

’fish’] 10 sh^ ’fish’

onto result i-4sh^ ’cattle’

onto result

0

Department of Computer Science — University of San Francisco – p.18/

??

What’s going on here? The Viterbi algorithm is

searching

through the space of

all combinations of substrings.^ States with high probability mass are pursued. The ’best’ array is used to prevent the algorithm fromrepeatedly expanding portions of the search space. This is an example of dynamic programming (like chartparsing)

Department of Computer Science — University of San Francisco – p.19/

??

Application: language detection n-grams have also been successfully used to detect thelanguage a document is in. Approach: consider

letters

as tokens, rather than words.

Gather a corpus in a variety of different languages(Wikipedia works well here.) Process the documents, and count all two-grams. Estimate probabilities for Language L with

count #of 2 −grams

Call this

PL

Assumption: different languages have characteristictwo-grams.

Department of Computer Science — University of San Fran

Application: language detection To classify a document by language:^ Find all two-grams in the document. Call this set T.^ For each language L, the

likelihood

that the

document is of language L is: P(tL

)^ ×^1

P(tL

)^ ×^2

...^ ×^

P(tL

)n

The language with the highest likelihood is the mostprobable language.^ (this is a form of Bayesian inference - we’ll spendmore time on this later in the semester.)

Department of Computer Science — University of San Francisco – p.21/

??

Going further n-grams and segmentation provide some interestingideas:^ We can combine structure with statistical knowledge.^ Probabilities can be used to help guide search^ Probabilities can help a parser choose betweendifferent outcomes. But, no structure used apart from colocation. Maybe we can apply these ideas to grammars.

Department of Computer Science — University of San Francisco – p.22/

??

Reminder: CFGs Recall context-free grammars from the last lecture Single non-terminal on the left, anything on the right.^ S -> NP VP^ VP -> Verb | Verb PP^ Verb -> ’run’ | ’sleep’ We can construct sentences that have more than onelegal parse.^ “Squad helps dog bite victim” CFGs don’t give us any information about which parseto select.

Department of Computer Science — University of San Fran

Probabalistic CFGs A probabalisitc CFG is just a regular CFG withprobabilities attached to the right-hand sides of rules.^ The have to sum up to 1 They indicate how often a particular non-terminalderives that right-hand side.

Department of Computer Science — University of San Francisco – p.24/

??

Example

S ->^

NP^ VP

PP^ ->

P^ NP

VP^ ->

V^ NP

VP^ ->

VP^ PP

P ->^

with^

V ->^

saw^ (1.0) NP^ ->

NP^ PP

NP^ ->

astronomers

NP^ ->

stars

NP^ ->

saw^

NP^ ->

ears

NP^ ->

telescopes

Department of Computer Science — University of San Francisco – p.25/

??

Disambiguation The probability of aparse tree beingcorrect is just theproduct of each rulein the tree beingderived. This^

lets^

us^ com-

pare two parses andsay^

which

is^

more

likely.

S (1.0)NP (0.1)^ VP (0.7)V (1.0)

NP (0.4) astronomers

saw^

NP(0.18) PP (1.0)P(1.0)

NP (0.18) stars

with^

ears

S (1.0)NP (0.1)^

VP (0.3) VP (0.7)V (1.0) astronomers

saw^

PP (1.0)NP(0.18) P(1.0)^ NP (0.18) stars^

with^

ears

P1=1.0 * 0.1 * 0.7 * 1.0 * 0.4 * 0.18 * 1.0 * 1.0 * 0.18 = 0.0009072P2=1.0 * 0.1 * 0.3 * 0.7 * 1.0 * 1.0 * 0.18 * 1.0 * 0.18 = 0.

Department of Computer Science — University of San Fran

Faster Parsing We can also use probabilities to speed up parsing. Recall that both top-down and chart pasring proceed ina primarily depth-first fashion.^ They choose a rule to apply, and based on itsright-hand side, they choose another rule. Probabilities can be used to better select which rule toapply, or which branch of the search tree to follow. This is a form of best-first search.

Department of Computer Science — University of San Francisco – p.27/

??

Information Extraction An increasingly common application of parsing is information extraction

This is the process of creating structured information(database or knowledge base entries) from unstructuredtext. Example:^ Suppose we want to build a price comparison agentthat can visit sites on the web and find the best dealson flatscreen TVs?^ Suppose we want to build a database about videogames. We might do this by hand, or we could writea program that could parse wikipedia pages andinsert knowledge such as madeBy(Blizzard,WorldOfWarcraft) into a knowledge base.

Department of Computer Science — University of San Francisco – p.28/

??

Extracting specific information A program that fetches HTML pages and extractsspecfic information is called a

scraper

Simple scrapers can be built with regular expressions.^ For example, prices typically have a dollar sign, somedigits, a period, and two digits.^ $[0-9]+.[0-9]{2} This approach will work, but it has several limitations^ Can only handle simple extractions^ Brittle and page specific

Department of Computer Science — University of San Fran