Download N-grams and Probabilistic Context-free Grammars in Natural Language Processing and more Study notes Computer Science in PDF only on Docsity!
Artificial Intelligence
Programming^ Statistical NLP
Chris Brooks Department of Computer ScienceUniversity of San Francisco
Outline n-grams^ Applications of n-grams review - Context-free grammars Probabilistic CFGs Information Extraction
Department of Computer Science — University of San Francisco – p.1/
??
Advantages of IR approaches Recall that IR-based approaches use the “bag of words”model. TFIDF is used to account for word frequency.^ Takes information about common words into account.^ Can deal with grammatically incorrect sentences.^ Gives us a “degree of correctness”, rather than justyes or no.
Department of Computer Science — University of San Fra
Disadvantages of IR approaches No use of structural information.^ Not even co-occurrence of words Can’t deal with synonyms or dereferencing pronouns Very little semantic analysis.
Department of Computer Science — University of San Francisco – p.3/
??
Advantages of classical NLP Classical NLP approaches use a parser to generate aparse tree. This can then be used to transform knowledge into aform that can be reasoned with.^ Identifies sentence structure^ Easier to do semantic interpretation^ Can handle anaphora, synonyms, etc.
Department of Computer Science — University of San Francisco – p.4/
??
Disadvantages of classical NLP Doesn’t take frequency into account No way to choose between different parses for asentence Can’t deal with incorrect grammar Requires a lexicon. Maybe we can incorporate both statistical informationand structure.
Department of Computer Science — University of San Fra
n-grams The simplest way to add structure to our IR approach isto count the occurrence not only of single tokens, but of sequences
of tokens.
So far, we’ve considered words as tokens. A token is sometimes called a
gram
an^ n
-gram model considers the probability that a
sequence of
n^ tokens occurs in a row.
More precisely, it is the probability P^ (token
|tokeni
, tokeni− 1
, ..., tokeni− 2
)i−n Department of Computer Science — University of San Francisco – p.6/
??
n-grams Our approach in assignment 3 uses 1-grams, orunigrams. We could also choose to count
bigrams
, or 2-grams.
The sentence “Every good boy deserves fudge”contains the bigrams “every good”, “good boy”, “boydeserves”, “deserves fudge” We could continue this approach to 3-grams, or4-grams, or 5-grams. Longer n-grams give us more accurate informationabout content, since they include phrases rather thansingle words. What’s the downside here?
Department of Computer Science — University of San Francisco – p.7/
??
Sampling theory We need to be able to estimate the probability of each n-gram occurring.^ In assignment 3, we do this by collecting a corpusand counting the distribution of words in the corpus.^ If the corpus is too small, these counts may not bereflective of an
n-gram’s true frequency.
Many
n-grams will not appear at all in our corpus.
For example, if we have a lexicon of 20,000 words, thereare:^20
= 400 million distinct bigrams
20 ,^000
3 = 8 trillion distinct trigrams
20 ,^000
.^6 ×^
distinct 4-grams
Department of Computer Science — University of San Fra
Smoothing So, when we are estimating
n-gram counts from a
corpus, there will be many
n-grams that we never see.
This might occur in your assignment - what if there’sa word in your similarity set that’s not in the corpus? The simplest thing to do is
add-one smoothing
. We start
each
n-gram with a count of 1, rather than zero. Easy, but not very theoretically satisfying.
Department of Computer Science — University of San Francisco – p.9/
??
Linear Interpolation Smoothing We can also use estimates of shorter-length
n-grams to
help out.^ Assumption: the sequence
w, w^1
, w 23
and the
sequence
w, w^1
are related. 2
More precisely, we want to know
P^ (w
|w, w 32
).^ We count 1
all 1-grams, 2-grams, and 3-grams. we estimate
P^ (w
|w, w 32
)^ as 1
cP^ (^1
w|w^3
, w 21
) +^ c
P^ (w 2
|w) + 32
cP^ (^3
w)^3
So where do we get
c, c^1
, c? 23
They might be fixed, based on past experience. Or, we could learn them.
Department of Computer Science — University of San Francisco – p.10/
??
Application: segmentation One application of
n-gram models is
segmentation
Splitting a sequence of characters into tokens, or findingword boundaries.^ Speech-to-text systems^ Chinese and Japanese^ genomic data^ Documents with other characters, such as representing space. The algorithm for doing this is called
Viterbi
segmentation^ (Like parsing, it’s a form of dynamic programming)
Department of Computer Science — University of San Fran
Example st: [1.
0.^
0.^
0.^
0.2]
rds:^ [’c’^ ’ca’^ ’cat’^
’catt’
’cattl’
’cattle’
’cattlef’
’cattlefi’
attlefis’
’fish’] 10 sh^ ’fish’
onto result i-4sh^ ’cattle’
onto result
0
Department of Computer Science — University of San Francisco – p.18/
??
What’s going on here? The Viterbi algorithm is
searching
through the space of
all combinations of substrings.^ States with high probability mass are pursued. The ’best’ array is used to prevent the algorithm fromrepeatedly expanding portions of the search space. This is an example of dynamic programming (like chartparsing)
Department of Computer Science — University of San Francisco – p.19/
??
Application: language detection n-grams have also been successfully used to detect thelanguage a document is in. Approach: consider
letters
as tokens, rather than words.
Gather a corpus in a variety of different languages(Wikipedia works well here.) Process the documents, and count all two-grams. Estimate probabilities for Language L with
count #of 2 −grams
Call this
PL
Assumption: different languages have characteristictwo-grams.
Department of Computer Science — University of San Fran
Application: language detection To classify a document by language:^ Find all two-grams in the document. Call this set T.^ For each language L, the
likelihood
that the
document is of language L is: P(tL
)^ ×^1
P(tL
)^ ×^2
...^ ×^
P(tL
)n
The language with the highest likelihood is the mostprobable language.^ (this is a form of Bayesian inference - we’ll spendmore time on this later in the semester.)
Department of Computer Science — University of San Francisco – p.21/
??
Going further n-grams and segmentation provide some interestingideas:^ We can combine structure with statistical knowledge.^ Probabilities can be used to help guide search^ Probabilities can help a parser choose betweendifferent outcomes. But, no structure used apart from colocation. Maybe we can apply these ideas to grammars.
Department of Computer Science — University of San Francisco – p.22/
??
Reminder: CFGs Recall context-free grammars from the last lecture Single non-terminal on the left, anything on the right.^ S -> NP VP^ VP -> Verb | Verb PP^ Verb -> ’run’ | ’sleep’ We can construct sentences that have more than onelegal parse.^ “Squad helps dog bite victim” CFGs don’t give us any information about which parseto select.
Department of Computer Science — University of San Fran
Probabalistic CFGs A probabalisitc CFG is just a regular CFG withprobabilities attached to the right-hand sides of rules.^ The have to sum up to 1 They indicate how often a particular non-terminalderives that right-hand side.
Department of Computer Science — University of San Francisco – p.24/
??
Example
S ->^
NP^ VP
PP^ ->
P^ NP
VP^ ->
V^ NP
VP^ ->
VP^ PP
P ->^
with^
V ->^
saw^ (1.0) NP^ ->
NP^ PP
NP^ ->
astronomers
NP^ ->
stars
NP^ ->
saw^
NP^ ->
ears
NP^ ->
telescopes
Department of Computer Science — University of San Francisco – p.25/
??
Disambiguation The probability of aparse tree beingcorrect is just theproduct of each rulein the tree beingderived. This^
lets^
us^ com-
pare two parses andsay^
which
is^
more
likely.
S (1.0)NP (0.1)^ VP (0.7)V (1.0)
NP (0.4) astronomers
saw^
NP(0.18) PP (1.0)P(1.0)
NP (0.18) stars
with^
ears
S (1.0)NP (0.1)^
VP (0.3) VP (0.7)V (1.0) astronomers
saw^
PP (1.0)NP(0.18) P(1.0)^ NP (0.18) stars^
with^
ears
P1=1.0 * 0.1 * 0.7 * 1.0 * 0.4 * 0.18 * 1.0 * 1.0 * 0.18 = 0.0009072P2=1.0 * 0.1 * 0.3 * 0.7 * 1.0 * 1.0 * 0.18 * 1.0 * 0.18 = 0.
Department of Computer Science — University of San Fran
Faster Parsing We can also use probabilities to speed up parsing. Recall that both top-down and chart pasring proceed ina primarily depth-first fashion.^ They choose a rule to apply, and based on itsright-hand side, they choose another rule. Probabilities can be used to better select which rule toapply, or which branch of the search tree to follow. This is a form of best-first search.
Department of Computer Science — University of San Francisco – p.27/
??
Information Extraction An increasingly common application of parsing is information extraction
This is the process of creating structured information(database or knowledge base entries) from unstructuredtext. Example:^ Suppose we want to build a price comparison agentthat can visit sites on the web and find the best dealson flatscreen TVs?^ Suppose we want to build a database about videogames. We might do this by hand, or we could writea program that could parse wikipedia pages andinsert knowledge such as madeBy(Blizzard,WorldOfWarcraft) into a knowledge base.
Department of Computer Science — University of San Francisco – p.28/
??
Extracting specific information A program that fetches HTML pages and extractsspecfic information is called a
scraper
Simple scrapers can be built with regular expressions.^ For example, prices typically have a dollar sign, somedigits, a period, and two digits.^ $[0-9]+.[0-9]{2} This approach will work, but it has several limitations^ Can only handle simple extractions^ Brittle and page specific
Department of Computer Science — University of San Fran