



































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
lecture notes from: Joshua Goodman, L. Kosseim, D. Klein
Typology: Lecture notes
1 / 75
This page cannot be seen from the preview
Don't miss anything!




































































Lecture 9
Many slides from:
Why we need to model language
Probability background
n-gram model
Parameter Estimation Techniques
Language Model for Speech Recognition Slides 2-7, from
Joshua Goodman's slidesJoshua Goodman's slides
research.microsoft.com/~joshuago/lm research.microsoft.com/~joshuago/lm
Language Model for Speech Recognition
Language Model for Speech Recognition
A language model is a probability distributionover word/character sequences
We would like to find a language model P s.t.
Babies
Baby boys
John
Brown eyes
11
Babies
Baby boys
John
P(baby is a boy)
P(baby is named John, baby is a boy)
13
Bayes rule:
Babies
Baby boys
John
(^
)^
(^
)^
(^
)
(^
)
boy P
John
named P
John
named|
boy P
boy|
John
named P
( =
)^
(^
)^
(^
)
(^
)
acoustics) |
sequence
P(word
s)
P(acoustic
sequence)
P(word
sequence)
word | s
P(acoustic
= ×
from language model
reasonably easy to model
usually don’t need this
very hard to model
Language Modeling
the first word in a sequence is “an” the second word in a sequence is “apple” the third word in a sequence is “fell”
17
Hard to compute
P
( and nothing but the truth
)
^
Step 1: Decompose probability using conditional probability:
(^
)^
truth
the
but
nothing
and P
(^
)^
(^
)^
the
but
nothing
and P
the
but
nothing
and|
truth P
(^
)^
(^
)×
but
nothing
and|
the P
the
but
nothing
and|
truth P
(^
)^
but
nothing
and P
(^
)^
(^
)×
but
nothing
and|
the P
the
but
nothing
and|
truth P
(^
)^
(^
)^
nothing
and P
nothing
and|
but P
(^
)^
(^
)×
but
nothing
and|
the P
the
but
nothing
and|
truth P
(^
)^
(^
)^
(^
)
and P
and|
nothing P
nothing
and|
but P ×
Predict the next word/character given the
n-
previous words/characters. ^
Human subjects were shown 100 characters of textand were asked to guess the next character ^
As context increases, entropy decreases^
the smaller the entropy => the larger the probability ofpredicting the next letter
But only a few words is enough to make a goodprediction on the next word, in most cases ^
Evidence that we only need to look back at n-1previous words
Entropy (H)
3
2
1
0
Context
P(w
k^ |w
w 1
…w 2
k-
)=P(w
k^
k+1-n
k-
Unigram: previous words do not matter Bigram: only the previous one word matters Trigram: only the previous two words matter