Language Models for natural language processing, Lecture notes of Natural Language Processing (NLP)

lecture notes from: Joshua Goodman, L. Kosseim, D. Klein

Typology: Lecture notes

2018/2019

Uploaded on 11/15/2019

ali-elrafie
ali-elrafie 🇪🇬

4

(1)

1 document

1 / 75

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CS442/542b: Artificial Intelligence II
Prof. Olga Veksler
Lecture 9
NLP: Language Models
Many slides from: Joshua Goodman, L. Kosseim, D.
Klein
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b

Partial preview of the text

Download Language Models for natural language processing and more Lecture notes Natural Language Processing (NLP) in PDF only on Docsity!

CS442/542b: Artificial Intelligence II

Prof. Olga Veksler

Lecture 9

NLP: Language Models

Many slides from:

Joshua Goodman, L. Kosseim, D.

Klein

Outline 

Why we need to model language 

Probability background

Basic probability axioms

Conditional probability

Bayes’ rule



n-gram model 

Parameter Estimation Techniques

MLE

Smoothing

Language Model for Speech Recognition Slides 2-7, from

Joshua Goodman's slidesJoshua Goodman's slides

research.microsoft.com/~joshuago/lm research.microsoft.com/~joshuago/lm

  • tutorial-tutorial
  • public.ppt-public.ppt

Language Model for Speech Recognition

Language Model for Speech Recognition

What is a Language Model? ^

A language model is a probability distributionover word/character sequences 

We would like to find a language model P s.t.

P(“And nothing but the truth”)

P(“And nuts sing on the roof”)

Joint probabilities 

P(X,Y) means probability that X and Y are bothtrue, for example: P(brown eyes, boy) = (number of all baby boys with browneyes)/(total number of babies)

Babies

Baby boys

John

Brown eyes

11

Conditional Probability 

P(X|Y) = P(X, Y) / P(Y)^ P(baby is named John | baby is a boy) =

Babies

Baby boys

John

P(baby is a boy)

P(baby is named John, baby is a boy)

13

Bayes Rule 

Bayes rule:

Babies

Baby boys

John

(^

)^

(^

)^

(^

)

(^

)

boy P

John

named P

John

named|

boy P

boy|

John

named P

( =

)^

(^

)^

(^

)

(^

)

Y

P

X P X | Y P

Y|

X

P

Speech Recognition Example

acoustics) |

sequence

P(word

s)

P(acoustic

sequence)

P(word

sequence)

word | s

P(acoustic

= ×

from language model

reasonably easy to model

usually don’t need this

very hard to model

Language Modeling 

In our case, events will be sequences of words, forexample “an apple fell”

P(“an apple fell”) is the probability of the jointevent that^ 

the first word in a sequence is “an”  the second word in a sequence is “apple”  the third word in a sequence is “fell”

P( fell | an apple ) should be read as probabilitythat the third word in a sequence is “fell” given thatthe previous 2 words are “an apple”

17

How Language Models work 

Hard to compute

P

( and nothing but the truth

)

^

Step 1: Decompose probability using conditional probability:

(^

)^

truth

the

but

nothing

and P

(^

)^

(^

)^

the

but

nothing

and P

the

but

nothing

and|

truth P

(^

)^

(^

but

nothing

and|

the P

the

but

nothing

and|

truth P

(^

)^

×

but

nothing

and P

(^

)^

(^

but

nothing

and|

the P

the

but

nothing

and|

truth P

(^

)^

(^

)^

×

nothing

and P

nothing

and|

but P

(^

)^

(^

but

nothing

and|

the P

the

but

nothing

and|

truth P

(^

)^

(^

)^

(^

)

and P

and|

nothing P

nothing

and|

but P ×

“Shannon Game” (Shannon, 1951)^ “I am going to make a collect …”^ 

Predict the next word/character given the

n-

previous words/characters. ^

Human subjects were shown 100 characters of textand were asked to guess the next character ^

As context increases, entropy decreases^ 

the smaller the entropy => the larger the probability ofpredicting the next letter

^

But only a few words is enough to make a goodprediction on the next word, in most cases ^

Evidence that we only need to look back at n-1previous words

Entropy (H)

3

2

1

0

Context

n-grams^ 

n-gram model: the probability of a word dependsonly on the n-1 previous words (the history)

P(w

k^ |w

w 1

…w 2

k-

)=P(w

k^

|w

k+1-n

…w

k-

This called

Markov Assumption

: only the

closest n words are relevant:^ 

Unigram: previous words do not matter  Bigram: only the previous one word matters  Trigram: only the previous two words matter