















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Main points of this lecture are: Language Modeling, Statistical Language Modelling, Language Models, Distribution, Gram Models, Distributed Representations, High-Dimensional, Estimation Of Distributions, Word Representations, Neural Language
Typology: Study notes
1 / 23
This page cannot be seen from the preview
Don't miss anything!
















ALL
NTRODUCTION TO
ACHINE
EARNING
PPLICATIONS OF MACHINE LEARNING TO
LANGUAGE MODELING
1
Goal: Model the joint distribution of words in a sentence.
Such a model can be used to
predict the next word given several preceding ones
-
arrange bags of words into sentences
-
assign probabilities to documents
Applications: speech recognition, machine translation,information retrieval.
Most statistical language models are based on the Markovassumption:
The distribution of the next word depends on only
n
words that immediately precede it.
-
This assumption is clearly wrong but useful – it makesthe task much more tractable.
2
n
Let
s
be the number of times a sequence of words
s
occurs in the training set.
Then we can estimate a trigram model as follows:
P
(
w
3
| w
1
, w
2
) =
w
1
w
2
w
3
w
1
w
2
Problem: if
w
3
w
2
w
1
does occur in the training set, it is
assigned zero probability.
That’s bad – the model does not generalize to new wordtriples!
One solution: smooth the trigram estimates byinterpolating them with the bigram estimates
P
(
w
3
| w
1
, w
2
) =
λ
×
w
1
w
2
w
3
w
1
w
2
−
λ
)
×
w
2
w
3
w
2
Can also smooth with the unigram estimates and theuniform distribution.
4
n
n
n
-gram models don’t take advantage of the fact that some words are used in similar ways.
Suppose you know that words
snow
and
rain
are used in
similar ways, as are
Monday
and
Tuesday
If you are told that the following sentence is probable:
It’s going to rain on Monday.
Then you can infer that the following sentence is alsoprobable:
It’s going to snow on Tuesday.
n
-gram models cannot generalize this way because all words are treated as arbitrary symbols, with each wordbeing equally (dis)similar to all others.
Using distributed representations for words allowssimilarity between words to be captured.
5
7
8
The original and still the most popular neural languagemodel.
A lookup table is used to map context words to featurevectors.
Architecture: 1-hidden layer neural net
Input: sequence of the context word feature vectors.
-
Output: distribution over the next word (softmax overwords).
Outperforms
n
-gram models on small (
∼
1M words)
datasets.
For better results, predictions of a NPLM are interpolatedwith those of an
n
-gram model.
10
11
Computing the probability of the given word being thenext word requires considering all
N
words in the
vocabulary.^ –
Need to normalize over all words because the space ofwords is unstructured.
Idea (due to Bengio): Organize words in the vocabularyinto a (somewhat balanced) binary tree and exploit itsstructure to speed up normalization.
Construct a binary tree over words^ ∗
words are associated with leaf nodes ∗
one word per leaf
Predicting the next word: replace one
N
-way decision
by a sequence of
O
(log
N
)
two-way decision.
∗
Can achieve exponential speedup!
13
To define a distribution over leaf nodes:
Specify the probability of taking the left branch at eachnon-leaf node.
-
Then the probability of a leaf node is simply theprobability of the sequence of left/right decisions thatlead from the root node to the leaf node.
14
Let
d
be the binary string / code that encodes the sequence
of left-right decisions in the tree that lead to word
w
Each non-leaf node in the tree is given a feature vector thatcaptures the difference between the words in its left andright subtrees.
The probability of taking the left branch at a particularnode is given by
P
(
d
i^
= 1
| q
, wi
1:
n
−
1
) =
σ
(ˆ
r
T
q
)i
,
where
ˆr
is computed as in the LBL model and
q
i^
is the
feature vector for the node.
Then the probability of word
w
being the next word is
simply the probability of
d
under the binary decision
model:
P
(
w
n
=
w
| w
1:
n
−
1
) =
∏
i
P
(
d
|i q
, wi
1:
n
−
1
)
.
16
We would like to cluster words based on the distributionof contexts in which they occur.
This distribution is hard to estimate and work with due tothe high dimensionality of the space of contexts (the samesparsity problem
n
-gram models suffer from).
To avoid this problem, we represent contexts usingdistributed representations and cluster words based ontheir
expected
context representation.
To construct a word tree:1. Train a model using a random (balanced) tree over
words.
occurrences of the given word.
representations.
17
We compared the models on the APNews dataset:
A collection of Associated Press news stories (16 millionwords)
-
Training/validation/test split: 14M/1M/1M words
Preprocessing (Bengio):
convert all words to lower case
-
map all rare words and proper nouns to special symbols
-
Result: just under 18000 unique words.
Models were compared based on the perplexity theyassigned to the test set.
Perplexity is the geometric average of
1
P
(
wn
|w
1:
n
−
1
)
19
The effect of the feature dimensionality and the tree-buildingalgorithm on the test set perplexity of the model.
Feature
Perplexity using
Perplexity using
Reduction
dimensionality
a RANDOM tree
a BALANCED tree
in perplexity
25
50
75
100
20