Language Modeling - Introduction to Machine Learning - Lecture Notes, Study notes of Machine Learning

Main points of this lecture are: Language Modeling, Statistical Language Modelling, Language Models, Distribution, Gram Models, Distributed Representations, High-Dimensional, Estimation Of Distributions, Word Representations, Neural Language

Typology: Study notes

2012/2013

Uploaded on 04/30/2013

bassu
bassu 🇮🇳

4.5

(42)

141 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CSC2515 FALL 2008
INTRODUCTION TO MACHINE LEARNING
APPLICATIONS OF MACHINE LEARNING TO
LANGUAGE MODELING
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Language Modeling - Introduction to Machine Learning - Lecture Notes and more Study notes Machine Learning in PDF only on Docsity!

CSC2515 F

ALL

I

NTRODUCTION TO

M

ACHINE

L

EARNING

A

PPLICATIONS OF MACHINE LEARNING TO

LANGUAGE MODELING

1

Statistical language modelling

Goal: Model the joint distribution of words in a sentence.

Such a model can be used to

predict the next word given several preceding ones

-

arrange bags of words into sentences

-

assign probabilities to documents

Applications: speech recognition, machine translation,information retrieval.

Most statistical language models are based on the Markovassumption:

The distribution of the next word depends on only

n

words that immediately precede it.

-

This assumption is clearly wrong but useful – it makesthe task much more tractable.

2

Training

n

-gram models

Let

s

be the number of times a sequence of words

s

occurs in the training set.

Then we can estimate a trigram model as follows:

P

(

w

3

| w

1

, w

2

) =

w

1

w

2

w

3

w

1

w

2

Problem: if

w

3

w

2

w

1

does occur in the training set, it is

assigned zero probability.

That’s bad – the model does not generalize to new wordtriples!

One solution: smooth the trigram estimates byinterpolating them with the bigram estimates

P

(

w

3

| w

1

, w

2

) =

λ

×

w

1

w

2

w

3

w

1

w

2

  • (

λ

)

×

w

2

w

3

w

2

Can also smooth with the unigram estimates and theuniform distribution.

4

Why

n

-gram models are hopeless for large

n

n

-gram models don’t take advantage of the fact that some words are used in similar ways.

Suppose you know that words

snow

and

rain

are used in

similar ways, as are

Monday

and

Tuesday

If you are told that the following sentence is probable:

It’s going to rain on Monday.

Then you can infer that the following sentence is alsoprobable:

It’s going to snow on Tuesday.

n

-gram models cannot generalize this way because all words are treated as arbitrary symbols, with each wordbeing equally (dis)similar to all others.

Using distributed representations for words allowssimilarity between words to be captured.

5

Word representations embedded in 2D (I)

7

Word representations embedded in 2D (II)

8

Neural Probabilistic Language Model

(Bengio et al., 2000)

The original and still the most popular neural languagemodel.

A lookup table is used to map context words to featurevectors.

Architecture: 1-hidden layer neural net

Input: sequence of the context word feature vectors.

-

Output: distribution over the next word (softmax overwords).

Outperforms

n

-gram models on small (

1M words)

datasets.

For better results, predictions of a NPLM are interpolatedwith those of an

n

-gram model.

10

Neural Probabilistic Language Model

11

Structuring the vocabulary

Computing the probability of the given word being thenext word requires considering all

N

words in the

vocabulary.^

Need to normalize over all words because the space ofwords is unstructured.

Idea (due to Bengio): Organize words in the vocabularyinto a (somewhat balanced) binary tree and exploit itsstructure to speed up normalization.

Construct a binary tree over words^ ∗

words are associated with leaf nodes ∗

one word per leaf

Predicting the next word: replace one

N

-way decision

by a sequence of

O

(log

N

)

two-way decision.

Can achieve exponential speedup!

13

Tree-based factorization

To define a distribution over leaf nodes:

Specify the probability of taking the left branch at eachnon-leaf node.

-

Then the probability of a leaf node is simply theprobability of the sequence of left/right decisions thatlead from the root node to the leaf node.

14

Hierarchical log-bilinear model

(Mnih & Hinton, 2008)

Let

d

be the binary string / code that encodes the sequence

of left-right decisions in the tree that lead to word

w

Each non-leaf node in the tree is given a feature vector thatcaptures the difference between the words in its left andright subtrees.

The probability of taking the left branch at a particularnode is given by

P

(

d

i^

= 1

| q

, wi

1:

n

1

) =

σ

r

T

q

)i

,

where

ˆr

is computed as in the LBL model and

q

i^

is the

feature vector for the node.

Then the probability of word

w

being the next word is

simply the probability of

d

under the binary decision

model:

P

(

w

n

=

w

| w

1:

n

1

) =

i

P

(

d

|i q

, wi

1:

n

1

)

.

16

Data-driven tree construction

We would like to cluster words based on the distributionof contexts in which they occur.

This distribution is hard to estimate and work with due tothe high dimensionality of the space of contexts (the samesparsity problem

n

-gram models suffer from).

To avoid this problem, we represent contexts usingdistributed representations and cluster words based ontheir

expected

context representation.

To construct a word tree:1. Train a model using a random (balanced) tree over

words.

  1. Compute the expected predicted representation over all

occurrences of the given word.

  1. Perform hierarchical clustering on these expected

representations.

17

Dataset and evaluation

We compared the models on the APNews dataset:

A collection of Associated Press news stories (16 millionwords)

-

Training/validation/test split: 14M/1M/1M words

Preprocessing (Bengio):

convert all words to lower case

-

map all rare words and proper nouns to special symbols

-

Result: just under 18000 unique words.

Models were compared based on the perplexity theyassigned to the test set.

Perplexity is the geometric average of

1

P

(

wn

|w

1:

n

1

)

19

Random vs. non-random trees

The effect of the feature dimensionality and the tree-buildingalgorithm on the test set perplexity of the model.

Feature

Perplexity using

Perplexity using

Reduction

dimensionality

a RANDOM tree

a BALANCED tree

in perplexity

25

50

75

100

20