Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Language Modeling - Introduction to Machine Learning - Lecture Notes, Study notes of Machine Learning

Birla Institute of Technology and Science Machine Learning

Main points of this lecture are: Language Modeling, Statistical Language Modelling, Language Models, Distribution, Gram Models, Distributed Representations, High-Dimensional, Estimation Of Distributions, Word Representations, Neural Language

Typology: Study notes

2012/2013

Uploaded on 04/30/2013

bassu 🇮🇳

4.5

(42)

141 documents

1 / 23

This page cannot be seen from the preview

Don't miss anything!

CSC2515 FALL 2008

INTRODUCTION TO MACHINE LEARNING

APPLICATIONS OF MACHINE LEARNING TO

LANGUAGE MODELING

Discover Study notes of Machine Learning Birla Institute of Technology and Science

Partial preview of the text

Download Language Modeling - Introduction to Machine Learning - Lecture Notes and more Study notes Machine Learning in PDF only on Docsity!

CSC2515 F

ALL

I

NTRODUCTION TO

M

ACHINE

L

EARNING

A

PPLICATIONS OF MACHINE LEARNING TO

LANGUAGE MODELING

Statistical language modelling

Goal: Model the joint distribution of words in a sentence.

Such a model can be used to

predict the next word given several preceding ones

arrange bags of words into sentences

assign probabilities to documents

Applications: speech recognition, machine translation,information retrieval.

Most statistical language models are based on the Markovassumption:

The distribution of the next word depends on only

words that immediately precede it.

This assumption is clearly wrong but useful – it makesthe task much more tractable.

Training

-gram models

Let

be the number of times a sequence of words

occurs in the training set.

Then we can estimate a trigram model as follows:

(

| w

, w

) =

Problem: if

does occur in the training set, it is

assigned zero probability.

That’s bad – the model does not generalize to new wordtriples!

One solution: smooth the trigram estimates byinterpolating them with the bigram estimates

(

| w

, w

) =

(

−

)

Can also smooth with the unigram estimates and theuniform distribution.

Why

-gram models are hopeless for large

-gram models don’t take advantage of the fact that some words are used in similar ways.

Suppose you know that words

snow

and

rain

are used in

similar ways, as are

Monday

and

Tuesday

If you are told that the following sentence is probable:

It’s going to rain on Monday.

Then you can infer that the following sentence is alsoprobable:

It’s going to snow on Tuesday.

-gram models cannot generalize this way because all words are treated as arbitrary symbols, with each wordbeing equally (dis)similar to all others.

Using distributed representations for words allowssimilarity between words to be captured.

Word representations embedded in 2D (I)

Word representations embedded in 2D (II)

Neural Probabilistic Language Model

(Bengio et al., 2000)

The original and still the most popular neural languagemodel.

A lookup table is used to map context words to featurevectors.

Architecture: 1-hidden layer neural net

Input: sequence of the context word feature vectors.

Output: distribution over the next word (softmax overwords).

Outperforms

-gram models on small (

∼

1M words)

datasets.

For better results, predictions of a NPLM are interpolatedwith those of an

-gram model.

Neural Probabilistic Language Model

Structuring the vocabulary

Computing the probability of the given word being thenext word requires considering all

words in the

vocabulary.^ –

Need to normalize over all words because the space ofwords is unstructured.

Idea (due to Bengio): Organize words in the vocabularyinto a (somewhat balanced) binary tree and exploit itsstructure to speed up normalization.

Construct a binary tree over words^ ∗

words are associated with leaf nodes ∗

one word per leaf

Predicting the next word: replace one

-way decision

by a sequence of

(log

)

two-way decision.

∗

Can achieve exponential speedup!

Tree-based factorization

To define a distribution over leaf nodes:

Specify the probability of taking the left branch at eachnon-leaf node.

Then the probability of a leaf node is simply theprobability of the sequence of left/right decisions thatlead from the root node to the leaf node.

Hierarchical log-bilinear model

(Mnih & Hinton, 2008)

Let

be the binary string / code that encodes the sequence

of left-right decisions in the tree that lead to word

Each non-leaf node in the tree is given a feature vector thatcaptures the difference between the words in its left andright subtrees.

The probability of taking the left branch at a particularnode is given by

(

= 1

| q

, wi

−

) =

(ˆ

where

ˆr

is computed as in the LBL model and

is the

feature vector for the node.

Then the probability of word

being the next word is

simply the probability of

under the binary decision

model:

(

| w

−

) =

∏

(

|i q

, wi

−

)

Data-driven tree construction

We would like to cluster words based on the distributionof contexts in which they occur.

This distribution is hard to estimate and work with due tothe high dimensionality of the space of contexts (the samesparsity problem

-gram models suffer from).

To avoid this problem, we represent contexts usingdistributed representations and cluster words based ontheir

expected

context representation.

To construct a word tree:1. Train a model using a random (balanced) tree over

words.

Compute the expected predicted representation over all

occurrences of the given word.

Perform hierarchical clustering on these expected

representations.

Dataset and evaluation

We compared the models on the APNews dataset:

A collection of Associated Press news stories (16 millionwords)

Training/validation/test split: 14M/1M/1M words

Preprocessing (Bengio):

convert all words to lower case

map all rare words and proper nouns to special symbols

Result: just under 18000 unique words.

Models were compared based on the perplexity theyassigned to the test set.

Perplexity is the geometric average of

(

−

)

Random vs. non-random trees

The effect of the feature dimensionality and the tree-buildingalgorithm on the test set perplexity of the model.

Feature

Perplexity using

Reduction

dimensionality

a RANDOM tree

a BALANCED tree

in perplexity

100

Language Modeling - Introduction to Machine Learning - Lecture Notes, Study notes of Machine Learning

Related documents

Partial preview of the text

Download Language Modeling - Introduction to Machine Learning - Lecture Notes and more Study notes Machine Learning in PDF only on Docsity!

CSC2515 F

I

M

L

A

Statistical language modelling

Training

-gram models

Why

-gram models are hopeless for large

Word representations embedded in 2D (I)

Word representations embedded in 2D (II)

Neural Probabilistic Language Model

(Bengio et al., 2000)

Neural Probabilistic Language Model

Structuring the vocabulary

Tree-based factorization

Hierarchical log-bilinear model

(Mnih & Hinton, 2008)

Data-driven tree construction

Dataset and evaluation

Random vs. non-random trees