Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Lecture 18: Information and Learning - Machine Learning at TTIC, Lecture notes of Introduction to Machine Learning

Toyota Technological Institute at Chicago (TTIC)Introduction to Machine Learning

A lecture note from ttic 31020: introduction to machine learning, focusing on the topics of bayesian information criterion, learning vs. Communication, coding, and entropy. The instructor is greg shakhnarovich. The concepts of model selection using bic, the difference between learning and communication, optimal coding, and entropy as a measure of uncertainty.

Typology: Lecture notes

2011/2012

Uploaded on 03/12/2012

alfred67 🇺🇸

4.9

(20)

328 documents

1 / 16

This page cannot be seen from the preview

Don't miss anything!

Lecture 18: Information and learning

TTIC 31020: Introduction to Machine Learning

Instructor: Greg Shakhnarovich

TTI–Chicago

November 5, 2010

Lecture 18: Information and learning TTIC 31020

Discover Lecture notes of Introduction to Machine Learning Toyota Technological Institute at Chicago (TTIC)

Partial preview of the text

Download Lecture 18: Information and Learning - Machine Learning at TTIC and more Lecture notes Introduction to Machine Learning in PDF only on Docsity!

Lecture 18: Information and learning

TTIC 31020: Introduction to Machine Learning

Instructor: Greg Shakhnarovich

TTI–Chicago

November 5, 2010

Bayesian Information Criterion

For a MoG model with k components in Rd:

|θ| = k (d + d(d + 1)/2) + k − 1.

For a model class M with parameters θM, we find ML (or MAP) estimates of the parameters on X = [x 1 ,... , xN ]:

L∗(M) , max θM

log p(X|M; θM).

e.g., M = {mixtures of 5 Gaussians}

The BIC score for the model M on data X:

BIC(M) = L∗(M) −

|θM| log N.

Learning vs. communication

Suppose we want to communicate the data set X.

The receiver knows the model class M(θ).

We need to communicate: ˆθ and the prediction errors.

The goal of learning: find the most efficient way of communicating this information.

(^) A thought experiment: suppose we have a N -component mixture, with zero covariance Gaussians centered on each data point.
(^) No prediction errors! But need to send the N means and
- do not gain anything

Learning vs. communication

Suppose we want to communicate the data set X.

The receiver knows the model class M(θ).

We need to communicate: ˆθ and the prediction errors.

The goal of learning: find the most efficient way of communicating this information.

(^) A thought experiment: suppose we have a N -component mixture, with zero covariance Gaussians centered on each data point.
(^) No prediction errors! But need to send the N means and
- do not gain anything ⇒ haven’t learned anything.

Coding: example

Suppose we had an alphabet with 8 letters; each letter appears with probability 1/8.

How many bits do we need to code an n-letter message?

three bits per letter ⇒ 3 n bits total.

Optimal coding

Suppose we had an alphabet with m letters a 1 ,... , am

Probabilistic model of the language: for a letter A, p(A = ai) = pi.

Need to encode n-letter message;

General idea for optimal code: encode most frequent symbols with shortest keywords.

Example: Huffman’s code. Suppose p(a) = 1/ 2 , p(b) = p(c) = 1/4.

(^) Code words: a → 0, b → 10, c → 11.

Entropy as a measure of uncertainty

Entropy H(A) = −

∑m i=1 p(ai) log^ p(ai) gives the amount of information gained from observing an instance of A.

Measured in bits (if using log 2 ) or nats if loge

Example: Bernoulli, A ∈ { 0 , 1 }, p(A = 1) = θ.

Highest entropy: fair coin.
Lowest entropy: fully biased coin.^00 0.2 0.4 0.6 0.8 1

H(θ)

Coding of real numbers

With real-valued messages the code length depends on precision.

(^) Precision means “the number of values we want to distinguish”;
(^) Precision m means that we can discretize the real-valued A into m bins;
(^) Will need log m nats to encode one value, assuming all values are equal. A more precise treatment:
Compute pi as the probability that the real value A falls into the i-th bin.
The expected optimal code length is again

∑^ m

pi log pi.

Description length: data and parameters

The receiver can’t know θ - we need to also transmit θˆ. We will discretize θ into 1/

N distinct values

Intuitive argument: with N data points, the estimation error for θˆ is roughly 1/

N.

Will need log

N nats to encode each component of θˆ Total description length with k parameters:

DL(X, θˆ) ≈ −

∑^ N

log p

xi | θˆ

k log

N

Description length: data and parameters

The receiver can’t know θ - we need to also transmit θˆ. We will discretize θ into 1/

N distinct values

Intuitive argument: with N data points, the estimation error for θˆ is roughly 1/

N.

Will need log

N nats to encode each component of θˆ Total description length with k parameters:

DL(X, θˆ) ≈ −

∑^ N

log p

xi | θˆ

k log

N

BIC:

BIC(M) = L∗(M) −

k log N

i.e., minimizing MDL ⇒ maximizing BIC score.

Learning and coding

Suppose we have a random discrete variable X with distribution p, pi , Pr(X = i), i = 1,... , m.

Optimal code (knowing p) has expected length per observation

L(p) = −

∑^ m

pi log pi.

Suppose now we think (estimate) the distribution is ˆp = q.

We build code with codeword lengths − log qi;
(^) The expected length is

L(q) = −

∑^ m

pi log qi.

Lecture 18: Information and Learning - Machine Learning at TTIC, Lecture notes of Introduction to Machine Learning

Related documents

Partial preview of the text

Download Lecture 18: Information and Learning - Machine Learning at TTIC and more Lecture notes Introduction to Machine Learning in PDF only on Docsity!

Lecture 18: Information and learning

Bayesian Information Criterion

Learning vs. communication

Learning vs. communication

Coding: example

Optimal coding

Entropy as a measure of uncertainty

Coding of real numbers

Description length: data and parameters

N.

∑^ N

N

Description length: data and parameters

N.

∑^ N

N

BIC:

BIC(M) = L∗(M) −

Learning and coding