Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Lecture 18: Information and Learning - Machine Learning at TTIC, Lecture notes of Introduction to Machine Learning

A lecture note from ttic 31020: introduction to machine learning, focusing on the topics of bayesian information criterion, learning vs. Communication, coding, and entropy. The instructor is greg shakhnarovich. The concepts of model selection using bic, the difference between learning and communication, optimal coding, and entropy as a measure of uncertainty.

Typology: Lecture notes

2011/2012

Uploaded on 03/12/2012

alfred67
alfred67 🇺🇸

4.9

(20)

328 documents

1 / 16

Toggle sidebar

Related documents


Partial preview of the text

Download Lecture 18: Information and Learning - Machine Learning at TTIC and more Lecture notes Introduction to Machine Learning in PDF only on Docsity!

Lecture 18: Information and learning

TTIC 31020: Introduction to Machine Learning

Instructor: Greg Shakhnarovich

TTI–Chicago

November 5, 2010

Bayesian Information Criterion

For a MoG model with k components in Rd:

|θ| = k (d + d(d + 1)/2) + k − 1.

For a model class M with parameters θM, we find ML (or MAP) estimates of the parameters on X = [x 1 ,... , xN ]:

L∗(M) , max θM

log p(X|M; θM).

e.g., M = {mixtures of 5 Gaussians}

The BIC score for the model M on data X:

BIC(M) = L∗(M) −

1

2

|θM| log N.

Learning vs. communication

Suppose we want to communicate the data set X.

The receiver knows the model class M(θ).

We need to communicate: ˆθ and the prediction errors.

The goal of learning: find the most efficient way of communicating this information.

  • (^) A thought experiment: suppose we have a N -component mixture, with zero covariance Gaussians centered on each data point.
  • (^) No prediction errors!

Learning vs. communication

Suppose we want to communicate the data set X.

The receiver knows the model class M(θ).

We need to communicate: ˆθ and the prediction errors.

The goal of learning: find the most efficient way of communicating this information.

  • (^) A thought experiment: suppose we have a N -component mixture, with zero covariance Gaussians centered on each data point.
  • (^) No prediction errors! But need to send the N means and
    • do not gain anything

Learning vs. communication

Suppose we want to communicate the data set X.

The receiver knows the model class M(θ).

We need to communicate: ˆθ and the prediction errors.

The goal of learning: find the most efficient way of communicating this information.

  • (^) A thought experiment: suppose we have a N -component mixture, with zero covariance Gaussians centered on each data point.
  • (^) No prediction errors! But need to send the N means and
    • do not gain anything ⇒ haven’t learned anything.

Coding: example

Suppose we had an alphabet with 8 letters; each letter appears with probability 1/8.

How many bits do we need to code an n-letter message?

Coding: example

Suppose we had an alphabet with 8 letters; each letter appears with probability 1/8.

How many bits do we need to code an n-letter message?

three bits per letter ⇒ 3 n bits total.

Optimal coding

Suppose we had an alphabet with m letters a 1 ,... , am

Probabilistic model of the language: for a letter A, p(A = ai) = pi.

Need to encode n-letter message;

  • General idea for optimal code: encode most frequent symbols with shortest keywords.

Example: Huffman’s code. Suppose p(a) = 1/ 2 , p(b) = p(c) = 1/4.

  • (^) Code words: a → 0, b → 10, c → 11.

Optimal codelength

Shannon’s optimal code: codelength for letter a is has length l(a) = log 1/p(a).

  • If the probabilities are not powers of two, we will incur some cost: dlog 1/pie Asymptotically, as n → ∞, the expected code length is

1 n Ea∼p

[ (^) n ∑

i=

l(ai)

]

=

∑^ m

i=

pi log

1

pi

bits per letter, assuming i.i.d. letters. The quantity

H(A) ,

∑^ m

i=

pi log

1

pi

= −

∑^ m

i=

pi log pi

is the entropy of the random variable A.

Entropy as a measure of uncertainty

Entropy H(A) = −

∑m i=1 p(ai) log^ p(ai) gives the amount of information gained from observing an instance of A.

  • Measured in bits (if using log 2 ) or nats if loge

Example: Bernoulli, A ∈ { 0 , 1 }, p(A = 1) = θ.

  • Highest entropy: fair coin.
  • Lowest entropy: fully biased coin.^00 0.2 0.4 0.6 0.8 1

1

θ

H(θ)

Coding of real numbers

With real-valued messages the code length depends on precision.

  • (^) Precision means “the number of values we want to distinguish”;
  • (^) Precision m means that we can discretize the real-valued A into m bins;
  • (^) Will need log m nats to encode one value, assuming all values are equal. A more precise treatment:
  • Compute pi as the probability that the real value A falls into the i-th bin.
  • The expected optimal code length is again

∑^ m

i=

pi log pi.

Description length: data

We have a generative model p (x | θ), that for the given data set produces log-likelihood

log p(X |θ) =

∑^ N

i=

log p (xi | θ).

We can build the code for the data, assuming p (x | θ) is the true distribution.

  • If the receiver knows θ, this is all we need to send!

The description length of the message containing the data:

∑^ N

i=

log

1

p (xi | θ)

= −

∑^ N

i=

log p (xi | θ) = − log p(X|θ).

Description length: data and parameters

The receiver can’t know θ - we need to also transmit θˆ. We will discretize θ into 1/

N distinct values

  • Intuitive argument: with N data points, the estimation error for θˆ is roughly 1/

N.

  • Will need log

N nats to encode each component of θˆ Total description length with k parameters:

DL(X, θˆ) ≈ −

∑^ N

i=

log p

(

xi | θˆ

)

  • k log

N

Description length: data and parameters

The receiver can’t know θ - we need to also transmit θˆ. We will discretize θ into 1/

N distinct values

  • Intuitive argument: with N data points, the estimation error for θˆ is roughly 1/

N.

  • Will need log

N nats to encode each component of θˆ Total description length with k parameters:

DL(X, θˆ) ≈ −

∑^ N

i=

log p

(

xi | θˆ

)

  • k log

N

BIC:

BIC(M) = L∗(M) −

1

2

k log N

i.e., minimizing MDL ⇒ maximizing BIC score.

BIC for classification

A similar communication setup:

  • (^) The receiver knows the model class and the inputs x 1 ,... , xN.
  • (^) We need to send θˆ and the classification errors ˆy − y. Again, a trivial solution: memorize the data (NN classifier)
  • (^) (almost) zero DL for errors, but high DL for the parameters. Here the code length is given by conditional entropy

H(y |x) = −

∑^ C

c=

p (y = c | x) log p (y = c | x)

The BIC score:

∑^ N

i=

log p

(

yi | xi, θˆ

)

π(M) 2 log N.

Learning and coding

Suppose we have a random discrete variable X with distribution p, pi , Pr(X = i), i = 1,... , m.

Optimal code (knowing p) has expected length per observation

L(p) = −

∑^ m

i=

pi log pi.

Suppose now we think (estimate) the distribution is ˆp = q.

  • We build code with codeword lengths − log qi;
  • (^) The expected length is

L(q) = −

∑^ m

i=

pi log qi.