Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A lecture note from ttic 31020: introduction to machine learning, focusing on the topics of bayesian information criterion, learning vs. Communication, coding, and entropy. The instructor is greg shakhnarovich. The concepts of model selection using bic, the difference between learning and communication, optimal coding, and entropy as a measure of uncertainty.
Typology: Lecture notes
1 / 16
TTIC 31020: Introduction to Machine Learning
Instructor: Greg Shakhnarovich
TTI–Chicago
November 5, 2010
For a MoG model with k components in Rd:
|θ| = k (d + d(d + 1)/2) + k − 1.
For a model class M with parameters θM, we find ML (or MAP) estimates of the parameters on X = [x 1 ,... , xN ]:
L∗(M) , max θM
log p(X|M; θM).
e.g., M = {mixtures of 5 Gaussians}
The BIC score for the model M on data X:
BIC(M) = L∗(M) −
|θM| log N.
Suppose we want to communicate the data set X.
The receiver knows the model class M(θ).
We need to communicate: ˆθ and the prediction errors.
The goal of learning: find the most efficient way of communicating this information.
Suppose we want to communicate the data set X.
The receiver knows the model class M(θ).
We need to communicate: ˆθ and the prediction errors.
The goal of learning: find the most efficient way of communicating this information.
Suppose we want to communicate the data set X.
The receiver knows the model class M(θ).
We need to communicate: ˆθ and the prediction errors.
The goal of learning: find the most efficient way of communicating this information.
Suppose we had an alphabet with 8 letters; each letter appears with probability 1/8.
How many bits do we need to code an n-letter message?
Suppose we had an alphabet with 8 letters; each letter appears with probability 1/8.
How many bits do we need to code an n-letter message?
three bits per letter ⇒ 3 n bits total.
Suppose we had an alphabet with m letters a 1 ,... , am
Probabilistic model of the language: for a letter A, p(A = ai) = pi.
Need to encode n-letter message;
Example: Huffman’s code. Suppose p(a) = 1/ 2 , p(b) = p(c) = 1/4.
Shannon’s optimal code: codelength for letter a is has length l(a) = log 1/p(a).
1 n Ea∼p
[ (^) n ∑
i=
l(ai)
∑^ m
i=
pi log
pi
bits per letter, assuming i.i.d. letters. The quantity
∑^ m
i=
pi log
pi
∑^ m
i=
pi log pi
is the entropy of the random variable A.
Entropy H(A) = −
∑m i=1 p(ai) log^ p(ai) gives the amount of information gained from observing an instance of A.
Example: Bernoulli, A ∈ { 0 , 1 }, p(A = 1) = θ.
1
θ
H(θ)
With real-valued messages the code length depends on precision.
∑^ m
i=
pi log pi.
We have a generative model p (x | θ), that for the given data set produces log-likelihood
log p(X |θ) =
i=
log p (xi | θ).
We can build the code for the data, assuming p (x | θ) is the true distribution.
The description length of the message containing the data:
∑^ N
i=
log
p (xi | θ)
i=
log p (xi | θ) = − log p(X|θ).
The receiver can’t know θ - we need to also transmit θˆ. We will discretize θ into 1/
N distinct values
N nats to encode each component of θˆ Total description length with k parameters:
DL(X, θˆ) ≈ −
i=
log p
xi | θˆ
The receiver can’t know θ - we need to also transmit θˆ. We will discretize θ into 1/
N distinct values
N nats to encode each component of θˆ Total description length with k parameters:
DL(X, θˆ) ≈ −
i=
log p
xi | θˆ
k log N
i.e., minimizing MDL ⇒ maximizing BIC score.
A similar communication setup:
H(y |x) = −
c=
p (y = c | x) log p (y = c | x)
The BIC score:
∑^ N
i=
log p
yi | xi, θˆ
π(M) 2 log N.
Suppose we have a random discrete variable X with distribution p, pi , Pr(X = i), i = 1,... , m.
Optimal code (knowing p) has expected length per observation
L(p) = −
∑^ m
i=
pi log pi.
Suppose now we think (estimate) the distribution is ˆp = q.
L(q) = −
∑^ m
i=
pi log qi.