

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
We are going to introduce a new way of choosing parameters called Maximum Likelihood Estimation (MLE). We want to select that parameters (θ) ...
Typology: Summaries
1 / 2
This page cannot be seen from the preview
Don't miss anything!


Chris Piech
CS
Handout # May 13th, 2016
Consider IID random samples X 1 , X 2 ,... Xn where Xi is a sample from the density function f (Xi|θ ). We are going to introduce a new way of choosing parameters called Maximum Likelihood Estimation (MLE). We want to select that parameters (θ ) that make the observed data the most likely. Note that we are now using notation that shows that the density of X depends on its parameters, θ.
First we define the likelihood of our data give parameters θ :
L(θ ) =
n
i= 1
f (Xi|θ )
This is the probability of all of our data. It evaluates to a product because all Xi are independent. Now we chose the value of θ that maximizes the likelihood function. Formally ˆθ = argmax θ
L(θ ).
A cool property of argmax is that since log is a monotone function, the argmax of a function is the same as the argmax of the log of the function! That’s nice because logs make the math simpler. Instead of using likelihood, you should instead use log likelihood: LL(θ ).
LL(θ ) = log
n
i= 1
f (Xi|θ ) =
n
i= 1
log f (Xi|θ )
To use a maximum likelihood estimator, first write the log likelihood of the data given your parameters. Then chose the value of parameters that maximize the log likelihood function. Argmax can be computed in many ways. Most require computing the first derivative of the function.
Consider IID random variables X 1 , X 2 ,... Xn where Xi ∼ Ber(p). First we are going to write the PMF of a Bernoulli in a crazy way: The probability mass function f (Xi|p) = pXi^ ( 1 − p)^1 −Xi^. Wow! Whats up with that? First convince yourself that when Xi = 0 and Xi = 1 this returns the right probabilities. We write the PMF this way because its derivable.
Now let’s do some MLE estimation:
L(θ ) =
n
i= 1
pXi^ ( 1 − p)^1 −Xi
LL(θ ) =
n
i= 1
log pXi^ ( 1 − p)^1 −Xi
n
i= 1
Xi(log p) + ( 1 − Xi)log( 1 − p)
= Y log p + (n −Y )log( 1 − p) where Y =
n
i= 1
Xi
Great Scott! Now we simply need to chose the value of p that maximizes our log-likelihood. One way to do that is to find the first derivative and set it equal to 0.
δ LL(p) δ p
p
1 − p
pˆ =
n
∑ni= 1 Xi n
All that work and we get the same thing as method of moments and sample mean...
Consider IID random variables X 1 , X 2 ,... Xn where Xi ∼ N(μ, σ 2 ).
L(θ ) =
n ∏ i= 1
f (Xi|μ, σ 2 )
n ∏ i= 1
2 πσ
e − (Xi−μ) 2 2 σ 2
LL(θ ) =
n ∑ i= 1
log
2 πσ
e − (Xi−μ) 2 2 σ 2
n ∑ i= 1
− log(
2 πσ ) −
2 σ 2
(Xi − μ)^2
If we chose the values of ˆμ and σˆ 2 that maximize likelihood, we get: ˆμ = (^1) n ∑ni= 1 Xi and σˆ 2 = (^1) n ∑ni= 1 (Xi − μˆ)^2.
Assume that Y = θ X + Z where Z ∼ N( 0 , σ 2 ) and X is an unknown distribution. The equations imply that Y |X ∼ N(θ X, σ 2 ). Chose a value of θ that maximizes the probability of the data: (X 1 ,Y 1 ), (X 2 ,Y 2 ),... (Xn,Yn).
We approach this problem by finding a function for the log likelihood of the data given θ. Then we find the value of θ that maximizes the log likelihood function. To start, use the PDF of a Normal to express the probability of Y |X, θ :
f (Yi|Xi, θ ) =
2 πσ
e − (Yi−θ^ Xi) 2 2 σ 2
Now we are ready to write the likelihood function, then take its log to get the log likelihood function:
L(θ ) =
n ∏ i= 1
f (Yi, Xi|θ ) Let’s break up this joint
n ∏ i= 1
f (Yi|Xi, θ ) f (Xi) f (Xi) is independent of θ
n ∏ i= 1
2 πσ
e − (Yi−θ^ Xi) 2 2 σ (^2) f (Xi) Substitute in the definition of f (Yi|Xi)
LL(θ ) = log L(θ )
= log
n ∏ i= 1
2 πσ
e−^
(Yi−θ Xi)^2 2 σ (^2) f (Xi) Substitute in L(θ )
n ∑ i= 1
log
2 πσ
e−^
(Yi−θ Xi)^2 2 σ (^2) +
n ∑ i= 1
log f (Xi) Log of a product is the sum of logs
= n log
2 π
2 σ 2
n ∑ i= 1
(Yi − θ Xi)^2 +
n ∑ i= 1
log f (Xi)
Remove constant multipliers and terms that don’t include θ. We are left with trying to find a value of θ that maximizes:
θˆ = argmax θ
m ∑ i= 1
(Yi − θ Xi)^2
= argmin θ
m ∑ i= 1
(Yi − θ Xi)^2
This result says that the value of θ that makes the data most likely is one that minimizes the squared error of predictions of Y. We will see in a few days that this is the basis for linear regression.