Maximum Likelihood, Summaries of Probability and Statistics

We are going to introduce a new way of choosing parameters called Maximum Likelihood Estimation (MLE). We want to select that parameters (θ) ...

Typology: Summaries

2022/2023

Uploaded on 05/11/2023

ekasha
ekasha 🇺🇸

4.8

(22)

270 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Maximum Likelihood
Chris Piech
CS109
Handout #35
May 13th, 2016
Consider IID random samples X1,X2,. . . Xnwhere Xiis a sample from the density function f(Xi|θ). We are
going to introduce a new way of choosing parameters called Maximum Likelihood Estimation (MLE). We
want to select that parameters (θ) that make the observed data the most likely. Note that we are now using
notation that shows that the density of X depends on its parameters, θ.
First we define the likelihood of our data give parameters θ:
L(θ) =
n
i=1
f(Xi|θ)
This is the probability of all of our data. It evaluates to a product because all Xiare independent. Now we
chose the value of θthat maximizes the likelihood function. Formally ˆ
θ=argmax
θ
L(θ).
A cool property of argmax is that since log is a monotone function, the argmax of a function is the same
as the argmax of the log of the function! That’s nice because logs make the math simpler. Instead of using
likelihood, you should instead use log likelihood: LL(θ).
LL(θ) = log
n
i=1
f(Xi|θ) =
n
i=1
log f(Xi|θ)
To use a maximum likelihood estimator, first write the log likelihood of the data given your parameters. Then
chose the value of parameters that maximize the log likelihood function. Argmax can be computed in many
ways. Most require computing the first derivative of the function.
Bernoulli MLE Estimation
Consider IID random variables X1,X2,. . . Xnwhere XiBer(p). First we are going to write the PMF of a
Bernoulli in a crazy way: The probability mass function f(Xi|p) = pXi(1p)1Xi. Wow! Whats up with
that? First convince yourself that when Xi=0 and Xi=1 this returns the right probabilities. We write the
PMF this way because its derivable.
Now let’s do some MLE estimation:
L(θ) =
n
i=1
pXi(1p)1Xi
LL(θ) =
n
i=1
log pXi(1p)1Xi
=
n
i=1
Xi(log p)+(1Xi)log(1p)
=Ylog p+ (nY)log(1p)where Y=
n
i=1
Xi
Great Scott! Now we simply need to chose the value of pthat maximizes our log-likelihood. One way to do
that is to find the first derivative and set it equal to 0.
δLL(p)
δp=Y1
p+ (nY)1
1p=0
ˆp=Y
n=n
i=1Xi
n
All that work and we get the same thing as method of moments and sample mean...
pf2

Partial preview of the text

Download Maximum Likelihood and more Summaries Probability and Statistics in PDF only on Docsity!

Maximum Likelihood

Chris Piech

CS

Handout # May 13th, 2016

Consider IID random samples X 1 , X 2 ,... Xn where Xi is a sample from the density function f (Xi|θ ). We are going to introduce a new way of choosing parameters called Maximum Likelihood Estimation (MLE). We want to select that parameters (θ ) that make the observed data the most likely. Note that we are now using notation that shows that the density of X depends on its parameters, θ.

First we define the likelihood of our data give parameters θ :

L(θ ) =

n

i= 1

f (Xi|θ )

This is the probability of all of our data. It evaluates to a product because all Xi are independent. Now we chose the value of θ that maximizes the likelihood function. Formally ˆθ = argmax θ

L(θ ).

A cool property of argmax is that since log is a monotone function, the argmax of a function is the same as the argmax of the log of the function! That’s nice because logs make the math simpler. Instead of using likelihood, you should instead use log likelihood: LL(θ ).

LL(θ ) = log

n

i= 1

f (Xi|θ ) =

n

i= 1

log f (Xi|θ )

To use a maximum likelihood estimator, first write the log likelihood of the data given your parameters. Then chose the value of parameters that maximize the log likelihood function. Argmax can be computed in many ways. Most require computing the first derivative of the function.

Bernoulli MLE Estimation

Consider IID random variables X 1 , X 2 ,... Xn where Xi ∼ Ber(p). First we are going to write the PMF of a Bernoulli in a crazy way: The probability mass function f (Xi|p) = pXi^ ( 1 − p)^1 −Xi^. Wow! Whats up with that? First convince yourself that when Xi = 0 and Xi = 1 this returns the right probabilities. We write the PMF this way because its derivable.

Now let’s do some MLE estimation:

L(θ ) =

n

i= 1

pXi^ ( 1 − p)^1 −Xi

LL(θ ) =

n

i= 1

log pXi^ ( 1 − p)^1 −Xi

n

i= 1

Xi(log p) + ( 1 − Xi)log( 1 − p)

= Y log p + (n −Y )log( 1 − p) where Y =

n

i= 1

Xi

Great Scott! Now we simply need to chose the value of p that maximizes our log-likelihood. One way to do that is to find the first derivative and set it equal to 0.

δ LL(p) δ p

= Y

p

  • (n −Y )

1 − p

pˆ =

Y

n

∑ni= 1 Xi n

All that work and we get the same thing as method of moments and sample mean...

Normal MLE Estimation

Consider IID random variables X 1 , X 2 ,... Xn where Xi ∼ N(μ, σ 2 ).

L(θ ) =

n ∏ i= 1

f (Xi|μ, σ 2 )

n ∏ i= 1

2 πσ

e − (Xi−μ) 2 2 σ 2

LL(θ ) =

n ∑ i= 1

log

2 πσ

e − (Xi−μ) 2 2 σ 2

n ∑ i= 1

[

− log(

2 πσ ) −

2 σ 2

(Xi − μ)^2

]

If we chose the values of ˆμ and σˆ 2 that maximize likelihood, we get: ˆμ = (^1) n ∑ni= 1 Xi and σˆ 2 = (^1) n ∑ni= 1 (Xi − μˆ)^2.

Linear Transform Plus Noise

Assume that Y = θ X + Z where Z ∼ N( 0 , σ 2 ) and X is an unknown distribution. The equations imply that Y |X ∼ N(θ X, σ 2 ). Chose a value of θ that maximizes the probability of the data: (X 1 ,Y 1 ), (X 2 ,Y 2 ),... (Xn,Yn).

We approach this problem by finding a function for the log likelihood of the data given θ. Then we find the value of θ that maximizes the log likelihood function. To start, use the PDF of a Normal to express the probability of Y |X, θ :

f (Yi|Xi, θ ) =

2 πσ

e − (Yi−θ^ Xi) 2 2 σ 2

Now we are ready to write the likelihood function, then take its log to get the log likelihood function:

L(θ ) =

n ∏ i= 1

f (Yi, Xi|θ ) Let’s break up this joint

n ∏ i= 1

f (Yi|Xi, θ ) f (Xi) f (Xi) is independent of θ

n ∏ i= 1

2 πσ

e − (Yi−θ^ Xi) 2 2 σ (^2) f (Xi) Substitute in the definition of f (Yi|Xi)

LL(θ ) = log L(θ )

= log

n ∏ i= 1

2 πσ

e−^

(Yi−θ Xi)^2 2 σ (^2) f (Xi) Substitute in L(θ )

n ∑ i= 1

log

2 πσ

e−^

(Yi−θ Xi)^2 2 σ (^2) +

n ∑ i= 1

log f (Xi) Log of a product is the sum of logs

= n log

2 π

2 σ 2

n ∑ i= 1

(Yi − θ Xi)^2 +

n ∑ i= 1

log f (Xi)

Remove constant multipliers and terms that don’t include θ. We are left with trying to find a value of θ that maximizes:

θˆ = argmax θ

m ∑ i= 1

(Yi − θ Xi)^2

= argmin θ

m ∑ i= 1

(Yi − θ Xi)^2

This result says that the value of θ that makes the data most likely is one that minimizes the squared error of predictions of Y. We will see in a few days that this is the basis for linear regression.