Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli


Massima Verosimiglianza e Stima di Massima Verosimiglianza - Prof. Manfredi, Dispense di Statistica

Una definizione dei concetti di massima verosimiglianza e di stima di massima verosimiglianza (mle) per una distribuzione di campione. Vengono inoltre illustrate le differenze tra un campione di bernoulli e la sua realizzazione, e viene presentata la funzione di verosimiglianza per un campione di bernoulli con densità di probabilità o funzione di massa di probabilità sconosciuta. Vengono inoltre discusse le proprietà della stima di massima verosimiglianza e l'invarianza della stima di massima verosimiglianza.

Tipologia: Dispense

2022/2023

Caricato il 06/04/2024

giacomo-segnini
giacomo-segnini 🇮🇹

4.7

(4)

6 documenti

1 / 10

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
13/11/2023
1
Department of Economics & Management
Master of Science in Economics, 2023-2024
Advanced statistics
Lecture 18, THU 09/11/2023. Introduction to the «inferential
model» & likelihood-based inference.
Prof. Piero Manfredi
Dep Economics & Management
University of Pisa
The three main problems of statistical estimation
Remarks on the quality control example. The choice of the sample proportion
phat (=8%) as an estimate of the population proportion was just intuitive: we
estimated the population proportion by the sample proportion (analogy
argument or «plug-in» principle). A number of questions arise:
POINT ESTIMATION (me).
INTERVAL ESTIMATION
(prof. Giusti)
PROPERTIES OF
ESTIMATORS (prof. Giusti)
Keywords of parametric inference
Population
(Unknown) parameters.
Sampling & Bernoulli sample. Sample distribution.
Chance variability of sampling («sampling variability»)
(parametric) statistical model.
Estimator/estimate.
Estimation & testing.
Likelihood: the unifying tool of inferential procedures.
Likelihood theory
In what follows we introduce the core approach to (point) parameter estimation, namely
the maximum likelihood approach.
This is the single most important concept of statistical theory. It provides a flexible and
powerful approach to point estimation.
The intuitive idea is that of focusing on the available observed sample datum and of
chosing as best estimate of an unknown parameter its most likely value i.e., the value
that maximises the probability of observing that datum.
This probability is called the likelihood. It is the unifying tool of all inferential procedures
in the classical approach to statistical theory.
We start by introducing the concept of likelihood function.
We continue by presenting the maximum likelihood approach to (parametric) estimation.
We finally define the concepts of maximum likelihood estimate and of maximum
likelihood estimator (MLE)
(Refreshing) the distribution of the sample (Lect 15)
Let (X1,…,Xn) be a a sample (from population X) which can take on values (x1,…,xn) (the
«possible realisations»).
The distribution of the sample (DS) i.e., the joint probability mass function/joint density function
of the sample observations in the order of selection is
For Bernoulli samples (i.e., IID)
( ) ( )
( ) ( )
1
1
,.., 1
,.., 1
,..,
,..,
n
n
X X n X
X X n X
p x x X discrete with PMF p x
DS f x x X continuous with PDF f x
=
( )
1 1 2 2
Pr , , ..., nn
X x X x X x= = = =
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
12
12
...
...
X X X n X
X X X n X
p x p x p x X discrete with PMF p x
DS f x f x f x X continuous with PDF f x
=
Advanced Statistics 2023-2024: Mid-term 31/10/2022 (prof P. Manfredi, Duration: 100 min)
Name…..………………………….………………..Surname……………….………………………….Student number
(matricola)………………….;
……
3. State the following definitions related to sampling theory: (0) the population of interest (X); (1)
simple random sample; (2) distribution of a sample; (3) Bernoulli sample. Then, given a population X,
clarify the difference between the concept of (Bernoulli) sample (X1,X2,..,Xn) from this population and
its realization (x1,x2,..,xn). Finally, postulating that X is a Poisson population (of given parameter 𝜗) for
the number of cars’ accidents per day over the roads of a certain country, assume you draw a
Bernoulli sample of n roads and let (x1,x2,..,xn) be the resulting numbers of daily accidents in each
selected road. Provide the probability distribution of the sample. Last, assuming that 𝜗 = 2.4/𝑑𝑎𝑦,
find the probability to observe the following sample of size 3: (x1=0,x2=0, xn=4).
(Refreshing) the distribution of the sample (Lect 15)
1 2
3 4
5 6
pf3
pf4
pf5
pf8
pf9
pfa

Anteprima parziale del testo

Scarica Massima Verosimiglianza e Stima di Massima Verosimiglianza - Prof. Manfredi e più Dispense in PDF di Statistica solo su Docsity!

Department of Economics & Management

Master of Science in Economics, 2023 - 2024

Advanced statistics

Lecture 18, THU 09/11/2023. Introduction to the «inferential

model» & likelihood-based inference.

Prof. Piero Manfredi

Dep Economics & Management

University of Pisa

The three main problems of statistical estimation

Remarks on the quality control example. The choice of the sample proportion

phat (=8%) as an estimate of the population proportion was just intuitive: we

estimated the population proportion by the sample proportion ( “analogy”

argument or «plug-in» principle). A number of questions arise:

POINT ESTIMATION (me).

INTERVAL ESTIMATION

(prof. Giusti)

PROPERTIES OF

ESTIMATORS (prof. Giusti)

Keywords of parametric inference

  • Population
  • (Unknown) parameters.
  • Sampling & Bernoulli sample. Sample distribution.
  • Chance variability of sampling («sampling variability»)
  • (parametric) statistical model.
  • Estimator/estimate.
  • Estimation & testing.
  • Likelihood : the unifying tool of inferential procedures.

Likelihood theory

  • In what follows we introduce the core approach to (point) parameter estimation, namely the maximum likelihood approach.
  • This is the single most important concept of statistical theory. It provides a flexible and powerful approach to point estimation.
  • The intuitive idea is that of focusing on the available observed sample datum and of chosing as best estimate of an unknown parameter its most likely value i.e., the value that maximises the probability of observing that datum.
  • This probability is called the likelihood. It is the unifying tool of all inferential procedures in the classical approach to statistical theory.
  • We start by introducing the concept of likelihood function.
  • We continue by presenting the maximum likelihood approach to (parametric) estimation.
  • We finally define the concepts of maximum likelihood estimate and of maximum likelihood estimator (MLE)

(Refreshing) the distribution of the sample (Lect 15)

  • Let (X 1 ,…,Xn) be a a sample (from population X) which can take on values (x 1 ,…,xn) (the «possible realisations»).
  • The distribution of the sample (DS) i.e., the joint probability mass function/joint density function of the sample observations in the order of selection is
  • For Bernoulli samples (i.e., IID)

( ) ( )

( ) ( )

1

1

,.., 1

,.., 1

n

n

X X n X

X X n X

p x x X discrete with PMF p x

DS

f x x X continuous with PDF f x

= Pr ( X (^) 1 = x 1 (^) , X (^) 2 = x 2 ,..., X (^) n = xn )

( ) ( ) ( ) ( )

( ) ( ) ( ) ( )

1 2

1 2

X X X n X

X X X n X

p x p x p x X discrete with PMF p x

DS

f x f x f x X continuous with PDF f x

 ^ ^ 

Advanced Statistics 2023-2024: Mid-term 31/10/2022 (prof P. Manfredi, Duration: 100 min)

Name…..………………………….………………..Surname……………….………………………….Student number (matricola)………………….;

3. State the following definitions related to sampling theory: (0) the population of interest (X); (1)

simple random sample; (2) distribution of a sample ; (3) Bernoulli sample. Then, given a population X,

clarify the difference between the concept of (Bernoulli) sample (X 1 ,X 2 ,..,Xn) from this population and

its realization (x 1 ,x 2 ,..,x n ). Finally, postulating that X is a Poisson population (of given parameter 𝜗) for

the number of cars’ accidents per day over the roads of a certain country, assume you draw a

Bernoulli sample of n roads and let (x 1 ,x 2 ,..,xn) be the resulting numbers of daily accidents in each

selected road. Provide the probability distribution of the sample. Last, assuming that 𝜗 = 2. 4 /𝑑𝑎𝑦,

find the probability to observe the following sample of size 3: (x 1 =0,x 2 =0, xn=4).

(Refreshing) the distribution of the sample (Lect 15)

1 2

3 4

5 6

The «quality control» problem: the population and its (unknown)

parameters

In the «quality control» example the «real-world» underlying population is any delivery of N=5000 («population size») trims from factory B to factory A (really not interesting !).

❑ From a statistical viewpoint we are only interested in the binary nature of delivered

trims: defective (=1) or undefective (=0): a Bernoulli population!

❑ More specifically: we are interested in inferring the proportion of defective items ( 𝝑 ) but not in the sample (this we observed!): in the population.

❑ 𝝑 is the unknown characteristic of the population i.e., the unknown parameter of our problem that we want to make infer using the data in the sample.

X

 

The «quality control» problem: Random sample vs its realization

In our example the factory samples n=50 items from the population (the delivery of N=5000). Ex ante (: before the sample is drawn), we have a random experiment described by a random vector of size n=50 ( X 1 ,X 2 ,…, X 50 ), where each component Xi is Bernoulli distributed.

Ex post (: after the sample was drawn) a vector of n=50 “numbers” ( x 1 ,x 2 ,…, x 50 ), that we call the (observed) realization of the sample. Assume for example the 4 defective items are observed in the drawings i=3, 17, 41, 44. The sample realisation is therefore the vector:

elsewhere

i

x x

1 50

NB. In statistical theory we are obviously interested in the “ex-ante” sample (as a Random Vector). The «ex-post» sample is just a collection of numbers.

The «quality control» problem: the probability of the observed

sample (the datum…)

❑ The observed («realised») sample ( )

elsewhere

i

x x

1 ,..,^50

❑ The ( ex-ante) probability of observing exactly this sample

Pr 0,0,1,0....0,1,0....0,1,0,0,1,0,0,0,0,0,0( ) =

( 1, 2,3, 4....16,17,18....40, 41, 42, 43, 44, 45,....,50)

( )

=  1 −

The «quality control» problem: remarks about the probability of

the observed sample (the datum…)

❑ Unknown : it depends on the unknown parameter.

❑ it actually represents the likelihood of any Bernoulli sample with a sample sum

s=4 successes over n=50 drawings.

❑ Being the sample “Bernulli” i.e., IID, we computed the probability of the

sample as if the drawing was “with replacement.

( ) ( )

P X = x ,.., X = x =  1 −

( ) 4 46 =  1 −

(after Fisher) « likelihood » of the data.

Likelihood

  • Core concept of «classical» (frequentist)

statistical inference.

  • Introduced by RA Fisher in 1911.
    • Only «auctoritas» is data.
    • (appropriate) Design of data collection (=

«design of experiments») is critical to allow

statistical analyses to provide good results.

The likelihood function

Definition. Given a Bernoulli sample ( X 1 ,X 2 ,…,Xn ) with realisation ( x 1 ,x 2 ,…,xn ) from

population X with density f

X

(x,) / probability mass function p

X

(x, ), depending on

a vector of unknown parameters , the likelihood function is:

( )

( )

1

1

1

n

X j j

n

n

X j j

p x

L L x x

f x

=

=

If the population is discrete with probability mass function p X (x, )

Remark. The set  (: domain of the likelihood function) is called the “parametric space”.

If the population is continuous with density function fX(x, )

7 8

9 10

11 12

Remark : The sample proportion ( 𝑝Ƹ=0.08) until now was just an intuitive («analogical») estimate of the unknown parameter, given by the population proportion p.

Exercise. Confirm that = 0. 08 is a maximum point of the likelihood function.

Max

p Max

Remark. Given the data actually observed, 𝑝Ƹ =0.08 is the « most likely» value of the parameter i.e., the value promoting the largest likelihood. Note that «most likely» does not (at all !) mean «most probable», given that the unknown parameter is a constant (and not a random variable !).

We have now found that : …^ the^ observed^ sample^ proportion ෝ𝒑 is much more than an «intuitive estimate»: it is the value which maximises the likelihood function of our problem.

We say that the realized sample proportion 𝑝Ƹ is a maximum

likelihood estimate (MLE) of the unknown population proportion.

Remark : in the original problem the sample was actually drawn without replacement. This implies some changes, especially considered that the population is a finite one, with size N=5000. In this circumstance the observed sample implies a reduction in the amplitude of the parametric space, i.e. since the proportion of defective items is at least of 4 over 5000 (given that 4 defective items have been actually observed) we have  = (i/5000), for i=4,5..,4954. Moreover the likelihood is not anymore binomial, but hypergeometric. Exercise. Write down the hypergeometric likelihood of the previous problem.

=( 0 , 1 )

Obviously if the population is very large (“infinite”, as it might be the case of a National human community) then the sample must necessarily be small compared to the population, which anyhow means that the choice

remains an excellent approximation.

Bernoulli populations: general treatment

We seek the ML estimate of a Bernoulli proportion based on a generic Bernoulli sample of size n with realization ( x 1 ,x 2 ,…,x n ) yielding any possible number of successes. The general form of the likelihood is

− −

=

=

 

  

    

    

1

1 1

1

1

1

x

s n

L L x x p x

nx n^ x

n x s ns

x

n

i

x nx

n

i

n X i

m i

m i

i i

Note the likelihood has been parametrised into two equivalent forms:

a) Using the observed sample sum (s) i.e., the number of successes in the sample.

b) Using the observed sample mean i.e., the sample proportion 𝑝Ƹ= s/n.

Remark. The form of the likelihood written in the previous slide is «abstract». Indeed, in any actual problem it is the observed datum which ultimately determines the relevant parametric space and therefore the form of the MLE.

( ) = ( 1 − ) =( 0 , 1 )

s ns

L   

( ) = =( 0 , 1 

n L  

s=0 («only failures» observed) :

( ) = ( 1 − ) =  0 , 1 )

n

L  

s=n («only successes» observed). Then:

0

Remark («regular» vs «non regular» likelihood problems).

The examples s=n, s=0, where the MLE was detected by investigating the

behaviour of the likelihood function at the boundaries of the parametric set (due

to the monotonic shape of the likelihood function), is called a «non regular»

likelihood problem to contrast it with the «regular» ML problem considered

before (0

= − Kn ( 2 − 2 x )= 0

d

dLL

m m

m

m ˆ ML = x

We obtain:

LL ( m) log L ( m) log H Kn ( m 2 x m)

2 1

x ML

m ˆ =

Therefore, based on previous arguments, we can confirm that the sample mean :

is a global maximum point of the likelihood function.

Exercise (trivial but useful). Apply a second order test to confirm that

was a local maximum of the likelihood function.

x

ML

m ˆ =

Exercise. Find the MLE for:

a) Mean and standard deviation ( m , s ) of the normal population. b) Mean and variance ( m , s 2 ) of a normal population. c) The rate of a Negative exponential population from complete observation of the waiting times (done by PM) d) The mean of a Poisson population (done PM). e) The rate of a Negative exponential population under censored observations (NOT in syllabus 2020 - 2021). f) The parameter of a population uniformly continuous over (0,):

g) The parameter of the 1-parameter “exponential tent” (Laplace)

h) Intercept, slope, and variance of errors (Var( e )) for the basic linear model under normal errors (TA).

X  I 0 ,

f x, =

( )= exp ( − ) ( −, )

f Xx ,  x  I

Remark. Bolded underlined exercises done during tutoring.

Normal case : joint ML estimation of (m,s)

( )

( )

^ (^ )

=

=

− − −

− −

=

− −

n

i

i

n

i

i i

x n n

x n n

i

x

e

L e e

1

2 2

1

2 2 2

2

2

1

2

1

1

2

m s

m s s

m

s

s  s 

ms

(fully developed in the class during tutoring) The vector of unknown parameters is

=(m,s). The corresponding likelihood function is:

 :−m, 0 s 

Over the parametric set:

Preliminary remark. The likelihood is well behaved (non-negative, tending to zero

at the boundary points of the parametric set, continuous and differentiable).

Therefore it has at least a maximum point over the parametric set, which we

detect by equating to zero the (prime) partial derivatives of the LL function :

( ) ( )

= (− − )−  ( − )

=

=

n

i

i

n

i

i

n x

LL n x

1

2

2

1

2

2

log log 2

, log

m s

s 

m s  s

ms

2 1

2

2

= s

m

s s

ms

m

m s

ms

n

i

n xi

LL

nx n

LL

The first order conditions (= likelihood equations ) are:

The first equation immediately gives:

Plugging this result into the second likelihood equation we get:

m= x

2 1

2

s = s

ms^ n

i

i

n x x

LL

( )

n

x x

n

i

 i

= 1

2

s

Exercise. Confirm that the pair

( )

n

x x

x

n

i

i

ML
ML

 −

=

=

ˆ

ˆ

s

m

Sol. Yes, there is only a candidate point. Therefore based on the characteristics

of the likelihood function this is necessarily the unique MLE.

Exercise. Use the usual second order test (: show that the Hessian matrix is

negative-definite) to suggest first that the previous pair actually represent a

local maximum for the likelihood function.

Is the desired ML estimate for (m,s).

37 38

39 40

41 42

Remark. Note that, though one can straightfowardly compute the square of the MLE for sigma obtaining:

( )

( )

n

x x

n

i

i

ML

 −

=

= 1

2

2 s ˆ

One would like to know whether it is also correct to say that:

( ) ( )

n

x x

n

i

i

ML ML

 −

= =

= 1

2

2 2 s sˆ

i.e., whether by taking the square of the MLE for the standard deviation one obtains the MLE for the population variance.

The answer is again YES, by the invariance property of MLE (presented during lectures which we recall here)

The «invariance» property of MLE

Theorem (strong invariance of MLE). Let L() be a likelihood of parameter , where є, having MLE. Let g(.) be a biunivocal function and consider the parameter t=g(). Then:

ML

( ( )) ( ) ML ML gg

ˆ

^

=

Remark. Note that t=g() is a parameter which will have its own parametric space =g().

Remark. The invariance theorem actually holds under much more general conditions on the form of function g.

Exercise (confirming previous finding). Instead than applying the invariance

property one might confirm the previous result by compute the MLE for the

parametric vector =(m,s

2

) (instead than (m,s) ).

Hint. We write the new likelihood as follows, by setting :

=

− −

m 

m

m 

1

2 2

1 n

i

xi

n

L e

s= 

Proof of the invariance property. We look at the likelihood of the parameter

t=g() for a generic t

0

(note the parametric space is).

g

t 0

2

( ) 0

L 

( ) 1

L 

t 0 = g (  0 )

It therefore holds:

Where:

L ( t 0 )=?

MLE of intercept, slope and variance of the

basic linear model with normal homoscedastic

errors.

The «classical» simple linear model

  • This is the baseline model of any econometric textbook.
  • Linear stochastic relation between an input given by the sum of a (linear) deterministic component (𝛼 + 𝛽𝑋) and of a random disturbance (𝜀) and an output (Y) (also termed the «response»).
  • The independent variable (X) is also termed the«explanatory» factor, or the «predictor».
  • Input (X) is fixed (and thus by def uncorrelated with the error).
  • Standard «measurement errors» assumptions about the disturbances.

i i i Y =+ X + e

( )

E ( ) i j

IIDN i

i j

i

2

e e

e s

43 44

45 46

47 48

Exercise (properties of the ML-least square estimator). Find expectation and

variance of the «ML-least square» estimator for the slope and show its

unbiasedness and consistency.

Sol. Let:

x X X

i i

=

=

=

=

=

=

=

=

=

=

n

i

i

n

i

ii

n

i

i

n

i

i i i

n

i

i

n

i

ii

n

i

i

n

i

i i

n

i

i

n

i

i i

MLOLS

x

x

X X

X X X

x

xY

X X

X XY

X X

X XY Y

1

2

1

1

2

1

1

2

1

1

2

1

1

2

1 _

e

  e

 

From which:

( )

( ) ( )

( )

( ) ( ) ( )

 ^ 

= =^ =

=

=

=

=

=

=

=

n

i

i

n

i

i

n

i

i

n

i

i i

n

i

i

n

i

i i i

MLOLS

n

i

i

n

i

i i

n

i

i

n

i

i i i

MLOLS

x x

Var

x

xVar

x

x x

Var Var

x

xE

x

x x

E E

1

2

2

1

2

2

1

2

1

2

1

2

1 _

1

2

1

1

2

1 _

e s

e

 e

e

 e

Therefore:

Which shows that the ML_OLS estimator for the slope is unbiased and consistent.

Estimating waiting-times: the Exponential case

  • The classical problem: full (“complete”) observation of waiting time until the

event of interest has occurred.

  • “Desruptive observations”. Common in technology and industrial production

processes & related quality control procedures where the event most often

represent a «failure». Convenient when scrutined items have short lifespan

and low costs.

  • Incomplete observation plans and the censoring problem.
    • Social sciences observations, with long waiting times and costly

observations.

  • The issue of censored observations.
  • Event history techniques.

The classical problem: full observation of waiting time

(until failure/transition) occurred.

Example. To evaluate the reliability of its product, a factory producing batteries performs the following experiment: a Bernoulli sample of n=10 batteries is followed under standard operation rules until the event of interest (failure) has occurred. The following durations at failure were observed (in days):

Battery 1 2 3 4 5 6 7 8 9 10

Duration 7.5^ 9.2^ 12.7^ 6.9^ 10.4^ 10.6^ 11.4^ 14.4^ 8.8^ 10.

Assuming that the waiting time to failure is distributed according to a negative exponential density of unknown rate , find the corresponding maximum likelihood estimate of , and of the population mean.

Solution. As the distribution of the population is negative exponential, the likelihood function is:

1

1 1

=

=

=

 

 

n

i

i i

x n

n

i

x

n

i

X i

L f x e e

The Likelihood is non-negative, continuous and differentiable over the parametric set, with L(0)=0, L()=0. The LL function therefore admits at least one local maximum. The LL is:

LL ( ) = log L ( ) = n  log  − n   x  =   :  0 

Exercise. Show there is a unique ML estimate, given by:

x

ML

x = 10. 2 days day

x

ML

Therefore:

Exercise. With reference to the previous reliability problem, find the corresponding maximum likelihood estimate of the average duration of batteries.

Solution. In principle we can re-solve (please do it…) the problem on the likelihood of the exponential population reparametrised as:

x

ML

ML = =

m

However by applying the invariance property of MLE we immediately find:

( ) ( )  : 0 

0 ,

1

m m

m

m

m

m

f x e I

x

X

55 56

57 58

59 60