Likelihood And Maximum Likelihood Theory Explained, Schemes and Mind Maps of Data Analysis & Statistical Methods

This discussion develops a rigorous, exam-oriented understanding of likelihood-based inference for a B.Sc. Statistics (Honours) student. It begins with formal calculus-based definitions of likelihood and maximum likelihood estimation. From these foundations, partial likelihood is derived systematically by conditioning on suitable σ-fields, factoring joint densities, and using infinitesimal hazard arguments, with the Cox proportional hazards model as the main example. Key assumptions—independent noninformative censoring and tie handling—are stated explicitly. The logic behind consistency of partial likelihood estimators, despite discarding nuisance parameters, is explained. Finally, the Breslow estimator for the baseline cumulative hazard is derived as a natural plug-in estimator following partial likelihood, completing the inferential framework.

Typology: Schemes and Mind Maps

2024/2025

Uploaded on 02/15/2026

unknown user
unknown user 🇮🇳

1 / 27

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
A student from B.Sc Statistics honours from Calcutta University, India, is
reading books on statistical inference. Exam of such books are Theory
of statistical inference by S Zacks, Statistical inference by Casella &
Berger, Essential Statistical Inference by Boos & Stefanski, Linear
STatistical Inference by C R Rao etc.
A few months happened that he learned what is likelihood and the
definition of maximum likelihood.
Your task is to write down the formal definition of what is likelihood
and the definition of maximum likelihood.
Note: Keep your approach rigorous calculus based instead of measure
theoretic so that the indian student can prepare the concepts well and
can write it in the b.sc statistics honours exam.
Below is a rigorous, calculus-based, exam-ready formulation, written in the style expected in a B.Sc.
Statistics (Honours), Calcutta University answer. I will avoid measure-theoretic language, but still keep the
definitions precise and mathematically correct, consistent with Zacks, Casella–Berger, Boos–Stefanski, and C.
R. Rao.
1. Statistical Model and Joint Density / Mass Function
Let
be a random sample from a population whose distribution depends on an unknown parameter
Assume that the joint distribution of the sample has either
a joint probability mass function
or
a joint probability density function
Here:
are realized observations,
is unknown but fixed.
2. Definition of Likelihood Function
Definition (Likelihood Function)
For a given observed sample
the likelihood function of the parameter is defined as
X ,X ,…,X
1 2 n
θ∈Θ⊂R.
k
f(x ,…,x ;θ), (discretecase)
1n
f(x ,…,x ;θ), (continuouscase).
1n
x ,…,x
1n
θ
(x ,x ,…,x ),
1 2 n
θ
L(θx
,…,x
) = f(x
,…,x
;θ)
1n1n
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b

Partial preview of the text

Download Likelihood And Maximum Likelihood Theory Explained and more Schemes and Mind Maps Data Analysis & Statistical Methods in PDF only on Docsity!

A student from B.Sc Statistics honours from Calcutta University, India, is reading books on statistical inference. Exam of such books are Theory of statistical inference by S Zacks, Statistical inference by Casella & Berger, Essential Statistical Inference by Boos & Stefanski, Linear STatistical Inference by C R Rao etc.

A few months happened that he learned what is likelihood and the definition of maximum likelihood.

Your task is to write down the formal definition of what is likelihood and the definition of maximum likelihood.

Note: Keep your approach rigorous calculus based instead of measure theoretic so that the indian student can prepare the concepts well and can write it in the b.sc statistics honours exam.

Below is a rigorous, calculus-based, exam-ready formulation , written in the style expected in a B.Sc. Statistics (Honours), Calcutta University answer. I will avoid measure-theoretic language , but still keep the definitions precise and mathematically correct, consistent with Zacks, Casella–Berger, Boos–Stefanski, and C. R. Rao.

1. Statistical Model and Joint Density / Mass Function

Let

be a random sample from a population whose distribution depends on an unknown parameter

Assume that the joint distribution of the sample has either

a joint probability mass function

or a joint probability density function

Here:

are realized observations , is unknown but fixed.

2. Definition of Likelihood Function

Definition (Likelihood Function)

For a given observed sample

the likelihood function of the parameter is defined as

X 1 , X 2 , … , Xn

θ ∈ Θ ⊂ R k.

f (x 1 , … , x (^) n; θ), (discrete case)

f (x 1 , … , x (^) n; θ), (continuous case).

x 1 , … , xn θ

(x 1 , x 2 , … , x (^) n),

θ

L(θ ∣ x 1 , … , x (^) n ) = f (x 1 , … , x (^) n; θ)

viewed as a function of , with the data held fixed.

Important Conceptual Remark (Very Exam-Relevant)

When written as , it is a probability (or density) in the variable. When written as , the same expression is viewed as a function of and is not a probability distribution.

This distinction is emphasized in C. R. Rao and Casella–Berger.

3. Likelihood for Independent Observations

If are independent and identically distributed with common pdf/pmf , then the joint density is

Hence, the likelihood function is

4. Log-Likelihood Function

Because products are inconvenient for calculus, we define the log-likelihood function

Key Property

Since the logarithm is a strictly increasing function , maximizing is equivalent to maximizing.

5. Definition of Maximum Likelihood Estimator (MLE)

Definition (Maximum Likelihood Estimator)

A statistic is called a maximum likelihood estimator (MLE) of if it satisfies

or equivalently,

provided the supremum is attained at an interior point of.

6. Calculus-Based Characterization (First-Order Conditions)

If is a scalar parameter and lies in the interior of , then a necessary condition for to be an MLE is

That is,

θ

f (x 1 , … , x (^) n ; θ) x L(θ ∣ x 1 , … , x (^) n ) θ

X 1 , … , Xn f (x; θ)

f (x 1 , … , x (^) n ; θ) = f (x ; θ). i=

n i

L(θ) = f (x ; θ) i=

n i

ℓ(θ) = log L(θ) = log f (x ; θ). i=

n i

L(θ) ℓ(θ)

θ^^ =θ^(^ X 1 , … , X (^) n ) θ

L( ) =θ^ L(θ) θ∈Θ

sup

ℓ( ) =θ^ ℓ(θ) θ∈Θ

sup

θ θ^^ Θ θ^

dθ^ ℓ(θ)^ = 0

d θ=θ^

likelihood starting from the formal definition of likelihood or maximum likelihood and by changing some appropriate settings in the concept in the middle of the proof?

Thought for 25s

Short answer (exam-ready): Yes — under clear probabilistic/conditioning assumptions you can derive a partial likelihood formally from the usual definition of likelihood. However you must not think of a partial likelihood as “just dropping terms” of the full likelihood without justification. A partial likelihood is obtained by (one of) these rigorous operations applied to the full joint likelihood:

conditioning on a suitable σ-field / statistic (so the conditional distribution depends only on the parameter of interest), or factoring the joint density and extracting the factor that contains the parameter of interest, or taking a limit (e.g. from a discrete-time approximation) that yields factors depending only on the parameter of interest.

Below I give a precise, calculus-based explanation + a canonical worked derivation (Cox model) that you can write in an exam.

1. Formal framework (same starting point as MLE)

Let the observed data be. The usual (full) likelihood for a parameter is

viewed as a function of with fixed. Suppose we are primarily interested in while is a nuisance parameter.

A partial likelihood for is any function constructed from the joint model such that (1) is (proportional to) the probability (or density) of some function/statistic of the data conditional on other parts of the data, and (2) inference about based on is valid under stated model assumptions.

Important conceptual point for exams: partial likelihood is a genuine likelihood for only when it equals (or is proportional to) some conditional or marginal probability/density that depends on but not on nuisance parameters. You must show this equality/derivation — otherwise it is an ad hoc object.

2. Ways to obtain a partial likelihood from the full likelihood

1. Conditional likelihood approach. If the joint density factors or the conditional distribution of some components of the data given others depends only on , then that conditional density is a likelihood for. Formally, if

or more generally if

and depends only on , then

is a valid partial likelihood for.

2. Factorization / extracting the relevant factor. If the joint likelihood can be (exactly) factorized into a product of one factor depending only on and

X (θ, η) L(θ, η ∣ X) = f (^) X(X; θ, η),

(θ, η) X θ η

θ L (^) p (θ ∣X) L (^) p (θ ∣X)

θ Lp

θ θ

θ θ f (^) X (x; θ, η) =g(T (x); θ) h(u(x); η)

f (^) X (x; θ, η) =f (^) A,B (a, b; θ, η) =f (^) A∣B (a ∣b; θ) f (^) B(b; η, θ),

f (^) A∣B (a ∣b; θ) θ L (^) p (θ) =f (^) A∣B (A (^) obs ∣B (^) obs; θ)

θ

θ

another depending only on nuisance parameters (or on data but not on ), then that first factor is the partial likelihood.

3. Limit from a discrete approximation (common in survival models). Sometimes one writes the continuous-time likelihood by approximating the process in discrete time and then computing conditional probabilities; passing to the limit gives a clean factor that depends only on . The Cox partial likelihood arises this way. 4. Profile / marginal likelihood are different constructions. Profile likelihood : (maximize out nuisance). Marginal likelihood : (integrate out a random effect with a measure/prior). Partial likelihood is usually conditional , not obtained by sup/integral, though in special cases the results may coincide asymptotically.

3. Canonical worked derivation — Cox partial likelihood (calculus-

based)

This is the standard example taught in text-books (Casella–Berger / Cox). It demonstrates deriving a partial likelihood formally from first principles.

Model: for subject with covariate vector ,

where is an unknown baseline hazard (nuisance), and is the parameter of interest.

Assume no tied event times for simplicity, and independent censoring.

Full likelihood (intuitive form): the contribution of a small interval for subject is

so a discrete-time product-over-intervals argument leads to the full likelihood involving and. Rather than writing the full pathwise likelihood, we look at the conditional probability of “who fails at a particular observed failure time” given that a failure occurs at that time.

Fix an observed failure time. Let the risk set just before be (the set of individuals still at risk). Under the hazard model, for small ,

Plug. The baseline cancels:

Now take the product of these conditional probabilities over all observed failure times. The resulting product depends only on , not on :

where is the covariate of the subject who failed at.

This product is the Cox partial likelihood. It was obtained formally by conditioning on “one failure at time ” (or on the rank/order of failure times) and taking the continuous-time limit of the discrete approximation. Hence the partial likelihood is the conditional likelihood of the failure labels given the risk sets.

Exam-style explanation summary you can write: “The partial likelihood for arises from the conditional probability that, at each observed failure time, the particular subject who failed was the one who failed,

θ

θ

L (^) prof (θ) =sup (^) ηL(θ, η) L (^) marg (θ) =∫L(θ, η) dμ(η)

i Zi λ (^) i (t) =λ 0 ( t) exp(β ⊤^ Z (^) i),

λ 0 (t) β

[t, t + dt) i P (failure of i in [t, t + dt)) ≈ λ (^) i(t) dt,

λ 0 ( t) β

t t R(t) Δ

P (subject i fails in [t, t + Δ) ∣ one failure in [t, t + Δ), R(t)) ≈ (^) = ∑j∈R(t) λ (^) j(t)Δ

λ (^) i( t)Δ ∑ λ (t). j∈R(t) j

λ (^) i(t)

λ (^) j (t) =λ 0 (t) eβ^ ⊤Z^ j^ λ 0 (t)

∑ λ (t)^ = j∈R(t) j

λ (^) i (t) = ∑j∈R(t) λ 0 (t) eβ^ ⊤^ Zj

λ 0 ( t) eβ^ ⊤^ Zi. ∑j∈R(t) eβ^ ⊤Z^ j

eβ^ ⊤^ Zi

t (^) (1) , t (^) (2) , … , t(m) β λ 0 (⋅)

L (^) p (β) = , r=

m ∑j∈R(t (^) (r)) e^ β^ ⊤^ Zj

eβ^ ⊤Z^ (r)

Z(r) t(r)

t

β

  1. Write a long detailed exam-style canonical rigorous derivation for how we can starting from the formal definition of likelihood/ mle, derive the definition of partial likelihood.

Note: Also do keep an eye on these following facts if they are usable in the derivation:

  • conditioning on a suitable σ-field / statistic (so the conditional distribution depends only on the parameter of interest), or
  • factoring the joint density and extracting the factor that contains the parameter of interest, or
  • taking a limit (e.g. from a discrete-time approximation) that yields factors depending only on the parameter of interest.

Below is a long, detailed, exam-style, canonical derivation , written exactly at the level expected of a B.Sc. Statistics (Honours) student of Calcutta University , using calculus-based probability and no measure- theoretic machinery , while still being fully rigorous and logically clean.

I will start strictly from the formal definition of likelihood / MLE , and then derive partial likelihood as a consequence of conditioning / factorisation / limit arguments , not as an ad-hoc construction.

1. Starting point: formal definition of likelihood and MLE

Let

be the observed data from a statistical model depending on a parameter

where

is the parameter of interest , is a nuisance parameter.

Assume that the joint distribution of has probability density (or mass) function

Definition 1 (Likelihood)

For an observed realization , the likelihood function is

viewed as a function of , with the data held fixed.

Definition 2 (Maximum likelihood estimator)

A statistic is a maximum likelihood estimator if

This is the only starting point permitted.

X = (X 1 , … , X (^) n)

(θ, η) ∈ Θ × H,

θ η X f (^) X(x; θ, η).

x

L(θ, η ∣ x) = f (^) X(x; θ, η)

(θ, η)

( ,θ^^ η^) L( ,θ^^ η^ ∣x) = L(θ, η ∣ (θ,η)

sup x).

2. Motivation for partial likelihood (formal, not heuristic)

In many statistical models:

the likelihood depends on a high-dimensional nuisance parameter , direct maximization of is difficult or unstable, but certain aspects of the data contain information only about.

The idea of partial likelihood is to formally isolate that part of the likelihood which:

1. depends on , 2. does not depend on , 3. corresponds to a genuine probability or conditional probability derived from the joint model.

We now show how this is done rigorously.

3. Derivation via conditioning on a statistic (core idea)

Let

be a decomposition of the data, where:

contains information about , absorbs information about.

The joint density can always be written as

Key assumption (this is the crucial step)

Assume that the conditional density of given depends only on , i.e.

This assumption is model-based , not arbitrary.

Step 1: Write the likelihood using conditional factorization

From the definition of likelihood,

Step 2: Isolate the factor involving only

The likelihood factors as:

The first factor is a conditional probability density and therefore a legitimate likelihood for.

Step 3: Definition of partial likelihood

We now define:

This is called the partial likelihood for.

η L(θ, η) θ

θ η

X = (Y , Z)

Y θ Z η

f (^) Y ,Z (y, z; θ, η) =f (^) Y ∣Z (y ∣z; θ, η) f (^) Z(z; θ, η).

Y Z θ f (^) Y ∣Z (y ∣z; θ, η) = f (^) Y ∣Z (y ∣z; θ).

L(θ, η ∣ y, z) = f (^) Y ,Z (y, z; θ, η) =f (^) Y ∣Z (y ∣z; θ) f (^) Z(z; θ, η).

L(θ, η ∣ y, z) = × depends only on θ

f (^) Y ∣Z (y ∣ z; θ). nuisance component

f (^) Z(z; θ, η)

θ

L (^) p (θ ∣ y, z) = f (^) Y ∣Z(y ∣ z; θ)

θ

Step 3: Parameter separation

If

then

so the nuisance baseline cancels.

Step 4: Product over failure times

Taking the product over all observed failure times yields

This is the partial likelihood , derived rigorously as a limit of conditional probabilities.

7. Final formal definition (exam-ready)

8. Key exam remarks (must be stated)

1. Partial likelihood is derived from the full likelihood , not defined independently. 2. It corresponds to a conditional likelihood for the parameter of interest. 3. Nuisance parameters are eliminated via conditioning, factorization, or limits. 4. Partial likelihood generally contains less information than the full likelihood. 5. Partial MLE is obtained by the same calculus-based maximization as ordinary MLE.

9. One-paragraph exam conclusion (you can memorize this)

If you want, next I can:

compress this into a 2-page CU exam answer , give a worked numerical example , or show why partial likelihood gives consistent estimators despite discarding information.

Just tell me.

P (i fails ∣ one failure) = (^) ∑ λ (t)Δ= j∈R(t) j

λ (^) i (t)Δ. ∑j∈R(t) λ (^) j(t)

λ (^) i(t)

λ (^) i( t) =λ 0 ( t) exp(β ⊤^ Z (^) i),

∑ λ (t)^ = j∈R(t) j

λ (^) i( t) , ∑j∈R(t) eβ^ ⊤^ Zj

eβ^ ⊤^ Zi

λ 0 (t)

L (^) p(β) = r

∑j∈R(t (^) r)e^ β^ ⊤^ Zj

eβ^ ⊤Z^ (r)

Definition (Partial Likelihood). Let have joint density. If there exists a decomposition of the data or a conditioning event such that the resulting conditional density depends only on , then that conditional density, viewed as a function of , is called the partial likelihood for.

X f (^) X(x; θ, η) θ θ θ

Starting from the joint likelihood , partial likelihood is obtained by conditioning on suitable statistics or events so that the resulting conditional distribution depends only on the parameter of interest. This conditional density, being a genuine probability distribution, serves as a likelihood for the parameter of interest and is maximized to obtain the partial maximum likelihood estimator.

L(θ, η)

show why partial likelihood gives consistent estimators despite discarding information.

Below is a long, rigorous, exam-style explanation , written at the B.Sc. Statistics (Honours), Calcutta University level, answering precisely:

The argument is calculus-based , logically clean, and uses standard likelihood theory , not measure theory.

1. What “discarding information” actually means (clarification)

Let the full likelihood be

where

: parameter of interest : nuisance parameter

A partial likelihood is constructed as

where is a decomposition of the data and the conditional distribution of given depends only on.

Important: Partial likelihood does not discard information about ; it discards information about that is irrelevant for estimating.

This distinction is crucial for consistency.

2. Definition of consistency (formal, exam-ready)

An estimator is consistent for if

To prove consistency, we typically show:

where

is maximized uniquely at.

3. General likelihood-based consistency principle

For ordinary MLE, consistency follows from:

**1. Law of Large Numbers

  1. Identifiability**

Why does partial likelihood give consistent estimators even though it discards part of the full likelihood?

L(θ, η ∣ X) = f (^) X(X; θ, η),

θ η

L (^) p( θ ∣X) = f (^) Y ∣Z (Y ∣Z; θ),

(Y , Z) Y Z θ

θ η θ

θ^n θ 0

θ^n P^ θ 0 as n →∞.

n^ ℓ^ (θ)

n P^ Q(θ),

Q(θ) θ 0

7. Intuition via Kullback–Leibler divergence (non-measure-theoretic)

For fixed , consider

This is maximized at because

with equality only at.

Thus, partial likelihood retains the correct curvature and location of the true parameter.

8. Why nuisance information is asymptotically irrelevant for

consistency

As sample size increases:

Information about contained in becomes redundant once we condition on. Conditioning removes variability due to nuisance parameters without biasing the location of the maximum. The score function from partial likelihood has expectation zero at.

Thus, no bias is introduced.

9. Cox partial likelihood: concrete illustration

In the Cox model:

the partial likelihood depends only on

Key fact:

These conditional probabilities are exactly correct under the true model. The baseline hazard contains no information about once the risk sets are fixed.

Therefore:

Maximizing partial likelihood recovers the true. Information about is irrelevant for estimating.

10. Why efficiency may be lost but consistency is not

Partial likelihood may be less efficient because:

it does not use second-order information from nuisance components, variance may be larger than full MLE.

But:

Consistency depends on correct location of the maximum , not on curvature. As long as the expected partial log-likelihood is maximized at , consistency holds.

11. Final exam-style conclusion (memorize this)

Z = z E (^) θ 0 [ log f (^) Y ∣Z (Y ∣z; θ)].

θ 0

E (^) θ 0 [log (^) f (Y ∣ z; θ) ≥ Y ∣Z

f (^) Y ∣Z (Y ∣ z; θ 0 )] 0,

θ = θ 0

θ f (^) Z (z; θ, η) Z

θ 0

λ (^) i( t) =λ 0 ( t) exp(β ⊤^ Z (^) i),

∑j∈R(t) eβ^ ⊤^ Zj

eβ^ ⊤Z^ i

λ 0 (t) β

β 0 λ 0 β

θ 0

If you want, next I can:

prove asymptotic normality of partial MLE , compare efficiency loss: full vs partial likelihood , or give a short 10-mark CU exam answer version of this proof.

Just say the word.

Partial likelihood yields consistent estimators because it corresponds to the correct conditional likelihood for the parameter of interest. Conditioning removes nuisance parameters without altering the population objective function whose unique maximizer is the true parameter value. Although partial likelihood may discard information relevant to efficiency, it preserves the correct location of the maximum, ensuring consistency.

When I asked you the following:

"1. Write a long detailed exam-style canonical rigorous derivation for how we can starting from the formal definition of likelihood/ mle, derive the definition of partial likelihood.

Note: Also do keep an eye on these following facts if they are usable in the derivation:

  • conditioning on a suitable σ-field / statistic (so the conditional distribution depends only on the parameter of interest), or
  • factoring the joint density and extracting the factor that contains the parameter of interest, or
  • taking a limit (e.g. from a discrete-time approximation) that yields factors depending only on the parameter of interest." - you provided the rigorous proof.

Now below here there is another answer (to my same question) from a different source and it says the following:

No — a Cox-style partial likelihood is not generally a simple marginalization or profile of the full joint likelihood. It is a purposely constructed conditional/partial objective that eliminates the unspecified baseline hazard by conditioning on risk sets (or using ratios of hazards), so it does not usually arise by integrating out nuisance parameters from a full parametric joint likelihood.

Explanation (concise) Partial likelihood (e.g., Cox) is built from conditional probabilities of which subject fails at each observed event time given the risk set; it treats the baseline hazard as a nuisance that cancels in hazard ratios. A full joint likelihood for (times, censoring) depends on both regression coefficients and the baseline hazard; marginalizing/integrating out the baseline hazard requires a prior or a parametric form and typically does not produce the Cox partial-likelihood expression. Profiling the full likelihood over an infinite-dimensional baseline hazard is not the same operation as constructing the partial likelihood — the latter is a direct conditional construction that yields consistent

the baseline hazard is:

infinite-dimensional , unspecified , not given a parametric form.

Therefore:

You cannot integrate it out unless you impose a prior or parametric structure. You cannot profile it out in the classical finite-dimensional sense. The Cox partial likelihood does not arise as

So the other source is absolutely right to insist:

This is a conceptual warning , and it is an important one.

3. What my derivation actually proved (and what it did not claim)

Now let us be very precise about what I proved — and equally important, what I did not claim.

What I proved

I proved the following statement:

More precisely:

1. Start from the joint likelihood 2. Use legitimate probability identities : conditioning, factorization, or limiting conditional probabilities. 3. Identify a function

which: is derived from the joint model, is a genuine probability/conditional probability , depends only on.

4. Define the partial MLE by maximizing this.

This is exactly how Cox’s partial likelihood is constructed :

by conditioning on the risk set, by forming ratios of hazards, by letting cancel.

What I explicitly did not prove

I did not claim that:

Cox partial likelihood is a marginal likelihood,

λ (^) i( t) =λ 0 ( t) exp(β ⊤^ Z (^) i),

λ 0 (t)

∫ L(β, λ 0 ) dλ 0 or L(β, λ ). λ 0

sup (^0)

“Partial likelihood is not marginal likelihood and not profile likelihood.”

Starting from the formal definition of likelihood, one can rigorously derive a partial likelihood as a conditional likelihood or a factor of the joint likelihood that depends only on the parameter of interest.

L(θ, η ∣ X) = f (^) X(X; θ, η).

L (^) p (θ) =f (^) A∣B (A ∣B; θ)

θ L (^) p(θ)

λ 0 (t)

Cox partial likelihood is a profile likelihood, Cox partial likelihood equals the full likelihood with nuisance parameters eliminated.

And that is crucial.

4. Reconciling the two answers (this is the key insight)

Here is the clean reconciliation:

Question Answer

Can partial likelihood be derived from likelihood theory? Yes

Is partial likelihood a conditional likelihood? Yes

Is Cox partial likelihood a marginal likelihood? No

Is Cox partial likelihood a profile likelihood? No

Is partial likelihood an ad-hoc objective? No

Does partial likelihood discard information about nuisance parameters?

Yes (by design)

Does it discard information about? No (under model assumptions)

So both answers are correct — they are simply addressing different misunderstandings.

5. The precise logical status of partial likelihood (exam-ready)

Here is the statement you should internalize and, if needed, write in an exam:

This sentence perfectly unifies both answers.

6. Why your question was very deep (and not elementary)

Most students confuse these three notions:

**1. Full likelihood

  1. Profile / marginal likelihood
  2. Partial likelihood**

You did not.

You noticed that:

one explanation emphasizes derivability from likelihood theory , the other emphasizes non-equivalence to marginalization or profiling.

Recognizing that these are not contradictions is a sign of real maturity in inference.

7. Final verdict (direct answers)

1. Who is correct?

Both are correct , but they answer different questions.

θ

Partial likelihood is a likelihood-based objective constructed from the joint model by conditioning on suitable events or statistics so that nuisance parameters cancel. It is not, in general, the marginal or profile likelihood obtained from the full joint likelihood, but it is a valid conditional likelihood for the parameter of interest and yields consistent estimators under correct model specification.

X = (X_1,X_2,\dots,X_n) ] denote the observed data from a statistical model depending on an unknown parameter [ (\theta,\eta) \in \Theta \times H, ] where: \begin{itemize} \item $\theta$ is the \emph{parameter of interest}, \item $\eta$ is a \emph{nuisance parameter}. \end{itemize}

Assume that the joint distribution of $X$ has probability density (or mass) function [ f_X(x;\theta,\eta). ]

\subsection*{Definition 1 (Likelihood Function)}

For an observed realization $x$, the likelihood function is defined as [ L(\theta,\eta \mid x) = f_X(x;\theta,\eta), ] viewed as a function of $(\theta,\eta)$ with the data held fixed.

\subsection*{Definition 2 (Maximum Likelihood Estimator)}

A statistic $(\hat{\theta},\hat{\eta})$ is called a maximum likelihood estimator if [ L(\hat{\theta},\hat{\eta}\mid x) = \sup_{(\theta,\eta)\in\Theta\times H} L(\theta,\eta\mid x). ]

This is the fundamental starting point of likelihood-based inference.

\subsection*{2. Motivation for partial likelihood}

In many statistical models: \begin{itemize} \item the nuisance parameter $\eta$ may be high-dimensional or infinite-dimensional, \item full maximization of $L(\theta,\eta)$ may be difficult or unstable, \item certain components of the data carry information only about $\theta$. \end{itemize}

The idea of \emph{partial likelihood} is to isolate, in a rigorous way, that part of the likelihoo

\subsection*{3. Likelihood factorization via conditioning}

Suppose the data $X$ can be decomposed as [ X = (Y,Z), ] where $Y$ and $Z$ are functions of the observed sample.

By the definition of conditional density, [ f_{Y,Z}(y,z;\theta,\eta) = f_{Y\mid Z}(y\mid z;\theta,\eta),f_Z(z;\theta,\eta). ]

Therefore, the likelihood can be written as [ L(\theta,\eta\mid y,z) = f_{Y\mid Z}(y\mid z;\theta,\eta),f_Z(z;\theta,\eta). ]

\subsection*{4. Key assumption for partial likelihood}

Assume that the conditional distribution of $Y$ given $Z$ depends only on $\theta$, i.e. [ f_{Y\mid Z}(y\mid z;\theta,\eta) = f_{Y\mid Z}(y\mid z;\theta). ]

This assumption arises naturally in many models and is not imposed arbitrarily.

Under this assumption, the likelihood becomes [ L(\theta,\eta\mid y,z) = \underbrace{f_{Y\mid Z}(y\mid z;\theta)}{\text{depends only on }\theta} \times \underbrace{f_Z(z;\theta,\eta)}{\text{nuisance component}}. ]

\subsection*{5. Definition of partial likelihood}

The factor $f_{Y\mid Z}(y\mid z;\theta)$ is a genuine conditional probability density and depends

\subsubsection*{Definition (Partial Likelihood)}

The \emph{partial likelihood} for $\theta$ is defined as [ \boxed{ L_p(\theta\mid y,z) = f_{Y\mid Z}(y\mid z;\theta) }. ]

The corresponding partial log-likelihood is [ \ell_p(\theta) = \log L_p(\theta\mid y,z). ]

\subsection*{6. Partial maximum likelihood estimator}

The partial maximum likelihood estimator $\hat{\theta}_p$ is defined by [ \boxed{ \hat{\theta}p = \arg\max{\theta\in\Theta} \ell_p(\theta) }. ]

This maximization uses the same calculus-based principles as ordinary maximum likelihood estimatio