



















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This discussion develops a rigorous, exam-oriented understanding of likelihood-based inference for a B.Sc. Statistics (Honours) student. It begins with formal calculus-based definitions of likelihood and maximum likelihood estimation. From these foundations, partial likelihood is derived systematically by conditioning on suitable σ-fields, factoring joint densities, and using infinitesimal hazard arguments, with the Cox proportional hazards model as the main example. Key assumptions—independent noninformative censoring and tie handling—are stated explicitly. The logic behind consistency of partial likelihood estimators, despite discarding nuisance parameters, is explained. Finally, the Breslow estimator for the baseline cumulative hazard is derived as a natural plug-in estimator following partial likelihood, completing the inferential framework.
Typology: Schemes and Mind Maps
Uploaded on 02/15/2026
1 / 27
This page cannot be seen from the preview
Don't miss anything!




















A student from B.Sc Statistics honours from Calcutta University, India, is reading books on statistical inference. Exam of such books are Theory of statistical inference by S Zacks, Statistical inference by Casella & Berger, Essential Statistical Inference by Boos & Stefanski, Linear STatistical Inference by C R Rao etc.
A few months happened that he learned what is likelihood and the definition of maximum likelihood.
Your task is to write down the formal definition of what is likelihood and the definition of maximum likelihood.
Note: Keep your approach rigorous calculus based instead of measure theoretic so that the indian student can prepare the concepts well and can write it in the b.sc statistics honours exam.
Below is a rigorous, calculus-based, exam-ready formulation , written in the style expected in a B.Sc. Statistics (Honours), Calcutta University answer. I will avoid measure-theoretic language , but still keep the definitions precise and mathematically correct, consistent with Zacks, Casella–Berger, Boos–Stefanski, and C. R. Rao.
1. Statistical Model and Joint Density / Mass Function
Let
be a random sample from a population whose distribution depends on an unknown parameter
Assume that the joint distribution of the sample has either
a joint probability mass function
or a joint probability density function
Here:
are realized observations , is unknown but fixed.
2. Definition of Likelihood Function
Definition (Likelihood Function)
For a given observed sample
the likelihood function of the parameter is defined as
X 1 , X 2 , … , Xn
θ ∈ Θ ⊂ R k.
f (x 1 , … , x (^) n; θ), (discrete case)
f (x 1 , … , x (^) n; θ), (continuous case).
x 1 , … , xn θ
(x 1 , x 2 , … , x (^) n),
θ
L(θ ∣ x 1 , … , x (^) n ) = f (x 1 , … , x (^) n; θ)
viewed as a function of , with the data held fixed.
Important Conceptual Remark (Very Exam-Relevant)
When written as , it is a probability (or density) in the variable. When written as , the same expression is viewed as a function of and is not a probability distribution.
This distinction is emphasized in C. R. Rao and Casella–Berger.
3. Likelihood for Independent Observations
If are independent and identically distributed with common pdf/pmf , then the joint density is
Hence, the likelihood function is
4. Log-Likelihood Function
Because products are inconvenient for calculus, we define the log-likelihood function
Key Property
Since the logarithm is a strictly increasing function , maximizing is equivalent to maximizing.
5. Definition of Maximum Likelihood Estimator (MLE)
Definition (Maximum Likelihood Estimator)
A statistic is called a maximum likelihood estimator (MLE) of if it satisfies
or equivalently,
provided the supremum is attained at an interior point of.
6. Calculus-Based Characterization (First-Order Conditions)
If is a scalar parameter and lies in the interior of , then a necessary condition for to be an MLE is
That is,
θ
f (x 1 , … , x (^) n ; θ) x L(θ ∣ x 1 , … , x (^) n ) θ
X 1 , … , Xn f (x; θ)
f (x 1 , … , x (^) n ; θ) = f (x ; θ). i=
n i
L(θ) = f (x ; θ) i=
n i
ℓ(θ) = log L(θ) = log f (x ; θ). i=
n i
L(θ) ℓ(θ)
θ^^ =θ^(^ X 1 , … , X (^) n ) θ
L( ) =θ^ L(θ) θ∈Θ
sup
ℓ( ) =θ^ ℓ(θ) θ∈Θ
sup
θ θ^^ Θ θ^
dθ^ ℓ(θ)^ = 0
d θ=θ^
likelihood starting from the formal definition of likelihood or maximum likelihood and by changing some appropriate settings in the concept in the middle of the proof?
Thought for 25s
Short answer (exam-ready): Yes — under clear probabilistic/conditioning assumptions you can derive a partial likelihood formally from the usual definition of likelihood. However you must not think of a partial likelihood as “just dropping terms” of the full likelihood without justification. A partial likelihood is obtained by (one of) these rigorous operations applied to the full joint likelihood:
conditioning on a suitable σ-field / statistic (so the conditional distribution depends only on the parameter of interest), or factoring the joint density and extracting the factor that contains the parameter of interest, or taking a limit (e.g. from a discrete-time approximation) that yields factors depending only on the parameter of interest.
Below I give a precise, calculus-based explanation + a canonical worked derivation (Cox model) that you can write in an exam.
Let the observed data be. The usual (full) likelihood for a parameter is
viewed as a function of with fixed. Suppose we are primarily interested in while is a nuisance parameter.
A partial likelihood for is any function constructed from the joint model such that (1) is (proportional to) the probability (or density) of some function/statistic of the data conditional on other parts of the data, and (2) inference about based on is valid under stated model assumptions.
Important conceptual point for exams: partial likelihood is a genuine likelihood for only when it equals (or is proportional to) some conditional or marginal probability/density that depends on but not on nuisance parameters. You must show this equality/derivation — otherwise it is an ad hoc object.
1. Conditional likelihood approach. If the joint density factors or the conditional distribution of some components of the data given others depends only on , then that conditional density is a likelihood for. Formally, if
or more generally if
and depends only on , then
is a valid partial likelihood for.
2. Factorization / extracting the relevant factor. If the joint likelihood can be (exactly) factorized into a product of one factor depending only on and
X (θ, η) L(θ, η ∣ X) = f (^) X(X; θ, η),
(θ, η) X θ η
θ L (^) p (θ ∣X) L (^) p (θ ∣X)
θ Lp
θ θ
θ θ f (^) X (x; θ, η) =g(T (x); θ) h(u(x); η)
f (^) X (x; θ, η) =f (^) A,B (a, b; θ, η) =f (^) A∣B (a ∣b; θ) f (^) B(b; η, θ),
f (^) A∣B (a ∣b; θ) θ L (^) p (θ) =f (^) A∣B (A (^) obs ∣B (^) obs; θ)
θ
θ
another depending only on nuisance parameters (or on data but not on ), then that first factor is the partial likelihood.
3. Limit from a discrete approximation (common in survival models). Sometimes one writes the continuous-time likelihood by approximating the process in discrete time and then computing conditional probabilities; passing to the limit gives a clean factor that depends only on . The Cox partial likelihood arises this way. 4. Profile / marginal likelihood are different constructions. Profile likelihood : (maximize out nuisance). Marginal likelihood : (integrate out a random effect with a measure/prior). Partial likelihood is usually conditional , not obtained by sup/integral, though in special cases the results may coincide asymptotically.
This is the standard example taught in text-books (Casella–Berger / Cox). It demonstrates deriving a partial likelihood formally from first principles.
Model: for subject with covariate vector ,
where is an unknown baseline hazard (nuisance), and is the parameter of interest.
Assume no tied event times for simplicity, and independent censoring.
Full likelihood (intuitive form): the contribution of a small interval for subject is
so a discrete-time product-over-intervals argument leads to the full likelihood involving and. Rather than writing the full pathwise likelihood, we look at the conditional probability of “who fails at a particular observed failure time” given that a failure occurs at that time.
Fix an observed failure time. Let the risk set just before be (the set of individuals still at risk). Under the hazard model, for small ,
Plug. The baseline cancels:
Now take the product of these conditional probabilities over all observed failure times. The resulting product depends only on , not on :
where is the covariate of the subject who failed at.
This product is the Cox partial likelihood. It was obtained formally by conditioning on “one failure at time ” (or on the rank/order of failure times) and taking the continuous-time limit of the discrete approximation. Hence the partial likelihood is the conditional likelihood of the failure labels given the risk sets.
Exam-style explanation summary you can write: “The partial likelihood for arises from the conditional probability that, at each observed failure time, the particular subject who failed was the one who failed,
θ
θ
L (^) prof (θ) =sup (^) ηL(θ, η) L (^) marg (θ) =∫L(θ, η) dμ(η)
i Zi λ (^) i (t) =λ 0 ( t) exp(β ⊤^ Z (^) i),
λ 0 (t) β
[t, t + dt) i P (failure of i in [t, t + dt)) ≈ λ (^) i(t) dt,
λ 0 ( t) β
t t R(t) Δ
P (subject i fails in [t, t + Δ) ∣ one failure in [t, t + Δ), R(t)) ≈ (^) = ∑j∈R(t) λ (^) j(t)Δ
λ (^) i( t)Δ ∑ λ (t). j∈R(t) j
λ (^) i(t)
λ (^) j (t) =λ 0 (t) eβ^ ⊤Z^ j^ λ 0 (t)
∑ λ (t)^ = j∈R(t) j
λ (^) i (t) = ∑j∈R(t) λ 0 (t) eβ^ ⊤^ Zj
λ 0 ( t) eβ^ ⊤^ Zi. ∑j∈R(t) eβ^ ⊤Z^ j
eβ^ ⊤^ Zi
t (^) (1) , t (^) (2) , … , t(m) β λ 0 (⋅)
L (^) p (β) = , r=
m ∑j∈R(t (^) (r)) e^ β^ ⊤^ Zj
eβ^ ⊤Z^ (r)
Z(r) t(r)
t
β
Note: Also do keep an eye on these following facts if they are usable in the derivation:
Below is a long, detailed, exam-style, canonical derivation , written exactly at the level expected of a B.Sc. Statistics (Honours) student of Calcutta University , using calculus-based probability and no measure- theoretic machinery , while still being fully rigorous and logically clean.
I will start strictly from the formal definition of likelihood / MLE , and then derive partial likelihood as a consequence of conditioning / factorisation / limit arguments , not as an ad-hoc construction.
Let
be the observed data from a statistical model depending on a parameter
where
is the parameter of interest , is a nuisance parameter.
Assume that the joint distribution of has probability density (or mass) function
Definition 1 (Likelihood)
For an observed realization , the likelihood function is
viewed as a function of , with the data held fixed.
Definition 2 (Maximum likelihood estimator)
A statistic is a maximum likelihood estimator if
This is the only starting point permitted.
X = (X 1 , … , X (^) n)
(θ, η) ∈ Θ × H,
θ η X f (^) X(x; θ, η).
x
L(θ, η ∣ x) = f (^) X(x; θ, η)
(θ, η)
( ,θ^^ η^) L( ,θ^^ η^ ∣x) = L(θ, η ∣ (θ,η)
sup x).
In many statistical models:
the likelihood depends on a high-dimensional nuisance parameter , direct maximization of is difficult or unstable, but certain aspects of the data contain information only about.
The idea of partial likelihood is to formally isolate that part of the likelihood which:
1. depends on , 2. does not depend on , 3. corresponds to a genuine probability or conditional probability derived from the joint model.
We now show how this is done rigorously.
Let
be a decomposition of the data, where:
contains information about , absorbs information about.
The joint density can always be written as
Key assumption (this is the crucial step)
Assume that the conditional density of given depends only on , i.e.
This assumption is model-based , not arbitrary.
Step 1: Write the likelihood using conditional factorization
From the definition of likelihood,
Step 2: Isolate the factor involving only
The likelihood factors as:
The first factor is a conditional probability density and therefore a legitimate likelihood for.
Step 3: Definition of partial likelihood
We now define:
This is called the partial likelihood for.
η L(θ, η) θ
θ η
Y θ Z η
f (^) Y ,Z (y, z; θ, η) =f (^) Y ∣Z (y ∣z; θ, η) f (^) Z(z; θ, η).
Y Z θ f (^) Y ∣Z (y ∣z; θ, η) = f (^) Y ∣Z (y ∣z; θ).
L(θ, η ∣ y, z) = f (^) Y ,Z (y, z; θ, η) =f (^) Y ∣Z (y ∣z; θ) f (^) Z(z; θ, η).
L(θ, η ∣ y, z) = × depends only on θ
f (^) Y ∣Z (y ∣ z; θ). nuisance component
f (^) Z(z; θ, η)
θ
L (^) p (θ ∣ y, z) = f (^) Y ∣Z(y ∣ z; θ)
θ
Step 3: Parameter separation
If
then
so the nuisance baseline cancels.
Step 4: Product over failure times
Taking the product over all observed failure times yields
This is the partial likelihood , derived rigorously as a limit of conditional probabilities.
1. Partial likelihood is derived from the full likelihood , not defined independently. 2. It corresponds to a conditional likelihood for the parameter of interest. 3. Nuisance parameters are eliminated via conditioning, factorization, or limits. 4. Partial likelihood generally contains less information than the full likelihood. 5. Partial MLE is obtained by the same calculus-based maximization as ordinary MLE.
If you want, next I can:
compress this into a 2-page CU exam answer , give a worked numerical example , or show why partial likelihood gives consistent estimators despite discarding information.
Just tell me.
P (i fails ∣ one failure) = (^) ∑ λ (t)Δ= j∈R(t) j
λ (^) i (t)Δ. ∑j∈R(t) λ (^) j(t)
λ (^) i(t)
λ (^) i( t) =λ 0 ( t) exp(β ⊤^ Z (^) i),
∑ λ (t)^ = j∈R(t) j
λ (^) i( t) , ∑j∈R(t) eβ^ ⊤^ Zj
eβ^ ⊤^ Zi
λ 0 (t)
L (^) p(β) = r
∑j∈R(t (^) r)e^ β^ ⊤^ Zj
eβ^ ⊤Z^ (r)
Definition (Partial Likelihood). Let have joint density. If there exists a decomposition of the data or a conditioning event such that the resulting conditional density depends only on , then that conditional density, viewed as a function of , is called the partial likelihood for.
X f (^) X(x; θ, η) θ θ θ
Starting from the joint likelihood , partial likelihood is obtained by conditioning on suitable statistics or events so that the resulting conditional distribution depends only on the parameter of interest. This conditional density, being a genuine probability distribution, serves as a likelihood for the parameter of interest and is maximized to obtain the partial maximum likelihood estimator.
L(θ, η)
show why partial likelihood gives consistent estimators despite discarding information.
Below is a long, rigorous, exam-style explanation , written at the B.Sc. Statistics (Honours), Calcutta University level, answering precisely:
The argument is calculus-based , logically clean, and uses standard likelihood theory , not measure theory.
Let the full likelihood be
where
: parameter of interest : nuisance parameter
A partial likelihood is constructed as
where is a decomposition of the data and the conditional distribution of given depends only on.
Important: Partial likelihood does not discard information about ; it discards information about that is irrelevant for estimating.
This distinction is crucial for consistency.
An estimator is consistent for if
To prove consistency, we typically show:
where
is maximized uniquely at.
For ordinary MLE, consistency follows from:
**1. Law of Large Numbers
Why does partial likelihood give consistent estimators even though it discards part of the full likelihood?
L(θ, η ∣ X) = f (^) X(X; θ, η),
θ η
L (^) p( θ ∣X) = f (^) Y ∣Z (Y ∣Z; θ),
(Y , Z) Y Z θ
θ η θ
θ^n θ 0
θ^n P^ θ 0 as n →∞.
n^ ℓ^ (θ)
n P^ Q(θ),
Q(θ) θ 0
For fixed , consider
This is maximized at because
with equality only at.
Thus, partial likelihood retains the correct curvature and location of the true parameter.
As sample size increases:
Information about contained in becomes redundant once we condition on. Conditioning removes variability due to nuisance parameters without biasing the location of the maximum. The score function from partial likelihood has expectation zero at.
Thus, no bias is introduced.
In the Cox model:
the partial likelihood depends only on
Key fact:
These conditional probabilities are exactly correct under the true model. The baseline hazard contains no information about once the risk sets are fixed.
Therefore:
Maximizing partial likelihood recovers the true. Information about is irrelevant for estimating.
Partial likelihood may be less efficient because:
it does not use second-order information from nuisance components, variance may be larger than full MLE.
But:
Consistency depends on correct location of the maximum , not on curvature. As long as the expected partial log-likelihood is maximized at , consistency holds.
Z = z E (^) θ 0 [ log f (^) Y ∣Z (Y ∣z; θ)].
θ 0
E (^) θ 0 [log (^) f (Y ∣ z; θ) ≥ Y ∣Z
f (^) Y ∣Z (Y ∣ z; θ 0 )] 0,
θ = θ 0
θ f (^) Z (z; θ, η) Z
θ 0
λ (^) i( t) =λ 0 ( t) exp(β ⊤^ Z (^) i),
∑j∈R(t) eβ^ ⊤^ Zj
eβ^ ⊤Z^ i
λ 0 (t) β
β 0 λ 0 β
θ 0
If you want, next I can:
prove asymptotic normality of partial MLE , compare efficiency loss: full vs partial likelihood , or give a short 10-mark CU exam answer version of this proof.
Just say the word.
Partial likelihood yields consistent estimators because it corresponds to the correct conditional likelihood for the parameter of interest. Conditioning removes nuisance parameters without altering the population objective function whose unique maximizer is the true parameter value. Although partial likelihood may discard information relevant to efficiency, it preserves the correct location of the maximum, ensuring consistency.
When I asked you the following:
"1. Write a long detailed exam-style canonical rigorous derivation for how we can starting from the formal definition of likelihood/ mle, derive the definition of partial likelihood.
Note: Also do keep an eye on these following facts if they are usable in the derivation:
Now below here there is another answer (to my same question) from a different source and it says the following:
No — a Cox-style partial likelihood is not generally a simple marginalization or profile of the full joint likelihood. It is a purposely constructed conditional/partial objective that eliminates the unspecified baseline hazard by conditioning on risk sets (or using ratios of hazards), so it does not usually arise by integrating out nuisance parameters from a full parametric joint likelihood.
Explanation (concise) Partial likelihood (e.g., Cox) is built from conditional probabilities of which subject fails at each observed event time given the risk set; it treats the baseline hazard as a nuisance that cancels in hazard ratios. A full joint likelihood for (times, censoring) depends on both regression coefficients and the baseline hazard; marginalizing/integrating out the baseline hazard requires a prior or a parametric form and typically does not produce the Cox partial-likelihood expression. Profiling the full likelihood over an infinite-dimensional baseline hazard is not the same operation as constructing the partial likelihood — the latter is a direct conditional construction that yields consistent
the baseline hazard is:
infinite-dimensional , unspecified , not given a parametric form.
Therefore:
You cannot integrate it out unless you impose a prior or parametric structure. You cannot profile it out in the classical finite-dimensional sense. The Cox partial likelihood does not arise as
So the other source is absolutely right to insist:
This is a conceptual warning , and it is an important one.
Now let us be very precise about what I proved — and equally important, what I did not claim.
What I proved
I proved the following statement:
More precisely:
1. Start from the joint likelihood 2. Use legitimate probability identities : conditioning, factorization, or limiting conditional probabilities. 3. Identify a function
which: is derived from the joint model, is a genuine probability/conditional probability , depends only on.
4. Define the partial MLE by maximizing this.
This is exactly how Cox’s partial likelihood is constructed :
by conditioning on the risk set, by forming ratios of hazards, by letting cancel.
What I explicitly did not prove
I did not claim that:
Cox partial likelihood is a marginal likelihood,
λ (^) i( t) =λ 0 ( t) exp(β ⊤^ Z (^) i),
λ 0 (t)
∫ L(β, λ 0 ) dλ 0 or L(β, λ ). λ 0
sup (^0)
“Partial likelihood is not marginal likelihood and not profile likelihood.”
Starting from the formal definition of likelihood, one can rigorously derive a partial likelihood as a conditional likelihood or a factor of the joint likelihood that depends only on the parameter of interest.
L(θ, η ∣ X) = f (^) X(X; θ, η).
L (^) p (θ) =f (^) A∣B (A ∣B; θ)
θ L (^) p(θ)
λ 0 (t)
Cox partial likelihood is a profile likelihood, Cox partial likelihood equals the full likelihood with nuisance parameters eliminated.
And that is crucial.
Here is the clean reconciliation:
Question Answer
Can partial likelihood be derived from likelihood theory? Yes
Is partial likelihood a conditional likelihood? Yes
Is Cox partial likelihood a marginal likelihood? No
Is Cox partial likelihood a profile likelihood? No
Is partial likelihood an ad-hoc objective? No
Does partial likelihood discard information about nuisance parameters?
Yes (by design)
Does it discard information about? No (under model assumptions)
So both answers are correct — they are simply addressing different misunderstandings.
Here is the statement you should internalize and, if needed, write in an exam:
This sentence perfectly unifies both answers.
Most students confuse these three notions:
**1. Full likelihood
You did not.
You noticed that:
one explanation emphasizes derivability from likelihood theory , the other emphasizes non-equivalence to marginalization or profiling.
Recognizing that these are not contradictions is a sign of real maturity in inference.
1. Who is correct?
✅ Both are correct , but they answer different questions.
θ
Partial likelihood is a likelihood-based objective constructed from the joint model by conditioning on suitable events or statistics so that nuisance parameters cancel. It is not, in general, the marginal or profile likelihood obtained from the full joint likelihood, but it is a valid conditional likelihood for the parameter of interest and yields consistent estimators under correct model specification.
X = (X_1,X_2,\dots,X_n) ] denote the observed data from a statistical model depending on an unknown parameter [ (\theta,\eta) \in \Theta \times H, ] where: \begin{itemize} \item $\theta$ is the \emph{parameter of interest}, \item $\eta$ is a \emph{nuisance parameter}. \end{itemize}
Assume that the joint distribution of $X$ has probability density (or mass) function [ f_X(x;\theta,\eta). ]
\subsection*{Definition 1 (Likelihood Function)}
For an observed realization $x$, the likelihood function is defined as [ L(\theta,\eta \mid x) = f_X(x;\theta,\eta), ] viewed as a function of $(\theta,\eta)$ with the data held fixed.
\subsection*{Definition 2 (Maximum Likelihood Estimator)}
A statistic $(\hat{\theta},\hat{\eta})$ is called a maximum likelihood estimator if [ L(\hat{\theta},\hat{\eta}\mid x) = \sup_{(\theta,\eta)\in\Theta\times H} L(\theta,\eta\mid x). ]
This is the fundamental starting point of likelihood-based inference.
\subsection*{2. Motivation for partial likelihood}
In many statistical models: \begin{itemize} \item the nuisance parameter $\eta$ may be high-dimensional or infinite-dimensional, \item full maximization of $L(\theta,\eta)$ may be difficult or unstable, \item certain components of the data carry information only about $\theta$. \end{itemize}
The idea of \emph{partial likelihood} is to isolate, in a rigorous way, that part of the likelihoo
\subsection*{3. Likelihood factorization via conditioning}
Suppose the data $X$ can be decomposed as [ X = (Y,Z), ] where $Y$ and $Z$ are functions of the observed sample.
By the definition of conditional density, [ f_{Y,Z}(y,z;\theta,\eta) = f_{Y\mid Z}(y\mid z;\theta,\eta),f_Z(z;\theta,\eta). ]
Therefore, the likelihood can be written as [ L(\theta,\eta\mid y,z) = f_{Y\mid Z}(y\mid z;\theta,\eta),f_Z(z;\theta,\eta). ]
\subsection*{4. Key assumption for partial likelihood}
Assume that the conditional distribution of $Y$ given $Z$ depends only on $\theta$, i.e. [ f_{Y\mid Z}(y\mid z;\theta,\eta) = f_{Y\mid Z}(y\mid z;\theta). ]
This assumption arises naturally in many models and is not imposed arbitrarily.
Under this assumption, the likelihood becomes [ L(\theta,\eta\mid y,z) = \underbrace{f_{Y\mid Z}(y\mid z;\theta)}{\text{depends only on }\theta} \times \underbrace{f_Z(z;\theta,\eta)}{\text{nuisance component}}. ]
\subsection*{5. Definition of partial likelihood}
The factor $f_{Y\mid Z}(y\mid z;\theta)$ is a genuine conditional probability density and depends
\subsubsection*{Definition (Partial Likelihood)}
The \emph{partial likelihood} for $\theta$ is defined as [ \boxed{ L_p(\theta\mid y,z) = f_{Y\mid Z}(y\mid z;\theta) }. ]
The corresponding partial log-likelihood is [ \ell_p(\theta) = \log L_p(\theta\mid y,z). ]
\subsection*{6. Partial maximum likelihood estimator}
The partial maximum likelihood estimator $\hat{\theta}_p$ is defined by [ \boxed{ \hat{\theta}p = \arg\max{\theta\in\Theta} \ell_p(\theta) }. ]
This maximization uses the same calculus-based principles as ordinary maximum likelihood estimatio