Optimal control And Statistics inference Interconnection Explained, Schemes and Mind Maps of Mathematical Statistics

The discussion developed a geometric view of identifiability, control, and optimal design in statistics. Fisher information was treated intrinsically on parameter manifolds, with rank deficiency characterized by score vectors failing to span the cotangent space. This failure was shown to coincide with non-identifiability and with Lie Algebra Rank Condition (LARC) failure in controlled observation models. A concrete example demonstrated how a single control collapses Fisher rank, while adding a second control restores full rank, identifiability, and controllability. Finally, Hamilton–Jacobi–Bellman design was linked directly to this geometry: optimal probing controls are precisely those that inject Fisher information into missing directions, automatically repairing rank loss.

Typology: Schemes and Mind Maps

2025/2026

Uploaded on 02/15/2026

unknown user
unknown user 🇮🇳

1 / 65

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Is optimal control used in mathematical statistics (especially in
inference). If yes, discuss how with two theoretical examples. If no, just
write 'no'.
Thought for 16s
Yes — optimal control is used in mathematical statistics, and especially in inference. Below I give two
theoretical examples that show how control / optimal-stopping / dynamic-programming ideas enter
statistical inference (one classical, one modern/adaptive). I state each problem formally and sketch how the
optimal-control viewpoint yields the statistical rule.
1) Sequential hypothesis testing as an optimal-stopping / control
problem (SPRT)
Setup. Observations i.i.d. with densities under and under . We must choose a
stopping time and a terminal decision to detect which hypothesis is true.
Control formulation. Introduce a cost-per-sample and penalties for type-I and type-II errors
respectively. The objective is to choose to minimize the expected total cost
where the expectation is under the unknown true hypothesis with prescribed prior or taken in a minimax
sense. This is a stochastic optimal-stopping problem with state equal to the posterior (or likelihood ratio).
Sufficient statistic / state. The (unnormalized) likelihood ratio after observations,
(or the posterior ), is a Markov state. The decision to stop or continue depends
only on .
Dynamic programming / Bellman equation. Let be the minimal expected future cost when the
current likelihood ratio is . Bellman optimality gives
where is the one-step likelihood ratio random variable and the expectation is taken
under the current posterior predictive.
Result / structure. The optimal rule from this Bellman equation is to stop when leaves an interval
and continue while . Choosing the Lagrange multipliers appropriately (or matching error
constraints) yields Wald’s Sequential Probability Ratio Test (SPRT) thresholds. Thus classical SPRT is the
solution of an optimal-stopping control problem: you control the stopping time (and terminal decision) to
trade off sampling cost vs. error penalties.
Why this is “control”: you actively control the sampling (when to stop) based on the evolving state
(likelihood ratio). The Bellman equation and stopping boundaries are derived via dynamic programming —
the central tool in optimal control.
2) Adaptive experimental design / active learning as a control problem
(minimizing estimation error)
X ,X ,
1 2 f
0H
0f
1H
1
τ δ {0, 1}
c> 0 C ,C
10 01
(τ,δ)
J(τ,δ) = E[c τ +C 1{δ=
10 1, H= 0} + C 1{δ=
01 0, H= 1}],
n
Λ =
n ,
i=1
n
f (X )
0i
f (X )
1i
π =
nPr(H= 1 X )
1:n
Λ
n
V(λ)
λ
V(λ) = min {
,c+
costifstopandchoosebestdecision
min{C Pr(H= 0 λ), C Pr(H= 1 λ)}
10 01 E[V(λW)]},
W=f (X)/f (X)
1 0
λ(A,B)
λ (A,B)
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41

Partial preview of the text

Download Optimal control And Statistics inference Interconnection Explained and more Schemes and Mind Maps Mathematical Statistics in PDF only on Docsity!

Is optimal control used in mathematical statistics (especially in

inference). If yes, discuss how with two theoretical examples. If no, just

write 'no'.

Thought for 16s

Yes — optimal control is used in mathematical statistics , and especially in inference. Below I give two

theoretical examples that show how control / optimal-stopping / dynamic-programming ideas enter

statistical inference (one classical, one modern/adaptive). I state each problem formally and sketch how the

optimal-control viewpoint yields the statistical rule.

1) Sequential hypothesis testing as an optimal-stopping / control

problem (SPRT)

Setup. Observations i.i.d. with densities under and under. We must choose a

stopping time and a terminal decision to detect which hypothesis is true.

Control formulation. Introduce a cost-per-sample and penalties for type-I and type-II errors

respectively. The objective is to choose to minimize the expected total cost

where the expectation is under the unknown true hypothesis with prescribed prior or taken in a minimax

sense. This is a stochastic optimal-stopping problem with state equal to the posterior (or likelihood ratio).

Sufficient statistic / state. The (unnormalized) likelihood ratio after observations,

(or the posterior ), is a Markov state. The decision to stop or continue depends

only on.

Dynamic programming / Bellman equation. Let be the minimal expected future cost when the

current likelihood ratio is. Bellman optimality gives

where is the one-step likelihood ratio random variable and the expectation is taken

under the current posterior predictive.

Result / structure. The optimal rule from this Bellman equation is to stop when leaves an interval

and continue while. Choosing the Lagrange multipliers appropriately (or matching error

constraints) yields Wald’s Sequential Probability Ratio Test (SPRT) thresholds. Thus classical SPRT is the

solution of an optimal-stopping control problem: you control the stopping time (and terminal decision) to

trade off sampling cost vs. error penalties.

Why this is “control”: you actively control the sampling (when to stop) based on the evolving state

(likelihood ratio). The Bellman equation and stopping boundaries are derived via dynamic programming —

the central tool in optimal control.

2) Adaptive experimental design / active learning as a control problem

(minimizing estimation error)

X (^) 1 , X (^) 2 , … f 0 H 0 f 1 H 1

τ δ ∈ {0, 1}

c > 0 C (^) 10 , C 01

( τ , δ )

J ( τ , δ ) = E[ c τ + C (^) 10 (^1) { δ =1, H = 0} + C (^) 01 1 { δ =0, H = 1}],

n

Λ (^) n = ,

i =

n

f (^) 0 ( X (^) i )

f (^) 1 ( X (^) i )

π (^) n =Pr( H = 1 ∣ X (^) 1: n )

Λ n

V ( λ )

λ

V ( λ ) = min { , c +

cost if stop and choose best decision

min{ C (^) 10 Pr( H = 0 ∣ λ ), C (^) 01 Pr( H = 1 ∣ λ )} E[ V ( λW )]},

W = f (^) 1 ( X )/ f (^) 0 ( X )

λ ( A , B )

λ ∈ ( A , B )

Setup (linear-Gaussian). Unknown parameter. At each time you choose an input (design/control)

and then observe

independent. You plan experiments and want a final estimator with small mean-squared error.

Bayesian prior / state. Put a Gaussian prior. After steps the posterior remains Gaussian

. The posterior covariance evolves deterministically (given the choices ):

or equivalently

So the posterior covariance is the state of a deterministic control system, and the design vectors are

the control inputs.

Control objective. Choose to minimize a functional of the terminal covariance, e.g.

(A-optimality) or (D-optimality) subject to constraints. Equivalently, minimize expected

squared error of the Bayes estimator.

Optimal-control viewpoint. This is a finite-horizon deterministic control problem on the manifold of

symmetric positive definite matrices:

Apply dynamic programming / Pontryagin’s principle (or greedy approximations) to derive optimal policies.

In special cases analytic structure appears:

One-step (greedy) choice: minimizing given leads to choosing in the direction of the

eigenvector corresponding to the largest eigenvalue of (intuitively, probe the most uncertain

direction). That is a control law mapping the state to.

For low-dimensional or structured , dynamic programming can produce the globally optimal adaptive

design; in general this is computationally hard and approximate control methods (certainty equivalence,

greedy, approximate dynamic programming) are used.

Extensions / dual control. If the also influence a dynamical system you wish to control (not just probe),

you face the dual control problem: choose controls that simultaneously regulate the system and reduce

parameter uncertainty. That is literally an optimal control problem with an augmented state containing both

system state and parameter posterior — central in adaptive control and Bayesian experimental design.

Short summary / takeaways

Yes , optimal-control ideas (dynamic programming, Bellman equations, Hamilton–Jacobi–Bellman

methods, Pontryagin principle) appear naturally in statistical inference when decisions are made

adaptively: when to stop sampling (sequential tests), which experiments to run next (adaptive design /

active learning), how to allocate sampling budget (best-arm identification / bandits), and how to control

systems while learning parameters (dual control).

Two prototypical theoretical examples are (1) the SPRT viewed as an optimal-stopping problem on the

likelihood-ratio state, and (2) Bayesian adaptive design viewed as a finite-horizon control problem on

the posterior-covariance state. Both are best framed and solved (exactly or approximately) using

optimal-control tools.

If you want, I can (a) write out the Bellman equation and carry through a more explicit derivation of the SPRT

thresholds, or (b) set up and solve a small numeric adaptive-design example (linear Gaussian, ) to

show the covariance trajectory under greedy vs optimal policies. Which would you prefer?

θ ∈ R p t

x (^) t ∈X

Y (^) t = x (^) +

ε (^) t , ε (^) tN (0, σ ),

2

T θ ^ T

θN ( m (^) 0 , Σ 0 ) t

N ( m (^) t , Σ ) t x (^) 1 , … , xt

Σ (^) t =

− Σ (^) t −1+

x x , σ^2

t (^) t

Σ (^) t =(Σ^ t −1+

x x ). σ 2

t (^) t

Σ t xt

x (^) 1 , … , xT J = tr(Σ (^) T )

log det(Σ (^) T ) x (^) t ∈X

Φ(Σ ) subject to Σ = x 1: T

min (^) T t +1 F (Σ (^) t , x (^) t +1 ) =(Σ (^) t +

x x ). σ^2

1 t +1 (^) t +

⊤ −

tr(Σ (^) t +1) Σ t xt +

Σ t

Σ t xt +

X

xt

p = 2

Topic in optimal control

Requires calculus of variations? Notes

LQR / Riccati equations ❌ No Linear algebra + ODEs.

Pontryagin Maximum Principle ⚠ Helpful but not required A short intro to variations is usually

provided in the chapter.

Variational formulations (Euler–

Lagrange)

✔ Yes But this is optional part of optimal control.

4. Recommendation

If your background is in mathematical statistics and you want to use optimal control for inference, sequential

decisions, Bayesian design, etc., then:

Start with dynamic programming / Bellman equations first

(no calculus of variations needed),

Only later, when you need continuous-time control or Hamiltonian methods, pick up the necessary

calculus-of-variations concepts (takes about 1–2 days of study).

Final answer

You can study optimal control without knowing calculus of variations.

It becomes useful only when deriving continuous-time necessary conditions like Pontryagin’s principle, and

even then textbooks teach the necessary parts along the way.

If you want, I can give you (a) a minimal “calculus of variations crash course,” or (b) a recommended

sequence for learning optimal control starting from a statistics background.

One thing about calculus of variation.

We often do general calculus, then vector calculus and

multidimensional calculus (e.g. calculus I, calculus II, calculus III etc.

courses).

Does there exist any vector calculus of variation or multivariate calculus

of variation?

Also the theoretical form of calculus is Analysis. So does there exist any

theoretical version of calculus of variation such as Analysis of variation

or variational analysis?

Thought for 20s

Short answer first: Yes — there are multivariate / vector versions of the calculus of variations, and there is a

broad modern theoretical field called variational analysis that generalizes and extends the classical calculus

of variations.

Below I’ll sketch the landscape, give the standard formulations, and list concrete examples and pointers so

you can see how the notions you know from single-variable calculus of variations generalize.

1) “Vector” / multivariate calculus of variations — two senses

There are two natural ways the single-variable calculus of variations is extended; both are common and

important.

(A) Vector-valued trajectories (multiple components, 1D domain).

Domain is an interval. The unknown is a vector function

. A typical functional

has essentially the same variational derivation as the scalar case; stationarity yields a system of Euler–

Lagrange ODEs, one equation per component:

This is often called the calculus of variations for vector-valued functions (or for curves in ). Applications:

mechanics with multiple coordinates, optimal control (via calculus of variations), elasticity in 1D, etc.

(B) Multiple integrals / fields (multidimensional domain).

Domain is a region and the unknown is a field. Functionals take the form

Variational derivative leads to Euler–Lagrange PDEs (system if ):

Examples: the Dirichlet energy gives Laplace’s equation ; the area functional gives

the minimal-surface PDE; nonlinear elasticity and many PDEs arise this way. This branch is usually called

“calculus of variations in several variables,” “variational methods for PDEs,” or “field-theoretic calculus of

variations.”

2) Theoretical versions / modern umbrella: “Variational analysis”

What you called “Analysis of variation” is essentially captured by several rigorous, more abstract fields:

Classical/theoretical calculus of variations (existence & regularity): uses functional analysis (Sobolev

spaces, weak derivatives) to prove existence of minimizers (Tonelli, direct method), study regularity, and

handle lower semicontinuity. Key texts: Gelfand–Fomin (introductory), Dacorogna (direct methods).

Variational methods in PDEs / functional analysis: formulates PDEs as Euler–Lagrange equations of

energy functionals; uses minimization, critical-point theory (e.g. Mountain Pass) to produce solutions.

Evans, Struwe, Lions cover this.

Variational analysis (modern, as a discipline): a broad, rigorous framework that includes nonsmooth

analysis, convex analysis, generalized gradients/subdifferentials, set-valued mappings, epi-convergence,

and stability of optimization problems. The canonical reference is Rockafellar & Wets, Variational

Analysis. This field unifies and extends calculus of variations into optimization, nonsmooth problems,

and control.

Geometric / global variational theory: calculus of variations on manifolds, Morse theory (critical-point

topology), symplectic/Hamiltonian approaches, and Noether’s theorem (symmetries → conservation

laws).

Γ-convergence / variational convergence: theory of convergence of functionals (used in

homogenization, phase transitions, Γ-limits).

3) Typical theoretical tools that appear (and how they differ from

elementary CV)

Weak formulations & Sobolev spaces: instead of classical differentiability, work with weak derivatives

and minimize in or other Sobolev spaces. This is essential to get existence theorems for PDE

axb y ( x ) = ( y ( x ), … , y ( x )) : 1 n

[ a , b ] → R n

J [ y ] = ∫ L ( x , y ( x ), y ( x )) dx a

b

yi

∂ L

dx

d

yi

∂ L

0, i = 1, … , n.

R

n

Ω ⊂ R

m u : Ω → R n

J [ u ] = ∫ F ( x ,^ u ( x ), ∇ u ( x )) dx. Ω

n > 1

( x , u , ∇ u ) − ∂

∂ F

j =

m

xj

∂(∂ (^) xju α )

∂ F

0, α = 1, … , n.

F = ∣∇ u ∣ 2

1 2 −Δ u = 0

W

1, p

than control theory.

Optimal estimation (Cramér–Rao, unbiasedness, Bayes risk, minimax, decision theory) is rooted in

statistical decision theory and information theory, not control.

Optimal/experimental design (A/D/E/D-optimality, Fisher information, D-optimal design, equivalence

theorems) is mainly linear algebra, convex analysis, and classical statistics.

Textbooks in these areas develop the necessary dynamic programming or variational arguments internally;

they don’t presuppose an optimal-control course.

Where optimal-control is relevant

Any adaptive , sequential or real-time decision problem often uses the same tools as control:

dynamic programming / Bellman equation (connects directly to control theory),

optimal-stopping as a special case of stochastic control,

bandits and active learning use control-style exploration–exploitation tradeoffs,

Bayesian adaptive design and dual control explicitly combine estimation + control.

Continuous-time problems (e.g. quickest change detection in diffusion models, stochastic control

formulations of sequential tests) may draw on HJB/Pontryagin ideas.

Typical prerequisites you will actually need

Probability theory (measure-theoretic if you want full rigor)

Stochastic processes / martingales (for stopping)

Mathematical statistics (estimators, Fisher information, decision theory)

Linear algebra and convex optimization (for design and computational methods)

Basic dynamic programming / discrete-time decision theory — this is often enough to bridge to control

ideas you’ll meet

Practical recommendation

If your goal is standard inference, sequential tests, estimation, or nonadaptive design: learn probability,

decision theory, and optimization first — no control class needed.

If you plan to work on adaptive experiments, reinforcement learning, bandits, or problems that merge

control and learning: learn dynamic programming and then an intro to optimal control (LQR,

Pontryagin/PMP at a high level), because the shared vocabulary and methods will pay off.

If you want, I can give a compact two-page cheat-sheet: (A) which control concepts map to which statistical

problems, and (B) a 6–8 week study plan tailored to your current math/stats background. Want that?

Write down the domain/fields of mathematical statistics that use

optimal control and geometric control extensively. Also is geometric

control a subset of optimal control?

Below is a clean, high-level map of where optimal control and geometric control appear inside

mathematical statistics , and a clarification of the relationship between the two control fields.

✅ 1. Domains in mathematical statistics that use optimal control

While many parts of statistics do not use control theory, there are specific subfields where optimal control

ideas are central — especially whenever decisions are made sequentially, adaptively, or under uncertainty.

(A) Sequential analysis / sequential inference

Sequential probability ratio tests (SPRT)

Quickest change detection

Sequential confidence intervals

Stopping-time–based estimation

These problems reduce to optimal stopping , a special case of stochastic optimal control.

Tools: dynamic programming, Snell envelopes, HJB equations (in continuous time).

(B) Bayesian decision theory (sequential decisions)

Bayes estimators under sampling cost

Decision rules based on posterior updates

Multi-stage Bayes risk minimization

Posterior belief is the “state”; the decision rule is a control policy.

(C) Optimal experimental design (adaptive design)

Adaptive D-, A-, E-optimal designs

Bayesian optimal design

Design of experiments for nonlinear models

Information-based control of experiments

This is deterministic optimal control of the Fisher information or of the posterior covariance.

(D) Active learning, multi-armed bandits, reinforcement learning

Best-arm identification (pure exploration)

Bayesian optimization

Adaptive sampling policies

Bandits are literally stochastic control with partial information (belief state is a posterior, control is arm

selection).

(E) Filtering, estimation, and control of stochastic systems

Kalman filter (linear-quadratic-Gaussian control structure)

Kalman–Bucy filtering

Joint estimation + control (dual control, adaptive control)

Statistical inference for dynamical systems strongly overlaps with control.

(F) Stochastic differential equations (SDEs) in statistics

Continuous-time statistical problems for diffusions often lead to:

Stochastic optimal control

Hamilton–Jacobi–Bellman equations

Girsanov/likelihood-ratio controls in filtering and smoothing

(G) Information geometry + statistical manifolds

This is where geometric control enters more explicitly:

Geodesic flows for statistical estimators

Natural gradient descent = steepest descent on a statistical manifold

Optimal transport methods in inference

Fisher–Rao geometry and geodesics

Studies control systems using differential geometry :

manifolds

Lie groups

distributions and frobenius integrability

controllability via Lie brackets

sub-Riemannian geometry

geodesic flows of control systems

Geometric control is more structural, focusing on:

controllability

feedback linearization

symmetry

minimal paths in geometric structures

Optimal control and geometric control intersect when:

the state space is a manifold, OR

the optimal-control problem has a geometric structure (e.g. sub-Riemannian geodesics).

So the logical relationship is:

But many optimal control problems on manifolds are solved using geometric control tools.

✅ Summary Table

Area of Mathematical Statistics Uses Optimal Control? Uses Geometric Control?

Sequential analysis ✔ strongly rarely

Sequential decision/Bayes risk ✔ rarely

Optimal experimental design ✔ sometimes (geometric design)

Bandits / RL ✔ (dynamic programming) no

Filtering / SDE inference ✔ sometimes (Lie-group filters)

Information geometry sometimes ✔ strongly

Shape statistics, directional data sometimes ✔ strongly

Optimal transport / Schrödinger

bridge

✔ ✔ strongly

If you want

I can also produce:

a Venn diagram showing optimal control vs geometric control vs mathematical statistics;

a study plan for entering the control-theoretic side of statistics;

examples of specific theorems (e.g., Fisher geodesics, Lie-algebra controllability, HJB on manifolds) used

in statistics.

Just tell me which one.

Geometric control is not a subset of optimal control; rather, it provides the geometric foundations for

certain optimal control problems, especially when the dynamics or constraints are geometric.

examples of specific theorems (e.g., Fisher geodesics, Lie-algebra

controllability, HJB on manifolds) used in statistics.

Thought for 11s

Nice — tasty, concrete request. Below are six specific theorems / results (with short statements, the math

idea, and exactly how each gets used in statistics). I focus on the items you named (Fisher geodesics, Lie-

algebra controllability, HJB on manifolds) and add a few neighboring control/variational results that

statisticians actually lean on.

I’ll mark a few historical names and a key reference so you can chase original sources.

1) Fisher–Rao / Fisher information as a Riemannian metric

Ronald A. Fisher

Statement (informal). For a parametric family , the Fisher information matrix

defines a Riemannian metric on the parameter manifold. Infinitesimally, the Kullback–Leibler divergence

between and equals.

Why useful in statistics.

The Fisher–Rao metric gives the natural geometry of parameter space; geodesics describe “least-

distinguishable” paths of parameters.

Asymptotic efficiency/Cramér-Rao type bounds are local metric statements: covariance matrices of

regular unbiased estimators cannot beat.

Algorithms: natural gradient descent (Amari) uses the inverse Fisher metric as a preconditioner —

treated as a geometric control/flow on the statistical manifold.

Concrete use: choosing experiment directions to most rapidly reduce uncertainty corresponds to

following steepest descent w.r.t. the Fisher metric.

2) Fisher geodesics and Rao distance (geodesic characterization)

Result (informal). The geodesic distance induced by the Fisher–Rao metric gives a canonical measure of

statistical separation (the Rao distance). For exponential families geodesics and dual connections have closed

forms and lead to simple expressions for information-theoretic divergences (useful in asymptotic inference).

Why useful in statistics.

Used to compare models and approximate likelihood contours; underlies information geometry tools

used in asymptotic expansions, hypothesis testing geometry, and design on manifolds.

In constrained estimation on a manifold (e.g., directional data on the sphere), geodesic projection is the

natural estimator correction step.

3) Lie-algebra rank condition (LARC) / bracket-generating controllability

Theorem (LARC, informal). Consider a control system on a smooth manifold

. If the Lie algebra generated by the vector fields spans the tangent space at

every point (i.e., the fields are bracket-generating ), then the system is (small-time) locally

accessible/controllable (Chow–Rashevskii type result).

{ p ( x ; θ )} θ ∈Θ

I ( θ ) = E (^) θ [ ∇ (^) θ log p ( X ; θ ) ∇ (^) θ log p ( X ; θ ) ]

θ θ + (^) 2 dθ I ( θ ) +

1 ⊤ o (∥ ∥ )

2

I ( θ )

x ˙ = f (^) 0 ( x ) +∑ i =1 u f ( x )

m i i

M { f (^) 0 , f (^) 1 , … , f (^) m } T (^) xM

x

6) Variational analysis / subdifferential calculus used in modern statistics

Variational Analysis

Result (paraphrase). Rockafellar–Wets style variational analysis supplies generalized derivative rules

(subdifferentials, epi-convergence, marginal functions) that let you analyze nonsmooth objective maps and

limits of optimization problems.

Why useful in statistics.

Non-smooth estimators (LASSO, quantile regression, M-estimators) and model selection criteria are

analyzed using subdifferential calculus and epi-convergence — these are exactly variational-analysis

tools.

In adaptive control + statistics (dual control, active learning with nonsmooth cost), the stability and

convergence of approximate policies are studied via variational convergence.

How to read these results as a statistician (practical takeaways)

Fisher geometry gives you the local metric for uncertainty and is the right lens for local asymptotics,

natural gradients, and information-based design.

LARC / bracket conditions are the checklist: if you can’t generate tangent directions with your

experimental inputs, you cannot hope to estimate certain parameters no matter how clever your

estimator is.

HJB / PMP give recipes for deriving optimal adaptive/sampling policies; solving them exactly is typically

hard, but they guide approximations (greedy policies, certainty-equivalence, numerical dynamic

programming).

Variational analysis is essential when objective functions are nonsmooth or when you take limits (e.g.,

sample size , vanishing regularization).

If you want, I’ll: (a) derive a short worked example showing how the HJB gives the SPRT stopping boundary

for a simple diffusion; or (b) show a compact example where LARC fails and an estimator cannot identify a

parameter. Pick one and I’ll expand it into a step-by-step worked example.

derive a short worked example showing how the HJB gives the SPRT

stopping boundary for a simple diffusion

Thought for 18s

Nice choice — this is a clean illustration of how a continuous-time HJB / dynamic-programming approach

produces the same stopping rule as the classical Sequential Probability Ratio Test (SPRT). I’ll work a complete

short example for the simple drift–diffusion model, show the Bellman/HJB equation on the log-likelihood

(LLR) state, reduce it to an ordinary differential equation, and show how the stopping boundaries are

obtained by value-matching + smooth-fit. (I’ll keep the algebra compact but explicit so you can follow each

step.)

Note: the classical discrete-time SPRT was introduced by Abraham Wald; here we treat a continuous-time

analogue via HJB.

1. Model and Bayesian cost

Observe a continuous process satisfying

where is standard Brownian motion and with. Two hypotheses:

,

.

Start with prior. Sampling costs per unit time; terminal (0–1) loss: cost if

you choose the wrong hypothesis, if you choose the correct one. We seek a stopping time and terminal

decision minimizing Bayes risk

Because of symmetry, the optimal stopping rule will be: stop when the log-likelihood ratio exits an

interval and choose the sign of.

2. Log-likelihood ratio as state and its dynamics

The log-likelihood ratio (LLR) of vs at time is

With we have and , so the formula simplifies to

This is a stochastic process driven by. Using we obtain the dynamics under each

hypothesis:

So under the drift of is (positive), and under it is (negative); its diffusion coefficient is

.

For dynamic programming we need the conditional (posterior predictive) evolution of given the current

information. Let. In terms of ,

Given , the next increment has conditional drift (mixture of ), so the conditional

drift of is

Using we can write this drift as a function of :

The diffusion coefficient (conditional variance) of is. So the posterior-conditional

dynamics of the LLR (as a Markov process under the filter measure) are

where is the innovation Brownian motion.

3. Bellman / HJB equation

( X (^) t ) t

dX (^) t = θ dt + dW (^) t ,

Wt θ ∈ {− μ , + μ } μ > 0

H (^) 0 : θ = − μ

H (^) 1 : θ = + μ

P ( H 1 ) = P ( H 0 ) =

2

1 c > 0 1

0 τ

δ ∈ {0, 1}

J = E[ + 1 {error at τ }].

Lt

(− a , a ) Lt

H 1 H 0 t

L (^) t = log = d P (^) H 0 ( X (^) [0, t ])

d P (^) H 1 ( X (^) [0, t ] ) ∫ ( θ − 0

t

1 θ^ 0 ) dX^ s − 2 ( θ^ −

1 ∫ 0

t

1

2 θ (^) 0 ) ds.

2

θ (^) 1 =+ μ , θ (^) 0 =− μ θ (^) 1 − θ (^) 0 = 2 μ θ (^) 1 − 2 θ (^) 0 = 2 0

L (^) t = 2 μX (^) t.

Xt X (^) t = θt + Wt

L (^) t = 2 μ ( θt + W (^) t ) ⇒ dL (^) t = 2 μ dX (^) t = 2 μθ dt + 2 μ dW (^) t.

H 1 L 2 μ 2 H 0 −2 μ 2

2 μ

L

p (^) t = P ( H (^) 1 ∣F (^) t ) Lt

p (^) t =. 1 + eLt

e Lt

F t dXt (2 p (^) t −1) μ dt ± μ

L

E[ dL (^) t ∣F (^) t ] = 2 μ ⋅ ((2 p (^) t −1) μ dt ) = 4 μ ( p

2 t (^) 2 )^ dt.

1

p (^) t = 1 + eLt

e Lt L

m ( L ) = 4 μ ( −

2 1 + eL

e L ) = 2

1 2 μ tanh( ).

2 2

L

L (2 μ ) dt = 2 4 μ dt 2

Lt

dL (^) t = m ( L (^) t ) dt + 2 μ d Wt , m ( L ) = 2 μ (^) tanh( ),

2

2

L

W

Integrate once more to get (the integral can be written in closed form, but for the purpose of finding

the threshold we only need the values of and at ).

5. Matching at the stopping boundary

Apply the smooth-fit condition at : the expression (***) evaluated at must equal the derivative

of the stopping cost:

Thus the threshold is the solution of

We can simplify the right hand side in hyperbolic terms:.

Indeed

(You may check this identity by writing .)

So the smooth-fit equation becomes

Multiply both sides by and cancel the common factor to obtain the implicit equation for :

Thus the threshold is the (unique) positive solution of. Once is known, the

stopping rule is:

and the terminal decision is.

(One can check value-matching is satisfied automatically given the chosen constant of

integration obtained when integrating and using symmetry; in practice both value-matching and

smooth-fit determine the integration constant and the threshold.)

6. Interpretation and connection to SPRT

The HJB / Bellman approach produced an ODE whose boundary condition yields the threshold equation

(✶). That threshold is the continuous-time Bayesian analogue of Wald’s discrete SPRT thresholds.

For small sampling cost (you tolerate long sampling to reduce error), the threshold is large. In the

asymptotic/small- regime the stopping boundaries translate to the familiar log-likelihood thresholds

when one instead prescribes Type-I/II error rates rather than a sampling cost.

The HJB viewpoint clarifies why the optimal rule is "stop when the LLR leaves an interval": the

continuation region is an interval around zero determined by the PDE/ODE; that is a general

phenomenon in optimal stopping problems for one-dimensional diffusions.

7. Summary (quick recap) 1. Put the LLR as the Markov state (because the posterior is a function of it). 2. Write the HJB variational inequality.

V ( L )

V V

L = a

± a

L = a L = a

V ( a ) =

′ −. (1 + e a^ )^2

e a

a > 0

4 μ^2

c

cosh ( )

2 2

a

a + sinh a −. (1 + e a )^^2

e a

(1 + e a )^^2

e a = ( ea /2^ + e a /2^ )^2

sech ( )/

2 2

a

(1 + e a^ )^2

e a ⋅ (1 + ea^ )^2

e (algebra) =

a sech (^) ( ). 4

a

cosh( a /2) = ( e + a / e )/ − a /

4 μ^2

c

cosh ( )

2 2

a

a + sinh a sech ( ). 4

a

4 cosh ( a /2)

2 a

( a + μ^2

c sinh a ) = 1. (✶)

a ( c / μ )( a + 2 sinh a ) = 1 a

τ = inf{ t ≥ 0 : ∣ L (^) t ∣ ≥ a } = inf{ t ≥ 0 : ∣2 μX (^) t ∣ ≥ a },

1 { L (^) τ >0}

V ( a ) = g ( a )

V

c a

c

log α

1− α

Lt

min{ gV , c + L V } = 0

3. On the continuation interval solve the ODE. 4. Use symmetry + smooth-fit/value-matching at to obtain the implicit threshold equation

.

5. The optimal rule is stop when and pick the sign of ; this matches the SPRT structure.

If you like, I can:

(A) compute a numerical example (choose and ) and plot sample trajectories showing stopping at

;

(B) show the limiting connection to Wald’s discrete SPRT thresholds (map cost to error probabilities

); or

(C) expand the algebra and produce a closed form expression for (integrating the formula for

).

Which follow-up would be most useful?

2 μ V + 2 ′′ 2 μ tanh( L /2) V + 2 ′ c = 0

± a

( c / μ )( a + 2 sinh a ) = 1

L (^) t ∣ ≥ a Lt

μ c Lt

± a

c

α , β

V ( L )

V ( L )

  1. show the limiting connection to Wald’s discrete SPRT thresholds

(map cost 𝑐 to error probabilities

𝛼,𝛽);

  1. Expand the algebra and produce a closed form expression for 𝑉(𝐿)

(integrating the formula for

𝑉′(𝐿)).

Great — let’s do both carefully and keep the logic visible. I’ll stay fully analytic and show exactly where the

classical Wald SPRT thresholds emerge from the HJB solution.

I’ll reuse the diffusion/SPRT setup from the previous message and keep the same notation.

Part 1. Limiting connection to Wald’s discrete SPRT thresholds

(mapping sampling cost ↔ error probabilities )

1. Continuous-time Bayes rule recap

We found that the optimal stopping rule is

with determined by the implicit equation

At stopping, the terminal decision is

2. Error probabilities induced by threshold

Under the true hypothesis (negative drift), the LLR process drifts to. The probability of making a

type-I error (decide ) is

c α , β

τ = inf{ t : ∣ L (^) t ∣ ≥ a },

a > 0

( a + μ 2

c sinh a ) = 1. (1)

δ = 1 { L (^) τ >0}.

a

H 0 −∞

H 1

We now integrate this explicitly.

1. Algebraic simplification

Use identities

Then

So rewrite (6) as

2. Integrate term by term

First term

Let. Then

Returning to :

Second term

3. Combine

The logarithmic terms cancel cleanly. Therefore

4. Determine the constant

Use value-matching at the stopping boundary :

Thus

V ( L ) =

′ −. 4 μ^2

c

cosh ( L /2)

2

L + sinh L (6)

cosh ( ) =

2

2

L

, sinh L = 2

1 + cosh L 2 sinh( ) cosh( ). 2

L

L

cosh ( L /2)

2

sinh L 2 tanh( ). 2

L

V ( L ) =

′ − + 2 tanh( ). 4 μ^2

c [ cosh ( L /2)

2

L

L

] (7)

∫^ dL cosh ( L /2)

2

L

u = L /

= 4 ∫ (^) u sech ( u ) du =

2 4 ( u tanh u − log cosh u ).

L

= 2 L tanh( ) − 2

L

4 log cosh( ). 2

L

∫ 2 tanh( ) dL = 2

L

4 log cosh( ). 2

L

V ( L ) = − ⋅

4 μ^2

c 2 L tanh( ) + 2

L

C = − L tanh( ) + 2 μ^2

c

2

L

C. (8)

C

L = a

V ( a ) = g ( a ) = (^). 1 + e a

C = +

1 + ea

a tanh( ). 2 μ^2

c

2

a

5. Final closed form

For ,

For ,

This function is:

even in ,

(smooth-fit),

solves the HJB ODE inside ,

equals the terminal Bayes risk at the boundary.

Big picture (why this matters)

Wald’s SPRT thresholds drop out as the large-threshold / small-cost limit of an HJB free-boundary

problem.

The value function has a closed analytic form , showing explicitly how posterior uncertainty,

sampling cost, and information geometry interact.

This example is the cleanest bridge between classical mathematical statistics and stochastic optimal

control : same rule, different language, deeper structure revealed.

If you want, next we can:

generalize this to asymmetric losses ( ),

redo it for unknown variance ,

or show how the same HJB machinery yields optimal confidence sequences instead of tests.

L ∣ ≤ a

V ( L ) = + a tanh( ) − L tanh( ) 1 + ea

2 μ^2

c [ 2

a

2

L

]

L ∣ ≥ a

V ( L ) = .

1 + eL

L

C

1

(− a , a )

V ( L )

α = β

  1. generalize this to asymmetric losses (𝛼≠𝛽),
  2. redo it for unknown variance,

Let’s keep this squarely in the mathematics and strip away folklore.

1. Asymmetric losses : continuous-time SPRT via HJB

Setup (diffusion likelihood ratio)

Consider the classical signal-in-noise model

with known.

The log-likelihood ratio evolves as

We choose a stopping time and terminal decision.

dX (^) t = μ dt + σ dW (^) t , H (^) 0 : μ = μ (^) 0 , H (^) 1 : μ = μ (^) 1 , μ (^) 1 > μ (^) 0 ,

σ

Lt

dL (^) t = a dt + b dW (^) t , a = , b = 2 σ^2

( μ (^) 1 − μ (^) 0 )

2

. σ

μ (^) 1 − μ 0

τ δ ∈ {0, 1}