

























































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The discussion developed a geometric view of identifiability, control, and optimal design in statistics. Fisher information was treated intrinsically on parameter manifolds, with rank deficiency characterized by score vectors failing to span the cotangent space. This failure was shown to coincide with non-identifiability and with Lie Algebra Rank Condition (LARC) failure in controlled observation models. A concrete example demonstrated how a single control collapses Fisher rank, while adding a second control restores full rank, identifiability, and controllability. Finally, Hamilton–Jacobi–Bellman design was linked directly to this geometry: optimal probing controls are precisely those that inject Fisher information into missing directions, automatically repairing rank loss.
Typology: Schemes and Mind Maps
Uploaded on 02/15/2026
1 / 65
This page cannot be seen from the preview
Don't miss anything!


























































Is optimal control used in mathematical statistics (especially in
inference). If yes, discuss how with two theoretical examples. If no, just
write 'no'.
Thought for 16s
Yes — optimal control is used in mathematical statistics , and especially in inference. Below I give two
theoretical examples that show how control / optimal-stopping / dynamic-programming ideas enter
statistical inference (one classical, one modern/adaptive). I state each problem formally and sketch how the
optimal-control viewpoint yields the statistical rule.
Setup. Observations i.i.d. with densities under and under. We must choose a
stopping time and a terminal decision to detect which hypothesis is true.
Control formulation. Introduce a cost-per-sample and penalties for type-I and type-II errors
respectively. The objective is to choose to minimize the expected total cost
where the expectation is under the unknown true hypothesis with prescribed prior or taken in a minimax
sense. This is a stochastic optimal-stopping problem with state equal to the posterior (or likelihood ratio).
Sufficient statistic / state. The (unnormalized) likelihood ratio after observations,
(or the posterior ), is a Markov state. The decision to stop or continue depends
only on.
Dynamic programming / Bellman equation. Let be the minimal expected future cost when the
current likelihood ratio is. Bellman optimality gives
where is the one-step likelihood ratio random variable and the expectation is taken
under the current posterior predictive.
Result / structure. The optimal rule from this Bellman equation is to stop when leaves an interval
and continue while. Choosing the Lagrange multipliers appropriately (or matching error
constraints) yields Wald’s Sequential Probability Ratio Test (SPRT) thresholds. Thus classical SPRT is the
solution of an optimal-stopping control problem: you control the stopping time (and terminal decision) to
trade off sampling cost vs. error penalties.
Why this is “control”: you actively control the sampling (when to stop) based on the evolving state
(likelihood ratio). The Bellman equation and stopping boundaries are derived via dynamic programming —
the central tool in optimal control.
X (^) 1 , X (^) 2 , … f 0 H 0 f 1 H 1
τ δ ∈ {0, 1}
c > 0 C (^) 10 , C 01
( τ , δ )
J ( τ , δ ) = E[ c τ + C (^) 10 (^1) { δ =1, H = 0} + C (^) 01 1 { δ =0, H = 1}],
n
Λ (^) n = ,
i =
n
f (^) 0 ( X (^) i )
f (^) 1 ( X (^) i )
π (^) n =Pr( H = 1 ∣ X (^) 1: n )
Λ n
V ( λ )
λ
V ( λ ) = min { , c +
cost if stop and choose best decision
min{ C (^) 10 Pr( H = 0 ∣ λ ), C (^) 01 Pr( H = 1 ∣ λ )} E[ V ( λ ⋅ W )]},
W = f (^) 1 ( X )/ f (^) 0 ( X )
λ ( A , B )
λ ∈ ( A , B )
Setup (linear-Gaussian). Unknown parameter. At each time you choose an input (design/control)
and then observe
independent. You plan experiments and want a final estimator with small mean-squared error.
Bayesian prior / state. Put a Gaussian prior. After steps the posterior remains Gaussian
. The posterior covariance evolves deterministically (given the choices ):
or equivalently
So the posterior covariance is the state of a deterministic control system, and the design vectors are
the control inputs.
Control objective. Choose to minimize a functional of the terminal covariance, e.g.
(A-optimality) or (D-optimality) subject to constraints. Equivalently, minimize expected
squared error of the Bayes estimator.
Optimal-control viewpoint. This is a finite-horizon deterministic control problem on the manifold of
symmetric positive definite matrices:
Apply dynamic programming / Pontryagin’s principle (or greedy approximations) to derive optimal policies.
In special cases analytic structure appears:
One-step (greedy) choice: minimizing given leads to choosing in the direction of the
eigenvector corresponding to the largest eigenvalue of (intuitively, probe the most uncertain
direction). That is a control law mapping the state to.
For low-dimensional or structured , dynamic programming can produce the globally optimal adaptive
design; in general this is computationally hard and approximate control methods (certainty equivalence,
greedy, approximate dynamic programming) are used.
Extensions / dual control. If the also influence a dynamical system you wish to control (not just probe),
you face the dual control problem: choose controls that simultaneously regulate the system and reduce
parameter uncertainty. That is literally an optimal control problem with an augmented state containing both
system state and parameter posterior — central in adaptive control and Bayesian experimental design.
Short summary / takeaways
Yes , optimal-control ideas (dynamic programming, Bellman equations, Hamilton–Jacobi–Bellman
methods, Pontryagin principle) appear naturally in statistical inference when decisions are made
adaptively: when to stop sampling (sequential tests), which experiments to run next (adaptive design /
active learning), how to allocate sampling budget (best-arm identification / bandits), and how to control
systems while learning parameters (dual control).
Two prototypical theoretical examples are (1) the SPRT viewed as an optimal-stopping problem on the
likelihood-ratio state, and (2) Bayesian adaptive design viewed as a finite-horizon control problem on
the posterior-covariance state. Both are best framed and solved (exactly or approximately) using
optimal-control tools.
If you want, I can (a) write out the Bellman equation and carry through a more explicit derivation of the SPRT
thresholds, or (b) set up and solve a small numeric adaptive-design example (linear Gaussian, ) to
show the covariance trajectory under greedy vs optimal policies. Which would you prefer?
θ ∈ R p t
x (^) t ∈X
Y (^) t = x (^) tθ +
⊤ ε (^) t , ε (^) t ∼ N (0, σ ),
2
T θ ^ T
θ ∼ N ( m (^) 0 , Σ 0 ) t
N ( m (^) t , Σ ) t x (^) 1 , … , xt
Σ (^) t =
− Σ (^) t −1+
− x x , σ^2
t (^) t
⊤
Σ (^) t =(Σ^ t −1+
− x x ). σ 2
t (^) t
⊤
−
Σ t xt
x (^) 1 , … , xT J = tr(Σ (^) T )
log det(Σ (^) T ) x (^) t ∈X
Φ(Σ ) subject to Σ = x 1: T
min (^) T t +1 F (Σ (^) t , x (^) t +1 ) =(Σ (^) t +
− x x ). σ^2
1 t +1 (^) t +
⊤ −
tr(Σ (^) t +1) Σ t xt +
Σ t
Σ t xt +
xt
p = 2
Topic in optimal control
Requires calculus of variations? Notes
LQR / Riccati equations ❌ No Linear algebra + ODEs.
Pontryagin Maximum Principle ⚠ Helpful but not required A short intro to variations is usually
provided in the chapter.
Variational formulations (Euler–
Lagrange)
✔ Yes But this is optional part of optimal control.
If your background is in mathematical statistics and you want to use optimal control for inference, sequential
decisions, Bayesian design, etc., then:
Start with dynamic programming / Bellman equations first
(no calculus of variations needed),
Only later, when you need continuous-time control or Hamiltonian methods, pick up the necessary
calculus-of-variations concepts (takes about 1–2 days of study).
You can study optimal control without knowing calculus of variations.
It becomes useful only when deriving continuous-time necessary conditions like Pontryagin’s principle, and
even then textbooks teach the necessary parts along the way.
If you want, I can give you (a) a minimal “calculus of variations crash course,” or (b) a recommended
sequence for learning optimal control starting from a statistics background.
One thing about calculus of variation.
We often do general calculus, then vector calculus and
multidimensional calculus (e.g. calculus I, calculus II, calculus III etc.
courses).
Does there exist any vector calculus of variation or multivariate calculus
of variation?
Also the theoretical form of calculus is Analysis. So does there exist any
theoretical version of calculus of variation such as Analysis of variation
or variational analysis?
Thought for 20s
Short answer first: Yes — there are multivariate / vector versions of the calculus of variations, and there is a
broad modern theoretical field called variational analysis that generalizes and extends the classical calculus
of variations.
Below I’ll sketch the landscape, give the standard formulations, and list concrete examples and pointers so
you can see how the notions you know from single-variable calculus of variations generalize.
There are two natural ways the single-variable calculus of variations is extended; both are common and
important.
(A) Vector-valued trajectories (multiple components, 1D domain).
Domain is an interval. The unknown is a vector function
. A typical functional
has essentially the same variational derivation as the scalar case; stationarity yields a system of Euler–
Lagrange ODEs, one equation per component:
This is often called the calculus of variations for vector-valued functions (or for curves in ). Applications:
mechanics with multiple coordinates, optimal control (via calculus of variations), elasticity in 1D, etc.
(B) Multiple integrals / fields (multidimensional domain).
Domain is a region and the unknown is a field. Functionals take the form
Variational derivative leads to Euler–Lagrange PDEs (system if ):
Examples: the Dirichlet energy gives Laplace’s equation ; the area functional gives
the minimal-surface PDE; nonlinear elasticity and many PDEs arise this way. This branch is usually called
“calculus of variations in several variables,” “variational methods for PDEs,” or “field-theoretic calculus of
variations.”
What you called “Analysis of variation” is essentially captured by several rigorous, more abstract fields:
Classical/theoretical calculus of variations (existence & regularity): uses functional analysis (Sobolev
spaces, weak derivatives) to prove existence of minimizers (Tonelli, direct method), study regularity, and
handle lower semicontinuity. Key texts: Gelfand–Fomin (introductory), Dacorogna (direct methods).
Variational methods in PDEs / functional analysis: formulates PDEs as Euler–Lagrange equations of
energy functionals; uses minimization, critical-point theory (e.g. Mountain Pass) to produce solutions.
Evans, Struwe, Lions cover this.
Variational analysis (modern, as a discipline): a broad, rigorous framework that includes nonsmooth
analysis, convex analysis, generalized gradients/subdifferentials, set-valued mappings, epi-convergence,
and stability of optimization problems. The canonical reference is Rockafellar & Wets, Variational
Analysis. This field unifies and extends calculus of variations into optimization, nonsmooth problems,
and control.
Geometric / global variational theory: calculus of variations on manifolds, Morse theory (critical-point
topology), symplectic/Hamiltonian approaches, and Noether’s theorem (symmetries → conservation
laws).
Γ-convergence / variational convergence: theory of convergence of functionals (used in
homogenization, phase transitions, Γ-limits).
Weak formulations & Sobolev spaces: instead of classical differentiability, work with weak derivatives
and minimize in or other Sobolev spaces. This is essential to get existence theorems for PDE
a ≤ x ≤ b y ( x ) = ( y ( x ), … , y ( x )) : 1 n
[ a , b ] → R n
J [ y ] = ∫ L ( x , y ( x ), y ( x )) dx a
b ′
∂ yi
dx
d
∂ yi ′
0, i = 1, … , n.
n
m u : Ω → R n
J [ u ] = ∫ F ( x ,^ u ( x ), ∇ u ( x )) dx. Ω
n > 1
( x , u , ∇ u ) − ∂ uα
j =
m
∂ xj
∂(∂ (^) xju α )
0, α = 1, … , n.
F = ∣∇ u ∣ 2
1 2 −Δ u = 0
1, p
than control theory.
Optimal estimation (Cramér–Rao, unbiasedness, Bayes risk, minimax, decision theory) is rooted in
statistical decision theory and information theory, not control.
Optimal/experimental design (A/D/E/D-optimality, Fisher information, D-optimal design, equivalence
theorems) is mainly linear algebra, convex analysis, and classical statistics.
Textbooks in these areas develop the necessary dynamic programming or variational arguments internally;
they don’t presuppose an optimal-control course.
Any adaptive , sequential or real-time decision problem often uses the same tools as control:
dynamic programming / Bellman equation (connects directly to control theory),
optimal-stopping as a special case of stochastic control,
bandits and active learning use control-style exploration–exploitation tradeoffs,
Bayesian adaptive design and dual control explicitly combine estimation + control.
Continuous-time problems (e.g. quickest change detection in diffusion models, stochastic control
formulations of sequential tests) may draw on HJB/Pontryagin ideas.
Probability theory (measure-theoretic if you want full rigor)
Stochastic processes / martingales (for stopping)
Mathematical statistics (estimators, Fisher information, decision theory)
Linear algebra and convex optimization (for design and computational methods)
Basic dynamic programming / discrete-time decision theory — this is often enough to bridge to control
ideas you’ll meet
If your goal is standard inference, sequential tests, estimation, or nonadaptive design: learn probability,
decision theory, and optimization first — no control class needed.
If you plan to work on adaptive experiments, reinforcement learning, bandits, or problems that merge
control and learning: learn dynamic programming and then an intro to optimal control (LQR,
Pontryagin/PMP at a high level), because the shared vocabulary and methods will pay off.
If you want, I can give a compact two-page cheat-sheet: (A) which control concepts map to which statistical
problems, and (B) a 6–8 week study plan tailored to your current math/stats background. Want that?
Write down the domain/fields of mathematical statistics that use
optimal control and geometric control extensively. Also is geometric
control a subset of optimal control?
Below is a clean, high-level map of where optimal control and geometric control appear inside
mathematical statistics , and a clarification of the relationship between the two control fields.
While many parts of statistics do not use control theory, there are specific subfields where optimal control
ideas are central — especially whenever decisions are made sequentially, adaptively, or under uncertainty.
(A) Sequential analysis / sequential inference
Sequential probability ratio tests (SPRT)
Quickest change detection
Sequential confidence intervals
Stopping-time–based estimation
These problems reduce to optimal stopping , a special case of stochastic optimal control.
Tools: dynamic programming, Snell envelopes, HJB equations (in continuous time).
(B) Bayesian decision theory (sequential decisions)
Bayes estimators under sampling cost
Decision rules based on posterior updates
Multi-stage Bayes risk minimization
Posterior belief is the “state”; the decision rule is a control policy.
(C) Optimal experimental design (adaptive design)
Adaptive D-, A-, E-optimal designs
Bayesian optimal design
Design of experiments for nonlinear models
Information-based control of experiments
This is deterministic optimal control of the Fisher information or of the posterior covariance.
(D) Active learning, multi-armed bandits, reinforcement learning
Best-arm identification (pure exploration)
Bayesian optimization
Adaptive sampling policies
Bandits are literally stochastic control with partial information (belief state is a posterior, control is arm
selection).
(E) Filtering, estimation, and control of stochastic systems
Kalman filter (linear-quadratic-Gaussian control structure)
Kalman–Bucy filtering
Joint estimation + control (dual control, adaptive control)
Statistical inference for dynamical systems strongly overlaps with control.
(F) Stochastic differential equations (SDEs) in statistics
Continuous-time statistical problems for diffusions often lead to:
Stochastic optimal control
Hamilton–Jacobi–Bellman equations
Girsanov/likelihood-ratio controls in filtering and smoothing
(G) Information geometry + statistical manifolds
This is where geometric control enters more explicitly:
Geodesic flows for statistical estimators
Natural gradient descent = steepest descent on a statistical manifold
Optimal transport methods in inference
Fisher–Rao geometry and geodesics
Studies control systems using differential geometry :
manifolds
Lie groups
distributions and frobenius integrability
controllability via Lie brackets
sub-Riemannian geometry
geodesic flows of control systems
Geometric control is more structural, focusing on:
controllability
feedback linearization
symmetry
minimal paths in geometric structures
Optimal control and geometric control intersect when:
the state space is a manifold, OR
the optimal-control problem has a geometric structure (e.g. sub-Riemannian geodesics).
So the logical relationship is:
But many optimal control problems on manifolds are solved using geometric control tools.
Area of Mathematical Statistics Uses Optimal Control? Uses Geometric Control?
Sequential analysis ✔ strongly rarely
Sequential decision/Bayes risk ✔ rarely
Optimal experimental design ✔ sometimes (geometric design)
Bandits / RL ✔ (dynamic programming) no
Filtering / SDE inference ✔ sometimes (Lie-group filters)
Information geometry sometimes ✔ strongly
Shape statistics, directional data sometimes ✔ strongly
Optimal transport / Schrödinger
bridge
✔ ✔ strongly
I can also produce:
a Venn diagram showing optimal control vs geometric control vs mathematical statistics;
a study plan for entering the control-theoretic side of statistics;
examples of specific theorems (e.g., Fisher geodesics, Lie-algebra controllability, HJB on manifolds) used
in statistics.
Just tell me which one.
Geometric control is not a subset of optimal control; rather, it provides the geometric foundations for
certain optimal control problems, especially when the dynamics or constraints are geometric.
examples of specific theorems (e.g., Fisher geodesics, Lie-algebra
controllability, HJB on manifolds) used in statistics.
Thought for 11s
Nice — tasty, concrete request. Below are six specific theorems / results (with short statements, the math
idea, and exactly how each gets used in statistics). I focus on the items you named (Fisher geodesics, Lie-
algebra controllability, HJB on manifolds) and add a few neighboring control/variational results that
statisticians actually lean on.
I’ll mark a few historical names and a key reference so you can chase original sources.
1) Fisher–Rao / Fisher information as a Riemannian metric
Ronald A. Fisher
Statement (informal). For a parametric family , the Fisher information matrix
defines a Riemannian metric on the parameter manifold. Infinitesimally, the Kullback–Leibler divergence
between and equals.
Why useful in statistics.
The Fisher–Rao metric gives the natural geometry of parameter space; geodesics describe “least-
distinguishable” paths of parameters.
Asymptotic efficiency/Cramér-Rao type bounds are local metric statements: covariance matrices of
regular unbiased estimators cannot beat.
Algorithms: natural gradient descent (Amari) uses the inverse Fisher metric as a preconditioner —
treated as a geometric control/flow on the statistical manifold.
Concrete use: choosing experiment directions to most rapidly reduce uncertainty corresponds to
following steepest descent w.r.t. the Fisher metric.
2) Fisher geodesics and Rao distance (geodesic characterization)
Result (informal). The geodesic distance induced by the Fisher–Rao metric gives a canonical measure of
statistical separation (the Rao distance). For exponential families geodesics and dual connections have closed
forms and lead to simple expressions for information-theoretic divergences (useful in asymptotic inference).
Why useful in statistics.
Used to compare models and approximate likelihood contours; underlies information geometry tools
used in asymptotic expansions, hypothesis testing geometry, and design on manifolds.
In constrained estimation on a manifold (e.g., directional data on the sphere), geodesic projection is the
natural estimator correction step.
3) Lie-algebra rank condition (LARC) / bracket-generating controllability
Theorem (LARC, informal). Consider a control system on a smooth manifold
. If the Lie algebra generated by the vector fields spans the tangent space at
every point (i.e., the fields are bracket-generating ), then the system is (small-time) locally
accessible/controllable (Chow–Rashevskii type result).
{ p ( x ; θ )} θ ∈Θ
I ( θ ) = E (^) θ [ ∇ (^) θ log p ( X ; θ ) ∇ (^) θ log p ( X ; θ ) ]
⊤
θ θ + dθ (^) 2 dθ I ( θ ) dθ +
1 ⊤ o (∥ dθ ∥ )
2
I ( θ )
−
x ˙ = f (^) 0 ( x ) +∑ i =1 u f ( x )
m i i
M { f (^) 0 , f (^) 1 , … , f (^) m } T (^) xM
x
6) Variational analysis / subdifferential calculus used in modern statistics
Variational Analysis
Result (paraphrase). Rockafellar–Wets style variational analysis supplies generalized derivative rules
(subdifferentials, epi-convergence, marginal functions) that let you analyze nonsmooth objective maps and
limits of optimization problems.
Why useful in statistics.
Non-smooth estimators (LASSO, quantile regression, M-estimators) and model selection criteria are
analyzed using subdifferential calculus and epi-convergence — these are exactly variational-analysis
tools.
In adaptive control + statistics (dual control, active learning with nonsmooth cost), the stability and
convergence of approximate policies are studied via variational convergence.
How to read these results as a statistician (practical takeaways)
Fisher geometry gives you the local metric for uncertainty and is the right lens for local asymptotics,
natural gradients, and information-based design.
LARC / bracket conditions are the checklist: if you can’t generate tangent directions with your
experimental inputs, you cannot hope to estimate certain parameters no matter how clever your
estimator is.
HJB / PMP give recipes for deriving optimal adaptive/sampling policies; solving them exactly is typically
hard, but they guide approximations (greedy policies, certainty-equivalence, numerical dynamic
programming).
Variational analysis is essential when objective functions are nonsmooth or when you take limits (e.g.,
sample size , vanishing regularization).
If you want, I’ll: (a) derive a short worked example showing how the HJB gives the SPRT stopping boundary
for a simple diffusion; or (b) show a compact example where LARC fails and an estimator cannot identify a
parameter. Pick one and I’ll expand it into a step-by-step worked example.
derive a short worked example showing how the HJB gives the SPRT
stopping boundary for a simple diffusion
Thought for 18s
Nice choice — this is a clean illustration of how a continuous-time HJB / dynamic-programming approach
produces the same stopping rule as the classical Sequential Probability Ratio Test (SPRT). I’ll work a complete
short example for the simple drift–diffusion model, show the Bellman/HJB equation on the log-likelihood
(LLR) state, reduce it to an ordinary differential equation, and show how the stopping boundaries are
obtained by value-matching + smooth-fit. (I’ll keep the algebra compact but explicit so you can follow each
step.)
Note: the classical discrete-time SPRT was introduced by Abraham Wald; here we treat a continuous-time
analogue via HJB.
1. Model and Bayesian cost
Observe a continuous process satisfying
where is standard Brownian motion and with. Two hypotheses:
,
.
Start with prior. Sampling costs per unit time; terminal (0–1) loss: cost if
you choose the wrong hypothesis, if you choose the correct one. We seek a stopping time and terminal
decision minimizing Bayes risk
Because of symmetry, the optimal stopping rule will be: stop when the log-likelihood ratio exits an
interval and choose the sign of.
2. Log-likelihood ratio as state and its dynamics
The log-likelihood ratio (LLR) of vs at time is
With we have and , so the formula simplifies to
This is a stochastic process driven by. Using we obtain the dynamics under each
hypothesis:
So under the drift of is (positive), and under it is (negative); its diffusion coefficient is
.
For dynamic programming we need the conditional (posterior predictive) evolution of given the current
information. Let. In terms of ,
Given , the next increment has conditional drift (mixture of ), so the conditional
drift of is
Using we can write this drift as a function of :
The diffusion coefficient (conditional variance) of is. So the posterior-conditional
dynamics of the LLR (as a Markov process under the filter measure) are
where is the innovation Brownian motion.
3. Bellman / HJB equation
( X (^) t ) t ≥
dX (^) t = θ dt + dW (^) t ,
Wt θ ∈ {− μ , + μ } μ > 0
H (^) 0 : θ = − μ
H (^) 1 : θ = + μ
2
1 c > 0 1
0 τ
δ ∈ {0, 1}
J = E[ cτ + 1 {error at τ }].
Lt
(− a , a ) Lt
H 1 H 0 t
L (^) t = log = d P (^) H 0 ( X (^) [0, t ])
d P (^) H 1 ( X (^) [0, t ] ) ∫ ( θ − 0
t
1 θ^ 0 ) dX^ s − 2 ( θ^ −
1 ∫ 0
t
1
2 θ (^) 0 ) ds.
2
θ (^) 1 =+ μ , θ (^) 0 =− μ θ (^) 1 − θ (^) 0 = 2 μ θ (^) 1 − 2 θ (^) 0 = 2 0
L (^) t = 2 μX (^) t.
Xt X (^) t = θt + Wt
L (^) t = 2 μ ( θt + W (^) t ) ⇒ dL (^) t = 2 μ dX (^) t = 2 μθ dt + 2 μ dW (^) t.
H 1 L 2 μ 2 H 0 −2 μ 2
2 μ
p (^) t = P ( H (^) 1 ∣F (^) t ) Lt
p (^) t =. 1 + eLt
e Lt
F t dXt (2 p (^) t −1) μ dt ± μ
E[ dL (^) t ∣F (^) t ] = 2 μ ⋅ ((2 p (^) t −1) μ dt ) = 4 μ ( p −
2 t (^) 2 )^ dt.
1
p (^) t = 1 + eLt
e Lt L
m ( L ) = 4 μ ( −
2 1 + eL
e L ) = 2
1 2 μ tanh( ).
2 2
L (2 μ ) dt = 2 4 μ dt 2
Lt
dL (^) t = m ( L (^) t ) dt + 2 μ d Wt , m ( L ) = 2 μ (^) tanh( ),
2
2
Integrate once more to get (the integral can be written in closed form, but for the purpose of finding
the threshold we only need the values of and at ).
5. Matching at the stopping boundary
Apply the smooth-fit condition at : the expression (***) evaluated at must equal the derivative
of the stopping cost:
Thus the threshold is the solution of
We can simplify the right hand side in hyperbolic terms:.
Indeed
(You may check this identity by writing .)
So the smooth-fit equation becomes
Multiply both sides by and cancel the common factor to obtain the implicit equation for :
Thus the threshold is the (unique) positive solution of. Once is known, the
stopping rule is:
and the terminal decision is.
(One can check value-matching is satisfied automatically given the chosen constant of
integration obtained when integrating and using symmetry; in practice both value-matching and
smooth-fit determine the integration constant and the threshold.)
6. Interpretation and connection to SPRT
The HJB / Bellman approach produced an ODE whose boundary condition yields the threshold equation
(✶). That threshold is the continuous-time Bayesian analogue of Wald’s discrete SPRT thresholds.
For small sampling cost (you tolerate long sampling to reduce error), the threshold is large. In the
asymptotic/small- regime the stopping boundaries translate to the familiar log-likelihood thresholds
when one instead prescribes Type-I/II error rates rather than a sampling cost.
The HJB viewpoint clarifies why the optimal rule is "stop when the LLR leaves an interval": the
continuation region is an interval around zero determined by the PDE/ODE; that is a general
phenomenon in optimal stopping problems for one-dimensional diffusions.
7. Summary (quick recap) 1. Put the LLR as the Markov state (because the posterior is a function of it). 2. Write the HJB variational inequality.
′ L = a
L = a L = a
V ( a ) =
′ −. (1 + e a^ )^2
e a
a > 0
4 μ^2
c
cosh ( )
2 2
a
a + sinh a −. (1 + e a )^^2
e a
(1 + e a )^^2
e a = ( e − a /2^ + e a /2^ )^2
sech ( )/
2 2
a
(1 + e a^ )^2
e a ⋅ (1 + e − a^ )^2
e (algebra) =
− a sech (^) ( ). 4
a
cosh( a /2) = ( e + a / e )/ − a /
4 μ^2
c
cosh ( )
2 2
a
a + sinh a sech ( ). 4
a
4 cosh ( a /2)
2 a
( a + μ^2
c sinh a ) = 1. (✶)
a ( c / μ )( a + 2 sinh a ) = 1 a
τ = inf{ t ≥ 0 : ∣ L (^) t ∣ ≥ a } = inf{ t ≥ 0 : ∣2 μX (^) t ∣ ≥ a },
1 { L (^) τ >0}
V ( a ) = g ( a )
′
c a
c
log α
1− α
Lt
min{ g − V , c + L V } = 0
3. On the continuation interval solve the ODE. 4. Use symmetry + smooth-fit/value-matching at to obtain the implicit threshold equation
.
5. The optimal rule is stop when and pick the sign of ; this matches the SPRT structure.
If you like, I can:
(A) compute a numerical example (choose and ) and plot sample trajectories showing stopping at
;
(B) show the limiting connection to Wald’s discrete SPRT thresholds (map cost to error probabilities
); or
(C) expand the algebra and produce a closed form expression for (integrating the formula for
).
Which follow-up would be most useful?
2 μ V + 2 ′′ 2 μ tanh( L /2) V + 2 ′ c = 0
± a
( c / μ )( a + 2 sinh a ) = 1
∣ L (^) t ∣ ≥ a Lt
μ c Lt
± a
c
α , β
′
(map cost 𝑐 to error probabilities
𝛼,𝛽);
(integrating the formula for
𝑉′(𝐿)).
Great — let’s do both carefully and keep the logic visible. I’ll stay fully analytic and show exactly where the
classical Wald SPRT thresholds emerge from the HJB solution.
I’ll reuse the diffusion/SPRT setup from the previous message and keep the same notation.
Part 1. Limiting connection to Wald’s discrete SPRT thresholds
(mapping sampling cost ↔ error probabilities )
1. Continuous-time Bayes rule recap
We found that the optimal stopping rule is
with determined by the implicit equation
At stopping, the terminal decision is
2. Error probabilities induced by threshold
Under the true hypothesis (negative drift), the LLR process drifts to. The probability of making a
type-I error (decide ) is
c α , β
τ = inf{ t : ∣ L (^) t ∣ ≥ a },
a > 0
( a + μ 2
c sinh a ) = 1. (1)
δ = 1 { L (^) τ >0}.
We now integrate this explicitly.
1. Algebraic simplification
Use identities
Then
So rewrite (6) as
2. Integrate term by term
First term
Let. Then
Returning to :
Second term
3. Combine
The logarithmic terms cancel cleanly. Therefore
4. Determine the constant
Use value-matching at the stopping boundary :
Thus
′ −. 4 μ^2
c
cosh ( L /2)
2
L + sinh L (6)
cosh ( ) =
2
2
, sinh L = 2
1 + cosh L 2 sinh( ) cosh( ). 2
cosh ( L /2)
2
sinh L 2 tanh( ). 2
′ − + 2 tanh( ). 4 μ^2
c [ cosh ( L /2)
2
∫^ dL cosh ( L /2)
2
u = L /
= 4 ∫ (^) u sech ( u ) du =
2 4 ( u tanh u − log cosh u ).
= 2 L tanh( ) − 2
4 log cosh( ). 2
∫ 2 tanh( ) dL = 2
4 log cosh( ). 2
4 μ^2
c 2 L tanh( ) + 2
C = − L tanh( ) + 2 μ^2
c
2
L = a
V ( a ) = g ( a ) = (^). 1 + e a
1 + ea
a tanh( ). 2 μ^2
c
2
a
5. Final closed form
For ,
For ,
This function is:
even in ,
(smooth-fit),
solves the HJB ODE inside ,
equals the terminal Bayes risk at the boundary.
Big picture (why this matters)
Wald’s SPRT thresholds drop out as the large-threshold / small-cost limit of an HJB free-boundary
problem.
The value function has a closed analytic form , showing explicitly how posterior uncertainty,
sampling cost, and information geometry interact.
This example is the cleanest bridge between classical mathematical statistics and stochastic optimal
control : same rule, different language, deeper structure revealed.
If you want, next we can:
generalize this to asymmetric losses ( ),
redo it for unknown variance ,
or show how the same HJB machinery yields optimal confidence sequences instead of tests.
∣ L ∣ ≤ a
V ( L ) = + a tanh( ) − L tanh( ) 1 + ea
2 μ^2
c [ 2
a
2
∣ L ∣ ≥ a
1 + e ∣ L ∣
1
(− a , a )
α = β
Let’s keep this squarely in the mathematics and strip away folklore.
1. Asymmetric losses : continuous-time SPRT via HJB
Setup (diffusion likelihood ratio)
Consider the classical signal-in-noise model
with known.
The log-likelihood ratio evolves as
We choose a stopping time and terminal decision.
dX (^) t = μ dt + σ dW (^) t , H (^) 0 : μ = μ (^) 0 , H (^) 1 : μ = μ (^) 1 , μ (^) 1 > μ (^) 0 ,
σ
Lt
dL (^) t = a dt + b dW (^) t , a = , b = 2 σ^2
( μ (^) 1 − μ (^) 0 )
2
. σ
μ (^) 1 − μ 0
τ δ ∈ {0, 1}