















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of the stochastic subgradient method (ssm), a popular optimization algorithm used in machine learning, convex optimization, and stochastic programming. The ssm uses noisy unbiased subgradients to iteratively update the estimate of the minimum, ensuring convergence in expectation, probability, and almost surely. The assumptions, convergence results, and proof of the ssm, as well as applications in stochastic programming, on-line learning, and adaptive signal processing.
Typology: Slides
1 / 23
This page cannot be seen from the preview
Don't miss anything!
















noisy unbiased subgradient
stochastic subgradient method
convergence proof
stochastic programming
expected value of convex function
on-line learning and adaptive signal processing
Prof. S. Boyd, EE364b, Stanford University
docsity.com
random vector
g
n
is a
noisy unbiased subgradient
for
f
n
at
x
if for all
z
f
z
f
x
˜g
T
z
x
i.e.
g
g
∂f
x
same as
g
g
v
, where
g
∂f
x
v
v
can represent error in computing
g
, measurement noise, Monte Carlo
sampling error, etc.
Prof. S. Boyd, EE364b, Stanford University
1
docsity.com
stochastic subgradient method
is the subgradient method, using noisy
unbiased subgradients
x
(
k
+1)
x
(
k
)
α
k
g
(
k
)
x
(
k
)
is
k
th iterate
g
(
k
)
is any noisy unbiased subgradient of (convex)
f
at
x
(
k
)
i.e.
g
(
k
)
x
(
k
)
g
(
k
)
∂f
x
(
k
)
α
k
is the
k
th step size
define
f
(
k
)
best
= min
f
x
(1)
,... , f
x
(
k
)
Prof. S. Boyd, EE364b, Stanford University
3
docsity.com
f
⋆
= inf
x
f
x
, with
f
x
⋆
f
⋆
g
(
k
)
22
2
for all
k
x
(1)
x
⋆
22
2
(can take
here)
step sizes are square-summable but not summable
α
k
∞
∑ k
=
α
2 k
α
2 2
∞
k
=
α
k
these assumptions are stronger than needed, just to simplify proofs Prof. S. Boyd, EE364b, Stanford University
4
docsity.com
key quantity:
expected Euclidean distance squared to the optimal set
x
(
k
+1)
x
⋆
22
x
(
k
)
x
(
k
)
α
k
g
(
k
)
x
⋆
22
x
(
k
)
x
(
k
)
x
⋆
2 2
α
k
˜g
(
k
)
T
x
(
k
)
x
⋆
x
(
k
)
α
2 k
˜g
(
k
)
2 2
x
(
k
)
x
(
k
)
x
⋆
2 2
α
k
g
(
k
)
x
(
k
)
T
x
(
k
)
x
⋆
α
2 k
g
(
k
)
2 2
x
(
k
)
x
(
k
)
x
⋆
2 2
α
k
f
x
(
k
)
f
⋆
α
2 k
˜g
(
k
)
2 2
x
(
k
)
using
g
(
k
)
x
(
k
)
∂f
x
(
k
)
Prof. S. Boyd, EE364b, Stanford University
6
docsity.com
now take expectation:
x
(
k
+1)
x
⋆
2 2
x
(
k
)
x
⋆
2 2
α
k
f
x
(
k
)
f
⋆
α
2 k
g
(
k
)
2 2
apply recursively, and use
g
(
k
)
22
2
to get
x
(
k
+1)
x
⋆
2 2
x
(1)
x
⋆
2 2
k
i
=
α
i
f
x
(
i
)
f
⋆
2
k
i
=
α
2 i
and so
min
i
=
,...,k
f
x
(
i
)
f
⋆
2
2
α
22
k i
=
α
i
Prof. S. Boyd, EE364b, Stanford University
7
docsity.com
piecewise linear minimization
minimize
f
x
) = max
i
=
,...,m
a
Ti
x
b
i
we use stochastic subgradient algorithm with noisy subgradient
g
(
k
)
g
(
k
)
v
(
k
)
g
(
k
)
∂f
x
(
k
)
v
(
k
)
independent zero mean random variables
Prof. S. Boyd, EE364b, Stanford University
9
docsity.com
problem instance:
n
variables,
m
terms,
f
⋆
α
k
/k
v
(
k
)
are IID
noise since
g
1000
2000
3000
4000
5000
10
−
10
−
10
−
10
0
k
f
)k(
best
f −
⋆
noise-free caserealization 1realization 2
Prof. S. Boyd, EE364b, Stanford University
10 docsity.com
empirical distributions of
f
(
k
)
best
f
⋆
at
k
k
, and
k
10
−
10
−
10
−
10
0
0
30 20 10
10
−
10
−
10
−
10
0
0
30 20 10
10
−
10
−
10
−
10
0
0
30 20 10
k
k
k
Prof. S. Boyd, EE364b, Stanford University
12 docsity.com
minimize
f
0
x, ω
subject to
f
i
x, ω
i
,... , m
if
f
i
x, ω
is convex in
x
for each
ω
, problem is convex
‘certainty-equivalent’ problem
minimize
f
0
x,
ω
subject to
f
i
x,
ω
i
,... , m
(if
f
i
x, ω
is convex in
ω
, gives a lower bound on optimal value of
stochastic problem) Prof. S. Boyd, EE364b, Stanford University
13 docsity.com
suppose
x, w
is convex in
x
for each
w
and
x, w
x
x, w
f
x
x, w
x, w
p
w
dw
is convex
a subgradient of
f
at
x
is
g
x, w
x, w
p
w
dw
∂f
x
a noisy unbiased subgradient of
f
at
x
is
g
M
i
=
x, w
i
where
w
1
,... , w
M
are
independent samples (Monte Carlo)
Prof. S. Boyd, EE364b, Stanford University
15 docsity.com
minimize
f
x
max
i
=
,...,m
a
Ti
x
b
i
where
a
i
and
b
i
are random
evaluate noisy subgradient using Monte Carlo method with
samples,
and run stochastic subgradient methodcompare to:
certainty equivalent: minimize
f
ce
x
) = max
i
=
,...,m
a
Ti
x
b
i
heuristic: minimize
f
heur
x
) = max
i
=
,...,m
a
Ti
x
b
i
λ
x
2
Prof. S. Boyd, EE364b, Stanford University
16 docsity.com
f
⋆
estimated by running the method with
for long time
500
1000
1500
2000
10
−
10
−
10
−
10
0
k
f
)k(
best
f
⋆
M
= 1
M
= 10
M
= 100
M
= 1000
Prof. S. Boyd, EE364b, Stanford University
18 docsity.com
x, y
n
have some joint distribution
find weight vector
w
n
for which
w
T
x
is a good estimator of
y
choose
w
to minimize expected value of a convex
loss function
l
w
l
w
T
x
y
l
u
u
2
: mean-square error
l
u
u
: mean-absolute error
at each step (
e.g.
, time sample), we are given a sample
x
(
k
)
, y
(
k
)
from
the distribution
Prof. S. Boyd, EE364b, Stanford University
19 docsity.com