Stochastic Subgradient Methods-Optimization Techniques-Lecture Notes, Study notes of Convex Optimization

Prof. Asit Maddukuri provided this handout for Convex Optimization course at Aligarh Muslim University. Its main points are: Stochastic, Subgradient, Methods, Noisy, Unbiased, Convergence, Iteration, Number, Empirical, Distributions

Typology: Study notes

2011/2012

Uploaded on 07/15/2012

sagar
sagar 🇮🇳

5

(2)

69 documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Stochastic Subgradient Methods
Stephen Boyd and Almir Mutapcic
Notes for EE364b, Stanford University, Winter 2006-07
January 23, 2007
1 Noisy unbiased subgradient
Suppose f:RnRis a convex function. We say that a random vector ˜gRnis a noisy
(unbiased) subgradient of fat xdom fif g=E˜g∂f (x), i.e., we have
f(z)f(x) + (E˜g)T(zx)
for all z. Thus, ˜gis a noisy unbiased subgradient of fat xif it can be written as ˜g=g+v,
where g∂f (x) and vhas zero mean.
If xis also a random variable, then we say that ˜gis a noisy subgradient of fat x(which
is random) if
z f(z)f(x) + Eg|x)T(zx)
holds almost surely. We can write this compactly as Eg|x) f(x). (‘Almost surely’ is to
be understood here.)
The noise can represent (presumably small) error in computing a true subgradient, er-
ror that arises in Monte Carlo evaluation of a function defined as an expected value, or
measurement error.
Some references for stochastic subgradient methods are [Sho98, §2.4], [Pol87, Chap. 5].
Some books on stochastic programming in general are [BL97, Pre95, Mar05].
2 Stochastic subgradient method
The stochastic subgradient method is essentially the subgradient method, but using noisy
subgradients and a more limited set of step size rules. In this context, the slow convergence
of subgradient methods helps us, since the many steps help ‘average out’ the statistical errors
in the subgradient evaluations.
We’ll consider the simplest case, unconstrained minimization of a convex function f:
RnR. The stochastic subgradient method uses the standard update
x(k+1) =x(k)αk˜g(k),
1
docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download Stochastic Subgradient Methods-Optimization Techniques-Lecture Notes and more Study notes Convex Optimization in PDF only on Docsity!

Stochastic Subgradient Methods

Stephen Boyd and Almir Mutapcic

Notes for EE364b, Stanford University, Winter 2006-

January 23, 2007

1 Noisy unbiased subgradient

Suppose f : Rn^ → R is a convex function. We say that a random vector ˜g ∈ Rn^ is a noisy (unbiased) subgradient of f at x ∈ dom f if g = E ˜g ∈ ∂f (x), i.e., we have

f (z) ≥ f (x) + (E ˜g)T^ (z − x)

for all z. Thus, ˜g is a noisy unbiased subgradient of f at x if it can be written as ˜g = g + v, where g ∈ ∂f (x) and v has zero mean. If x is also a random variable, then we say that ˜g is a noisy subgradient of f at x (which is random) if ∀z f (z) ≥ f (x) + E(˜g|x)T^ (z − x)

holds almost surely. We can write this compactly as E(˜g|x) ∈ ∂f (x). (‘Almost surely’ is to be understood here.) The noise can represent (presumably small) error in computing a true subgradient, er- ror that arises in Monte Carlo evaluation of a function defined as an expected value, or measurement error. Some references for stochastic subgradient methods are [Sho98, §2.4], [Pol87, Chap. 5]. Some books on stochastic programming in general are [BL97, Pre95, Mar05].

2 Stochastic subgradient method

The stochastic subgradient method is essentially the subgradient method, but using noisy subgradients and a more limited set of step size rules. In this context, the slow convergence of subgradient methods helps us, since the many steps help ‘average out’ the statistical errors in the subgradient evaluations. We’ll consider the simplest case, unconstrained minimization of a convex function f : Rn^ → R. The stochastic subgradient method uses the standard update

x(k+1)^ = x(k)^ − αk ˜g(k),

where x(k)^ is the kth iterate, αk > 0 is the kth step size, and ˜g(k)^ is a noisy subgradient of f at x(k), E(˜g(k)|x(k)) = g(k)^ ∈ ∂f (x(k)). Even more so than with the ordinary subgradient method, we can have f (x(k)) increase during the algorithm, so we keep track of the best point found so far, and the associated function value f (^) best(k) = min{f (x(1)),... , f (x(k))}.

The sequences x(k), ˜g(k), and f (^) best(k) are, of course, stochastic processes.

3 Convergence

We’ll prove a very basic convergence result for the stochastic subgradient method, using step sizes that are square-summable but not summable,

αk ≥ 0 ,

∑^ ∞

k=

α^2 k = ‖α‖^22 < ∞,

∑^ ∞

k=

αk = ∞.

We assume there is an x⋆^ that minimizes f , and a G for which E ‖g(k)‖^22 ≤ G^2 for all k. We also assume that R satisfies E ‖x(1)^ − x⋆‖^22 ≤ R^2. We will show that E f (^) best(k) → f ⋆

as k → ∞, i.e., we have convergence in expectation. We also have convergence in probability: for any ǫ > 0, lim k→∞ Prob(f (^) best(k) ≥ f ⋆^ + ǫ) = 0.

(More sophisticated methods can be used to show almost sure convergence.) We have

E

( ‖x(k+1)^ − x⋆‖^22

∣∣ ∣ x(k)

) = E

( ‖x(k)^ − αk ˜g(k)^ − x⋆‖^22

∣∣ ∣ x(k)

)

= ‖x(k)^ − x⋆‖^22 − 2 αk E

( g ˜(k)T^ (x(k)^ − x⋆)

∣∣ ∣ x(k)

)

  • α^2 k E

( ‖˜g(k)‖^22

∣∣ ∣ x(k)

)

= ‖x(k)^ − x⋆‖^22 − 2 αk E(˜g(k)|x(k))T^ (x(k)^ − x⋆) + α^2 k E

( ‖˜g(k)‖^22

∣∣ ∣ x(k)

)

≤ ‖x(k)^ − x⋆‖^22 − 2 αk(f (x(k)) − f ⋆) + α^2 k E

( ‖g˜(k)‖^22

∣∣ ∣ x(k)

) ,

where the inequality holds almost surely, and follows because E(˜g(k)|x(k)) ∈ ∂f (x(k)). Now we take expectation to get

E ‖x(k+1)^ − x⋆‖^22 ≤ E ‖x(k)^ − x⋆‖^22 − 2 αk(E f (x(k)) − f ⋆) + α^2 kG^2 ,

using E ‖g˜(k)‖^22 ≤ G^2. Recursively applying this inequality yields

E ‖x(k+1)^ − x⋆‖^22 ≤ E ‖x(1)^ − x⋆‖^22 − 2

∑^ k

i=

αi(E f (x(i)) − f ⋆) + G^2

∑^ k

i=

α i^2.

1000 2000 3000 4000 5000

10 −

10 −

10 −

100

k

f^ (k

) best

f

noise-free case realization 1 realization 2

Figure 1: The value of f (^) best(k) − f ⋆^ versus iteration number k, for the subgradient method with step size αk = 1/k. The plot shows a noise-free realization, and two realizations with subgradient noise.

1000 2000 3000 4000 5000

10 −

10 −

10 −

100

k

E

f

(k

) best

f

Figure 2: Average and one standard deviation error bars for f (^) best(k) − f ⋆^ versus iteration number k, computed using 100 realizations, every 250 iterations.

10 − 10 − 10 − 10

(^00)

10

20

30

10 − 10 − 10 − 10

(^00)

10

20

30

10 − 10 − 10 − 10

(^00)

10

20

30

k = 250

k = 1000

k = 5000

Figure 3: Empirical distributions of f (^) best(k) − f ⋆^ at k = 250, k = 1000, and k = 5000 iterations, based on 100 realizations.

5.1 Noisy subgradient of expected function value

Suppose F : Rn^ × Rp^ → R, and F (x, w) is convex in x for each w. We define

f (x) = E F (x, w) =

∫ F (x, w)p(w) dw,

where p is the density of w. (The integral is in Rp.) The function f is convex. We’ll show how to compute a noisy unbiased subgradient of f at x. The function f comes up in many applications. We can think of x as some kind of design variable to be chosen, and w as some kind of parameter that is random, i.e., subject to statistical fluctuation. The function F tells us the cost of choosing x when w takes a particular value; the function f , which is deterministic, gives the average cost of choosing x, taking the statistical variation of w into account. Note that the dimension of w, how it enters F , and the distribution are not restricted; we only require that for each value of w, F is convex in x. Except in some very special cases, we cannot easily compute f (x) exactly. However, we can approximately compute f using Monte Carlo methods, if we can cheaply generate samples of w from its distribution. (This depends on the distribution.) We generate M independent samples w 1 ,... , wM , and then take

fˆ (x) = 1 M

∑^ M

i=

F (x, wi)

as our estimate of f (x). We hope that if M is large enough, we get a good estimate. In fact, fˆ (x) is a random variable with E fˆ (x) = f (x), and a variance equal to c/M , where c is the variance of F (x, ω). If we know or bound c, then we can at least bound the probability of a given level of error, i.e., Prob(| fˆ (x) − f (x)| ≥ ǫ). In many cases it’s possible to carry out a much more sophisticated analysis of the error in Monte Carlo methods, but we won’t pursue that here. A summary for our purposes is: we cannot evaluate f (x) exactly, but we can get a good approximation, with (possibly) much effort. Let G : Rn^ × Rp^ → Rn^ be a function that satisfies G(x, w) ∈ ∂xF (x, w)

for each x and w. In other words, G(x, w) selects a subgradient for each value of x and w. If G(x, w) is differentiable in x then we must have G(x, w) = ∇xF (x, w). We claim that g = E G(x, w) =

∫ G(x, w)p(w) dw ∈ ∂f (x).

To see this, note that for each w and any z we have

F (z, w) ≥ F (x, w) + G(x, w)T^ (z − x),

since G(x, w) ∈ ∂xF (x, w). Multiplying this by p(w), which is nonnegative, and integrating gives ∫ F (z, w)p(w) dw ≥

∫ (^) ( F (x, w) + G(x, w)T^ (z − x)

) p(w) dw

= f (x) + gT^ (z − x).

Since the lefthand side is f (z), we’ve shown g ∈ ∂f (x). Now we can explain how to compute a noisy unbiased subgradient of f at x. Generate independent samples w 1 ,... , wM. We then take

˜g =

M

∑^ M

i=

G(x, wi).

In other words, we evaluate a subgradient of f , at x, for M random samples of w, and take ˜g to be the average. At the same time we can also compute fˆ (x), the Monte Carlo estimate of f (x). We have E g˜ = E G(x, w) = g, which we showed above is a subgradient of f at x. Thus, ˜g is a noisy unbiased sugradient of f at x. This result is independent of M. We can even take M = 1. In this case, ˜g = G(x, w 1 ). In other words, we simply generate one sample w 1 and use the subgradient of F for that value of w. In this case ˜g could hardly be called a good approximation of a subgradient of f , but its mean is a subgradient, so it is a valid noisy unbiased subgradient. On the other hand, we can take M to be large. In this case, ˜g is a random vector with mean g and very small variance, i.e., it is a good estimate of g, a subgradient of f at x.

5.2 Example

We consider the problem of minimizing the expected value of a piecewise-linear convex function with random coefficients,

minimize f (x) = E maxi=1,...,m(aTi x + bi),

with variable x ∈ Rn. The data vectors ai ∈ Rn^ and bi ∈ R are random with some given distribution. We can compute an (unbiased) approximation of f (x), and a noisy unbiased subgradient g ∈ ∂f (x), using Monte Carlo methods. We consider a problem instance with n = 20 variables and m = 100 terms. We assume that ai ∼ N (¯ai, 5 I) and b ∼ N (¯b, 5 I). The mean values ¯ai and ¯b are generated from unit normal distributions (and are the same as the constant values used called ai and bi in previous examples). We take x(1)^ = 0 as the starting point and use the step size rule αk = 1/k. We first compare the solution of the stochastic problem xstoch (obtained from the stochas- tic subgradient method) with xce, the solution of the certainty equivalent problem

minimize fce(x) = maxi=1,...,m(E aTi x + E bi)

(which is the same piecewise-linear minimization problem considered in earlier examples). We also compare it with xheur, the solution of the problem

minimize fheur(x) = maxi=1,...,m(E aTi x + E bi + λ‖x‖ 2 ),

where λ is a positive parameter. The extra terms are meant to account for the variation in aTi x + b caused by variation in ai; the problem above can be cast as an SOCP and easily solved. We chose λ = 1, after some experimentation.

500 1000 1500 2000 10 −

10 −

10 −

100

k

f^ (k

) best

ˆ f ⋆

M = 1

M = 10

M = 100

M = 1000

Figure 5: The value of f (^) best(k) − f ⋆^ versus iteration number k, for the stochastic subgradient method with step size rule αk = 1/k. The plot shows one realization for noisy subgradients evaluated using M = 1, M = 10, M = 100, and M = 1000.

M = 1 seems to be the best choice. In any case, there seems little advantage (at least in this example) to using a value of M larger than 10.

6 More examples

6.1 Minimizing expected maximum violation

The vector x ∈ Rn^ is to be chosen subject to some (deterministic) linear inequalities F x  g. These can represent manufacturing limits, cost limits, etc. Another set of random inequalities is given as Ax  b, where A and b come from some distribution. We would like these inequalities to hold, but because A and b vary, there may be no choice of x that results in Ax  b almost surely. We still would like to choose x so that Ax  b holds often, and when it is violated, the violation is small. Perhaps the most natural problem formulation is to maximize the yield, defined as Prob(Ax  b). In some very special cases (e.g., A is deterministic and b has log-concave density), this can be converted to a convex problem; but in general it is not convex. Also, yield is not sensitive to how much the inequalities Ax  b are violated; a small violation is the same as a large variation, as far as the yield is concerned. Instead, we will work with the expected maximum violation of the inequalities. We define

the maximum violation as

max((Ax − b)+) = max i (aTi x − b)+,

where ai are the rows of A. The maximum violation is zero if and only if Ax  b; it is positive otherwise. The expected violation is

E max((Ax − b)+) = E

( max i (aTi x − bi)+

) .

It is a complicated but convex function of x, and gives a measure of how often, and by how much, the constraints Ax  b are violated. It is convex no matter what the distribution of A and b is. As an aside, we note that many other interesting measures of violation could also be used, such as the expected total (or average) violation, E 1T^ (aTi x − bi)+, or the expected sum of the squares of the individual violations, E ‖(Ax − b)+‖^22. We can even measure violation by the expected distance from the desired polyhedron, E dist(x, {z | Az  b}), which we call the expected violation distance. Back to our main story, and using expected (maximum) violation, we have the problem

minimize E max((Ax − b)+) = E

( maxi(bi − aTi x)+

)

subject to F x  g.

The data are F , g, and the distribution of A and b. We’ll use a stochastic projected subgradient method to solve this problem. Given a point x that satisfies F x  g, we need to generate a noisy unbaised subgradient ˜g of the objective. To do this we first generate a sample of A and b. If Ax  b, we can take ˜g = 0. (Note: if a subgradient is zero, it means we’re done; but that’s not the case here.) In the stochastic subgradient algorithm, ˜g = 0 means we don’t update the current value of x. But we don’t stop the algorithm, as we would in a deterministic projected subgradient method. If Ax  b is violated, choose j for which bj − aTj x = maxi(bi − aTi x)+. Then we can take ˜g = aj. We then set xtemp = x − αkaj , where αk is the step size. Finally, we project xtemp back to the feasible set to get the updated value of x. (This can be done by solving a QP.) We can generate a number of samples of A and b, and use the method above to find an unbiased noisy subgradient for each case. We can use the average of these as ˜g. If we take enough samples, we can simultaneously estimate the expected maximum violation for the current value of x. A reasonable starting point is a point well inside the certainty equivalent inequalities E Ax  E b, that also satisfies F x  g. Simple choices include the analytic, Chebyshev, or maximum volume ellipsoid centers, or the point that maximizes the margin max(b − Ax), subject to F x  g.

6.2 On-line learning and adaptive signal processing

We suppose that (x, y) ∈ Rn^ × R have some joint distribution. Our goal is to find a weight vector w ∈ Rn^ for which wT^ x is a good estimator of y. (This is linear regression; we can add

50 100 150 200 250 300 −

0

10

20

30

40

prediction error

k

Figure 6: Prediction error w(k)T^ x(k+1)^ − y(k+1)^ versus iteration number k.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 0

50

100

150

Figure 7: Empirical distribution of prediction errors for w⋆, based on 1000 samples.

Acknowledgments

May Zhou helped with a preliminary version of this document. We thank Abbas El Gamal for helping us simplify the convergence proof. Trevor Hastie suggested the on-line learning example.

References

[BL97] J. R. Birge and F. Louveaux. Introduction to Stochastic Programming. Springer, New York, 1997.

[Mar05] K. Marti. Stochastic Optimization Methods. Springer, 2005.

[Pol87] B. Polyak. Introduction to Optimization. Optimization Software, Inc., 1987.

[Pre95] A. Prekopa. Stochastic Programming. Kluwer Academic Publishers, 1995.

[Sho98] N. Shor. Nondifferentiable Optimization and Polynomial Problems. Nonconvex Optimization and its Applications. Kluwer, 1998.