Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Convex Analysis and Nonsmooth Optimization, Lecture notes of Calculus

Georgetown University (GU)Calculus

Sections 1.1-1.3 review basic constructs of linear algebra, in- cluding inner products, norms, linear maps and their adjoints, as well as.

Typology: Lecture notes

2022/2023

Uploaded on 05/11/2023

ekaram 🇺🇸

4.6

(30)

264 documents

1 / 46

This page cannot be seen from the preview

Don't miss anything!

Convex Analysis and Nonsmooth Optimization

Dmitriy Drusvyatskiy

March 29, 2020

Discover Lecture notes of Calculus Georgetown University (GU)

Partial preview of the text

Download Convex Analysis and Nonsmooth Optimization and more Lecture notes Calculus in PDF only on Docsity!

Convex Analysis and Nonsmooth Optimization

Dmitriy Drusvyatskiy

March 29, 2020

iv CONTENTS

Chapter 1 Background

This chapter sets the notation and reviews the background material that will be used throughout the rest of the book. The reader can safely skim this chapter during the first pass and refer back to it when necessary. The discussion is purposefully kept brief. The comments section at the end of the chapter lists references where a more detailed treatment may be found.

Roadmap. Sections 1.1-1.3 review basic constructs of linear algebra, in- cluding inner products, norms, linear maps and their adjoints, as well as eigenvalue and singular value decompositions. Section 1.4 establishes nota- tion for basic set operations, such as sums and images/preimages of sets. Section 1.5 focuses on topological preliminaries; the main results are the Bolzano-Weierstrass theorem and a variant of the extreme value theorem. The final Sections 1.6-1.7 formally define first and second-order derivatives of multivariate functions, establish estimates on the error in Taylor approx- imations, and deduce derivative-based conditions for local optimality. The material in Sections 1.6-1.7 is often covered superficially in undergraduate courses, and therefore we provide an entirely self-contained treatment.

1.1 Inner products and linear maps

Throughout, we fix an Euclidean space E, meaning that E is a finite- dimensional real vector space endowed with an inner product 〈·, ·〉. Recall that an inner-product on E is an assignment 〈·, ·〉 : E × E → R satisfying the following three properties for all x, y, z ∈ E and scalars a, b ∈ R:

(Symmetry) 〈x, y〉 = 〈y, x〉

1.2. NORMS 3

Linear mappings A : E → E, between a Euclidean space E and itself, are called linear operators, and are said to be self-adjoint if equality A = A∗ holds. Self-adjoint operators on Rn^ are precisely those operators that are representable as symmetric matrices. A self-adjoint operator A is positive semi-definite, denoted A 0, whenever

〈Ax, x〉 ≥ 0 for all x ∈ E.

Similarly, a self-adjoint operator A is positive definite, denoted A 0, when- ever 〈Ax, x〉 > 0 for all 0 6 = x ∈ E.

A positive semidefinite linear operator A is positive definite if and only if A is invertible.

1.2 Norms

A norm on a vector space V is a function ‖·‖ : V → R for which the following three properties hold for all point x, y ∈ V and scalars a ∈ R:

(Absolute homogeneity) ‖ax‖ = |a| · ‖x‖

(Triangle inequality) ‖x + y‖ ≤ ‖x‖ + ‖y‖

(Positivity) Equality ‖x‖ = 0 holds if and only if x = 0. The inner product in the Euclidean space E always induces a norm ‖x‖ =

〈x, x〉. Unless specified otherwise, the symbol ‖x‖ for x ∈ E will always denote this induced norm. For example, the dot product on Rn induces the usual 2-norm ‖x‖ 2 :=

x^21 +... + x^2 n, while the trace product on Rm×n^ induces the Frobenius norm ‖X‖F :=

tr (XT^ X). Other important examples of norms are the lp-norms on Rn:

‖x‖p =

(|x 1 |p^ +... + |xn|p)^1 /p^ for 1 ≤ p < ∞ max{|x 1 |,... , |xn|} for p = ∞

The most notable of these are the l 1 , l 2 , and l∞ norms; see Figure 1.1. For an arbitrary norm ‖ · ‖ on E, the dual norm ‖ · ‖∗^ on E is defined by

‖v‖∗^ := max{〈v, x〉 : ‖x‖ ≤ 1 }.

Thus ‖v‖∗^ is the maximal value that the linear function x 7 → 〈v, x〉 takes over the closed unit ball of the norm ‖ · ‖. For example, the lp and lq norms

4 CHAPTER 1. BACKGROUND

(a) p = 1 (b) p = 1. 5 (c) p = 2 (d) p = 5 (e) p = ∞

Figure 1.1: Unit balls of `p-norms.

on Rn^ are dual to each other whenever p−^1 + q−^1 = 1 and p, q ∈ [1, ∞]. In particular, the ` 2 -norm on Rn^ is self-dual; the same goes for the Frobenius norm on Rm×n^ (why?). For an arbitrary norm ‖·‖ on E, the Cauchy-Schwarz inequality holds: |〈x, y〉| ≤ ‖x‖ · ‖y‖∗.

Exercise 1.2. Given a positive definite linear operator A on E, show that the assignment 〈v, w〉A := 〈Av, w〉 is an inner product on E, with the in- duced norm ‖v‖A =

〈Av, v〉. Show that the dual norm with respect to the original inner product 〈·, ·〉 is ‖v‖∗A = ‖v‖A− 1 =

〈A−^1 v, v〉.

All norms on E are “equivalent” in the sense that any two are within a constant factor of each other. More precisely, for any two norms ρ 1 (·) and ρ 2 (·), there exist constants α, β ≥ 0 satisfying

αρ 1 (x) ≤ ρ 2 (x) ≤ βρ 1 (x) for all x ∈ E.

Case in point, for any vector x ∈ Rn, the relations hold:

‖x‖ 2 ≤ ‖x‖ 1 ≤

n‖x‖ 2 ‖x‖∞ ≤ ‖x‖ 2 ≤

n‖x‖∞ ‖x‖∞ ≤ ‖x‖ 1 ≤ n‖x‖∞.

For our purposes, the term “equivalent” is a misnomer: the proportionality constants α, β strongly depend on the (often enormous) dimension of the vector space E. Hence measuring quantities in different norms can yield strikingly different conclusions. Consider a linear map A : E → Y, and norms ‖ · ‖E on E and ‖ · ‖Y on Y. We define the induced norm of A by

‖A‖E,Y := max x: ‖x‖E≤ 1 ‖Ax‖Y.

6 CHAPTER 1. BACKGROUND

Any symmetric matrix A ∈ Sn^ admits an eigenvalue decomposition, meaning a factorization of the form

A = U ΛU T^ ,

where U ∈ O(n) is orthogonal and Λ ∈ Sn^ is a diagonal matrix. The diagonal elements of Λ are precisely the eigenvalues of A and the columns of U are corresponding eigenvectors. More generally, any rectangular matrix A ∈ Rm×n^ admits a singular value decomposition, meaning a factorization of the form

A = U ΣV T^ ,

where U ∈ O(m) and V ∈ O(n) are orthogonal matrices and Σ ∈ Rm×n^ is a diagonal matrix with nonnegative diagonal entries. The diagonal elements of Σ are uniquely defined and are called the singular values of A. Supposing without loss of generality m ≤ n, the singular values of A are precisely the square roots of the eigenvalues of AAT^ , and we denote them by

σ 1 (A) ≥ σ 2 (A) ≥... ≥ σm(A) ≥ 0.

In particular, the operator norm ‖A‖op of any matrix A ∈ Rm×n^ equals its maximal singular-value σ 1 (A). See Figure 1.2 for an illustration.

0.6 - 0.4 - 0.2 0.0 0.2 0.4 0.

Figure 1.2: The shaded ellipse is the image of the unit disk by a nonsingular matrix A ∈ R^2 ×^2. The radii of the circumscribed and inscribed circles are σ 1 (A) and σ 2 (A), respectively.

1.4. SET OPERATIONS 7

1.4 Set operations

In this section, we review notation for sums, generated cones, and im- ages/preimages of sets. For any two sets A, B ⊂ E and λ ∈ R, define the set operations:

λA := {λa : a ∈ A} and A + B := {a + b : a ∈ A, b ∈ B}.

Thus the points in λA are simply the points in A scaled by λ. One can visualize the sum A + B by writing it more suggestively as

A + B =

a∈A

(a + B).

Thus A + B is formed from the union of the shifted sets a + B over all points a ∈ A. In particular, forming the sum of a set A ⊂ E and a unit ball B ⊂ E has the affect of “fattening” A. The symbol A − B is defined similarly. The cone generated by a set A ⊂ E will be denoted by

R+A := {λx : x ∈ A, λ ≥ 0 }.

See Figure 1.3 for an illustration of the generated cone and sum operation.

R+A

(a) Generated cone.

A + B = A^ +^ B

(^) (b) Disk plus square.

Figure 1.3: Sum and cone operations.

For any map F : E → Y and sets A ⊂ E and B ⊂ Y, define the two sets

FA = {F(x) : x ∈ A} and F−^1 B = {x : Fx ∈ B}.

The set FA is called the image of A under F, while F−^1 B is called the preimage of B under F. Notice that the sum A + B can also be written as the linear image of the product set Q := A×B under the map F(x, y) = x+y.

1.6. DIFFERENTIABILITY 9

(a) Closed.

(b) Not closed.

Figure 1.4: Closed functions.

The following exercise shows that the infimal value of a closed function on a compact set is always attained.

Exercise 1.7 (Existence of minimizers on compact sets). b Consider a closed function f : E → R and a compact set Q ⊂ E. Then the infimum value infx∈Q f (x) is attained at some point in Q.

[Hint: Apply the Bolzano-Weierstrass Theorem to the sequence xi ∈ Q satisfying f (xi) → infQ f and invoke lower-semicontinuity.] An important downside of the above exercise is it only guarantees exis- tence of minimizers over compact sets. In light of the exponential example mentioned previously, if we wish to guarantee existence of minimizers over E, then we must focus on a favorable class of functions.

Definition 1.8 (Coercive). A function f : E → R is coercive if for any sequence xi with ‖xi‖ → ∞, it must be that f (xi) → ∞.

Equivalently, a function f is coercive precisely when the sublevel sets {x : f (x) ≤ r} are bounded for every r ∈ R (check this!). For example, the function f (x) = ex 2 is coercive while the exponential f (x) = ex^ is not.

Exercise 1.9 (Existence of unconstrained minimizers). b Any coercive closed function f : E → R has a minimizer. [Hint: Choose r ∈ R such that the sublevel set L = {x : f (x) ≤ r} is nonempty and apply Exercise 1.7.]

1.6 Differentiability

For the rest of the section, let E and Y be two Euclidean spaces, and U an open subset of E. A mapping F : Q → Y, defined on a subset Q ⊂ E,

10 CHAPTER 1. BACKGROUND

is continuous at a point x ∈ Q if for any sequence xi in Q converging to x, the values F (xi) converge to F (x). We say that F is continuous if it is continuous at every x ∈ Q. We say that F is L-Lipschitz continuous if

‖F (y) − F (x)‖ ≤ L‖y − x‖ for all x, y ∈ Q.

A function f : U → R is differentiable at a point x in U if there exists a vector, denoted by ∇f (x) ∈ E, satisfying

lim h→ 0

f (x + h) − f (x) − 〈∇f (x), h〉 ‖h‖

Rather than carrying such fractions around, which can be cumbersome, it is convenient to introduce the following notation. The symbol o(r) will always stand for a term satisfying 0 = limr↓ 0 o(r)/r. Then the equation (1.1) simply amounts to the expression

f (x + h) = f (x) + 〈∇f (x), h〉 + o(‖h‖).

The vector ∇f (x) is called the gradient of f at x. In the most familiar setting E = Rn, the gradient is simply the vector of partial derivatives

∇f (x) =

∂f (x) ∂x 1 ∂f (x) ∂x 2 .. . ∂f (x) ∂xn

If the gradient mapping x 7 → ∇f (x) is well-defined and continuous on U , we say that f is C^1 -smooth. If the stronger property

‖∇f (y) − ∇f (x)‖∗^ ≤ β‖y − x‖ holds for all x, y ∈ U,

then we say that f is β-smooth. Recall that ‖ · ‖ denotes the Euclidean norm in E and ‖ · ‖∗^ is the dual norm. More generally, a mapping F : U → Y is differentiable at x ∈ U if there exists a linear mapping from E to Y, denoted by ∇F (x), satisfying

F (x + h) = F (x) + ∇F (x)h + o(‖h‖).

The linear mapping ∇F (x) is called the Jacobian of F at x. If the assignment x 7 → ∇F (x) is continuous, we say that F is C^1 - smooth. In the most familiar

12 CHAPTER 1. BACKGROUND

Assuming A is self-adjoint, show that f is coercive if and only if A is positive definite.

Exercise 1.11. Define the function f (x) = 12 ‖F (x)‖^2 , where F : E → Y is a C^1 -smooth mapping. Prove the identity ∇f (x) = ∇F (x)∗F (x).

Exercise 1.12. b Consider a function f : U → R and a linear mapping A : Y → E and define the composition h(x) = f (Ax).

Show that if f is differentiable at Ax, then ∇h(x) = A∗∇f (Ax).
Show that if f is twice differentiable at Ax, then ∇^2 h(x) = A∗∇^2 f (Ax)A.

Exercise 1.13. b Define the two sets

Rn ++ := {x ∈ Rn^ : xi > 0 for all i = 1,... , n}, Sn ++ := {X ∈ Sn^ : X 0 }.

Consider the two functions f : Rn ++ → R and F : Sn ++ → R given by

f (x) = −

∑^ n

log xi and F (X) = − ln det(X),

respectively. Note, from basic properties of the determinant, the equality F (X) = f (λ(X)), where we set λ(X) := (λ 1 (X),... , λn(X)).

Find the derivatives ∇f (x) and ∇^2 f (x) for x ∈ Rn ++.
Using the property tr (AB) = tr (BA), prove ∇F (X) = −X−^1 and ∇^2 F (X)[V ] = X−^1 V X−^1 for any X 0. [Hint: To compute ∇F (X), justify F (X+tV )−F (X)+t〈X−^1 , V 〉 = − ln det(I+X−^1 /^2 V X−^1 /^2 )+tr (X−^1 /^2 V X−^1 /^2 ). By rewriting the expression in terms of eigenvalues of X−^1 /^2 V X−^1 /^2 , deduce that the right-hand-side is o(t). To compute the Hessian, observe

(X + V )−^1 = X−^1 /^2

I + X−^1 /^2 V X−^1 /^2

X−^1 /^2 ,

and then use the expansion (I + A)−^1 = I − A + A^2 − A^3 +... = I − A + O(‖A‖^2 op), whenever ‖A‖op < 1. ]

1.7. ACCURACY IN APPROXIMATION AND OPTIMALITY CONDITIONS 13

Show 〈∇^2 F (X)[V ], V 〉 = ‖X−^

1 (^2) V X−^ 1 (^2) ‖^2 F for any X 0 and V ∈ Sn. Deduce that the operator ∇^2 F (X) : Sn^ → Sn is positive definite.

1.7 Accuracy in approximation and optimality con-

ditions

A set Q in E is convex if for any two points x, y ∈ Q and real λ ∈ [0, 1], the point λx + (1 − λ)y lies in Q. In other words, a set Q is convex if and only if the line segment joining any two point x, y ∈ Q lies entirely in Q. Throughout this section, we let U be an open, convex subset of E. Consider a C^1 -smooth function f : U → R and a point x ∈ U. Classi- cally, the linear function

l(x; y) = f (x) + 〈∇f (x), y − x〉

is a “best first-order approximation” of f near x. If f is C^2 -smooth, then the quadratic function

Q(x; y) = f (x) + 〈∇f (x), y − x〉 + 12 〈∇^2 f (x)(y − x), y − x〉

is a “best second-order approximation” of f near x. These two functions play a fundamental role when designing and analyzing algorithms, they fur- nish simple linear and quadratic local models of f. In this section, we aim to quantify how closely l(x; ·) and Q(x; ·) approximate f. All results will fol- low quickly by restricting multivariate functions to line segments and then applying the fundamental theorem of calculus for univariate functions. To this end, the following observation plays a basic role.

Exercise 1.14. b Consider a function f : U → R and two points x, y ∈ U. Define the univariate function ϕ : [0, 1] → R given by ϕ(t) = f (x + t(y − x)) and let xt := x + t(y − x) for any t.

Show that if f is C^1 -smooth, then equality

ϕ′(t) = 〈∇f (xt), y − x〉 holds for any t ∈ (0, 1).

Show that if f is C^2 -smooth, then equality

ϕ′′(t) = 〈∇^2 f (xt)(y − x), y − x〉 holds for any t ∈ (0, 1).

1.7. ACCURACY IN APPROXIMATION AND OPTIMALITY CONDITIONS 15

Corollary 1.17 (First and second order expansions). Suppose that f : U → R is C^1 -smooth. Then for any point x¯ ∈ U , the estimate holds:

y^ lim→x

f (y) − l(x; y) ‖y − x‖

If f is in addition C^2 -smooth, then the estimate holds:

x,y^ lim→x¯

f (y) − Q(x; y) ‖y − x‖^2

When the mappings ∇f and ∇^2 f are Lipschitz continuous, one has even greater control on the accuracy of approximation, in essence passing from little-o terms to big-O terms.

Exercise 1.18 (Accuracy in approximation under Lipschitz conditions). Suppose f : U → R is a β-smooth function. Then for any points x, y ∈ U the inequality (^) ∣ ∣∣f (y) − l(x; y)

∣∣ ≤ β 2 ‖y − x‖^2 holds.

Moreover, if f is C^2 -smooth and satisfies the estimate

‖∇^2 f (y) − ∇^2 f (x)‖op ≤ M ‖y − x‖ for all x, y ∈ U,

then the inequality ∣ ∣∣f (y) − Q(x; y)

∣∣ ≤ M

‖y − x‖^3 , holds for all x, y ∈ U.

[Hint: This follows directly from Corollary 1.16.]

Corollary 1.17 and Exercise 1.18 play central roles in optimization, as will become clear in later chapters. We end this section with one useful consequence of Corollary 1.17: derivative-based conditions for a point to be a local minimizer of a smooth function. A point x is called a local minimizer of a function f : E → R if there exists a convex neighborhood U of x such that f (x) ≤ f (y) for all y ∈ U. Observe that naively checking if x is a local minimizer of f from the very definition requires evaluation of f at every point near x, an impossible task. We now derive a verifiable necessary condition for local optimality based on the gradient.

Theorem 1.19. (First-order necessary conditions) Suppose that x is a local minimizer of a function f : U → R. If f is differentiable at x, then equality ∇f (x) = 0 holds.

16 CHAPTER 1. BACKGROUND

Proof. Set v := −∇f (x). Then for all small t > 0, the definition of differen- tiability implies

0 ≤

f (x + tv) − f (x) t = −‖∇f (x)‖^2 +

o(t) t

Letting t tend to zero yields ∇f (x) = 0, as claimed.

To obtain verifiable sufficient conditions for optimality, higher order derivatives are required. Theorem 1.20. (Second-order conditions) Consider a C^2 -smooth function f : U → R and fix a point x ∈ U. Then the following are true.

(Necessary conditions) If x ∈ U is a local minimizer of f , then

∇f (x) = 0 and ∇^2 f (x) 0.

(Sufficient conditions) If the relations

∇f (x) = 0 and ∇^2 f (x) 0

hold, then x is a local minimizer of f. More precisely, it holds:

lim inf y→x

f (y) − f (x) 1 2 ‖y^ −^ x‖ 2 ≥^ λn(∇

(^2) f (x)).

Proof. Suppose first that x is a local minimizer of f. Then Theorem 1. guarantees ∇f (x) = 0. Consider an arbitrary vector v ∈ E. Then for all small t > 0, we deduce from a second-order expansion (1.2) the estimate

0 ≤

f (x + tv) − f (x) 1 2 t

2 =^ 〈∇

(^2) f (x)v, v〉 + o(t

t^2

Letting t tend to zero yields 〈∇^2 f (x)v, v〉 ≥ 0 for all v ∈ E, as claimed. Suppose ∇f (x) = 0 and ∇^2 f (x) 0. Let > 0 be such that B(x) ⊂ U. Then for points y sufficiently close to x, the second-order expansion (1.2) yields the estimate f (y) − f (x) 1 2 ‖y^ −^ x‖

∇^2 f (x)

y − x ‖y − x‖

o(‖y − x‖^2 ) ‖y − x‖^2

≥ λn(∇^2 f (x)) + o(‖y − x‖^2 ) ‖y − x‖^2

Letting y tend to x, the result follows.

Convex Analysis and Nonsmooth Optimization, Lecture notes of Calculus

Related documents

Partial preview of the text

Download Convex Analysis and Nonsmooth Optimization and more Lecture notes Calculus in PDF only on Docsity!

Convex Analysis and Nonsmooth Optimization

Dmitriy Drusvyatskiy

March 29, 2020

Chapter 1

Background

1.2. NORMS 3

4 CHAPTER 1. BACKGROUND

6 CHAPTER 1. BACKGROUND

1.4. SET OPERATIONS 7

A + B =

1.6. DIFFERENTIABILITY 9

10 CHAPTER 1. BACKGROUND

12 CHAPTER 1. BACKGROUND

I + X−^1 /^2 V X−^1 /^2

X−^1 /^2 ,

1.7. ACCURACY IN APPROXIMATION AND OPTIMALITY CONDITIONS 13

1.7. ACCURACY IN APPROXIMATION AND OPTIMALITY CONDITIONS 15

∣∣ ≤ M

16 CHAPTER 1. BACKGROUND

2 =^ 〈∇