






































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Sections 1.1-1.3 review basic constructs of linear algebra, in- cluding inner products, norms, linear maps and their adjoints, as well as.
Typology: Lecture notes
1 / 46
This page cannot be seen from the preview
Don't miss anything!







































ii
iv CONTENTS
This chapter sets the notation and reviews the background material that will be used throughout the rest of the book. The reader can safely skim this chapter during the first pass and refer back to it when necessary. The discussion is purposefully kept brief. The comments section at the end of the chapter lists references where a more detailed treatment may be found.
Roadmap. Sections 1.1-1.3 review basic constructs of linear algebra, in- cluding inner products, norms, linear maps and their adjoints, as well as eigenvalue and singular value decompositions. Section 1.4 establishes nota- tion for basic set operations, such as sums and images/preimages of sets. Section 1.5 focuses on topological preliminaries; the main results are the Bolzano-Weierstrass theorem and a variant of the extreme value theorem. The final Sections 1.6-1.7 formally define first and second-order derivatives of multivariate functions, establish estimates on the error in Taylor approx- imations, and deduce derivative-based conditions for local optimality. The material in Sections 1.6-1.7 is often covered superficially in undergraduate courses, and therefore we provide an entirely self-contained treatment.
1.1 Inner products and linear maps
Throughout, we fix an Euclidean space E, meaning that E is a finite- dimensional real vector space endowed with an inner product 〈·, ·〉. Recall that an inner-product on E is an assignment 〈·, ·〉 : E × E → R satisfying the following three properties for all x, y, z ∈ E and scalars a, b ∈ R:
(Symmetry) 〈x, y〉 = 〈y, x〉
1
Linear mappings A : E → E, between a Euclidean space E and itself, are called linear operators, and are said to be self-adjoint if equality A = A∗ holds. Self-adjoint operators on Rn^ are precisely those operators that are representable as symmetric matrices. A self-adjoint operator A is positive semi-definite, denoted A 0, whenever
〈Ax, x〉 ≥ 0 for all x ∈ E.
Similarly, a self-adjoint operator A is positive definite, denoted A 0, when- ever 〈Ax, x〉 > 0 for all 0 6 = x ∈ E.
A positive semidefinite linear operator A is positive definite if and only if A is invertible.
1.2 Norms
A norm on a vector space V is a function ‖·‖ : V → R for which the following three properties hold for all point x, y ∈ V and scalars a ∈ R:
(Absolute homogeneity) ‖ax‖ = |a| · ‖x‖
(Triangle inequality) ‖x + y‖ ≤ ‖x‖ + ‖y‖
(Positivity) Equality ‖x‖ = 0 holds if and only if x = 0. The inner product in the Euclidean space E always induces a norm ‖x‖ =
〈x, x〉. Unless specified otherwise, the symbol ‖x‖ for x ∈ E will always denote this induced norm. For example, the dot product on Rn induces the usual 2-norm ‖x‖ 2 :=
x^21 +... + x^2 n, while the trace product on Rm×n^ induces the Frobenius norm ‖X‖F :=
tr (XT^ X). Other important examples of norms are the lp-norms on Rn:
‖x‖p =
(|x 1 |p^ +... + |xn|p)^1 /p^ for 1 ≤ p < ∞ max{|x 1 |,... , |xn|} for p = ∞
The most notable of these are the l 1 , l 2 , and l∞ norms; see Figure 1.1. For an arbitrary norm ‖ · ‖ on E, the dual norm ‖ · ‖∗^ on E is defined by
‖v‖∗^ := max{〈v, x〉 : ‖x‖ ≤ 1 }.
Thus ‖v‖∗^ is the maximal value that the linear function x 7 → 〈v, x〉 takes over the closed unit ball of the norm ‖ · ‖. For example, the lp and lq norms
(a) p = 1 (b) p = 1. 5 (c) p = 2 (d) p = 5 (e) p = ∞
Figure 1.1: Unit balls of `p-norms.
on Rn^ are dual to each other whenever p−^1 + q−^1 = 1 and p, q ∈ [1, ∞]. In particular, the ` 2 -norm on Rn^ is self-dual; the same goes for the Frobenius norm on Rm×n^ (why?). For an arbitrary norm ‖·‖ on E, the Cauchy-Schwarz inequality holds: |〈x, y〉| ≤ ‖x‖ · ‖y‖∗.
Exercise 1.2. Given a positive definite linear operator A on E, show that the assignment 〈v, w〉A := 〈Av, w〉 is an inner product on E, with the in- duced norm ‖v‖A =
〈Av, v〉. Show that the dual norm with respect to the original inner product 〈·, ·〉 is ‖v‖∗A = ‖v‖A− 1 =
〈A−^1 v, v〉.
All norms on E are “equivalent” in the sense that any two are within a constant factor of each other. More precisely, for any two norms ρ 1 (·) and ρ 2 (·), there exist constants α, β ≥ 0 satisfying
αρ 1 (x) ≤ ρ 2 (x) ≤ βρ 1 (x) for all x ∈ E.
Case in point, for any vector x ∈ Rn, the relations hold:
‖x‖ 2 ≤ ‖x‖ 1 ≤
n‖x‖ 2 ‖x‖∞ ≤ ‖x‖ 2 ≤
n‖x‖∞ ‖x‖∞ ≤ ‖x‖ 1 ≤ n‖x‖∞.
For our purposes, the term “equivalent” is a misnomer: the proportionality constants α, β strongly depend on the (often enormous) dimension of the vector space E. Hence measuring quantities in different norms can yield strikingly different conclusions. Consider a linear map A : E → Y, and norms ‖ · ‖E on E and ‖ · ‖Y on Y. We define the induced norm of A by
‖A‖E,Y := max x: ‖x‖E≤ 1 ‖Ax‖Y.
Any symmetric matrix A ∈ Sn^ admits an eigenvalue decomposition, meaning a factorization of the form
A = U ΛU T^ ,
where U ∈ O(n) is orthogonal and Λ ∈ Sn^ is a diagonal matrix. The diagonal elements of Λ are precisely the eigenvalues of A and the columns of U are corresponding eigenvectors. More generally, any rectangular matrix A ∈ Rm×n^ admits a singular value decomposition, meaning a factorization of the form
A = U ΣV T^ ,
where U ∈ O(m) and V ∈ O(n) are orthogonal matrices and Σ ∈ Rm×n^ is a diagonal matrix with nonnegative diagonal entries. The diagonal elements of Σ are uniquely defined and are called the singular values of A. Supposing without loss of generality m ≤ n, the singular values of A are precisely the square roots of the eigenvalues of AAT^ , and we denote them by
σ 1 (A) ≥ σ 2 (A) ≥... ≥ σm(A) ≥ 0.
In particular, the operator norm ‖A‖op of any matrix A ∈ Rm×n^ equals its maximal singular-value σ 1 (A). See Figure 1.2 for an illustration.
Figure 1.2: The shaded ellipse is the image of the unit disk by a nonsingular matrix A ∈ R^2 ×^2. The radii of the circumscribed and inscribed circles are σ 1 (A) and σ 2 (A), respectively.
1.4 Set operations
In this section, we review notation for sums, generated cones, and im- ages/preimages of sets. For any two sets A, B ⊂ E and λ ∈ R, define the set operations:
λA := {λa : a ∈ A} and A + B := {a + b : a ∈ A, b ∈ B}.
Thus the points in λA are simply the points in A scaled by λ. One can visualize the sum A + B by writing it more suggestively as
a∈A
(a + B).
Thus A + B is formed from the union of the shifted sets a + B over all points a ∈ A. In particular, forming the sum of a set A ⊂ E and a unit ball B ⊂ E has the affect of “fattening” A. The symbol A − B is defined similarly. The cone generated by a set A ⊂ E will be denoted by
R+A := {λx : x ∈ A, λ ≥ 0 }.
See Figure 1.3 for an illustration of the generated cone and sum operation.
A
R+A
(a) Generated cone.
A + B = A^ +^ B
(^) (b) Disk plus square.
Figure 1.3: Sum and cone operations.
For any map F : E → Y and sets A ⊂ E and B ⊂ Y, define the two sets
FA = {F(x) : x ∈ A} and F−^1 B = {x : Fx ∈ B}.
The set FA is called the image of A under F, while F−^1 B is called the preimage of B under F. Notice that the sum A + B can also be written as the linear image of the product set Q := A×B under the map F(x, y) = x+y.
f
(a) Closed.
f
(b) Not closed.
Figure 1.4: Closed functions.
The following exercise shows that the infimal value of a closed function on a compact set is always attained.
Exercise 1.7 (Existence of minimizers on compact sets). b Consider a closed function f : E → R and a compact set Q ⊂ E. Then the infimum value infx∈Q f (x) is attained at some point in Q.
[Hint: Apply the Bolzano-Weierstrass Theorem to the sequence xi ∈ Q satisfying f (xi) → infQ f and invoke lower-semicontinuity.] An important downside of the above exercise is it only guarantees exis- tence of minimizers over compact sets. In light of the exponential example mentioned previously, if we wish to guarantee existence of minimizers over E, then we must focus on a favorable class of functions.
Definition 1.8 (Coercive). A function f : E → R is coercive if for any sequence xi with ‖xi‖ → ∞, it must be that f (xi) → ∞.
Equivalently, a function f is coercive precisely when the sublevel sets {x : f (x) ≤ r} are bounded for every r ∈ R (check this!). For example, the function f (x) = ex 2 is coercive while the exponential f (x) = ex^ is not.
Exercise 1.9 (Existence of unconstrained minimizers). b Any coercive closed function f : E → R has a minimizer. [Hint: Choose r ∈ R such that the sublevel set L = {x : f (x) ≤ r} is nonempty and apply Exercise 1.7.]
1.6 Differentiability
For the rest of the section, let E and Y be two Euclidean spaces, and U an open subset of E. A mapping F : Q → Y, defined on a subset Q ⊂ E,
is continuous at a point x ∈ Q if for any sequence xi in Q converging to x, the values F (xi) converge to F (x). We say that F is continuous if it is continuous at every x ∈ Q. We say that F is L-Lipschitz continuous if
‖F (y) − F (x)‖ ≤ L‖y − x‖ for all x, y ∈ Q.
A function f : U → R is differentiable at a point x in U if there exists a vector, denoted by ∇f (x) ∈ E, satisfying
lim h→ 0
f (x + h) − f (x) − 〈∇f (x), h〉 ‖h‖
Rather than carrying such fractions around, which can be cumbersome, it is convenient to introduce the following notation. The symbol o(r) will always stand for a term satisfying 0 = limr↓ 0 o(r)/r. Then the equation (1.1) simply amounts to the expression
f (x + h) = f (x) + 〈∇f (x), h〉 + o(‖h‖).
The vector ∇f (x) is called the gradient of f at x. In the most familiar setting E = Rn, the gradient is simply the vector of partial derivatives
∇f (x) =
∂f (x) ∂x 1 ∂f (x) ∂x 2 .. . ∂f (x) ∂xn
If the gradient mapping x 7 → ∇f (x) is well-defined and continuous on U , we say that f is C^1 -smooth. If the stronger property
‖∇f (y) − ∇f (x)‖∗^ ≤ β‖y − x‖ holds for all x, y ∈ U,
then we say that f is β-smooth. Recall that ‖ · ‖ denotes the Euclidean norm in E and ‖ · ‖∗^ is the dual norm. More generally, a mapping F : U → Y is differentiable at x ∈ U if there exists a linear mapping from E to Y, denoted by ∇F (x), satisfying
F (x + h) = F (x) + ∇F (x)h + o(‖h‖).
The linear mapping ∇F (x) is called the Jacobian of F at x. If the assignment x 7 → ∇F (x) is continuous, we say that F is C^1 - smooth. In the most familiar
Exercise 1.11. Define the function f (x) = 12 ‖F (x)‖^2 , where F : E → Y is a C^1 -smooth mapping. Prove the identity ∇f (x) = ∇F (x)∗F (x).
Exercise 1.12. b Consider a function f : U → R and a linear mapping A : Y → E and define the composition h(x) = f (Ax).
Exercise 1.13. b Define the two sets
Rn ++ := {x ∈ Rn^ : xi > 0 for all i = 1,... , n}, Sn ++ := {X ∈ Sn^ : X 0 }.
Consider the two functions f : Rn ++ → R and F : Sn ++ → R given by
f (x) = −
∑^ n
i=
log xi and F (X) = − ln det(X),
respectively. Note, from basic properties of the determinant, the equality F (X) = f (λ(X)), where we set λ(X) := (λ 1 (X),... , λn(X)).
(X + V )−^1 = X−^1 /^2
and then use the expansion (I + A)−^1 = I − A + A^2 − A^3 +... = I − A + O(‖A‖^2 op), whenever ‖A‖op < 1. ]
1 (^2) V X−^ 1 (^2) ‖^2 F for any X 0 and V ∈ Sn. Deduce that the operator ∇^2 F (X) : Sn^ → Sn is positive definite.
1.7 Accuracy in approximation and optimality con-
ditions
A set Q in E is convex if for any two points x, y ∈ Q and real λ ∈ [0, 1], the point λx + (1 − λ)y lies in Q. In other words, a set Q is convex if and only if the line segment joining any two point x, y ∈ Q lies entirely in Q. Throughout this section, we let U be an open, convex subset of E. Consider a C^1 -smooth function f : U → R and a point x ∈ U. Classi- cally, the linear function
l(x; y) = f (x) + 〈∇f (x), y − x〉
is a “best first-order approximation” of f near x. If f is C^2 -smooth, then the quadratic function
Q(x; y) = f (x) + 〈∇f (x), y − x〉 + 12 〈∇^2 f (x)(y − x), y − x〉
is a “best second-order approximation” of f near x. These two functions play a fundamental role when designing and analyzing algorithms, they fur- nish simple linear and quadratic local models of f. In this section, we aim to quantify how closely l(x; ·) and Q(x; ·) approximate f. All results will fol- low quickly by restricting multivariate functions to line segments and then applying the fundamental theorem of calculus for univariate functions. To this end, the following observation plays a basic role.
Exercise 1.14. b Consider a function f : U → R and two points x, y ∈ U. Define the univariate function ϕ : [0, 1] → R given by ϕ(t) = f (x + t(y − x)) and let xt := x + t(y − x) for any t.
ϕ′(t) = 〈∇f (xt), y − x〉 holds for any t ∈ (0, 1).
ϕ′′(t) = 〈∇^2 f (xt)(y − x), y − x〉 holds for any t ∈ (0, 1).
Corollary 1.17 (First and second order expansions). Suppose that f : U → R is C^1 -smooth. Then for any point x¯ ∈ U , the estimate holds:
y^ lim→x
f (y) − l(x; y) ‖y − x‖
If f is in addition C^2 -smooth, then the estimate holds:
x,y^ lim→x¯
f (y) − Q(x; y) ‖y − x‖^2
When the mappings ∇f and ∇^2 f are Lipschitz continuous, one has even greater control on the accuracy of approximation, in essence passing from little-o terms to big-O terms.
Exercise 1.18 (Accuracy in approximation under Lipschitz conditions). Suppose f : U → R is a β-smooth function. Then for any points x, y ∈ U the inequality (^) ∣ ∣∣f (y) − l(x; y)
∣∣ ≤ β 2 ‖y − x‖^2 holds.
Moreover, if f is C^2 -smooth and satisfies the estimate
‖∇^2 f (y) − ∇^2 f (x)‖op ≤ M ‖y − x‖ for all x, y ∈ U,
then the inequality ∣ ∣∣f (y) − Q(x; y)
‖y − x‖^3 , holds for all x, y ∈ U.
[Hint: This follows directly from Corollary 1.16.]
Corollary 1.17 and Exercise 1.18 play central roles in optimization, as will become clear in later chapters. We end this section with one useful consequence of Corollary 1.17: derivative-based conditions for a point to be a local minimizer of a smooth function. A point x is called a local minimizer of a function f : E → R if there exists a convex neighborhood U of x such that f (x) ≤ f (y) for all y ∈ U. Observe that naively checking if x is a local minimizer of f from the very definition requires evaluation of f at every point near x, an impossible task. We now derive a verifiable necessary condition for local optimality based on the gradient.
Theorem 1.19. (First-order necessary conditions) Suppose that x is a local minimizer of a function f : U → R. If f is differentiable at x, then equality ∇f (x) = 0 holds.
Proof. Set v := −∇f (x). Then for all small t > 0, the definition of differen- tiability implies
0 ≤
f (x + tv) − f (x) t = −‖∇f (x)‖^2 +
o(t) t
Letting t tend to zero yields ∇f (x) = 0, as claimed.
To obtain verifiable sufficient conditions for optimality, higher order derivatives are required. Theorem 1.20. (Second-order conditions) Consider a C^2 -smooth function f : U → R and fix a point x ∈ U. Then the following are true.
∇f (x) = 0 and ∇^2 f (x) 0.
∇f (x) = 0 and ∇^2 f (x) 0
hold, then x is a local minimizer of f. More precisely, it holds:
lim inf y→x
f (y) − f (x) 1 2 ‖y^ −^ x‖ 2 ≥^ λn(∇
(^2) f (x)).
Proof. Suppose first that x is a local minimizer of f. Then Theorem 1. guarantees ∇f (x) = 0. Consider an arbitrary vector v ∈ E. Then for all small t > 0, we deduce from a second-order expansion (1.2) the estimate
0 ≤
f (x + tv) − f (x) 1 2 t
(^2) f (x)v, v〉 + o(t
t^2
Letting t tend to zero yields 〈∇^2 f (x)v, v〉 ≥ 0 for all v ∈ E, as claimed. Suppose ∇f (x) = 0 and ∇^2 f (x) 0. Let > 0 be such that B(x) ⊂ U. Then for points y sufficiently close to x, the second-order expansion (1.2) yields the estimate f (y) − f (x) 1 2 ‖y^ −^ x‖
∇^2 f (x)
y − x ‖y − x‖
y − x ‖y − x‖
o(‖y − x‖^2 ) ‖y − x‖^2
≥ λn(∇^2 f (x)) + o(‖y − x‖^2 ) ‖y − x‖^2
Letting y tend to x, the result follows.