A Singularly Valuable Decomposition, Essays (university) of Communications Engineering

A Singularly Valuable Decomposition: The SVD of a Matrix

Typology: Essays (university)

2018/2019

Uploaded on 03/03/2019

ahhmed
ahhmed 🇪🇬

1 document

1 / 27

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
A Singularly Valuable Decomposition: The SVD of a Matrix
Dan Kalman
The American University
Washington, DC 20016
February 13, 2002
Every teacher of linear algebra should be familiar with the matrix singular value decomposition (or
SVD). It has interesting and attractive algebraic properties, and conveys important geometrical and
theoretical insights about linear transformations. The close connection between the SVD and the well
known theory of diagonalization for symmetric matrices makes the topic immediately accessible to linear
algebra teachers, and indeed, a natural extension of what these teachers already know. At the same
time, the SVD has fundamental importance in several different applications of linear algebra. Strang
was aware of these facts when he introduced the SVD in his now classical text [22, page 142], observing
...it is not nearly as famous as it should be.
Golub and Van Loan ascribe a central significance to the SVD in their definitive explication of numerical
matrix methods [8, page xiv] stating
...perhaps the most recurring theme in the book is the practical and theoretical value of [the
SVD].
Additional evidence of the significance of the SVD is its central role in a number of papers in recent
years in Mathematics Magazine and The American Mathematical Monthly (for example [2, 3, 17, 23]).
Although it is probably not feasible to include the SVD in the first linear algebra course, it definitely
deserves a place in more advanced undergraduate courses, particularly those with a numerical or applied
emphasis. My primary goals in this article are to bring the topic to the attention of a broad audience,
and to reveal some of the facets that give it both practical and theoretical significance.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b

Partial preview of the text

Download A Singularly Valuable Decomposition and more Essays (university) Communications Engineering in PDF only on Docsity!

A Singularly Valuable Decomposition: The SVD of a Matrix

Dan Kalman

The American University

Washington, DC 20016

February 13, 2002

Every teacher of linear algebra should be familiar with the matrix singular value decomposition (or SVD). It has interesting and attractive algebraic properties, and conveys important geometrical and theoretical insights about linear transformations. The close connection between the SVD and the well known theory of diagonalization for symmetric matrices makes the topic immediately accessible to linear algebra teachers, and indeed, a natural extension of what these teachers already know. At the same time, the SVD has fundamental importance in several different applications of linear algebra. Strang was aware of these facts when he introduced the SVD in his now classical text [22, page 142], observing

... it is not nearly as famous as it should be.

Golub and Van Loan ascribe a central significance to the SVD in their definitive explication of numerical matrix methods [8, page xiv] stating

... perhaps the most recurring theme in the book is the practical and theoretical value of [the SVD].

Additional evidence of the significance of the SVD is its central role in a number of papers in recent years in Mathematics Magazine and The American Mathematical Monthly (for example [2, 3, 17, 23]).

Although it is probably not feasible to include the SVD in the first linear algebra course, it definitely deserves a place in more advanced undergraduate courses, particularly those with a numerical or applied emphasis. My primary goals in this article are to bring the topic to the attention of a broad audience, and to reveal some of the facets that give it both practical and theoretical significance.

Theory

The SVD is intimately related to the familiar theory of diagonalizing a symmetric matrix. Recall that if A is a symmetric real n × n matrix, there is an orthogonal matrix V and a diagonal D such that A = V DV T^. Here the columns of V are eigenvectors for A and form an orthonormal basis for Rn; the diagonal entries of D are the eigenvalues of A. To emphasize the connection with the SVD, we will refer to V DV T^ as the eigenvalue decomposition, or EVD, for A.

For the SVD we begin with an arbitrary real m × n matrix A. As we shall see, there are orthogonal matrices U and V and a diagonal matrix, this time denoted Σ, such that A = U ΣV T^. In this case, U is m × m and V is n × n, so that Σ is rectangular with the same dimensions as A. The diagonal entries of Σ, that is the Σii = σi, can be arranged to be nonnegative and in order of decreasing magnitude. The positive ones are called the singular values of A. The columns of U and V are called left and right singular vectors, for A.

The analogy between the EVD for a symmetric matrix and SVD for an arbitrary matrix can be extended a little by thinking of matrices as linear transformations. For a symmetric matrix A, the transformation takes Rn^ to itself, and the columns of V define an especially nice basis. When vectors are expressed relative to this basis, we see that the transformation simply dilates some components and contracts others, according to the magnitudes of the eigenvalues (with a reflection through the origin tossed in for negative eigenvalues). Moreover, the basis is orthonormal, which is the best kind of basis to have.

Now let’s look at the SVD for an m × n matrix A. Here the transformation takes Rn^ to a different space, Rm, so it is reasonable to ask for a natural basis for each of domain and range. The columns of V and U provide these bases. When they are used to represent vectors in the domain and range of the transformation, the nature of the transformation again becomes transparent: it simply dilates some components and contracts others, according to the magnitudes of the singular values, and possibly discards components or appends zeros as needed to account for a change in dimension. From this perspective, the SVD tells us how to choose orthonormal bases so that the transformation is represented by a matrix with the simplest possible form, that is, diagonal.

How do we choose the bases {v 1 , v 2 , · · · , vn} and {u 1 , u 2 , · · · , um} for the domain and range? There is no difficulty in obtaining a diagonal representation. For that, we need only Avi = σiui, which is easily arranged. Select an orthonormal basis {v 1 , v 2 , · · · , vn} for Rn^ so that the first k elements span the row space of A and the remaining n − k elements span the null space of A, where k is the rank of A. Then for 1 ≤ i ≤ k define ui to be a unit vector parallel to Avi, and extend this to a basis for Rm. Relative to these bases, A will have a diagonal representation. But in general, although the v’s are orthogonal, there is no reason to expect the u’s to be. The possibility of choosing the v–basis so that its orthogonality is preserved under A is the key point. We show next that the EVD of the n × n symmetric matrix AT^ A provides just such a basis, namely, the eigenvectors of AT^ A.

Let AT^ A = V DV T^ , with the diagonal entries λi of D arranged in nonincreasing order, and let the

v 1 ... vk

vk+1 ... vn

u 1 ... uk

uk+1 ... um

A

null(A)

null(A)

range(A)

range(A)

Figure 1: Strang’s Diagram

computations. Procedures that seem elegant and direct in theory can sometimes be dramatically flawed as prescriptions for computational algorithms. The SVD illustrates this point. Our construction appears to offer a straightforward algorithm for the SVD: form AT^ A, compute its eigenvalues and eigenvectors, and then find the SVD as described above. Here practice and theory go their separate ways. As we shall see later, the computation of AT^ A can be subject to a serious loss of precision. It turns out that direct methods exist for finding the SVD of A without forming AT^ A, and indeed in many applications the practical importance of the SVD is that it allows one to find the EVD of AT^ A without ever forming this numerically treacherous product.

Let us now proceed to several additional aspects of the SVD.

The SVD and EVD. Our construction showed how to determine the SVD of A from the EVD of the symmetric matrix AT^ A. Conversely, it is easy to recover the EVD of AT^ A from the SVD of A. For suppose the singular value decomposition A = U ΣV T^ is given. Clearly, AT^ A = V ΣT^ ΣV T^ and

AAT^ = U ΣΣT^ U T^. Now in either order the product of Σ and ΣT^ is a square diagonal matrix whose first k diagonal entries are the σ^2 i , and with any remaining diagonal entries equal to 0. Thus, AT^ A = V ΣT^ ΣV T is the EVD of AT^ A and and AAT^ = U ΣΣT^ U T^ is the EVD of AAT^. Our argument also yields a uniqueness result for the singular value decomposition. In any SVD of A, the right singular vectors (columns of V ) must be the eigenvectors of AT^ A, the left singular vectors (columns of U ) must be the eigenvectors of AAT^ , and the singular values must be the square roots of the nonzero eigenvalues common to these two symmetric matrices. Thus, up to possible orthogonal transformations in multidimensional eigenspaces of AT^ A and AAT^ , the matrices V and U in the SVD are uniquely determined. Finally, note that if A itself is square and symmetric, each eigenvector for A with eigenvalue λ is an eigenvector for A^2 = AT^ A = AAT^ with eigenvalue λ^2. Hence the left and right singular vectors for A are simply the eigenvectors for A, and the singular values for A are the absolute values of its eigenvalues. That is, the EVD and SVD essentially coincide for symmetric A, and are actually identical if A has no negative eigenvalues. In particular, for any A, the SVD and EVD of AT^ A are the same.

A geometric interpretation of the SVD. One way to understand how A deforms space is to consider its action on the unit sphere in Rn. An arbitrary element x of this unit sphere can be represented by x = x 1 v 1 + x 2 v 2 + · · · + xnvn with

∑n 1 x 2 i = 1.^ The image is^ Ax^ =^ σ^1 x^1 u^1 +^ · · ·^ +^ σkxkuk. Letting yi = xiσi, we see the image of the unit sphere consists of the vectors y 1 u 1 + y 2 u 2 + · · · + ykuk, where

y^21 σ 12

y 22 σ^22

y^2 k σ k^2

∑^ k

1

x^2 i

≤ 1

If A has full column rank, so that k = n, the inequality is actually a strict equality. Otherwise, some of the xi are missing on the right, and the sum can be anything from 0 to 1. This shows that A maps the unit sphere of Rn^ to a k–dimensional ellipsoid with semi-axes in the directions ui and with the magnitudes σi. If k = n the image is just the surface of the ellipsoid, otherwise it is the solid ellipsoid. In summary, we can visualize the effect A as follows: it first collapses n − k dimensions of the domain, then distorts the remaining dimensions, stretching and squeezing the unit k–sphere into an ellipsoid, and finally embeds the ellipsoid in Rm. This is illustrated for n = m = 3 and k = 2 in Fig. (2).

As an immediate consequence, we see that ‖A‖, the operator norm of A, defined as the maximum value of |Av| for v on the unit sphere, is simply σ 1 , the largest singular value of A. Put another way, we have the inequality |Ax| ≤ σ 1 |x| for all x ∈ Rn, with equality only when x is a multiple of v 1.

Partitioned matrices and the outer product form of the SVD. When viewed in a purely algebraic sense, any zero rows and columns of the matrix Σ are superfluous. They can be eliminated if

in some expositions: Any m × n matrix A of rank k can be expressed in the form A = U ΣV T^ where U is an m × k matrix such that U T^ U = I, Σ is a k × k diagonal matrix with positive entries in decreasing order on the diagonal, and V is an n × k matrix such that V T^ V = I.

The partitioned matrix formulation of the SVD is a little unusual in one respect. Usually in a matrix product XY , we focus on the rows in X and on the columns in Y. Here, the factors are expressed in just the opposite way. This is the ideal situation to apply what is called an outer product expansion for the product of two matrices. In general, if X is an m × k matrix with columns xi and Y is a k × n matrix with rows yTi , the matrix product XY can be expressed as

XY =

∑^ k

i=

xiyTi

Each of the terms xiyiT is an outer product of vectors xi and yj. It is simply the standard matrix product of a column matrix and a row matrix. The result can be visualized in terms of a multiplication table, with the entries of xi listed along the left margin and those of yj across the top. The body of the table is the outer product of xi and yj. This idea is illustrated in Fig. (3), showing the outer product of (a, b, c)

and (p, q, r). Observe that in the figure, each column is a multiple of

[ (^) a bc

]

and each row is a multiple

p q r a ap aq ar b bp bq br c cp cq cr

Figure 3: Outer Product as Multiplication Table

of

[

p q r

]

, so that the outer product is clearly of rank 1. In just the same way, the outer product xiyTi is a rank 1 matrix with columns which are multiples of xi and rows which are multiples of yi.

We shall return to outer product expansions in one of the applications of the SVD. Here, we simply apply the notion to express the SVD of A in a different form.

Let

X =

[

u 1 · · · uk

]

σ 1

.. . σk

 =^

[

σ 1 u 1 · · · σkuk

]

and

Y =

vT 1 .. . vTk

Then A = XY can be expressed as an outer product expansion

A =

∑^ k

i=

σiuivTi

This is yet another form of the SVD, and it provides an alternate way of expressing how A transforms an arbitrary vector x. Clearly,

Ax =

∑^ k

i=

σiuivTi x

Since vTi x is a scalar, we can rearrange the order of the factors to

Ax =

∑^ k

i=

viT xσiui

Now in this sum Ax is expressed as a linear combination of the vectors ui. Each coefficient is a product of two factors, vTi x and σi. Of course, viT x = vi · x is just the ith^ component of x relative to the orthonormal basis {v 1 , · · · , vn}. Viewed in this light, the outer product expansion reconfirms what we already know: under the action of A each v component of x becomes a u component after scaling by the appropriate σ.

Applications

Generally, the SVD finds application in problems involving large matrices, with dimensions that can reach into the thousands. It is the existence of efficient and accurate computer algorithms for its computation that makes the SVD so useful in these applications. There is beautiful mathematics involved in the derivation of the algorithms, and the subject is worth looking into. However, for the present discussion of applications, we will treat the computation of the SVD as if performed by a black box. By way of analogy, consider any application of trigonometry. When we require the value of a sine or cosine, we simply push the buttons on a calculator with never a thought to the internal workings. We are confident that the calculator returns a close enough approximation to serve our purposes, and think no more about it. So, too, in the present discussion, let the reader be confident that computers can quickly and accurately approximate the SVD of arbitrary matrices. To extend the analogy, we will concentrate on when and why to push the SVD button, and how to use the results.

The SVD is an important tool in several different applications. Here we will present a detailed discussion of two samples that are apparently quite unrelated: linear least squares optimization and data compression with reduced rank approximations. Before proceeding to these topics, a few other applications will be briefly mentioned.

Applications in Brief. One of the most direct applications of the SVD is to the problem of computing the EVD of a matrix product AT^ A. This type of problem is encountered frequently under the name of

There is a geometric interpretation to the general least squares problem. We are seeking an element of the subspace S spanned by the ai which is closest to b. The solution is the projection of b on S, and is characterized by the condition that the error vector (that is, the vector difference between b and its projection) should be orthogonal to S. Orthogonality to S is equivalent to orthogonality to each of the ai. Thus, the optimal solution vector x must satisfy ai · (Ax − b) = 0 for all i. Equivalently, in matrix form, AT^ (Ax − b) = 0.

Rewrite the equation as AT^ Ax = AT^ b, a set of equations for the xi generally referred to as the normal equations for the linear least squares problem. Observe that the independence of the columns of A implies the invertibility of AT^ A. Therefore, we have x = (AT^ A)−^1 AT^ b.

That is a beautiful analysis. It is neat, it is elegant, it is clear. Unfortunately it is also poorly behaved when it is implemented in approximate computer arithmetic. Indeed, this is the classic example of the gap between theory and practice alluded to earlier. Numerically, the formation of AT^ A can dramatically degrade the accuracy of a computation; it is a step to be avoided. A detailed discussion of the reasons for this poor performance is provided in [8]. Here, we will be content to give some insight into the problem, using both a heuristic analysis and a numerical example. First, though, let us see how the SVD solves the least squares problem.

We are to choose the vector x so as to minimize |Ax − b|. Let the SVD of A be U ΣV T^ (where U and V are square orthogonal matrices, and Σ is rectangular with the same dimensions as A). Then we have

Ax − b = U ΣV T^ x − b = U (ΣV T^ x) − U (U T^ b) = U (Σy − c)

where y = V T^ x and c = U T^ b. Now U is an orthogonal matrix, and so preserves lengths. That is, |U (Σy − c)| = |Σy − c|, and hence |Ax − b| = |Σy − c|. This suggests a method for solving the least squares problem. First, determine the SVD of A and calculate c as the product of U T^ and b. Next, solve the least squares problem for Σ and c. That is, find a vector y so that |Σy − c| is minimal. (We shall see in a moment that the diagonal nature of Σ makes this step trivial.) Now y = V T^ x so we can determine x as V y. That gives the solution vector x as well as the magnitude of the error, |Σy − c|.

In effect, the SVD has allowed us to make a change of variables so that the least squares problem is reduced to a diagonal form. In this special form, the solution is easily obtained. We seek y to minimize the norm of the vector Σy − c. Let the nonzero diagonal entries of Σ be σi for 1 ≤ i ≤ k, and let the

components of y, which is the unknown, be yi for 1 ≤ i ≤ n. Then

Σy =

σ 1 y 1 .. . σkyk 0 0 .. . 0

so Σy − c =

σ 1 y 1 − c 1 .. . σkyk − ck −ck+ −ck+ .. . −cm

By inspection, when yi = ci/σi for 1 ≤ i ≤ k, Σy − c assumes its minimal length, which is given by [ (^) m ∑

i=k+

c^2 i

] 1 / 2

Note that when k = m, the sum is vacuous. In this case, the columns of A span Rm^ so the least squares problem can be solved with zero error. Also observe that when k is less than n, there is no constraint on the values of yk+1 through yn. These components can be assigned arbitrary values with no effect on the length of Σy − c.

As the preceding analysis shows, the SVD allows us to transform a general least squares problem into one that can be solved by inspection, with no restriction on the rank of the data matrix A. Indeed, we can combine the transformation steps and the solution of the simplified problem as follows. First, we know that c = U T^ b. The calculation of y from c amounts to multiplying by the matrix Σ+^ defined by transposing Σ and inverting the nonzero diagonal entries. Then y = Σ+c will have its first k entries equal to ci/σi, as required. Any remaining entries (which were previously unconstrained) all vanish. Finally, we compute x = V y. This gives the compact solution

x = V Σ+U T^ b (2)

At this point, the subject of generalized inverses surfaces. Given b ∈ Rm, there is a unique element x ∈ Rn^ of minimal norm for which Ax is nearest b. The mapping from b to that x defines what is called the Moore-Penrose generalized inverse, A+, of A. It is clear that Σ+, as defined in the solution of the lease squared problem for Σ and c, is the Moore-Penrose inverse of Σ. Moreover, Eq. (2) shows that when A = U ΣV T^ is the SVD of A, the generalized inverse is given by A+^ = V Σ+U T^. A nice treatment of the generalized inverse that is both compact and self-contained is given in [5], although there is no mention of the SVD in that work. Generalized inverses and their connection with the SVD are also discussed in [8, 14, 22, 23].

Now we are in a position to compare the solutions obtained using the SVD and the normal equations. The solutions are theoretically equivalent, so if they are carried out in exact arithmetic, they must yield the same results. But when executed in limited precision on a computer, they can differ significantly.

(ATA)-

However, in solving the normal equations we are interested in the inverse mapping, (AT^ A)−^1 , which has the same eigenvectors as AT^ A, but whose eigenvalues are the reciprocals 1/λi. Now it is λn, alone, that is significant, and the ellipsoid is essentially one dimensional (see Fig. (5)). In every direction except vn, the image of a random unit vector will have no noticable contribution. Arguing as before, significant

Figure 5: Unit Sphere Deformed by (AT^ A)−^1

errors are introduced in the contributions of all of the vi except vn.

Let us make this analysis more concrete. Let a particular vector x be selected, with x = x 1 v 1 + x 2 v 2 + · · · + xnvn. Thinking of x as a column vector, consider a single entry, c, and write ci for the corresponding entry of xivi. Then c = c 1 + c 2 + · · · + cn. In the image (AT^ A)−^1 x the corresponding entry will be c 1 /λ 1 + c 2 /λ 2 + · · · + cn/λn. Now if the ci are all of roughly comparable magnitude, then the final term will dwarf all the rest. The situation illustrated above will occur. The cn/λn term will play the role of 12345678.9012345; every other term will be in the position of .123456789012345. In effect, several decimal digits of each term, save the last, are neglected. Of course, the number of neglected digits depends on the difference in magnitude between λ 1 and λn. If λ 1 /λn is on the order of 10τ^ , then λ 1 and λn differ by τ orders of magnitude, and the inferior terms of the sum lose τ digits of accuracy. If τ is large enough, the contributions can be completely lost from every term except the last.

Does this loss of accuracy really matter? Isn’t the value of y = (AT^ A)−^1 x still correct to the accuracy of the computation? If y is really all we care about, yes. But y is supposed to be an approximate solution of AT^ Ay = x, and we should judge the adequacy of this approximation by looking at the difference between AT^ Ay and x. In exact arithmetic, the first n − 1 ci/λi, which were nearly negligable in the computation of y, regain their significance and make an essential contribution to the restoration of x. But the digits that were lost in the limited precision arithmetic cannot be recovered, and so every component, except the last, is either corrupted or lost entirely. Thus, while y = (AT^ A)−^1 x might be correct to the limits of computational precision, the computed value of AT^ Ay can be incorrect in every decimal place.

In the context of the normal equations, the computation of x = (AT^ A)−^1 AT^ b is supposed to result

in a product Ax close to b. This calculation is subject to the same errors as those described above. Small errors in the computation of (AT^ A)−^1 can be inflated by the final multiplication by A to severely degrade the accuracy of the final result.

Based on the foregoing, it is the range of the magnitudes of the eigenvalues of AT^ A that determines the effects of limited precision in the computation of the least-squares vector x, and these effects will be most severe when the ratio λ 1 /λn is large. As many readers will have recognized, this ratio is the condition number of the matrix AT^ A. More generally, for any matrix, the condition number is the ratio of greatest to least singular values. For a square matrix A, the condition number can be interpreted as the reciprocal of the distance to the nearest singular matrix ([8, p. 26]). A large condition number for a matrix is a sign that numerical instability may plague many types of calculations, particularly the solution of linear systems.

Now all of this discussion applies equally to the SVD solution of the least squares problem. There too we have to invert a matrix, multiplying U T^ b by Σ+. Indeed, the singular values are explicitly inverted, and we can see that a component in the direction of the smallest positive singular value gets inflated, in the nonsingular case by the factor 1/σn. Arguing as before, the effects of truncation are felt when the condition number σ 1 /σn is large. So the benefit of the SVD solution is not that the effects of near dependence of the columns of A are avoided. But let us compare the two condition numbers. We know that the eigenvalues of AT^ A are the squares of the singular values of A. Thus λ 1 /λn = (σ 1 /σn)^2 , the condition number of AT^ A is the square of the condition number of A. In view of our heuristic error analysis, this relation between the condition numbers explains a rule of thumb for numerical computation: the computation using AT^ A needs to be performed using roughly twice as many digits, to assure a final result with accuracy comparable to that provided by the SVD of A.

The heuristic analysis is plausible, but by itself cannot be completely convincing. The following numerical example provides some evidence in support of the heuristic argument. Of course, there is no substitute for a careful mathematical analysis, and the interested reader is encouraged to consult [8] for a discussion of condition number.

A Numerical Example We will simulate a least squares analysis where experimental data have been gathered for four variables and we are attempting to estimate one, the dependent variable, by a linear combination of the other three independent variables. For simplicity we suppose each variable is specified by just four data values, so the columns a 1 , a 2 , a 3 of A, as well as the dependent column b, lie in R^4. We would like the columns of A to be nearly dependent, so that we can observe the situation described in the heuristic analysis. This will be accomplished by making one column, say a 3 , differ from a linear combination of the other two by a very small random vector. Similarly, we want the dependent vector b to be near, but not in, the range of A, so we will define it, too, as a linear combination of a 1 and a 2 plus a small random vector. We then compute the least squares coefficients x = (x 1 , x 2 , x 3 ) by the two formulas x = V Σ+U T^ b and x = (AT^ A)−^1 AT^ b, and compare the errors by calculating the magnitude of the residual, |b − Ax|, in each case.

In choosing the random vectors in the definitions of a 3 and b, we have different goals in mind. The

and 1/2. Thus, the length of the random component of c3 will be at most 10−^7 , in accord with the discussion above.

Next, we define the b vector in a similar way, by adding a small random vector to a specified linear combination of columns of A.

b = 2c1 - 7c2 + .0001(rand(4,1) - .5[1 1 1 1]’)

This time the random vector will have length no greater than 10−^4 , again reflecting the previous discus- sion.

The SVD of A is quickly determined by MATLAB:

[U,S,V] = svd(A)

The three matrices U, S (which represents Σ), and V are displayed on the screen and kept in the computer memory. The singular values when we ran the program turned out to be 59.810, 2.5976 and 1. 0578 × 10 −^8. The condition number of A is therefore about 6 · 109 , and has the desired magnitude.

To compute the matrix S+^ we need to transpose the diagonal matrix S and invert the nonzero diagonal entries. This matrix will be denoted by G. It is defined by the following MATLAB commands:

G = S’

G(1,1) = 1/S(1,1)

G(2,2) = 1/S(2,2)

G(3,3) = 1/S(3,3)

Now let’s see how well the SVD solves the least squares problem. We multiply x = V Σ+U T^ b by the matrix A and see how far the result is from b. Upon receiving the commands

r1 = b - AVGU’b e1 = sqrt(r1’*r1)

MATLAB responds

e1 = 4.2681e-

As desired, the computed magnitude of the residual |b − Ax| is a bit smaller than 10−^4. So the SVD provided a satisfactory solution of our least squares problem.

The solution provided by the normal equations is x = (AT^ A)−^1 AT^ b. We enter

r2 = b - Ainv(A’A)A’b e2 = sqrt(r2’*r2)

and MATLAB responds

e2 = 51.

which is of the same order of magnitude as |b|, the distance from b to the origin! As we anticipated from our heuristic analysis, the solution to the least squares problem computed using the normal equations does a poor job of approximating b as a linear combination of the columns of A. What is more, the computed residual for this method is of no value as an estimate of the distance of b from the range of A.

This completes our in depth look at least squares problems. The next section provides a detailed look at another area of application: reduced rank approximation.

Data compression using reduced rank approximations. The expression of the SVD in terms of the outer product expansion was introduced earlier. This representation emerges in a natural way in another application of the SVD – data compression. In this application we begin with an m × n matrix A of numerical data, and our goal is to describe a close approximation to A using many fewer numbers than the mn original entries. The matrix is not considered as a linear transformation, or indeed as an algebraic object at all. It is simply a table of mn numbers and we would like to find an approximation that captures the most significant features of the data.

Because the rank of a matrix specifies the number of linearly independent columns (or rows), it is a measure of redundancy. A matrix of low rank has a large amount of redundancy, and so can be expressed much more efficiently than simply by listing all the entries. As a graphic example, suppose a scanner is used to digitize a photograph, replacing the image by an m × n matrix of pixels, each assigned a gray level on a scale of 0 to 1. Any large-scale features in the image will be reflected by redundancy in the columns or rows of pixels, and thus we may hope to recapture these features in an approximation by a matrix of lower rank than min(m, n).

The extreme case is a matrix of rank one. If B is such a matrix, then the columns are all multiples of one another – the column space is one dimensional. If u is the single element of a basis, then each column is a multiple of u. We represent the coefficients as vi, meaning that the ith^ column of B is given by viu, so that B = [v 1 u v 2 u · · · vnu] = uvT^. Thus, any rank one matrix can be expressed as an outer product, that is, as the product of a column and row. The mn entries of the matrix are determined by the m entries of the column and the n entries of the row. For this reason, we can achieve a large compression of the data matrix A if it can be approximated by a rank one matrix. Instead of the mn entries of A, we need only m + n numbers to represent this rank one approximation to A. It is natural to seek the best rank one approximation.

We will call B the best rank one approximation to A if the error matrix B − A has minimal norm, where now we define the (Frobenius) norm |X| of a matrix X to be simply the square root of the sum of

σ 1 u 1 v 1 T is one possible rank one approximation, and it gives rise to an error of σ 22 + · · · + σ k^2. The idea is to show that this is the smallest the error can be. For any rank one A 1 , using the fact that the Frobenius norm is preserved under multiplication by orthogonal matrices, |A−A 1 | = |U ΣV T^ −A 1 | = |Σ−U T^ A 1 V |, where A = U ΣV T^ expresses the SVD of A. Write U T^ A 1 V as αxyT^ with positive α and unit vectors x ∈ Rm^ and y ∈ Rn. Next use the properties of the matrix inner product:

|Σ − αxyT^ |^2 = |Σ|^2 + α^2 − 2 αΣ · xyT

Focusing on the final term, the matrix inner product of the diagonal matrix Σ with the outer product xyT^ is given by

Σ · xyT^ =

∑^ k

i=

σixiyi

∑^ k

i=

σi|xi||yi|

≤ σ 1

∑^ k

i=

|xi||yi|

= σ 1 x∗^ · y∗

where x∗^ = (|x 1 |, · · · , |xk|) and y∗^ = (|y 1 |, · · · , |yk|). By the Cauchy–Schwarz inequality, x∗^ · y∗^ ≤ |x∗||y∗| ≤ |x||y| = 1. Combined with the last equation above, this gives Σ · xyT^ ≤ σ 1.

Returning to the main thread of the argument, we now have |Σ − αxyT^ |^2 ≥ |Σ|^2 + α^2 − 2 ασ 1 = |Σ|^2 + (α − σ 1 )^2 − σ^21. We immediately see that the right-hand side is minimized when α is taken to be σ 1 , so that |Σ − αxyT^ |^2 ≥ |Σ|^2 − σ 12. Moreover, if the σi are distinct, this minimum is actually obtained only when α = σ 1 and x and y are of the form e 1 = (1, 0 , · · · , 0) in Rm^ and Rn, respectively. Finally, A 1 = αU xyT^ V T^ = σ 1 (U e 1 )(V e 1 )T^ = σ 1 u 1 vT 1. That is what we wished to show. (If σ 1 is a repeated singular value, a slight modification of the argument is needed. It amounts to making an orthogonal change of basis in the subspace of right singular vectors corresponding to σ 1 .)

This result shows that the SVD can be used to find the best rank one approximation to a matrix. But in many cases, the approximation will be too crude for practical use in data compression. The obvious next step is to look for good rank r approximations for r > 1. In this regard, there is an attractive greedy algorithm. To begin, choose a rank one A 1 for which the error E 1 = A − A 1 has minimal norm. Next choose a rank one matrix A 2 for which the norm of E 2 = E 1 − A 2 is minimal. Then A 1 + A 2 is a rank two approximation to A with error E 2. Continue in this fashion, each time choosing a best possible rank one approximation to the error remaining from the previous step. The procedure is a greedy algorithm because at each step we attempt to capture as large a piece of A as possible. After r steps, the sum A 1 + · · · + Ar is a rank r approximation to A. The process can be repeated for k steps (where k is the rank of A), at which point the error is reduced to zero. This results in the decomposition

A =

i

Ai

which is none other than the SVD of A. To be more precise, each Ai can be expressed as the product of a positive scalar σi with an outer product uivTi of unit vectors. Then A is given by

A =

i

σiuivTi

which, true to the notation, is the SVD of A expressed in the form of the outer product expansion.

This statement follows from the earlier result linking the best rank one approximation of a matrix to its largest singular value and corresponding singular vectors. Assuming the singular values are distinct, write A =

∑k i=1 σiuiv T i.^ It is immediate that^ A^1 must equal^ σ^1 u^1 v T 1 ,^ and^ E^1 =^

∑k i=2 σiuiv T i. But that gives the SVD of E 1 , from which we can instantly obtain the best rank one approximation A 2 = σ 2 u 2 vT 2. Clearly, this argument can be repeated until the complete decomposition of A is obtained. (As before, if the singular values of A are not distinct, the argument must be modified slightly to take into account orthogonal changes of basis in the subspace of right singular vectors corresponding to a particular singular value.)

The understanding we now have of the connection between successive best rank one estimates and the SVD can be summarized as follows. The outer product expansion form for the SVD, A =

σiuivTi , expresses A as a sum of rank one matrices that are orthogonal with respect to the matrix inner product. Truncating the sum at r terms defines a rank r matrix Sr =

∑r i=1 σiuiv T i. Approximating^ A^ with^ Sr leaves an error of Er = A − Sr =

∑k i=r+1 σiuiv

T i with^ |Er^ |

(^2) = ∑k i=r+1 σ

2 i and^ |A|

2 = |S

r |

2 + |E

r | (^2). The

sum Sr is the result of making successive best rank one approximations. But there is a stronger result: Sr is actually the best rank r approximation possible. Proofs of this assertion can be found in Lawson and Hanson [14, pages 23–26] and Leon [16, pages 422–424].

The properties of reduced rank approximations can now be used to clarify a remark made earlier. In the example showing the advantage of the SVD in least squares problems, the matrix A had three independent columns, but the final column differed from a linear combination of the other two by a small random vector. Subtracting that small random vector from the third column produces a rank two matrix B that closely approximates A. In fact, |A − B| is equal to the norm of the random vector, expected to be about 5 × 10 −^8. Of course, the minimal error for a rank two approximation to A must be precisely σ 3 , so we can conclude that σ 3 ≤ |A − B| ≈ 5 × 10 −^8. This shows that the magnitude of the random vector provides a bound on the least singular value of A. In particular, the order of magnitude of the least singular value can be controlled by choosing the random vector appropriately. This is what allowed us to construct A with a prescribed condition number in the least squares example.

In practice, the SVD can be used to choose the rank r, as well as the best rank r approximation to A. Note that with r terms, the SVD outer product expansion results in a relative error

|Er | |A|

∑k i=r+1 σ

2 i ∑k i=1 σ 2 i

Typically, the value of r is chosen to reduce this relative error to some prespecified threshhold. There is a nice visual example in [17] of this idea, used to approximate an image of a surface in R^3. Image