Conjugate Gradient - Matrix Computation - Lecture Slides, Slides of Advanced Computer Architecture

These lecture slides are very easy to understand and very helpful to built a concept about the Matrix computation.The key points discuss in these slides are:Conjugate Gradient, Convergence Rate, Optimality of Conjugate Gradients, Arbitrary Point, Monotonicity Property, Optimization Algorithm, Nonlinear Equation, Quadratic Function, Lanczos Iteration, Ritz Values

Typology: Slides

2012/2013

Uploaded on 04/27/2013

ashalata
ashalata 🇮🇳

3.8

(18)

106 documents

1 / 24

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18

Partial preview of the text

Download Conjugate Gradient - Matrix Computation - Lecture Slides and more Slides Advanced Computer Architecture in PDF only on Docsity!

  • Lecture

Overview

Conjugate gradient Convergence rate of conjugate gradient Preconditioning

Optimality of conjugate gradients (cont’d)

An inner product of rn and a vector in Kn is zero A crucial property that makes CG powerful It implies that ‖e‖^2 A = e> n Aen + (∆x)>A(∆x) Only the second term depends on ∆x and since A is positive definite, the first term is larger or equal to 0 The second term is 0 if and only if ∆x = 0 , i.e., xn = x Thus ‖e‖A is minimal if and only if xn = x as claimed The monotonicity property is a consequence of the inclusion Kn ⊆ Kn+1, and since Kn is a subset of IRm^ of dimension n as long as convergence has not yet been achieved, convergence must be achieved in at most m steps That is, each step of conjugate direction cuts down the error term component by component

Optimality of conjugate gradients (cont’d)

The guarantee that the CG iteration converges in at most m steps is void in floating point arithmetic For arbitrary matrices A on a real computer, no decisive reduction in ‖en‖A will necessarily be observed at all when n = m In practice, however, CG is used not for arbitrary matrices but for matrices whose spectra are well behaved (partially due to preconditioning) that convergence to a desired accuracy is achieved for n  m The theoretical exact convergence at n = m has no relevance to this use of the CG iteration in scientific computing

Conjugate gradients as an optimization algorithm (cont’d)

Cannot use ‖e‖A or ‖e‖^2 A as neither can be evaluated without knowing x∗ On the other hand, given A and b and x ∈ IRm, the quantity

φ(x) =

x>Ax − x>b

can certainly be evaluated as

‖en‖^2 A = e> n Aen = (x∗ − xn)>A(x∗ − xn) = x> n Axn − 2 x> n Ax∗ + x>∗ Ax∗ = x> n Axn − 2 x> n b + x>∗ b = 2 φ(xn) + constant

Like ‖e‖^2 A, it must achieve its minimum uniquely at x = x∗

Conjugate gradients as an optimization algorithm (cont’d)

The CG iteration can be interpreted as an iterative process for minimizing the quadratic function φ(x) of x ∈ IRm At each step, an iterate xn = xn− 1 + αnpn− 1 is computed that minimizes φ(x) over all x in the one dimensional space xn− 1 + 〈pn− 1 〉 It can be readily confirmed that the formula

αn =

r> n− 1 rn− 1 p> n− 1 Apn− 1

ensures αn is optimal in the sense among all step lengths α What makes the CG iteration remarkable is the choice of the search direction pn− 1 , which has the special property that minimizing φ(x) over xn− 1 + 〈pn− 1 〉 actually minimizes it over all of Kn

Conjugate gradients and polynomial approximation

Connection between Krylov subspace iteration and polynomials of matrices The Arnoldi and Lanczos iterations solve the Arnoldi/Lanczos approximation problem Find pn^ ∈ Pn^ such that ‖pn(A)b‖ = minimum The GMRES iteration solves the GMRES approximation problem Find pn ∈ Pn such that ‖pn(A)b‖ = minimum For CG, the appropriate approximation problem involves the A-norm of the error Find pn ∈ Pn such that ‖pn(A)e 0 ‖A = minimum where e 0 denotes the initial error e 0 = x∗ − x 0 = x∗, and Pn is again defined as GMRES (i.e., the set of polynomials p of degree ≤ n with p(0) = 1)

CG and polynomial approximation

Theorem

If the CG iteration has not already converged before step n (i.e., rn− 1 6 = 0 ), then ‖pn(A)e 0 ‖A has a unique solution pn ∈ Pn, and the iterate xn has error en = pn(A)e 0 for this same polynomial pn. Consequently, we have ‖en‖A ‖e 0 ‖A

= inf p∈Pn

‖p(A)e 0 ‖A ‖e 0 ‖A

≤ inf p∈Pn

max λ∈Λ(A)

|p(λ)| (2)

where Λ(A) denotes the spectrum of A

From theorem in the last lecture, it follows that en = p(A)e 0 for some p ∈ Pn The equality is a consequence of (2) and monotonic convergence (1)

Rate of CG convergence

First, we suppose that the eigenvalues are perfectly clustered but assume nothing about the locations of these clusters

Theorem

If A has only n distinct eigenvalues, then the CG iteration converges in at most n steps

This is a corollary of (2), since a polynomial p(x) =

∏n j=1(1^ −^ x/λj^ )^ ∈^ Pn^ exists that is zero at any specified set of n points {λj } At the other extreme, suppose we know nothing about any clustering of the eigenvalues but only that their distances from the origin vary by at most a factor κ ≥ 1 In other words, suppose we know only the 2-norm condition number κ = λmax /λmin, where λmax and λmin are the extreme eigenvalues of A

Rate of CG convergence (cont’d)

Theorem

Let the CG iteration be applied to a symmetric positive definite matrix problem Ax = b, where A has 2-norm condition number κ. Then the A-norm of the errors satisfy

‖en‖A ‖e 0 ‖A

/[( √

κ + 1 √ κ − 1

)n

κ + 1 √ κ − 1

)−n] ≤ 2

κ − 1 √ κ + 1

)n

See text for proof using Chebyshev polynomials Since (^) √ κ − 1 √ κ + 1

κ as κ → ∞, it implies that if κ is large but not too large, convergence to a specified tolerance can be expected in O(

κ) iterations An upper bound, and convergence may be faster for special right hand sides or if the spectrum is clustered

Example: CG convergence (cont’d)^ ^ ^ ^ ^ ^ ^  yH

10 −16 0 10 20

10 −

10 −

10 −

100

104



`^ ‡y^ yH^

`

By yE

`^ ‡y^ 

`^ By^ M

0 '0#  '   ’   ’8   Eyy&X Ey"y L“  8 j

%‘e8 #   ^ %‘@ 9 ”"8 $#‘ By &%^ yH • .K‘g”    By( MU•–’  '  ^04 :9^ yy^ ! 8“  Ž’

)( *   ^ ’  ”

' % 'x_!" Ha ^0 ^0 "# ^ ('^ W Y9.% % P"_^ %"U]" % 0 % "O'i9.D^ %# Z]  %    )='0i^ Myx  ) C_F^ (H

' k<9C" <8" %Ib5: + "ar] "K^ ‡yO^0 y, 8y^ yoyE8y^ y H,h=j"y^ MN( $ '0V y Mu % ^ PL"

%V 8 "" % )>  6" b9D Z ) %hf"5=# "U#a 5 = %,+ I$ XYy ,- y+ )CV]  0   0  % `;a8' +)*  "  

u% %  &9.D $, 'hn" %"U]" % 0 % 8" % ^ Z ) %h,"5:"#*y^ yEhm % "  EXYy^ 0/ h-^ )*y.+^ M

0  %"U]" % 0 %  `Ny /Ž @9[ R] MEhEM+4u % " -" $Yy  *hC9.DJ'\ E 0 "' 6"

* %  x ( %"U]" % 0 %^ My+ Z %) 2" &"' ^ ( xYy1^ - Y9[ % ZW )^ 0^  0  U]"'# 2^ `oy^ M h,9.'(Y9^ E %0V']" "hn /jy*h^ '^ u % " %h 

e NSe_‚ 0 % 2)= D']" 6%h 3(Ho' % '-"_Ž @ # ' "4* #  +"_[" %  W9.'@' (H ' % '4' 2a+  2 % % h5#E )) ")  `  

   7 2a  2 Yb ("."'+  D 75#E' YP %'   O

* "* %)^ " 7Y9  '^ ^   %D\ 0  0 "#  `‡y^ yH^ %#* ] 2"_^0 '0# ^ ('^ 7a^ 8  $ 7^ 8 %a%

- )* 75: % 9C  "V\I_‚^ K M(/ u^0  " N^ M^1  a)' "h ()= % O&5V;+‚K &"["5=#^ "+ %#Vf"_ 4 y"y +*_! #^ XYy^ P

 %D\hU"f ]" % Z+a ]" %R_Ž  . 25:  E Z: Z (Y " ' '02 @0OX`)  /) K % ." f # "%"'8b29[ %' * % YPL5= %Y]" % )= " #aeh

For τ = 0.01, A has 3,092 nonzero entries and κ ≈ 1 .06, the CG convergence takes place in 9 steps For τ = 0.05, A has 13,062 nonzero entries with κ ≈ 1 .83, and convergence takes place in 19 steps For τ = 0.1, A has 25,526 nonzero entries with κ ≈ 10 .3 and the process converges in 20 steps For τ = 0.2 with 50,834 nonzero entries, there is no convergence at all For this example, the CG beats Cholesky factorization by a factor of about 700 in terms of operation counts

Preconditioning

The convergence of a matrix iteration depends on the properties of the matrix - the eigenvalues, the singular values, or sometimes other information In many cases, the problem of interest can be transformed so that the properties of the matrix are improved drastically The process of preconditioning is essential to most successful applications of iterative methods

Preconditioning for Ax = b (cont’d)

Two extreme cases: I (^) If M = A, then (5) is the same as (3), and nothing has been gained I (^) If M = I , then (4) is the same as (3), and then it is a trivial solution Between these two extremes lie the useful preconditioners, I (^) structured enough (5) can be solved quickly I (^) but close enough to A in some sense that an iteration for (4) converges more quickly than an iteration for (3) What does it mean for M to be “close enough” to A? If the eigenvalues of M−^1 A are close to 1 and ‖M−^1 A − I ‖ 2 is small, then any of the iterations we have discussed can be expected to converge quickly However, preconditioners that do not satisfy such a strong condition may also perform well A simple rule of thumb: preconditioner M is good if M−^1 A is not too far from normal and its eigenvalues are clustered

Left, right and Hermitian preconditioners

What we have described may be more precisely terms as left preconditioner Another idea is to transform Ax = b into AM−^1 y = b with x = M−^1 y in which case M is called a right preconditioner If A is Hermitian positive definite, then it is usual to preserve this property in preconditioning Suppose M is also Hermitian positive definite, with M = CC ∗^ for some C , then (3) is equivalent to

[C −^1 AC −∗]C ∗x = C −^1 b

The matrix in brackets is Hermitian positive definite, so this equation can be solved by conjugate gradient or related iterations At the same time, since C −^1 AC −∗^ is similar to C −∗C −^1 A = M−^1 A, it is enough to examine the eigenvalues of the non-Hermitian matrix M−^1 A to investigate convergence