CG method
Assume A, the coefficient matrix of our system Ax = b, positive definite (recall that A and b are real). In the general scheme choose H = A, d 0 = r 0 = b − Ax 0 , d k = r k + β k−1 d k−1 , k = 1, 2, . . ., (r k = b − Ax k ) where β k−1 is such that
(d k , d k−1 ) A = 0
(d k conjugate to d k−1 ). Here below is the algorithm we obtain:
x 0 ∈ R n , r 0 = b − Ax 0 , d 0 = r 0 . For k = 0, 1, . . . , {
τ k = d
T k
r
kd
TkAd
kx k+1 = x k + τ k d k
r k+1 = b − Ax k+1 = r k − τ k Ad k
β k = − r
T k+1
Ad
kd
Tk
Ad
kd k+1 = r k+1 + β k d k }
known as Conjugate Gradient (CG) algorithm.
Remarks. Note that
0 = (x − x k+1 ) T Hd k = (x − x k+1 ) T Ad k = r T k+1 d k , r T k+1 d k+1 = kr k+1 k 2 2 . As a consequence, if at step s we have r s = b − Ax s 6= 0, then d s 6= 0, τ s is well defined and not zero . . .: the algorithm works.
If r 0 , . . ., r m−1 are non null and r m = 0, then β m−1 = 0, d m = 0, τ m cannot be defined, but it doen’t matter since x m = A −1 b . This hypothesis is effectively verified, in fact there exists m ≤ n = the order of A, such that r m = 0 (see below).
Alternative expressions for τ k and β k hold:
τ k = d T
k r k d T
k Ad k
= r T
k r k d T
k Ad k
(d k = r k + β k−1 d k−1 , r T k d k−1 = 0), β k = − r
T k+1
Ad
kd
Tk
Ad
k= − r
T
k+1
τ
k−1(r
k−r
k+1) d
Tk
τ
k−1(r
k−r
k+1)
= −r
T
k+1
r
k+r
Tk+1r
k+1d
Tk
r
k= r
T k+1
r
k+1r
Tk
r
k. The latter identity uses the result
r T k+1 r k = 0
(residual at step k + 1 is orthogonal to residual at step k, exactly as in the Gradient method) which is not obvious:
r T
k+1 r k = r T k+1 (d k − β k−1 d k−1 ) = −β k−1 r T
k+1 d k−1
= −β k−1 (r k − τ k Ad k ) T d k−1 = β k−1 τ k d T
k Ad k−1 = 0.
First main result: If r 0 , r 1 , . . . , r p are non null, then
d T l Ad j = 0, r T l r j = 0, 0 ≤ j < l ≤ p.
That is, each new residual (search direction) is orthogonal (conjugate) to all previous residuals (search directions). As a consequence, the residual r m must be null for some m ≤ n, or, equivalently, CG finds the solution of Ax = b in at most n steps.
Proof. . . .
A useful representation of the residuals. If r 0 , r 1 , . . . , r k−1 are non null, then there exist polynomials s k (λ), q k (λ) such that
r k = s k (A)r 0 , d k = q k (A)r 0 ,
s k (λ) = (−1) k τ 0 τ 1 · · · τ k−1 λ k + . . . + 1, τ 0 τ 1 · · · τ k−1 6= 0.
Proof (by induction). The equality r 0 = s 0 (A)r 0 holds if s 0 (λ) = 1; d 0 = q 0 (A)r 0 holds if q 0 (λ) = 1. Moreover,
r k+1 = r k − τ k Ad k = s k (A)r 0 − τ k Aq k (A)r 0 = s k+1 (A)r 0
if s k+1 (λ) = s k (λ) − τ k λq k (λ), and
d k+1 = r k+1 + β k d k = s k+1 (A)r 0 + β k q k (A)r 0 = q k+1 (A)r 0
if q k+1 (λ) = s k+1 (λ) + β k q k (λ). Finally, since
s k+1 (λ) = s k (λ) − τ k λ(s k (λ) + β k−1 q k−1 (λ)),
the coefficient of λ k+1 in s k+1 (λ) is −τ k times the coefficient of λ k in s k (λ).
Thus, by the inductive assumption, it must be (−1) k+1 τ 0 τ 1 · · · τ k−1 τ k . Also, the coefficient of λ 0 in s k+1 (λ) is equal to the coefficient of λ 0 in s k (λ), which is 1 by the inductive assumption.
Second main result: r k = 0 for some k ≤ #{distinct eigenvalues of A}.
Proof. Let µ 1 , µ 2 , . . . , µ m be the distinct eigenvalues of A (m ≤ n = order of A). Assume that CG requires more than m steps to converge. So, the vectors r 0 , r 1 , . . . , r m are non null, and, by the First main result, orthogonal (⇒ linearly independent). Let V be an orthonormal matrix whose columns are eigenvectors of A, thus V T = V −1 and AV = V D for D diagonal with the eigenvalues of A as diagonal entries. Observe that there is a degree-m polynomial which is null in A,
m
Y
j=1
(A−µ j I) =
m
Y
j=1
(V DV T −µ j I) =
m
Y
j=1
V (D−µ j I)V T == V
m
Y
j=1
(D−µ j I)V T = 0.
As a consequence the matrices A 0 = I, A, . . ., A m are linearly dependent. But this implies that the dimension of the space
S m+1 (r 0 ) = Span {r 0 , Ar 0 , . . . , A m r 0 } = Span {r 0 , r 1 , . . . , r m }
is smaller than m + 1, which is absurd. It follows that one of the vectors r i , i = 0, . . . , m, must be null.
Let Π 1 k be the set of all polynomials of degree exactly k whose graphic
pass through (0, 1). We now see that the polynomial s k (λ) in the expression
r k = s k (A)r 0 is a very particular polynomial in the class Π 1 k : it makes the norm
of the vector p k (A)r 0 , p k ∈ Π 1 k , minimum (for a suitable choice of the norm).
This result let us give estimates of the rate of convergence of CG, as precise as good is the knowledge about the location of the eigenvalues of A. For example, if it is known that the eigenvalues of A cluster around 1, then CG must converge with a superlinear rate of convergence (see toe 1a).
Notice that r k = s k (A)r 0 = r 0 + ˆ h k , for a particular vector ˆ h k in the space M = Span {Ar 0 , A 2 r 0 , . . . , A k r 0 }. Take a generic vector h k in this space. Then
kr 0 + h k k 2 A
−1= kr 0 + ˆ h k + h k − ˆ h k k 2 A
−1= kr 0 + ˆ h k k 2 A
−1+ kh k − ˆh k k 2 A
−1+ 2(r 0 + ˆ h k , h k − ˆ h k ) A
−1. Now observe that the latter inner product is null, in fact, for j = 0, . . . , k − 1, 0 = r T k r j = r T k A −1 Ar j = (r k , Ar j ) A
−1, that is, r k is A −1 -orthogonal to the space Span {Ar 0 , Ar 1 , . . . , Ar k−1 }, but this space is exactly M. The thesis follows since h k − ˆ h k ∈ M. So we have:
kr 0 + h k k 2 A
−1= kr 0 + ˆ h k k 2 A
−1+ kh k − ˆh k k 2 A
−1≥ kr 0 + ˆ h k k 2 A
−1. In other words,
kr k k 2 A
−1= kr 0 + ˆ h k k 2 A
−1= min{kr 0 + h k k 2 A
−1: h k ∈ M}
= min{kp k (A)r 0 k 2 A
−1: p k ∈ Π 1 k }. (m) Comparison with GMRES. Notice that for any h k ∈ M we have
r 0 + h k = b − A(x 0 + z), z = −A −1 h k ∈ S k (r 0 ) = Span {r 0 , Ar 0 , . . . , A k−1 r 0 }.
Thus, the vector x k generated by the CG method is of type x 0 + ˆ z where ˆ z solves the problem
kb − A(x 0 + ˆ z )k A
−1= min{kb − A(x 0 + z)k A
−1: z ∈ S k (r 0 )} (p) (S k (r 0 ) is known as Krilov space). GMRES is a method able to solve Ax = b in at most n steps under the only assumption det(A) 6= 0. (Like CG, GMRES in order to be competitive must be used as an iterative method, i.e. less than n steps must be sufficient to give a good approximation of x). In the k-th step of GMRES it is defined a vector x k of type x 0 + ˆ z where ˆ z solves exactly the problem (p) but the norm involved is the euclidean one. So, CG is a minimal residual algorithm different from GMRES| A pd .
It is easy to see that the condition (m) can be rewritten as follows:
kx − x k k 2 A = min
p
k∈Π
1kkp k (A)(x − x 0 )k 2 A .
Now we give a bound for the quantity kp k (A)(x − x 0 )k 2 A , p k ∈ Π 1 k , which can be evaluated if (besides A, b) also some information about the location of the eigenvalues λ i of A is given. Let v i 6= 0 be such that Av i = λ i v i , v T i v j = δ ij . Then
kp k (A)(x − x 0 )k 2 A = (x − x 0 ) T Ap k (A) 2 (x − x 0 ) = (P α i v i ) T P α i Ap k (A) 2 v i
= (P α i v i ) T P α i Ap k (λ i ) 2 v i = (P α i v i ) T P α i λ i p k (λ i ) 2 v i
= P α 2 i λ i p k (λ i ) 2 ≤ max i |p k (λ i )| 2 kx − x 0 k 2 A .
So, we obtain the following
Third main result: If x k is the k-th vector generated by CG when applied to solve the pd linear system Ax = b, then
kx − x k k 2 A = min
p
k∈Π
1kkp k (A)(x − x 0 )k 2 A ≤ max i |p k (λ i )| 2 kx − x 0 k 2 A , ∀ p k ∈ Π 1 k .
So, if S ⊂ R, p k ∈ Π 1 k , M k ∈ R are known such that λ i ∈ S ∀ i and |p k (λ)| ≤ M k
∀ λ ∈ S, then kx − x k k A ≤ M k kx − x 0 k A .
Let us see two applications of the latter result. As consequences of the first application we observe that CG (considered as an iterative method) has a linear rate of convergence, is in general faster than G, and is competitive (f.i. with direct methods) if λ max and λ min are comparable. However, as a consequence of the second application, the latter condition is not necessary: the rate of convergence of CG remains high (so, CG remains competitive) if most of the eigenvalues are in [λ min , ˆ λ] with λ min and ˆ λ comparable. Further useful applications of the Third main result hold. In particular, as a consequence of one of these (see toe 1a), it can be stated that CG has a superlinear rate of convergence if most of the eigenvalues of A are in the interval S = [1 − ε, 1 + ε]
(. . .).
(1)
S = [λ min , λ max ], p k (x) = T k
λ
max+λ
min−2x λ
max−λ
minT k
λ
max+λ
minλ
max−λ
min⇒
kx − x k k A < 2
p µ 2 (A) − 1 pµ 2 (A) + 1
! k
kx − x 0 k A , µ 2 (A) = λ max
λ min . (2)
S = [λ min , ˆ λ] ∪ {λ i : λ i > ˆ λ}, r ˆ λ = #{i : λ i > ˆ λ},
p k (x) = Π i: λ
i>ˆ λ
1 − x
λ i
T k−r
λˆˆ
λ+λ
min−2x λ−λ ˆ
minT k−r
λˆˆ
λ+λ
minλ−λ ˆ
min⇒
kx − x k k A < 2
q λ/λ ˆ min − 1 q λ/λ ˆ min + 1
k−r
ˆλkx − x 0 k A , k ≥ r λ ˆ .
The applications (1) and (2) of the Third main result suggest an idea. When λ min and λ max are not comparable and the eigenvalues of A are uniformly dis- tributed in the interval [λ min , λ max ] (in this case all n steps of CG are required in order to give a good approximation of x), replace the given system Ax = b with an equivalent system ˜ A˜ x = ˜ b , ˜ A = E −1 AE −T , ˜ x = E T x , ˜ b = E −1 b , det(E) 6= 0, where the matrix E is such that µ 2 ( ˜ A) < µ 2 (A) and has one of the following properties
• µ 2 ( ˜ A) << µ 2 (A)
• ˜ A has much less distinct eigenvalues than A
• ˜ A has the eigenvalues much more clustered (around 1) than A Then apply CG to ˜ A˜ x = ˜ b .
If such matrix E can be found, then the pd matrix P = EE T is said pre- conditioner.
Note that E −T AE ˜ T = P −1 A, so one could look directly for a pd matrix P such that the (real positive) eigenvalues of P −1 A have the required properties.
For example, in order to obtain something of type P −1 A ≈ I (which would result in a very high increase of the CG rate of convergence) one could choose P as an approximation A of A. We shall see that applying CG to ˜ A˜ x = ˜ b requires, for each step, a surplus of computation: solve a system of type P z = h k . This computation must not make CG slow, in other words P must be a lower complexiy matrix than A. Also notice that E 1 and E 2 , E 1 6= E 2 , E 1 E 1 T = E 2 E 2 T , define matrices ˜ A 1 = E 1 −1 AE 1 −T and ˜ A 2 = E 2 −1 AE 2 −T , ˜ A 1 6= ˜ A 2 , with the same spectrum. For this reason one prefers to call preconditioner P instead of E.
A final remark. The vector x = A −1 b we are looking for is also the minimum point of the function F (z) = 1 2 z T Az − z T b . Analogously, ˜ x = ˜ A −1 b ˜ is the minimum point of the function ˜ F (z) = 1 2 z T Az − z ˜ T b ˜ . The preconditioning technique replaces the (sections of the) contours of F with the more spherical (sections of the) contours of ˜ F , and this results in a more efficient minimization when using gradient-type methods.
Let us write the preconditioned version of the CG algorithm, well defined once that A, b and the preconditioner P are given.
Let us apply CG to the system ˜ A˜ x = ˜ b :
˜
x 0 ∈ R n , ˜r 0 = ˜ b − ˜ A˜ x 0 , ˜ d 0 = ˜r 0 . For k = 0, 1, . . . , {
˜ τ k = ˜ r
T k
˜ r
kd ˜
TkA˜ ˜ d
k˜
x k+1 = ˜ x k + ˜ τ k d ˜ k
˜r k+1 = ˜ b − ˜ A˜ x k+1 = ˜r k − ˜τ k A˜ ˜ d k β ˜ k = ˜ r
T k+1
˜ r
k+1˜ r
Tk