Adaptive Matrix Algebras In Unconstrained Optimization

(1)

S. Cipolla, C. Di Fiore

University of Rome “Tor Vergata”

Department of Mathematics

4th International Conference on Matrix Methods in Mathematics and Applications 24th-28th August 2015, Moscow

(2)

The Problem

“In 1963 I attended a meeting at Imperial College, London, where most of the participants agreed that the general algorithms of that time for nonlinear optimization calculations were unlikely to be successful if there were more than 10 variables, unless one had an approximation to the solution in the region of convergence of Newton’s method. However, because I had studied the report of Davidson that presented the first variable metric algorithm, I already had a computer program that would calculate least values of functions of up to 100 variables using only function values and first

derivatives.”

M. J. D. Powell

f : Rⁿ→ R lower bounded, findx∗such that f (x∗) = min

x ∈ Rn f (x ).

(3)

Matrix Algebras [1980-2001]

Let U be a unitary matrix, let us define

L = {Ud(z)U^H : z ∈ Cⁿ} = sd U, d (z) = diag (z1, . . . , zn).

Given A ∈ Mn(C) let us define

L_A= arg minX ∈L||X − A||_F, where ||A||F =Pn

r ,t=1artart;

Properties LA

L_Awell defined because L is a closed subspace of C^n×n(Hilbert’s Projection Theorem);

LA= Ud (zA)U^Hwhere [zA]i = [U^HAU]ii, i = 1, . . . , n;

A ∈ R^n×n, U ∈ R^n×n (U^H= U^T) ⇒ LA ∈ R^n×n;

A S.P.D (Real Symmetric Positive Definite), U ∈ R^n×n (U^H= U^T) ⇒ LA S.P.D;

tr L_A=P

i[z_A]i= tr A;

detLA=Q

i[zA]i≥ detA.

χ(M) number of FLOPS sufficient to perform matrix-vector product Mx , x ∈ Cⁿ.

If L ∈ L = sdU, then χ(L) = χ(U^T) + χ(U) + n.

χ(U) = O(n) =⇒ χ(L) = O(n) for all L ∈ L.

(4)

Generalized quasi-Newton: Algorithm Structure

Algorithm 0.1: Generalized Quasi-Newton [2003]

Data:x0 ∈ Rⁿ; g0= ∇f (x0);

B0S.P.D., d0∈ Rⁿ, d^T₀g0< 0;

1 for k = 0, 1 . . . do

2 xk+1=xk+ λ_kdk;

3 sk=xk+1− xk;

4 yk=gk+1− gk;

5 B_k+1= Φ(B˜_k,sk,yk);

6







Define B˜k+1 S.P.D ⇒ dk+1= − ˜B_k+1⁻¹gk+1 N S;

d_k+1= −B_k+1⁻¹g_k+1 ⇒ Define B˜_k+1 S.P.D S;

Φ(B˜k, sk, yk) =B˜k+ ¹

y_k^Ts_kyky^T_k − ¹

s^T_kB˜_ks_kB˜ksks^T_kB˜k

is the generalized BFGS-type updating formula;

Remarks

B˜kis a S.P.D. approx. ofBk; if ˜B_k= B_kfor all k = 0, 1, . . . we obtain classical BFGS method;

the N S algorithm and S algorithms generate sequences {x_k}_k∈N,{gk}_k∈N,{Bk}_k∈N COMPLETELY DIFFERENT!

(5)

Generalized quasi-Newton : Properties

Algorithm 0.2: Generalized q-N Data:x0 ∈ Rⁿ;

g0= ∇f (x0);

B₀S.P.D, d₀∈ Rⁿ, d^T₀g₀< 0;

1 for k = 0, 1 . . . do

2 xk+1=xk+ λ_kdk;

3 sk=xk+1− xk;

4 yk=gk+1− gk;

5 B_k+1= Φ(B˜_k,sk,yk);

6 dk+1=







− ˜B_k+1⁻¹gk+1 N S;

−B_k+1⁻¹g_k+1 S;

Properties

Φ( ˜Bk,s_k,y_k)sk= yk

Bk+1sk= yk→ Secant Algorithm;

B˜k+1sk6= y_k→ N on Secant Algorithm;

g_k^Tdk< 0 and λk such that (0 < c1< c2< 1):

f (xk+1) ≤ f (xk) + c1λkg^T_kdk

∇f (x_k+ λkdk) ≥ c2g^T_kdk

⇓

s^T_ky_k> 0 and f (x_k+1) < f (x_k).

s^T_kyk> 0 and ˜BkS.P.D. ⇒ Φ( ˜Bk,s_k,y_k) S.P.D;

(B˜k+1S.P.D. ⇒ g^T_k+1dk+1< 0 (N S);

Bk+1= Φ( ˜Bk,s_k,y_k) S.P.D. ⇒ g^T_k+1dk+1< 0 (S).

(6)

Generalized quasi-Newton : Complexity

Algorithm 0.3: Generalized q-N Data:x0 ∈ Rⁿ;

g0= ∇f (x0);

B₀S.P.D, d₀∈ Rⁿ, d^T₀g₀< 0;

1 for k = 0, 1 . . . do

2 xk+1=xk+ λ_kdk;

3 sk=xk+1− xk;

4 yk=gk+1− gk;

5 B_k+1= Φ(B˜_k,sk,yk);

6 dk+1=







− ˜B_k+1⁻¹gk+1 N S;

−B_k+1⁻¹g_k+1 S;

B_k+1⁻¹ = Ψ(B˜_k⁻¹, sk, yk) = (I −^y^k^s^T^k

y^T_ks_k)^TB˜_k⁻¹(I −^y^k^s^T^k

y^T_ks_k) +^s^k^s^T^k

y^T_ks_k.

Complexity

if ˜Bk= Bkfor all k = 0, 1 . . . we obtain BFGS and its complexity is :

O(n²) FLOPS per step;

O(n²) memory allocations;

if ˜B_k6= B_kalgorithm’s complexity is :

Time Complexity per Step :

- number of FLOPS sufficient to calculate ˜B_k⁻¹where B˜k is an approximation of Bk;

- number of FLOPS sufficient to multiply the matrix B˜_k⁻¹ by a vector;

- O(n) more FLOPS ;

Space Complexity :

-number of memory allocation sufficient to store ˜B_k⁻¹; - O(n) more memory allocation.

(7)

Global convergence of generalized QNN S

Algorithm 0.4: QNN S Data:x0 ∈ Rⁿ; g0= ∇f (x0), B₀S.P.D,;

d₀∈ Rⁿ, d^T₀g₀< 0;

1 for k = 0, 1 . . . do

2 xk+1=xk+ λ_kdk;

3 sk=xk+1− x_k;

4 yk=gk+1− g_k;

5 Bk+1= Φ( ˜Bk,sk,yk);

6 dk+1= − ˜B⁻¹_k+1gk+1;

N on Secant Global Convergence [2003]

If ˜Bk is such that (

trBk≥ tr ˜Bk

detBk≤ det ˜Bk

(1)

and there exists M > 0 such that

||y_k||² y^T_ksk

≤ M, (2)

then

lim inf ||g_k|| = 0.

NOTE 1: (1) is verified if ˜Bk= LB_k for some L = sd U.

NOTE 2: (2) is verified if f is convex.

(8)

Global convergence of L^(k)QNN S (“pure projections”)

Algorithm 0.5: L^(k)QNN S Data:x0 ∈ Rⁿ;

g0= ∇f (x0), L⁽⁰⁾;

B0S.D.P, d0∈ Rⁿ, d^T₀g0< 0;

1 for k = 0, 1 . . . do

2 xk+1=xk+ λkdk;

3 sk=xk+1− xk;

4 yk=gk+1− gk;

5 B_k+1= Φ(L^(k)

Bk,sk,yk);

6 Define L^(k+1);

7 d_k+1= −[L^(k+1)

Bk+1]⁻¹g_k+1;

L^(k+1)= {Uk+1d (z)U_k+1^H : z ∈ Cⁿ}

⇓

tr Bk+1= tr L^(k+1)_B

k+1

det Bk+1≤ det L^(k+1)_B

k+1

Remark

The choice L^(k)≡ L for k = 0, 1, . . . is allowed! (LQNN S)

L^(k)QNN S and LQNN S areCONVERGENTbutNOT EFFICIENT[2003,2015]

(9)

Global Convergence Generalized QNS

Algorithm 0.6: QNS Data:x0 ∈ Rⁿ; g0= ∇f (x0), B₀S.D.P,;

d0∈ Rⁿ, d^T₀g0< 0;

1 for k = 0, 1 . . . do

2 xk+1=xk+ λkdk;

3 sk=xk+1− xk;

4 yk=gk+1− gk;

5 B_k+1= Φ( ˜B_k,sk,yk);

6 d_k+1= −B⁻¹_k+1g_k+1;

Global Convergence Secant [2015]

If ˜Bk is such that

trBk≥ tr ˜Bk, detBk≤ det ˜Bk

||B_ksk||²

(s^T_kBksk)² ≤ || ˜Bksk||²

(s^T_kB˜ksk)² (∗)

and there exists M > 0 such that ^||y^k^||²

y_k^Ts_k ≤ M , then lim inf ||gk|| = 0.

B_ks_k = σ ˜B_ks_k ⇐⇒ − ˜B_k⁻¹g_k = σd_k =⇒ (∗)

σ =?

(10)

One step analysis of Generalized QNS self-correction properties [1987-2015]

mkzk²≤ z^TG (x)z ≤ Mkzk² ∀ x ∈ {x ∈ Rⁿ : f (x) ≤ f (x0)}

⇓

G s_k= y_k, where G = Z₁

0

G (x_k+ τ s_k)d τ

B_k+1= Φ(B_k, s_k, y_k)

tr B_k+1= tr B_k−kB_ks_kk² s^T_kB_ks_k +ky_kk²

y^T_ks_k

det(B_k+1) = det(B_k)s^T_k(G s_k) s^T_kB_ks_k

Bk+1= Φ( ˜Bk, sk, yk)

tr B_k+1= tr ˜B_k−k ˜B_ks_kk² s^T_kB˜_ks_k +ky_kk²

y^T_ks_k

det(B_k+1) = det( ˜B_k)s^T_k(G s_k) s^T_kB˜_ks_k

⇓

If Bksk = σ ˜Bksk, tr ˜Bk = tr Bk, det ˜Bk ≥ det Bk

tr B_k+1= tr B_k−1 σ

kB_ks_kk² s^T_kB_ks_k +ky_kk²

y^T_ks_k , det(B_k+1) =σdet( ˜B_k)s^T_kG s_k s^T_kB_ks_k

σ = 1 ⇒ self-correction properties analogous to BFGS!

(11)

Global convergence L^(k)QNS, B˜_k =“pure” projection

At each step impose ||Bksk||²

(s^T_kBksk)² = || ˜Bksk||² (s^T_kB˜ksk)²

Algorithm 0.7: L^(k)QNS Data:x0 ∈ Rⁿ;

g0= ∇f (x0), L⁽⁰⁾= sd U0; B₀S.P.D, d₀∈ Rⁿ, d^T₀g₀< 0;

1 for k = 0, 1 . . . do

2 xk+1=xk+ λ_kdk;

3 sk=xk+1− xk;

4 yk=gk+1− gk;

5 B_k+1= Φ(L^(k)

Bk,sk,yk);

6 d_k+1= −B_k+1⁻¹g_k+1;

7 Choose L^(k+1);

Choose L^(k+1)→ Totally Non Linear Problem

To guarantee the convergence:

FindU_k+1such that

Bk+1sk+1= σL^(k+1)_B

k+1sk+1, where

L^(k+1)_B

k+1 = Uk+1d (zk+1)U_k+1^T and

[zk+1]i = [U_k+1^T Bk+1Uk+1]ii> 0, i = 1, . . . , n .

(12)

Global convergence L^(k)QNS, B˜_k =“hybrid” projection

The matrix ˜Bk+1approximation of Bk+1= Φ( ˜Bk, sk, yk) must be S.P.D. and in L^(k+1), i.e must have the following structure :

B˜k+1= Uk+1d (zk+1)U_k+1^T , Uk+1unitary, zk+1> 0, χ(Uk+1) << n².

Which kind of structure should have L^(k+1)= sd Uk+1and which kind of spectrum zk+1

should have ˜Bk+1in order to guarantee convergence?

Algorithm 0.8: Hybrid L^(k)QNS Data: ...

1 for k = 0, 1 . . . do

2 xk+1=xk+ λkdk;

3 sk=xk+1− xk;

4 yk=gk+1− gk;

5 B_k+1= Φ( ˜B_k,sk,yk);

6 d_k+1= −B_k+1⁻¹g_k+1;

7 Define zk+1> 0;

8 Choose L^(k+1)(i.e choose U_k+1);

9 Define ˜Bk+1= Uk+1d (zk+1)U_k+1^T ;

Partially Non Linear Problem

To guarantee the convergence it is sufficient

Givenz ∈ Rⁿ, z = zk+1> 0, such that det Bk+1≤Y

zi,trBk+1≥X zi

FindU_k+1unitary such that Bk+1sk+1= σUk+1d (z)U^T_k+1sk+1 EXAMPLE: Try for

[z_k+1]_i= [U_k^TB_k+1U_k]_ii,i.e. z_k+1= λ(L^(k) Bk+1).

(13)

Existence of the solution for PNLP (using σ as a parameter) [2015]

Given z > 0, exists Uk+1unitary and σk+1> 0 such that

Bk+1sk+1= σk+1Uk+1d (z)U_k+1^T sk+1

if and only if the following Kantorovich condition holds 4zmzM

(zm+ z_M)² ≤ (s^T_k+1(−gk+1))²

||s_k+1||²||g_k+1||².

“⇐”: Starting from z > 0 such that hypothesis are fulfilled, we build explicitly Uk+1= dHc(z) = H(wk+1)H(vk+1) ,

where H(x) is the Householder matrix I −_||x||²₂xx^T. Let us denote the corresponding algebra

L^(k+1)= sd Uk+1=: [2Ho]^(k+1)

Observe that χ(H(x)) = O(n) for all x ∈ Rⁿ =⇒ χ(L) = O(n) for all L ∈ [2Ho]^(k+1)

(14)

Existence of the solution for PNLP with σ = 1

Given z > 0 such that

zm<s^T_k+1Bk+1sk+1

ksk+1k² <zM,

kB_k+1s_k+1k²− (zm+ z_M)s^T_k+1B_k+1s_k+1+z_Mzmks_k+1k²≤ 0 and exists j ∈ {1, . . . , n}\{m, M} such that

z_j ∈ [s^T_k+1B_k+1s_k+1z_M− kB_k+1s_k+1k² zMks_k+1k²− s^T_k+1Bk+1sk+1

,s^T_k+1B_k+1s_k+1zm− kB_k+1s_k+1k² zmksk²− s^T_k+1Bk+1sk+1

] =: [˜θs, ˜βs],

(we will writeP(zm, zM) = True) then there exists a unitary Uk+1such that Bk+1sk+1= Uk+1d (z)U_k+1^T sk+1.

“⇐”: Starting from z > 0 such that P(zm, zM) = T , we build explicitly Uk+1= dHc(z) = H(wk+1)H(vk+1) , where H(x) is the Householder matrix I −_||x||²₂xx^T.

In this case let us denote the corresponding algebra

L^(k+1)= sd Uk+1=: [2Ho]^(k+1)

Observe that χ(H(x)) = O(n) for all x ∈ Rⁿ =⇒ χ(L) = O(n) for all L ∈ [2Ho]^(k+1)

(15)

Algorithm 0.9: Hybrid L^(k)QNS Data: ...

1 for k = 0, 1 . . . do

2 xk+1=xk+ λkdk;

3 sk=xk+1− xk;

4 yk=gk+1− gk;

5 B_k+1= Φ( ˜B_k,sk,yk) ;

6 dk+1= −B_k+1⁻¹gk+1;

7 Consider [2Ho]^(k)_Bk+1;

8 Compute zk+1= λ([2Ho]^(k)_Bk+1);

9 ifP([zk+1]m, [zk+1]M) = T then

10 Uk+1= dHc(zk+1);

11 B˜_k+1= U_k+1d (z_k+1)U_k+1^T ;

12 [2Ho]^(k+1)= sd Uk+1;

13 else

14 z_k+1= SC(z_k+1);

15 zk+1:= zk+1;

16 Uk+1= dHc(zk+1);

17 B˜_k+1= U_k+1d (z_k+1)U_k+1^T ;

18 [2Ho]^(k+1)= sd U_k+1;

Complexity

Set [2Ho]^(k)= sd Uk, where Uk= H(wk)H(vk) .

O(n) memory allocations are sufficient for implementation.

Line (6): O(n) FLOPS Invert a matrix in [2Ho]^(k);

Multiply a matrix in [2Ho]^(k)by a vector;

Line (8) : O(n) FLOPS via a simple eigenvalue updating formula

zk→ z_k+1= λ([2Ho]^(k)_B

k+1).

Line (10) o (15): O(n) FLOPS for the construction of U = dHc(z).

If P(zm, zM) = False ? Which is the computational cost of SC?

(16)

If P([z_k+1]m, [z_k+1]_M) = False ? Spectral correction: the strategy

Given zi = (U_k^HBk+1Uk)ii> 0 such that P(zm, z_M) = F,

we need to produce a correction z of z z := SC(z),

such that P(zm, zM) = T ; Pn

i =1zi ≤ tr Bk+1; Qn

i =1zi≥ det B_k+1.

The last two conditions hold any time

˜

zi = ( ˜V^HB_k+1V )˜ ii

for some ˜V unitary

(17)

Spectral correction: the theoretical framework for σ = 1

Theorem

Let be B a S.P.D. matrix and s ∈ Rⁿa given vector. Then:

kBsk²− (λ_m+ λ_M)s^TBs +λ_Mλmksk²≤ 0

Assumption (Bx_j = λ_jx_j where λ_j are all simple)

s

ksk6= x_jfor all j ∈ {1, . . . , n}.

Theorem

∀ s ∈ Rⁿ

s^TBs

ksk² ∈ [θ_s, β_s] := [s^TBsλ_M− kBsk²

λ_Mksk²− s^TBs,s^TBsλm− kBsk² λmksk²− s^TBs ],

θ_s≤ β_s.

Theorem

β ≥λ and θ ≤λ .

(18)

Theorem

Let us suppose to have the following four eigenpairs of Bk+1:

(λm, xm), (λ_{r (m)}, x_{r (m)}), (λ_{l (M)}, x_{l (M)}) and (λ_M, x_M).

Then there exists a unitary V such that defining

zi= (V^HBk+1V )ii for i = 1, . . . , n, it holds that

P(zm, zM) = T .

^sTk+1Bk+1sk+1

ksk+1k2 ∈ (λ_m, λ_{r (m)}) =⇒ V em= xm, V e_{r (m)}= x_{r (m)}, V e_M= x_M

^sT^k+1^Bk+1sk+1

ksk+1k2 ∈ (λ_l(M), λ_M) =⇒ V em= xm, V e_{l (M)}= x_{l (M)}, V e_M= x_M

^sT^k+1^Bk+1sk+1

ksk+1k2 ∈ (λ_r(m), λ_{l (M)}) =⇒ V em= xm, V e_l= x_sk+1 (l 6= m, M) , V e_M= x_M where x_sk+1is such that kx_sk+1k = 1, x^T_sk+1B_k+1x_sk+1= ^sT^k+1^Bk+1sk+1

ksk+1k2 and x_sk+1 ∈ < x_{r (m)}, x_{l (M)}>.

λ_{l (M)} λ_{r (m)}

λm λ_M

(19)

Algorithm 0.10: Coupled Subspace Iteration Data: z₀:= z, θ_s,0=sT BszM −kBsk2

zM ksk2 −sT Bs, β_s,0=sT Bszm −kBsk2 zm ksk2−sT Bs

S_B⁽⁰⁾= [Ue_M, Uet], S⁽⁰⁾

B−1= [Ue_b, Uem] where t, b 6= M, m z_M⁰ = z_M, z_t⁰= z_t z⁰_b= z_b, z_m⁰= z_m

1 i=0;

2 while STOP CONDITION do

3 Z_B^{(i +1)}= BS_B^{(i )};

4 Z_B^{(i +1)}=: Q_B^{(i +1)}R_B^{(i +1)}(QR decomposition);

5 Find P_B^{(i +1)}such that

[Q_B^{(i +1)}P_B^{(i +1)}]^HBQ_B^{(i +1)}P^{(i +1)}_B =diag (z_M^{(i +1)}, z_t^{(i +1)});

6 S_B^{(i +1)}:= Q_B^{(i +1)}P_B^{(i +1)};

7 Z^{(i +1)}

B−1 = B⁻¹S^{(i )}

B−1;

8 Z^{(i +1)}

B−1 =: Q^{(i +1)}

B−1R^{(i +1)}

B−1 (QR decomposition);

9 Find P^{(i +1)}

B−1 such that [Q^{(i +1)}

B−1P^{(i +1)}

B−1]^HB⁻¹Q^{(i +1)}

B−1P^{(i +1)}

B−1 =diag (z_b^{(i +1)}, z_m^{(i +1)});

10 S^{(i +1)}

B−1 := Q^{(i +1)}

B−1P^{(i +1)}

B−1;

11 θ_{s,i +1}=^{sT Bsz}

(i +1) M −kBsk2 z(i +1)

M ksk2 −sT Bs

;

12 βs,i +1=^{sT Bsz}

(i +1) m −kBsk2 z(i +1)

m ksk2 −sT Bs;

13 i:=i+1 ;

zi= (U_k^HB_k+1U_k)ii> 0 such that P(zm, zM) = F,

Lemma

Algorithm 0.10 produces the mutually orthogonal sequences

{vⁱ_M}_i, {vⁱ_m}_i, {vⁱ_t}_i, {vⁱ_b}_i

(columns of the matrices S_B^{(i )}, S^{(i )} B−1) and the sequences

{z_Mⁱ }_i, {z_mⁱ}_i, {z_tⁱ}_iand {z_bⁱ}_i, such that

lim

i →∞(vⁱ_M, zⁱ_M) = (λ_M, x_M), lim

i →∞(vⁱ_m, zⁱ_m) = (1/λ_m, x_m), i →∞lim(vⁱ_t, z_tⁱ) = (λ_{l (M)}, x_{l (M)}),

i →∞lim(vⁱ_b, z_bⁱ) = (1/λ_{r (m)}, x_{r (m)}).

(20)

[1] C.Di Fiore, P.Zellini, Matrix algebras in optimal preconditioning, Linear Algebra and its Applications, 335, 1-54 (2001).

[2] C.Di Fiore, S.Fanelli, F.Lepore, P.Zellini, Matrix algebras in Quasi-Newton methods for unconstrained minimization, Numerische Mathematik, 94, 479-500 (2003).

[3] A. Bortoletti, C. Di Fiore, S. Fanelli, P. Zellini, A new class of quasi-Newtonian methods for optimal learning in MLP-networks, IEEE Transactions on Neural Networks, 14, 263-273 (2003).

[4] S. Cipolla, C. Di Fiore, F. Tudisco, P. Zellini, Adaptive Matrix Algebras in unconstrained minimization, Linear Algebra and its Applications, 471, 544-568, (2015).

[5] R. H. Byrd, J. Nocedal, Y. Yuan, Convergence of a Class of Quasi-Newton Methods on Convex Problems, SIAM Journal of Numerical Analysis, 24-5, 1171-1190 (1987).