Randomized methods for low-rank approximation of matrices

(1)

Dipartimento di Matematica

Randomized methods for low-rank

approximation of matrices

Master Degree Thesis

Candidate: Marilena Pandolfi

Supervisors: Prof. Leonardo Robol Prof. Dario Andrea Bini

(2)

(3)

Introduction

In certain problems from the applications involving large-scale computations, one has to deal with matrices of very large size, having low rank with respect to their dimensions. Working with matrices of this kind can be computationally very expensive. In fact several difficulties are typically encountered.

An important issue concerns the storage of the matrix. In fact, matrices of huge size can be hardly stored in a fast memory, say the RAM, and a mass storage memory is needed. Typically, mass-storage devices like solid state drives, are much slower than RAM. Fetching data on-demand introduces unacceptable latencies.

Moreover, even if we are allowed to store a huge n × n matrix A entirely on the fast mem-ory, since the number of the arithmetic operations involving all the entries of A are, at least, proportional to n2_{, we would obtain as a result a huge CPU time to process this matrix.}

Therefore it is essential to find compressed representations of matrices with the features described above, given in terms of lower dimensional matrices. Any matrix A ∈ Rm×n_{with rank}

k m, n, can be represented as a sum of k rank-1 matrices. Each of these matrices, in turn, can be represented as a product between a row vector and a column vector. In other words, there exists matrices B ∈ Rm×k

and C ∈ Rk×n _{such that A = BC. Relying on this decomposition}

enables one to reduce the storage, which turns from nm to k(m + n) memory locations, and also to reduce the number of arithmetic operations needed to implement elementary matrix transformations.

On the other hand, the representation A = BC is not available directly from the entries of A and needs to be computed. This is a computational bottleneck that we are going to analyze with the aim of finding low-rank decompositions of a given matrix A which can be computed at the lowest computational possible cost. Indeed, since all the operations will be performed in floating point arithmetic, we don’t need to find an exact decomposition, but we may relax the goal of our computation by accepting also approximate decompositions of A. That is, we are looking for low-rank decompositions of a matrix which differ from A by a small multiple of the machine precision.

In the literature we can find different attempts to obtain a low-rank approximation of a matrix. In most cases, what is done is searching for an approximation to a classical decomposition of the matrix (such as the QR decomposition or the SVD). In [18, 19] a bidiagonalization procedure based on a Lanczos method is introduced which provides an approximate SVD to a given matrix. In [4], a rank-revealing QR decomposition is proposed such that its truncation provides a low-rank factorization maintaining a certain level of accuracy in the approximation. These are examples of deterministic approaches to the problem, in contrast to the randomized approaches presented in [10] and [14]. In these papers, the idea is to find an approximation to the range of A by taking advantage of the properties of random vectors. More specifically, when drawing k random vectors independently from each other using a suitable distribution, there is a high probability that they are linearly independent and that they form a well-conditioned basis. This

(6)

is the idea behind the randomized approach: applying A to these random vectors provides an approximation of the range of A, and from this approximation one can construct an approximate low-rank decomposition of the matrix.

In addition, in [10] it is explained how to find an approximate decomposition of A by means of deterministic methods, once an approximation of the range of A is computed with a randomized approach. At this point it is not too difficult to switch from an approximate decomposition to another (for instance computing an approximate SVD starting from an approximate QR decomposition).

Notice that, due to the roundoff errors, the rank of a matrix computed in floating point arithmetic is likely different from the rank of the matrix computed in exact arithmetic. This is the reason why it is useful to introduce the concept of δ-rank of a matrix, that is the number of singular values of the matrix that are greater than the largest singular value multiplied by δ.

The Singular Value Decomposition of a matrix A can be truncated to obtain the best low-rank approximation of A with respect to the 2 and to the Frobenius norm.

The thesis aims at describing randomized algorithms for computing an approximate SVD, at a low computational cost. Clearly, to have a satisfactory efficiency in computing an approximation to a given matrix A, we have to make a few assumptions about the matrix. We can approximate the range of A by selecting a certain number of random vectors and by multiplying them by A. Consequently, we require that the matrix-vector product involving A is cheaply computable. This fact is not too restrictive, as in many cases of interest the matrix A has a block structure which favors a fast matrix-vector product, or it is sparse, so that the product can be computed by means of small number of floating point operations, typically proportional to the matrix size. For the best result we require that the matrix A has a fast decay of singular values. We will show that this additional hypothesis can be removed by means of minor adjustements to the algorithm.

This work is structured in four chapters, plus an appendix. Chapter1 is divided into three sections. Section1.1introduces the notation and gives a formal description of the problem. The Singular Value Decomposition is defined, and some general and useful properties of the singular values are reported. It follows an introduction to the probability results, needed to describe the Gaussian random vectors which will be sampled to compute an approximation to the range of a given matrix. Section1.2is based on giving some results from linear algebra, such as aspects related to the positive semidefiniteness of a matrix, to the singular values and to the orthogonal projectors onto subspaces. Some further results from theory of probability are reported in Section

1.3, dealing with Gaussian random matrices and their Frobenius and spectral norms.

Chapter 2 describes a randomized algorithm to compute an approximate SVD of a matrix for which computing a matrix-vector product is computationally efficient. The problem can be viewed in two different versions: a fixed-rank problem and a fixed-accuracy problem. The former version aims at finding an approximation with a certain rank, trying to maximize the accuracy; the latter aims at finding an approximation having a certain accuracy, trying to minimize the rank.

After an introduction to the chapter, in Section 2.2 a description is given of a randomized algorithm for the fixed-rank problem, which works well when the starting matrix has a fast decay of the singular values. A highly probable estimate of the error made in the approximation, both in Frobenius and spectral norm, is then provided together with a bound for the expected value of the error.

Section2.3deals with the case where the matrix has slowly decaying singular spectrum. The fixed-rank problem is taken into account even in this case. The algorithm is slightly modified to improve the accuracy, then the analysis of the error is repeated.

(7)

vii

order to construct a basis for the range of the given matrix by adding vectors iteratively, until the required precision is achieved.

Some deterministic low-rank approximation methods, which are alternative to the presented approach, are described in Chapter 3. In detail, in Section3.1we present the Lanczos bidiago-nalization process. This algorithm takes advantage of the relation between the singular values and singular vectors of a matrix A and the eigenvalues and eigenvectors of the symmetric matrix ATA to which it is applied.

Section3.2deals with the rank-revealing QR decomposition. We present an efficient algorithm providing an approximation of a given matrix A, computing only few steps of a rank-revealing QR decomposition. Then we truncate the rank-revealing QR to obtain a low-rank approximation of the matrix.

Finally, Chapter 4 shows some applications of the randomized algorithms. In Section 4.1

we test the algorithms for the fixed-rank and the fixed-accuracy problem. This is done using low-rank random matrices and positive definite Hankel matrices. In Section4.2we test the fixed-accuracy algorithm using semi-infinite Toeplitz matrices with low-rank corrections. In particular, we consider the sum and the product of matrices of this kind. We approximate the low-rank correction of the resulting matrix with the fixed-accuracy algorithm and, starting from this, we approximate the whole matrix.

We conclude focusing on a particular application: assume we have to compute a matrix A = T (a)+E, sum of a Toeplitz matrix T (a), associated with the symbol a, and a low-rank correction E. This matrix is a known function of given matrices Ai = T (ai) + Ei, say A = A1A1. . . Aq.

Assume also that the symbol a is easily computable as function of a1, . . . , aq, so that the goal

of the computation is to compute E. Computing a low-rank approximation of E = A − T (a) with the fixed-accuracy algorithm does not require to know the whole matrix A if it is known a procedure to compute the matrix-vector product Av and ATv. Given that, we may arrive at an effective representation of A in terms of T (a) and E. As an interesting example, we consider the case where A is the product of the matrices Aj = T (aj) + Ej, for j = 1, . . . , q, where T (aj)

is a semi-infinite Toeplitz matrix and Ej is a low-rank correction, so that the matrix vector

product Ajv can be easily computed. We show that a good approximation of the matrix A can

be achieved in a more efficient way than by computing A using the arithmetic operations of the Matlab CQT-toolbox described in [2].

Appendixreports some classical theoretical results, which support the theory developed in the thesis.

(8)

(9)

Chapter 1

Formulation of the problem

In this chapter we introduce the problem of approximating a low-rank matrix of large size by means of a product of lower-dimensional matrices.

In Section1.1we introduce the notation used throughout the text and we show that the econ-omy size SVD satisfies all the requests to be a low-rank decomposition of the starting matrix. Despite of this fact, computing an SVD could be very expensive, so, we search for a less expen-sive algorithm computing an approximation to an SVD of the matrix. In addition Section 1.1

contains some key results from probability, which will be useful in the analysis of the randomized algorithms, presented in the next chapter, which is the core of this work.

Section 1.2 contains some results involving positive semidefinite matrices, and also some basic properties concerning the singular values. In the last part of the section we introduce the orthogonal projectors within subspaces together with some related properties. This material is useful to better understand the results presented in the next chapters.

This is also true concerning the contents of Section1.3, where we deal with Gaussian matrices. In particular we present some bounds on singular values of Gaussian matrices, and some results involving Frobenius and spectral norm of this kind of matrices.

1.1 Preliminaries

In this section we introduce some notation used throughout the text and give a more formal definition of the problem we are going to deal with.

For simplicity of notation we will use only real matrices, but many of the results presented in the following chapters can be generalized to complex matrices.

We denote by N the set of natural numbers, by Z the set of integers and by R the set of real numbers.

We denote by Rm×n_{the set of m × n matrices A = (a}

ij) with entries aij ∈ R, for i = 1, . . . , m

and j = 1, . . . , n. In will be the n × n identity matrix. Writing rk(A) we mean the rank of a

given matrix A. With range(A) we denote the space spanned by the columns of A. Given a vector v = (vi) ∈ Rn, let kvk2 =

Pn

i=1v 2 i

1/2

. For a matrix A ∈ Rm×n, we denote by kAk2:= maxx6=0kAxk_kxk₂2 the `2-operator norm of the matrix, also called spectral norm, and by

kAkF := (P_j,k|ajk|2)

1

2 its Frobenius norm. We recall that kAk

2≤ kAkF ≤pmin{m, n} kAk2.

The symbol . will be used meaning that the quantity on its left is less than or equal to a positive multiple, independent of the involved quantities, of the quantity on its right. The

(10)

symbol ≈ means that the quantity on its right is an approximation of the quantity on its left, within a suitable specified precision.

Lemma 1.1.1. Given a matrix A ∈ Rm×n

, let B ∈ Rk×l_{, with 1 ≤ k ≤ m and 1 ≤ l ≤ n, be a}

submatrix of A. It holds that kBk₂≤ kAk₂.

Proof. Let x ∈ Rl be a vector of unit norm such that kBk₂ = kBxk₂. Assume, without loss of generality, that the selected rows and columns of A to form B are the uppermost k and the leftmost l, respectively. Let us consider y =x

0

∈ Rn_{, and split A into blocks: A =} B A12

A21 A22

. We can easily see that kAyk₂=

Bx A21x ₂ , so that kAyk2₂= kBxk2₂+ kA21xk2₂≥ kBxk₂. As a consequence we have kAk₂= max z: kzk₂=1 kAzk₂≥ kAyk₂≥ kBxk₂= kBk₂.

Given an m × n matrix A, we denote by AT _{∈ R}n×m _{its transpose. A matrix A is called}

symmetric if A = AT

. A matrix Q ∈ Rn×n _{is called orthogonal if QQ}T _{= Q}T_{Q = I} n.

A matrix U ∈ Rm×k _{is orthonormal if its columns are a set of orthonormal vectors, that is}

UT_{U = I}

k. An orthonormal matrix U preserves the Euclidean norm of the vectors to which

it is applied: for any x ∈ Rn we have kU xk₂ = kxk₂. Multiplying a matrix A by an or-thonormal matrix preserves both the spectral and the Frobenius norm: U AVT ₂= kAk₂ and

U AVT

_F = kAk_F for any matrix A and for any orthonormal matrices U and V .

The problem we are dealing with can be intuitively summarized as follows. It is common to work with a high dimensional matrix, whose information is, instead, not so big. In fact, what is important is the knowledge of the space spanned by the columns (or the rows) of this matrix. If we have a matrix A ∈ Rm×n with rank k m, n we would like to represent it as a product of lower dimensional matrices. Indeed, this representation would simplify the computations involving the matrix, leading to a smaller number of operations, and to a reduced memory storage. An idea, of course, is to start from the standard decompositions of the matrix, for example the SVD (Singular Value Decomposition). We will show that the rank-k truncated SVD of a matrix, defined below, minimizes both the Frobenius and the spectral norm of their difference, among all rank-k matrices. But this is not sufficient, because computing this decomposition could be very expensive in terms of floating-point operations (flops). Therefore we will try to find out an alternative approximate low-rank decomposition that can be computed at low cost and can provide approximations comparable to the machine precision.

To clarify this idea, let us define, for a given matrix, the SVD and the approximate SVD with fixed accuracy.

Definition 1.1. An SVD (Singular Value Decomposition) of a matrix A ∈ Rm×n _{is the}

decom-position of A into the product A = U ΣVT

, where U ∈ Rm×m

and V ∈ Rn×n _{are orthogonal}

matrices, whose columns u1, . . . , um and v1, . . . , vn are the left and the right singular vectors

of A, respectively. Σ is a nonnegative diagonal matrix, whose diagonal entries Σii = σi, for

i = 1, . . . r, where r = min{m, n}, are the singular values of A, nonincreasingly ordered (i.e. σ1≥ . . . σr≥ 0).

Remark 1.2. Thanks to the order given to the singular value in the definition above, Σ is com-pletely determined by A. The matrices U and V , on the other hand, could be not unique. Indeed,

(11)

1.1 Preliminaries 3

for instance, when A has a repeated singular value σi= σi+1, we can switch the columns ui and

ui+1, and also vi and vi+1 and obtain a different SVD. So, the decomposition is unique if and

only if σi6= σj for all i 6= j.

Let us introduce a block partitioning notation that is useful to handle structured matrices. Given an SVD of A ∈ Rm×n_{, A = U ΣV}T_{, we are interested in approximating the subspace}

spanned by the first k left singular vectors. If the decay of the singular values is fast enough we will also consider another parameter j, trying to obtain sharper bounds. We will consider the block partitioning of the matrices involved, depending on the parameters j and k:

A =U1 U2 U3   Σ1 Σ2 Σ3     V₁T V₂T V3T  , (1.1)

where U1∈ Rm×k, U2∈ Rm×j and U3∈ Rm×(m−k−j), Σ1∈ Rk×k, Σ2∈ Rj×j and

Σ3 ∈ R(m−k−j)×(n−k−j), V1T ∈ R k×n_{, V}T 2 ∈ R j×n _{and V}T 3 ∈ R (n−k−j)×n_{. Multiplying A by a}

sample matrix Ω, yields

Y = AΩ = U   Σ1V1TΩ Σ2V2TΩ Σ3V3TΩ  = U   Σ1Ω1 Σ2Ω2 Σ3Ω3  , where Ω1:= V1TΩ, Ω2:= V2TΩ and Ω3:= V3TΩ.

Definition 1.3. A so called economy size SVD of a matrix A ∈ Rm×n _{with rk(A) = k is given}

by A = U ΣVT

in which U ∈ Rm×k

and V ∈ Rn×k

are orthogonal matrices, while Σ ∈ Rk×k _is

such that Σ = diag(σ1, . . . , σk), with σ1≥ · · · ≥ σk> 0.

Remark 1.4. For a rank-k matrix A ∈ Rm×n_{, an economy size SVD can be computed, considering}

the block partitioning notation of its SVD, with parameters k and j = 0. In this case Σ3 is a

null block, so we have A = U ΣVT _{= U}

1Σ1V1T, and the size of the matrices involved in the latter

product is the same of that of the corresponding matrices in the definition of the economy size SVD.

Remark 1.5. An SVD of A, A = U ΣVT _{∈ R}m×n_{, can be equivalently written as a sum of rank-1}

matrices A =Pr

i=1σiuiv T

i , where r = min{m, n}.

Before giving the definition of an approximate SVD of a matrix A, we describe some general properties of the singular values.

We notice that the squares of the singular values of a matrix A are the eigenvalues of AT_A,

and the singular values themselves are the absolute values of the eigenvalues of_A0T A₀. Indeed,

if rk(A) = k and we consider an SVD A = U ΣVT with square orthogonal matrices U and V (so Σ is a rectangular matrix, with the first elements on the diagonal given by σ1, . . . , σk, and all

other elements are zero), we obtain UT_{AV = Σ. Then, multiplying on the left this expression}

by its transpose, we get

VT(ATA)V = diag(σ₁2, . . . , σ_k2_{, 0, . . . , 0) ∈ R}m×m.

Analogously, multiplying the same expression on the right by its transpose, we obtain UT(AAT)U = diag(σ₁2, . . . , σ_k2_{, 0, . . . , 0) ∈ R}n×n.

This proves that the squares of the singular values of A are the eigenvalues of AT_{A or, analogously,}

of AAT_{. On the other hand, supposing m ≥ n, if we partition U into blocks U =} h

e U1 Ue2

i ,

(12)

with eU1∈ Rm×n, and we consider Q = √1₂ V V 0 e U1 − eU1 √ 2 eU2 , then we get QT 0 A T A 0 Q = diag(σ1, . . . , σn, −σ1, . . . , −σn, 0, . . . , 0) ∈ R(m+n)×(m+n). In fact we have " e U₁T e U₂T # AV = Σ = e Σ 0

, where eΣ = diag(σ1, . . . , σn). In particular it follows that

e

U1TAV = eΣ = eΣT = VTATUe1 and eU2TAV = 0 = VTATUe2. Now we compute:

QT 0 A T A 0 Q = 1 2    VT_AT e U1+ eU1TAV −VTATUe1+ eU1TAV √ 2VT_AT e U2 VT_AT e U1− eU1TAV −VTATUe1− eU1TAV √ 2VT_AT e U2 √ 2 eUT 2AV √ 2 eUT 2AV 0   =   e Σ 0 0 0 −eΣ 0 0 0 0  .

We deduce that the singular values of A are the absolute values of the eigenvalues of 0 A

T

A 0

. The following theorem characterizes the singular values of a matrix A: the (k + 1)-st largest singular value is the minimum spectral norm of the difference between A and a rank-k matrix. Theorem 1.1.2 (Eckart-Young-Mirsky: the spectral norm case). Given a matrix A ∈ Rm×n_{, let r = min{m, n}. If A has singular values σ}

1≥ · · · ≥ σr, it holds that σ1 = kAk2

and σk+1= min{kA − Bk2 | rk(B) = k}, ∀k = 1, . . . , r − 1.

Proof. The proof can be divided into two parts. Consider the SVD A = U ΣVT and the block partitioning (1.1), with parameters k and j = 0; denote Ak = U1Σ1V1, for all k = 1, . . . , r.

First we show that kA − Akk2 = σk+1, then we prove that for all rank-k matrices B, we have

kA − Akk2 ≤ kA − Bk2. Noticing that Ak =P k

i=1σiuiviT, we have A − Ak =P r

i=k+1σiuivTi .

To compute the spectral norm of A − Ak we have to maximize k(A − Ak)vk2, with kvk2= 1. Let

us take v =Pr

i=1αivi: the columns of V are orthogonal vectors, so we have 1 = kvk 2 2= Pr i=1α 2 i. It holds that k(A − Ak)vk₂= r X i=k+1 σiuiviT r X j=1 αjvj ₂ = r X i=k+1 αiσiuiviTvi 2 = r X i=k+1 αiσiui 2 = v u u t r X i=k+1 α2 iσ2i,

where, in the second and third equalities we used the fact that the columns of V are orthonormal vectors. A unitary vector v which maximizes this quantity has coefficients αk+1= 1 and αi = 0

for i 6= k + 1. Using this we achieve kA − Akk2= σk+1. Notice that the maximizer is not always

unique: in case σk+1 = σk+2 another unitary vector which maximizes the considered quantity

has coefficients αk+2= 1 and αi= 0 for all i 6= k + 2. Moreover, even when σk+1> σk+2, if v is

a maximizer, also −v is a maximizer.

Now we prove that kA − Akk2 ≤ kA − Bk2 for any matrix B with rank at most k. If the

rank of A is less than or equal to k the inequality is true because we have kA − Akk2 = 0.

So, let us suppose that the rank of A is greater than k. By contradiction suppose that there exists a matrix B, with rank at most k, such that kA − Bk₂< σk+1= kA − Akk2. Given that

B ∈ Rm×n_{, the null space ker(B) of B has dimension at least n − k. Let v}

(13)

1.1 Preliminaries 5

first k + 1 right singular vectors of A. By a dimension argument, there exists a vector z 6= 0 in ker(B)∩Span{v1, . . . , vk+1}: without loss of generality we may suppose that kzk₂= 1. If we show

that k(A − B)zk₂≥ σk+1, we achieve kA − Bk₂ ≥ σk+1, that is a contradiction. In particular,

since Bz = 0, we have kA − Bk2₂≥ k(A − B)zk2₂= kAzk2₂. Because z ∈ Span{v1, . . . , vk+1} and

it is unitary, we have kAzk2₂= n X i=1 σiuivTi z 2 2 = n X i=1 σ_i2(v_iTz)2= k+1 X i=1 σ_i2(v_iTz)2 ≥ σ2 k+1 k+1 X i=1 (vT_i z)2= σ_k+12 kzk2₂= σ2_k+1.

We used also the fact that in the definition of SVD, the singular values are ordered in a nonin-creasing way. As a consequence we obtain kA − Bk2₂≥ σ2

k+1. This proves that

min{kA − Bk₂ | rk(B) = k} = σk+1.

In the proof we observed that, for all k, the value min{kA − Bk₂ | rk(B) = k} can be achieved considering the truncated SVD Ak:= U1Σ1V1T, where U1∈ Rm×k, Σ1∈ Rk×k and V1T ∈ R

k×n_.

Let us consider the analogous problem involving the Frobenius norm: min{kA − Bk_F | rk(B) = k},

for all k = 1, . . . , r = min{m, n}. It can be shown that the truncated SVD Ak, defined above, is

a minimizer even in this case.

Theorem 1.1.3 (Eckart-Young-Mirsky: the Frobenius norm case). Given a matrix A ∈ Rm×n_{, let r = min{m, n}.} _{Consider an SVD A = U ΣV}T_{, and, with the notation}

above, truncate it to obtain Ak = U1Σ1V1T. For any k, Ak is a solution for the problem

min{kA − Bk_F | rk(B) ≤ k}, and the achieved minimum isqPr

i=k+1σ 2 i.

Proof. Since the matrices U and V are orthogonal, by definition of Frobenius norm we have kA − Akk2_F = P r i=k+1σiuiv T i 2 F = Pr i=k+1σ 2

i. Let us consider a rank-j matrix B, with j ≤ k.

If ρ1≥ · · · ≥ ρr≥ 0 are the singular values of A − B, we have kA − Bk 2 F = Pr i=1ρ 2 i.

It suffices to show that ∀i = 1, . . . , r − k, we have ρi≥ σk+i. From this fact it would follow that

Pr i=1ρ 2 i ≥ Pr i=k+1σ 2

i, that is equivalent to the thesis. Denote by λi(·) the i-th eigenvalue, in

nondecreasing order, of the matrix in the argument. Consider the matrices

e B = 0 B T B 0 , A =e 0 AT A 0 , A − ee B := 0 (A − B)T (A − B) 0 . If we apply Weyl’s inequalities A.3(in the Appendix) to them, we obtain that

λi( eA) = λi(( eA − eB) − eB) ≤ λi+j( eA − eB) + λj( eB)

for j = 0, . . . , m + n − i, for all i = 1, . . . , m + n. Because of the fact that we are considering the eigenvalues in nondecreasing order, we have σk = λm+n−k+1( eA) and ρk = λm+n−k+1( eA − eB).

The latter eigenvalue inequality can be rewritten, with a change of indices, as λm+n−i+1( eA) ≤ λm+n−i−k+1( eA − eB) + λm+n−k( eB),

(14)

that, in terms of singular values, becomes:

σi≤ ρi−k+ ηk+1,

where η1, . . . , ηmin{m,n}are the singular values of B. This can be finally rewritten as

σi+k≤ ρi+ ηk+1.

The fact that rk(B) = j ≤ k < k + 1 implies that ηk+1= 0, so we obtain ρi≥ σk+i.

The Eckart-Young-Mirsky theorems prove that the rank-k matrix Ak is the best rank-k

approximant of A in both the Frobenius and spectral norm, because of the fact that it minimizes both kA − Bk₂and kA − Bk_F, over all rank-k matrices B.

If A ∈ Rm×n, if m ≥ n, the cost of computing an SVD is O(mn2) flops (floating-point operations). Let us consider the case we are interested in, so when rk(A) = k m, n, and we look for a rank-k approximation of A. As we will see in Section 2.1, computing Ak, the

best rank-k approximation, would be much more expensive than computing a different rank-k approximation through randomized methods. In the next chapters different approaches will be shown, for finding an approximation to A at a lower cost, by slightly reducing the attained accuracy.

The lemma below shows a characterization of the singular values of a matrix A ∈ Rm×n

through the Courant-Fisher Min-Max Theorem. Thanks to the characterization of the spectral norm of a matrix as its largest singular value, this result will be useful in comparing the spectral norm of matrices that are, for example, in a matrix-submatrix relation.

Lemma 1.1.4. The k-th singular value of a given matrix A ∈ Rm×n _{is given by}

σk = max

S⊆Rn_{, dim(S)=k}_{v∈S, kvk}min 26=0

kAvk₂ kvk₂ for all k = 1, . . . , min{m, n}.

Proof. The proof of the lemma is just an application of the Min-Max Theorem A.1in the Ap-pendix to the matrix AT_{A, whose eigenvalues are the squares of the singular values of A. In fact,}

this theorem states that the k-th largest eigenvalue of a symmetric matrix B can be written as max

S⊆Rn_{, dim(S)=k}06=y∈Smin

yT_By

yT_y :

applying this formula to ATA, we have yTATAy = kAyk2₂ and yTy = kyk2₂, so we reach a characterization for the singular values of A.

The above lemma enables us to prove the following result which states that any matrix B constructed considering a subset of k columns of a given matrix A has the i-th singular value bounded from above by the i-th singular value of A.

Proposition 1.1.5. Let B ∈ Rm×k

be a matrix containing any subset of k columns of A ∈ Rm×n_.

Then σi(B) ≤ σi(A) for each i ≤ k.

Proof. Assume, without loss of generality, that B contains the k leftmost columns of A. We stated that σi(B) = max_S⊆Rk_,dim(S)=iminv∈S,kvk₂6=0

kBvk₂

kvk₂ . Let ¯v be a vector for which the

(15)

1.1 Preliminaries 7 b v =¯v 0 ∈ Rn _{is in S, and k} b

vk₂= k¯vk₂. Because B is a submatrix involving all the rows of A, we have also kA_bvk₂= kB¯vk₂, so that

σi(B) = min v∈S,kvk₂6=0 kBvk₂ kvk₂ = kB¯vk₂ k¯vk₂ = kA_bvk₂ k_bvk₂ . Whence we have σi(A) = max S⊆Rn_{, dim(S)=i}_{v∈S, kvk}min 26=0 kAvk₂ kvk₂ ≥ kA_bvk₂ k_bvk₂ = σi(B).

The rank of a matrix can be alternatively defined as the number of non-zero singular values, counted with multiplicity. Following this interpretation we may introduce the concept of δ-rank associated with a fixed accuracy δ > 0, which will be used to define an approximate SVD of a matrix.

Definition 1.6. For a fixed accuracy δ > 0 and for a given matrix A with singular values σi,

we define the δ-rank of A as the maximum integer k such that σk > δσ1.

Definition 1.7. An approximation with accuracy δ > 0 of an SVD of a matrix A ∈ Rm×n is a product bUδΣbδVb_δT, in which the columns of bUδ, {bu

1_{, . . . ,} b uk}, and bVδ, {bv 1_{, . . . ,} b vk}, are orthonor-mal, and bΣδ = diag(_bσ1, . . . ,_bσk) withσ_bi≥ 0 ∀i = 1, . . . , k, such that

kA − bUδΣbδVbδTk2= A − k X j=1 b σjbu j₍ b vj)T ₂ ≤ δ.

Definition 1.8. Given a matrix A ∈ Rm×n _{and an SVD A = U ΣV}T_{, we define the}

Moore-Penrose inverse (also called pseudoinverse) of A as: A+_{= V Σ}+_UT_{, where}

Σ+= diag(σ₁+, . . . , σ_min{m,n}+ _{) ∈ R}n×m, with σ+_i = ( σ−1_i if σi6= 0 0 if σi= 0 .

Remark 1.9. When AT_{A is invertible and m ≥ n, the Moore-Penrose inverse of A is given by}

(AT_A)−1_AT_{. Indeed we have}

(ATA)−1AT = (V ΣTUTU ΣVT)−1V ΣTUT = (V ΣTΣVT)−1V ΣTUT = V Σ−1Σ−TV−1V ΣTUT = V Σ−1UT,

like in the definition. The matrix Σ−1is well defined because since ATA = V ΣTΣVT is invertible, then 0 is not an eigenvalue of ΣTΣ, and therefore it is not an eigenvalue of Σ.

When AT_{A is invertible, the quantity A}+_{b provides a least squares solution for kAx − bk} 2.

Indeed, in this case Ax = AA+_{b = A(A}T_A)−1_AT_{b = AA}−1_(AT₎−1_AT_{b = b.}

To achieve our purpose of finding an approximate SVD of a matrix we will use randomized methods. First of all, we introduce the idea for the approximation of the range of a matrix A, by using random vectors.

(16)

Given A ∈ Rm×n _{with rk(A) = k (or, more generally, its δ-rank is k), we are interested in}

identifying its range (or at least most of it). To do that, we consider a set {ωi}ki=1 of random

Gaussian vectors in Rn_{: in [10] it is well explained how they are likely to be linearly independent.}

So, given Ω := [ω1, . . . , ωk] and Y := AΩ, we expect that the columns of Y span most of the

range of A. Indeed, if we consider separately the images yi = Aωi, they are k random samples

from the range of A. Because of the fact that {ωi}ki=1, sampled randomly, intuitively are linearly

independent with high probability, no linear combination falls into the null space of A, and also the vectors {yi}ki=1 are linearly independent with high probability. In particular, we would like

to obtain a matrix Y ∈ Rm×k, such that range(Y ) = range(A), but it will be sufficient to find a matrix Y , such that the distance between range(Y ) and range(A) is small enough for a suitably defined distance dist(·, ·). In these terms, we construct a matrix Y such that, fixed an accuracy µ > 0, dist(range(Y ), range(A)) ≤ µ.

In addition we orthonormalize the columns of Y obtaining an orthonormal basis Q of (most of) the range of A. So, going back to the problem of computing an approximate decomposition of A, given an error tolerance , we would determine a matrix Q with a number of orthonormal columns depending on , such that

A − QQT_A

2≤ . In fact, we will show that, by construction, given

a matrix Q as described above, the approximate SVD that we achieve is the exact SVD of the matrix QQT_{A. In addition, in the next section, Remark}_1.22_{will show that if Q has orthonormal}

columns, QQT _{is the projector onto the range of Q. So, we are searching for a subspace of a}

given dimension in which to project our matrix A, trying to minimize the obtained approximation error. Gaussian random vectors are often used in the procedure described.

Let us fix the notation and give the basis to the probabilistic tools we need throughout the text.

First, we need to define what a probability function is and on which spaces it can be defined. In general, given a set F and a subset A, with AC we denote the complementary set of A within F , that is AC_{= F \ A.}

Definition 1.10. Given a set F , a σ-algebra on F is a set F containing subsets of F , such that: • ∅ ∈ F , F ∈ F ;

• if A ∈ F , then AC_{∈ F ;}

• if {An}n≥1is a sequence such that An∈ F for all n, then S∞n=1An∈ F .

Definition 1.11. Given a set F and a σ-algebra F on it, a probability is a function P (·) : F → [0, 1] such that:

• given a sequence {An}n≥1⊂ F , such that Ai∩ Aj= ∅ for any i 6= j, then

P (S∞n=1An) =P +∞

n=1P (An);

• P (F ) = 1.

(F, F , P (·)) is a probability space and the elements of F are called events. We now define a random variable.

Definition 1.12. Given a probability space (F, F , P (·)), a real random variable is a measurable function X : (F, F ) → (R, B(R)), where B(R) is the Borel σ-algebra defined as the smallest σ-algebra containing the open sets.

Given a random variable X, we define its expected value as E [X] = R

FX(ω)dP (ω), where

(17)

1.1 Preliminaries 9

Definition 1.13. For a real-valued random variable X we define the Cumulative Distribution Function (CDF) as FX : R → [0, 1] such that FX(x) := P (X ≤ x).

Definition 1.14. A real random variable has a Probability Density Function (PDF) f if for all A ∈ B(R) it holds that P (X ∈ A) =R

Af (x)dx.

If a real random variable X has a probability density function f , then, given a function φ that is bounded and measurable with respect to B(R), it holds that E [φ(X)] =R+∞

−∞ φ(x)f (x)dx.

In this work we consider F = R and F = B(R)

If X is a random variable, Eq[q] denotes (E [|X|q])1/q_{. If the quantity E [|X|}q] is finite, it is called q-th moment of the random variable.

Definition 1.15. Given a random variable having second moment, we define its variance as Var(X) = E|X|2 − E [X]2.

For two random variables X and Y , we write X ∼ Y if they have the same probability distribution.

A standard Gaussian random variable X has probability density function f (x) =√1 2πe

−x2_/2

. In this case X has expected value (also called mean) 0 and variance 1.

A standard Gaussian matrix is a matrix Ω whose entries are independent, identically dis-tributed, Gaussian random variables with zero mean and unitary variance. The notation we use to say that each entry of the matrix is a standard Gaussian variable is Ωij ∼ N (0, 1) for

all i, j. A Gaussian matrix Ω ∈ Rm×n has a distribution M Nm,n(M, U, V ), where M ∈ Rm×n

is the matrix of mean, while U is the coviariance matrix among rows and V is the covariance matrix among columns. Given D ∈ Rr×m and C ∈ Rn×s, both of full rank, it holds that DXC ∼ M Nr,s(DM C, DU DT, CTV C): this fact can be proved comparing the probability

den-sity function of the random variables on both sides.

Remark 1.16. Given a standard Gaussian matrix X ∼ M Nm,n(0, I, I), its distribution is invariant

under orthogonal transformation. Indeed, given U ∈ Rm×mand V ∈ Rn×n orthogonal matrices, we obtain U XV ∼ M Nm,n(U 0V, U IUT, VTIV ), that is U XV ∼ M Nm,n(0, I, I).

Now we show some basic results involving probability.

Definition 1.17. Given a probability space (F, F , P (·)), and two events A, B ∈ F, such that P (B) 6= 0, we define the conditional probability of A with respect to B as the quantity

P (A|B) = P (A ∩ B) P (B) .

Given a probability space (F, F , P (·)) and two events A, B ∈ F, the following properties are a straightforward consequence of the definition of a probability function:

• if B ⊆ A, then P (B) ≤ P (A);

• P (A ∪ B) + P (A ∩ B) = P (A) + P (B); • P (A) = P (A|B) P BC_{+ P A|B}C

P (B), so P (A) ≤ P (A|B) + P A|BC.

Another relevant notion is that of independence of random variables. To define that, we need to say what a π-system generated by a random variable is.

Definition 1.18. Given a real-valued random variable X, the π-system generated by X is the set IX= {X−1((−∞, x]), x ∈ R}.

(18)

Definition 1.19. Two random variables X and Y , defined on the same probability space, are independent if and only if P (A ∪ B) = P (A) P (B) for all A ∈ IX and for all B ∈ IY.

The definition of independence implies that if X and Y are independent random variables then P (X|Y ) = P (X), so what concerns X is not affected by Y , and viceversa.

Our last definition of this introductive part introduces the conditional expectation, and is followed by some relevant properties.

Proposition 1.1.6. Let (F, F , P (·)) be a probability space, and X a real random variable with finite expected value. Let E ⊆ F be another σ-algebra. Then there exists an E -measurable random variable Y such thatR

AX(ω)dP (ω) =

R

AY (ω)dP (ω) for any A ∈ E. Such variable Y is called

conditional expectation and it is also denoted by E [X|E].

Remark 1.20. Given a random variable Z, we define the σ-algebra generated by Z as σ(Z) := {Z−1_{(A) | A ∈ B(R}n

)}. If we write E [X|Z], where Z is a random variable, we mean E [X|σ(Z)].

The following properties hold: • E [X] = E [E [X|E]];

• if X is measurable with respect to E, then E [X|E] = X; • if X is independent from E, then E [X|E] = E [X]; An important inequality is the following:

Proposition 1.1.7 (Markov’s inequality). If X is a nonnegative random variable and a > 0, it holds that P (X ≥ a) ≤ E [X] /a.

Before showing some more specific results from the theory of probability that will be used in the following chapters, in the next section we recall some further results from linear algebra, needed throughout the text.

1.2 Results from linear algebra

Concerning the results presented in this section, we refer the reader to [10], while concerning orthogonal projectors we refer to [8].

A symmetric matrix A is positive semidefinite if ∀u 6= 0 it holds that uT_{M u ≥ 0, and is}

positive definite if the inequality is strict for all u 6= 0. In the set of symmetric matrices we can define a partial order. Namely, M N if and only if N − M is positive semidefinite. In these terms we can write M 0 meaning that the matrix M is positive semidefinite. A positive semidefinite matrix has nonnegative real eigenvalues; a positive definite matrix has only positive eigenvalues.

The singular values of a symmetric positive semidefinite matrix are the eigenvalues themselves. Indeed, for a symmetric matrix the singular values are the absolute values of the eigenvalues, recalling that the eigenvalues of AT_{A = A}2 _{are the squares of the singular values of A. Hence,}

if the matrix is also positive semidefinite, the eigenvalues of A are its singular values. In case of a symmetric positive semidefinite matrix we have kM k₂= maxu6=0u

T_{M u}

uT_u . Then, if M N , it

holds kM k₂≤ kN k₂.

The following result is a straightforward consequence of the latter observation:

Proposition 1.2.1 (Conjugation rule). If M 0, for every matrix A we have AT_{M A 0,}

(19)

1.2 Results from linear algebra 11

Consider a symmetric perturbation M applied to an identity matrix and the inverse of this quantity. We use the previous result to prove that there exists a bound for the distance in spectral norm between (I + M )−1 and the identity matrix, made up by the spectral norm of the perturbation.

Proposition 1.2.2. Given M 0, then I − (I + M )−1 M .

Proof. Since M is positive semidefinite we can consider its positive semidefinite square root, by DefinitionA.1in the Appendix, R = M1/2_{. We have:}

I − (I + R2)−1= (I + R2)−1(I + R2− I) = (I + R2₎−1_R2_{= R(I + R}2₎−1_{R RIR = R}2_,

where we used the fact that rational functions, which can be expressed as the ratio between two polynomial functions (with a non-zero polynomial as denominator), of a diagonalizable matrix commute and we applied Proposition 1.2.1with the hypothesis (I + R2)−1 I. Furthermore, I + R2= I + M I because obviously I + M − I = M is positive semidefinite. So, looking at the eigenvalues, those of I + R2 _{are greater than or equal to 1, then their reciprocals are less than}

or equal to 1. It turns out that I − (I + R2₎−1_{0, that is (I + R}2₎−1_{I, by definition.}

The result presented below generalizes the fact that the spectral norm of a symmetric positive semidefinite matrix can be controlled by its trace. Indeed, the singular values of the matrix are its eigenvalues, so the spectral norm of the matrix is its largest eigenvalue, that is, of course, less than the sum of all eigenvalues. The generalization here presented refers to a matrix that is block-symmetric, that is of the kind A B

BT C, where A and C are symmetric.

Proposition 1.2.3. Given a partitioned positive semidefinite matrix M = A B BT C

, it holds kM k₂≤ kAk₂+ kCk₂.

Proof. Thanks to the fact that M is a positive definite matrix, we have kM k₂= sup kxk2 2+kyk22=1 xT _yT A B BT _C x y ≤ sup kxk2 2+kyk 2 2=1

(kAk₂kxk2₂+ 2 kBk₂kxk₂kyk₂+ kCk₂kyk2₂).

Theorem 7.7.9 in [12] states that if M is symmetric positive semidefinite, there exists a matrix X such that kXk₂≤ 1 and B = A1/2_XC1/2_{, so that kBk}2

2≤ kAk2kCk2. Thus, kM k₂≤ sup kxk2 2+kyk 2 2=1

(kAk1/2₂ kxk₂+ kCk1/2₂ kyk₂)2≤ kAk₂+ kCk₂.

The following lemma relates the norm of the pseudoinverse (AT_A)−1_AT _{of a given full-rank}

matrix A to the minimum singular value of A:

Lemma 1.2.4. Given A ∈ Rm×n _{with m ≥ n such that A}T_{A is invertible, then}

(ATA)−1AT ₂= 1 σn

, where σn is the least singular value of A.

(20)

Proof. We have AT_{A = V Σ}T_ΣVT_{, so (A}T_A)−1_AT _{= V}−T_Σ−1_Σ−T_V−1_{V Σ}T_UT _{= V Σ}−1_UT_{. As}

a consequence, the singular values of (AT_A)−1_AT _{are the reciprocals of those of A, then the}

largest is 1/σn:

(AT_A)−1_AT ₂= _σ1

n.

The following lemma states that the distance between the k-th singular value of two matrices is bounded by the norm of the difference between the matrices.

Lemma 1.2.5. Let A and R be m×n matrices. If τ1≥ · · · ≥ τmin{m,n}are the singular values of

A+R and σ1≥ · · · ≥ σmin{m,n}are those of A, then |τk−σk| ≤ kRk2for all k = 1, . . . , min{m, n}.

Proof. Applying the Weyl inequalities, see TheoremA.3in the Appendix, to eA := 0 A

T A 0 and e A + eR := 0 (A + R)T A + R 0

, which are symmetric matrices, yields

λi( eA + eR) ≤ λi+j( eA) + λm+n−j( eB) (1.2)

for j = 0, . . . , m + n − i, i = 1, . . . , n, where λi(·) denotes the k-th eigenvalue of the matrix in

the argument, and

λi( eA + eR) ≥ λi−j+1( eA) + λj( eR) (1.3)

for j = 1, . . . , i, i = 1, . . . , n. Now we use the relations between the sets of eigenvalues of eA, eR and eA + eR and the sets of singular values of A, R and A + R, respectively. If η1≥ · · · ≥ ηmin{m,n}

are the singular values of R, following the same argument as in the proof of Theorem1.1.3, we obtain from the eigenvalue inequality (1.2), rewritten with i = m + n − k + 1 and j = 0, that

τk≤ σk+ η1.

Using the eigenvalue inequality (1.2) with i = m + n − k + 1 and j = 1, we find out that λm+n−k+1( eA + eR) ≥ λm+n−k+1( eA) + λ1( eR), which, in terms of singular values is

τk≥ σk− η1.

It follows that |τk− σk| ≤ η1= kRk2, for all k = 1, . . . , min{m, n}.

Now we define the orthogonal projectors and show some properties that will be useful in the next chapters.

Definition 1.21. An orthogonal projector is a symmetric matrix P such that P2= P . It holds that 0 P I.

An orthogonal projector is completely determined by its range: we write PM to denote the

unique orthogonal projector with range(PM) = range(M ).

If M has full column rank, the projector onto its range is PM = M (MTM )−1MT. Given an

orthogonal projector P , the orthogonal projector in the complementary subspace range(P )⊥ _is

I − P .

Remark 1.22. In particular, if M has orthonormal columns, the projector onto its range is PM = M MT.

Proposition 1.2.6. Given an orthogonal matrix U , we have UT_P

(21)

1.2 Results from linear algebra 13

Proof. P := UT_P

MU is an orthogonal projector, indeed: it is symmetric,

PT = UTP_MTU = UTPMU = P,

because PM is symmetric; it holds

P2= UTPMU UTPMU = UTPM2U = U T_P

MU = P,

because the same holds for PM that is actually a projector. Now,

range(P ) = UTrange(M ) = range(UTM ), so that P = PUTM.

We are going to compare the spectral norm of the orthogonal projection of a matrix within vector spaces with a subspace relation.

Proposition 1.2.7. Let N, M be matrices in Rn×n_{, and assume that range(N ) ⊂ range(M ).}

Then, for each matrix A, it holds that kPNAk₂≤ kPMAk₂and that k(I − PM)Ak₂≤ k(I − PN)Ak₂.

Proof. Since PN is a projector, we have PN I, so, for the conjugation rule we have

PMPNPM PM. In addition, range(N ) ⊂ range(M ) implies that PMPN = PN, so:

PMPNPM = PNPM = (PMPN)T = PNT = PN. Therefore PN PM, which implies

AT_P NA ATPMA, so we have: kPNAk2₂= ATPNA ₂≤ ATPMA ₂= kPMAk2₂.

On the other hand, the orthogonal complements are such that range(M )⊥⊂ range(N )⊥, so the same argument can be used to prove that k(I − PM)Ak2≤ k(I − PN)Ak2.

Another property, which will be useful in dealing with matrices with a slow decay of singular values, is summed up by the following proposition:

Proposition 1.2.8. Let P be an orthogonal projector, and M be a matrix. For each q > 0 we have kP M k₂≤

P (M MT₎q_M

1/(2q+1) 2 .

Proof. First of all we show that, given an orthogonal projector R and a nonnegative diagonal matrix D, it holds that, if α is a natural number, kRDRkα₂ ≤ kRDα_Rk

2. To this aim, let x be

a unit vector such that xT_{(RDR)x = kRDRk}

2. We must have Rx = x, otherwise it would be

kRxk₂< 1, and then if we take y = _kRxkRx

2 , we obtain yT(RDR)y = (Rx) T_(RDR)(Rx) kRxk2₂ = xT_(RDR)x kRxk2₂ > x T_(RDR)x,

that leads to a contradiction for the hypothesis holding on x. So, if xj are the entries of x and

dj are the diagonal elements of D, it is true that

kRDRkα₂ = (xT(RDR)x)α= (xTDx)α= X k djx2j !α ≤X j dα_jx2_j = xTDαx = (Rx)TDα(Rx) ≤ kRDαRk₂,

(22)

where we have applied the Jensen inequality, see TheoremA.4in the Appendix, to the convex (for α ≥ 1) function z 7→ |z|α_{, with}P

jx 2 j = 1.

Now we consider an SVD M = U ΣVT and we compute: kP M k2(2q+1)₂ = P M MTP 2q+1 2 = (UTP U )Σ2(UTP U ) 2q+1 2 ≤ (U T_{P U )Σ}2(2q+1)_(UT_{P U )} ₂= P (M M T₎2(2q+1)_P ₂ = P (M MT)qM MT(M MT)qP ₂= P (M MT)qM 2 2,

which implies the sought inequality. Notice that we have used the relation proved in the first part applied to UT_{P U , which is an orthogonal projector.}

1.3 Results from theory of probability involving Gaussian

matrices

The results presented in this section can be found mainly in [14, 10]. We refer also to [5,7] for some proofs.

First, we recall that a standard Gaussian matrix is a matrix Ω whose entries are independent, identically distributed, zero-mean and unit-variance Gaussian random variables.

The first result we are going to state provides a lower bound on the least singular value of a standard Gaussian rectangular matrix which is highly probable. We will use this result because of the fact that the spectral norm of G+_{, which is also a standard Gaussian matrix, is the inverse}

of its least singular value. The statement is preceded by the definition of the Γ function, used during the proof.

Definition 1.23. We define the Γ function on the set of complex numbers with positive real part as Γ(z) =R∞

0 x z−1_e−x

dx. For any n ∈ N, Γ(n) = (n − 1)!.

Lemma 1.3.1. Given integers 0 < k ≤ l, let Ω ∈ Rl×k be a standard Gaussian matrix and β > 0 such that e q := 1 − 1 p2π(l − k + 1) _e (l − k + 1)β l−k+1 (1.4)

is nonnegative. Then, the least singular value of Ω is greater than or equal to 1/(√lβ) with probability at least_eq.

Proof. Let λminbe the least eigenvalue of ΩΩT, corresponding to the square of the least singular

value of Ω. First we can prove that, for every A > 0, x > 0, we have

P λmin≤ A2_l x2 < 1 Γ(l − k + 2) Al x l−k+1 .

Then we use the fact that Γ(x + 1) >√2πxx+1

2e−x, obtaining that P λmin≤ A2_l x2 <√ 1 2π(l − k + 1)l−k+1+1 2e−(l−k+1) Al x l−k+1 .

(23)

1.3 Results from theory of probability involving Gaussian matrices 15

In particular, the first result can be achieved in the following way: if fλmin is the probability

den-sity function of λmin, we have P

λmin≤ A 2_l x2 =R A2 l x2

0 fλmin(t)dt. Let us consider an inequality,

which can be found in [6]: fλmin(x) ≤ Lk,le

−x/2_xl−k−1 2 , where L_k,l= 2 l−k−1 2 Γ(l+1₂ ) Γ(k 2)Γ(l−k+1) . We obtain P λmin≤ A2l x2 ≤ Lk,l Z A2 l x2 0 t12(l−k−1)_dt = 2 l−k−1 2 Γ(l+1 2 ) Γ(k₂)Γ(l − k + 1) Z A2 l_x2 0 e−t2t l−k−1 2 ≤ Γ( l+1 2 ) Γ(k₂)Γ(l − k + 1)2 l−k−1 2 Z A2 l_x2 0 tl−k−12 _dt = Γ( l+1 2 ) Γ(k₂)Γ(l − k + 1) 2l−k−12 2 (l − k + 1) A2_l x2 l−k+12 = Γ( l+1 2 ) Γ(k₂)Γ(l − k + 2) 2 n l−k+12 _Al x l−k+1 .

In addition we can prove that, since m ≤ n, Γ(k 2)( l 2) l−k+1 2 > Γ(l+1 2 ), we have P λmin≤ A2_l x2 < 1 Γ(l − k + 2) Al x l−k+1 .

For more details see [5]. Now, choosing A and x such that Al_x =_β1, we have, equivalently:

P λmin≥ A2_l x2 ≥ 1 −√ 1 2π(l − k + 1)l−k+1+1 2e−(l−k+1) Al x l−k+1 = 1 − 1 p2π(l − k + 1) _e (l − k + 1)β l−k+1 .

So the right hand side is what we wanted to achieve. On the other hand, with this choice, the left hand side becomes Pλmin≥_lβ12

, that is Pσ2_min≥ 1 lβ2 = Pσmin≥√1_lβ : this proves the result.

The following result provides a highly probable upper bound on the largest singular value of a rectangular standard Gaussian matrix.

Lemma 1.3.2. Assume we are given l, m, n positive integers such that l, m ≤ n, a standard Gaussian matrix Ω ∈ Rl×m_{, and a real γ > 1 such that}

e p := 1 − 1 4(γ2_{− 1)}p πnγ2 _2γ2 eγ2₋₁ n (1.5)

is nonnegative. Then the largest singular value of Ω is at most√2nγ with probability at leastp._e Proof. The result follows from the two facts below:

(24)

• Given k, l, m, n such that k ≤ m and l ≤ n, a matrix A ∈ Rm×n

and a submatrix B ∈ Rk×l_,

then the largest singular value of B is less than or equal to the largest singular value of A. • Let Ω ∈ Rn×n _{be a standard Gaussian matrix and γ > 1 be a constant satisfying the}

hypothesis of the lemma. Then, the largest singular value of Ω is at most √2nγ with probability larger thanp_e

The first result can be viewed as a consequence of the fact that the spectral norm of a submatrix B of a matrix A is always less than or equal to the spectral norm of A, as we said in Lemma

1.1.1.

The proof of the second fact relies on the argument used in the proof of equation (8.3) in [7]: we can read there that, if λ1is the largest singular value of Ω, then

P λ1> 2nσ2γ2 < 1 −p._e

Therefore, since in our case the variance is σ = 1, we have P λ1≤

√

2nγ_{≥ P λ}1≤ 2nγ2 ≥p.e The formula is proved by writing P λ > 2nσ2_γ2

= R∞

2nγ2φ(λ)dλ, where λ = λ1 for

simplic-ity, and φ(λ) is the probability density of the singular value. In particular, it results that φ(λ) ≤ _(2σ2₎n−1/2π1/2_(Γ(n/2))2λ n−3/2_e−λ/(2σ2₎ . This leads to P λ > 2nσ2λ2 < (γ 2_n)n−1/2_e−γ2_n π1/2 (Γ(n/2))2_(γ2_{− 1)n} < 1 4(γ2_{− 1)(γ}2_πn)1/2 _2γ2 eγ2 − 1 n

where we have used Stirling’s formula (Γ(n/2))2>_eπnn₂n−1n−2.

Next, we consider the norm (both Frobenius and spectral) of a Gaussian matrix. First we give an upper bound to the expected value of the norm of a scaled Gaussian matrix:

Proposition 1.3.3. Given two matrices S and T and a standard Gaussian matrix Ω, then:

E h

kSΩT k2_Fi1/2= kSk_FkT k_F, and

E [kSΩT k2] ≤ kSk2kT kF+ kSkFkT k2.

Proof. For the first relation, because the distribution of a Gaussian matrix is invariant under orthogonal transformations and the Frobenius norm is orthogonally invariant, we can suppose that S and T are diagonal (if not we can take an SVD of them and use the invariance under orthogonal transformations). Then EhkSΩT k2_Fi = EhP

jk|sjjgjktkk|2

i

= P

j,k|sjj|2|tkk|2 =

kSk2_FkT k2_F. The fact that also the right hand side is orthogonally invariant completes the proof. For the second relation we refer to Theorem 3.20 in [13], which relies on Gaussian processes within the proof.

Another relevant result is the following, whose proof is postponed to the end of this section. Proposition 1.3.4. Let Ω ∈ Rk×(k+p) _{be a standard Gaussian matrix, with p ≥ 4. For all t ≥ 1}

we have P Ω+ _F ≥ s 3k p + 1t ! ≤ t−p and P Ω+ ₂≥ e√k + p p + 1 t ≤ t−(p+1).

(25)

Next we present a bound for the expected norm of the pseudoinverse of a Gaussian matrix. To do that, we need to introduce the χ2 _{distribution of a random variable.}

Definition 1.24. The χ2

k distribution, where k is the number of degrees of freedom, is the

distribution of a sum of k squared independent standard Gaussian random variables. We notice that, if X ∼ χ2_k_{, it holds that E [X] = k.}

Proposition 1.3.5. Given a standard Gaussian matrix Ω ∈ Rk×(k+p), with k, p ≥ 2. Then

E h Ω+ 2 F i1/2 = s k p − 1 and E Ω+ ₂ ≤ e√k + p p .

Proof. For the first identity we observe that kΩ+k2_F = tr[(Ω+)TΩ+] = tr[(ΩΩT)−1], where the second equality holds almost surely because ΩΩT is invertible with probability 1. The diagonal entries of (ΩΩT)−1 are the inverses of χ2p+1 random variables. The probability distribution of

ΩΩT _{is the so called Wishart distribution W}

k(I, k + p), where k + p, the number of columns of

Ω, are the degrees of freedom, k (the number of rows of Ω) is the dimension (if k = 1 we have a χ2

k+p distribution), and I is the matrix of covariance for the rows of Ω, thought as Gaussian

vectors. The distribution of (ΩΩT₎−1 _{is the so called inverse Wishart distribution W}−1

k (I, k + p).

The diagonal elements of such a matrix (in the particular case when the matrix parameter is the identity) follow an inverse χ2 _{distribution with k + p − k + 1 = p + 1 degrees of freedom. So:}

kΩ+_k F =

Pm j=1X

−1

j , where Xj ∼ χ2p+1. It holds that EX −1 j = 1 p−1, and, as a consequence: E h kΩ+_k2 F i = _p−1k .

To prove the second inequality, let C = √ 1

2π(n−m+1)

h _e√_n

n−m+1

in−m+1

. For all E > 0 we have:

E Ω+ ₂ = Z ∞ 0 P Ω+ ₂> t dt ≤ E + Z ∞ E P Ω+ ₂> t dt ≤ E + C Z ∞ E t−(n−m+1)dt = E + 1 n − mCE −(n−m)_,

where the second inequality follows from Proposition1.3.4. The right hand side is minimized by choosing E = C1/(n−m+1), so we have: E Ω+ ₂ ≤ 1 + 1 n − m C1/(n−m+1)= e √ n n − m 1 (2π(n − m + 1))1/2(p+1) < e√n n − m.

The following proposition characterizes the Cumulative Distribution Function of Lipschitz functions of a Gaussian matrix: such a random variable follows a subgaussian distribution. Proposition 1.3.6. Let h : Rm×n

→ R, and assume that there exists a constant L > 0 such that |h(X) − h(Y )| ≤ L kX − Y k_F _{for all X, Y ∈ R}m×n_{. If Ω is a standard Gaussian matrix,}

then

P (h(Ω) ≥ E [h(Ω)] + Lt) ≤ e−t

2_/2

(26)

Proof. A proof of this result can be found in [16], and relies on the use of Brownian motion and stochastic integration.

The last result of this part shows some large deviation bounds for the norm of the pseudoin-verse of a Gaussian matrix. Before giving this statement we present two auxiliary results: Proposition 1.3.7. Let X be a random variable with a χ2

k distribution. If 0 ≤ q < k/2 we have

E [X−q] = Γ(k/2−q)2q_Γ(k/2). In particular it holds that EX−1 = (k − 2)−1.

Proof. The density function of a χ2

k random variable is f (t) = 1 2k/2_Γ(k/2)t

k/2−1_e−t/2 _{for t ≥ 0.}

By the integral formula for the expected value we obtain

EX−q = Z ∞ 0 t−qf (t)dt = Z ∞ 0 t−q+k/2−1e−t/2 2k/2_Γ(k/2) dt = 1 2k/2_Γ(k/2) Z ∞ 0 tk/2−q−1e−t/2dt = 2 k/2−q 2k/2_Γ(k/2) Z ∞ 0 wk/2−q−1e−wdw = Γ(k/2 − q) 2q_Γ(k/2) ,

where we used the definition of Γ(k/2 − q) and the change of variable t = 2w. Lemma 1.3.8. Given a random variable X ∼ χ2

k with k ≥ 5, for 2 ≤ q ≤ (k − 1)/2 we have

EX−1 < 3/k.

Proof. If q < (k − 1)/2 the property follows from H¨older’s inequality. Let us prove the result for q = (k − 1)/2. From the Proposition1.3.7_{we have E}q[q] =Γ(k/2−q)₂q_Γ(k/2)

1/q

=₂Γ(1/2)q_Γ(k/2)

1/q . Using Stirling’s approximation we obtain Γ(k/2) ≥√2π(k/2)(k−1)/2e−k/2. In addition we have Γ(1/2) = √_{π, so that we obtain E}q_{[q] ≤} √π 2q√_2π(k/2)1_e−q−1/2 1/q = _ke e₂1/(2q) < 3_k, using the fact that q ≥ 2.

Proof of Proposition1.3.4. The proof of the result involving the spectral norm is a straightfor-ward consequence of Lemma 1.3.1. Indeed, thanks to the fact that the spectral norm of the pseudoinverse of a matrix is the inverse of its smallest singular value, applying the cited lemma yields P Ω+ ₂> x = P σmin< 1 x < 1 p2π(p + 1) e√k + p (p + 1)x p+1 . By substituting x with e √ k+p

p+1 t, the right hand side becomes 1

√

2π(p+1)t

−(p+1)_{, that is less than}

t−(p+1)_{. This leads to P}kΩ+_k 2≥ e√k+p p+1 t ≤ t−(p+1)_.

On the other hand, the result involving the Frobenius norm is equivalent to

P Ω+ 2 F > 3k p + 1t ≤ t−p2 _.

To this end, we define the random variable Z := kΩ+_k2

F = tr((ΩΩ

T₎−1_{), where the latter equality}

holds almost surely. Indeed we have kΩ+_k2

F = tr((Ω

+₎T_Ω+_{), and}

(27)

where the latter equality holds almost surely, with the same probability with which ΩΩT _is

invertible (see Lemma 1.3.1). We can represent Z as Pk

j=1X −1 j , where X −1 j is an inverted χ2

p+1 random variable, for all j: as observed in Proposition 1.3.5, the diagonal elements of

(ΩΩT₎−1 _{follow this distribution. Then, if q :=} p

2, we can bound the q-th moment as follows:

Eq[q] = Eq[q] ≤Pkj=1E

q_{[q]. Now, in general, if X ∼ χ}2

m with m ≥ 5, and 2 ≤ q ≤ m−12 , then

Eq[q] < m3, as Lemma1.3.8states. So we obtain E

q_{[q] <} 3

p+1 for all j = 1, . . . , k, that leads to

[E [Zq_]]1/q

= Eq_{[q] <} 3k

p+1. Using Markov’s inequality, we find that

P Z ≥ 3k p + 1t = P Zq≥ _3k p + 1t q ≤ _3k p + 1t −q E [Zq] < t−q.

Let us clarify the latter result, by means of some examples, by considering particular values of t, k and p, by generating some random matrices and computing their actual norms.

Example 1.1. Take k = 500, p = 4 and generate some standard Gaussian matrices {Ωk,i}1000i=1

in Rk×(k+p)_{. In this case, Proposition}_1.3.4_{states that}

P Ω+ _F ≥ t ≤ t r p + 1 3k !−p = t r 1 300 !−4 ≈ 9.02 · 104_t−4 and P Ω+ ₂≥ t ≤ t(p + 1) e√k + p −(p+1) ≈ 2.71 · 105_t−5_.

We have computed both the Frobenius and the spectral norm of the actual generated matrices and, for each t ∈ {1, . . . , 150}, we have counted in how many cases these quantities were greater than t. The two plots in Figure 1.1 show the comparison between the actual values and the function that teorically estimates them. In particular, the figure on the left shows the com-parison between the function f1(t) :=

t q p+1 3k −p

and the quantity s1(t) := P1000 i=1 δ{kΩ+k_F≥t} 1000 , where δ{kΩ+_k F≥t} = 1 if and only if kΩ +_k

F ≥ t and it is 0 otherwise. The figure on the

right shows the comparison between the function f2(t) :=

_t(p+1)

e√k+p

−(p+1)

and the quantity s2(t) :=

P1000

i=1 δ{kΩ+k₂≥t}

1000 , where δ is defined as above. Both plots are in logarithmic scale with

respect to the ordinate axis.

We verified that for each t the value actually computed, for s1(t) or s2(t) is less than the

cor-respondent value of the function f1(t) or f2(t), respectively, and asymptotically they are very

(28)

0 10 20 30 40 50 10−3 10−1 101 103 105 Frobenius norm

bound from Proposition1.3.4

empiric result 0 10 20 30 40 50 10−3 10−1 101 103 105 Spectral norm

bound from Proposition1.3.4

empiric result

Figure 1.1: On the left: comparison between the function f1(t) :=

tqp+1_3k

−p

(blue line) and the quantity s1(t) :=

P1000

i=1 δ{kΩ+k_F≥t}

1000 (red line). On the right: comparison between the function

f2(t) :=

_t(p+1)

e√k+p

−(p+1)

(blue line) and the quantity s2(t) := P1000

i=1 δ{kΩ+k₂≥t}

(29)

Chapter 2

Randomized SVD

In this chapter, we discuss a class of randomized algorithms which can be used to compute an approximate SVD of a given matrix A. The case of interest is when A ∈ Rm×n is a matrix characterized by a large number of rows and columns, but also by a δ-rank k that is small com-pared to the dimensions of the matrix. In this case we would like to construct an approximation to the standard decomposition, where the matrices involved have low dimensions.

2.1 Introduction

Examples of matrices with low δ-rank are the Cauchy matrices C = (cij) ∈ Rm×n, where

cij = _xsitj

i+yj are constructed from vectors s = (si) ∈ R

m _{and t = (t}

j) ∈ Rn, and from the sets

{0 < x1 < · · · < xm} and {0 < y1 < · · · < ym} of real positive numbers, with no intersection.

Other matrices characterized by a low δ-rank, and by singular values which decay rapidly to 0, are the positive definite Hankel matrices, which have constant entries along the anti-diagonals. They can be represented by a vector a ∈ Rm+n−1, such that H(a) = (hij), with hij = ai+j−1.

We will state two versions of the problem: the fixed-rank and the fixed-accuracy problem. The next Sections 2.2 and 2.3 focus on the fixed-rank problem, in which we achieve a good approximant, with fixed rank, to the matrix A.

The algorithm we are going to use, in its most general form, can be divided in two stages, by following the structure described in [10]:

• Stage A: use a randomized scheme for approximating the range of A and compute an orthonormal basis Q of the obtained space. Here, we seek a matrix Q containing as few columns as possible, such that A ≈ QQT_{A. In other words, since QQ}T_{A = P}

QA, we would

like to find a matrix Q such that the projection of A onto the subspace range(Q) is a good approximation to A.

• Stage B: construct an approximation for a standard factorization (the SVD in this case) of A using the matrix Q; to this end we use deterministic approaches.

Section 2.4, instead, refers to the fixed-accuracy problem: to solve it, we can try to fix the rank, solve the problem and check the accuracy. If it is worse than the request we can repeat the procedure by choosing a larger rank, until we reach enough accuracy.

Throughout this chapter, we denote by σ1≥ · · · ≥ σmin{m,n}the singular values of A ∈ Rm×n.

In particular, we refer to the case where we have a matrix A for which it is well known an algorithm allowing a fast matrix-vector product computation. Section2.2describes an algorithm

Randomized methods for low-rank approximation of matrices

Dipartimento di Matematica