A
A
l
l
m
m
a
a
M
M
a
a
t
t
e
e
r
r
S
S
t
t
u
u
d
d
i
i
o
o
r
r
u
u
m
m
–
–
U
U
n
n
i
i
v
v
e
e
r
r
s
s
i
i
t
t
à
à
d
d
i
i
B
B
o
o
l
l
o
o
g
g
n
n
a
a
DOTTORATO DI RICERCA IN
Scienze Statistiche
Ciclo XXVIII
Settore Concorsuale di afferenza: 13/D1
Settore Scientifico disciplinare: SECS/S-01
TITOLO TESI
Large covariance matrix estimation
by composite minimization
Presentata da:
Matteo Farnè
Coordinatore Dottorato
Relatore
Prof.ssa Alessandra Luati
Prof.ssa Angela Montanari
by omposite minimization
Matteo Farnè
Dipartimentodi S ienze Statisti he Università diBologna
"Illi autem o to ursus, in quibus eadem visest duorum, septem e iunt distin tos intervallissonos, qui numerus rerum omniumfere nodusest; quod do ti homines nervis imitatiatque antibusaperuerunt sibi reditum in hun lo um, si utalii, qui praestantibus ingeniis in vita humanadivina studia oluerunt." Mar us Tullius Ci ero SomniumS ipionis, 6.18
At the end of a parti ularly ex iting and laborious work, I feel the need to address my thanks to the people who a ompanied and helped me to omplete su ha huge eort.
First ofall,I desireto expressmygratitude toProfessor Angela Monta-nari, mysupervisorand mentor, for the ontinuous supportand en ourage-ment, for the valuable advi es and suggestions, for her passionand for the unique han esofgrowthandmaturation sheoeredmea rossallmystudy path.
IwanttothankProfessorTrevorHastie,forthebriefbutde isiveresear h period spent at Stanford, whi h gave me the ne essary insights to develop my own theoryand to ee tively provide original solutions to my resear h problem.
I have to thank Dr Yuan Liao for helpful dis ussions and pre ious ref-eren es, Professor Roberto Ro i for his pun tual annotations, Professor Enri oBernardi forhisusefulideas onthemathemati al partand Professor Ni olaGuglielmifor his strongdire tions onthealgorithmi part.
A parti ularly warm a knowledgment goes to theSupervisory Statisti s Division of the European Central Bank, where I spent a semester as PhD trainee: they wel omed me and allowed me to omplete my PhD studies developing ru ial skills for real data analysis in large dimensions. A spe- i thanksgivinggoestoAngelosVouldis,wonderfulresear hsupervisor,to Mi healFedesin,for histrust and enthusiasm, andto Patri k Hoganfor his valuable managing.
Finally, I want to thank my family, for the long and patient guide, aid andsupport,myfriends,fortheyoftenalleviatedthis labour,andSilvia,for her ontinuous, passionate andenthusiasti listeningand in itement.
Thepresentthesis on ernslarge ovarian ematrixestimationvia omposite minimizationundertheassumptionoflowrankplussparsestru ture. Exist-ing methods like POET (Prin ipal Orthogonal omplEment Thresholding) perform estimationbyextra tingprin ipal omponentsandthenapplyinga soft thresholding algorithm. In ontrast, our method re overs the low rank plus sparse de omposition of the ovarian e matrix by least squares mini-mization under nu lear norm plus
l
1
norm penalization. This non-smooth onvex minimization pro edure is based on semidenite programming and subdierentialmethods,resultingintwoseparableproblemssolvedbya sin-gular valuethresholding plussoftthresholding algorithm.Themostre entestimatorinliteratureis alledLOREC (LowRankand sparsECovarian eestimator)andprovidesnon-asymptoti errorratesaswell asidentiability onditions inthe ontext of algebrai geometry. Our work showsthattheunshrinkageoftheestimatedeigenvaluesofthelowrank om-ponentimprovestheperforman eofLOREC onsiderably. Thesamemethod also re overs ovarian e stru tures with very spiked latent eigenvalues like in the POET setting, thus over oming the ne essary ondition
p ≤ n
. In addition,itisproved thatourmethodre oversstru tureswithintermediate degrees of spikiness,obtainingalosswhi h isbounded a ordingly.Then, an ad ho model sele tion riterion whi h dete ts the optimal point in terms of omposite penalty is proposed. Empiri al results oming from a wide original simulation study where various low rank plus sparse settings aresimulated a ordingto dierent parameter valuesaredes ribed outliningindetailthe improvements uponexistingmethods. Two real data-setsarenallyexploredhighlightingtheusefulnessofourmethodinpra ti al appli ations.
Keywords: ovarian e matrix, nu lear norm, thresholding, low rank plussparsede omposition, unshrinkage.
A knowledgments i
Abstra t iii
1 Introdu tion 1
2 State of the art 5
2.1 Sample ovarian e matrix estimators . . . 7
2.1.1 TheMaximum Likelihood ovarian e estimator . . . . 8
2.1.2 Theunbiased ovarian e estimator: xed
n
ontext . . 102.1.3 Covarian e matrix estimation: the IID data ontext . 10 2.2 Conditioningproperties . . . 11
2.2.1 Matrix onditioning asan ill-posedinverse problem . . 14
2.3 Ledoit andWolf's approa h . . . 16
2.3.1 General Asymptoti s . . . 17
2.4 Sparse ovarian e matrix estimation . . . 21
2.5 Fa tor analysisbasedestimators . . . 24
2.5.1 Stri tfa tor model . . . 26
2.5.2 PCAand fa toranalysis . . . 27
2.5.3 Approximate fa tormodel . . . 28
2.5.4 POET estimator . . . 30
3 Numeri al and omputational aspe ts 37 3.1 Anhistori al review . . . 40
3.1.1
l
1
norm heuristi s. . . 403.1.2 Nu learnormheuristi s . . . 45
3.1.3
l
1
norm plusnu lear norm . . . 493.2 Analyti aland algorithmi aspe ts . . . 51
3.2.1 Numeri al ontext: a semideniteprogram. . . 51
3.2.2 Solutionmethods . . . 55
4 Low rank plus sparse de omposition 63 4.1 Identi ation and re overy . . . 64
4.1.2 Approximate re overy: a fun tionalapproa h . . . 71
4.1.3 Approximate re overy: an extendedalgebrai approa h 78 4.1.4 Approximate re overy: LOREC approa h . . . 92
5 ImprovingLOREC 103 5.1 Theoreti al advan es . . . 104
5.2 Simulationsetting . . . 117
5.2.1 Simulationalgorithm . . . 117
5.2.2 Simulated settings and omparison quantities . . . 119
5.2.3 Anew modelsele tion riterion . . . 121
5.3 Data analysisresults . . . 123
5.3.1 Simulation results . . . 124
5.3.2 Realdataresults . . . 148
Introdu tion
Thepresentthesis on erns largedimensional ovarian e matrixestimation. Estimation of population ovarian e matri es from samples of multivariate data is of interest in many high-dimensional inferen e problems - prin i-pal omponents analysis, lassi ation by dis riminant analysis, inferring a graphi al modelstru ture, and others. Depending onthe dierent goalthe interest is sometimes in inferring the eigenstru ture of the ovarian e ma-trix(asinPCA) andsometimesinestimatingitsinverse(asindis riminant analysisor ingraphi al models). Examples ofappli ation areas where these problemsarisein ludegenearrays,fMRI,textretrieval, image lassi ation, spe tros opy, limate studies,nan e andma ro-e onomi analysis.
The theory of multivariate analysis for normal variables has been well worked out,see, for example,Anderson ([2℄). However, itbe ame apparent that exa t expressions were umbersome, and that multivariate data were rarely Gaussian. The remedy was asymptoti theory for large samples and xedrelatively small dimensions.
Inre entyears,datasetsthatdonot tintothisframeworkhavebe ome very ommon, the data are very high-dimensional and sample sizes an be verysmallrelativetodimension. Themosttraditional ovarian eestimator, the sample ovarian e matrix, is shown to be dramati ally ill- onditioned in a large dimensional ontext, where the pro ess dimension
p
is loser to or even larger than thesample dimensionn
,even inthe ase thatthe true ovarian e matrixiswell- onditioned. Somesolutionstothis drawba khave been proposed in the asymptoti ontext (for example [75℄ [15 ℄ [45℄). An alternative re ent approa h isbynumeri al optimization,whi h providesin thenon-asymptoti ontext, some solutions improving upon thementioned ones.As des ribed intheexisting literature, two key propertiesof thematrix estimation pro ess assumea parti ular relevan einlarge dimensions:
1. well onditioning,i.e. numeri al stability; 2. identiability.
Bothpropertiesare ru ialfor thetheoreti alre overy andthepra ti aluse of the estimate. A bad onditioned estimate suers from ollinearity and auses its inverse, the pre ision matrix, to amplify dramati ally any error inthe data. A large dimension may ause theimpossibility to identify the unknown ovarian e stru ture andthedi ultyto interprettheresults.
The rst property is strongly related to regularization te hniques. A basi referen einthisrespe tisTibshirani(1996)([108 ℄),wheretheLASSO estimationalgorithm inthe ontext ofregressionmodelswasrstproposed. The se ond property an be ensured by dimensionality redu tion methods, whi h an beusedto redu e theparameterspa e dimensionality.
Regularization approa hes to large ovarian e matri es estimation have therefore started to be presented in the literature, both from theoreti al and pra ti al points of view. Some authors propose shrinkage towards the identitymatrix([75℄),others onsidertaperingthesample ovarian ematrix, thatis, gradually shrinkingtheo-diagonal elementstoward zero([54 ℄). At the same time, a ommon approa h is to en ourage sparsity, either by a penalizedlikelihoodapproa h([53 ℄)orbythresholdingthesample ovarian e matrix([100 ℄).
Forthisreason,ourresear hstudiesaspe i regularizationproblem un-dertheassumption oflowrankplussparsede omposition forthe ovarian e matrix. Su haproblemissolvedexploitingnon-smooth onvexoptimization methods. Thisapproa h allows toproperlyaddressbothre onditioningand dimensionalityredu tion issuesand is proved to be ee tive even in alarge dimensional ontext.
Ourdissertationmovesfromadetailedoutlineofasymptoti approa hes. In Chapter 2, we provide a thorough des ription of the motivation to our workand areviewofsome relevantasymptoti methodsfor ovarian e esti-mation. Maximumlikelihood estimators and unbiased nite estimators are des ribed ([2℄). Spe i treatment to the onditioning problem for ovari-an ematrix estimates is given. The ovarian e shrinkage estimator derived byLedoitand Wolfinthegeneral asymptoti frameworkisdes ribed([75 ℄). Sparse ovarian e estimators are shown together with the underlying as-sumptions and the estimation error rates, with parti ular referen e to the thresholding estimator of [15 ℄. POET (Prin ipal Orthogonal omplEment Thresholding) estimator([45 ℄), whi h ombines Prin ipal Component Anal-ysisand thresholding algorithms, isanalyzed indetail.
InChapter 3,we dene the regularization problemabovementioned. It isanu learnormplus
l
1
normapproximationproblem,andworksunderthe assumption of low rank plus sparse stru ture for the ovarian e matrix. It is omposed by a least squares loss and a omposite non-smooth penalty, whi h isthe sum ofthe nu learnorm ofthelow rank omponent and thel
1
normof thesparse omponent.The numeri al rationale behindthe problemformulation is provided. It isshownhowthis problem anbere astfromthepoint ofviewofnumeri al
analysisasasemi-deniteprogram(SDP).Nonstandardoptimizationtools, as subgradient minimization methods, are needed to solve it. We des ribe themost re ent solution algorithm andpoint out its ee tiveness.
InChapter4,weprovideawidereviewofexistingnon-asymptoti meth-ods. Theevolution path ofthe most re ent worksis guredout. Themost re ent developmentsofthenumeri al approa hunder theassumption oflow rank plus sparsestru ture for the ovarian e matrix are des ribed, starting fromthebasi ontributionbyChandraskeranetal. ([30℄)whi hrstproves the exa t re overy of the ovarian e matrix in the noiseless ontext. This resultisa hieved minimizing aspe i onvexnon-smooth obje tive, whi h isthe sumof the nu lear norm ofthelowrank omponent and the
l
1
norm ofthesparse omponent.Then, therstapproximatesolutiontore overyandidentiabilityinthe noisy ontext, oming from[1℄,is des ribed. In the following,theextension of [30℄ providing the rst exa t solution of the numeri al problem in the noisygraphi al model setting([31 ℄) is shownindetail. In that ontext, the obje tive isa leastsquarelosspenalizedbythe above mentioned omposite penalty,anditsoptimizationallowsto re overtheinverse ovarian ematrix. In on lusion, theextension of this framework to the ovarian e matrix es-timation ontext, oming from [77℄,is explained. Theresulting estimator is alledLOREC (LOwRankand sparsE Covarian e estimator).
In the last hapter (Chapter 5), an improvement over the solution de-s ribedin[77℄isproposed,basedontheunshrinkage oftheestimated eigen-valuesof thelow rank omponent. Luo'sapproa h is ompleted by deriving therates of the sparse omponent estimate, and the onditions for its posi-tivedenitenessandinvertibility. Inaddition,theratesofLORECunderthe onditions of POET, and, more importantly, in a ontext where the eigen-values of the low rank omponent are allowed to grow with
p
α
, α ∈ [0, 1]
(generalizedspikiness ontext)areprovided.
Inthefollowing, weshowtheresults ofourpro edureon both simulated and real data sets. We illustrate a new model sele tion riterion whi h is proved to be ee tive in our ontext. An original simulation study is presentedwhere extensive simulation results arepointedout, aswell asthe simulationalgorithm and theestimationassessment framework.
In the end,theperforman e of our newproposedestimator is ompared totheoneof LORECandPOET undervarioussettings. Tworealexamples areprovidedwhere ourmodelisee tive respe ttothe ompetitors. In par-ti ular, the se ondexample isabanking supervisory dataset whi h olle ts supervisory reporting indi ators of themost relevant Euro Area banks. We expli itlythanktheSupervisoryStatisti s DivisionoftheEuropeanCentral Bank,where theauthorspentasemesterasaPhDtrainee,fortheallowan e to usethese datainanonymous formfor resear h purposes.
Covarian e matrix estimation:
state of the art
In this hapter, a short review of existing solutions to the problem of o-varian e matrix estimation is provided. Parti ular attention isgiven to the two properties displayed intheIntrodu tion(well onditioning and identi-ability)andtothe performan eofexistingmethods inthelarge dimensional ontext. An exhaustive review an be foundin Pourhamadi (2013)([95 ℄).
ThisChaptershowsapath a rossexistingestimators aimedat outlining the two mentioned features (well onditioning and identiability) for ea h estimation setting, espe ially when
p
is very large ompared to the sample sizen
orevenlarger. Thisiswhy,forea hestimator,adetaileddis ussionof theasymptoti frameworkandtheassumptionsneededtoensure onsisten y (i.e. the onvergen e to the theoreti al ovarian e matrix)is provided.Existing approa hes to the estimation problem are des ribed in this Chapter, while non-asymptoti approa hes will be the obje tof next hap-ters. The des ription ofpast approa hes isintended to displaythe main is-suesen ountered byexistingmethods,withparti ular referen eto thelarge dimensional ontext,andthe reasonswhyweneed todevelopanalternative numeri al approa hto the ovarian e estimation problem.
Therst paragraph(2.1) isdevotedto ovarian e matrix estimation un-der the assumption of normality for the data. The maximum likelihood estimator, i.e. the sample ovarian e matrix, is introdu ed and justied. Theunbiasedsample ovarian e matrix,undertheassumption ofxed
n
,is thenoutlined. A spe i remark ontheasymptoti distribution ofthe sam-ple ovarian e matrix under the assumption of independen e and identi al distributionfor thedata on ludes these tion.In the se ond paragraph (2.2) the onditioning properties of the sam-ple ovarian e matrix are explored. The reason why the sample ovarian e matrix is bad- onditioned when the dimension is lose to the sample size is deeply explained and analyzed, as well as the reason why the inverse
ovarian e matrixdramati allyampliestheestimationerrorin aseof bad- onditioning.
The third paragraph (2.3) widely des ribes a su essful attempt to ad-dress theproblemof re onditioning thesample ovarian e matrix when the dimension islargerthan thesample size: theshrinkage estimator byLedoit &Wolf([75 ℄). Theirmotivations, theirresults andtheir asymptoti ontext areproperlyhighlighted, tryingtoretainthekeyelementsoftheirapproa h. The fourth paragraph (2.4) briey outlines existingsparsity estimators, withparti ular referen e to the thresholding estimator by Bi kel & Levina ([15 ℄), whi h is des ribed in detail with respe t to model assumptions and onvergen e rates. There we point out the strong link between sparsity assumptions and shrinkage thresholding. That family of estimators shows howit ispossible to usesparsityto re ondition the ovarian e estimateand to signi antly redu e thenumber ofparameters.
The fth paragraph (2.5) des ribes ovarian e matri esestimator based onfa tormodelassumptions. Abriefoverviewoffa tormodelspe i ations and underlying assumptions a ross history is provided, dis ussing the dif-ferent asymptoti ontexts. The relationship between Prin ipalComponent Analysis(PCA,[72 ℄)andfa tormodelling(see[59 ℄)is ru ialinthisrespe t. Finally,POETestimator([45 ℄),basedontheassumptionofapproximate fa -tor model witha sparseresidual matrix, is widely illustrated, pointing out the ru ialassumptions for onsisten y and identiability.
In[45 ℄, the population ovarian e matrix is assumedto be thesum of a low rank and a sparse omponent. POET works under the assumption of sparseresidual ovarian e matrix and pervasiveeigenvalues of thelow rank omponent (as
p → ∞
). Thisstru ture is parti ularly onvenient in alarge dimensional ontext, and ta kles both the issues mentioned above, as we willwidelyexplain. For the same reasons,the fa toranalysis assumption is a key to approa h ovarian e estimation in large dimensions. The asymp-toti orresponden ebetweenPCAandfa torestimationisthereestablished a ordingto the underlying assumptionsand thenexploited.Before starting, we des ribe the basi matrix terminology. We restri t our analysis to the real ase. The spe traltheorem ensures that, when
M
is a positive semidenite squaredp
- dimensional real matrix with rankr
, thereexistsanorthogonalp × r
matrixU
andadiagonalr × r
matrixΛ
su h thatM = U ΛU
′
=
r
X
i=1
λ
i
u
i
u
′
i
,
(2.1)whi histheeigenvaluede ompositionof
M
. S alarsλ
1
, . . . , λ
r
are alled theeigenvaluesofM
andarestri tlylarger than0
. Ther
olumnsofU
are theeigenve torsofM
. IfM
issymmetri ,theeigenvalues oin idewiththe singularvaluesσ
1,...,r
,whi harethesquarerootsoftheeigenvaluesofM
′
M
i.e. the absolute values of the eigenvalues of
M
. A fortiori, this happensifM
isa ovarian e matrix,whi h issymmetri and positive denite.The relevant norms we aregoing to usethroughout theentire thesisare (see also[62 ℄):
• ||M||
2
=
pσ
max
(M
′
M )
isthespe tralnormofM
,whi hisits largestsingularvalue.
• ||M||
∞
= max
i,j
|m
ij
|
is the innity norm ofM
, whi h is the largestentry inmagnitude.
• ||M||
F
= trace(M
′
M ) =
q
P
i
P
j
m
2
ij
is the Frobenius norm ofM
, whi his the squareroot ofthesum oftheentriesofM
.• ||M||
∗
= trace(
√
M
′
M ) =
P
p
i=1
σ
i
, sum of the singular values ofM
.||M||
∗
is alled nu lear norm. IfM
is a Positive SemiDenitema-trix(PSD),
||M||
∗
= tr(M )
,be ause theeigenvalues and the singularvaluesexa ly oin ide.
• ||M||
1
=
P
i
P
j
|m
ij
|
: sum oftheabsolute values of theentries ofM
.For a
p
-dimensional ve torx
,therelevant norms for our purpose are:• ||x||
2
=
q
P
i
x
2
i
,the Eu lidean norm ofx
.• ||x||
1
=
P
p
i=1
|x
i
|
,thel
1
norm ofx
.• ||x||
∞
= max
i
|x
i
|
,themaximumnorm ofx
.2.1 Sample ovarian e matrix estimators
In this paragraph we fo us on the most used estimator of the ovarian e matrix: the sample ovarian e matrix. First, we will derive it asthe maxi-mumlikelihoodestimatorof the ovarian e matrix under theassumption of multivariatenormalityfor our data(2.1.1). Maximum likelihood estimators are onsistentwhen
n → ∞
. Thisiswhywethenderivetheunbiased ov ari-an e estimator under the assumption ofn
nite (2.1.2), whi h is a slightly modied version of the sample ovarian e matrix. These two estimators asymptoti ally onverge whenn → ∞
,under the assumption ofp
xed. In theend of this paragraph, we give a ash about the behaviour of this esti-mator under the assumption of independen e and identi al distribution forour datawhen
n → ∞
(2.1.3).Our main referen e for this argument is the famous book by Anderson ([2℄).
2.1.1 The Maximum Likelihood ovarian e estimator
Suppose we have a sample
(x
1
, . . . x
n
)
, from a real-valuedp−
dimensional normal random variablex ∼ N
p
(µ
∗
, Σ
∗
)
, with
p ≤ n
. Thep × p
matrixΣ
∗
= E((x − µ
∗
)(x − µ
∗
)
′
)
is real positive denite and symmetri , whileµ
∗
= E(x)
isap × 1
ve tor.Thedensity of
x
isthefollowing:f (x|µ
∗
, Σ
∗
) = (2π)
−
1
2
p
|Σ
∗
|
−
1
2
exp
−
1
2
(x − µ
∗
)
′
Σ
∗−1
(x − µ
∗
)
.
whereµ
∗
is ap × 1
ve tor andΣ
∗
is a
p × p
invertible (positive denite)matrix.
Thelikelihood fun tionis
L(µ
∗
, Σ
∗
) =
n
Y
i=1
N (x
i
|µ
∗
, Σ
∗
) =
= (2π)
−
1
2
pn
|Σ
∗
|
−
1
2
n
exp
"
−1/2
n
X
i=1
(x
i
− µ
∗
)
′
Σ
∗−1
(x
i
− µ
∗
)
#
.
Thelog-likelihood isthen
log L(µ
∗
, Σ
∗
) = −
1
2
pn log 2π −
1
2
n log |Σ
∗
| −
1
2
n
X
i=1
(x
i
− µ
∗
)
′
Σ
∗−1
(x
i
− µ
∗
).
We denoteby
µ
ˆ
M L
andΣ
ˆ
M L
theve torand thepositivedenite matrix maximizinglog L
. They are the maximum likelihood estimators ofµ
∗
and
Σ
∗
. Sin elog L
is an in reasing fun tion ofL
,log L
andL
share thesame maximum respe tto our parameter estimates.Thefollowing important theoremholds:
Theorem2.1.1. If
x
1
, . . . x
n
onstituteasamplefromN (µ
∗
, Σ
∗
)
with
p < n
, the maximum likelihood estimators ofµ
∗
andΣ
∗
areµ
ˆ
M L
= ¯
x =
1
n
P
n
i=1
x
i
andΣ
ˆ
M L
=
1
n
P
n
i=1
(x
i
− ¯x)(x
i
− ¯x)
′
respe tively.Theproof anbefoundinAnderson(1958),page67andfollowing. It ex-ploitsthepropertiesofthearithmeti meanandofpositivedenitematri es. Thekeyargument isthat
log L
an berewritten inthefollowingway:−
1
2
pn log 2π −
1
2
log |Σ
∗
| −
1
2
trΣ
∗−1
D −
1
2
n(x
i
− µ
∗
)Σ
∗−1
(x
i
− µ
∗
)
′
,
whereD =
P
n
i=1
(x
i
− ¯x)(x
i
− ¯x)
′
.In order to perform maximization, the ne essaryassumption is that
Σ
∗
isa positive denite matrix. This ondition is ne essaryto ensure that theterm
n(x
i
− µ
∗
)Σ
∗−1
(x
i
− µ
∗
)
′
a hievesa maximumforµ
∗
= ¯
x
andtheterm
log |Σ
∗
| − tr(Σ
∗−1
D)
a hieves amaximumforΣ
∗
=
1
n
D
.ML estimators show a number of interesting optimality properties. In parti ular,theyare onsistentandasymptoti allye ient([34℄). Atheorem by Cramer ensures that
µ
ˆ
M L
andΣ
ˆ
M L
are minimum varian e (asymptoti- ally) unbiased estimators. Theseproperties holdifand onlyifn → ∞
.Note that also the ondition
p < n
is ne essary in order to perform maximization. In orderto seethis point, we need to re all a basi theorem ([2℄,p.77):Theorem2.1.2. Themaximum likelihood estimator
µ
ˆ
M L
= ¯
x =
1
n
P
n
i=1
x
i
, fromN (µ
∗
, Σ
∗
)
, is distributed a ording toN (µ
∗
,
1
n
Σ
∗
)
and independently ofΣ
ˆ
M L
= ˆ
Σ =
1
n
P
n
i=1
(x
i
− ¯x)(x
i
− ¯x)
′
.n ˆ
Σ
is distributed a ording toP
n−1
i=1
z
i
z
′
i
, wherez
i
∼ N(0, Σ
∗
)
, and
z
1
, . . . , z
n−1
are independent.This theorem states that under the multivariate normality assumption for thedata,
n ˆ
Σ
is the sumofn − 1
squaredp
dimensional matri eshavingrank
1
. Ifp ≥ n
,n ˆ
Σ
will never have full rankp
.In addition, it has been shown by Wishart ([113 ℄) that
D = n ˆ
Σ
is a matrix-valued sto hasti pro ess having the following distribution:f (D|Σ
∗
) =
|D|
1
2
(n−p−1)
exp
−
1
2
tr(Σ
∗−1
D)
2
1
2
np
π
p(p−1)
4
|Σ
∗
|
1
2
n
Q
p
i=1
Γ[
1
2
(n + 1 − i)]
whi hisaWishart distributionwith
ν = n − 1
degrees offreedom,whereΓ(t) =
R
∞
0
x
t−1
e
−x
dx
is the usual Gamma fun tion. The proof is reported in [2℄ (p.252 and following). It exploits massively the linear transforms of randomvariables,andis basedonthepropertiesofGram-S hmidt orthogo-nalization algorithm.Thisresultswasrstderivedforabi-variatedistributionbyFisher([51 ℄) where the distribution of the orrelation oe ient (rst dened by Karl Pearson in[91℄)was alsoderived.
We an now understand why
p < n
is a ne essary ondition. Ifn ≤ p
,f (D|Σ
∗
)
isnolongeradensity,su hthatitisnolongerpossibletoderivetheasymptoti distributionfor
ˆ
Σ
(i.e.,alltheusualoptimalitypropertiesofML estimators arelost). Infa t,|D|
would be zero, andthe distribution would thusbedegenerate, having nullmeasureinR
p×p
everywhere. Notealsothat
if
n = p + 1 f (D|Σ
∗
)
has not a mode, analogously to the
χ
2
distribution withtwo degrees offreedom.
Inthe sameway,denotingby
T
thequantityT = (¯
x − µ
∗
)
′
W
−1
(¯
x − µ
∗
)
, where
W =
D
n−1
,ithasbeen shownbyHotelling ([64℄)thatν − p − 1
vp
T
2
∼ F
where
F
isFisher'sdistribution withp
andν − p + 1
degrees offreedom(
ν = n−1
).T
2
is alledHotelling'sT-squareddistribution. Itisnon-singular if and only if both
µ
ˆ
andΣ
ˆ
are non-singular, i.e. ifΣ
∗
is positive deniteand
ν − p + 1 > 0
(equivalent ton > p
).So, both the sample mean and the sample ovarian e matrix are ML estimators of the true mean and the true ovarian e matrix if and only if thetrue ovarian e matrixispositive deniteandthedimension
p
isstri tly smallerthanthesamplesizen
. Inparti ular,thedistribution ofthesample ovarian ematrixisn
n−1
W ishart(Σ
∗
, n−1)
. ThismeansthatΣ
ˆ
isbiasedifn
isnite. Notethatthisdistributiondoesnot hangeevenwhenthetruemean
µ
∗
is known, unlessx
¯
is repla ed by the trueµ
∗
. In that ase, the degrees offreedom are
n
and theresulting estimator (1
n
P
n
i=1
(x
i
− µ
∗
)(x
i
− µ
∗
)
′
) is unbiased.2.1.2 The unbiased ovarian e estimator: xed
n
ontext In order to derive the nite sample unbiased estimator of the ovarian e matrix,the keyresultisTheorem2.1.2 aboutthedistribution ofD = n ˆ
Σ =
P
n
i=1
(x
i
− ¯x)(x
i
− ¯x)
′
shownabove.A orollaryofthat theoremstates:
Corollary 2.1.1. Let
x
1
, . . . , x
n
(n > p)
be independently distributed, ea h a ording toN (µ
∗
, Σ
∗
)
. The distribution ofˆ
Σ
ν
=
1
ν
P
n
i=1
(x
i
− ¯x)(x
i
− ¯x)
′
isW ishart(Σ
∗
, ν)
, whereν = n − 1
.Thisresultmeansthat
Σ
ˆ
n−1
= (
n−1
1
)
P
n
i=1
(x
i
−¯x)(x
i
−¯x)
′
istheunbiased estimator of the ovarian e matrix when the dimensionn
is nite. This estimator will be theinput of our new estimation pro edure in Chapter 4. Clearly,Σ
ˆ
n−1
andΣ
ˆ
n
onverge asymptoti allyto thesame estimator. We are now going to derive the asymptoti (normal) distribution of the sample ovarian e matrix inthemore general ase ofIIDdata.2.1.3 Covarian e matrix estimation: the IID data ontext Let us suppose
x
i
∼ IID(µ
∗
, Σ
∗
)
,i = 1 . . . , n
. We want to derive the asymptoti distribution ofˆ
Σ
n
=
1
n
P
n
i=1
(x
i
− ¯x)
′
(x
i
− ¯x).
Under the IIDhypothesis,wehave:
E(x
i
x
′
i
) = E(x
i
)E(x
′
i
) = Σ
∗
+ µ
∗
µ
∗
′
,
V (x
i
x
′
i
) = V (x
i
) + V (x
i
) = Σ
∗
+ Σ
∗
= 2Σ
∗
.
Ourtarget an berewritten asthesumof three omponents:
1
n
n
X
i=1
(x
i
− ¯x)(x
i
− ¯x)
′
=
n
X
i=1
x
i
x
′
i
n
− 2
n
X
i=1
¯
x
x
′
i
n
+
n
X
i=1
¯
x¯
x
′
n
Sin e
P
n
i=1
x
n
i
prob
→ µ
∗
,we have that−2¯x
n
X
i=1
x
i
n
+
n
X
i=1
¯
x¯
x
′
n
= −2¯x¯x
′
+ ¯
x¯
x
′
= −¯x¯x
′
.
onverges inprobability asfollows:
−¯x¯x
′ prob
→ −µ
∗
µ
∗
′
(2.2)Now, the rst omponent
P
n
i=1
x
i
x
′
i
n
anbe rewrittenas1
√
n
n
X
i=1
(x
√
i
x
′
i
)
n
So,for the Central Limittheorem, we have
1
√
n
n
X
i=1
x
i
x
′
i
− (Σ
∗
+ µ
∗
µ
∗
′
)
√
n
CLT
→
n
1
N (µ
∗
µ
∗
′
+ Σ
∗
, 2Σ
∗
).
Re alling (2.2),wehave that
ˆ
Σ
n
distrib
→
1
√
n
N (Σ
∗
, 2Σ
∗
).
(2.3) Theseresults nd onrmation in[58 ℄.2.2 Thesample ovarian ematrix: onditioning prop-erties
We arenowgoing tobrieytalkaboutmatrix onditioning. Letus suppose
p
andn
are xed. Ifn > p,
the expe ted value ofΣ
ν=n−1
isΣ
∗
, and the entries of its ovarian e matrix are
V (ˆ
σ
n,ij
) =
(σ
∗
2
ij
+σ
∗
ii
σ
∗
jj
)
(n−1)
. This highlights whythevarian eofΣ
ˆ
n
in reasesasthetrue onditionnumberofΣ
∗
in reases. If the ondition number
c = σ
max
/σ
min
in reases, the orrelation between the omponentsx
i
andx
j
in reases, be auseΣ
∗
is loser to ollinearity. Consequently,
V (ˆ
σ
n,ij
)
in reases,be auseσ
∗2
ij
is losertoitsmaximum,whi h isσ
∗
ii
σ
jj
∗
(fortheCau hy-S hwartz inequality).Coming ba k to the main point, it is ru ial to study the behaviour of thesampleeigenvalues. In thematrixestimation ontext thereisa relevant issueabout numeri al onditioning, i.e. the behaviour of samplemaximum andminimumsingular values,ofa
p × n
datamatrixX
.Theorem2.2.1(Theorem([39 ℄)). Givennaturalnumbers
n, p
withp < n+1
andvarian e
1
n
. Thenthe largest and smallest singular valuesσ
min
(X)
andσ
max
(X)
are su h thatmax
P r
λ
max
≥ 1 +
r p
n
+ t
, P r
λ
min
≤ 1 −
r p
n
− t
≤ exp
−nt
2
2
,
for anyt > 0
.Thistheoremwasprovedbyusingargumentsfromrandommatrixtheory and the geometry of Bana h spa es. It is an essential result to provide a probabilisti boundfortheerror distan e
||ˆ
Σ
n
− Σ
∗
||
2
,whereΣ
ˆ
n
=
1
n
X
′
X =
1
n
P
n
i=1
x
i
x
′
i
.Infa t, thefollowing Lemma holds:
Lemma 2.2.1. Let
ψ = ||Σ
∗
||
2
. Given anyδ > 0
andφ > 0
withψ ≤ 8φ
, letthe number of samplesn
be su h thatn ≥
64pφ
2
δ
2
. Then we have thatP r[||Σ
n
− Σ
∗
||
2
≥ δ] ≤ 2 exp
−
nδ
2
128ψ
2
.
ThisTheorem isbased ona spe i assumption on
ψ
,thelargest eigen-value ofΣ
∗
. By appropriately setting the parameterψ
, we an obtain the probabilisti bound a ordingly.ThisLemma relieson the fa tthatthespe tralnorm isunitarily invari-ant, su h that it is possible to assume a diagonal stru ture for
Σ
ˆ
without lossofgeneralityand thenapply theprevioustheorem 2.2.1.Itisremarkablethatwithoutfurtherassumptions,
Σ
ˆ
n
isnot invertibleifp > n
(sin eit is perfe tly ollinear, having learly at most rankn
,and fortherestnulleigenvalues). Evenif
p ≤ n
,inthe asetheratiop/n
islessthan 1but not negligible, theestimated(maximumand minimum) eigenvalues are numeri ally unstable, sin e the probabilisti bound is too large. This may result in bad onditioning (i.e. too large ondition number) forˆ
Σ
n
. Thisis why inthe Big Data ontext, whenp
isvery large,it is frequent to have anill- onditioned sample ovarian e matrix,sin eit isdi ultto have enough observationto keep the ratiop/n
negligible([75 ℄).Theexampleingure(2.1) learlyoutlinesthedes ribeddrawba k. The eigenvalues of the ovarian e matrix of a simulated
n × p
pro essǫ
i
=
N
p
(0,
n
1
I)
,p = 100
,n = [10, 50, 100, 500, 1000, 10000]
areplotted. Theg-uredisplayshowthedispersionoftheeigenvaluesde reasesas
p/n
de reases. Alldistributions tendto theMar enko-Pastur distribution, whi h isproved tobethe limitingdistributionoftheeigenvalues ofIIDrandomvariables(in the Kolmogorov asymptoti framework, see [79 ℄). The rank is always equalto
min(p, n − 1)
. Ifp = n
,the matrix isthus singular.We have provided this simple example to state that without further as-sumption on the eigen-stru ture (values and ve tors) of
Σ
∗
0
50
100
0
1
2
3
4
Eigenvalues
0
50
100
0
0.5
1
1.5
2
2.5
Eigenvalues
0
50
100
0
0.5
1
1.5
2
2.5
Eigenvalues
0
50
100
0.5
1
1.5
Eigenvalues
0
50
100
0.8
1
1.2
1.4
Eigenvalues
0
50
100
0.9
1
1.1
1.2
Eigenvalues
10
50
100
500
1000
10000
Figure 2.1: Eigenvalues of the sample ovarian e matrix of
ǫ
i
= N
p
(0,
1
n
I)
,p = 100
,n
varyingp ≤ n
is unavoidable in order to guarantee the positive deniteness (andthus theinvertibility) of our ovarian e estimate. Anyway, the re overy of theeigen-stru ture ofa ovarian ematrix isstronglyrelatedtothe underly-ingassumptionsand to the asymptoti ontext.
Wenowenumeratethreeparametersettingsrelevantforourdissertation: 1.
p
andn
xed: this is the ase ofΣ
ˆ
n−1
, and all numeri al estimators we will analyzeinnext hapters ([31 ℄, [1 ℄,[77℄,[15 ℄)2.
p
xed,n → ∞
: this isthe ase ofΣ
ˆ
M L
,or of theapproximate fa tor model([29℄)3.
p
n
n
→ c
whenn → ∞
: herewendtheGeneralasymptoti framework,used by Ledoit and Wolf to ensure the onsisten y of their estimator ([75 ℄),andtheKolmogorovasymptoti framework(wherealso
p → ∞
). Also onsisten y propertiesofthethresholding estimator([15 ℄)and of POETestimator([45 ℄)arederivedunderasimilarframework,where a fun tionofp
andn
tendsto0whilen → ∞
. Seeformoreexplanations se tions(2.4) and (2.5).In the se ond ontext, with xed
p
andn
,the outlined results on ern-ing numeri al onditioning for the sample ovarian e matrix hold, and the onditionp ≤ n
is unavoidable without further assumptionsto derive nite sample bounds. This is why one of the aims of the present work is trying to exploit results from the third asymptoti framework (in terms of model assumptions) to establishboundsunder the nite sample ontext dropping the onditionp ≤ n
.2.2.1 Matrix onditioning as an ill-posed inverse problem We are nowexplaining in detail why a bad- onditioned sample matrix is a fataldrawba k for us. Thereason stands inthe onsequen esderiving from the inversionof abad- onditioned matrix.
Letus now onsiderthestandard linearsystem
Ax = b
,whereA
isp × p
,and
x
,b
arep × 1
. If our aim is to deriveb
(the output), we are solvingthe dire tproblem. Ifour aim isto derive
x
(the input),we aresolving the inverse problem. IfA
is full rank, Cramer's theorem is ensuring that the inverse problem has exa t solutionx
∗
= A
−1
b
. Otherwise, if
A
has rankr < p
,weneed to solve theleastsquares problemmin
x∈R
p
||Ax − b||
2
,
andwe havex
∗
=
r
X
i=1
|u
′
i
b|
λ
i
u
i
(2.4)||Ax
∗
− b||
2
=
p
X
i=r+1
||u
′
i
b||
2
.
Thisfundamentalresult wasproved in[40℄. Howmu hissolutionthe
x
∗
reliable? Hadamard([57℄)outlinedthethree hara teristi s ofa well-posed problem:
•
existen e: the problemadmits one solution•
uniqueness: the problemhasat most one solution•
stability: theproblem isnot sensitive to data perturbation.Inour ontext,if
A
isfullrank,theinverseproblemmaybeill-posedsin e itviolatesthestability ondition. IfA
isnotfullrank,theinverseproblemis ill-posed sin e it violates theexisten e and the uniqueness ondition (there are only approximate solutions, no exa t ones). The least squares system servesfor identifying inany ase asolution even iftherewouldbe none.Anyway,(2.1) and(2.4)enableus tounderstand whytheinverseof bad- onditioned matri es are numeri ally unstable. The solution of the dire t problemis
Ax = U ΛU
′
x =
P
p
i=1
λ
i
(u
′
i
x)u
i
,whi h dampensthe omponents orrespondingto thesmallesteigenvaluesofA
. Onthe ontrary,(2.4)shows us thatthesolution ofthe inverse problemamplies theee ts ofthesame omponents. If we assumethatb
isperturbed, i.e.b
ǫ
= b + ǫ
,we note thatx
ǫ
= x
∗
+
P
r
i=1
|u
′
i
ǫ|
λ
i
u
i
. So, if
A
is bad onditioned (i.e. we have very small eigenvalues), the ee t of data perturbation is amplied, and the solution maynot be ee tively usable inappli ations.ThisiswhyPi ard([93 ℄)elaborateda onditionunderwhi htheinverse solution is reliable. It states that
x
∗
=
P
r
i=1
|u
′
i
b|
λ
i
u
i
< ∞
if and only if|u
′
i
b|
de ays morerapidlythan the orrespondingλ
i
foralli
,whi h o ursifλ
i
> τ ∀i
, whereτ
is the threshold at whi h thesingularvalues arelevelledbythe noise.
If this ondition isviolated, a regularization method, like thetrun ated singularvaluede omposition(TSVD,see[55℄)orTikhonov'sregressionmethod ([109 ℄) or otherregressionmethods (like the ridgeone), are needed. Thisis whythenonasymptoti approa hfor ovarian ematrixestimationessentially onsistsinspe ifyingappropriateregularizationproblemsundersuitable on-ditions for deriving improved error rates, as we will widely des ribe in the following hapters.
Notethatthereisahugeliteraturedealingwiththedistributionof eigen-values. We mention again Mar enko-Pastur law, whi h des ribes the be-haviourof the singularvaluesof are tangular randommatrix having Gaus-sianentries([79 ℄). Tra y andWidom([107 ℄) foundthelimiting distribution ofthesingularvaluesofalargedimensionalrandomHermitianmatrix. John-stone ([70 ℄) found out the limiting distribution of the largest eigenvalue in prin ipal omponent analysis(for
n ≤ p
, underthe assumption of indepen-dentnormalityfor the olumnsofthe data matrix)whi hisproportionalto a Wishart of order1
. A re ent work by Chiani([33 ℄) derived theexa t dis-tribution of the largest eigenvalues for real Wishart matri es and Gaussian Orthogonal Ensembles.The work in[70℄,inparti ular, outlinedthat for large
p
it an be easier to re over the topr
eigenvalues ifthey are parti ularly spiked, be ause the distribution of the(r + 1)
-th eigenvalue isbounded bya Tra y-Widom law of lower dimensions (n × (p − r)
respe t ton × p
). Thus, the(r + 1)
-th eigenvalueofasetofp
eigenvalueswherer
arespikedissto hasti allysmaller than the largesteigenvalue ofa settingof(p − r) < p
variables non-spiked. This fa tsuggests that large dimensions (p → ∞
) an helpthe re overy of strong eigenvalues and somehow justies the use of "s ree-plot" to hoose thenumberof eigenvalues.Therearealsosomeresultsonthedistributionofthesmallesteigenvalues. We referto [8 ℄ for ageneral review.
All in all,the problem of re onditioning our ovarian e matrix estimate is approa hed dierently a ording to the related asymptoti ontext. In Chapter 4 we will fo us on the non-asymptoti ontext, outlining various solutions re ently provided. Now, we will fo us on the des ription of key ovarian e estimators in the asymptoti ontext where both
p
andn
are allowedtotendto∞
. Theestimatorweareabouttodes ribebelongstothe lassof shrinkage estimators ([68 ℄) whi h represent a widely usedapproa h in this ontext as an ee tive regularization method. It is relevant to note that the distributional assumption of normality is no longer needed, sin ethe approa h we aregoingto des ribe isdistribution-free.
2.3 Shrinkagetowardstheidentity: LedoitandWolf's approa h
Ledoitand Wolfweretherst toderive in[75 ℄a onsistentestimator ofthe ovarian ematrixinanewasymptoti framework, alledgeneralasymptoti framework. Theyproposedawaytotemperthenumeri alinstabilityof sam-ple eigenvalues, expli itly re onditioning them byshrinkage. The adoption of a new asymptoti framework was needed to ensure the shrinkage inten-sity to be positive, avoiding it to vanish in the limit. Their estimator is also Bayesian in nature, sin e it is a ombination of a priori and sample information. They all itEmpiri al Bayesian estimator.
Themotivatingresult oftheir analysisit reportedbelow.
Theorem 2.3.1. The eigenvalues are the most dispersed diagonal elements that an be obtained by rotation of a symmetri matrix.
Theproof exploitstheinvarian e by rotationoftra e.
This auses that the largest sample eigenvalues are positively biased, while the smallest are negatively biased, and the bias in reases in
p/n
(re- all Theorem 2.2.1). The pattern of sample eigenvalues depends on the Mar enko-Pastur distribution, whi h holds in the Kolmogorov asymptoti framework. Asdes ribed,underKolmogorovasymptoti stheratiop/n
tends to aspe i onstant, whilebothp
andn
tendto innity.HerewereportthesolutionproposedbyLedoitandWolftothedes ribed problem. Their idea is to shrink thesample ovarian e matrix towards the identity matrix, solving the following optimization problem (thus re ondi-tioningtheeigenvalues):
min
ρ
1
,ρ
2
E[||Σ − Σ
∗
||
2
]
s.t.Σ = ρ
1
I
p
+ ρ
2
ˆ
Σ
n
.
where
ρ
1
andρ
2
arenonrandom oe ients.Thetheoreti al solutionto this problemis theoptimal linear shrink-ageestimator
Σ
LW
=
β
2
γ
2
µI +
α
2
γ
2
Σ
ˆ
n
(2.5) withE[||Σ
LW
− Σ
∗
||
2
] =
α
2
β
2
γ
2
,where:µ =< Σ, I >;
α
2
= ||Σ
∗
− µI||
2
;
β
2
= E[||ˆ
Σ
n
− Σ
∗
||
2
];
γ
2
= E[||ˆ
Σ
n
− µI||
2
].
Their derivation exploitsthe natural Pythagorean relationship
α
2
+ β
2
= γ
2
.
(2.6)Inthis view,theratio
β
2
γ
2
is alledoptimal shrinkage intensity.The most important interpretation of this approa h for our purposes is thefollowing. It iswell known(Theorem 2.2.1) thatthesampleeigenvalues ofIID datahave bounded error respe t to thetrueones, sothat, underthe ondition
p ≤ n
(p
andn
xed),1
p
E(
P
p
i=1
λ
ˆ
i
) =
p
1
P
p
i=1
λ
i
,i.e. thetra e ofΣ
∗
is unbiasedly estimated.At the same time, theorem 2.3.1 shows that sample eigenvalues have a largerdispersionaroundtheirgrandmeanrespe ttothetrueones(assuming thattheeigenve torsarereliable). From (2.6)we an arguethat
1
p
E
"
p
X
i=1
(ˆ
λ
i
− µ)
2
#
=
1
p
p
X
i=1
(λ
i
− µ)
2
+ E[||ˆ
Σ
n
− Σ||
2
],
i.e. theex essdispersionofthesampleeigenvaluesistheerrorofthesample ovarian e matrix. This is why here the authors bound
[||ˆ
Σ
n
− Σ||
2
]
by bounding1
p
E
h
P
p
i=1
(ˆ
λ
i
− µ)
2
i
,whereµ = 1
.So,
Σ
LW
impli itly doesthere onditioning ofeigenvalues,sin eλ
i,LW
=
β
2
γ
2
µ +
α
2
γ
2
λ
ˆ
i
, ∀i = 1, . . . , p.
1
p
E[
P
p
i=1
(ˆ
λ
i,LW
−µ)
2
]
isequaltoα
2
γ
,andisevensmallerthanthedispersion of thetrue ones, for the reasonsdes ribed above. Note thatthis method is very similarin its meaning to themax log − det
heuristi s for nu lear norm minimization(see [49 ℄).2.3.1 General Asymptoti s
In order to derive a feasible estimator, we now need to get into a new asymptoti framework, sin e theoptimal shrinkage intensity
β
2
vanishesas
||ˆ
Σ
n
− Σ
∗
||
2
vanishes whenn → ∞
in the standard asymptoti framework (as proved in paragraph 2.1.3, see onvergen e (2.3)). This fa t, whenp
is loserton
or even larger,is in onsistent withreality. So,anew asymptoti framework, alledGeneralAsymptoti s,isneeded,whereβ
2
Consider
n = 1, 2, . . .
indexing a sequen e of statisti al models, and for everyn
,X
n
isap
n
×n
matrixofn
iidobservationsonasystemofp
n
random zeromeanvariables with ovarian e matrixΣ
n
.Thefollowing assumption hara terizes this ontext:
A1. There existsa onstant
K
1
independent ofn
su hthatp
n
/n ≤ K
1
. Itisremarkablethatinthissettingp
an hange andevengoto innity, but it is not required. Dierently from the Kolmogorov asymptoti frame-work (the one of Mar enko-Pastur Law), it is not even ne essary this ratio tendsto a nite onstant.Two further assumptions are needed to derive a onsistent estimator of
Σ
LW
. IfΣ
n
= Γ
n
Λ
n
Γ
′
n
, the produ tY
n
= Γ
′
n
X
n
is a set of un orrelated variables spanning the same spa e as the original variables. The following restri tionson thehighermomentsofY
n
areimposed:A2. There existsa onstant
K
2
independent ofn
su hthat1
p
n
p
n
X
i=1
E[(y
n
i1
)
8
] ≤ K
2
,
A3.lim
n→∞
p
2
n
2
P
i,j,k,l
∈ Q
n
Cov(y
i1
y
j1
, y
k1
y
l1
)
Cardinal ofQ
n
= 0.
where
Q
n
denotes the set of all the quadruples that are made of four distin tintegers between1
andp
n
.Assumption2statesthattheeighthmomentof
y
isbounded(onaverage). Assumption 3 states that produ ts of un orrelated random variables are themselvesun orrelated(on average,inthelimit). Inthe asewhengeneral asymptoti s degenerate into standard asymptoti s (p/n → 0
); Assumption 3istrivially veried asa onsequen eof Assumption2.For what previously stated,Assumption3 isveried when random v ari-ablesarenormallyorevenellipti allydistributed,sin ethesample ovarian e of(un orrelated) normalvariables is asymptoti allyunbiased. Anyway,A3 ismu hweakerthan thatsituation.
Theseassumptions arespe i allyneeded to derive thesample ounter-partsof
µ
,γ
2
,
β
2
.
Notethatthesetwoassumptionsheavilyinvolvetheeigenstru ture (eigen-values and eigenve tors) of the true ovarian e matrix. Here we need to imposerestri tionson eighthmoments,for theparti ular natureoftheir op-timal weights. Anyway, theneed to ontrol the pervasiveness of thelatent stru ture in the ovarian e matrix is ru ial for model re overy. We also underline how mu h latent fa torial assumptions an impa t on ovarian e estimation. Thisis whywe aregoing to spe i allydis uss therelationship between fa tor modelling and ovarian e estimationin paragraph(2.5).
Under these assumptions, Ledoit and Wolf approa h the study on the onsisten yoftheirestimator. Intheir ontext,thereferen enormis
||A||
n
=
1
p
n
tr(AA
′
)
,su hthattheidentitymatrixhasalwaysnormone,andthe refer-en e rossprodu tis
< A
1
, A
2
>
n
=
1
p
n
tr(A
1
A
′
2
)
. The problemof obtaining meaningful absolute rates in high dimensions is another relevant issue. As we willsee, in[45 ℄ the authors deriveasymptoti rates for therelative error matrix(andnotthe ovarian ematrixitself). Instead,underthe nonasymp-toti setting(Chapter4),wewillobtainniteabsoluterates,evenunderthe same assumptionsof [45℄.We are now going to show why the sample ovarian e matrix is not onsistent in this ontext, dierently from the nite
p
ontext, where the ovarian e matrix isasymptoti ally onsistentundertheassumption of nor-mality. Theauthorsshowthatquantitiesµ
n
=< Σ
n
, I >
,α
2
n
= ||Σ
n
−µ
n
I||
2
,β
n
2
= E[||ˆ
Σ
n
−Σ
n
||
2
]
,γ
2
n
= E[||ˆ
Σ
n
−µI||
2
]
areboundedinthegeneralasymp-toti framework when
n → ∞
. Then, they prove the following important Theorem: Theorem 2.3.2. Deneθ
2
n
= V ar(
p
1
n
P
p
n
i=1
E[(y
n
i1
)
2
])
.θ
2
n
is bounded asn → ∞
,and we have:lim
n→∞
E[||ˆ
Σ
n
− Σ
n
||
2
] =
p
n
n
(µ
2
n
+ θ
n
2
).
This result states that the sample ovarian e matrix is not onsistent under the general asymptoti framework, sin e its expe ted loss is lower bounded by
p
n
n
(µ
2
n
)
,whi hdoesnotusuallyvanish. (Re allthatθ
2
n
vanishes asymptoti allyunderthe assumption ofnormality,for onvergen e (2.2)).There aretwo interesting ex eptions:
•
whenp
n
n
→ 0
, we fall into the standard asymptoti ontext, where thesample ovarian e matrixis onsistent. Theonly dieren eisthat moregeneral asep = o(n)
isallowed,i.e.p
isallowedtobeunbounded andgrow towardsinnity;• µ
2
n
→ 0
andθ
2
n
→ 0
.µ
2
n
implies that most of the random variables havevanishingvarian es,i.e. thereareO(n)
asymptoti allydegenerate variables. So, if the number of nondegenerate random variables is NOTnegligiblewithrespe ttothenumberofobservations,thesample ovarian e matrix is not onsistent.In onsisten y is due to thedisequilibrium between the number of data-points
np
n
andthe number ofparametersp
n
(p
n
+ 1)/2
. Thisis akey point in our analysis, whi h is unsolved by the approa h of Ledoit and Wolf. In fa t, they write there is no DIRECT onsistent estimator of the ovarian e matrixunderthegeneralasymptoti s. Theirstrategyistoderivea onsistent estimatoroftheirtheoreti alestimator,whi hisprovedtohavetheminimum riskamongallthelinear ombinationsofI
p
andΣ
n
andisshowntobebetter onditioned than thesample ovarian e matrix.So, shrinkage matters unless
p
n
n
is negligible respe tγ
2
µ
2
, i.e. if the dis-persionof sampleeigenvalues is mu h larger thanp
n
n
.To on lude this se tion, we are now going to explain how Ledoit and Wolfderive a onsistent estimator for
Σ
LW
.Theyintrodu e sample ounterparts of their keyquantities:
m
n
=< ˆ
Σ
n
, I >
n
,
d
2
n
= ||ˆ
Σ
n
− m
n
I||
2
,
¯b
2
n
=
n
X
k=1
||x
n
.k
x
n
′
.k
− ˆ
Σ
n
||,
b
2
n
= min(¯b
2
n
, d
2
n
),
a
2
n
= d
2
n
− b
2
n
,
wherex
n
.k
denotethek − th
olumn ofX
n
.All these sample ounterparts are onsistent in the general asymptoti framework, i.e. they onverge to
µ
2
n
,α
2
n
,β
2
n
,γ
2
n
respe tively in quadrati mean.Then, their feasible onsistent estimator is
ˆ
Σ
LW
=
b
2
n
d
2
n
m
n
I
n
+
a
2
n
d
2
n
ˆ
Σ
n
(2.7)Thisestimatoris onsistentinthe generalasymptoti frameworkrespe t to
Σ
LW
, i.e. they share the same asymptoti expe ted loss. Thus, the expe tedquadrati lossα
2
β
2
γ
2
anbe onsistentlyestimatedinquadrati mean bya
2
n
b
2
n
d
2
n
.ˆ
Σ
LW
isshowntohaveanimportant optimalityproperty: ithasthesame asymptoti risk asthe theoreti al optimal linear ombination ofΣ
ˆ
n
andI
n
withrandom oe ients. Inaddition, its ondition number isproved to be bounded inprobability,whi h isvery important for pra ti al use.The approa h by Ledoit and Wolf is undoubtedly very elegant. How-ever, there is still one main di ulty: their estimator is ex essively better onditioned than the true ovarian e matrix, i.e. it is often too biased, for the presen e of the identity matrix in the estimator. This is why another major pointof ourdissertation will dealwiththeneedof "unshrinking"the estimatedeigenvalues.
Infa t, the numeri al issueis nottheonly relevantreason for desiringa well onditionedestimate ofthe ovarian e matrix. Deep statisti alreasons lie behind this need: we supposethat thetrue ovarian e matrix
Σ
∗
is well onditioned, that isthere is nomulti- ollinearity amongour
p
variables. In this respe t, a well onditioned estimate is ru ial also for tting purposes, i.e. toimprove the statisti al propertiesof theestimate.We re all the previous skrinkage estimator by the same authors ([74 ℄) in the market portfolio ontext. There, the authors spe ify for the ov ari-an ematrixasingle-indexmodel( onsistently withthebasi theoryofasset pri es, see [103 ℄), whi h is essentially a one-fa tor latent model, and then estimatethe ovarian e matrix deriving theoptimal shrinkage intensity to-wards the single-index as des ribed. This single index ovarian e matrix estimatorisaninteresting onta tpointbetween latent variablemodels and shrinkage methods.
Before passing to the analysisof fa tor-based ovarian e matrix estima-tors(paragraph(2.5)),wenowbrieyoutlinethe ovarian eestimatorsbased onpuresparsityassumptions, withparti ularreferen eto theuseof shrink-age thresholding. In this ontext, sparsity means that our true ovarian e matrix hasaprevalen eof zeros.
2.4 Sparse ovarian e matrix estimation
Inthis se tionwe listthe mostrelevant estimatorsbased ona puresparsity assumption, whi h an be ee tive for redu ing the number of parameters andre onditioning theestimate, removing unne essaryo-diagonal orrela-tions. If
p/n → c ∈ (0, 1)
(general asymptoti framework) the eigenvalues ofΣ
ˆ
n
follow the Mar enko-Pastur law, supported on(1 −
√
c)
2
, (1 +
√
c)
2
. If
p/n
doesnot tendto a onstant, we do not have any guarantee. For this reason,enfor ing sparsity an bea key for obtaininga fullrank estimatein highdimensions, even whenn < p + 1
. However, there are lots of dierent typesofsparsityassumptions,methodsandasymptoti frameworkstoprove onsisten y.The natural ontext whi h gave rise to the on ept of sparsity lies in a data-set showing a lear index ordering among variables. This ondition arises easily for spatial data, when the variables are geographi al areas for whi h a proximity matrix is naturally dened. Appli ations in lude spe -tros opyand limate data.
For this kind of data, several methods have been developed. Banding the ovarian e matrix, by appropriately dening a banding parameter, is one ee tive solution. In that approa h ([14 ℄), thematrix referen e lass is
Σ
∗
∈ U(ǫ
0
),
whereU (ǫ
0
) =
Σ
∗
∈ R
p×p
: 0 < ǫ
0
≤ Λ
i
(Σ
∗
) ≤ ǫ
−1
0
< +∞,
max
j
{
X
i
|σ
∗
ij
| : |i − j| > k} ≤ Ck
−α
,
(2.8)whi h is the lass of matri es having uniformly bounded eigenvalues and banded ovarian e.
For any