Statistical Analysis and Mathematical Modeling of Interest Rates

(1)

Tesi di Laurea Magistrale

Statistical Analysis and

Mathematical Modeling of

Interest Rates

Candidato:

Relatori:

Maurizio Ragusa

Prof. Maurizio Pratelli

Controrelatore:

Prof. Dario Trevisan

(2)

Introduction ii

1 First Definitions and Notations 1

2 Principal Component Analysis 4

2.1 The Idea . . . 4

2.2 Transposing the problem using the Spectral Theorem . . . 6

2.3 Algebraic and Geometric Properties of the PCs . . . 7

2.4 Application of PCA to the FRC models . . . 16

2.4.1 PCA of the yeld curve . . . 17

2.4.2 PCA of the daily changes . . . 19

3 A Phenomenological Approach 23 3.1 Postulation of a behaviour . . . 24

3.2 A three-factor model for the FRC . . . 30

3.3 A qualitative justification for the three-factor model . . . 33

4 Abstract Modeling 36 4.0.1 Basic definitions . . . 37

4.1 Short rate models . . . 38

4.1.1 Vasicek Model . . . 41

4.1.2 Cox-Ingersoll-Ross model . . . 41

4.1.3 Hull-White model . . . 42

4.2 Term Structure Factor Models . . . 42

4.2.1 Multifactor Short Rate Models . . . 51

4.3 Forward rate models and the Heath-Jarrow-Morton theorem . . . . 52

4.3.1 Example . . . 53

4.3.2 Heath-Jarrow-Morton theorem . . . 54

4.3.3 The Mercurio and Moraleda Model . . . 57

Bibliography 63

(3)

Introduction

Forecasting the interest rates trend is one of the most challenging issues faced by the modern mathematical finance. Beside the obvious practical reasons for which it is important to investigate the problem, the mathematical interest arises from the fact that a possible solution requests a deep knowledge of different instruments, such as the stochastic differential equations and thus all the theory behind, which adds a lot of complexity to the mathematical approach.

At the center of our discussion there will be the definition of bond. It is a contract depending on two different dates: the time t in which it has been subscribed, and the maturity T > t in which the money borrowed at time t is returned to the buyer, with an interest rate (possibly negative, even if this is a quite rare case) decided at time t. In other words, one borrows at time t a certain sum, and at time T this sum is returned with interests. Then the price of the bond per unit of money depends on two parameters, t and T , and it will be indicated with P (t, T ).

Since the price is random, it will be mathematically intended as a stochastic process. However, it has not to be interpreted as the price of another common good, say, for example, the price of oil, since in this case the price is dependent only on one parameter. Thus the dependence on the maturity T adds another level of complexity to the problem.

A further complication is the fact that the price of the bond P (t, T ) must satisfy some technical hypotheses. First of all, it is obvious that P (T, T ) = 1: if one borrows 1 dollar at time T , it is impossible that there might be a return greater or smaller than 1 dollar. In both cases, indeed, we would be brought to an arbitrage, that is, informally speaking, an investment which brings an immediate, riskless profit.

Another request for P (t, T ) is the regularity of the trajectories for each fixed t. There is no conceptual reason for imposing this hypothesis, but it comes from empirical observations of bonds on the real market. A clever way to overcoming this hurdle is by introducing the forward rate curve:

f (t, T ) = − ∂

∂TlnP (t, T ) ii

(4)

which will have a precise financial meaning. It is obvious that it is equivalent to study the bond price or the forward rate curve, since, from the first, we could easily find the second and vice versa by a simple transformation. Furthermore, if we study the forward rate curve and we only impose the continuity of its trajectories (an hypothesis which will hold quite easily, for example, by supposing that f (t, T ) is an Ito’s process, as we will do in the last chapter), then the regularity of P (t, T ) for each fixed t will become a free hypothesis. Thus, studying the forward rate curve instead of the bond price, is in fact a winning idea that will be massively used throughout the dissertation.

The entire work contains two souls: a phenomenological one and a theoretical one.

The phenomenological soul retraces the work of Bouchad, Sagna et al. [Bou99] and consists of the observation and analysis of empirical data (more precisely, the values of the forward rate curve coming from the U.S. Treasury from 2006 to 2017, for thirty different maturities). Starting from them, we will try to give a simple model which describes the forward rate curve trend by factorizing it as a linear sum of three different factors:

f (t, T ) = r(t) + s(t) Y (T − t) Y (Tmax− t)

+ ψ1(t)α1(T − t). (1)

Each factor will have a precise financial meaning that will be deepened in the third chapter.

In doing that, it will be fundamental the use of a statistical instrument called Principal Component Analysis (abbreviated, PCA). It is very useful to reduce the dimension of the problem from N , possibly very large, to a much smaller k. If, for example, we consider a random vector X in RN, for each k < N the PCA lets us to identify the k-subspace in which the distribution of X is mostly concentrated, or equivalently, the k-subspace that has the variance of the components on the orthogonal minimized. That will give us the best linear k-dimensional approximation of X, in a sense that we will precise in details in the second chapter.

In our context, the power of this instrument will be amplified by the fact that we only need k = 3 to have an excellent approximation of the forward rate curve. The second soul is much more abstract and it involves the construction of a model for the forward rate in a stochastic sense. A SDE-type dynamics will be given in order to determine the law of the forward rate curve, intended as a stochastic process. It is a classical approach with a very rich history we will partially retrace in the last chapter. We will start from the first (and simplest) models, the short rate ones, and arrive to the more general forward rate ones and the Heath-Jarrow-Morton theorem, which is a key result in mathematical finance.

(5)

It gives necessary and sufficient conditions for the absence of arbitrage in a model, and these will only depend on the dynamics of the forward rate. Between the short rate models and the Heath-Jarrow-Morton theorem, we will also expose the factor models for the forward rate. They are a good compromise between the simpleness of the formers and the strong generality of the latters. In few words, we will suppose that the forward rate can be written as a function of finite factors:

f (t, T ) = f (t, Z_t1, ..., Z_tk).

This assumption can be justified by the previous phenomenological approach. Indeed, the equation (1) can be seen as a special case of factor model.

In the very final part of the dissertation, a concrete example of forward rate model will be given. It was first proposed by Mercurio and Moraleda [Mer00] in 1997; its strength comes from its simpleness which helps a lot in practical implementations, and a good adherence to real data. It can be seen as an attempt to join the two different souls, since statistical observations are used to determine the dynamics of the forward rate curve in such a way to be in line with the theoretical results proved before.

(6)

First Definitions and Notations

In order to fix the notations, let’s briefly introduce the basic definitions of interest rates theory, which we massively use in the future.

Definition 1.1 (Zero Coupon Bond). A Zero Coupon Bond (often simply

abbre-viated as Bond) is a title bought at price P (t, T ) at time t and worth 1 at time T , with T ≥ t. The function P (t, T ) defined on {0 ≤ t ≤ T ≤ T∗} is called price of the zero coupon bond and it will be often confused with the bond itself. T will be called maturity of the bond, while the value T − t will be the time to maturity.

Remark 1.2. Before going deeper into the theory and defining all the different

types of rates, it is important to make a distinction between the government ones, deduced by bond issued by governments, and the interbank ones, which are computed from a reference rate, usually the LIBOR (London InterBank Offered Rate), fixed daily in London.

This difference is just practical, because from a theoretical point of view we could indifferently study the bond instead its interest rates, or vice versa. This is the reason why the two kinds of interest rates are modeled in the same way, and in what follows it will not be made a distinction between them.

Obviously, we must have P (T, T ) = 1. Furthermore, the price must not be negative, so that P (t, T ) > 0 for each t and T . In order to make the problem mathematically treatable, we add hypotheses to the regularity of P (t, T ), imposing that for each fixed T with 0 < T ≤ T∗, the family {(P (t, T )}0≤t≤T is a stochastic

process (in the last chapter we will also suppose that it is an Ito process) on a probability space_{Ω, F , P}provided with a filtration Ft such that all the processes

P (t, T ) are adapted for each T . Another important assumption will be that T 7→ P (t, T ) is regular (at least C1 _{for every fixed t). Nothing is assumed about}

the regularity in t.

(7)

Remark 1.3. The hypothesis on the regularity with respect to the maturity T

comes from empirical observations of the market, as we will see in the next chapter.

Definition 1.4. The LIBOR interest rate is defined as:

L(t, T ) = 1 − P (t, T ) P (t, T )(T − t).

Definition 1.5. The Yeld of a zero coupon bond is defined as:

Y (t, T ) = −lnP (t, T ) T − t . The Yeld Curve is the yeld seen as a function of t.

Note that this definition stems from a simple computation of the continuously compounded interest rates. Indeed, we can characterize Y (t, T ) as the function such that

1

P (t, T ) = e

Y (t,T )(T −t)

i.e. the yeld curve is the interest rate of a title worth 1 at time t (and consequently, by definition, worth _{P (t,T )}1 at time T ). Similar remarks can be made for the LIBOR rate, but in this case the interest rate is linearly computed.

Let’s now introduce the instantaneous forward rate, which will play a funda-mental role in the next chapters. Let t, T and S be dates such that t < T < S.

Definition 1.6. The Forward Rate at time t, for the interval time [T, S] is the

value f (t, T, S) such that the following identity holds: ef (t,T ,S)(S−T )P (t, S) = P (t, T )

which we can interpret as the interest rate for an investment decided at time t, started at time T and with maturity S. It can be written as

f (t, T, S) = −ln P (t, S) − ln P (t, T )

S − T .

Definition 1.7. The instantaneous forward rate curve (often abbreviated with

FRC ) is the forward rate calculated for an infinitesimal interval of maturity, say, T + dT . In formulas, f (t, T ) = lim S→T f (t, T, S) = − ∂ ∂TlnP (t, T ) = − ∂ ∂TP (t, T ) P (t, T ) .

(8)

In what follows we shall often refer to the instantaneous forward rate, but for shortness we will omit the term instantaneous. There will not be misunderstandings with the forward rate f (t, T, S) because the latter will not be mentioned anymore. Indeed, we just introduced it to give a financial meaning to the instantaneous forward rate.

Definition 1.8. The short rate or spot rate is the instantaneous forward rate with

maturity T = t: r(t) = f (t, t) = − ∂ ∂TP (t, T ) P (t, T ) _{T =t} . Simply algebraic manipulations give us

P (t, T ) = e−R

T

t f (t,u)du.

Indeed, in terms of the yeld we have: Y (t, T ) = − 1

T − t

Z T

t

f (t, u)du

These relations between the yeld, the zero coupon bond and the forward rate curve imply that if we know one of them, we know all of them. This does not hold for the spot rate: if we know it, we can’t compute the forward rate, unless we make additional hypotheses on the market. This principle will be fundamental for the modelings we’ll see later.

We could also introduce many different options on bonds. One of the most important (and the unique that we will really use in the future) is the European call on a bond.

Definition 1.9. An European call on a zero coupon bond P (t, T ) with maturity

S > T and strike price X is the random variable, measurable respect to FT, defined

as follows:

C(T, S, X) =P (T, S) − X+.

One of the main problem in mathematical finance is to find, where it exists, the right value at time t, that is, a value which does not let any kind of arbitrage. We still do not introduce a rigorous notion of arbitrage, leaving the reader with an heuristic meaning of it. A formal definition, adapted to the context of interest rates, will be given in the last chapter, where it will be fundamental to build possible modelings for the market of interest rates.

(9)

Chapter 2 Principal Component Analysis

In this chapter we’ll introduce the reader to the theory of principal component analysis, a very useful tool used not only in financial mathematics but also in many other statistic contexts. We’ll see that this technique will give an excellent approximation of the interest rates curve, letting us to reduce the dimension of the space of all possible forward rate curves from N (possibly very large) to 3 or 4.

For this purpose, we’ll first forget the financial setting of our dissertation, giving some theoretical results about the principal component analysis which hold in general. Doing that we’ll try to give some justifications about the efficiency and validity of this instrument. Then we’ll return to the study of the FRC, and analyzing some empirical data coming from the U.S. Department of Treasury we’ll try to convince the reader about the goodness of using the PCA in this context. Statistical analysis we’ll also permit us to justify some of the considerations we’ll do in the following chapters about possible modelings for the short and forward rate curve.

2.1 The Idea

Let X be a random vector in Rn with finite variance. We should imagine n as a very large number, so that X is a vector depending from a lot of variables. In many practical cases, however, X isn’t an effective n-dimensional vector, because it actually is a k-dimensional vector, with k << n, in the sense that X lives in a k-dimensional subspace, up to a small error. The aim of the principal component analysis is exactly to find the best k-subspace which minimizes the error.

Let’s say in more precise terms what we are looking for: let’s call V the "minimizing" subspace (we’ll discuss later about its existence and uniqueness) and

P : Rn _{→ R}n _{the orthonormal projection onto V . Then we can write}

(10)

X = P X + E

where E is the error, which lives in P⊥ _{(then it basically is an R}n−k_-random

vector), and it must have the "minimizing" property. Now, it’s quite boring to give a rigorous meaning to the term "minimizing": if it were a one-dimensional random variable, then we could simply request that Var(E) is the smallest possible. In general, we should minimize the size of Cov(E), but it’s hard to work with that because we should say what we mean with "size" of a matrix covariance. Then we could proceed in this sense, for example, introducing the concept of generalized covariance.

However, there is another equivalent but more conceptually easier way of proceed, and it consists of eluding the problem by making the following two simple observations:

• P X and E are uncorrelated, then Cov(X) = Cov(P X)+Cov(E). This means that minimizing the size of Cov(E) is equivalent to maximizing Cov(P X); • we could reason step by step, i.e. find first the best minimizing one-dimensional

subspace, and then iterate the process on its orthogonal space. It’s quite reasonable suppose that, if we call V1, ..., Vn these subspaces, then for each k

we have that V1⊕ V2⊕ ... ⊕ Vk is the k-dimensional minimizing subspace.

The advantage is quite obvious: we factorize the problem in simpler ones. Moreover, we can easily give a precise meaning to the concept of maximization. Indeed projections onto the Vi fundamentally give us one-dimensional random

vectors, so we have a natural interpretation of the concept (heuristically given before) of maximization in that of maximizing their variance.

At last, the problem can be summarized in the following:

Problem 2.1. Finding

max

α∈S Var(α

t_{X) = α}t_Cov(X)α

where S is the unitary sphere in Rn_{, i.e. S = {α ∈ R}n_{: kαk = 1}.}

Indeed, we can see αtX with αtα = 1 as orthogonal projections of the random vector X over all the possible vectorial lines of Rn_.

Let’s call this maximum α1. Once we have it, we could go on finding the

vector α2 that maximizes the function αtCov(X)α over the set {α ∈ Rn|kαk =

1} ∩ Span(α1)⊥.

Iterating this procedure we get n vectors α1, ..., αn and for what we said before,

we should expect that Span(α1, ...αk) is the subspace minimizing the error, in the

sense we talked about at the beginning. This assertion in fact holds and will be proved in the next section.

(11)

An elementary example

Let think our problem in R2_{. If M is our covariance matrix, that we suppose}

positive definite, then the equation {α ∈ R2_|αt_{M α = c} defines an ellipse for all}

c > 0. Geometrically, the aim of the last section is to find the maximum of these c such that the ellipse intersects the unitary sphere S = {αt_{α = 1}. It is not}

difficult to prove that the maximum is reached when the intersection happens in two double points lying on the minor axis of the ellipse and to show that this axis also corresponds to one of the eigenspaces of M .

2.2 Transposing the problem using the Spectral

Theorem

The idea that the maximum is reached when α is an eigenvector of Cov(X) holds in general. Moreover, the following stronger fact holds.

Lemma 2.2. Using the notations of the previous section, we have

max

α∈S Var(α

t_{X) = λ}

1

where S is the unitary sphere and λ1 is the greatest eigenvalue of M = Cov(X).

This maximum is reached when α is the respective eigenvector.

In general, if α1, ..., αn are unitary eigenvectors corresponding to the eigenvalues

λ1 > λ2 > ... > λn > 0, then

max

α∈Sk∩S

Var(αtX) = λk+1

where Sk is the unitary sphere on the orthogonal of the first k eigenvectors:

Sk =

Span{α1, ...αk}

⊥

.

This maximum is reached when α = αk+1 is the eigenvector corresponding to λk+1.

Remark 2.3. In the previous lemma, we did an implicit assumption: we have

supposed that the covariance matrix is non singular and with simple eigenvalues. If this assumption does not hold, we could have some complications; fortunately, in most of our applications it will never happen they are not satisfied. Thus we will not stress these assumptions. We’ll remark that in the following:

(12)

All matrices we’ll consider (unless explicitly declared) will have these pro-prieties:

• Non singularity;

• Simplicity of eigenvalues, that is, the eigenspaces are all one-dimensional.

Note also that with these hypotheses, the eigenvectors αk are unique up to a

change of sign.

Proof. An easy proof is made by using the Lagrange Multipliers. However, here we give another proof obtained by using some basic notions of linear algebra.

We know that M is a positive definite symmetric matrix, then by the spectral theorem we can find an orthonormal basis of eigenvectors for M . Let α1, ..., αn this

basis, with M αk = λkαk. Then

αt₁M α1 = λ1αt1α = λ1.

On the other hand, if α is a unitary vector, then we have α =P

kckαk and αtM α = αtM ( n X k=1 ckαk) = αt( n X k=1 ckλkαk) = ( n X h=1 chαth)( n X k=1 ckλkαk) = = n X k,h=1 chckλkδhk = n X k=1 c2_kλk ≤ λ1 n X k=1 c2_k= λ1.

The second statement can be similarly proven, considering the restrictions to the subsequent orthogonals.

Definition 2.4. With the same notations made before, we’ll call the variable αt_kX the k-th population principal component (abbreviated, PC ) of X.

Furthermore, the eigenvectors αk will be called loadings.

The term population is needed to distinguish these principal components from the sample principal components which we’ll discuss later. Let’s observe that the original problem of finding the αk, originally a maximum problem, has been

converted in a simpler problem of linear algebra.

2.3 Algebraic and Geometric Properties of the

PCs

Now that we have the definition of the PCs, we can try to justify the optimality we talked about at the beginning of this chapter. In particular the sense of this

(13)

section is that of proving the "minimizing" proprieties we heuristically gave before, and we’ll do that by listing some of the fundamental proprieties of the principal components.

First, let fix the notation in this section once for all. X will be a random vector in Rn, and M its covariance matrix. The eigenvalues of M will be denoted λ1 > λ2 > ... > λn > 0 and the respective unitary eigenvectors will be α1, ..., αn.

At last, A will be the matrix whose kth-column is αk, so that

AtM A = Λ with Λi,j = λiδi,j.

Proposition 2.5. The PCs define the principal axes of the ellipsoids

xtM−1x = const.

Proof. If we denote z = Atx, then by orthogonality of A we have x = Az, then: const = xtM−1x = (Az)tM−1(Az) = ztAtM−1Az = zt(AtM A)−1z = ztΛ−1z where we have used the fact that A−1 = At_{. This last equation can be rewritten as}

X

k

z_k2 λk

= const

which defines an ellipsoid having as its principal axis the directions spanned by the eigenvectors of Λ−1. To conclude, we only have to note that the eigenvectors of M and of M−1 are the same.

The last facts have a clear probabilistic meaning, at least with stronger hy-potheses for X.

Corollary 2.6. If X is a Gaussian random vector, then the equations of the

ellipsoids

xtM−1x = const

define contours of constant probability for the distribution of X.

Proof. In a Gaussian vector, the contours of constant probability are uniquely determined by the covariance matrix.

Proposition 2.7. Let Am ∈ Rn×m be the matrix whose columns are the first m

columns of A. Then the random vector

Y = At_mX

maximizes the variance, that is, tr(Cov(Y )) is the maximum variance possible between all the possible orthonormal linear transformations in Rn×m_.

(14)

Proof. Let B ∈ Rn×m _{be another orthonormal linear transformation, then our aim}

is to prove that

tr(Cov(Z)) ≤ tr(Cov(Y )) where Z = Bt_{X. Since the columns α}

k of A form a basis for Rm, we can write

βk =

X

j

cjkαj

for each column βk of B. In other words, if C = (cjk) ∈ Rm×n, we have B = AC

and then

Cov(Z) = BtM B = CtAtM AC = CtΛC. From that we can compute

tr(Cov(Z)) = tr(CtΛC) =X j,k λjc2jk = m X j=1 ( n X k=1 c2_jk)λj.

At this point, note that C is orthogonal, in fact

CtC = BtAAtB = BtB = Im

and as a consequence we have

tr(CtC) =X j,k c2_jk = m = tr(Im) n X k=1 c2_jk ≤ 1.

Indeed, the first equation is obvious, while the last inequality follows by com-pleting the matrix C to a square n × n matrix orthonormal, and by recalling that the square sum over the columns of an orthonormal matrix is equal to one.

From the last two equation and from the fact that λ1 > λ2... > λn, we can

immediately deduce that the quantity

m X j=1 ( n X k=1 c2_jk)λj is maximized when n X k=1 c2_jk =    1 if j = 1, ...m 0 if j = m + 1, ..., n and this is exactly the case when B = Am.

(15)

The last proposition is in line with the considerations we made ad the beginning of the chapter, when we wanted to maximize, in some sense, the size of Cov(Y ), with Y a projection of X onto a k-subspace. Here the size of Cov(Y ) is intended as its trace.

Indeed, with a similar proof we can also show the following:

Proposition 2.8. Let ˆAt

m ∈ Rn×m be the matrix whose columns are the last m

columns of A. Then the random vector ˆ

Y = ˆAmX

minimizes the variance, that is, tr(Cov( ˆY )) is the minimum variance between all the possible orthonormal linear transformations.

Remark 2.9. This proposition is a formalization of what we said before: if we take the first k principal components, then the error in the estimation of X can be written in terms of the last n − k components, and with this decomposition the error has the smallest size (in the precise sense that the trace of the covariance matrix of the error is minimum).

With the following proposition, we’ll take another measure of the size of a covariance matrix, and we’ll prove that the principal components still give the best k-dimensional approximation of X. First, let’s recall that the generalized variance of a random vector Y is defined as det(Cov(Y )).

Proposition 2.10. With the same notations as in 2.7, the random vector Y

maximizes the generalized variance, that is, det(Cov(Y )) is maximum between all the possible orthonormal linear transformations.

Before beginning the proof, we need the following characterization, which can be seen as a generalization of the Lemma 2.2:

Lemma 2.11. If Sk = Span(α1, ..., αk−1)⊥ and ˆSk = Span(αk+1, ..., αn)⊥, then

λk= sup α∈Sk\{0} αt_{M α} αt_α =_{α∈ ˆ}_Sinf k\{0} αt_{M α} αt_α .

Proof. The first equation follows directly from the Lemma 2.2. The second one follows indirectly from it, by taking the matrix A−1 instead of A, and observing that its eigenvalues are the reciprocals of the eigenvalues of A, and the eigenvectors are the same.

Proof of 2.10. Let B ∈ Rn×m _{another linear orthonormal transformation and}

(16)

det(Cov(Z)) = det(BtM B) ≤ det(At_mM Am) = m

Y

k=1

λk.

Indeed, it’s immediate to verify that the eigenvalues of At

mM Am are λ1, .., λm, since

At_mM Amek = AtmM αk= λkAtmαk = λkαk

for every k = 1, .., m. Thus it’s sufficient to prove that, if µ1 > µ2 > ... > µm are

the eigenvalues of Bt_{M B, then}

µk ≤ λk for 1 ≤ k ≤ m.

Let γk an eigenvector for µk for each k. Then, if Tk = Span(γk+1, ..., γm)⊥ ⊂ Rm

and Sk = Span(α1, ..., αk−1)⊥ we have, by the second relation of the previous

lemma,

γt_Bt_{M Bγ}

γt_γ ≥ µk

for any non-zero vector γ ∈ Tk. Consider the subspace ˜Sk = B(Tk). Since B is

injective, we have dim( ˜Sk) = dim(Tk) = k. The subspaces Skand ˜Sk must intersect

non trivially, by Grassman’s formula. Hence there is a non-zero vector α ∈ Sk∩ ˜Sk,

with α = Bγ and thus µk ≤ γt_Bt_{M Bγ} γt_γ = γt_Bt_{M Bγ} γt_Bt_Bγ = αt_{M α} αt_α ≤ λk.

We’ll show now another optimality criterion typical of the principal components.

Proposition 2.12. Let X1 and X2 be independent identically distributed random

vectors in Rn_{, and A}

m ∈ Rn×m the matrix of their first m principal components,

as usually. Then the quantity E[kY1 − Y2k2] = E[kB(X1 − X2)k2], where B is

a generic orthonormal linear transformation from Rn onto Rm, is maximized by taking B = Am.

Proof. Let the common expectation of X1 and X2 be µ, and the covariance matrix

M . Take an orthonormal linear transformation B, then Z1 = BtX1 and Z2 = BtX2

have the same distribution with expectation Bt_{µ and covariance B}t_{M B. Moreover}

we have E h kB(X1− X2)k2 i = Eh(Z1 − Z2)t(Z1− Z2) i = = Eh(Z1− Btµ) − (Z2− Btµ) t (Z1− Btµ) − (Z2− Btµ) i = = Eh(Z1−Btµ)t(Z1−Btµ) i −2Eh(Z1−Btµ)t(Z2−Btµ) i +Eh(Z2−Btµ)t(Z2−Btµ) i

(17)

and the second term vanishes because X1 and X2, and hence Z1 and Z2, are

independent. Now, for i = 1, 2 we can write E h (Zi−Btµ)t(Zi−Btµ) i = Ehtr(Zi−Btµ)t(Zi−Btµ) i = Ehtr(Zi−Btµ)(Zi−Btµ)t i = trEh(Zi− Btµ)(Zi− Btµ)t i = tr(Cov(Zi)) = tr(BtM B).

Note that for the second equality we have used that tr(P Q) = tr(QP ). We thus have obtained that

E

h

kB(X1− X2)k2

i

= 2tr(BtM B)

and, by the Proposition 2.7, we know that this quantity is maximized when B = Am.

Hence the proposition is proved.

The last proposition has a clear geometric interpretation. Indeed it says that the expected distance between two independent identically distributed random vectors of n variables in a subspace of dimension m is maximized when this subspace is that spanned by the first m principal components.

PCs of the Sample Covariance Matrix

The main problem in the previous section is that we have to know the law (or at least, the covariance matrix) of a random vector X to calculate its PCs. However, in statistic, what we would like to find is the law of X. The idea of this section is thus to convert everything discussed before in a statistical context. Thus our setting will change and, instead a single random vector X, we’ll have a set of N independent observations X1, ..., XN on the same vector X whose law is unknown,

Xk = (Xk1, ..., Xkn)t for k = 1, ..., N . Since we don’t know the covariance of X,

we’ll replace it with the sample covariance matrix S: Sij = 1 N − 1 N X k=1 (Xki− Xi)(Xkj− Xj) = 1 N − 1Ψ t_Ψ where Xj = PN i=1Xij N for j = 1, ..., n. is the sample mean and Ψ ∈ RN ×n _{is the matrix of data, i.e.}

(18)

Remark 2.13. The choice of the sample mean and covariance as estimators

respectively of the mean and of the covariance of X, is justified by theoretical results we’ll briefly recall.

If we have a succession X1, ..., XN, .. of observations of X (that in this context

we must interpret as a succession of independent random vectors with the same distribution of X) and satisfying a moment condition, then we know, by the strong law of large numbers, that

XN → E[X] almost surely as N → ∞

SN → Cov(X) almost surely as N → ∞.

where XN and SN are respectively the sample mean and covariance for the first

N observations. So we have that XN and SN, for N sufficiently large, are close

enough to the true mean and covariance of X.

The hypothesis of independence of the increments can be relaxed. In fact we’ll see later that’s sufficient suppose the Xk form a stationary succession.

Returning to the principal components analysis of an unknown random vector, we could start with the same arguments as the previous section, but instead of finding the maximum of

Var(αtX)

over {αt_{α = 1}, now we have to try to maximize the sample variance}

1 N − 1 N X i=1 (αtXi− αtX)2

with the same restriction {αt_{α = 1}. Then we iterate the process in the same way}

done before, obtaining n uncorrelated and unitary vectors α1, ..., αn.

Definition 2.14. The sample principal components associated to N

indepen-dent observations X1, ..., XN of the random vector X are the random variables

αt₁X, ..., αt_nX.

Obviously, we can’t compute the law of the sample PCs, but instead we have their scores

Zik = αtkXi.

Note that if we develop the sample variance of αt_{X =}P

jαjXj, we obtain: 1 N − 1 N X i=1 (αtXi− αtX)2 = 1 N − 1 N X i=1 ( n X j=1 αjXij − αjXj)2 =

(19)

= 1 N − 1 N X i=1   n X j,k=1 αjαk(Xij − Xj)(Xik− Xk)  = = 1 N − 1 n X j,k=1 αjαk n X i (Xij − Xj)(Xik− Xk) ! = αtSα

where S is the sample covariance. Then the problem can be again reduced in a linear algebra problem, and the results proved in the previous section hold also in this case. We remark that in the following:

Proposition 2.15. If αt_kX are the sample principal components, then the αk are

the eigenvectors of S related to the eigenvalue λk, with λ1 > λ2 > ... > λn> 0.

Remark 2.16. Similarly to the discussion made in the previous section, we have

implicitly supposed that S is a definite positive matrix with simple eigenvalues. Again, these hypotheses will not be too strong and we will assume them for the rest of the chapter.

The proprieties of PCs listed in the previous section can be generalized for the sample PCs. Some of these generalizations are obvious; here we report those which request a non trivial proof.

Proposition 2.17. Let X1, ..., XN be observations of a random vector X ∈ Rn

and, with the same notations of the previous sections, Am ∈ Rn×m the matrix whose

columns are the first m principal components of the sample covariance matrix S. Then Am minimizes the sum of squared distances between the Xi and their

projections through Am, between all the linear orthonormal transformations in

Rn×m.

Proof. Let B ∈ Rn×m _{another linear orthonormal transformation, and Z}

i = BtXi

for each i = 1, ..., N . Let Vi ∈ Rn be the position of Zi in terms of the original

coordinates in Rn, and Ei = Xi− Vi. Thus the quantity we want to minimize is N X i=1 kEik2 = N X i=1 E_itEi.

But, since kEik2 = kXik2− kVik2, we equivalently need to maximize N X i=1 V_itVi = N X i=1 Z_itZi

where the equality follows from the fact that the Zi are obtained from the Vi by an

(20)

N X i=1 Z_itZi = X i X_itBBtXi = tr( X i X_itBBtXi) = X i tr(X_itBBtXi) = X i tr(BtXiXitB) = = tr(Bt(X i XiXit)B) = tr(B t_Ψt_{ΨB) = (N − 1)tr(B}t_SB)

and, by the Proposition 2.7, this quantity is maximized when B = Am.

This proposition could be viewed as an equivalent way of defining the principal components. Indeed, they can be defined as the linear functions which successively define subspaces of dimension 1, 2, ...n−1 for which the sum of squared perpendicular distances of the observations X1, ..., XN from the subspace is minimized.

Proposition 2.18. Let Ψ ∈ RN ×n be the matrix of data, Ψij = Xij− Xj, and Am

the matrix of the first m eigenvector defined as above. Suppose that Ψ has rank n. Then the quantity

kΦΦt− ΨΨtk,

where Φ ∈ RN ×m is the matrix of data projected onto a subspace through a lin-ear orthonormal transformation and the norm chosen is the euclidean norm, is minimized by Am.

Remark 2.19. Let’s say in few words what is the meaning of this proposition.

The matrix ΨΨt _{is such that its ith diagonal element is}

X

j

(Xij − Xj)2

that is simply the squared distance of Xi from the center of gravity X of the

observations X1, ..., XN. Instead the elements out of the diagonal are

X

j

(Xhj− Xj)(Xij − Xj)

which can be seen as the cosine of the angle between the lines joining Xh and Xi

to X, multiplied by their distances from X.

In other words, the matrix ΨΨt _{gives information about the position of the}

points X1, ..., XN respect to their empirical average, and thus the quantity

kΦΦt_{− ΨΨ}t_k

is a measure of the distortion given by the projection on the subspace associated to Φ.

(21)

Before proving the assertion, we need to enunciate a technical result of linear algebra.

Lemma 2.20. Suppose we have a symmetric (definite positive, with simple

eigen-values) matrix F ∈ Rn×n_{. Then the minimum of}

kF − Gk

over all the symmetric matrices G of rank m < n is reached when G has the same first m eigenvalues and eigenvectors of F .

Proof of 2.18. _{Let B another orthonormal projection over a m-subspace in R}n. By definition, we have Φ = ΨB, then

kΦΦt_{− ΨΨ}t_{k = kΨBB}t_Ψt_{− ΨΨ}t_k.

By the lemma, we have that the minimum is reached when ΨBBt_Ψt _{has the same}

first m eigenvalues and eigenvectors of ΨΨt. Let’s compute them. We know the eigenvectors and eigenvalues of Ψt_{Ψ and they are λ}

k and αk respectively, with

kαkk = 1. Then it’s immediate to find that λk and √1_λ

kΨαk, for k = 1, ..., n are the

eigenvalues and (unitary) eigenvectors for our matrix (the remaining eigenvalues are zero). So the first m of them are also eigenvectors and eigenvalues for ΨBBt_Ψt

But then the αk are also the eigenvectors for BBt. Indeed,

BBtαk= (ΨtΨ)−1ΨtΨBBtΨtΨ(ΨtΨ)−1αk= 1 λk (ΨtΨ)−1ΨtΨBBtΨtΨαk = = 1 λk (ΨtΨ)−1Ψt(ΨBBtΨt)(Ψαk) = (ΨtΨ)−1ΨtΨαk= αk.

Another way of see this fact is that the matrix B is minimizing if and only if it projects Rn _{onto Span(α}

1, ..., αm). The matrix Am satisfies this hypothesis and

thus the thesis is proved.

2.4 Application of PCA to the FRC models

The PCA has an excellent application in the approximation of the forward rate curve. The reason is that the first three PCs usually explain more than the 99% of the variance, as we shall see soon. This lets us think that the effective dimension of the space of all the possible FRC is three.

A theoretical implication of this observation is the justification to the study of some parametric estimation of the Yeld Curve, such as the Nelson Siegel or the Svensson parametrization.

(22)

Figure 2.1: Variance of the principal components of the U.S. Yeld Curve.

2.4.1 PCA of the yeld curve

As an application of the theory discussed before, we come back to the study of the FRC models, showing that in most relevant cases PCA gives us an excellent approximation of the yeld curve of interest rates. To this purpose, we have taken data on the U.S. yeld curve from the U.S. Department of the Treasury, and we’ve implemented them using the statistical software R.

The data span from 09/02/2006 to 30/12/2016 and they consist of all the daily values of the yeld curve for the following maturities: 1, 3 and 6 months, and 1, 2, 3, 5, 7, 10, 20, 30 years. Some of these data were not available, especially the long term maturities for the oldest spots. However, the missing data are so few that their absence can practically be ignored.

We have organized the data in a matrix consisting of 11 columns and 2725 rows, each of the last ones representing a daily observation. Using the notations of the previous sections, we have X1, ..., XN observations with N = 2725. Each of the Xk

is a vector in R11_{, thus the matrix of data is Ψ ∈ R}11×2725_.

We computed the principal components of the sample covariance matrix S obtained from these observations. The variances of each PC are described in the histogram of Fig. 2.1.

Indeed, a deeper study of the variance of each PC give us the table 2.4.1. The kth row corresponds to the kth PC. The first column describes the standard deviation of each PC, that is

SD(αt_kX) =qVar(αt kX).

(23)

Standard Proportion Cumulative deviation of Variance Proportion Comp.1 4.5837397 0.9344436 0.9344436 Comp.2 1.13074701 0.05686487 0.99130844 Comp.3 0.3863012 0.0066369 0.9979453 Comp.4 0.170060438 0.001286233 0.999231574 Comp.5 0.0822923443 0.0003011841 0.9995327582 Comp.6 0.0769479797 0.0002633344 0.9997960926 Comp.7 0.048376607 0.000104084 0.999900177

Comp.8 2.683453e-02 3.202589e-05 9.999322e-01

Comp.9 2.635332e-02 3.088758e-05 9.999631e-01

Comp.10 2.106229e-02 1.972988e-05 9.999828e-01

Comp.11 1.965418e-02 1.717999e-05 1.000000e+00 Table 2.1: Analysis of variance of the principal components.

The second column shows the proportion of variance of each PC compared to the global variance. At last, the third column indicates the cumulative proportion of variance; in other words, in the kth row there is the sum of the first k variances, that is the information we preserve if we forget the last n − k components.

Note in particular that the first loading explains the 93, 4% of the global variance, while the first three loadings explain over the 99, 9% of it. This shows that the effective dimension of the space of all yeld curves is three, as announced at the beginning of this section.

On Fig. 2.2 we have plotted the first four loadings as functions of the time to ma-turity. We have linearly interpolated the values between two consecutive maturities. Looking at them we can try to give the following statistical considerations:

• the first loading is almost constant. Indeed, it can be seen as the average yeld curve over the maturities. Note that it’s always negative, even though we expect something always positive: the reason is that the loading is defined up to a change of sign, as we remarked in 2.3;

• the second loading is monotone increasing. It represents the upward trend (or the downward trend if the component is negative) in the yeld;

• the third loading has a curious convex behavior, so that we are suggested to think that it represents the curvature of the yeld curve;

• the fourth loading has an irregular behaviour and it’s not easy to give an economic interpretation of it. In fact, it is usually considered as noise. We

(24)

Figure 2.2: Plots of the first four loadings of the Yeld Curve.

remark that this component already gives very few information about the yeld curve, being its contribute to the total variance extremely low.

2.4.2 PCA of the daily changes

In the study of the principal components of the yeld curve, there is a theoretical problem we have to consider. Indeed, the goodness in using the sample covariance matrix S Sij = 1 N − 1 N X k=1 (Xki− Xi)(Xkj − Xj)

as an estimator stems from 2.13, that is, in few words, S gives a good approximation of Cov(X) (where X, as usually, is the random vector to be estimated) when N is sufficiently large.

However, this fact holds if the strong law of large numbers holds too, which in turn holds with some hypotheses for the succession Xk. The most classical of

these hypotheses are finite mean, independence and identical distribution; but in our situation the request of independence of the observation is quite unrealistic. Indeed, it would mean that the interest rates at time t are independent from those at time t0 when t 6= t0, and that’s unreasonable.

(25)

Definition 2.21. A succession Xkis stationary if the law of the increments depends

only by the difference of the times, that is (Xk1, ..., Xkr)

L

= (Xk1+h, ..., Xkr+h)

for each h > 0 and for each r-uple of integers (k1, ..., kr).

Under this additional hypothesis, it is a non-trivial result of probability theory that we still have convergence of the sample covariance matrix S to the covariance matrix of X.

Therefore if we want to really study the yeld curve, a more reasonable (in the sense of more theoretical correct) way could be to study its increments. Thus we will now repeat the study made before, but instead of the daily data we will take the daily increments as observations:

δY (t, T ) = Y (t + 1, T ) − Y (t, T ) where 1 day is taken as the unity of time.

Remark 2.22. We could ask if the hypothesis of stationarity of the daily increments

is reasonable. In fact, it would mean that Y (t + 1, T ) − Y (t, T ) has the same law of Y (t + h + 1, T ) − Y (t + h, T ), and this means the law of the increments does not depend on the day but only on the interval of time. This is a very sensible supposition, at least if we consider intervals of time small enough to suppose the laws of market constant on time.

Let’s then study the matrix Ψδ ∈ R2724×11 of the data of the daily increments,

and let’s compute the principal components associated to its sample covariance matrix.

In Fig.2.3 we plotted the variance of each principal component. Note that the first four components already contain most of the information, but not in a so overpowering way as for the previous study. In particular we can see that the first four loadings explain only the 96, 3% of the total variance, instead of the 99, 9% of the principal components of the yeld curve.

Conversely, the plot of the first four loadings is quite similar (see Fig.2.4): • the first and most important loading still presents little changes in function of

the time to maturity. It does not represent the average yeld anymore; instead it’s usually interpreted as a parallel shift on the actual yeld curve;

• the second loading has a similar monotone trend, but this time it’s decreasing. Its impact over the yeld curve is interpreted in a similar way than in the previous section;

(26)

Figure 2.3: Variance of the principal components of daily changes of the U.S. Yeld Curve.

(27)

• the third loading has the same shape of the third loading of the yeld curve and indeed it’s interpreted in the same way;

• the fourth loading is noise and it’s behavior is relatively wide.

In conclusion, the principal components of the daily changes are slightly less expressive than those given by the study of the daily curve, even if they give analogs results. By the other side, these results are justified by more reasonable theoretical assumptions which give more sense to the use of the sample covariance matrix. This is the reason why the daily changes of the yeld curve are usually more used than the yeld curve itself.

(28)

A Phenomenological Approach

In this chapter we will follow a different approach to try to build a realistic model for the forward rate curve (FRC). The starting point will be to collect empirical data of the past FRCs, and as in the previous chapter, we will take them from the U.S. Treasury. Next we will compute some derived processes which will let us to better understand the nature of the forward rate itself. For this aim, the Principal Component Analysis introduced in the last chapter gives us a great help, because it will let us decompose the FRC as a linear sum of three factors, each one with a precise financial meaning that at the end of the chapter we will try to deduce.

The main advantage of this three-factor model will be its simpleness, along with the fact that it matches perfectly with the real data we started with, for the tautological reason that we have build this model starting from them. Moreover, this model can be seen as a factor model, as we will note also in the next chapter, when we will define this kind of models. For this reason, it is relatively easy to extrapolate some pricing formulas.

This approach was first introduced by Bouchaud et al. in their paper Phe-nomenology of the interest rate curve [Bou99]. They studied the daily prices of Eurodollar futures contracts from 1990 to 1996 for time to maturity less than 8 years. Thus, our aim will be fundamentally to understand if the model proposed by them is effectively applicable also to other more recent markets, trying also to understand what happens for a greater interval of times to maturity.

Collection of data

As mentioned before, the data we will analyze have been taken by the U.S. Treasury and they span from the 10th October 2006 to the 27th January 2017, covering all the 2746 daily values for the FRC. Note that the interval of time considered is almost the same as in the previous chapter, but this time we will directly study the forward rate curve instead the yeld curve. This is theoretically equivalent, but

(29)

it will be necessary since what we want is to find a model for the forward rate. The times to maturity considered are θ = 1, ..., 30. As done for the yeld curve of the previous chapter, we collect the data in a matrix (that we have called DATA) 2746 × 30.

Note that we could easily repeat the procedure made before, that is, computing the principal components for the matrix DATA, obtaining an excellent linear three or four-dimensional approximation for the FRC. However, this decomposition, although particularly easy, is not so expressive and it’s quite difficult to give to each of the factors a precise financial meaning.

Conversely, with the following approach we will obtain another three-factor decomposition, qualitatively quite similar, slightly less strong than this one, but much more expressive from a financial point of view.

Anyway, it is important to remark that we won’t forget the theory of principal components. Indeed, it will be crucial in order to find a good decomposition which approximates well the forward rate curve.

3.1 Postulation of a behaviour

We first postulate the following behaviour for the forward rate: f (t, t + θ) = r(t) + s(t) Y (θ)

Y (θmax)

+ ξ(t, θ) (3.1)

with (t, t + θ) ∈ I × Ω and where I is the set of the 2746 dates we have collected data of, while Ω = 1, .., 30 is the set of the times to maturity, measured in years. Furthermore, the following notations have been used:

• r(t) is the spot rate;

• θmax is the longest available time to maturity and θmin is the shortest one

(thus θmin = 1 and θmax= 30);

• s(t) = f (t, t + θmax) − f (t, t + θmin) is the maximum spread;

• Y is a real deterministic function;

• ξ(t, θ) represents the deviation from the empirical average curve over time t. Keeping that in mind, we assume that the time average of ξ is equal to zero:

Z

I

ξ(t, t + θ)

(30)

Since the measure spaces we are working in are all discrete spaces, here all the integrals are practically finite sums. So we could turn the notations in that used for the discrete, but for the sake of convenience and to preserve also the continuity with the first chapter, we will preserve the notation used for the continuous spaces. Furthermore, note that we changed the notations by using time-to-maturity θ = T − t instead of the maturity T . It will be much more convenient in the future.

Our fundamental aim will be to deduce the following model:

f (t, θ) ' r(t) + s(t)

s

θ θmax

+ ξ1(t)Ψ1(θ).

In other words, we will show that at least in the U.S. market, the function Y (θ) =√θ and a one-dimensional linear approximation Ψ1 of ξ fit very well with the empirical

data given by our analysis.

Remark 3.1. Let’s remember that r(t) appears in formulas for mathematical

convenience. In the market we are considering, the best approximation of r(t) is given by the FRC of the shortest maturity, so that

r(t) ' f (t, t + θmin).

where we recall that θmin = 1 year.

In Figure 3.1 we plotted the spot rate r(t) and the maximum spread s(t) as functions of t. In particular we can note that the maximum spread s(t) isn’t always positive, although it could seem a natural assumption to do. Indeed it reflects the fact that longer maturities usually should worth more than shortest ones. The opposite happens when there is a situation where the market expects a high in the interest rates. This is evidently the condition in which the market lives in the first part of the interval of time analyzed (from 2007 to 2010).

Using the fact that, by hypothesis, R

Iξ(t, θ)dt = 0, we can find an easy way to

compute Y (θ); indeed, by the equation 3.1, we have:

Z I (f (t, t + θ) − r(t))dt = Y (θ) Y (θmax) Z I s(t)dt + Z I ξ(t, θ)dt from which Y (θ) Y (θmax) = R I(f (t, t + θ) − r(t))dt R Is(t)dt = R I(f (t, t + θ) − f (t, t + θmin))dt R I(f (t, t + θmax) − f (t, t + θmin))dt . In Figure 3.2 we plotted this curve as a function of the time to maturity. In particular, if we pay attention to the shortest maturities (i.e. with θ ≤ 10 years), we

(31)

Figure 3.1: Maximum spread (in blue) and spot rate (in black).

note a shape that fits quite well the curve √θ. In fact, this result doesn’t surprise us, since it’s a result already obtained by Bouchaud in their paper. For longer maturities, the curve √θ doesn’t fit the observation anymore, and this shows that the model given by Bouchaud is consistent only for short maturities.

Once found the function Y (θ), we can also study ξ(t, θ). On Figure 3.3 we plotted the average standard deviation

SD(ξ(·, θ)) =qV ar(ξ(·, θ)) =

q R

Iξ2(t, θ)dt

|I| over the time t as a function of the time to maturity θ.

Note that the maximum is reached for the time to maturity θ ' 10, then the curve decreases, until it reaches the value zero for the longest maturity (30 years), that is also the value reached for the shortest maturity (1 year). This behavior for the extreme maturities theoretically should not surprise us since it comes from our hypotheses, indeed

f (t, t+θmax) = r(t)+s(t)

Y (θmax)

+ξ(t, θmax) ⇒ f (t, t+θmax)−r(t) = s(t)+ξ(t, θmax)

from which ξ(t, θmax) = 0, by definition of s(t). Similarly, we have also

f (t, t + θmin) = r(t) = r(t) + s(t)

Y (θmin)

Y (θmax)

(32)

Figure 3.2: Time average spread as a function of the time to maturity.

(33)

Standard Proportion Cumulative deviation of Variance Proportion Comp.1 6.7874447 0.9649524 0.9649524

Comp.2 1.19599741 0.02996082 0.99491320

Comp.3 0.476842377 0.004762587 0.999675787

Comp.4 0.1204742328 0.0003040056 0.9999797927

Comp.5 2.886327e-02 1.744956e-05 9.999972e-01

Comp.6 1.119013e-02 2.622792e-06 9.999999e-01

Comp.7 2.482844e-03 1.291196e-07 1.000000e+00

Comp.8 5.034260e-04 5.308412e-09 1.000000e+00

Comp.9 9.321440e-05 1.819949e-10 1.000000e+00

Comp.10 3.443950e-05 2.484316e-11 1.000000e+00 Table 3.1: Analysis of variance of the first principal components for ξ.

and thus we have ξ(t, θmin) = 0, since Y (θmin) = 0.

This humped behaviour of the standard deviation is quite typical. For this reason, in the last chapter we will examine a model which artificially reproduce the hump, in order to better fit the reality.

A deeper study for ξ is made by computing the principal components of the sample covariance matrix given by the observations ξ(t, ·) for t ranging between all the considered trading days. Note that this matrix is equal, up to a N − 1 factor, to the matrix M having in the (i, j) entry:

Mi,j =

Z

I

ξ(t, θi)ξ(t, θj)dt

for 1 ≤ i, j ≤ 30, since the average ξ is equal to zero.

The proportion of variance for the first ten principal components for M is explained by the histogram of Figure 3.4 and by the Table 3.1.

Looking at the the data obtained, it’s clear that the only relevant principal components are the first two (having a cumulative proportion of the variance higher than 99%), but already the only first loading explains most of the global variation (96, 5%).

For a closer study of the loadings, we plotted the first four of them on Figure 3.5. For our purposes, only the first will be really interesting and we can note it has a convex behavior that could partially recall that of the third loading of the yeld curve studied in the previous chapter. In fact, this suggests us that the decomposition as a sum of three factors (the spot rate r(t), the maximum spread s(t) multiplied by the function Y (θ) and the error ξ(t, θ)) is, at least qualitatively,

(34)

Figure 3.4: Principal component analysis for ξ(t, θ).

similar to that given by the principal component analysis applied to the data of the daily forward rate curves. We’ll try to better explain this statement in the next section.

Statistics of the daily shifts

As done in the previous chapter, we now want to study the daily increments δf (t, T ) = f (t + 1, T ) − f (t, T ) of the forward rates. To do that, we need the following definitions.

Definition 3.2. The volatility of the daily increments is the variable defined as:

σ(θ) = s 1 |I| Z I δf (t, t + θ)2_dt.

Definition 3.3. The spread correlation function is the variable

C(θ) = 1 σ2_(θ min)|I| Z I δf (t, t + θmin)(δf (t, t + θ) − δf (t, t + θmin))dt.

In other words, recalling that f (t, t + θmin) ' r(t) is the spot rate, C(θ) measures

the correlation between the spot rate shifts and the (partial) spread shifts at time to maturity θ.

The volatility σ(θ) has been plotted in Figure 3.6. We can see that it has a maximum at around θ = 5 years, then decreases and for θ ≥ 20 it starts to grow

(35)

Figure 3.5: plot of the first four loadings of the sample covariance matrix of ξ(t, θ).

until it reaches the absolute maximum for the largest maturity. This means that the meddle maturities (4-8 years) and the longest (22-30 years) are the most subject to an uncertain evolution.

In Figure 3.7 we plotted the spread correlation function C(θ). We can see a positive correlation for the shortest maturities, then a rapid decrease. At around θ = 5 years the spread increments seem to be uncorrelated (at least to a first approximation) from the short rate ones, a result already shown by Bouchaud. Furthermore, our analysis for longer maturities shows that the correlation tends to be strongly negative. This means that the evolution of the spot rate and of the spread are correlated when the maturities are long.

3.2 A three-factor model for the FRC

Let’s come back to the equation 3.1:

f (t, t + θ) = r(t) + s(t) Y (θ) Y (θmax)

+ ξ(t, θ).

We have just seen that ξ(t, θ) can be linearly approximated by using only the first loading. Indeed, if we place

(36)

Figure 3.6: Volatility of the daily increments in function of the time to maturity.

(37)

as the vector in R30 _{of the error whose coordinates are the time to maturities, then}

by the theory of principal components we can write for each t ξ(t) =

30

X

k=1

ψk(t)αk

where αk is the k-th loading; then for each θ (i.e. for each component of ξ) we have

ξ(t, θ) =

30

X

k=1

ψk(t)αk(θ) ' ψ1(t)α1(θ)

up to a small error of less than 4% of the total variance. We thus have a particular simple factorization of the FRC as a sum of three quantities:

f (t, t + θ) ' r(t) + s(t) Y (θ) Y (θmax)

+ ψ1(t)α1(θ). (3.2)

Remark 3.4. Keeping in mind the study made on the principal components of

the yeld curve in the previous chapter, we could suggest an interesting parallelism. Let’s briefly recall the qualitative proprieties of the first three loadings as function of the time to maturity:

1. the first loading is almost constant (it’s the average yeld over maturities); 2. the second one is monotone increasing (it represent the upward trend); 3. the third one is a convex function (it’s the curvature of the yeld).

Note that these three proprieties are also satisfied respectively by r(t), s(t)_{Y (θ}Y (θ)

max)

and ξ(t, θ). That’s quite interesting because we maybe could be led to think that these two factorizations of the forward rate curve are in fact the same. However, this does not hold since, for example, the decomposition given by the equation 3.1 is not orthogonal: for example,

Z I r(t)s(t) Y (θ) Y (θmax) dt = Y (θ) Y (θmax) Z I

f (t, θmin)(f (t, θmax) − f (t, θmin))dt

and in general there is no hope that this integral is equal to zero. In our special case, furthermore, a practical computation gives values very far from zero.

Nevertheless, the similarities between the two approximations of the forward rate curve can not be due to chance. We rather see them as a reason to believe in the efficiency of this model, and a partial theoretical justification for the approximation given by 3.2.

(38)

3.3 A qualitative justification for the three-factor

model

We will now try to give some financial interpretations in order to justify some of the results found in the previous section.

First, we want to explain why on average the forward rate seems to follow a √

θ law. Indeed, we have seen that the spread s(t, θ) = f (t, θ) − r(t)

is on average proportional to √θ. Here we give an interpretation of this fact, first given by Bouchaud. Let’s first make the following assumption:

The spot rate r(t) takes a random walk behavior.

Obviously, that’s not a realistic assumption to do, but it will be useful to give an heuristic sense to the model just proposed.

Now, let t > 0 a future date and think t = 0 as the present date. The trend of the spot rate r(t) can be positive or negative, but it reasonable to suppose that, on average, it will be close to the present value r(0):

E[r(t)] = E[r(0)].

This means that also the average FRC should be constant, up to a small correction. Indeed, we have to think the FRC f (t, t + θ) as something related to the value of the spot rate the market expects at time t + θ. However, the expectation of its shift between now and t = θ is of the order of σ√θ, where σ is the volatility of r(t). Thus money lenders tend to protect themselves against this probable rise of the spot rate by adding to their estimate of the average future rate a risk premium. Hence they set the price for the forward rates in such a way that the probability the value of the spot rate at time t + θ exceeds f (t, t + θ) is less than p (usually close to 0.2). In mathematical terms, if P(r0, t0|r, t) is the conditional probability the spot rate is equal to r0 at time t0 > t, given the actual value at time t, we have

Z ∞

f (t,t+θ)P(r

0

, t + θ|r, t)dr0 = p.

Computing the value of f (t, θ) we should have by inverting this formula and recalling the simple random walk behavior supposed at the beginning, we can find

f (t, θ + t) − r(t) = A√θ (3.3)

at least on average. Note that this is a typical Value at Risk procedure, in which investors try to hedge possible losses.

(39)

Remark 3.5. The shape of the average spread is quite close with that of√θ, but it doesn’t fit it perfectly as the article of Bouchaud claims. This is probably due to some corrections the market does in the estimation of the future of spot rate. Furthermore, let’s recall that the √θ term stems from the supposition that r(t) follows a random walk behavior, but this could not be the best fitting available. It has been shown, for example (see [Che]), that the average spread of EURIBOR follows a θ3/2 _{law instead that a} √_{θ one.}

Remark 3.6. The reader may argue that 3.3 could lead to arbitrage. Indeed,

buying the spot rate and selling the forward rate should give a profit of A√σ. Anyway, we recall that the equation 3.3 describes an average behavior, while in reality we have also to consider the deformation given by ξ(t, θ) which could lead to some losses. However, the strategy of borrowing short-term money and lending long-term money could easily take to a profit and in fact it’s one the most popular strategy used by banks.

If we want to discuss about the shape of the FRC at a given instant of time instead to on average, we have to consider also the error ξ(t, θ), that we have approximated with the first eigenvector of its covariance matrix, Ψ1(t)α1(θ). A

qualitative meaning of these quantities is given below. At time t, the spot rate r(t) takes a biased random walk behavior. However, it seems not to be centered around r(t) itself, because the market expects the existence of a trend m(t), estimated by the past behavior of r(t). Furthermore, it seems too simplistic to think that the bias depends only on the time t; in general, we have to suppose also a reliance on the maturity t + θ. Thus, we should talk about an anticipated bias m(t, t + θ) which can be seen as follows:

m(t, t + θ) = m1(t)χ(θ)

where χ(θ) is a persistence function and it can be interpreted by a measure of how much the trend is expected to persist in the future. In applications, it is usually normalized in such a way that χ(0) = 1. Instead m1(t) is an estimation made

by the observation of the past variations of the spot rate. Integrating over the maturities between t and t + θ the anticipated bias, we obtain that the probability distribution P(r0, t + θ|r, t) of the spot rate to take the value r0 at time t + θ, given its value at time t, is centered around

r(t) +

Z t+θ

t

m(t, t + u)du

and using the same Value-at-risk arguments made before, we have f (t, t + θ) = r(t) + Aσ√θ +

Z θ

0