Stochastic Processes (Master degree in Engineering) Franco Flandoli

(1)

Stochastic Processes

(Master degree in Engineering)

Franco Flandoli

(2)

(3)

Preface v

Chapter 1. Preliminaries of Probability 1

1. Transformation of densities 1

2. About covariance matrices 3

3. Gaussian vectors 5

Chapter 2. Stochastic processes. Generalities 13

1. Discrete time stochastic process 13

2. Stationary processes 16

3. Time series and empirical quantities 19

4. Gaussian processes 21

5. Discrete time Fourier transform 22

6. Power spectral density 24

7. Fundamental theorem on PSD 26

8. Signal to noise ratio 30

9. An ergodic theorem 31

Chapter 3. ARIMA models 37

1. De…nitions 37

2. Stationarity, ARMA and ARIMA processes 40

3. Correlation function 41

4. Power spectral density 45

iii

(4)

(5)

These notes are planned to be the last part of a course of Probability and Stochastic Processes.

The …rst part is devoted to the introduction to the following topics, taken for instance from the book of Baldi (Italian language) or Billingsley (in English):

Probability space ( ; F; P )

Conditional probability and independence of events Factorization formula and Bayes formula

Concept of random variable X, random vector X = (X1; :::; Xn) Law of a r.v., probability density (discrete and continuous) Distribution function and quantiles

Joint law of a vector and marginal laws, relations

(Transformation of densities and moments) (see complements below) Expectation, properties

Moments, variance, standard deviation, properties Covariance and correlation coe¢ cient, covariance matrix Generating function and characteristic function

(Discrete r.v.: Bernoulli, binomial, Poisson, geometric)

Continuous r.v.: uniform, exponential, Gaussian, Weibull, Gamma Notions of convergence of r.v.

(Limit theorems: LLN, CLT; Chebyshev inequality.)

Since we need some more specialized material, Chapter 1 is a complement to this list of items.

v

(6)

(7)

Preliminaries of Probability

1. Transformation of densities

Exercise 1. If X has cdf FX(x) and g is increasing and continuous, then Y = g (X) has cdf FY (y) = FX g ¹(y)

for all y in the image of y. If g is decreasing and continuous, the formula is F_Y (y) = 1 F_X g ¹(y)

Exercise 2. If X has continuous pdf fX(x) and g is increasing and di¤ erentiable, then Y = g (X) has pdf

fY (y) = fX g ¹(y)

g⁰(g ¹(y)) = fX(x) g⁰(x) _y=g(x)

for all y in the image of y. If g is decreasing and di¤ erentiable, the formula is fY (y) = fX(x)

g⁰(x) _y=g(x): Thus, in general, we have the following result.

Proposition 1. If g is monotone and di¤ erentiable, the transformation of densities is given by fY (y) = fX(x)

jg⁰(x)j y=g(x)

Remark 1. Under proper assumptions, when g is not injective the formula generalizes to f_Y (y) = X

x:y=g(x)

f_X(x) jg⁰(x)j:

Remark 2. A second proof of the previous formula comes from the following characterization of the density: f is the density of X if and only if

E [h (X)] = Z

R

h (x) f (x) dx

for all continuous bounded functions h. Let us use this fact to prove that f_Y (y) = ^f_jg^X₀^(x)

(x)j y=g(x) is the density of Y = g (X). Let us compute E [h (Y )] for a generic continuous bounded functions h. We

1

(8)

2 1. PRELIMINARIES OF PROBABILITY

have, from the de…nition of Y and from the characterization applied to X, E [h (Y )] = E [h (g (X))] =

Z

R

h (g (x)) f (x) dx:

Let us change variable y = g (x), under the assumption that g is monotone, bijective and di¤ erentiable.

We have x = g ¹(y), dx = ¹

jg⁰(g ¹(y))jdy (we put the absolute value since we do not change the extreme of integration, but just rewrite R

R) so that Z

R

h (g (x)) f (x) dx = Z

R

h (y) f g ¹(y) 1

jg⁰(g ¹(y))jdy:

If we set f_Y (y) := _jg^f^X₀^(x)

(x)j y=g(x) we have proved that E [h (Y )] =

Z

R

h (y) f_Y (y) dy

for every continuous bounded functions h. By the characterization, this implies that f_Y (y) is the density of Y . This proof is thus based on the change of variable formula.

Remark 3. The same proof works in the multidimensional case, using the change of variable formula for multiple integrals. Recall that in place of dy = g⁰(x)dx one has to use dy = jdet Dg (x)j dx where Dg is the Jacobian (the matrix of …rst derivatives) of the transformation g : Rⁿ! Rⁿ. In fact we need the inverse transformation, so we use the corresponding formula

dx = det Dg ¹(y) dy = 1

jdet Dg (g ¹(y))jdy:

With the same passages performed above, one gets the following result.

Proposition 2. If g is a di¤ erentiable bijection and Y = g (X), then f_Y (y) = f_X(x)

jdet Dg (x)j _y=g(x):

Exercise 3. If X (in Rⁿ) has density f_X(x) and Y = U X, where U is an orthogonal linear transformation of Rⁿ (it means that U ¹ = U^T), then Y has density

f_Y (y) = f_X U^Ty :

1.1. Linear transformation of moments. The solution of the following exercises is based on the linearity of expected value (and thus of covariance in each argument).

Exercise 4. Let X = (X1; :::; Xn) be a random vector, A be a n d matrix, Y = AX. Let

X = ^X₁ ; :::; ^X_n be the vector of mean values of X, namely ^X_i = E [Xi]. Then

Y := A ^X is the vector of mean values of Y , namely ^Y_i = E [Yi].

Exercise 5. Under the same assumptions, if Q^X and Q^Y are the covariance matrices of X and Y , then

Q^Y = AQ^XA^T:

(9)

2. About covariance matrices

The covariance matrix Q of a vector X = (X1; :::; Xn), de…ned as Qij = Cov (Xi; Xj), is symmetric:

Q_ij = Cov (X_i; X_j) = Cov (X_j; X_i) = Q_ji and non-negative de…nite:

x^TQx = Xn i;j=1

Q_ijx_ix_j = Xn i;j=1

Cov (X_i; X_j) x_ix_j = Xn i;j=1

Cov (x_iX_i; x_jX_j)

= Cov 0

@ Xn i=1

xiXi; Xn j=1

xjXj

1

A = V ar [W ]

where W =Pn

i=1x_iX_i.

The spectral theorem states that any symmetric matrix Q can be diagonalized, namely it exists a orthonormal basis e₁; :::; e_n of Rⁿ where Q takes the form

Q_e= 0

@ ¹ 0 0

0 ::: 0

0 0 n

1 A :

Moreover, the numbers i are eigenvalues of Q, and the vectors ei are corresponding eigenvectors.

Since the covariance matrix Q is also non-negative de…nite, we have

i 0; i = 1; :::; n:

Remark 4. To understand better this theorem, recall a few facts of linear algebra. Rⁿ is a vector space with a scalar product h:; :i, namely a set of elements (called vectors) with certain operations (sum of vectors, multiplication by real numbers, scalar product between vectors) and properties. We may call intrinsic the objects de…ned in these terms, opposite to the objects de…ned by means of numbers, with respect to a given basis. A vector x 2 Rⁿ is an intrinsic object; but we can write it as a sequence of numbers (x₁; :::; x_n) in in…nitely many ways, depending on the basis we choose. Given an orthonormal basis u1; :::; un, the components of a vector x 2 Rⁿ in this basis are the numbers hx; u^ji, j = 1; :::; n. A linear map L in Rⁿ, given the basis u₁; :::; u_n, can be represented by a matrix of components hLui; u_ji.

We shall write y^T x for hx; yi (or hy; xi).

Remark 5. After these general comments, we see that a matrix represents a linear transformation, given a basis. Thus, given the canonical basis of Rⁿ, that we shall denote by u1; :::; un, given the matrix Q, it is de…ned a linear transformation L from Rⁿ to a Rⁿ. The spectral theorem states that there is a new orthonormal basis e1; :::; en of Rⁿ such that, if Qe represents the linear transformation L in this new basis, then Q_e is diagonal.

Remark 6. Let us recall more facts about linear algebra. Start with an orthonormal basis u1; :::; un, that we call canonical or original basis. Let e₁; :::; e_n be another orthonormal basis. The vector u₁, in

(10)

the canonical basis, has components

u₁= 0 BB

@ 1 0 :::

0 1 CC A

and so on for the other vectors. Each vector ej has certain components. Denote by U the matrix such that its …rst column has the same components as e₁ (those of the canonical basis), and so on for the other columns. We could write U = (e1; :::; en). Also, Uij = e^T_j ui. Then

U 0 BB

@ 1 0 :::

0 1 CC A = e¹

and so on, namely U represents the linear map which maps the canonical (original) basis of Rⁿ into e1; :::; en. This is an orthogonal transformation:

U ¹ = U^T:

Indeed, U ¹ maps e₁; :::; e_n into the canonical basis (by the above property of U ), and U^T does the same:

U^Te₁ = 0 BB

@

e^T₁ e1

e^T₂ e1

:::

e^T_n e₁ 1 CC A =

0 BB

@ 1 0 :::

0 1 CC A

and so on.

Remark 7. Let us now go back to the covariance matrix Q and the matrix Qegiven by the spectral theorem: Qe is a diagonal matrix which represents the same linear transformation L in a new basis e₁; :::; e_n. Assume we do not know anything else, except they describe the same map L and Q_e is diagonal, namely of the form

Qe= 0

@ ¹ 0 0

0 ::: 0

0 0 n

1 A :

Let us deduce a number of facts:

i)

Qe= U QU^T

ii) the diagonal elements _j are eigenvalues of L, with eigenvectors e_j iii) j 0, j = 1; :::; n.

To prove (i), recall from above that

(Q_e)_ij = e^T_j Le_i and Q_ij = u^T_j Lu_i:

(11)

Moreover, Uij = e^T_j ui, hence ej =Pn

k=1Ukjuk, and thus (Q_e)_ij = e^T_j Le_i=

Xn k;k⁰=1

U_kiU_k0ju^T_k0 Lu_k= Xn k;k⁰=1

U_kiQ_ijU_k0j = U QU^T _ij:

To prove (ii), let us write the vector Le₁ in the basis e₁; :::; e_n: e_i is the vector 0 BB

@ 1 0 :::

0 1 CC

A, the map L is represented by Q_e, hence Le₁ is equal to

Qe

0 BB

@ 1 0 :::

0 1 CC A =

0 BB

@

1

0 :::

0 1 CC A = ¹

0 BB

@ 1 0 :::

0 1 CC A

which is 1e1 in the basis e1; :::; en. We have checked that Le1 = 1e1, namely that 1 is an eigenvalue and e₁ is a corresponding eigenvector. The proof for ₂, etc. is the same. To prove (iii), just see that, in the basis e1; :::; en,

e^T_jQeej = j: But

e^T_jQeej = e_j^TU QU^Tej = v^TQv 0

where v = U^Te_j, having used the property that Q is non-negative de…nite. Hence _j 0.

3. Gaussian vectors

Recall that a Gaussian, or Normal, r.v. N ; ² is a r.v. with probability density f (x) = 1

p2 ²exp jx j² 2 ²

! :

We have shown that is the mean value and ² the variance. The standard Normal is the case = 0,

2= 1. If Z is a standard normal r.v., then + Z is N ; ² .

We may give the de…nition of Gaussian vector in two ways, generalizing either the expression of the density or the property that + Z is N ; ² . Let us start with a lemma.

Lemma 1. Given a vector = ( 1; :::; n) and a symmetric positive de…nite n n matrix Q (namely v^TQv > 0 for all v 6= 0), consider the function

f (x) = 1

p(2 )ⁿdet(Q)exp (x )^T Q ¹(x ) 2

!

where x = (x1; :::; xn) 2 Rⁿ. Notice that the inverse Q ¹ is well de…ned for positive de…nite matrices, (x )^TQ ¹(x ) is a positive quantity, det(Q) is a positive number. Then:

i) f (x) is a probability density;

(12)

ii) if X = (X1; :::; Xn) is a random vector with such joint probability density, then is the vector of mean values, namely

i= E [X_i] and Q is the covariance matrix:

Qij = Cov (Xi; Xj) :

Proof. Step 1. In this step we explain the meaning of the expression f (x). We have recalled above that any symmetric matrix Q can be diagonalized, namely it exists a orthonormal basis e1; :::; en

of Rⁿ where Q takes the form

Q_e= 0

@ ¹ 0 0

0 ::: 0

0 0 n

1 A :

Moreover, the numbers _i are eigenvalues of Q, and the vectors e_i are corresponding eigenvectors.

See above for more details. Let U be the matrix introduced there, such that U ¹ = U^T. Recall the relation Qe= U QU^T.

Since v^TQv > 0 for all v 6= 0, we deduce

v^TQev = v^TU Q U^Tv > 0 for all v 6= 0 (since U^Tv 6= 0). Taking v = eⁱ, we get i> 0.

Therefore the matrix Q_e is invertible, with inverse given by

Q_e¹ = 0

@

1

1 0 0

0 ::: 0

0 0 _n¹

1 A :

It follows that also Q, being equal to U^TQ_eU (the relation Q = U^TQ_eU comes from Q_e = U QU^T), is invertible, with inverse Q ¹ = U^TQ_e¹U . Easily one gets (x )^T Q ¹(x ) > 0 for x 6= . Moreover,

det(Q) = det U^T det (Q_e) det (U ) = ₁ _n because

det(Qe) = 1 n

and det (U ) = 1. The latter property comes from

1 = det I = det U^TU = det U^T det (U ) = det (U )²

(to be used in exercise 3). Therefore det(Q) > 0. The formula for f (x) is meaningful and de…nes a positive function.

Step 2. Let us prove that f (x) is a density. By the theorem of change of variables in multidimensional integrals, with the change of variables x = U^Ty,

Z

Rⁿ

f (x) dx = Z

Rⁿ

f U^Ty dy

(13)

because det U^T = 1 (and the Jacobian of a linear transformation is the linear map itself). Now, since U Q ¹U^T = Q_e¹, f U^Ty is equal to the following function:

f_e(y) = 1

p(2 )ⁿdet(Q_e)exp (y _e)^TQ_e¹(y _e) 2

!

where

e= U : Since

(y _e)^T Q_e¹(y _e) = Xn i=1

(y_i ( _e)_i)²

i

and det(Qe) = 1 n, we get fe(y) =

Yn i=1

p1 2 i

exp (y_i ( _e)_i)² 2 i

! :

Namely, f_e(y) is the product of n Gaussian densities N (( _e)_i; _i). We know from the theory of joint probability densities that the product of densities is the joint density of a vector with independent components. Hence f_e(y) is a probability density. Therefore R

Rⁿf_e(y) dy = 1. This proves R

Rⁿf (x) dx = 1, so that f is a probability density.

Step 3. Let X = (X₁; :::; X_n) be a random vector with joint probability density f , when written in the original basis. Let Y = U X. Then (exercise 3) Y has density fY (y) given by fY (y) = f U^Ty . Thus

f_Y (y) = f_e(y) = Yn i=1

p1 2 i

exp (y_i ( _e)_i)² 2 i

! : Thus (Y1; :::; Yn) are independent N (( e)_i; i) r.v. and therefore

E [Yi] = ( e)_i; Cov (Yi; Yj) = ij i: From exercises 4 and 5 we deduce that X = U^TY has mean

X = U^T _Y and covariance

QX = U^TQYU:

Since _Y = _e and _e = U we readily deduce _X = U^TU = . Since Q_Y = Q_e and Q = U^TQ_eU we get QX = Q. The proof is complete.

Definition 1. Given a vector = ( ₁; :::; _n) and a symmetric positive de…nite n n matrix Q, we call Gaussian vector of mean and covariance Q a random vector X = (X1; :::; Xn) having joint probability density function

f (x) = 1

p(2 )ⁿdet(Q)exp (x )^T Q ¹(x ) 2

!

where x = (x₁; :::; x_n) 2 Rⁿ. We write X N ( ; Q).

(14)

The only drawback of this de…nition is the restriction to strictly positive de…nite matrices Q. It is sometimes useful to have the notion of Gaussian vector also in the case when Q is only non-negative de…nite (sometimes called degenerate case). For instance, we shall see that any linear transformation of a Gaussian vector is a Gaussian vector, but in order to state this theorem in full generality we need to consider also the degenerate case. In order to give a more general de…nition, let us take the idea recalled above for the 1-dimensional case: a¢ ne transformations of Gaussian r.v. are Gaussian.

Definition 2. i) The standard d-dimensional Gaussian vector is the random vector Z = (Z1; :::; Z_d) with joint probability density f (z₁; :::; z_d) =

Yd i=1

p (z_i) where p (z) = p¹ 2 e ^z2² :

ii) All other Gaussian vectors X = (X₁; :::; X_n) (in any dimension n) are obtained from standard ones by a¢ ne transformations:

X = AZ + b

where A is a matrix and b is a vector. If X has dimension n, we require A to be d n and b to have dimension n (but n can be di¤ erent from d).

The graph of a standard 2-dimensional Gaussian vector is

2 2

0 0 -20.00

x y

-2 0.15

z ^0.10

0.05

and the graph of the other Gaussian vectors can be guessed by linear deformations of the base plane xy (deformations de…ned by A) and shift (by b). For instance, if

A = 2 0

0 1

matrix which enlarge the x axis by a factor 2, we get the graph

(15)

4 40.0020 02 -2

x y

-4 -2-4

0.05 0.15

z ^0.10

First, let us compute the mean and covariance matrix of a vector of the form X = AZ + b, with Z of standard type. From exercises 4 and 5 we readily have:

Proposition 3. Mean and covariance Q matrix of a vector X of the previous form are given by

= b Q = AA^T:

When two di¤erent de…nitions are given for the same object, one has to prove their equivalence.

If Q is positive de…nite, the two de…nition aim to describe the same object, but for Q non-negative de…nite but not strictly positive de…nite, we have only the last de…nition, so we do not have to check any compatibility.

Proposition 4. If Q is positive de…nite, then de…nitions 1 and 2 are equivalent. More precisely, if X = (X₁; :::; X_n) is a Gaussian random vector with mean and covariance Q in the sense of de…nition 1, then there exists a standard Gaussian random vector Z = (Z1; :::; Zn) and a n n matrix A such that

X = AZ + : One can take A = p

Q, as described in the proof. Vice versa, if X = (X1; :::; Xn) is a Gaussian random vector in the sense of de…nition 2, of the form X = AZ + b, then X is Gaussian in the sense of de…nition 1, with mean and covariance Q given by the previous proposition.

Proof. Let us prove the …rst claim. Let us de…ne pQ = U^Tp

Q_eU wherep

Qe is simply de…ned as

pQe= 0

@ p

1 0 0

0 ::: 0

0 0 p

n

1 A :

We have

pQ ^T = U^T p Qe

T

U = U^Tp

QeU =p Q

(16)

and p

Q ² = U^Tp

Q_eU U^Tp

Q_eU = U^Tp Q_ep

Q_eU = U^TQ_eU = Q becausep

Q_ep

Q_e= Q_e. Set

Z = p

Q ¹X where notice that p

Q is invertible, from its de…nition and the strict positivity of _i. Then Z is Gaussian. Indeed, from the formula for the transformation of densities,

fZ(z) = fX(x)

jdet Dg (x)j z=g(x)

where g (x) = p

Q ¹x ; hence det Dg (x) = det p

Q ¹ = p ¹

1

p

n; therefore f_Z(z) =

Yn i=1

p

i

p 1

(2 )ⁿdet(Q)exp

pQz + ^T Q ¹ p

Qz + 2

!

= 1

p(2 )ⁿexp

pQz ^T Q ¹ p Qz 2

!

= 1

p(2 )ⁿexp z^Tz 2

which is the density of a standard Gaussian vector. From the de…nition of Z we get X =p

QZ + , so the …rst claim is proved.

The proof of the second claim is a particular case of the next exercise, that we leave to the reader.

Exercise 6. Let X = (X1; :::; X_n) be a Gaussian random vector, B a n m matrix, c a vector of R^m. Then

Y = BX + c

is a Gaussian random vector of dimension m. The relations between means and covariances is

Y = B _X + c and covariance

QY = BQXB^T:

Remark 8. We see from the exercise that we may start with a non-degenerate vector X and get a degenerate one Y , if B is not a bijection. This always happens when m > n.

Remark 9. The law of a Gaussian vector is determined by the mean vector and the covariance matrix. This fundamental fact will be used below when we study stochastic processes.

Remark 10. Some of the previous results are very useful if we want to generate random vectors according to a prescribed Gaussian law. Assume we have prescribed mean and covariance Q, n- dimensional, and want to generate a random sample (x1; :::; xn) from such N ( ; Q). Then we may generate n independent samples z₁; :::; z_n from the standard one-dimensional Gaussian law and com-

pute p

Qz +

(17)

where z = (z1; :::; zn). In order to have the entries of the matrix p

Q, if the software does not provide them (certain software do it), we may use the formulap

Q = U^Tp

Q_eU . The matrix p

Q_e is obvious.

In order to get the matrix U recall that its columns are the vectors e1; :::; en written in the original basis. And such vectors are an orthonormal basis of eigenvectors of Q. Thus one has to use at least a software that makes the spectral decomposition of a matrix, to get e1; :::; en.

(18)

(19)

Stochastic processes. Generalities

1. Discrete time stochastic process

We call discrete time stochastic process any sequence X₀; X₁; X₂; :::; X_n; ::: of random variables de…ned on a probability space ( ; F; P ), taking values in R. This de…nition is not so rigid with respect to small details: the same name is given to sequences X₁; X₂; :::; X_n; :::, or to the case when the r.v.

X_ntake values in a space di¤erent from R. We shall also describe below the case when the time index takes negative values.

The main objects attached to a r.v. are its law, its …rst and second moments (and possibly higher order moments and characteristic or generating function, and the distribution function). We do the same for a process (X_n)_{n 0}: the probability density of the r.v. X_n, when it exists, will be denoted by fn(x), the mean by n, the standard deviation by n. Often, we shall write t in place of n, but nevertheless here t will be always a non-negative integer. So, our …rst concepts are:

i) mean function and variance function:

t= E [Xt] ; _t²= V ar [Xt] ; t = 0; 1; 2; :::

In addition, the time-correlation is very important. We introduce three functions:

ii) the autocovariance function C (t; s), t; s = 0; 1; 2; ::::

C (t; s) = E [(X_t _t) (X_s _s)]

and the function

R (t; s) = E [X_tX_s]

(the name will be discussed below). They are symmetric (R (t; s) = R (s; t) and the same for C (t; s)) so it is su¢ cient to know them for t s. We have

C (t; s) = R (t; s) _{t s}; C (t; t) = _t²:

In particular, when _t 0 (which is often the case), C (t; s) = R (t; s). Most of the importance will be given to t and R (t; s). In addition, let us introduce:

iii) the autocorrelation function

(t; s) = C (t; s)

t s

We have

(t; t) = 1; j (t; s)j 1:

The functions C (t; s), R (t; s), (t; s) are used to detect repetitions in the process, self-similarities under time shift. For instance, if (Xn)_{n 0} is roughly periodic of period P , (t + P; t) will be signi…- cantly higher than the other values of (t; s) (except (t; t) which is always equal to 1). Also a trend

13

(20)

14 2. STOCHASTIC PROCESSES. GENERALITIES

is a form of repetitions, self-similarity under time shift, and indeed when there is a trend all values of (t; s) are quite high, compared to the cases without trend. See the numerical example below.

Other objects (when de…ned) related to the time structure are:

iv) the joint probability density

ft1;:::;tn(x1; :::; xn) ; ; tn ::: t1

of the vector (X_t₁; :::; X_t_n) and v) the conditional density

f_tjs(xjy) = ft;s(x; y)

f_s(y) ; t > s:

Now, a remark about the name of R (t; s). In Statistics and Time Series Analysis, the name autocorrelation function is given to (t; s), as we said above. But in certain disciplines related to signal processing, R (t; s) is called autocorrelation function. There is no special reason except the fact that R (t; s) is the fundamental quantity to be understood and investigated, the others (C (t; s) and (t; s)) being simple transformations of R (t; s). Thus R (t; s) is given the name which mostly reminds the concept of self-relation between values of the process at di¤erent times. In the sequel we shall use both languages and sometimes we shall call (t; s) the autocorrelation coe¢ cient.

The last object we introduce is concerned with two processes simultaneously: (X_n)_{n 0}and (Y_n)_{n 0}. It is called:

vi) cross-correlation function

CX;Y (t; s) = E [(Xt E [Xt]) (Ys E [Ys])] :

This function is a measure of the similarity between two processes, shifted in time. For instance, it can be used for the following purpose: one of the two processes, say Y , is known, has a known shape of interest for us, the other process, X, is the process under investigation, and we would like to detect portions of X which have a shape similar to Y . Hence we shift X in all possible ways and compute the correlation with Y .

When more than one process is investigated, it may be better to write R_X(t; s), C_X(t; s) and so on for the quantities associated to process X.

1.1. Example 1: white noise. The white noise with intensity ² is the process (Xn)_{n 0} with the following properties:

i) X₀; X₁; X₂; :::; X_n; ::: are independent r.v.’s ii) Xn N 0; ² .

It is a very elementary process, with a trivial time-structure, but it will be used as a building block for other classes of processes, or as a comparison object to understand the features of more complex cases. The following picture has been obtained by R software by the commands x<-rnorm(1000);

ts.plot(x).

(21)

Let us compute all its relevant quantities (the check is left as an exercise):

t= 0 _t² = ² R (t; s) = C (t; s) = ² (t s) where the symbol (t s) denotes 0 for t 6= s, 1 for t = s,

(t; s) = (t s) f_t₁_;:::;t_n(x₁; :::; x_n) =

Yn i=1

p (x_i) where p (x) = 1

p2 ²e ^{2 2}^x2 f_tjs(xjy) = p (x) .

1.2. Example 2: random walk. Let (Wn)_{n 0} be a white noise (or more generally, a process with independent identically distributed W₀; W₁; W₂; :::). Set

X0 = 0

Xn+1 = Xn+ Wn; n 0:

This is a random walk. White noise has been used as a building block: (Xn)_{n 0} is the solution of a recursive linear equation, driven by white noise (we shall see more general examples later on). The following picture has been obtained by R software by the commands x<-rnorm(1000); y<-cumsum(x);

ts.plot(y).

The random variables Xnare not independent (Xn+1 obviously depends on Xn). One has X_n+1=

Xn i=0

W_i:

We have the following facts We prove them by means of the iterative relation (this generalizes better to more complex discrete linear equations). First,

0 = 0

n+1= n; n 0 hence n= 0 for every n 0.

By induction, X_n and W_n are independent for every n, hence:

(22)

Exercise 7. Denote by ² the intensity of the white noise; …nd a relation between _n+1² and _n² and prove that

n=p

n ; n 0:

An intuitive interpretation of the result of the exercise is that X_nbehaves as p

n, in a very rough way.

As to the time-dependent structure, C (t; s) = R (t; s), and:

Exercise 8. Prove that R (m; n) = n ², for all m n (prove it for m = n, m = m + 1, m = n + 2 and extend). Then prove that

(m; n) = rn

m: The result of this exercise implies that

(m; 1) ! 0 as m ! 1:

We may interpret this result by saying that the random walk looses memory of the initial position.

2. Stationary processes

A process is called wide-sense stationary if _t and R (t + n; t) are independent of t.

It follows that also t, C (t + n; t) and (t + n; t) are independent of t. Thus we speak of:

i) mean

ii) standard deviation

iii) covariance function C (n) := C (n; 0)

iv) autocorrelation function (in the improper sense described above) R (n) := R (n; 0)

v) autocorrelation coe¢ cient (or also autocorrelation function, in the language of Statistics) (n) := (n; 0) :

(23)

A process is called strongly stationary if the law of the generic vector (Xn1+t; :::; Xnk+t) is independent of t. This implies wide stationarity. The converse is not true in general, but it is true for Gaussian processes (see below).

2.1. Example: white noise. We have

R (t; s) = ² (t s) hence

R (n) = ² (n) :

2.2. Example: linear equation with damping. Consider the recurrence relation Xn+1 = Xn+ Wn; n 0

where (Wn)_{n 0} is a white noise with intensity ² and 2 ( 1; 1) :

The following picture has been obtained by R software by the commands ( = 0:9, X0 = 0):

w <- rnorm(1000) x <- rnorm(1000) x[1]=0

for (i in 1:999) {

x[i+1] <- 0.9*x[i] + w[i]

}

ts.plot(x)

It has some features similar to white noise, but less random, more persistent in the direction where it moves.

Let X₀ be a r.v. independent of the white noise, with zero average and variance e². Let us show that (X_n)_{n 0} is stationary (in the wide sense) if e² is properly chosen with respect to ².

(24)

First we have

0 = 0

n+1 = _n; n 0

hence _n= 0 for every n 0. The mean function is constant.

As a preliminary computation, let us impose that the variance function is constant. By induction, X_n and W_n are independent for every n, hence

2

n+1= ^{2 2}_n+ ²; n 0:

If we want _n+1² = _n² for every n 0, we need

2

n= ^{2 2}_n+ ²; n 0 namely

2 n=

2

1 ²; n 0:

In particular, this implies the relation

e²=

2

1 ²:

It is here that we …rst see the importance of the condition j j < 1.

If we assume this condition on the law of X0, then we …nd

2 1 = ²

2

1 ² + ² =

2

1 ² = ²₀

and so on, _n+1² = _n² for every n 0. Thus the variance function is constant.

Finally, we have to show that R (t + n; t) is independent of t. We have R (t + 1; t) = E [( X_t+ W_t) X_t] = _n² =

2

1 ²

which is independent of t;

R (t + 2; t) = E [( X_t+1+ W_t+1) X_t] = R (t + 1; t) =

2 2

1 ²

and so on,

R (t + n; t) = E [( Xt+n 1+ Wt+n 1) Xt] = R (t + n 1; t)

= ::: = ⁿR (t; t) =

n 2

1 ²

which is independent of t. The process is stationary. We have R (n) =

n 2

1 ²: It also follows that

(n) = ⁿ:

The autocorrelation coe¢ cient (as well as the autocovariance function) decays exponentially in time.

(25)

2.3. Processes de…ned also for negative times. We may extend a little bit the previous de…nitions and call discrete time stochastic process also the two-sided sequences (X_n)_n2Z of random variables. Such processes are thus de…ned also for negative time. The idea is that the physical process they represent started in the far past and continues in the future.

This notion is particularly natural in the case of stationary processes. The function R (n) (similarly for C (n) and (n)) are thus de…ned also for negative n:

R (n) = E [X_nX₀] ; n 2 Z:

By stationarity,

R ( n) = R (n)

because R ( n) = E [X _nX₀] = E [X _n+nX_0+n] = E [X₀X_n] = R (n). Therefore we see that this extension does not contain so much new information; however it is useful or at least it simpli…es some computation.

3. Time series and empirical quantities

A time series is a sequence or real numbers, x1; :::; xn. Also empirical samples have the same form.

The name time series is appropriate when the index i of x_i has the meaning of time.

A …nite realization of a stochastic process is a time series. ideally, when we have an experimental time series, we think that there is a stochastic process behind. Thus we try to apply the theory of stochastic process.

Recall from elementary statistics that empirical estimates of mean values of a single r.v. X are computed from an empirical sample x₁; :::; x_n of that r.v.; the higher is n, the better is the estimate.

A single sample x1 is not su¢ cient to estimate moments of X.

Similarly, we may hope to compute empirical estimates of R (t; s) etc. from time series. But here, when the stochastic process has special properties (stationary and ergodic, see below the concept of ergodicity), one sample is su¢ cient! By “one sample”we mean one time series (which is one realization of the process, like the single x₁ is one realization of the r.v. X). Again, the higher is n, the better is the estimate, but here n refers to the length of the time series.

Consider a time series x₁; :::; x_n. In the sequel, t and n_tare such t + n_t= n:

Let us de…ne

x_t= 1 n_t

nt

X

i=1

x_i+t; bt²= 1 n_t

nt

X

i=1

(x_i+t x_t)² R (t) =b 1

nt nt

X

i=1

x_ix_i+t

C (t) =b 1 n_t

nt

X

i=1

(xi x0) (xi+t xt)

b(t) = C (t)b b0bt

=

Pnt

i=1(xi x0) (xi+t xt) qPnt

i=1(x_i x₀)²Pnt

i=1(x_i+t x_t)² :

(26)

These quantities are taken as approximations of

t; _t²; R (t; 0) ; C (t; 0) ; (t; 0)

respectively. In the case of stationary processes, they are approximations of

; ²; R (t) ; C (t) ; (t) :

In the section on ergodic theorems we shall see rigorous relations between these empirical and theo- retical functions.

The empirical correlation coe¢ cient b^X;Y =

Pn

i=1(x_i x) (y_i y) qPn

i=1(x_i x)²Pn

i=1(y_i y)²

between two sequences x1; :::; xn and y1; :::; yn is a measure of their linear similarity. If the there are coe¢ cients a and b such that the residuals

"_i = y_i (ax_i+ b)

are small, then jbX;Yj is close to 1; precisely, bX;Y is close to 1 if a > 0, close to -1 if a < 0. A value of bX;Y close to 0 means that no such linear relation is really good (in the sense of small residuals).

Precisely, smallness of residuals must be understood compared to the empirical variancebY² of y1; :::; yn: one can prove that

b²X;Y = 1 b²"

b_Y²

(the so called explained variance, the proportion of variance which has been explained by the linear model). After these remarks, the intuitive meaning of bR (t), bC (t) and b(t) should be clear: they measure the linear similarity between the time series and its t-translation. It is useful to detect repetitions, periodicity, trend.

Example 1. Consider the following time series, taken form EUROSTAT database. It collects export data concerning motor vehicles accessories, since January 1995 to December 2008.

Its autocorrelation function b(t) is given by

(27)

We see high values (the values of b(t) are always smaller than 1 in absolute value) for all time lag t. The reason is the trend of the original time series (highly non stationary).

Example 2. If we consider only the last few years of the same time series, precisely January 2005 - December 2008, the data are much more stationary, the trend is less strong. The autocorrelation function b(t) is now given by

where we notice a moderate annual periodicity.

4. Gaussian processes

If the generic vector (X_t₁; :::; X_t_n) is jointly Gaussian, we say that the process is Gaussian. The law of a Gaussian vector is determined by the mean vector and the covariance matrix. Hence the law of the marginals of a Gaussian process are determined by the mean function _t and the autocorrelation function R (t; s).

Proposition 5. For Gaussian processes, stationarity in the wide and strong sense are equivalent.

Proof. Given a Gaussian process (Xn)_n2N, the generic vector (Xt1+s; :::; Xtn+s) is Gaussian, hence with law determined by the mean vector of components

E [X_t_i_+s] = _t_i_+s and the covariance matrix of components

Cov X_t_i_+s; X_t_j_+s = R (t_i+ s; t_j+ s) _t_i_{+s t}_j_+s: If the process is stationary in the wide sense, then ti+s= and

R (t_i+ s; t_j+ s) _t_i_{+s t}_j_+s= R (t_i t_j) ²

do not depend on s. Then the law of (Xt1+s; :::; Xtn+s) does not depend on s. This means that the process is stationary in the strict sense. The converse is a general fact. The proof is complete.

Most of the models in these notes are obtained by linear transformations of white noise. White noise is a Gaussian process. Linear transformations preserve gaussianity. Hence the resulting processes are

(28)

Gaussian. Since we deal very often with stationary processes in the wide sense, being them Gaussian they also are strictly stationary.

5. Discrete time Fourier transform Given a series (x_n)_n2Z of real or complex numbers such thatP

n2Zjxnj² < 1, we denote by bx (!) or by F [x] (!) the discrete time Fourier transform (DTFT) de…ned as

b

x (!) = F [x] (!) = 1 p2

X

n2Z

e ^i!nx_n; ! 2 [0; 2 ] :

The function can be considered for all ! 2 R, but it is 2 -periodic. Sometimes the factor ^p¹₂ is not included in the de…nition; sometimes, it is preferable to use the variant

b

x (f ) = 1 p2

X

n2Z

e ^{2 if n}xn; f 2 [0; 1] :

We make the choice above, independently of the fact that in certain applications it is customary or convenient to make others. The factor p¹

2 is included for symmetry with the inverse transform or the Plancherel formula (without p¹

2 , a factor ₂¹ appears in one of them).

The L²-theory of Fourier series guarantees that the seriesP

n2Ze ^i!nx_n converges in mean square with respect to !, namely, there exists a square integrable functionbx (!) such that

N !1lim Z 2

0

X

jnj N

e ^i!nxn bx (!)

2

d! = 0:

The sequence x_n can be reconstructed from its Fourier transform by means of the inverse Fourier transform

xn= 1 p2

Z 2 0

e^i!nx (!) d!:b Among other properties, let us mention Plancherel formula

X

n2Z

jxnj²= Z 2

0 jbx (!)j²d!

and the fact that under Fourier transform the convolution corresponds to the product:

F

"

X

n2Z

f ( n) g (n)

#

(!) = bf (!)bg(!) :

When X

n2Z

jxnj < 1 then the seriesP

n2Ze ^i!nx_n is absolutely convergent, uniformly in ! 2 [0; 2 ], simply because X

n2Z

sup

!2[0;2 ]

e ^i!nxn =X

n2Z

sup

!2[0;2 ]

e ^i!n jxⁿj =X

n2Z

jxⁿj < 1:

(29)

In this case, we may also say that x (!) is a bounded continuous function, not only square inte-b grable. Notice that the assumption P

n2Zjxnj < 1 implies P

n2Zjxnj² < 1, because P

n2Zjxnj² sup_n2ZjxⁿjP

n2Zjxⁿj and sup_n2Zjxⁿj is bounded when P

n2Zjxⁿj converges.

One can do the DTFT also for sequences which do not satisfy the assumption P

n2Zjxnj² < 1, in special cases. Consider for instance the sequence

x_n= a sin (!₁n) : Compute the truncation

b

x_2N(!) = 1 p2

X

jnj N

e ^i!na sin (!₁n) : Recall that

sin t = eît e ît 2i : Hence sin (!₁n) = êî!1n_2iê î!1n

X

jnj N

e ^i!na sin (!₁n) = 1 2i

X

jnj N

e ^{i(! !}¹⁾ⁿ 1 2i

X

jnj N

e ^i(!+!¹⁾ⁿ:

The next lemma makes use of the concept of generalized function or distribution, which is outside the scope of these notes. We still given the result, to be understood in some intuitive sense. We use the generalized function (t) called delta Dirac, which is characterized by the property

(5.1)

Z ₁

1

(t t₀) f (t) dt = f (t₀)

for all continuous compact support functions f . No usual function has this property. A way to get intuition is the following one. Consider a function n(t) which is equal to zero for t outside _2n¹ ;_2n¹ , interval of length _n¹ around the origin; and equal to n in _2n¹ ;_2n¹ . Hence (t t₀) is equal to zero for t outside t0 1

2n; t0+_2n¹ , equal to n in t0 1

2n; t0+_2n¹ . We have Z ₁

1

n(t) dt = 1:

Now, Z ₁

1

n(t t₀) f (t) dt = n

Z t0+_2n¹ t0 1

2n

f (t) dt

which is the average of f around t₀. As n ! 1, this average converges to f (t0) when f is continuous.

Namely. we have

n!1lim Z ₁

1

n(t t₀) f (t) dt = f (t₀)

which is the analog of identity (5.1), but expressed by means of traditional concepts. In a sense, thus, the generalized function (t) is the limit of the traditional functions _n(t). But we see that _n(t) converges to zero for all t 6= 0, and to 1 for t = 0. So, in a sense, (t) is equal to zero for t 6= 0, and to 1 for t = 0; but this is a very poor information, because it does not allow to deduce identity (5.1) (the way _n(t) goes to in…nity is essential, not only the fact that (t) is 1 for t = 0).

(30)

Lemma 2. Denote by (t) the generalized function such that Z ₁

1

(t t₀) f (t) dt = f (t₀)

for all continuous compact support functions f (it is called the delta Dirac distribution). Then

N !1lim X

jnj N

e ^itn= 2 (t) : From this lemma it follows that

N !1lim X

jnj N

e ^i!na sin (!₁n) =

i (! !₁)

i (! + !₁) : In other words,

Corollary 1. The sequence

xn= a sin (!1n) has a generalized DTFT

b

x (!) = lim

N !1xb_2N(!) = p

p2i( (! !₁) (! + !₁)) :

This is only one example of the possibility to extend the de…nition and meaning of DTFT outside the assumption P

n2Zjxnj² < 1. It is also very interesting for the interpretation of the concept of DTFT. If the signal xn has a periodic component (notice that DTFT is linear) with angular frequency

!₁, then its DTFT has two symmetric peaks (delta Dirac components) at !₁. This way, the DTFT reveals the periodic components of the signal.

Exercise 9. Prove that the sequence

x_n= a cos (!₁n) has a generalized DTFT

b

x (!) = lim

N !1xb_2N(!) = p

p2 ( (! !₁) + (! + !₁)) : 6. Power spectral density

Given a stationary process (X_n)_n2Z with correlation function R (n) = E [X_nX₀], n 2 Z, we call power spectral density (PSD ) the function

S (!) = 1 p2

X

n2Z

e ^i!nR (n) ; ! 2 [0; 2 ] : Alternatively, one can use the expression

S (f ) = 1 p2

X

n2Z

e ^{2 if n}R (n) ; f 2 [0; 1]

which produces easier visualizations because we catch more easily the fractions of the interval [0; 1].

(31)

Remark 11. In principle, to be de…ned, this series requiresP

n2ZjR (n)j < 1 or at leastP

n2ZjR (n)j² <

1. In practice, on a side the convergence may happen also in unexpected cases due to cancellations, on the other side it may be acceptable to use a …nite-time variant, something likeP

jnj Ne ^i!nR (n), for practical purposes or from the computational viewpoint.

A priori, one may think that S (f ) may be not real valued. However, the function R (n) is non- negative de…nite (this means Pn

i=1R (t_i t_j) a_ia_j 0 for all t₁; :::; t_n and a₁; :::; a_n) and a theorem states that the Fourier transform of non-negative de…nite function is a non-negative function. Thus, at the end, it turns out that S (f ) is real and also non-negative. We do not give the details of this fact here because it will be a consequence of the fundamental theorem below.

6.1. Example: white noise. We have

R (n) = ² (n) hence

S (!) =

2

p2 ; ! 2 R:

The spectra density is constant. This is the origin of the name, white noise.

6.2. Example: perturbed periodic time series. This example is numeric only. Produce with R software the following time series:

t <- 1:100

y<- sin(t/3)+0.3*rnorm(100) ts.plot(y)

The empirical autocorrelation function, obtained by acf(y), is

and the power spectral density, suitable smoothed, obtained by spectrum(y,span=c(2,3)), is

(32)

6.3. Pink, Brown, Blue, Violet noise. In certain applications one meets PSD of special type which have been given names similarly to white noise. Recall that white noise has a constant PSD.

Pink noise has PSD of the form

S (f ) 1 f: Brown noise:

S (f ) 1 f²: Blue noise

S (f ) f 1 : Violet noise

S (f ) f² 1 :

7. Fundamental theorem on PSD

The following theorem is often stated without assumptions in the applied literature. One of the reasons is that it can be proved under various level of generality, with di¤erent meanings of the limit operation (it is a limit of functions). We shall give a rigorous statement under a very precise assumption on the autocorrelation function R (n); the convergence we prove is rather strong. The assumption is a little bit strange, but satis…ed in all our examples. The assumption is that there exists a sequence ("_n)_n2N of positive numbers such that

(7.1) lim

n!1"_n= 0; X

n2N

jR (n)j

"n < 1:

This is just a little bit more restrictive than the conditionP

n2NjR (n)j < 1 which is natural to impose if we want uniform convergence of p¹

2

P

n2Ze ^i!nR (n) to S (!). Any example of R (n) satisfying P

n2NjR (n)j < 1 that the reader may have in mind, presumably satis…es assumption (7.1) in a easy way.

(33)

Theorem 1 (Wiener-Khinchin). If (X (n))_n2Z is a wide-sense stationary process satisfying assumption (7.1), then

S (!) = lim

N !1

1

2N + 1E Xb_2N(!)² :

The limit is uniform in ! 2 [0; 2 ]. Here X2N is the truncated process X 1_{[ N;N ]}. In particular, it follows that S (!) is real an non-negative.

Proof. Step 1. Let us prove the following main identity:

(7.2) S (!) = 1

2N + 1E Xb_2N(!)² + r_N(!) where the remainder r_N is given by

rN(!) = 1 2N + 1F

2

4 X

n2 (N; )

E [X ( + n) X (n)]

3 5 (!)

with

(N; t) = [ N; N_t ) [ (Nt⁺; N ] N_t⁺=

8<

:

N if t 0

N t if 0 < t N 0 if t > N N_t =

8<

:

N if t 0

N t if N t < 0 0 if t < N

:

Since R (t) = E [X (t + n) X (n)] for all n, we obviously have, for every T > 0, R (t) = 1

2N + 1 X

jnj N

E [X (t + n) X (n)] : Thus

(7.3) S (!) = bR (!) = 1

2N + 1F Z N

N

E [X ( + n) X (n)] (!) : Then recall that

F

"

X

n2Z

f ( n) g (n)

#

(!) = bf (!)bg (!) hence

F

"

X

n2Z

f ( + n) g (n)

#

(!) = F

"

X

n2Z

f ( n) g ( n)

# (!)

= bf (!)bg ( !) because

F [g ( )] (!) = bg ( !) :