Embedding and Complexity of Time Series: Theory and Applications

(1)

CORSO DI LAUREA MAGISTRALE IN MATEMATICA

TESI DI LAUREA MAGISTRALE

Embedding and Complexity

of a Time Series:

Theory and Applications

CANDIDATO:

Andrea Scarciglia

RELATORE:

Prof. Claudio Bonanno

CORRELATORE:

Ing. Gaetano Valenza

ANNO ACCADEMICO 2019/2020 SESSIONE DI LAUREA 23 OTTOBRE 2020

(2)

(3)

Abstract

In this thesis we introduce an approach to the study of time series by nonlinear dynamical systems. The main assumption we make throughout this work is that time series samples can be associated to points of a nonlinear ergodic dynamical system through a map, called measurement map.

We are interested in showing how to classify time series in terms of predictability of their behavior. Firstly, we give a mathematical framework for some of the quantifiers that will be described later. We face the embedding problem: "almost any" measurement map has to be an embedding, that is an injective immersion. The choice of the embedding is motivated by the fact that differential structure and injectivity are preserved in order to compute quantifiers (e.g. Lyapunov exponents). Moreover, "almost every" has to be intended in the sense of prevalence, a generalization of the notion of probability measure. Since attractors may not have integer dimension, fractal sets are introduced. The main theorem we use to solve the embedding problem is the Fractal Delay Embedding Prevalence Theorem, a slight modification of Witney’s embedding theorems and more suitable for time series: it states that "almost every" measurement map is an embedding, provided that the samples are taken and organized in a particular manner.

After introducing notions of ergodic theory and discrete-time stochastic processes, we describe the Kolmogorov-Sinai (K-S) entropy as a quantifier of the predictability of an ergodic dynamical system (and consequently of a time series). Since the K-S entropy computation may be difficult, we obtain K-S entropy by using the Algorithmic Information Content (AIC) of strings.

We convert time series in strings composed by a finite number of symbols. For each of the symbolic string, we define its AIC as the length of the shortest binary program that reproduces the string as output, while its complexity as the mean AIC for each symbol. The notion of complexity is related to the notion of K-S entropy by Brudno’s theorem. It seems reasonable to determine K-S entropy by the AIC, but this is not possible since AIC cannot be performed by any algorithm. In order to approximate the AIC, simple Lempel-Ziv data compression is introduced: it is a lossless prefix-free code which asymptotically compresses almost every string from an ergodic stochastic process, using the less storage as possible, in such a way that the expected binary code length per symbol is equal to the entropy of the process and that it is possible to reconstruct the

(4)

original series from the encoded one: these properties allow us to think of Lempel-Ziv data compression as an approximation of the AIC. All tools introduced for computing K-S entropy require limits and consequently a large number of samples which is not always available. Other quantifiers of unpredictability are shown: generalized entropy of order 2 (K2), approximate entropy and sample entropy. They all rely on the geometrical

attractor reconstruction and they all estimate the probability of finding similar finite length patterns in a given time series. K2 is a lower bound to the K-S entropy while

the other two entropies are seen to work well with a small number of points. Finally, in numerical simulations, we test algorithms of approximate entropy, sample entropy and Lempel-Ziv data compression on circle rotations, logistic maps and physiological data. The aim is to relate entropies to the behavior of the dynamical system underlying the time series. In the case of physiological data, these quantifiers are shown to distinguish clinical behaviors.

(5)

Introduction

In the last thirty years, nonlinear dynamical systems have been intensively studied also because they have assumed a fundamental role in modelling natural phenomena.

Indeed, by assuming that there is a nonlinear dynamical system behind observ-able phenomena, one can try to infer some properties about it from experimental measurements.

Usually, data are collected as a time series, that is a sequence {Qt} where the

quantity Qt is sampled τ times apart, for τ some positive integer.

The aim of this thesis is to show a first approach to the study of time series by describing tools, testing them and giving a mathematical framework.

This thesis is divided into three chapters: the first one is entirely dedicated to the geometrical embedding problem of a time series: as we will see in Chapter 2, embedding dimension of a time series is the main ingredient for the formulation of efficient algorithms which allow to classify time series in terms of their predictability.

Let (M, f) denote the discrete dynamical system underlying the dynamics where M is a k-dimensional manifold (the phase space). We are interested in a subset A ⊂ M, called the attractor, i.e. the set where all the points lie for very long times.

One thinks of the measurement as an observation smooth function h on the state space M. Each Qt is assumed to be of the form Qt = h(ft(x)) that is the result of

evaluating an observation function h at the current state ft_{(x). Therefore, we suppose}

that any sample Qt of time series which is associated to a point of A by a measurement

map F .

In this context, the attractor reconstruction consists in giving sufficient conditions in order to satisfy this property: "almost every" measurement map F , defined on A into

the space where the samples live, has to be an embedding, that is an injective immersion.

The embedding requirement is important for two reasons: it preserves the differential structure of the attractor (which is useful in the computation of Lyapunov exponents) and injectivity makes predictions more accurate and controls the behavior of the dynamics.

(8)

fractal sets and their geometrical dimension, known as box-counting dimension.

The "almost every" in the property has to be intended in the sense of prevalence which generalizes "probability one" for infinite dimensional spaces.

Simultaneous measurements sampled in Qtare hardly assumed independent by scientists:

in this case delate-coordinate embeddings are the most suitable maps for the attractor reconstruction, or rather maps F (h, f, τ) : M → Rn _{defined as}

F(h, f, τ)(x) = (h(x), h(f−τ(x)), h(f−2τ(x)), . . . , h(f−(n−1)τ(x)))

where h is a smooth function.

The most important result in this direction is the Fractal Delay Embedding Prevalence

Theorem (coming from Witney’s embedding Theorems) which states that almost every

(in the sense of prevalence) delay-coordinate map F (h, f, τ) is an embedding provided that the dimension of the image n (chosen large enough) and the delay τ guarantee the non-existence of orbits of period τ and 2τ and the existence of a finite number of periodic orbits for each period 3τ, . . . , nτ of f.

This result is actually proved for a generalized version of the theorem (Fractal Filtered

Delay Embedding Prevalence Theorem) which uses filters (i.e. special matrix) to modify

data: for example, we use filters if we are interested in an arithmetic or pounded average of the data.

After showing the proofs of the stated theorems, in the final section we face the prob-lem of choosing the best embedding dimension of the attractor. The only requirement that we maintain is that dimF (A) = dim A, being F a delay coordinate map. Since all theorems only provide sufficient conditions, there is not a mathematical justification for the best embedding dimension. This means that the attractor dimension for increasing values of n is computed until a plateau is reached. This plateau is considered to be the embedding dimension.

In the second chapter we classify time series through quantifiers which take into account the amount of "order". "Order" means higher possibilities to find similar patterns in a time series or to predict its behavior.

We add the property that time series are outputs of an ergodic dynamical system. We list some facts and results about ergodic theory and the stationary discrete-time stochastic process whose random variables are defined on a finite set A, the alphabet. An important result is the Entropy Theorem (or Shannon-McMillian-Breiman Theorem) which states the existence of a positive number (the entropy) for ergodic processes which represents the exponential decay coefficient of the measure of "typical sets" of a process.

(9)

which guarantees that almost every infinite sequence, output of an ergodic stationary discrete-time stochastic processes, is contained in a set of measure one (the entropy-typical set).

For an ergodic dynamical system (X, µ, f) where X is assumed to be a compact metric space, we consider a finite measurable partition of X of the form Z = {I1, . . . , IN}

where the sets Ii are pairwise disjoint, measurable, and the measure of the complement

of their union is 0. Then we introduce the Kolmogorov-Sinai entropy htas the supremum

of the entropies of a partition among all finite measurable partitions. If we assume that each set Ii carries a certain amount of information (a quantity which allows to

identify each element of the partition), the Kolmogorov-Sinai entropy is identified as the highest information average needed to identify a sequence of sets of a finite partition. Actually, generating partitions are needed to compute K-S entropy instead of finding the supremum among all finite partitions, but these are not always available in an analytic form. We know that the entropy of a partition is exactly the same as the one described in the Entropy theorem.

In order to compute K-S entropy we develop another approach. First of all, we convert our time series into a symbolic sequence of a finite alphabet A through the

symbolic dynamics. By defining the Alogrithmic Information Content of a finite string s as the length of shortest binary program that reproduces s, we define the complexity of an infinite string as the mean length of a binary program which is needed to identify

an element of the strings, in symbols

K(ω) = lim sup

n→∞

AIC(ωn)

n

where ωn _{denotes the first n symbols of ω.}

How is the complexity of a partition related to the entropy of a partition? The answer is given by Brudno’s theorem, which asserts that the complexity and entropy of a partition are equal if the dynamical system is ergodic: as a corollary, we exploit the AIC in K-S entropy computation. Unfortunately, the AIC cannot be obtained by any algorithm: nothing is lost because it is approximated thanks to lossless data compression

algorithms, or rather code maps which encode finite sequences of symbols into binary

strings using the less storage as possible and allow to reconstruct the original string from the encoded one.

We continue the discussion by showing the definition of codes, prefix-free codes and their main properties, especially their relation with the entropy of a process: for any ergodic stochastic discrete-time stationary process with measure µ and entropy hµ it is

possible to construct a prefix-free code (such as the Shannon codes) so that the expected length per encoded word is asymptotically equal to the entropy hµ; moreover, there is

(10)

than the entropy of the system.

The measure of a process is hardly available in a given time series, therefore it is impossible to construct the previous prefix-free codes. We face the problem of the

universality of codes: are there any prefix-free codes so that the expected length per

encoded word is asymptotically equal to the entropy, for any ergodic process with entropy h? Are there any prefix-free codes whose expected length per encoded word beats asymptotically the entropy on a set of positive measure?

To the second question the answer is negative, whereas we answer the first question positively by introducing the lossless simple Lempel-Ziv (LZ) data compression algorithm. A brief description of the algorithm and some examples are showed together with results concerning entropy.

Lempel-Ziv algorithm is a powerful tool in the AIC approximation, and consequently in computing K-S entropy of a time series. However, this approach may not be very useful because the entropy is reached in a limit process and time series may be formed by a little number of samples denying the possibility of reaching the entropy value. For this reason we introduce other ways to measure entropy with short time series:

generalized entropy of order 2 (K2), approximate entropy and sample entropy.

All the three quantifiers rely on the geometrical attractor reconstruction of Chapter 1: indeed we regroup the raw data of the time series in m-dimensional vector, where

m is supposed to be the embedding dimension of the attractor of the underlying

dynamical system (even if m will be obtained empirically). The idea which lies behind the quantifiers can be described in the following way: we divide the phase space into a finite number of balls of radius ε and we take into account the number of times that two m-dimensional vectors lie in the same balls: in other words we are searching the number of trajectories of length m sufficiently close to each other. Starting from this idea, correlation sum and correlation integral provide the empiric probability that two any trajectories of length m are sufficiently ε-close.

We define the quantifiers from correlation sum and integral. The main properties are showed: they are easily computed from a time series; K2 provides a lower bound to the

K-S entropy; approximate entropy take into account self-matches (the two trajectories can be the same), while approximate entropy does not; approximate and sample entropy seems to work well even if few points are available (this is not true for K2). In addition

it is proved to be more convenient to compute approximate and sample entropy of a time series with fixed embedding dimension m and ε instead of varying these values until a plateau is reached: in this way, comparisons between different time series with the same number of points are more meaningful.

In the third and last chapter, we use numerical simulations to analyze the entropy quantifiers (K-S entropy with LZ compression, approximate and sample entropy) that

(11)

we have introduced in Chapter 2 and relate them to the behavior of the underlying dynamical system.

A brief description of chaotic systems (Devaney) is introduced: a characterization of chaotic dynamical systems is given through the K-S entropy and the Lyapunov

exponents. In particular, dynamical systems with lower K-S entropy result to be easily

predicted.

We test the quantifiers on three kind of maps: circle rotation, logistic map and physiological data. We know that for LZ compression we convert a time series into a symbolic one.

In the first two cases we already know the theoretical value of the metrical entropy: it is almost achieved by the LZ compression even if the number of samples has to be very large; lower values of sample and approximate entropy are associated with low complexity (low metrical entropy) while higher values with more unpredictability. For the third case, we are given a set of physiological time series which represent the interbeat intervals of two groups of people of the same number, old and young. In this case, without knowing the theoretical metrical entropy value of the time series, our goal is to distinguish clinically the mean heart behavior in old and young people with the three quantifiers above: old people show a lower value of entropy than the young people. All the quantifiers distinguish the two cases but the most marked distinction is achieved by the sample entropy which works better with a small number of samples.

(12)

(13)

Chapter 1 State-space Reconstruction

In this chapter we show which condition allow to reconstruct an attractor from a collection of experimental data, in particular time series.

Following mainly Sauer et al. [18], the fundamental fact we are assuming is that the data we want to analyze are the outputs of a nonlinear dynamical system.

The attractor reconstruction results to be an important step towards the predic-tions of non-linear time series and in the computation of invariant quantities used to characterize the dynamics of such series.

We will state Whitney’s theorems and similar theorems which ensure that "almost every" map from the attractor into the measurements space is an embedding. For this purpose fractal sets will be defined and prevalent sets will be used.

Moreover, theory showed in this chapter will be the starting point for some algorithms of the next chapter.

The proofs of the theorems used are left for the last part of the chapter.

1.1 Some Definitions

Before giving some results, we formalize some concepts.

Definition 1.1.1 (Dynamical system). Given a set M, an additive monoid T and a function Φ : V ⊆ (T × M) → M, we define a dynamical system as a tuple (M, T, Φ) such that:

• I(x) = {t ∈ T |(t, x) ∈ V } • Φ(0, x) = x, ∀x ∈ M

• Φ(t2,Φ(t1, x)) = Φ(t1+ t2, x) ∀x ∈ M, t1, t1+ t2 ∈ I(x), t2 ∈ I(Φ(t1, x)).

The function Φ(t, x) is called evolution function of the dynamical system: it associates a unique image to each point of V (or M) depending on the parameter t. The space M is called Phase Space.

(14)

Furthermore, if we consider x constant, the function Φ(t, x) ≡ Φx(t) : I(x) → M is

known as flow through x: it describes the evolution of the system with initial condition

x as t varies in I(x); the graph of a flow is said trajectory (through x).

Under the same assumptions, a dynamical system is continuous when the set T is an open interval in R; otherwise it is discrete if T is the set of the integers.

Since time series are samples taken from a continuous measurable function, we will consider only discrete dynamical systems: in particular the function Φ is given by

Φ(t, x) = ft_(x)

where f : M → M is a C1 _{transformation and f}t_{(x) indicates the function composition}

(f ◦ f ◦ · · · ◦ f)

| {z }

t times

(x).

Remark 1.1.1. In the definition above, we do not require any hypothesis on M. For our purpose, anyway, we will assume M to be a manifold Rk_{, for k some positive integer}

with a distance d defined on M.

We are interested in the study of the attractor of a dynamical system as defined below.

Definition 1.1.2 (Attractor set). We define the attractor as a closed subset A ∈ M such that A has a neighborhood U (called the fundamental neighborhood) such that the following conditions are satisfied:

• for every neighborhood V of A, we have ft_{(U) ⊂ V when t is large enough;}

• ft(A) = A for every t;

• irreducibility: there is no proper sets of A which satisfies the first two conditions. The set W =S

tf−t(U) is called the basin of the attraction of A: it consists of all

points x ∈ M such that ft_{(x) ∈ A when t → ∞ and therefore W is independent of the}

choice of U.

In other words, the attractor is the set where the dynamics lies for long times.

For this reason, it can be convenient to use the following operational and intuitive definition.

Definition 1.1.3 (Attractor: operational definition). Given a dynamical system, we call the attractor the subset of the phase space on which experimental points ft_x

accumulate for large times t.

Attractors may not have an integer dimension. Attractors with fractionary dimension are called strange attractors.

(15)

Remark 1.1.2. There is not a unique definition of attractors: we have chosen the most suitable one for our discussion. Indeed, attractors may be defined as in [17] since it has been showed that attractors may not have a basin of the attraction.

Suppose that a hidden dynamical system is defined in Rk_{, k a positive integer, and}

that all trajectories are asymptotic to an attractor A. If we know how the system will behave in a long time then we will get some information about A.

Unfortunately, in the situation we considered, we do not know directly the dynamical system but only some observations which we consider as outputs of the dynamical system. In particular, we may assume that we are able to collect n independent data simultaneously, namely Q1, . . . , Qn measurements in the sense Qi is not a function

dependent by any Qj for i 6= j.

We can, then, associate a point of the phase space which is assumed to be a subset of Rk to the states Q1, . . . , Qn. We have, indeed, the following

Definition 1.1.4. A function F : Rk_{→ R}n which maps states to measures

F(state) = (Q1, . . . , Qn)

is said a measurement function.

Among all the measurements, we are interest in the time-dependent ones. These are the most available in practice.

Definition 1.1.5 (Time series). We define a time series {Qt}as the resulting list of

samples of a measurable quantity sampled at intervals τ time units apart.

We can see a time series as the image of an observable function h defined on our dynamical system (M,Φ,T ): each Qt= h(xt), t ∈ T, xt∈ M, is the result of evaluating

the observation function h on the state xt.

Attractor reconstruction is a necessary step towards the study of time series. Indeed, as we will see later, time series from a dynamical system can be characterized in terms of predictability, that is how much it is easy to predict the behavior of the time series (and consequently the behavior of the dynamical system), and in terms of "sensible dependence on initial condition" (these two issues will be discussed in the last chapter). Predictability is quantified by entropy, while the sensible dependence on initial condition by the Lyapunov exponents. Some of the algorithms that provide values of entropy of a time series are based on the reconstruction of the attractor.

This reconstruction is not the actual one: for example, it may happen that the time series are produced by an observable function defined on a 2-dimensional attractor while the reconstruction provides higher values for the dimension of the attractor.

(16)

In fact, the reconstruction is done in such a way that for any measurement function

F we take from the attractor ro the measurement space, F is still useful to compute

entropy and Lyapunov exponents.

To be more precise, the quantifiers above are easy to compute if the measurement maps are assumed to be embeddings. Here we give the definition.

Definition 1.1.6 (Immersion). Let M be a compact smooth (at least C1) differentiable manifold of Rk_{. A smooth map (at least C}1_{) defined on M is an immersion if the}

derivative map DF (x) (represented by the Jacobian matrix of F at x) is one-to-one at every point x of M.

Remark 1.1.3. Since the Jacobian of F is a linear map, F is an immersion if DF (x) has full rank on the tangent space. In particular, no differential structure is lost under immersion.

Definition 1.1.7 (Embedding). Let S be a set in Rk. A function F : S → Rn is said

an embedding if it is an injective immersion.

Remark 1.1.4. If S is a compact manifold, then F is diffeomorphism if F is a one-to-one immersion.

So, the attractor reconstruction provides that every measurement function F has to be an embedding because it is important to preserve differential structure and injectivity in order to compute entropy or Lyapunov exponents.

Clearly, whether or not a map from A to Rn _{is an embedding on A depends on the}

structure of the set A.

For this sake, we are helped by the following

Theorem 1.1.1 (Whitney General Embedding Theorem). Suppose A to be a smooth manifold of dimension d, then the set of maps which are embedding of into R2d+1 _{is an}

open set and dense in the C1 _{topology of maps.}

We want to highlight two important aspects: given an embedding F , all small perturbations of F are embeddings (due to the open set of the embedding maps); furthermore, any map from to R2d+1, whether it is an embedding or not, is always

arbitrarily close to an embedding (the set of embedding maps is dense).

It seems reasonable to conclude from Whitney’s theorem that n = 2d + 1 simultan-eous measurements are sufficient to reconstruct a d-dimensional state manifold in the measurement space Rn_{; however, open and dense subsets can be very small in terms of}

probability.

One would know if a particular measurement map is an embedding with probability one, that is to be almost sure that our data are given by an embedding map: in the next section we shall define an equivalent notion of probability one in infinite dimensional space and try to give a more suitable embedding theorem for our purpose.

(17)

Embedding Theorems in the Sense of Prevalence

We begin by giving the following

Definition 1.1.8. (Prevalent Set) A Borel subset S of a normed linear space V is

prevalent if there is a finite-dimensional subspace E ⊂ V such that ∀v ∈ V , the sum v+ e ∈ S for (Lebesgue measure supported on E) almost every e in E.

The meaning of prevalence for S is that if we start at any point in V and explore along the finite-dimensional space E, then almost every point encountered will lie in S. For this reason we shall refer to E as probe space.

In this chapter, where there is no possibility of confusion, we say that "almost every" map satisfies a certain property when the set of such maps is prevalent. Here are given some properties about prevalent set:

• If E0 _{and E are subspaces of a normed linear vector space V and E}0 _{⊂ E} _{and E}

is a probe space, then E0 _{is a probe space.}

• A prevalent subset of a finite-dimensional vector space is simply a set whose complement has zero measure.

• The union or intersection of a finite number of prevalent sets is again prevalent. • In any normed linear space, a prevalent set is always dense: in fact, let S a prevalent set. If c ∈ V , for prevalence there exists an element e ∈ E such that

c+ e ∈ S. Let ε > 0 and let c(θ) = c + θe be an element of S. There exists a θ

such that kc − c(θ)k < ε, that is S is dense.

In particular a prevalent subset of a space function is dense in the Ck_-topology

for any k.

If a condition holds for a prevalent set of functions, it makes sense to determine the smallest, or most efficient, probe space E. This corresponds to the minimal amount of perturbation that must be available to the system in order for the condition to hold with virtual certainty. We give an example to clarify the concept of prevalence. Example 1.1.1. Consider the convergent Fourier series in one variable which form an infinite-dimensional vector space with basis {einx_}∞

n=−∞. In the sense of prevalence,

almost every Fourier series has nonzero integral on [0, 2π]. The probe space E in this case is the one-dimensional space of constant functions. If E0 _{is a vector space of Fourier}

series which contains the constant functions, then for every Fourier series f, the integral of f + e will be nonzero for almost every e ∈ E.

(18)

Theorem 1.1.2(Whitney Embedding Prevalence Theorem). Let A be a d−dimensional compact manifold contained in Rk_{, then almost every map R}k

→ R2d+1 _{is an embedding}

of A.

Remark 1.1.5. The probe space E is the k(2d + 1)−dimensional space of linear maps from Rk _{to R}2d+1_.

Furthermore, given any smooth map F , it is arbitrarily near an embedding with

probability one (sense of prevalence).

Remark 1.1.6. Under the same hypothesis, Whitney has showed that there exists an embedding map from Rk _{to R}2d_{. However, this theorem is useless for us because we do}

not know if this map, under small perturbations, is still an embedding.

1.1.1 Fractal sets

Sometimes, the attractor A is not a manifold: this prohibits us from using Whitney’s prevalence theorem. Luckily, we can deal with these sets.

Definition 1.1.9. Given a compact set A ⊂ Rn_{, and given a small positive number ε,}

define Aε = {x ∈ Rn| kx − ak for some a ∈ A} and let Vol(Aε) be the n−dimensional

outer Volume of Aε. We denote the box-counting dimension of A as

boxdim(A) = n − lim

ε→0

log Vol(Aε)

log ε

if the limit exists. If not, the lower (respectively, upper) box-counting dimension can be defined by replacing the limit by lim inf (resp.,lim sup).

If the box-counting limit exists, we can assume Vol(Aε) ≈ εn−d, where d =

boxdim(A).

Another equivalent definition of box-counting dimension is given by dividing Rn _into

ε-cubes with a grid based at points whose coordinates are ε-multiples of integer, and

considering

boxdim(A) = − lim

ε→0

log N(ε) log ε

if the limit exists, where N(ε) is the number of boxes that intersect A. Upper and lower boxdim are defined by replacing lim with limsup/liminf. In this case, N(ε) ≈ ε−d_,

where d = boxdim(A).

It is easy to check that the two definitions are equivalent in an Euclidean space: finite coverings and relations between balls and a n-cubes in Rn _{are used.}

The first definition of box-counting dimension is also suitable for any metric space, while the second one is only for Euclidean spaces. From now on, we will use the definition

(19)

which concerns the ε-grid since the attractors are assumed to lie in some Rk_.

The next theorem help us to deal with fractal sets.

Theorem 1.1.3 (Fractal Whitney Embedding Prevalence Theorem). Let A be a compact subset of Rk _{of box-counting dimension d and let n be an integer greater than}

2d. For almost every smooth map F : Rk_{→ R}n_,

1. F is one-to-one on A.

2. F is an immersion on each compact subset C of a smooth manifold contained in

A.

1.2 Delate-coordinate Embedding

Whitney’s embedding theorems are used if we have a large number of independent measurements. But, it is quite hard to collect a large number of independent quantities or to establish whether two different simultaneous measurements are indeed independent.

In general, the most common kind of data available are the time series since it is easy to take measurements dependent uniquely on time. For this reason, we should use methods in which only one measurable quantity is needed. We can face this problem by using delay coordinates.

Definition 1.2.1 (Delay-coordinate map of flow). If Φ is a flow on a manifold M, τ a positive number (the delay), and h : M → R a smooth function, we define the delay

coordinate map of flow F(h, Φ, τ) : M → Rn as:

F(h, Φ, τ)(x) = (h(x), h(Φ−τ(x)), h(Φ−2τ(x)), . . . , h(Φ−(n−1)τ(x)))

By adding some hypothesis, we state an embedding theorem for delay coordinate maps.

Theorem 1.2.1 (Fractal Delay Embedding Prevalence Theorem). Let Φ be a flow on an open subset U of Rk and let A be a compact subset of U with box-counting

dimension d. Let n > 2d be an integer and τ a strictly positive delay. Assume A contains at most a finite number of equilibria, no periodic orbits of Φ of period τ or 2τ, at most finitely many periodic orbits of period 3τ, 4τ, . . . , nτ and that the linearizations of those periodic orbits have distinct eigenvalues. It follows that for almost every smooth function h on U, the delay-coordinate map F (h, Φ, τ) : U → Rn _is:

1. One-to-one on A.

(20)

In general it is not always clear how to choose a suitable delay τ to satisfy the hypothesis of the previous theorem: some methods will be discussed later.

However, if we assume that the flow Φ on derives from a Lipschitz vector field ˙x = V (x), where kV (x) − V (y)k ≤ Lkx − yk with x, y ∈ A, it is sufficient to take τ smaller than

π/L: this result is discussed in [20]. The author showed that in this case each periodic

orbit has period at least 2π/L. We want to stress the fact that even if we assume the vector field to be Lipschitz, the value of the constant L is not a priori known.

1.2.1 Filtered delay-coordinate map

From the previous theorems, in the reconstruction procedure we exploit information from the previous n time steps to identify a state of the original dynamical system in Rk.

For purposes of measuring quantitative invariants of the dynamical systems, noise reduction, or prediction, it can be advantageous to create an embedding that identify a state with information from a larger number of previous time steps. However, working with Rn _{is difficult for large n. A way around this problem is to incorporate large}

numbers of previous data readings by "averaging" their contribution in some sense. To this end, we generalize the concept of delay embedding map.

Definition 1.2.2 (Filtered delay-coordinate map). Let U be an open subset of Rk, let

g : U → U be a smooth diffeomorphism, and let h : U → R be a function. Let w− < w+

be integers and let w = w+_{− w}−_{+ 1. For 1 ≤ i ≤ w, set g}

i = gw −_+i−1 , so that g1 = gw − and gw = gw +

. Define B a n × w matrix. We define the filtered delay-coordinate map

F_ww−+(B, h, g) : U → Rn by

F_ww−+(B, h, g)(x) = B(h(g₁(x)), h(g₂(x)), . . . , h(g_w(x)))T = B(h(gw−_{(x)), . . . , h(g}w+

(x)))T

In this section and in the following g denotes a smooth diffeomorphism on an open neighbourhood U ∈ Rk_{, h}

1, . . . , ht form a basis for the polynomials in k variables

of degree at most 2w. For a smooth function h0 on Rk and for α ∈ Rt, define

hα = h0+

Pt

i=1αihi. For each positive integer p denote by Ap = {x ∈ A : gp(x) = x}.

Let Cpq be a p × (p − (p, q)) matrix of the form

Cpq =   Ip−(p,q) −I(p,q). . . I(p,q)  

(21)

where (·, ·) denoted the g.c.d. (we assume that (p, 0) = 0)and p > q ≥ 0. Finally, let C∞

pq be the ∞ × (p − (p, q)) matrix formed by repeating the block Cpq vertically,

and for a positive integer w, define Cw

pq to be the matrix formed by the top w rows of Cpq∞.

The two following theorems are valid.

Theorem 1.2.2. Let g a smooth diffeomorphism on an open neighbourhood U of Rk,

and let be a compact subset of U, boxdim(A) = d. Let n and w− _{< w}+ _{be integers}

such that n ≤ w = w+_{− w}−_{+ 1. Assume that the n × w matrix B satisfies:}

A1. rank BC_p0w >2 · boxdim(Ap) for all 1 ≤ p ≤ w.

A2. rank BC_pqw >boxdim(Ap) for all 1 ≤ q < p ≤ w.

Let h1, . . . , ht be as before. Then, for any smooth function h0 on Rk, and for almost

every α ∈ Rt_{, it follows that if n > 2d, then F (B, h}

α, g) : U → Rn is one-to-one on .

Theorem 1.2.3. Let g be a smooth diffeomorphism on an open neighbourhood U of Rk, and let be a compact subset of a smooth m-manifold in U. Assume that the

linearization of the periodic orbits of period less than w have distinct eigenvalues. Let

n ≤ wbe positive integers os in the previous theorem and assume that the n × w matrix B satisfies

A3. rank BDw_p(λi1, . . . , λir) > boxdim(Ap+r−1) for all 1 ≤ p < w, and for all subsets λi1, . . . , λir of eigenvalues of the linearization Dg

p _{at a point in A} p.

Let h1, . . . , htbe a basis for the polynomials in k variables of degree at most 2w. Then,

for any smooth function h0 on Rk, and for almost every α ∈ Rt, it follows that if n ≥ 2m

then F (B, hα, g) : U → Rn is an immersion on A.

As a corollary of the two previous theorems, two particular properties hold for a diffeomorphism g.

Theorem 1.2.4. Let g a diffeomorphism on an open subset U of Rk, and let be a compact subset of U, boxdim(A) = d, and let n > 2d be an integer. Assume that for every positive integer p ≤ n, the set Ap of periodic points of period p satisfies

boxdim(Ap) < p/2, and that the linearization of Dgp for each of these orbits has distinct

eigenvalues. Then for almost every smooth function h on U, the delay coordinate map

F(h, g) : U → Rn _is:

• One-to-one on A.

(22)

Theorem 1.2.5. [Fractal Filtered Delay Embedding Prevalence Theorem] Let g a diffeo-morphism on an open subset U of Rk_{, and let be a compact subset of U, boxdim(A) = d.}

For a positive integer n > 2d, let B be an n × w matrix of rank n. Assume g has no periodic points of period less than or equal to w. Then for almost every smooth function h, the delay coordinate map F (B, h, g) : U → Rn _is:

• One-to-one on A.

• An immersion on each closed subset C of a smooth manifold contained in A.

1.3 Proofs

We are going to prove the theorems we have stated before. All the following results are from [18].

Lemma 1.3.1. Let n and k positive integers, y1, . . . , yn distinct points in Rk and

u1, . . . , un in R and v1, . . . , vn in Rk. It follows that:

1. There exist a polynomial h in k variables of degree at most n-1 such that for

i= 1, . . . , n, h(yi) = ui.

2. There exist a polynomial h in k variables of degree at most n such that for

i= 1, . . . , n, ∇h(yi) = vi.

Proof. 1. By applying a linear change of coordinates, we apply the ordinary

polyno-mial one-dimensional interpolation to the first distinct coordinates of y1, . . . , yn:

this polynomial of degree n − 1 (actually in one variable) is of the type desired. 2. If k = 1, we can find a polynomial of degree at most n − 1 that interpolates the

data. By taking the antiderivative of such polynomial, the conclusion follows. If k > 1 one can assume, always by a linear change of coordinates, that ∀j ∈ {1, . . . , k} the j-th coordinates of y1, . . . , yn are distinct; then, like in the case

k = 1, one can find for each j a polynomial hj = h(xj) of degree at most n whose

derivative hxj interpolates the jth coordinate of ui, i ∈ {1, . . . , n}. The sum

Pk

j=1hj(x) gives the polynomial which satisfies 2.

An important tool in differential geometry is the singular value decomposition: if a matrix M ∈ Ct×n _{can be written as V SU}∗_{, where}

• V is an t × t unitary matrix;

(23)

• S is an n × t diagonal matrix where the diagonal elements σi := σi,i satisfies

σ1 ≥ σ2 ≥ · · · ≥ σr ≥0, r ≤ min{t, n} and all other entries of S are zero;

then V SU∗ _{is said singular value decomposition of M. The elements σ}

i are called

singular values.

A theorem, which we do not prove, states that such decomposition always exists but it is not unique; moreover, rank(A)=rank(S).

We now have the ingredients to prove the following

Lemma 1.3.2. Let F (x) = Mx + b be a map from Rt to Rn, where M is an t × n

matrix and b ∈ Rn_{. For a positive integer r, let σ > 0 be the rth largest singular value}

of M. Then

Vol(Bρ∩ F−1(Bδ))

Vol(Bρ)

<2t/2(δ/ρσ)r

where Bρ denotes the ball of radius ρ centered at the origin in Rt and Bδ is the

n-dimensional ball of radius δ centered at the origin.

Proof. We may assume σ1, . . . , σr = σ and all other singular value equal to 0 due to

the fact that decreasing any singular value of M does not decrease the left-hand side. Let V SUT _{a singular decomposition of M. One can see that MB}

ρ = V SUTBρ is a

r-dimensional ball of radius ρσ in Rt_{: the orthogonal matrix V, U}T _{preserve dimension}

and radii while S "deletes" the n − r coordinates and multiplies by σ the radius of Bρ

(note that only the first r-coordinates of Bρ are manipulated).

It follows that Bρ∩F−1(Bδ) is a cylindrical subset lying in Bρwith base dimension r and

radius δ/σ (it could be even ∅ if the translation vector b has norm large enough). This subset has Volume less than (δ/σ)r_V

rρt−rVt−r where Vr =            πr/2 r 2 !if r is even 2(r+1)/2_π(r−1)/2 r!! if r is odd

is the Volume of the unit ball in Rn _{and r!! denotes the semifactorial of r. So}

Vol(Bρ∩ F−1(Bδ)) Vol(Bρ) < (δ/σ) r_ρt−r_V rVt−r ρt_V t <2t/2(δ/ρσ)r

where the last inequality is simple to be proved with r and t both even, while some algebra is required for the remaining cases.

The last formula concludes the proof.

Lemma 1.3.3. Let S be a bounded subset of Rk, boxdim(S = d), and let G0, G1, . . . , Gt

be Lipschitz maps from S to Rn_{. Assume that for each x in S, the r-th largest value of}

the n × t matrix

Mx = {G1(x), . . . , Gt(x)}

is at least σ > 0. For each α ∈ Rt we define G

(24)

every (in the Lebesgue sense) α ∈ Rt_{, the set G}−1

α (0) has lower box-counting dimension

at most d − r.

Proof. For a positive number ρ, let Bρ the ball of radius ρ centered at the origin in Rt.

Without loss of generality, we may replace Rt _{by B}

ρ in order to prove the lemma. For

the remainder of the proof, we say that Gα has some property with probability p to

mean that the Lebesgue measure of the set of α ∈ Bρ for which Gα has the property is

p times the measure of Bρ. For example, of x ∈ S, then the previous lemma shows that

|Gα(x)| = |G0(x) + Mx(α)| ≤ ε for α ∈ Bρ with probability at most 2t/2(ε/σρ)r.

We can choose D > d and ε0 >0 in a way that the following two fact hold:

• Since S is bounded, S can be covered by ε−D _k_{-dimensional balls B(x, ε) of radius}

ε, centered at x ∈ S;

• By Lipschitz condition, there exists a constant C such that the image under any

Gα, α ∈ Bρ, of any ε-ball in Rk intersecting S is contained in a Cε-ball in Rn.

We will assume that ε < ε0.

Let Rx be a subset of Rt such that Rx = {α ∈ Rt : |Gα(x)| ≤ Cε}. The Lebesgue

measure of Rx is at most a constant multiplied by εr since σ and ρ are constant (in

other words the measure of Rx is equal to the probability of Gα(B(x, ε)) contains 0).

We want to know the probability that at least M of the ε−D _{balls are mapped into}

0. We define, then, the function f : Rt _{→ N such that f(α) = k if α belongs to k}

different sets Rx with x being one of ε−D points, centers of the balls which cover S.

Then R f(α)dm(α) =P xm(Rx) = ε−Dm(Rx). It follows m{α : f(α) > M} < 1 M Z f(α)dm(α) = m(Rx) ε−D M M < C1ε r−D

for Markov’s inequality. In other words G−1

α (0) can be covered by fewer than M = ε

−b

of the ε-balls except with probability at most C1εb−(D−r). As long as b > D − r, this

probability can be made as small as desired by decreasing ε.

Let p > 0. We choose a sequence {εi}i∈N approaching to 0 such that G−1α (0) can be

covered by fewer than ε−b

i balls except for a probability at most p2−i. Thus, the lower

box-counting dimension of G−1

α (0) is at most b, except for a probability p subset of

α. Since p was arbitrary, lower boxdim(G−1_α (0))≤ b for almost every α. In conclusion,

since this holds for all b > d − r, lower boxdim(G−1

α (0))≤ d − r.

Remark 1.3.1. In the case boxdim(S) does not exist, the lemma is still valid if we suppose that d is the lower box-counting dimension of S.

Lemma 1.3.4. Let S be a bounded subset of Rk_{, boxdim(S = d), and let G}

0, G1, . . . , Gt

(25)

matrix

Mx = {G1(x), . . . , Gt(x)}

is at least r. For each α ∈ Rt _{we define G}

α = G0+Pti=1αiGi. Then for almost every

(in the Lebesgue sense) α ∈ Rt_{, the set G}−1

α (0) is the nested countable union of sets of

lower box-counting dimension at most d − r. If d > r, then G−1

α (0) is empty for almost

every α.

Proof. Apply lemma 1.3.3 to the set Sσ = {x ∈ S : r-th largest singular value of Mx >

σ to get G−1_α ∩ Sσ = ∅. Then S = ∪σ>0Sσ implies G−1α (0) = ∅.

Lemma 1.3.5. Let A be a compact subset of Rk. Let F

0, F1, . . . , Ft be Lipschitz maps

from to Rn. For each integer r ≥ 0, let S

r be the set of pairs x 6= y in for which the

n × tmatrix

Mxy = {F1(x) − F1(y), . . . , Ft(x) − Ft(y)}

has rank r, and let dr be the lower box-counting(Sr). For α outside a subset of measure

zero in Rt _{for the map F}

α = F0+Pti=1Fi : A → Rn it holds:

if dr< r for all integers r ≥ 0 (assuming that d0 = 0 for the case r = 0), then the map

Fα is one-to-one.

Proof. We define Gi(x, y) = Fi(x) − Fi(y) for i = 0, . . . , t and Gα= G0+Pti=1Gα for

α ∈ Rn; by hypothesis the rank of the matrix Mxy = {G1(x, y), . . . , Gt(x, y)} is r on Sr.

Now, if r > dr, by lemma 1.3.4 we have that for α ∈ Rt a.e. the image of Sr under

Gα does not contain the origin, or, Fα(x) 6= Fα(y) for x 6= y in Sr. If r > dr for all

r ≥ 0, each pair x 6= y stays in a Sr: if not, it would exist a pair x 6= y such that

rank(Mxy) does not belong to {0, . . . , min(n, t)} which is absurd. It follows that Fα is

a.e. one-to-one and this conclude the proof.

The next theorem give the probability of dealing with transformations between two Euclidean spaces which are one-to-one on a particular set.

Theorem 1.3.1. Let A a compact subset of Rk _{with lower boxdim(A)=d. If n > 2d,}

then almost every linear transformation of Rk to Rn is one-to-one on A.

Proof. Let x 6= y be a pair in A × A and consider the matrix Mxy = {F1(x −

y), . . . , Fnk(x − y)} defined in lemma 1.3.5 where all {Fi} form a basis of the

nk-dimensional space of linear transformations from Rk _{to R}n_{. It easy to see that the}

rank of Mxy is equal to n: in fact, assuming without loss of generality that the first

coordinate a of the vector x − y 6= 0 is different from 0, we can find n linear transform-ations of the basis such that Fj(x − y) = (0, . . . , a

|{z} j−th place

, . . . ,0) (we are choosing linear

(26)

In particular, adopting the same notion of lemma 1.3.5, Sn = A × A − ∆ (where ∆

denotes the pairs x = y in the product) and Sr = ∅ for r 6= n because we have noticed

that Mxy has always full rank for x 6= y.

Since A×A = Sn and lower boxdim(Sn) = lower boxdim(A)+lower boxdim(A)< 2d, all

hypothesis of lemma 1.3.5 are satisfied, hence almost every Fα =PαiFi is one-to-one

on A.

In our state space reconstruction it is important to preserve the differential structure. We continue with the following definitions.

Definition 1.3.1. For a compact differentiable manifold M, let T (M) = {(x, v) : x ∈

M, v ∈ TxM } be the tangent bundle of M and let S(M) = {(x, v) ∈ T (M) : |v| = 1}

denote the unit tangent bundle of M.

Remark 1.3.2. We assume the following fact: if M is a k-dimensional differentiable manifold in Rn_{, then S(M) has dimension 2k − 1.}

Lemma 1.3.6. Let A be a compact subset of a smooth manifold embedded in Rk. Let

F0, F1, . . . , Ft: Rk→ Rn be a (t + 1) smooth maps from an open neighbourhood U ⊂ A

to Rn_{. For r ≥ 0 let S}

r be the subset of S(A) such that the n × t matrix

{DF1(x)(v), . . . , DFt(x)(v)}

has rank r and let dr =lower boxdim(Sn). Define Fα = F0+Pti=1αiFi : U → Rn. If

dr < r for all integers r ≥ 0, then for almost every α ∈ Rt, the map Fα is an immersion

on A.

Proof. Define Gi : S(A) → Rn by Gi(x, v) = DFi(x)v. If dr < r for all r ≥ 0, then we

apply lemma 1.3.4 to demonstrate that for almost every α, G−1

α (0) T

Sr= ∅ and since

S(A) = S

rSr, the following equalities hold

G−1_α (0)\S(A) = G−1_α \ [ r Sr ! =[ r G−1_α (0)\Sr = ∅ hence G−1

α (0) = ∅. It follows that no unit tangent vector is mapped to the origin, and

Fα is an immersion.

We are now ready to prove the theorems stated in the previous section.

Proof of Fractal Whitney Embedding Prevalence Theorem. If we show that the set of

smooth maps is one-to-one and immersive then the theorem is proved. For this sake, we take as probe space the set E = {F1, . . . , Ft} that form a basis of the nk- dimensional

space of the linear transformations from Rk _{→ R}n_{. Referring to the terminology of}

(27)

is one-to-one on for almost α ∈ Rt_{. If we add other smooth maps F}

t+1, . . . , Ft0, the rank

of Mxy does not change if x 6= y, so almost every linear combination of Ft+1, . . . , Ft0 is

one-to-one on A; hence almost every smooth map F is one-to-one on (this is exactly in the sense of prevalence).

Then, remembering that boxdim(A)=d, we take in a smooth manifold contained in a subset C whose dimension is at most d. From the previous remark we have boxdim(S(C))≤ 2d − 1. The proof is concluded by applying lemma 1.3.6 to the func-tions of the probe space E, Sn = S(C) and Sr = ∅ for r 6= n (we are using the notation

of lemma 1.3.6) and noticing that boxdim(Sn)< 2d − 1 < 2d < n.

Remark 1.3.3. Whitney Embedding Prevalence Theorem is also proved, since it is a particular case of the Fractal one.

Proof of Theorem 1.2.2. For i = 1 . . . t define

Fi(x) = B      hi(g1(x)) ... hi(gw(x))     

By definition, F (B, hα, g) = Pti=1Fi. To use lemma 1.3.5, we have to study for each

x 6= y the rank of the matrix

Mxy = (F1(x) − F1(y), . . . , Ft(x) − Ft(y))

which can be written as

B      h1(g1(x) − h1(g1(y)) · · · ht(g1(x) − ht(g1(y) ... ... h1(gw(x) − h1(gw(y)) · · · ht(gw(x) − ht(gw(y))      = BJH where H =      h1(z1) . . . ht(z1) ... h1(zq) . . . ht(zq)     

q ≤2w, the zi are distinct and J = Jxy is a w × q matrix each of whose rows consists

of zeros except for one 1 and one −1. By the first part of lemma 1.3.1 the rank of H is

q (it can be chosen a basis formed by monomials). Now, the study is divided into three

cases:

Case 1: xand y are not both periodic with period ≤ w.

In this case J is upper or lower triangular, and rankJ = w. Since B, J and H are onto linear transformations, the product BJH is onto and has rank n. The set of

(28)

pairs x 6= y of case 1 has box-counting dimension at most 2d and rank(Mxy) = n.

If g has no periodic points of period ≤ w we are done for lemma 1.3.5. Case 2: x and y lie in distinct periodic orbits of period ≤ w.

Assume p and q are minimal such that gp_{(x) = x and g}q_{(y) = y, and that}

1 ≤ q ≤ p ≤ w. In this case the matrix J contains a copy of Cw

p0. Since H is onto,

rank M=rank BJ. By hypothesis, rank BJ ≥ rankBCw

p0>2 · boxdimAp which

is the box-counting dimension of the set of the pairs treated in case 2. By lemma 1.3.5 for almost α ∈ Rt, F

α(x) 6= Fα(y) for every such pair x 6= y.

Case 3: Both x and y lie in the same periodic orbit of period ≤ w. Assume p and q are minimal such that gp(x) = x and gq(x) = y and that 1 ≤ q < p ≤ w. Since x and

y lie in the same periodic orbit, the column space of J contains the column space

of Cw

pq. Thus, rank BJH =rankBJ ≥ rankBJ ≥ rankBCpqw >boxdimAp, which

is the dimension of the pairs of case 3. Thesis follows from lemma 1.3.5 again.

Proof of Theorem 1.2.3. We want to apply lemma 1.3.6. First of all we have to check

the rank of the n × t matrix

(DF1(x)(v), . . . , DFt(x)(v))

for each (x, v) in the unit tangent bundle S(A). For a given observation function h, the derivative of F (B, h, g) is DF(B, h, g)(x)v = B      ∇h(gw−(x))T_Dgw−(x)v ... ∇h(gw+ (x))T_Dgw+ (x)v     

We divide the proof into three cases.

Case 1: If x is not a periodic point of period less than w, then gw−(x), . . . , gw+(x) are distinct points. Since g is a diffeomorphism and v 6= 0 imply that Dgi_{(x)v 6=}

0 for all i. Therefore by the second part of lemma 1.3.1, the set of vectors {DF(B, hα, g)(x)v : α ∈ Rt}spans Rn. In the notation of lemma 1.3.6, the subset

Sn contains all points of S(A) that are not periodic with period less than w, and

dn = lower boxdim(Sn) ≤ 2m − 1. If g has no periodic points of period less than

(29)

Case 2: If x is a periodic point of period p < w, then DF(B, h, g)(x)v = B                      H₁Tw1 ... HT p wp HT 1D1w1 ... HT pDpwp HT 1 D12w1 ...                      (1.1) where xi = gw −_+i (x) = xp+i, Hi = ∇hxi, wi = Dg(xi−1) · · · Dg(x1)Dg w−(x)v and Di = Dg(xi−1) · · · Dg(x1)Dg(xp) · · · Dg(xi).

Each matrix Di has the same set of eigenvalues λ1, . . . , λm and by hypothesis, they

are distinct. If u1, . . . , um is a spanning set of eigenvector for D1, then it checks

that uij = Dg(xi−1) · · · Dg(x1)uj for 1 ≤ i ≤ p, 1 ≤ j ≤ m defines a spanning set

{u1i, · · · , uim} of eigenvectors for Di. Thus if w1 =Pmj=1aju1j is the eigenvector

expansion of w1, then the eigenvector expansion of wi is Pmj=1ajuij, which has

the same coefficients.

Thus DF (B, h, g)(x)v can be written as B times the w−vector

                                 1 · · · 1 0 · · · 0 ... 0 · · · 0 0 · · · 0 λ1· · · λm 0 · · · 0 ... 0 · · · 0 0 · · · 0 λ2 1· · · λ2m ...                                       a1uT11 ... amuT1m      H1+ · · · +                                     0 · · · 0 0 · · · 0 ... 0 · · · 0 1 · · · 1 0 · · · 0 0 · · · 0 0 · · · 0 ... 0 · · · 0 λ1· · · λm 0 · · · 0 ...                                          a1uTp1 ... amuTpm      Hp (1.2)

To find the rank of the matrix 1.1 for (x, v) where x is periodic, we need to find the span of B times the vectors 1.2 for h = hα=Pαihi, α ∈ Rt.

Assume that the eigenvector expansion of v has exactly r non zero coefficients

(30)

Rk. Then because the uij, 1 ≤ j ≤ m, are linearly independent, the vectors of

form 1.2 span a space of dimension min{w, rp} as α spans Rt_.

Therefore, for this (x, v), the span of the vectors 1.1 has dimension equal to the rank of BDw

p(λi1, . . . , λir). By hypothesis, the boxdim of such pairs (x, v) in S(A)

is boxdim(Ap)+r − 1. Always by hypothesis, the rank of the n × t matrix 1.1 is

strictly larger. So that lemma 1.3.6 applies to give the conclusion.

Proof of Theorem 1.2.4. Apply theorems 1.2.2 and 1.2.3 with B = In.

Proof of Theorem 1.2.5. Since Ap is empty for 1 ≤ p ≤ w, the conditions A1 − A3 of

theorems 1.2.2 and 1.2.3 are satisfied immediately.

1.4 How to estimate the embedding dimension

When using a delay coordinate map or a filtered delay coordinate map to examine the image F (A) in Rn of a set A in Rk, the choice of n depends on the objective of our

investigation, i.e. calculation of dimension and Lyapunov exponents, prediction and noise reduction [18].

To compute the dimension of A, all that is required is that dimF (A) = dimA

whatever dimension is considered.

In practical situations, if attempts to measure boxdim(A) result in answers dependent on n, where n > boxdim(A), then the variation would seem to be a numerical artifact, since there is no theoretical justification for which of the values of n greater than boxdim(A) gives the more accurate result. The standard technique is to increase n until the observed dimension of boxdimF (A) reaches a plateau and to use this result (the so called plateau dimension).

(31)

Chapter 2 Entropy of a Time Series

In this chapter, we want to characterize the behavior of a time series. First of all, we always assume that our time series is the output of an ergodic dynamical system. We introduce the main aspects and the main results of the ergodic theory that will be used later.

Then, using finite measurable partitions, Kolmogorov-Sinai entropy is introduced, that is a quantifier of the predictability of an ergodic dynamical system.

Simultaneously, the complexity of an infinite string of symbols is introduced through the

Algorithmic Information Content (AIC), that is the minimum amount of information

that can identify a string.

Following [3], we convert a time series into a sequence of symbols. Complexity of a time series can be identified with K-S entropy through symbolic dynamics and Brudno’s theorem: from these facts, we would compute K-S entropy by the AIC.

Unfortunately, the AIC cannot be computed by any algorithm: for this reason, prefix-free codes and simple Lempel-Ziv algorithm are introduced as an approximation of the AIC.

Following P. Shields [19] and T. Cover [7] we describe simple Lempel-Ziv algorithm and its properties.

In the last part, we introduce other ways to characterize a time series: generalized

entropy of order 2 [10],[8], the approximate entropy [14], and the sample entropy[16].

Their definitions are based on the embedding dimension of the attractor underlying the dynamics (see Chapter 1), even if these quantifiers are commonly used with fixed embedding dimensions.

2.1 Preliminary Concepts and Tools

We will try to infer entropy properties from strings that represent discrete-time stochastic processes.

(32)

Definition 2.1.1 (Process). A discrete-time stochastic process is a sequence

X1, X2, . . . , Xn, . . . of random variables defined on a probability space (X, B, µ).

The process has alphabet A if the range of each Xi is contained in A. In our

discussion we only consider processes with a finite alphabet, that is |A| (cardinality of A) finite. Moreover, "measure" will mean "probability measure" and "function" is intended as "measurable function" with respect to some appropriate σ-algebra on a probability space.

The sequence am, am+1, . . . an where each ai ∈ A is denoted by anm while Anm is the set

that contains all the sequences an

m. The notation An will refer to An1.

Of important interest is the notion of k-th order distribution of the process {Xk} which

is the measure µk on Ak defined by the formula

µk(ak1) = P rob(X k 1 = a k 1), a k 1 ∈ A k .

The set {µk : k ≥ 1} is called the distribution of the process, that is a family of

probability distributions, one for each k. An important fact is that the sequence of distributions cannot be completely arbitrary because they have to satisfy a consistency relation which is implicit in the definition of a process. This relation can be expressed as

µk(ak1) =

X ak+1

µk+1(ak+11 ).

A process is defined by its joint distributions without giving too much importance to the particular space on which the random variables Xn are defined. Therefore, one can

choose the underlying space on which Xn are defined as long as the joint distributions

are left unchanged.

It is convenient, then, to use the Kolmogorov model for a process which is a particular way to define a space and a sequence of functions with the given distributions. We see how this construction is carried out.

Let Ω denote the set of all infinite sequences. We define the cylinder set determined by

am

n denoted by [amn] as the following subset of Ω

[am

n] = {x ∈ Ω : xi = ai, m ≤ i ≤ n}.

Let Cn be the collection of cylinder sets defined by sequences that belong to An and

let C = ∪nCn. We denote with B the σ-algebra generated by the cylinder sets C. The

members of B are called the Borel sets of Ω.

For each n ≥ 1 the coordinate function ˆXn : Ω 7→ A is defined by ˆXn(x) = xn, x ∈ Ω.

The following theorem, known as Kolmogorov representation theorem, states that every process with finite alphabet A can be thought of as the coordinate function process { ˆXn} together with a Borel measure on Ω.

(33)

Theorem 2.1.1(Kolmogorov Representation Theorem.). If µkis a sequence of measures

for which the condition of consistency holds, then there is a unique Borel probability measure µ on Ω such that µ([ak

1]) = µk(ak1), for each k and ak1. Equivalently, if {Xn} is

a process with finite alphabet , there is a unique Borel measure µ on Ω for which the sequence of coordinate functions { ˆXn} has the same distribution as {Xn}.

Such measure µ will be called the Kolmogorov measure of the process and with the expression "let µ be a process" we refer to "let µ the Kolmogorov measure for some process {Xn}".

Definition 2.1.2(Stationary process). A process is stationary if the joint distributions do not depend on the choice of time origin, that is

P rob(Xi = ai, m ≤ i ≤ n) = P rob(Xi+1 = ai, m ≤ i ≤ n)

for all m, n and an m.

2.1.1 Probability tools

Markov’s inequality and the Borel-Cantelli principle will be used very frequently. Lemma 2.1.1 (Markov’s Inequality). Let f be a non negative measurable function on a probability space (X, B, µ). Then for every δ > 0

µ{x ∈ X : f(x) ≥ δ} ≤ 1 δ

Z

X

f(x)dµ.

Lemma 2.1.2 (The Borel-Cantelli Principle). If {Cn}is a sequence of measurable sets

in a probability space (X, B, µ) such thatP

µ(Cn) < ∞ then for almost every x there

is a N = N(x) such that x /∈ Cn, ∀n ≥ N.

In general, a property P is said measurable if the set of x for which P (x) is true is a measurable set.

If {Pn} is a sequence of measurable properties then

1. Pn(x) holds eventually almost surely, if for almost every x there is an N = N(x)

such that Pn(x) is true for n ≥ N.

2. Pn(x) holds infinitely often almost surely, if for almost every x there is an increasing

sequence {ni} of integers, which may depend on x, such that Pni(x) is true for i= 1, 2, . . .

Almost sure convergence can be stated in equivalent forms.

Embedding and Complexity of Time Series: Theory and Applications