**CORSO DI LAUREA MAGISTRALE IN MATEMATICA**

**TESI DI LAUREA MAGISTRALE**

**Embedding and Complexity**

**of a Time Series:**

**Theory and Applications**

CANDIDATO:

**Andrea Scarciglia**

RELATORE:

**Prof. Claudio Bonanno**

### CORRELATORE:

**Ing. Gaetano Valenza**

**ANNO ACCADEMICO 2019/2020**
**SESSIONE DI LAUREA 23 OTTOBRE 2020**

**Abstract**

In this thesis we introduce an approach to the study of time series by nonlinear dynamical systems. The main assumption we make throughout this work is that time series samples can be associated to points of a nonlinear ergodic dynamical system through a map, called measurement map.

We are interested in showing how to classify time series in terms of predictability of their behavior. Firstly, we give a mathematical framework for some of the quantifiers that will be described later. We face the embedding problem: "almost any" measurement map has to be an embedding, that is an injective immersion. The choice of the embedding is motivated by the fact that differential structure and injectivity are preserved in order to compute quantifiers (e.g. Lyapunov exponents). Moreover, "almost every" has to be intended in the sense of prevalence, a generalization of the notion of probability measure. Since attractors may not have integer dimension, fractal sets are introduced. The main theorem we use to solve the embedding problem is the Fractal Delay Embedding Prevalence Theorem, a slight modification of Witney’s embedding theorems and more suitable for time series: it states that "almost every" measurement map is an embedding, provided that the samples are taken and organized in a particular manner.

After introducing notions of ergodic theory and discrete-time stochastic processes, we describe the Kolmogorov-Sinai (K-S) entropy as a quantifier of the predictability of an ergodic dynamical system (and consequently of a time series). Since the K-S entropy computation may be difficult, we obtain K-S entropy by using the Algorithmic Information Content (AIC) of strings.

We convert time series in strings composed by a finite number of symbols. For each of the symbolic string, we define its AIC as the length of the shortest binary program that reproduces the string as output, while its complexity as the mean AIC for each symbol. The notion of complexity is related to the notion of K-S entropy by Brudno’s theorem. It seems reasonable to determine K-S entropy by the AIC, but this is not possible since AIC cannot be performed by any algorithm. In order to approximate the AIC, simple Lempel-Ziv data compression is introduced: it is a lossless prefix-free code which asymptotically compresses almost every string from an ergodic stochastic process, using the less storage as possible, in such a way that the expected binary code length per symbol is equal to the entropy of the process and that it is possible to reconstruct the

original series from the encoded one: these properties allow us to think of Lempel-Ziv
data compression as an approximation of the AIC. All tools introduced for computing
K-S entropy require limits and consequently a large number of samples which is not
always available. Other quantifiers of unpredictability are shown: generalized entropy of
*order 2 (K*2), approximate entropy and sample entropy. They all rely on the geometrical

attractor reconstruction and they all estimate the probability of finding similar finite
*length patterns in a given time series. K*2 is a lower bound to the K-S entropy while

the other two entropies are seen to work well with a small number of points. Finally, in numerical simulations, we test algorithms of approximate entropy, sample entropy and Lempel-Ziv data compression on circle rotations, logistic maps and physiological data. The aim is to relate entropies to the behavior of the dynamical system underlying the time series. In the case of physiological data, these quantifiers are shown to distinguish clinical behaviors.

**Contents**

**Introduction**

**III**

**1**

**State-space Reconstruction**

**1**1.1 Some Definitions . . . 1 1.1.1 Fractal sets . . . 6 1.2 Delate-coordinate Embedding . . . 7

1.2.1 Filtered delay-coordinate map . . . 8

1.3 Proofs . . . 10

1.4 How to estimate the embedding dimension . . . 18

**2** **Entropy of a Time Series** **19**
2.1 Preliminary Concepts and Tools . . . 19

2.1.1 Probability tools . . . 21

2.1.2 Entropy Theorem and Frequency-Typical Sets . . . 24

2.2 Kolmogorov-Sinai Entropy . . . 26

2.3 Symbolic Dynamics and AIC of a Time Series . . . 28

2.4 AIC and Data Compression . . . 34

2.5 Codes . . . 35

2.6 Prefix Codes . . . 36

2.7 Universal Codes . . . 40

2.8 Lempel-Ziv Algorithm . . . 42

2.9 Other Ways to Measure Entropy . . . 51

2.9.1 Approximate Entropy . . . 56

2.9.2 Sample Entropy . . . 57

**3** **Numerical Simulations** **59**
3.1 Chaotic Dynamical Systems and Entropy . . . 59

3.2 Circle Rotation . . . 61

3.3 Logistic map . . . 66

3.4 Physiological Data . . . 70

**Introduction**

In the last thirty years, nonlinear dynamical systems have been intensively studied also because they have assumed a fundamental role in modelling natural phenomena.

Indeed, by assuming that there is a nonlinear dynamical system behind observ-able phenomena, one can try to infer some properties about it from experimental measurements.

*Usually, data are collected as a time series, that is a sequence {Qt*} where the

*quantity Qt* *is sampled τ times apart, for τ some positive integer.*

The aim of this thesis is to show a first approach to the study of time series by describing tools, testing them and giving a mathematical framework.

This thesis is divided into three chapters: the first one is entirely dedicated to the geometrical embedding problem of a time series: as we will see in Chapter 2, embedding dimension of a time series is the main ingredient for the formulation of efficient algorithms which allow to classify time series in terms of their predictability.

*Let (M, f) denote the discrete dynamical system underlying the dynamics where M*
*is a k-dimensional manifold (the phase space). We are interested in a subset A ⊂ M,*
*called the attractor, i.e. the set where all the points lie for very long times.*

*One thinks of the measurement as an observation smooth function h on the state*
*space M. Each Qt* *is assumed to be of the form Qt* *= h(ft(x)) that is the result of*

*evaluating an observation function h at the current state ft _{(x). Therefore, we suppose}*

*that any sample Qt* *of time series which is associated to a point of A by a measurement*

*map F .*

In this context, the attractor reconstruction consists in giving sufficient conditions
*in order to satisfy this property: "almost every" measurement map F , defined on A into*

*the space where the samples live, has to be an embedding, that is an injective immersion.*

The embedding requirement is important for two reasons: it preserves the differential structure of the attractor (which is useful in the computation of Lyapunov exponents) and injectivity makes predictions more accurate and controls the behavior of the dynamics.

*fractal sets and their geometrical dimension, known as box-counting dimension.*

*The "almost every" in the property has to be intended in the sense of prevalence which*
generalizes "probability one" for infinite dimensional spaces.

*Simultaneous measurements sampled in Qt*are hardly assumed independent by scientists:

*in this case delate-coordinate embeddings are the most suitable maps for the attractor*
*reconstruction, or rather maps F (h, f, τ) : M → Rn* _{defined as}

*F(h, f, τ)(x) = (h(x), h(f−τ(x)), h(f−2τ(x)), . . . , h(f−(n−1)τ(x)))*

*where h is a smooth function.*

*The most important result in this direction is the Fractal Delay Embedding Prevalence*

*Theorem* *(coming from Witney’s embedding Theorems) which states that almost every*

*(in the sense of prevalence) delay-coordinate map F (h, f, τ) is an embedding provided*
*that the dimension of the image n (chosen large enough) and the delay τ guarantee*
*the non-existence of orbits of period τ and 2τ and the existence of a finite number of*
*periodic orbits for each period 3τ, . . . , nτ of f.*

*This result is actually proved for a generalized version of the theorem (Fractal Filtered*

*Delay Embedding Prevalence Theorem) which uses filters (i.e. special matrix) to modify*

data: for example, we use filters if we are interested in an arithmetic or pounded average of the data.

After showing the proofs of the stated theorems, in the final section we face the
prob-lem of choosing the best embedding dimension of the attractor. The only requirement
*that we maintain is that dimF (A) = dim A, being F a delay coordinate map. Since all*
theorems only provide sufficient conditions, there is not a mathematical justification for
the best embedding dimension. This means that the attractor dimension for increasing
*values of n is computed until a plateau is reached. This plateau is considered to be the*
embedding dimension.

In the second chapter we classify time series through quantifiers which take into account the amount of "order". "Order" means higher possibilities to find similar patterns in a time series or to predict its behavior.

*We add the property that time series are outputs of an ergodic dynamical system.*
We list some facts and results about ergodic theory and the stationary discrete-time
stochastic process whose random variables are defined on a finite set A, the alphabet.
*An important result is the Entropy Theorem (or Shannon-McMillian-Breiman Theorem)*
*which states the existence of a positive number (the entropy) for ergodic processes*
which represents the exponential decay coefficient of the measure of "typical sets" of a
process.

which guarantees that almost every infinite sequence, output of an ergodic stationary discrete-time stochastic processes, is contained in a set of measure one (the entropy-typical set).

*For an ergodic dynamical system (X, µ, f) where X is assumed to be a compact*
*metric space, we consider a finite measurable partition of X of the form Z = {I*1*, . . . , IN*}

*where the sets Ii* are pairwise disjoint, measurable, and the measure of the complement

*of their union is 0. Then we introduce the Kolmogorov-Sinai entropy ht*as the supremum

*of the entropies of a partition among all finite measurable partitions. If we assume*
*that each set Ii* carries a certain amount of information (a quantity which allows to

identify each element of the partition), the Kolmogorov-Sinai entropy is identified as the
highest information average needed to identify a sequence of sets of a finite partition.
*Actually, generating partitions are needed to compute K-S entropy instead of finding the*
supremum among all finite partitions, but these are not always available in an analytic
form. We know that the entropy of a partition is exactly the same as the one described
in the Entropy theorem.

In order to compute K-S entropy we develop another approach. First of all, we convert our time series into a symbolic sequence of a finite alphabet A through the

*symbolic dynamics. By defining the Alogrithmic Information Content of a finite string*
*s* *as the length of shortest binary program that reproduces s, we define the complexity*
*of an infinite string* as the mean length of a binary program which is needed to identify

an element of the strings, in symbols

*K(ω) = lim sup*

*n→∞*

*AIC(ωn*)

*n*

*where ωn* _{denotes the first n symbols of ω.}

How is the complexity of a partition related to the entropy of a partition? The answer
*is given by Brudno’s theorem, which asserts that the complexity and entropy of a*
partition are equal if the dynamical system is ergodic: as a corollary, we exploit the
AIC in K-S entropy computation. Unfortunately, the AIC cannot be obtained by any
*algorithm: nothing is lost because it is approximated thanks to lossless data compression*

*algorithms, or rather code maps which encode finite sequences of symbols into binary*

strings using the less storage as possible and allow to reconstruct the original string from the encoded one.

*We continue the discussion by showing the definition of codes, prefix-free codes and*
their main properties, especially their relation with the entropy of a process: for any
*ergodic stochastic discrete-time stationary process with measure µ and entropy hµ* it is

*possible to construct a prefix-free code (such as the Shannon codes) so that the expected*
*length per encoded word is asymptotically equal to the entropy hµ*; moreover, there is

than the entropy of the system.

The measure of a process is hardly available in a given time series, therefore it is impossible to construct the previous prefix-free codes. We face the problem of the

*universality* of codes: are there any prefix-free codes so that the expected length per

encoded word is asymptotically equal to the entropy, for any ergodic process with
*entropy h? Are there any prefix-free codes whose expected length per encoded word*
beats asymptotically the entropy on a set of positive measure?

To the second question the answer is negative, whereas we answer the first question
*positively by introducing the lossless simple Lempel-Ziv (LZ) data compression algorithm.*
A brief description of the algorithm and some examples are showed together with results
concerning entropy.

Lempel-Ziv algorithm is a powerful tool in the AIC approximation, and consequently in computing K-S entropy of a time series. However, this approach may not be very useful because the entropy is reached in a limit process and time series may be formed by a little number of samples denying the possibility of reaching the entropy value. For this reason we introduce other ways to measure entropy with short time series:

*generalized entropy of order 2* *(K*2*), approximate entropy and sample entropy.*

All the three quantifiers rely on the geometrical attractor reconstruction of Chapter
*1: indeed we regroup the raw data of the time series in m-dimensional vector, where*

*m* is supposed to be the embedding dimension of the attractor of the underlying

*dynamical system (even if m will be obtained empirically). The idea which lies behind*
the quantifiers can be described in the following way: we divide the phase space into a
*finite number of balls of radius ε and we take into account the number of times that*
*two m-dimensional vectors lie in the same balls: in other words we are searching the*
*number of trajectories of length m sufficiently close to each other. Starting from this*
*idea, correlation sum and correlation integral provide the empiric probability that two*
*any trajectories of length m are sufficiently ε-close.*

We define the quantifiers from correlation sum and integral. The main properties are
*showed: they are easily computed from a time series; K*2 provides a lower bound to the

*K-S entropy; approximate entropy take into account self-matches (the two trajectories*
can be the same), while approximate entropy does not; approximate and sample entropy
*seems to work well even if few points are available (this is not true for K*2). In addition

it is proved to be more convenient to compute approximate and sample entropy of a
*time series with fixed embedding dimension m and ε instead of varying these values*
until a plateau is reached: in this way, comparisons between different time series with
the same number of points are more meaningful.

In the third and last chapter, we use numerical simulations to analyze the entropy quantifiers (K-S entropy with LZ compression, approximate and sample entropy) that

we have introduced in Chapter 2 and relate them to the behavior of the underlying dynamical system.

*A brief description of chaotic systems (Devaney) is introduced: a characterization*
*of chaotic dynamical systems is given through the K-S entropy and the Lyapunov*

*exponents*. In particular, dynamical systems with lower K-S entropy result to be easily

predicted.

We test the quantifiers on three kind of maps: circle rotation, logistic map and physiological data. We know that for LZ compression we convert a time series into a symbolic one.

In the first two cases we already know the theoretical value of the metrical entropy: it is almost achieved by the LZ compression even if the number of samples has to be very large; lower values of sample and approximate entropy are associated with low complexity (low metrical entropy) while higher values with more unpredictability. For the third case, we are given a set of physiological time series which represent the interbeat intervals of two groups of people of the same number, old and young. In this case, without knowing the theoretical metrical entropy value of the time series, our goal is to distinguish clinically the mean heart behavior in old and young people with the three quantifiers above: old people show a lower value of entropy than the young people. All the quantifiers distinguish the two cases but the most marked distinction is achieved by the sample entropy which works better with a small number of samples.

**Chapter 1**

**State-space Reconstruction**

In this chapter we show which condition allow to reconstruct an attractor from a collection of experimental data, in particular time series.

Following mainly Sauer et al. [18], the fundamental fact we are assuming is that the data we want to analyze are the outputs of a nonlinear dynamical system.

The attractor reconstruction results to be an important step towards the predic-tions of non-linear time series and in the computation of invariant quantities used to characterize the dynamics of such series.

*We will state Whitney’s theorems and similar theorems which ensure that "almost*
*every" map from the attractor into the measurements space is an embedding. For this*
purpose fractal sets will be defined and prevalent sets will be used.

Moreover, theory showed in this chapter will be the starting point for some algorithms of the next chapter.

The proofs of the theorems used are left for the last part of the chapter.

**1.1**

**Some Definitions**

Before giving some results, we formalize some concepts.

**Definition 1.1.1** **(Dynamical system). Given a set M, an additive monoid T and a***function Φ : V ⊆ (T × M) → M, we define a dynamical system as a tuple (M, T, Φ)*
such that:

*• I(x) = {t ∈ T |(t, x) ∈ V }*
*• Φ(0, x) = x, ∀x ∈ M*

*• Φ(t*2*,Φ(t*1*, x)) = Φ(t*1*+ t*2*, x) ∀x ∈ M, t*1*, t*1*+ t*2 *∈ I(x), t*2 *∈ I(Φ(t*1*, x*)).

*The function Φ(t, x) is called evolution function of the dynamical system: it associates*
*a unique image to each point of V (or M) depending on the parameter t. The space M*
*is called Phase Space.*

*Furthermore, if we consider x constant, the function Φ(t, x) ≡ Φx(t) : I(x) → M is*

*known as flow through x: it describes the evolution of the system with initial condition*

*x* *as t varies in I(x); the graph of a flow is said trajectory (through x).*

*Under the same assumptions, a dynamical system is continuous when the set T is*
*an open interval in R; otherwise it is discrete if T is the set of the integers.*

Since time series are samples taken from a continuous measurable function, we will consider only discrete dynamical systems: in particular the function Φ is given by

*Φ(t, x) = ft _{(x)}*

*where f : M → M is a C*1 _{transformation and f}t_{(x) indicates the function composition}

*(f ◦ f ◦ · · · ◦ f)*

| {z }

t times

*(x).*

**Remark 1.1.1.** *In the definition above, we do not require any hypothesis on M. For*
*our purpose, anyway, we will assume M to be a manifold Rk _{, for k some positive integer}*

*with a distance d defined on M.*

*We are interested in the study of the attractor of a dynamical system as defined*
below.

**Definition 1.1.2** **(Attractor set). We define the attractor as a closed subset A ∈ M***such that A has a neighborhood U (called the fundamental neighborhood) such that the*
following conditions are satisfied:

*• for every neighborhood V of A, we have ft _{(U) ⊂ V when t is large enough;}*

*• ft(A) = A for every t;*

*• irreducibility: there is no proper sets of A which satisfies the first two conditions.*
*The set W =*S

*tf−t(U) is called the basin of the attraction of A: it consists of all*

*points x ∈ M such that ft _{(x) ∈ A when t → ∞ and therefore W is independent of the}*

*choice of U.*

In other words, the attractor is the set where the dynamics lies for long times.

For this reason, it can be convenient to use the following operational and intuitive definition.

**Definition 1.1.3** **(Attractor: operational definition). Given a dynamical system, we**
*call the attractor the subset of the phase space on which experimental points ft _{x}*

*accumulate for large times t.*

Attractors may not have an integer dimension. Attractors with fractionary dimension
*are called strange attractors.*

**Remark 1.1.2.** There is not a unique definition of attractors: we have chosen the most
suitable one for our discussion. Indeed, attractors may be defined as in [17] since it has
been showed that attractors may not have a basin of the attraction.

Suppose that a hidden dynamical system is defined in R*k _{, k a positive integer, and}*

that all trajectories are asymptotic to an attractor A. If we know how the system will behave in a long time then we will get some information about A.

Unfortunately, in the situation we considered, we do not know directly the dynamical
system but only some observations which we consider as outputs of the dynamical
*system. In particular, we may assume that we are able to collect n independent data*
*simultaneously, namely Q*1*, . . . , Qn* *measurements in the sense Qi* is not a function

*dependent by any Qj* *for i 6= j.*

We can, then, associate a point of the phase space which is assumed to be a subset of
R*k* *to the states Q*1*, . . . , Qn*. We have, indeed, the following

**Definition 1.1.4.** *A function F : Rk*_{→ R}*n* which maps states to measures

*F(state) = (Q*1*, . . . , Qn*)

*is said a measurement function.*

Among all the measurements, we are interest in the time-dependent ones. These are the most available in practice.

**Definition 1.1.5** * (Time series). We define a time series {Qt*}as the resulting list of

*samples of a measurable quantity sampled at intervals τ time units apart.*

*We can see a time series as the image of an observable function h defined on our*
*dynamical system (M,Φ,T ): each Qt= h(xt), t ∈ T, xt∈ M*, is the result of evaluating

*the observation function h on the state xt*.

Attractor reconstruction is a necessary step towards the study of time series. Indeed,
as we will see later, time series from a dynamical system can be characterized in terms
of predictability, that is how much it is easy to predict the behavior of the time series
(and consequently the behavior of the dynamical system), and in terms of "sensible
dependence on initial condition" (these two issues will be discussed in the last chapter).
*Predictability is quantified by entropy, while the sensible dependence on initial condition*
*by the Lyapunov exponents. Some of the algorithms that provide values of entropy of a*
time series are based on the reconstruction of the attractor.

This reconstruction is not the actual one: for example, it may happen that the time series are produced by an observable function defined on a 2-dimensional attractor while the reconstruction provides higher values for the dimension of the attractor.

In fact, the reconstruction is done in such a way that for any measurement function

*F* *we take from the attractor ro the measurement space, F is still useful to compute*

entropy and Lyapunov exponents.

To be more precise, the quantifiers above are easy to compute if the measurement
*maps are assumed to be embeddings. Here we give the definition.*

**Definition 1.1.6** * (Immersion). Let M be a compact smooth (at least C*1) differentiable
manifold of R

*k*1

_{. A smooth map (at least C}

_{) defined on M is an immersion if the}*derivative map DF (x) (represented by the Jacobian matrix of F at x) is one-to-one at*
*every point x of M.*

**Remark 1.1.3.** *Since the Jacobian of F is a linear map, F is an immersion if DF (x)*
has full rank on the tangent space. In particular, no differential structure is lost under
immersion.

**Definition 1.1.7** * (Embedding). Let S be a set in Rk. A function F : S → Rn* is said

*an embedding if it is an injective immersion.*

**Remark 1.1.4.** *If S is a compact manifold, then F is diffeomorphism if F is a one-to-one*
immersion.

*So, the attractor reconstruction provides that every measurement function F has to*
be an embedding because it is important to preserve differential structure and injectivity
in order to compute entropy or Lyapunov exponents.

Clearly, whether or not a map from A to R*n* _{is an embedding on A depends on the}

structure of the set A.

For this sake, we are helped by the following

**Theorem 1.1.1** **(Whitney General Embedding Theorem). Suppose A to be a smooth***manifold of dimension d, then the set of maps which are embedding of into R2d+1* _{is an}

*open set and dense in the C*1 _{topology of maps.}

*We want to highlight two important aspects: given an embedding F , all small*
*perturbations of F are embeddings (due to the open set of the embedding maps);*
furthermore, any map from to R*2d+1*, whether it is an embedding or not, is always

arbitrarily close to an embedding (the set of embedding maps is dense).

*It seems reasonable to conclude from Whitney’s theorem that n = 2d + 1 *
*simultan-eous measurements are sufficient to reconstruct a d-dimensional state manifold in the*
measurement space R*n*_{; however, open and dense subsets can be very small in terms of}

probability.

*One would know if a particular measurement map is an embedding with probability one,*
that is to be almost sure that our data are given by an embedding map: in the next
section we shall define an equivalent notion of probability one in infinite dimensional
space and try to give a more suitable embedding theorem for our purpose.

**Embedding Theorems in the Sense of Prevalence**

We begin by giving the following

**Definition 1.1.8.** *(Prevalent Set) A Borel subset S of a normed linear space V is*

*prevalent* *if there is a finite-dimensional subspace E ⊂ V such that ∀v ∈ V , the sum*
*v+ e ∈ S for (Lebesgue measure supported on E) almost every e in E.*

*The meaning of prevalence for S is that if we start at any point in V and explore*
*along the finite-dimensional space E, then almost every point encountered will lie in S.*
*For this reason we shall refer to E as probe space.*

In this chapter, where there is no possibility of confusion, we say that "almost every" map satisfies a certain property when the set of such maps is prevalent. Here are given some properties about prevalent set:

*• If E*0 * _{and E are subspaces of a normed linear vector space V and E}*0

_{⊂ E}

_{and E}*is a probe space, then E*0 _{is a probe space.}

• A prevalent subset of a finite-dimensional vector space is simply a set whose complement has zero measure.

• The union or intersection of a finite number of prevalent sets is again prevalent.
*• In any normed linear space, a prevalent set is always dense: in fact, let S a*
*prevalent set. If c ∈ V , for prevalence there exists an element e ∈ E such that*

*c+ e ∈ S. Let ε > 0 and let c(θ) = c + θe be an element of S. There exists a θ*

*such that kc − c(θ)k < ε, that is S is dense.*

*In particular a prevalent subset of a space function is dense in the Ck*_{-topology}

*for any k.*

If a condition holds for a prevalent set of functions, it makes sense to determine the
*smallest, or most efficient, probe space E. This corresponds to the minimal amount of*
perturbation that must be available to the system in order for the condition to hold
with virtual certainty. We give an example to clarify the concept of prevalence.
**Example 1.1.1.** Consider the convergent Fourier series in one variable which form
an infinite-dimensional vector space with basis {e*inx*_{}}∞

*n=−∞*. In the sense of prevalence,

*almost every Fourier series has nonzero integral on [0, 2π]. The probe space E in this*
*case is the one-dimensional space of constant functions. If E*0 _{is a vector space of Fourier}

*series which contains the constant functions, then for every Fourier series f, the integral*
*of f + e will be nonzero for almost every e ∈ E.*

**Theorem 1.1.2*** (Whitney Embedding Prevalence Theorem). Let A be a d−dimensional*
compact manifold contained in R

*k*

_{, then almost every map R}

*k*

→ R*2d+1* _{is an embedding}

of A.

**Remark 1.1.5.** *The probe space E is the k(2d + 1)−dimensional space of linear maps*
from R*k* _{to R}*2d+1*_{.}

*Furthermore, given any smooth map F , it is arbitrarily near an embedding with*

*probability one* (sense of prevalence).

**Remark 1.1.6.** Under the same hypothesis, Whitney has showed that there exists an
embedding map from R*k* _{to R}*2d*_{. However, this theorem is useless for us because we do}

not know if this map, under small perturbations, is still an embedding.

**1.1.1**

**Fractal sets**

Sometimes, the attractor A is not a manifold: this prohibits us from using Whitney’s prevalence theorem. Luckily, we can deal with these sets.

**Definition 1.1.9.** *Given a compact set A ⊂ Rn _{, and given a small positive number ε,}*

*define Aε* *= {x ∈ Rn*| *kx − ak* *for some a ∈ A} and let Vol(Aε) be the n−dimensional*

*outer Volume of Aε*. We denote the box-counting dimension of A as

*boxdim(A) = n − lim*

*ε→0*

*log Vol(Aε*)

*log ε*

if the limit exists. If not, the lower (respectively, upper) box-counting dimension can be defined by replacing the limit by lim inf (resp.,lim sup).

*If the box-counting limit exists, we can assume Vol(Aε) ≈ εn−d, where d =*

*boxdim(A).*

Another equivalent definition of box-counting dimension is given by dividing R*n* _{into}

*ε-cubes with a grid based at points whose coordinates are ε-multiples of integer, and*

considering

*boxdim(A) = − lim*

*ε→0*

*log N(ε)*
*log ε*

*if the limit exists, where N(ε) is the number of boxes that intersect A. Upper and*
*lower boxdim are defined by replacing lim with limsup/liminf. In this case, N(ε) ≈ ε−d*_{,}

*where d = boxdim(A).*

It is easy to check that the two definitions are equivalent in an Euclidean space: finite
*coverings and relations between balls and a n-cubes in Rn* _{are used.}

The first definition of box-counting dimension is also suitable for any metric space, while the second one is only for Euclidean spaces. From now on, we will use the definition

*which concerns the ε-grid since the attractors are assumed to lie in some Rk*_{.}

The next theorem help us to deal with fractal sets.

**Theorem 1.1.3** **(Fractal Whitney Embedding Prevalence Theorem). Let A be a**
compact subset of R*k* _{of box-counting dimension d and let n be an integer greater than}

*2d. For almost every smooth map F : Rk*_{→ R}*n*_{,}

1. F is one-to-one on A.

*2. F is an immersion on each compact subset C of a smooth manifold contained in*

*A*.

**1.2**

**Delate-coordinate Embedding**

Whitney’s embedding theorems are used if we have a large number of independent measurements. But, it is quite hard to collect a large number of independent quantities or to establish whether two different simultaneous measurements are indeed independent.

In general, the most common kind of data available are the time series since it is
easy to take measurements dependent uniquely on time. For this reason, we should use
methods in which only one measurable quantity is needed. We can face this problem
*by using delay coordinates.*

**Definition 1.2.1** **(Delay-coordinate map of flow). If Φ is a flow on a manifold M, τ a***positive number (the delay), and h : M → R a smooth function, we define the delay*

*coordinate map of flow F(h, Φ, τ) : M → Rn* as:

*F(h, Φ, τ)(x) = (h(x), h(Φ−τ(x)), h(Φ−2τ(x)), . . . , h(Φ−(n−1)τ(x)))*

By adding some hypothesis, we state an embedding theorem for delay coordinate maps.

**Theorem 1.2.1** **(Fractal Delay Embedding Prevalence Theorem). Let Φ be a flow**
*on an open subset U of Rk* *and let A be a compact subset of U with box-counting*

*dimension d. Let n > 2d be an integer and τ a strictly positive delay. Assume A*
*contains at most a finite number of equilibria, no periodic orbits of Φ of period τ or 2τ,*
*at most finitely many periodic orbits of period 3τ, 4τ, . . . , nτ and that the linearizations*
of those periodic orbits have distinct eigenvalues. It follows that for almost every smooth
*function h on U, the delay-coordinate map F (h, Φ, τ) : U → Rn* _{is:}

1. One-to-one on A.

*In general it is not always clear how to choose a suitable delay τ to satisfy the*
hypothesis of the previous theorem: some methods will be discussed later.

*However, if we assume that the flow Φ on derives from a Lipschitz vector field ˙x = V (x),*
*where kV (x) − V (y)k ≤ Lkx − yk with x, y ∈ A, it is sufficient to take τ smaller than*

*π/L*: this result is discussed in [20]. The author showed that in this case each periodic

*orbit has period at least 2π/L. We want to stress the fact that even if we assume the*
*vector field to be Lipschitz, the value of the constant L is not a priori known.*

**1.2.1**

**Filtered delay-coordinate map**

From the previous theorems, in the reconstruction procedure we exploit information
*from the previous n time steps to identify a state of the original dynamical system in*
R*k*.

For purposes of measuring quantitative invariants of the dynamical systems, noise
reduction, or prediction, it can be advantageous to create an embedding that identify a
state with information from a larger number of previous time steps. However, working
with R*n* _{is difficult for large n. A way around this problem is to incorporate large}

numbers of previous data readings by "averaging" their contribution in some sense. To this end, we generalize the concept of delay embedding map.

**Definition 1.2.2** * (Filtered delay-coordinate map). Let U be an open subset of Rk*, let

*g* *: U → U be a smooth diffeomorphism, and let h : U → R be a function. Let w*− *< w*+

*be integers and let w = w*+* _{− w}*−

_{+ 1. For 1 ≤ i ≤ w, set g}*i* *= gw*
−_{+i−1}*, so that g*1 *= gw*
−
*and gw* *= gw*
+

*. Define B a n × w matrix. We define the filtered delay-coordinate map*

*F _{w}w*−+

*(B, h, g) : U → Rn*by

*F _{w}w*−+

*(B, h, g)(x) = B(h(g*

_{1}

*(x)), h(g*

_{2}

*(x)), . . . , h(g*

_{w}(x)))T*= B(h(gw*−

*+*

_{(x)), . . . , h(g}w*(x)))T*

*In this section and in the following g denotes a smooth diffeomorphism on an open*
*neighbourhood U ∈ Rk _{, h}*

1*, . . . , ht* *form a basis for the polynomials in k variables*

*of degree at most 2w. For a smooth function h*0 on R*k* *and for α ∈ Rt*, define

*hα* *= h*0+

P*t*

*i=1αihi. For each positive integer p denote by Ap* *= {x ∈ A : gp(x) = x}.*

*Let Cpq* *be a p × (p − (p, q)) matrix of the form*

*Cpq* =
*Ip−(p,q)*
*−I(p,q). . . I(p,q)*

*where (·, ·) denoted the g.c.d. (we assume that (p, 0) = 0)and p > q ≥ 0. Finally,*
*let C*∞

*pq* *be the ∞ × (p − (p, q)) matrix formed by repeating the block Cpq* vertically,

*and for a positive integer w, define Cw*

*pq* *to be the matrix formed by the top w rows of Cpq*∞.

The two following theorems are valid.

**Theorem 1.2.2.** *Let g a smooth diffeomorphism on an open neighbourhood U of Rk*,

*and let be a compact subset of U, boxdim(A) = d. Let n and w*− * _{< w}*+

_{be integers}

*such that n ≤ w = w*+* _{− w}*−

_{+ 1. Assume that the n × w matrix B satisfies:}**A1.** *rank BC _{p0}w*

*>2 · boxdim(Ap) for all 1 ≤ p ≤ w.*

**A2.** *rank BC _{pq}w*

*>boxdim(Ap) for all 1 ≤ q < p ≤ w.*

*Let h*1*, . . . , ht* *be as before. Then, for any smooth function h*0 on R*k*, and for almost

*every α ∈ Rt _{, it follows that if n > 2d, then F (B, h}*

*α, g) : U → Rn* is one-to-one on .

**Theorem 1.2.3.** *Let g be a smooth diffeomorphism on an open neighbourhood U*
of R*k, and let be a compact subset of a smooth m-manifold in U. Assume that the*

*linearization of the periodic orbits of period less than w have distinct eigenvalues. Let*

*n ≤ wbe positive integers os in the previous theorem and assume that the n × w matrix*
*B* satisfies

**A3.** *rank BDw _{p}(λi*1

*, . . . , λir) > boxdim(Ap+r−1) for all 1 ≤ p < w, and for all subsets*

*λi*1

*, . . . , λir*

*of eigenvalues of the linearization Dg*

*p* _{at a point in A}*p*.

*Let h*1*, . . . , htbe a basis for the polynomials in k variables of degree at most 2w. Then,*

*for any smooth function h*0 on R*k, and for almost every α ∈ Rt, it follows that if n ≥ 2m*

*then F (B, hα, g) : U → Rn* is an immersion on A.

As a corollary of the two previous theorems, two particular properties hold for a
*diffeomorphism g.*

**Theorem 1.2.4.** *Let g a diffeomorphism on an open subset U of Rk*, and let be a
*compact subset of U, boxdim(A) = d, and let n > 2d be an integer. Assume that*
*for every positive integer p ≤ n, the set Ap* *of periodic points of period p satisfies*

*boxdim(Ap) < p/2, and that the linearization of Dgp* for each of these orbits has distinct

*eigenvalues. Then for almost every smooth function h on U, the delay coordinate map*

*F(h, g) : U → Rn* _{is:}

• One-to-one on A.

**Theorem 1.2.5.** *[Fractal Filtered Delay Embedding Prevalence Theorem] Let g a *
*diffeo-morphism on an open subset U of Rk _{, and let be a compact subset of U, boxdim(A) = d.}*

*For a positive integer n > 2d, let B be an n × w matrix of rank n. Assume g has*
*no periodic points of period less than or equal to w. Then for almost every smooth*
*function h, the delay coordinate map F (B, h, g) : U → Rn* _{is:}

• One-to-one on A.

*• An immersion on each closed subset C of a smooth manifold contained in A.*

**1.3**

**Proofs**

We are going to prove the theorems we have stated before. All the following results are from [18].

**Lemma 1.3.1.** *Let n and k positive integers, y*1*, . . . , yn* distinct points in R*k* and

*u*1*, . . . , un* *in R and v*1*, . . . , vn* in R*k*. It follows that:

*1. There exist a polynomial h in k variables of degree at most n-1 such that for*

*i= 1, . . . , n, h(yi) = ui*.

*2. There exist a polynomial h in k variables of degree at most n such that for*

*i= 1, . . . , n, ∇h(yi) = vi*.

*Proof.* 1. By applying a linear change of coordinates, we apply the ordinary

*polyno-mial one-dimensional interpolation to the first distinct coordinates of y*1*, . . . , yn*:

*this polynomial of degree n − 1 (actually in one variable) is of the type desired.*
*2. If k = 1, we can find a polynomial of degree at most n − 1 that interpolates the*

data. By taking the antiderivative of such polynomial, the conclusion follows.
*If k > 1 one can assume, always by a linear change of coordinates, that ∀j ∈*
{*1, . . . , k} the j-th coordinates of y*1*, . . . , yn* are distinct; then, like in the case

*k* *= 1, one can find for each j a polynomial hj* *= h(xj) of degree at most n whose*

*derivative hxj* *interpolates the jth coordinate of ui, i ∈ {1, . . . , n}. The sum*

Pk

*j=1hj(x) gives the polynomial which satisfies 2.*

*An important tool in differential geometry is the singular value decomposition: if a*
*matrix M ∈ Ct×n* * _{can be written as V SU}*∗

_{, where}

*• V is an t × t unitary matrix;*

*• S is an n × t diagonal matrix where the diagonal elements σi* *:= σi,i* satisfies

*σ*1 *≥ σ*2 *≥ · · · ≥ σr* ≥*0, r ≤ min{t, n} and all other entries of S are zero;*

*then V SU*∗ _{is said singular value decomposition of M. The elements σ}

*i* are called

*singular values*.

A theorem, which we do not prove, states that such decomposition always exists but it
*is not unique; moreover, rank(A)=rank(S).*

We now have the ingredients to prove the following

**Lemma 1.3.2.** *Let F (x) = Mx + b be a map from Rt* to R*n, where M is an t × n*

*matrix and b ∈ Rn _{. For a positive integer r, let σ > 0 be the rth largest singular value}*

*of M. Then*

*Vol(Bρ∩ F*−1*(Bδ*))

*Vol(Bρ*)

*<*2*t/2(δ/ρσ)r*

*where Bρ* *denotes the ball of radius ρ centered at the origin in Rt* *and Bδ* is the

*n-dimensional ball of radius δ centered at the origin.*

*Proof.* *We may assume σ*1*, . . . , σr* *= σ and all other singular value equal to 0 due to*

*the fact that decreasing any singular value of M does not decrease the left-hand side.*
*Let V SUT* _{a singular decomposition of M. One can see that MB}

*ρ* *= V SUTBρ* is a

*r-dimensional ball of radius ρσ in Rt _{: the orthogonal matrix V, U}T*

_{preserve dimension}

*and radii while S "deletes" the n − r coordinates and multiplies by σ the radius of Bρ*

*(note that only the first r-coordinates of Bρ* are manipulated).

*It follows that Bρ∩F*−1*(Bδ) is a cylindrical subset lying in Bρwith base dimension r and*

*radius δ/σ (it could be even ∅ if the translation vector b has norm large enough). This*
*subset has Volume less than (δ/σ)r _{V}*

*rρt−rVt−r* *where Vr* =
*πr/2*
*r*
2
!if r is even
2*(r+1)/2 _{π}(r−1)/2*

*r*!! if r is odd

is the Volume of the unit ball in R*n* _{and r!! denotes the semifactorial of r. So}

*Vol(Bρ∩ F*−1*(Bδ*))
*Vol(Bρ*)
*<* *(δ/σ)*
*r _{ρ}t−r_{V}*

*rVt−r*

*ρt*

_{V}*t*

*<*2

*t/2(δ/ρσ)r*

*where the last inequality is simple to be proved with r and t both even, while some*
algebra is required for the remaining cases.

The last formula concludes the proof.

**Lemma 1.3.3.** *Let S be a bounded subset of Rk, boxdim(S = d), and let G*0*, G*1*, . . . , Gt*

*be Lipschitz maps from S to Rn _{. Assume that for each x in S, the r-th largest value of}*

*the n × t matrix*

*Mx* *= {G*1*(x), . . . , Gt(x)}*

*is at least σ > 0. For each α ∈ Rt* *we define G*

*every (in the Lebesgue sense) α ∈ Rt _{, the set G}*−1

*α* (0) has lower box-counting dimension

*at most d − r.*

*Proof.* *For a positive number ρ, let Bρ* *the ball of radius ρ centered at the origin in Rt*.

Without loss of generality, we may replace R*t* _{by B}

*ρ* in order to prove the lemma. For

*the remainder of the proof, we say that Gα* *has some property with probability p to*

*mean that the Lebesgue measure of the set of α ∈ Bρ* *for which Gα* has the property is

*p* *times the measure of Bρ. For example, of x ∈ S, then the previous lemma shows that*

*|Gα(x)| = |G*0*(x) + Mx(α)| ≤ ε for α ∈ Bρ* with probability at most 2*t/2(ε/σρ)r*.

*We can choose D > d and ε*0 *>*0 in a way that the following two fact hold:

*• Since S is bounded, S can be covered by ε−D* _{k}_{-dimensional balls B(x, ε) of radius}

*ε, centered at x ∈ S;*

*• By Lipschitz condition, there exists a constant C such that the image under any*

*Gα, α ∈ Bρ, of any ε-ball in Rk* *intersecting S is contained in a Cε-ball in Rn*.

*We will assume that ε < ε*0.

*Let Rx* be a subset of R*t* *such that Rx* *= {α ∈ Rt* *: |Gα(x)| ≤ Cε}. The Lebesgue*

*measure of Rx* *is at most a constant multiplied by εr* *since σ and ρ are constant (in*

*other words the measure of Rx* *is equal to the probability of Gα(B(x, ε)) contains 0).*

*We want to know the probability that at least M of the ε−D* _{balls are mapped into}

*0. We define, then, the function f : Rt* _{→ N such that f(α) = k if α belongs to k}

*different sets Rx* *with x being one of ε−D* *points, centers of the balls which cover S.*

Then R
*f(α)dm(α) =*P
*xm(Rx) = ε−Dm(Rx*). It follows
*m{α : f(α) > M} <* 1
*M*
Z
*f(α)dm(α) =* *m(Rx*)
*ε−D*
*M*
*M* *< C*1*ε*
*r−D*

*for Markov’s inequality. In other words G*−1

*α* *(0) can be covered by fewer than M = ε*

*−b*

*of the ε-balls except with probability at most C*1*εb−(D−r). As long as b > D − r, this*

*probability can be made as small as desired by decreasing ε.*

*Let p > 0. We choose a sequence {εi*}*i∈N* *approaching to 0 such that G*−1*α* (0) can be

*covered by fewer than ε−b*

*i* *balls except for a probability at most p2−i*. Thus, the lower

*box-counting dimension of G*−1

*α* *(0) is at most b, except for a probability p subset of*

*α. Since p was arbitrary, lower boxdim(G*−1_{α}*(0))≤ b for almost every α. In conclusion,*

*since this holds for all b > d − r, lower boxdim(G*−1

*α* *(0))≤ d − r.*

**Remark 1.3.1.** *In the case boxdim(S) does not exist, the lemma is still valid if we*
*suppose that d is the lower box-counting dimension of S.*

**Lemma 1.3.4.** *Let S be a bounded subset of Rk _{, boxdim(S = d), and let G}*

0*, G*1*, . . . , Gt*

matrix

*Mx* *= {G*1*(x), . . . , Gt(x)}*

*is at least r. For each α ∈ Rt* _{we define G}

*α* *= G*0+Pt*i=1αiGi*. Then for almost every

*(in the Lebesgue sense) α ∈ Rt _{, the set G}*−1

*α* (0) is the nested countable union of sets of

*lower box-counting dimension at most d − r. If d > r, then G*−1

*α* (0) is empty for almost

*every α.*

*Proof.* *Apply lemma 1.3.3 to the set Sσ* *= {x ∈ S : r-th largest singular value of Mx* *>*

*σ* *to get G*−1_{α}*∩ Sσ* *= ∅. Then S = ∪σ>0Sσ* *implies G*−1*α* (0) = ∅.

**Lemma 1.3.5.** Let A be a compact subset of R*k. Let F*

0*, F*1*, . . . , Ft* be Lipschitz maps

from to R*n. For each integer r ≥ 0, let S*

*r* *be the set of pairs x 6= y in for which the*

*n × t*matrix

*Mxy* *= {F*1*(x) − F*1*(y), . . . , Ft(x) − Ft(y)}*

*has rank r, and let dr* *be the lower box-counting(Sr). For α outside a subset of measure*

zero in R*t* _{for the map F}

*α* *= F*0+Pt*i=1Fi* *: A → Rn* it holds:

*if dr< r* *for all integers r ≥ 0 (assuming that d*0 *= 0 for the case r = 0), then the map*

*Fα* is one-to-one.

*Proof.* *We define Gi(x, y) = Fi(x) − Fi(y) for i = 0, . . . , t and Gα= G*0+Pt*i=1Gα* for

*α ∈ Rn; by hypothesis the rank of the matrix Mxy* *= {G*1*(x, y), . . . , Gt(x, y)} is r on Sr*.

*Now, if r > dr, by lemma 1.3.4 we have that for α ∈ Rt* *a.e. the image of Sr* under

*Gα* *does not contain the origin, or, Fα(x) 6= Fα(y) for x 6= y in Sr. If r > dr* for all

*r ≥* *0, each pair x 6= y stays in a Sr: if not, it would exist a pair x 6= y such that*

*rank(Mxy) does not belong to {0, . . . , min(n, t)} which is absurd. It follows that Fα* is

a.e. one-to-one and this conclude the proof.

The next theorem give the probability of dealing with transformations between two Euclidean spaces which are one-to-one on a particular set.

**Theorem 1.3.1.** Let A a compact subset of R*k* _{with lower boxdim(A)=d. If n > 2d,}

then almost every linear transformation of R*k* to R*n* is one-to-one on A.

*Proof.* *Let x 6= y be a pair in A × A and consider the matrix Mxy* *= {F*1*(x −*

*y), . . . , Fnk(x − y)} defined in lemma 1.3.5 where all {Fi*} *form a basis of the *

nk-dimensional space of linear transformations from R*k* _{to R}*n*_{. It easy to see that the}

*rank of Mxy* *is equal to n: in fact, assuming without loss of generality that the first*

*coordinate a of the vector x − y 6= 0 is different from 0, we can find n linear *
*transform-ations of the basis such that Fj(x − y) = (0, . . . ,* *a*

|{z}
*j−th place*

*, . . . ,*0) (we are choosing linear

*In particular, adopting the same notion of lemma 1.3.5, Sn* *= A × A − ∆ (where ∆*

*denotes the pairs x = y in the product) and Sr* *= ∅ for r 6= n because we have noticed*

*that Mxy* *has always full rank for x 6= y.*

*Since A×A = Sn* *and lower boxdim(Sn) = lower boxdim(A)+lower boxdim(A)< 2d, all*

*hypothesis of lemma 1.3.5 are satisfied, hence almost every Fα* =P*αiFi* is one-to-one

on A.

In our state space reconstruction it is important to preserve the differential structure. We continue with the following definitions.

**Definition 1.3.1.** *For a compact differentiable manifold M, let T (M) = {(x, v) : x ∈*

*M,* *v ∈ TxM }* *be the tangent bundle of M and let S(M) = {(x, v) ∈ T (M) : |v| = 1}*

*denote the unit tangent bundle of M.*

**Remark 1.3.2.** *We assume the following fact: if M is a k-dimensional differentiable*
manifold in R*n _{, then S(M) has dimension 2k − 1.}*

**Lemma 1.3.6.** Let A be a compact subset of a smooth manifold embedded in R*k*. Let

*F*0*, F*1*, . . . , Ft*: R*k*→ R*n* *be a (t + 1) smooth maps from an open neighbourhood U ⊂ A*

to R*n _{. For r ≥ 0 let S}*

*r* *be the subset of S(A) such that the n × t matrix*

*{DF*1*(x)(v), . . . , DFt(x)(v)}*

*has rank r and let dr* *=lower boxdim(Sn). Define Fα* *= F*0+P*ti=1αiFi* *: U → Rn*. If

*dr* *< r* *for all integers r ≥ 0, then for almost every α ∈ Rt, the map Fα* is an immersion

on A.

*Proof.* *Define Gi* *: S(A) → Rn* *by Gi(x, v) = DFi(x)v. If dr* *< r* *for all r ≥ 0, then we*

*apply lemma 1.3.4 to demonstrate that for almost every α, G*−1

*α* (0)
T

*Sr*= ∅ and since

*S(A) =* S

*rSr*, the following equalities hold

*G*−1* _{α}* (0)\

*S(A) = G*−1

*\ [*

_{α}*r*

*Sr*! =[

*r*

*G*−1

*(0)\*

_{α}*Sr*= ∅

*hence G*−1

*α* (0) = ∅. It follows that no unit tangent vector is mapped to the origin, and

*Fα* is an immersion.

We are now ready to prove the theorems stated in the previous section.

*Proof of Fractal Whitney Embedding Prevalence Theorem.* If we show that the set of

smooth maps is one-to-one and immersive then the theorem is proved. For this sake, we
*take as probe space the set E = {F*1*, . . . , Ft*} *that form a basis of the nk- dimensional*

space of the linear transformations from R*k* _{→ R}*n*_{. Referring to the terminology of}

*is one-to-one on for almost α ∈ Rt _{. If we add other smooth maps F}*

*t+1, . . . , Ft*0, the rank

*of Mxy* *does not change if x 6= y, so almost every linear combination of Ft+1, . . . , Ft*0 is

*one-to-one on A; hence almost every smooth map F is one-to-one on (this is exactly in*
the sense of prevalence).

*Then, remembering that boxdim(A)=d, we take in a smooth manifold contained*
*in a subset C whose dimension is at most d. From the previous remark we have*
*boxdim(S(C))≤ 2d − 1. The proof is concluded by applying lemma 1.3.6 to the *
*func-tions of the probe space E, Sn* *= S(C) and Sr* *= ∅ for r 6= n (we are using the notation*

*of lemma 1.3.6) and noticing that boxdim(Sn)< 2d − 1 < 2d < n.*

* Remark 1.3.3. Whitney Embedding Prevalence Theorem* is also proved, since it is a
particular case of the Fractal one.

*Proof of Theorem 1.2.2.* *For i = 1 . . . t define*

*Fi(x) = B*
*hi(g*1*(x))*
...
*hi(gw(x))*

*By definition, F (B, hα, g*) = Pt*i=1Fi*. To use lemma 1.3.5, we have to study for each

*x 6= y the rank of the matrix*

*Mxy* *= (F*1*(x) − F*1*(y), . . . , Ft(x) − Ft(y))*

which can be written as

*B*
*h*1*(g*1*(x) − h*1*(g*1*(y)) · · · ht(g*1*(x) − ht(g*1*(y)*
... ...
*h*1*(gw(x) − h*1*(gw(y)) · · · ht(gw(x) − ht(gw(y))*
*= BJH*
where
*H* =
*h*1*(z*1*) . . . ht(z*1)
...
*h*1*(zq) . . . ht(zq*)

*q ≤2w, the zi* *are distinct and J = Jxy* *is a w × q matrix each of whose rows consists*

*of zeros except for one 1 and one −1. By the first part of lemma 1.3.1 the rank of H is*

*q* (it can be chosen a basis formed by monomials). Now, the study is divided into three

cases:

**Case 1: x**and y are not both periodic with period ≤ w.

*In this case J is upper or lower triangular, and rankJ = w. Since B, J and H are*
*onto linear transformations, the product BJH is onto and has rank n. The set of*

*pairs x 6= y of case 1 has box-counting dimension at most 2d and rank(Mxy) = n.*

*If g has no periodic points of period ≤ w we are done for lemma 1.3.5.*
**Case 2: x***and y lie in distinct periodic orbits of period ≤ w.*

*Assume p and q are minimal such that gp _{(x) = x and g}q_{(y) = y, and that}*

*1 ≤ q ≤ p ≤ w. In this case the matrix J contains a copy of Cw*

*p0. Since H is onto,*

*rank M=rank BJ. By hypothesis, rank BJ ≥ rankBCw*

*p0>2 · boxdimAp* which

is the box-counting dimension of the set of the pairs treated in case 2. By lemma
*1.3.5 for almost α ∈ Rt, F*

*α(x) 6= Fα(y) for every such pair x 6= y.*

**Case 3:** *Both x and y lie in the same periodic orbit of period ≤ w. Assume p and q are*
*minimal such that gp(x) = x and gq(x) = y and that 1 ≤ q < p ≤ w. Since x and*

*y* *lie in the same periodic orbit, the column space of J contains the column space*

*of Cw*

*pq. Thus, rank BJH =rankBJ ≥ rankBJ ≥ rankBCpqw* *>boxdimAp*, which

is the dimension of the pairs of case 3. Thesis follows from lemma 1.3.5 again.

*Proof of Theorem 1.2.3.* We want to apply lemma 1.3.6. First of all we have to check

*the rank of the n × t matrix*

*(DF*1*(x)(v), . . . , DFt(x)(v))*

*for each (x, v) in the unit tangent bundle S(A). For a given observation function h, the*
*derivative of F (B, h, g) is*
*DF(B, h, g)(x)v = B*
*∇h(gw*−*(x))T _{Dg}w*−

*(x)v*...

*∇h(gw*+

*(x))T*+

_{Dg}w*(x)v*

We divide the proof into three cases.

**Case 1:** *If x is not a periodic point of period less than w, then gw*−*(x), . . . , gw*+*(x) are*
*distinct points. Since g is a diffeomorphism and v 6= 0 imply that Dgi _{(x)v 6=}*

*0 for all i. Therefore by the second part of lemma 1.3.1, the set of vectors*
*{DF(B, hα, g)(x)v : α ∈ Rt*}spans R*n*. In the notation of lemma 1.3.6, the subset

*Sn* *contains all points of S(A) that are not periodic with period less than w, and*

*dn* *= lower boxdim(Sn) ≤ 2m − 1. If g has no periodic points of period less than*

**Case 2:** *If x is a periodic point of period p < w, then*
*DF(B, h, g)(x)v = B*
*H*_{1}*Tw*1
...
*HT*
*p* *wp*
*HT*
1*D*1*w*1
...
*HT*
*pDpwp*
*HT*
1 *D*12*w*1
...
(1.1)
*where xi* *= gw*
−_{+i}*(x) = xp+i, Hi* *= ∇hxi, wi* *= Dg(xi−1) · · · Dg(x*1*)Dg*
*w*−*(x)v and*
*Di* *= Dg(xi−1) · · · Dg(x*1*)Dg(xp) · · · Dg(xi*).

*Each matrix Di* *has the same set of eigenvalues λ*1*, . . . , λm* and by hypothesis, they

*are distinct. If u*1*, . . . , um* *is a spanning set of eigenvector for D*1, then it checks

*that uij* *= Dg(xi−1) · · · Dg(x*1*)uj* *for 1 ≤ i ≤ p, 1 ≤ j ≤ m defines a spanning set*

*{u1i, · · · , uim*} *of eigenvectors for Di. Thus if w*1 =Pm*j=1aju1j* is the eigenvector

*expansion of w*1*, then the eigenvector expansion of wi* is Pm*j=1ajuij*, which has

the same coefficients.

*Thus DF (B, h, g)(x)v can be written as B times the w−vector*

1 · · · 1
0 · · · 0
...
0 · · · 0
0 · · · 0
*λ*1*· · · λm*
0 · · · 0
...
0 · · · 0
0 · · · 0
*λ*2
1*· · · λ*2*m*
...
*a*1*uT*11
...
*amuT1m*
*H*1+ · · · +
0 · · · 0
0 · · · 0
...
0 · · · 0
1 · · · 1
0 · · · 0
0 · · · 0
0 · · · 0
...
0 · · · 0
*λ*1*· · · λm*
0 · · · 0
...
*a*1*uTp1*
...
*amuTpm*
*Hp* (1.2)

*To find the rank of the matrix 1.1 for (x, v) where x is periodic, we need to find*
*the span of B times the vectors 1.2 for h = hα*=P*αihi, α ∈ Rt*.

*Assume that the eigenvector expansion of v has exactly r non zero coefficients*

R*k. Then because the uij, 1 ≤ j ≤ m, are linearly independent, the vectors of*

*form 1.2 span a space of dimension min{w, rp} as α spans Rt*_{.}

*Therefore, for this (x, v), the span of the vectors 1.1 has dimension equal to the*
*rank of BDw*

*p(λi1, . . . , λir). By hypothesis, the boxdim of such pairs (x, v) in S(A)*

*is boxdim(Ap)+r − 1. Always by hypothesis, the rank of the n × t matrix 1.1 is*

strictly larger. So that lemma 1.3.6 applies to give the conclusion.

*Proof of Theorem 1.2.4.* *Apply theorems 1.2.2 and 1.2.3 with B = In*.

*Proof of Theorem 1.2.5.* *Since Ap* *is empty for 1 ≤ p ≤ w, the conditions A1 − A3 of*

theorems 1.2.2 and 1.2.3 are satisfied immediately.

**1.4**

**How to estimate the embedding dimension**

When using a delay coordinate map or a filtered delay coordinate map to examine
*the image F (A) in Rn* of a set A in R*k, the choice of n depends on the objective of our*

investigation, i.e. calculation of dimension and Lyapunov exponents, prediction and noise reduction [18].

To compute the dimension of A, all that is required is that
*dimF (A) = dimA*

whatever dimension is considered.

*In practical situations, if attempts to measure boxdim(A) result in answers dependent*
*on n, where n > boxdim(A), then the variation would seem to be a numerical artifact,*
*since there is no theoretical justification for which of the values of n greater than*
*boxdim(A) gives the more accurate result. The standard technique is to increase n*
*until the observed dimension of boxdimF (A) reaches a plateau and to use this result*
*(the so called plateau dimension).*

**Chapter 2**

**Entropy of a Time Series**

In this chapter, we want to characterize the behavior of a time series. First of all,
*we always assume that our time series is the output of an ergodic dynamical system.*
We introduce the main aspects and the main results of the ergodic theory that will be
used later.

*Then, using finite measurable partitions, Kolmogorov-Sinai entropy is introduced, that*
is a quantifier of the predictability of an ergodic dynamical system.

*Simultaneously, the complexity of an infinite string of symbols is introduced through the*

*Algorithmic Information Content (AIC)*, that is the minimum amount of information

that can identify a string.

Following [3], we convert a time series into a sequence of symbols. Complexity of a
*time series can be identified with K-S entropy through symbolic dynamics and Brudno’s*
theorem: from these facts, we would compute K-S entropy by the AIC.

Unfortunately, the AIC cannot be computed by any algorithm: for this reason, prefix-free codes and simple Lempel-Ziv algorithm are introduced as an approximation of the AIC.

Following P. Shields [19] and T. Cover [7] we describe simple Lempel-Ziv algorithm and its properties.

*In the last part, we introduce other ways to characterize a time series: generalized*

*entropy of order 2* *[10],[8], the approximate entropy [14], and the sample entropy[16].*

Their definitions are based on the embedding dimension of the attractor underlying the dynamics (see Chapter 1), even if these quantifiers are commonly used with fixed embedding dimensions.

**2.1**

**Preliminary Concepts and Tools**

We will try to infer entropy properties from strings that represent discrete-time stochastic processes.

**Definition 2.1.1** **(Process). A discrete-time stochastic process is a sequence**

*X*1*, X*2*, . . . , Xn, . . .* *of random variables defined on a probability space (X, B, µ).*

*The process has alphabet A if the range of each Xi* is contained in A. In our

discussion we only consider processes with a finite alphabet, that is |A| (cardinality
of A) finite. Moreover, "measure" will mean "probability measure" and "function" is
*intended as "measurable function" with respect to some appropriate σ-algebra on a*
probability space.

*The sequence am, am+1, . . . an* *where each ai* ∈ A *is denoted by anm* while A*nm* is the set

*that contains all the sequences an*

*m*. The notation A*n* will refer to A*n*1.

*Of important interest is the notion of k-th order distribution of the process {Xk*} which

*is the measure µk* on A*k* defined by the formula

*µk(ak*1*) = P rob(X*
*k*
1 *= a*
*k*
1*), a*
*k*
1 *∈ A*
*k*
*.*

*The set {µk* *: k ≥ 1} is called the distribution of the process, that is a family of*

*probability distributions, one for each k. An important fact is that the sequence of*
*distributions cannot be completely arbitrary because they have to satisfy a consistency*
relation which is implicit in the definition of a process. This relation can be expressed
as

*µk(ak*1) =

X
*ak+1*

*µk+1(ak+1*1 *).*

A process is defined by its joint distributions without giving too much importance to
*the particular space on which the random variables Xn* are defined. Therefore, one can

*choose the underlying space on which Xn* are defined as long as the joint distributions

are left unchanged.

It is convenient, then, to use the Kolmogorov model for a process which is a particular way to define a space and a sequence of functions with the given distributions. We see how this construction is carried out.

*Let Ω denote the set of all infinite sequences. We define the cylinder set determined by*

*am*

*n* *denoted by [amn*] as the following subset of Ω

*[am*

*n] = {x ∈ Ω : xi* *= ai, m ≤ i ≤ n}.*

Let C*n* be the collection of cylinder sets defined by sequences that belong to A*n* and

let C = ∪*n*C*n. We denote with B the σ-algebra generated by the cylinder sets C. The*

*members of B are called the Borel sets of Ω.*

*For each n ≥ 1 the coordinate function ˆXn* : Ω 7→ A is defined by ˆ*Xn(x) = xn, x ∈ Ω.*

*The following theorem, known as Kolmogorov representation theorem, states that every*
process with finite alphabet A can be thought of as the coordinate function process
{ ˆ*Xn*} together with a Borel measure on Ω.

**Theorem 2.1.1*** (Kolmogorov Representation Theorem.). If µk*is a sequence of measures

for which the condition of consistency holds, then there is a unique Borel probability
*measure µ on Ω such that µ([ak*

1*]) = µk(ak*1*), for each k and ak*1*. Equivalently, if {Xn*} is

*a process with finite alphabet , there is a unique Borel measure µ on Ω for which the*
sequence of coordinate functions { ˆ*Xn*} *has the same distribution as {Xn*}.

*Such measure µ will be called the Kolmogorov measure of the process and with the*
*expression "let µ be a process" we refer to "let µ the Kolmogorov measure for some*
*process {Xn*}".

**Definition 2.1.2*** (Stationary process). A process is stationary if the joint distributions*
do not depend on the choice of time origin, that is

*P rob(Xi* *= ai, m ≤ i ≤ n) = P rob(Xi+1* *= ai, m ≤ i ≤ n*)

*for all m, n and an*
*m.*

**2.1.1**

**Probability tools**

Markov’s inequality and the Borel-Cantelli principle will be used very frequently.
**Lemma 2.1.1** **(Markov’s Inequality). Let f be a non negative measurable function on***a probability space (X, B, µ). Then for every δ > 0*

*µ{x ∈ X* *: f(x) ≥ δ} ≤* 1
*δ*

Z

*X*

*f(x)dµ.*

**Lemma 2.1.2** * (The Borel-Cantelli Principle). If {Cn*}is a sequence of measurable sets

*in a probability space (X, B, µ) such that*P

*µ(Cn) < ∞ then for almost every x there*

*is a N = N(x) such that x /∈ Cn, ∀n ≥ N*.

*In general, a property P is said measurable if the set of x for which P (x) is true is*
a measurable set.

*If {Pn*} is a sequence of measurable properties then

**1. P**n(x) holds eventually almost surely, if for almost every x there is an N = N(x)

*such that Pn(x) is true for n ≥ N.*

**2. P**n(x) holds infinitely often almost surely, if for almost every x there is an increasing

*sequence {ni*} *of integers, which may depend on x, such that Pni(x) is true for*
*i= 1, 2, . . .*

Almost sure convergence can be stated in equivalent forms.