• Non ci sono risultati.

A mean field analysis of two-layers neural networks with general convex loss function

N/A
N/A
Protected

Academic year: 2021

Condividi "A mean field analysis of two-layers neural networks with general convex loss function"

Copied!
118
0
0

Testo completo

(1)

Dipartimento di Matematica

Corso di Laurea Magistrale in Matematica

Tesi di Laurea Magistrale

A Mean Field Analysis of

Two-Layers Neural Networks with

General Convex Loss Function

Candidato:

Relatore:

Francesco Demelas

Prof. Marco Romito

(2)
(3)

Contents

Introduction v

1 Mean Field for Neural Networks 1

1.1 Basic Concepts . . . 1

1.2 Main results and comparison with dierent works . . . 5

1.3 Convergence of the Risk . . . 9

2 Weak Convergence of the Distributions 19 2.1 Noiseless SGD . . . 19

2.2 Noisy SGD and generalization of the previous result . . . 32

3 Characterization of Fixed Points 43 3.1 Characterization of xed points of the PDE . . . 44

3.1.1 Noiseless case . . . 44

3.1.2 Noisy case . . . 46

3.2 General continuity results . . . 66

3.3 Property of the solution of the PDE . . . 69

4 Convergence: stability and instability 73 4.1 Noiseless case . . . 73 4.1.1 Stability . . . 74 4.1.2 Instability . . . 80 4.2 Noisy case . . . 93 5 Numerical experiments 101 5.1 Numerical Results . . . 104

(4)
(5)

Introduction

Nowadays neural networks are a useful tool for many tasks, even if there are few theoretical results that guarantee the eectiveness of this approach. Until few years ago, one of the powerful results in this eld was the Universal Approximation Theorem, that guarantees that any continuous function can be well approximated by a two-layers neural network with convex activation functions and with enough hidden nodes.

However this result tells us nothing about the practical choice of the parameters of the network in order to achieve this approximation. In general the neural networks are trained using the algorithm of Back-Propagation and typically the Stochastic Gradient Descent algorithm, or one of its variants, is used to update the weights of the network.

In the last years several results have been discovered in order to anal-yse the convergence of parameters using the Stochastic Gradient Descent, in particular using the mean eld approach. The key idea of this method is to notice that actually the risk depends on the empirical distribution of parameters. Hence this suggest us to consider a risk function dened over the set of distributions on the space of parameters. This allows us to study the convergence through a partial dierential equation, known as distribu-tional dynamics, that can be seen as a continuous counterpart of an iteration of the Stochastic Gradient Descent. Therefore this approach allows the use of common tools of mathematical analysis, in particular the distributional dynamics has a remarkable mathematical structure, indeed it corresponds to a gradient ow in the metric space of probability measures endowed with the Wasserstein metric.

(6)

Many of the results obtained in this direction use a quadratic loss func-tion, thus optimize the mean square error. In particular the quadratic loss is convex and regular, and this slightly simplies the optimization problem. In this works we analyse the mean eld approach presented in [1] with a general convex loss function, and study the convergence of parameters. This generalization has a relevance as it extend the approach to others commonly used loss functions in machine learning, such as Cross-Entropy and Mean Absolute Error. The generalization is fundamental, because dierent loss functions can capture dierent aspects of the problem. As a matter of facts, the success of a learning problem can be enhanced by the choice of the most suitable loss function. For example the cross-entropy is more appropriate for classication tasks, while the mean squared error and the mean absolute error generally work well for regression tasks.

We start by proving that the empirical distributions weakly converge to the continuous counterpart obtained through the distributional dynamics for any nite temporal interval. In particular, in each nite time interval, we obtain a bound on the dierence between the discrete and the continuous risk that goes to zero (in probability) when the number of hidden units goes to innity.

On the other hand, this bound depends exponentially on the nal time. Hence it is interesting to analyse the time convergence of the distributions, and we nd that, under suitable assumptions, the continuous distributions weakly converge to a xed point of the distributional dynamics. Moreover, considering a regularized version of the Stochastic Gradient Descent, know as Noisy SGD, we prove that, choosing appropriately the regularization hyper-parameters, the continuous distributions converge to a global minimum of the risk. Furthermore we nd that in the regularized case, the discrete dis-tributions can approximate arbitrarily well this optimal distribution.

(7)

Chapter 1

Mean Field for Neural Networks

In this chapter we start by giving the preliminary denition of neural network and presenting the stochastic gradient descent, one of the algorithms most used, during the training of a neural network, in order to update the weights. As the aim of the paper is to analyse the convergence of this algorithm, we give the main idea of the method that we consider, that is the mean eld approximation. This allow us to study the convergence of the parame-ters in stochastic gradient descent by their distribution and a related partial dierential equation that describes the continuous dynamics. This partial dierential equation, known as distributional dynamics can be viewed as the continuous counterpart of the stochastic gradient descent.

Then we give an informally overview of the main results, and compare them with others results obtained in dierent papers.

In the end of this chapter we present a rst result that exploit the rela-tionship between the minimum of the discrete risk and the minimum of its continuous counterpart, nding a characterization for the minimizer of the continuous counterpart of the risk.

1.1 Basic Concepts

Let us consider a standard supervised learning problem, with observations (xi, yi) ∈ Rd× R, i ∈ N which are assumed to be independent and

(8)

identi-cally distributed from an unknown distribution P on Rd× R, that we assume

to have conditional probability (for a.e. y) absolutely continuous with re-spect to the Lebesgue measure µ dened on Rd. Here x

i ∈ Rd is a feature

vector and y ∈ R is a label. We want to model the dependence of the label yi on the feature vector xi in order to assign labels to unlabelled samples.

Denition 1. A Neural Networks is an oriented graph (V, E) with weights, dened for each edge. (V, E) are such that:

ˆ there are no cycles (feed forward property) ˆ are organized in layers, i.e. V = ∪T

j=0Vj, Vi∩ Vj = ∅ ∀i 6= j and the

elements of E are all and only the edges that connect a node of Vj to

one of Vj+1 for some j = 0, · · · , T − 1.

ˆ In each hidden node is dened an activation function σ : R → R. In particular, we speak of deep neural network when there are at least three hidden layers. Instead for a two-layers neural network, known as shal-low neural network, we have T = 2 that is a single hidden layer.

In this work we consider a two-layers neural network with N hidden nodes, and we can write the prediction as

ˆ yN ≡ ˆyN(x; θ) = 1 N N X i=1 σ∗(x; θi) = 1 N N X i=1 aiσ(< wi, x > +bi)

where σ : R → R,σ∗ : Rd× RD → R are activation functions and (θi)i≤N ⊆

RD are the parameters of the network. θi = (ai, bi, wi) ∈ RD and in

partic-ular ai are the weights of the archs between the hidden layer (V1) and the

output layer (V2), instead wi and bi are the weights of the archs between the

input layer (V0) and the hidden layer, in particular bi is relative to the bias

(they connect a constant node to nodes in the hidden layer).

In general, in order to choose the parameters we consider a certain loss function l : R × R → R and then choose the parameters that minimize the risk RN(θ) = E[l(y, ˆyN(x; θ))], where E is the expectation made with respect

(9)

In practice one of the most used algorithms for the training of the param-eters in neural networks is the algorithm of Back-Propagation that, after an initialization of network's parameters, works by the iteration of two steps:

ˆ rst it propagates the data, by calculating the output of the network, using the actual network's parameters

ˆ then, if the output does not coincide with the observed (real) value y, it update the parameters

Commonly one of the algorithm most commonly used to update the weights of the network is Stochastic Gradient Descent (or one of its vari-ants).

Let us consider the following iteration of SGD for the two-layers neural network previously dened:

θik+1 = θik− sk∂2l(yk, ˆy(xk; θk))∇θiσ∗(xk; θ

k

i), (1.1)

where θk = (θk

i)i≤N denotes the parameter after k iterations, sk is a step

size and (xk, yk) the k − th example. In particular (1.1) is known as

on-line stochastic gradient descent, because we consider at each step just one training sample. On the other hand the o-line stochastic gradient descent is the algorithm that consider, at each step, all the training data for the updating of the parameters. In practice in this case we consider the following iterative formula: θik+1= θik− sk n X j=1 ∂2l(yj, ˆy(xj; θk))∇θiσ∗(xj; θ k i). (1.2)

In general, we can consider a compromise between the two previous ap-proach, where we take a batch-size p with 1 ≤ p ≤ n and at each step we take just p samples yj(1), · · · , yj(p)from the dataset and we update the weights just

using these parameters by the following formula: θk+1i = θki − sk p X j=1 ∂2l(yj(p), ˆy(xj(p); θk))∇θiσ∗(xj(p); θ k i). (1.3)

(10)

In this thesis we explicitly consider just the on-line stochastic gradient descent, but similar approaches can be used to analyse the convergence of the others variants. Moreover, as we are considering independent and identically distributed data {(xk, yk)}k with (xk, yk) ∼ P this is equivalent to say that

training examples are never revisited (One-pass assumption). We can start by writing the population risk

RN(θ) = E[l(y, 1 N N X i=1 σ∗(x; θi))],

here the mean value is made with respect to the data probability P.

Now we can observe that the dependence of the risk RN(θ) with respect

to the parameters θ1, · · · , θN is only through their empirical distribution

ρN := N1

PN

i=1δθi. This suggests to consider a risk function dened on ρ ∈ P(RD), the space of probability distribution on RD. Hence we take:

R(ρ) = E[l(y, 

σ∗(x; θ)ρ(dθ))]

where we use the notation ρ(dθ) ≡ ρ(θ)dθ.

The asymptotic dynamics of ρt is dened by the following PDE, that is

the continuum formulation of an iteration of the SGD, we shall refer to it as distributional dynamics (DD): ∂tρt = ξ(t)∇θ· (ρt∇θΨ(θ; ρt)) Ψ(θ; ρ) = δR(ρ) δρ(θ) = E  ∂2l  y,  σ∗(x; ¯θ)ρ(d ¯θ)  σ∗(x; θ)  (1.4)

here ∇θ· vf (θ) denotes the divergence of the vector eld vf(θ).

Assuming that the data probability P is absolutely continuous with re-spect to the Lebesgue measure µ it is easy to see that the function Ψ(θ; ρ) is

well dened. In fact, taking two dierent functions v1, v2such that v1(y, ˆy), v2(y, ˆy) ∈

∂2l(y, ˆy), we have that v1 = v2 almost everywhere. Now, as P is absolutely

(11)

measure, denoted by p(·, ·), and we nally nd that:

E[(v1(y, ˆy(x; θ)) − v2(y, ˆy(x; θ)))σ∗(x; θ)] =

=  

(v1(y, ˆy(x; θ)) − v2(y, ˆy(x; θ))) σ∗(x; θ)P(x | y)P(y)dxdy =

=  

(v1(y, ˆy(x; θ)) − v2(y, ˆy(x; θ))) σ∗(x; θ)P(y)p(x | y)dµ(x)dy = 0.

where ˆy(x; θ) ≡ σ∗(x; ¯θ)ρ(d ¯θ).

1.2 Main results and comparison with dierent

works

In this section we discuss the major results obtained in this work and compare it with similar results obtained in the past years. This thesis follows the same lines of [1], but we consider a more general convex and regular loss function. In particular in [1], authors minimize the mean square error, considering a quadratic loss function (seen as function of the true output value and the prediction). By the theoretical point of view we adapt their analysis to the loss functions that are convex. On the other hand, by the computational point of view, we provide an analysis of the convergence that is valid for the more commons loss function used in machine learning that is cross-entropy, mean absolute error and, obviously, mean squared error. The generalization is fundamental, because dierent loss functions can capture dierent aspects of the problem. As a matter of facts, the success of a learning problem can be enhanced by the choice of the most suitable loss function. For example the cross-entropy is more appropriate for classication tasks, while the mean squared error and the mean absolute error generally work well for regression tasks.

We study the convergence of the parameters, obtained through an appli-cation of the Stochastic Gradient Descent, to the measure relative to continu-ous dynamic. We obtain the same result for both the standard (or Noiseless) version of SGD and a regularized version, called Noisy Stochastic Gradient

(12)

Descent, where we consider two regularization terms, one for the parameters and the other for the measure.

Moreover we are interested in the convergence for both the discrete and the continuous problem when the time goes to innity in order to study the convergence to a Global Minimum. In order to do that we rst look for a characterization for the xed points of the Distributional Dynamics. We nd dierent characterizations for the Noiseless and Noisy version.

This leads us to dierent stability results for the partial dierential equa-tion. In the Noiseless case we provide some results for the xed point of the distributional dynamics that can be written as a characteristic function of one parameter. Under some additional assumptions, relative to the support of the initial distribution and the eigenvalues of the sub-gradient of the loss function, we nd that the continuous dynamics converges to a stationary point exponentially fast. In this case we provide a sort of instability result, that gives us sucient conditions such that the dynamics cannot converge to a stationary point.

Instead, in the Noisy case, the characterization of the xed point tells us that there exists a unique xed point, it is absolutely continuous with re-spect to the Lebesgue measure and must satisfy a certain formula. However also in this case we nd that the continuous dynamics converges to a sta-tionary point. Moreover for the regularized version we nd that the discrete distributions obtained through the stochastic gradient descent converge ex-ponentially fast to a global optimum, and this convergence does not depend on the number of hidden units, when this is large enough.

An overview of results from dierent papers There are several papers where authors obtain similar results, with some variations on the setting. In particular they consider dierent assumptions over the loss and the activation function, on the data and parameters space, and on the initial distribution.

In [9], [10], [11] and [12] authors obtain both a result on the convergence of the discrete distribution (obtained through stochastic gradient descent) to the measure relative to a continuous dynamic.

(13)

neu-ral networks considering the mean squared error as loss function. Their re-sults are valid for the two-layers neural networks (for example with sigmoid activation function)

They start by proving the universal approximation theorem and then prove the convexication of the problem at distributional level. This con-vexication is proved in a similar way as is done here (as well as in [10], [1] [11]), rst proving the convergence of the discrete measure obtained through stochastic gradient descent to a measure obtained by the resolution of the distributional dynamics, and then analysing the convergence in time.

For the rst result, that represents a kind of Law of Large Numbers for Neural Networks, they require that the set of data and the set of parameters are closed manifolds of Rk and that the activation function is continuous

dierentiable with respect to its parameters and that satises the unit dis-criminating property: if g(x)σ∗(x; θ)dx = 0for almost every θ), then g ≡ 0

for almost every x.

To study the convergence in time, they enforce the assumptions, by addi-tionally requiring that the initial distribution ρ0 must satisfy some technical

conditions, such as:

ˆ The support of ρ0 must contains a smooth manifold M that separate

the parameters set in {(a, θ−a) : a > a0} and {(a, θ−a) : a < c0}

for some c0 > 0.

ˆ ˆyin :=



Rcρ0(dc, ·) has nite total variation

ˆ for each b ∈ R:  ebcρ

0(dc, dz) < ∞

Moreover in this paper they study also an analogous of the Central Limit theorem, obtaining result of weak convergence in law for the uctuation ω(N )t := N12( ¯ρ(N ) − ρt) when N goes to innity and when t goes to innity. Furthermore it is interesting to notice that in this article all the result are presented for the o-line stochastic gradient descent, but they present a gen-eralization for the on-line stochastic gradient descent, that is the case that we explicitly consider.

(14)

In [10], Sirignano and Spilopoulos obtain the same results of weak con-vergence of the distribution, under slightly dierent assumptions, namely the activation functions has two continuous and bounded derivatives, data (xk, yk) ∼i.i.d. π(dx, dy) with π such that Eπ[kxkk] + Eπ[| yk |] < ∞ and

they suppose that the initial parameters θ0 ∼i.i.d.µ0, with µ0 such that there

exists 0 < q < ∞ for which E[exp(q | ai

0 |)] < C.

In order to study the convergence in time, and in particular to prove the existence and uniqueness of the limit for t → ∞, they follow a dierent way. First they prove a result similar to De Finetti's Theorem for exchangeable random variable, this ensure that propagation of chaos holds, that is for each k ∈ N, f1, ·, fk∈ Cb2(RD): lim N →∞hf1(x1) × · · · × fk(x k ), ˆρN(dx1, · · · , dxN)i = k Y i=1 hfi, ρ∗i

Then they prove the relatively compactness of the family {ˆρ(N )}in D

P(RD)([0, T ]) and this gives the existence and uniqueness of the limit.

In [11], Chizat and Bach present a general framework that is valid, not only for two-layers neural networks, but for the sparse deconvolution problem and for the low rank tensor decomposition problem. Furthermore they take a risk function more general then mean squared error, supposing that it is convex, dierentiable and with Lipschitz dierential on bounded sets and bounded sub-levels set. They assume that there exists a regularization term that is semi-convex and bounded on a nested family (Qr)r≥0 of non-empty

closed subsets of the data set such that {u ∈ Ω : dist(u, Qr) ≤ r0} ⊆ Qr+r0. Moreover for this family it holds that on each Qr the activation function

is bounded with Lipschitz dierential and nally there exist C1, C2 such that

supu∈Qr(k∇σ∗k + k∂Sk) ≤ C − 1 + C2 where k∂Sk is the sup norm of the

sub-gradient of the regularization function.

Under these assumptions they prove the convergence of the discrete mea-sures to the measure given by the continuous dynamic, but in order to obtain a result of convergence when the time goes to innity, they require as in the others papers further conditions in order to have a shared homogeneity

(15)

di-rection for σ∗ and S and some good property for the initial distribution that

are preserved during the dynamics.

In practice they explicitly consider two cases of study:

2−homogeneity in this case they suppose that the activation function and the regularization function must be 2−positively homogeneous, and σ∗ must be dierentiable with Lipschitz dierential. In this case it

is required a density property for the set of regular values of θ → hf, σ∗(θ)i + H(θ)

1−homogeneity In this case they assume that σ∗and H are 1-homogeneous

with respect to the same parameter, that is σ∗(a, θ) = aσ(θ) and

H(a, θ) =| a | ˜H(θ), with ˜H bounded and dierentiable with Lips-chitz dierential. Furthermore, in this setting they require the same property of the regular values as in the previous case, together with some boundary condition of σ∗.

In [12], the same authors of [11] consider stronger assumptions than in [11], rst of all restricting to the problem of training two-layers neural net-works, but considering just the case of 2−homogeneity with dierent types of norms, that are max-margin norms. Examples of these norms are the variation norm and the rkhs norm. Due to the choice of this norm, this pa-per of Chizat and Bach is not concerned with the mean squared error, but it is still interesting as it holds for the logistic and the exponential norm. In fact they consider a general loss l : R → R+ that is strictly

increas-ing, dierentiable with local Lipschitz-continuous gradient, with exponential tails (that is l(u) ∼ l0(u) ∼ exp(u) as u → −∞) and such that there exists

c > 0 : l0(u) ≥ C for u ≥ 0. Therefore, with the max margin norm, they nd a risk function of the form 1

N

PN

i=1l(−yiyˆm(xi; θ)).

1.3 Convergence of the Risk

First of all we want to establish some formal relationship between RN(θ)

and R(ρ). Now we will prove that the minimum of the asymptotic risk R(ρ) provides a good approximation of the minimum of the nite risk RN(θ).

(16)

Let Ψ(θ; ρ) = E[∂2l(y,



σ∗(x; d ¯θ)ρ(d ¯θ))σ∗(x; θ)] with ∂2l(y, ¯y) an

ele-ment of the sub gradient of l with respect to the second arguele-ment.

Observation 2. It is important to observe that, as Ψ is dened using a general element of the sub-gradient it is not obvious that Ψ is well dened. We can notice that, as l is convex, the set of points where l is not dierentiable has zero Lebesgue measure.

Therefore as already observed, if we assume that P(· | y)  µ ∼ Leb then Ψis uniquely determined.

For the next result we use the convexity of l seen as a function of y and the prediction ˆy and not like a function of y, θ, x. In all the thesis we make the following assumptions over the loss function.

Assumption 3. Let be l : R × R 3 (y, ˆy) → l(y, ˆy) ∈ R a continuous and convex loss function, bounded from below and such that there exists u : R × R → R≥0 symmetric and positive semi-denite for which, setting

W (θ1, θ2; ρ) ≡ E[u(y,



σ∗(x; θ)ρ(dθ))σ∗(x; θ1)σ∗(x; θ2)]

we have that for each ρ1, ρ2 ∈ P(RD) and for each y ∈ R:

 (Ψ(θ; ρ1) − Ψ(θ; ρ2))(ρ1− ρ2)(dθ) ≤ ≤   E[u(y,  σ∗(x; θ)ρ1)σ∗(x; θ1)  σ∗(x; θ2)(ρ1− ρ2)N 2(dθ1, dθ2)], (1.5) that is hΨ(·; ρ1) − Ψ(·; ρ2), ρ1− ρ2i ≤ hW (·, ·; ρ1), (ρ1− ρ2)N 2i.

Under the previous assumption (3) we can prove the following result: Proposition 4. Let be σ∗ such that k(σ∗(x; θ))2k∞≤ C.

(17)

(a) infρR(ρ) is obtained for a distribution ρ∗ such that



E[u(y, 

σ∗(x; ¯θ)ρ∗(d ¯θ))σ∗(x; θ)σ∗(x; θ)]ρ∗(dθ) ≤ K;

(b) there exist 0 > 0 s.t. ∀ρ ∈ P(RD) with R(ρ) ≤ infρR(ρ) + 0 we have

 E[u(y,  σ∗(x; ¯θ)ρ∗(d ¯θ))σ∗(x; θ)σ∗(x; θ)]ρ(dθ) ≤ K. Then | inf θ RN(θ) − infρ R(ρ) |≤ K · C N ,

and we have that the probability ρ∗ is a global minimizer of R if and only if

infθ∈RDΨ(θ, ρ) > −∞ and

Supp(ρ∗) ⊆ arg min

θ∈RDΨ(θ; ρ∗) (1.6)

Proof. We will divide the proof in two parts:

1. We will show that the minimization of the empirical risk gives us similar results to the minimization of its continuous counterpart, namely

| inf

θ RN(θ) − infρ R(ρ) |≤

K · C

N (1.7)

2. We will prove that in order to be a minimizer of the risk, for ρ∗ we have

the following necessary and sucient condition: Supp(ρ∗) ⊆ arg min

θ∈RDΨ(θ; ρ∗) (1.8)

Firstly we can observe that ∀N ∈ N : RN(θ) ≥ infρR(ρ) ∀θ, indeed

considering the distribution ˆρN := N1 PNi=1δθi we nd R(ˆρN) = RN(θ). Now we will prove rst (1) and then (2). For simplicity of notation we consider  σ∗ρ ≡



(18)

consider (θi)i≤N ∼ ρ∗ i.i.d. we obtain Eθ[RN] − R(ρ) = Eθ " E " l(y, 1 N N X i=1 σ∗) ## − E  l(y,  σ∗ρ∗)  = Eθ " E " l(y, 1 N N X i=1 σ∗) − l(y,  σ∗ρ∗) ##

Using the convexity of l in the second argument and the fact that v is a sub gradient of l (w.r.t. the second argument) we nd the following inequalities

l(y, 1 N N X i=1 σ∗) − l(y,  σ∗ρ∗) ≤ ≤ v(y, 1 N N X i=1 σ∗)( 1 N N X i=1 σ∗−  σ∗ρ∗)

from which it follows that

Eθ[RN] − R(ρ) ≤ Eθ " E " v(y, 1 N N X i=1 σ∗)( 1 N N X i=1 σ∗−  σ∗ρ∗) ## = Eθ " E " v(y,  σ∗ρ∗)( 1 N N X i=1 σ∗−  σ∗ρ∗) ## + Eθ " E " (v(y, 1 N N X i=1 σ∗) − v(y,  σ∗ρ∗))( 1 N N X i=1 σ∗−  σ∗ρ∗) ## = Eθ " E " v(y, σ∗ρ∗)( 1 N N X i=1 σ∗−  σ∗ρ∗) ## + Eθ " E " (v(y,  σ∗ρ∗) − v(y, 1 N N X i=1 σ∗))(  σ∗ρ∗− 1 N N X i=1 σ∗) ##

At this point, using the inequality (1.5) we have that

≤ Eθ " E " v(y,  σ∗ρ∗)( 1 N N X i=1 σ∗−  σ∗ρ∗) ## + Eθ " E " u(y,  σ∗ρ∗)(  σ∗ρ∗− 1 N N X i=1 σ∗)2 ##

(19)

Let us now observe that Eθ " E " v(y,  σ∗ρ∗)( 1 N N X i=1 σ∗−  σ∗ρ∗) ## = 0 indeed Eθ " E " v(y,  σ∗ρ∗) 1 N N X i=1 σ∗ ## = E " v(y,  σ∗ρ∗) 1 N N X i=1 Eθ[σ∗(x; θi)] # = E  v(y,  σ∗ρ∗) N NEθ  σ∗ρ∗  = Eθ  E  v(y,  σ∗ρ∗)  σ∗ρ∗  So we obtain that Eθ[RN] − R(ρ) ≤ 0 + Eθ " E " u(y,  σ∗ρ∗)(  σ∗ρ∗− 1 N N X i=1 σ∗)2 ## = E  (  σ∗ρ∗)2  − 2KE  (  σ∗ρ∗)2  + + Eθ   E    N N2 N X i=1 u(y,  σ∗ρ∗)(σ∗(x; θi))2+ 1 N2 N X i,j=1 i6=j u(y,  σ∗ρ∗)σ∗(x; θi)σ∗(x; θj)       = −E  u(y,  σ∗ρ∗)(  σ∗ρ∗)2  + + 1 N  E  u(y,  σ∗ρ∗)(σ∗(x; θ))2  ρ∗(dθ)+ +N (N − 1) N2  E  u(y,  σ∗ρ∗)σ∗(x; θ)σ∗(x; ¯θ)  ρ∗(dθ)ρ∗(d ¯θ) = −E  u(y,  σ∗ρ∗)(  σ∗ρ∗)2  + (  E[u(y,  σ∗ρ∗)σ∗(x; θ)]ρ∗(dθ))2 + 1 N  E[(u(y,  σ∗ρ∗)σ∗(x; θ))2]]ρ∗(dθ) − 1 N(  E[u(y,  σ∗ρ∗)σ∗(x; θ)]ρ∗(dθ))2

As ( E[u(y, σ∗ρ∗)σ∗(x; θ)]ρ∗(dθ))2 ≥ 0we have that

Eθ[RN] − R(ρ) ≤ 1 N(  E[u(y,  σ∗ρ∗)(σ∗(x; θ))2]]ρ∗(dθ)+

(20)

− (  E[u(y,  σ∗ρ∗)σ∗(x; θ)]ρ∗(dθ))2) ≤ 1 N(  E[u(y,  σ∗ρ∗)(σ∗(x; θ))2]ρ∗(dθ)) ≤ K N

So, as for each N holds infθRN(θ) − R(ρ∗) ≤ Eθ[RN(θ)] − R(ρ∗) ≤ KN,

we obtain

0 ≤ inf

θ RN(θ) − R(ρ∗) ≤

K N and from this follow the point (1) of the proof.

Ψ is lower semi-continuous with values in (−∞, +∞]; in particular from this follows that the set S0(ρ) := arg minθ(Ψ(θ; ρ))is closed.

Now, considering the minimizer ρ∗, we want to prove that this satises

Supp(ρ∗) ⊆ arg minθΨ(θ; ρ∗) =: Ψ∗. With accounts similar to point (1),

considering ρ in place of 1 N PN i=1σ∗(x; θi)we nd that R(ρ) − R(ρ∗) ≤ E  u(y,  σ∗ρ∗)(  σ∗(x; θ)(ρ − ρ∗)(dθ))2  (1.9) + E  v(y,  σ∗ρ∗)(  σ∗(x; θ)(ρ − ρ∗)(dθ))  (1.10) = hW (·, ·, ρ∗), (ρ − ρ∗)N 2i + hΨ(·, ρ∗), (ρ − ρ∗)i, (1.11) where W (θ1, θ2, ρ) := Eu(y,  σ∗(x; ¯θ)ρ(d ¯θ))σ∗(x; θ1)σ∗(x; θ2) .

Now let us study separately the two cases Ψ > −∞ and Ψ = −∞. Let us assume rst that Ψ > −∞, so S0(ρ) is a closed non-empty set.

Let us consider θ1 ∈ S0(ρ∗) and assume by contradiction that there exist

θ0 ∈ Supp(ρ) s.t. θ0 6∈ S0(ρ). If B(θ0, ) is the ball of radius  around θ0,

for the lower semi-continuity there exists 0, ∆ > 0 such that

inf

θ∈B(θ0,)

Ψ(θ; ρ∗) = Ψ∗+ ∆ > Ψ∗.

(21)

δB(θ0,0)

ρ∗

t0. So, for all t ∈ [0, t0] we can consider the probability measure: ρt= ρ∗− tν + tσθ1.

Using the inequality obtained in (1.11) we nd that

R(ρt) − R(ρ∗) ≤ hΨ(·, ρ∗), σθ1 − νi + hW (·, ·, ρ∗), (σθ1− ν)

N 2i (1.12)

≤ (Ψ∗− Ψ∗− ∆)t + C0t2 = −∆t + C0t2, (1.13)

where the inequality follows by the fact that W is continuous and σθ1, ν have compact support. Therefore, taking t small enough we nd R(ρt) < R(ρ∗),

a contradiction.

Let us now consider the case where Ψ∗ := inf

θΨ(θ, ρ∗) = −∞. For every

M ∈ N let θM ∈ RD such that Ψ(θM, ρ∗) ≤ −M. As in the rst point of the

proof let us assume by contradiction that there exists θ0 ∈ Supp(ρ∗) with

θ0 6∈ S0(ρ∗). Considering ν ≡ δB(θ0,0)

ρ∗

t0 we can dene for all t ∈ [0, t0] ρM,t = ρ∗− tν + tδθM

and for this choice

R(ρM,t) − R(ρ∗) ≤ hΨ(·, ρ∗), (δθ1 − ν)it + hW (·, ·, ρ∗), (δθ1 − ν) N 2it2 (1.14) ≤ −(M + Ψ0)t + C0M t2, (1.15) where Ψ0 := 

θ∈B(θ0,0)Ψ(θ, ρ∗). So, choosing t ≡ tM = min(t0, (M + Ψ0)/(C0M ))(that is positive for M big enough) we nd that R(ρM,t) < R(ρ∗)

for all M big enough, and from this the contradiction follows.

To conclude it remains to prove that Supp(ρ∗) ⊆ arg minθΨ(θ; ρ∗) is a

sucient condition for ρ∗to be a minimizer. We dene Ψ∗ := arg minθΨ(θ; ρ∗)

and using the convexity of l we obtain R(ρ) − R(ρ∗) = E  l(y,  σ∗(x; θ)ρ(dθ)) − l(y,  σ∗ρ∗)  (1.16)

(22)

≥  E[∂2l(y,  σ∗ρ∗)](σ∗(x; θ)(ρ − ρ∗)(dθ)) (1.17) = hΨ(·, ρ∗), (ρ − ρ∗)i. (1.18)

Now we can observe that for every non-negative and measurable function µ : RD → R

R(ρ) ≥ R(ρ∗)− < µ, ρ >,

so

R(ρ) ≥ R(ρ∗) + hΨ(·, ρ∗), ρ − ρ∗i − hµ, ρi =

= R(ρ∗) + hΨ(·, ρ∗) − Ψ∗, ρ − ρ∗i − hµ, ρi.

Now, observing that if Supp(ρ∗) ⊆ arg minθΨ(·; ρ∗)then hΨ(·, ρ/)−Ψ∗, ρ∗i =

0, and taking µ = Ψ(·, ρ/) − Ψwe nd R(ρ) ≥ R(ρ ∗).

Observation 5. It is easy to see that if v is convex (respectively concave) then, taking u(y, ˆy) ∈ ∂2v(y, ˆy)( respectively − u s.t.u(y, ˆy) ∈ ¯∂2v(y, ˆy)) (for

each y, ˆy), u satisfy the property 1.5. In fact, if v is convex (in the second argument), then we have that

v(y, y1) − v(y, y2) ≤ u(y, y1)(y1 − y2),

with u ≥ 0. Similarly if v is concave then

v(y, y1) − v(y, y1) ≤ −u(y, y1)(y1 − y2),

and in this case −u is positive semi-denite.

Observation 6. We will show that l(y, ˆy) = (y − ˆy)2, l(y, ˆy) = |y − ˆy| and

l(x; y) = −p(y)log(ˆy) meet the assumptions (3) of the proposition 4. l(y, ˆy) = (y − ˆy)2: we have that l ∈ C∞, so v = ∂

2l = −2(y − ˆy) and u =

∂2

2l = 2 (that is obviously bounded). It is immediate to observe that

l and v are convex functions (in the second argument), for example because ∂2

(23)

It remains only to see that θ → Ψ(θ; ρ) is l.s.c., but in this case it follows immediately by Fatou's Lemma because Ψ(θ; ρ) = V (θ) + 

U (θ, θ0)ρ(dθ)where V (·) is a continuous function and U(·, ·) is bounded from below.

l(y, ˆy) = −p(y)log(ˆy): for the cross-entropy we can assume without loss of generality that ˆy, p(y) ∈ [0, 1]. It is easy to see that l is a convex function of ˆy (because log(ˆy) is concave). A simple calculation shows that ∂2l(y, ˆy) = − p(y) ˆ y and ∂ 2 2l(y, ˆy) = p(y) ˆ

y2 . Observing that p(y) ∈ [0, 1] it is easy to see that the function ˆy → −p(y)

ˆ

y is concave for ˆy ∈ [0, 1].

Moreover even if E[∂2 2l(y,



σ∗(x; θ)ρ∗)(σ∗(x; θ))2] = E[(σ(x;θ)ρp(y) )2] is not bounded, it is simple to see that there exists 0 that satises the

condition (b) of the proposition (1.7). For the cross-entropy Ψ(θ; ρ) = E[σ(x;θp(y)0)ρ(dθ0)σ∗(x; θ)] is l.s.c. in θ.

l(y, ˆy) = |y − ˆy|: in this case l ∈ C0, l 6∈ C1, but we can consider

v(y, ˆy) = (

1 if ˆy > y

−1 if ˆy ≤ y ∈ ∂2l,

and u ≡ 0. It is easy to see that l is convex; moreover v satisfy the following inequality: (v(y,  σ∗(x; θ)ρ∗) − v(y,  σ∗(x; θ)ρ))(  σ∗(x; θ)ρ∗(dθ) −  σ∗(x; θ)ρ(dθ))) ≤ ≤ 2((  σ∗(x; θ)ρ∗(dθ) −  σ∗(x; θ)ρ(dθ)))), (1.19)

in fact v is an increasing function, then

(v(y,  σ∗(x; θ)ρ∗) − v(y,  σ∗(x; θ)ρ))(  σ∗(x; θ)ρ∗(dθ) −  σ∗(x; θ)ρ(dθ))) ≥ 0. Therefore (v(y,  σ∗(x; θ)ρ∗) − v(y,  σ∗(x; θ)ρ))(  σ∗ρ∗−  σ∗ρ) = |(v(y,  σ∗(x; θ)ρ∗) − v(y,  σ∗(x; θ)ρ))(  σ∗ρ∗−  σ∗ρ)|

(24)

≤ (|v(y,  σ∗(x; θ)ρ∗)| + |v(y,  σ∗(x; θ)ρ))|)(|  σ∗ρ∗−  σ∗ρ|) ≤ 2(|  σ∗ρ∗−  σ∗ρ|).

Observing that for each ρ,  σ∗(x; θ)ρ(dθ) ∈ R and remembering

that the two norms k · k1 and k · k2 are equivalent in R, we can

nd a constant C2 such that (|

 σ∗(x; θ)ρ∗(dθ) −  σ∗(x; θ)ρ(dθ)|) ≤ C2(  σ∗(x; θ)ρ∗(dθ) −  σ∗(x; θ)ρ(dθ))2, so (1.19) follows immediately

and we obtain (1.5) with u ≡ 2C2.

For the mean absolute error Ψ(θ; ρ) = E[v(y, σ∗(x; θ0)ρ(dθ0))σ∗(x; θ)]

is ls.c. as a product of a l.s.c. function and a constant term that does not depends by θ.

We often nd empirically that the optimal density ρ∗ is supported on a

set of Lebesgue measure 0 (sometimes on a nite set of points). The following consequence of the previous results partially explains these ndings.

Corollary 7. Assume θ → Ψ(θ; ρ∗) is an analytic function, namely there

exists a locally bounded function θ → B(θ) such that k∇k

θΨ(θ)k ≤ k!B(θ)k

for all k, θ. If ρ∗ is a minimizer of R(ρ), then one of the following holds:

(a) Ψ(θ, ρ∗) = Ψ∗ for some constant Ψ∗ and all θ ∈ RD;

(b) The support of ρ∗ has zero Lebesgue measure.

If D = 1, then (b) can be replaced with:

(b') ρ∗ is a convex combination of a countably many point masses with no

accumulation point (nitely many if Ψ(θ; ρ∗) → ∞ as | θ |→ ∞).

Proof. Let Ψ∗ = minθ∈RDΨ(θ; ρ). We have that θ → Ψ(θ; ρ) is analytic and so is θ → Ψ(θ; ρ∗) − Ψ∗. Since Supp(ρ∗) ⊆ {θ : Ψ(θ; ρ∗) = Ψ∗}, the

claim follows from the fact that the set of zeros of a non-trivial analytic function has vanishing Lebesgue measure [3]. In the case D = 1, the set of zeros of an analytic function cannot have any accumulation point [4], which therefore allows to replace (b) with (b0).

(25)

Chapter 2

Weak Convergence of the

Distributions

We want to analyse the convergence of the stochastic gradient descent using a continuous dynamic, therefore the rst step is to show that the discrete measure converges to the continuous distribution. This result can be seen as a Law of Large Numbers for neural networks, and in particular tells us that the discrete distribution weakly converges to the distribution described by a partial dierential equation, called distributional dynamics.

2.1 Noiseless SGD

First we need to dene a sub-Gaussian random variable that is a random variable that have a distribution (known as sub-Gaussian distribution) with strong tails decay. Informally the tails of a sub-Gaussian distribution are dominated by the tails of a Gaussian.

Denition 8. A random variable X is called sub-Gaussian if there exist C, v > 0 positive constants such that for every t > 0 the probability distri-bution P of X satisfy

P(| X |> t) ≤ Ce−vt 2

(2.1) A probability with the previous property is known as sub-Gaussian

(26)

proba-bility distribution.

Sub-Gaussian random variables with the following norm form a Birnbaum-Orlicz space:

kXkψ2 = inf{s > 0 | E[e

(X/s)2

] − 1 ≤ 1}

For the next results it will be useful the following lemma for sub-Gaussian random variables.

Lemma 9. [Azuma-Hoeding inequality] Let (Xk)k≥0 be a martingale taking

values in Rd with respect to the ltration (F

k)k≥0, with X0 = 0. Assume that

the following holds almost surely for all k ≥ 1 : Ee<λ,Xk−Xk−1>|Fk−1 ≤ eL 2kλk2/2 (2.2) Then we have P  max k≤n kXkk2 ≥ 2L √ n( √ d + t)  ≤ e−t2

Proof. Let Zk = Xk − Xk−1 be the martingale dierences. By the

sub-Gaussian condition (2.2) we get

Eehλ,Xni = E E ehλ,Xni| Fn−1 = E E ehλ,Zni | Fn−1 ehλ,Xn−1i



≤ eL2kλk2/2

Eehλ,Xn−1i

and by iterating the inequality we nd E[ehλXni] ≤ enL2kλk22/2. Letting G ∼ N (0, Id) be a standard Gaussian vector and ξ ≥ 0,

E[eξkXnk 2 2/2] = E G[E[e √ ξhG,Xni ]] ≤ EG[enL 2ξkGk2 2/2] = (1 − nL2ξ)−d2

By the Markov inequality, setting ξ = 1/(2nL2), we get, for all f ≥ 0,

P(kXnk2 ≥ 2L √ n(√d + t)) ≤ ed/2−( √ d+t)2 ≤ e−t2

(27)

min{k : kXkk2 ≥ 2L

n(√d + t)}, and the martingale ¯Xk = Xk∧τ. Since

{maxk≤nkXnk2 ≥ 2L

n(√d + t)} = {k ¯Xnk2 ≥ 2L

n(√d + t)}, the claim follows by applying the previous inequality to ¯Xn.

Assumption 10. In this section and for many other results in this thesis we consider the SGD dynamics with step size sk = ξ(k)under the assumptions:

A1 t → ξ(t) is bounded Lipschitz: kξk∞, kξkLip ≤ K1, with

∞

0 ξ(t)dt = ∞

A2 The activation function (x; θ) → σ∗(x; θ)is bounded, with sub-Gaussian

gradient: kσ∗k ≤ K2, k∇θσ∗(x; θ)kΨ2 ≤ K2. Labels are bounded | yk |≤ K2.

A3 Let V (θ) := Ψ(θ; ρ∗) = Ev(y,  σ∗(x; ¯θ)ρ∗(d ¯θ))σ∗(x; θ)  and W (θ1, θ2; ρ) := Eu(y,  σ∗(x; ¯θ)ρ(d ¯θ))σ∗(x; θ1)σ∗(x; θ2)  .

The gradients θ → ∇V (θ), (θ1, θ2) → ∇θ1W (θ1, θ2; ρ) are bounded, Lipschitz continuous, namely

k∇θV (θ)k2 ≤ K3, k∇θ1W (θ1, θ2; ρ)k2 ≤ K3 and k∇θV (θ) − ∇θV (θ0)k2 ≤ K3kθ − θ0k2, k∇θ1W (θ1, θ2; ρ) − ∇θ1W (θ 0 1, θ 0 2; ρ)k2 ≤ K3k(θ1, θ2) − (θ01, θ 0 2)k2

We can observe that instead of (A1) we can assume without loss of gen-erality that ξ ≡ const.

We also introduce the following error term which quanties in a non-asymptotic sense the accuracy of our PDE model

errN,D(z) ≡ r 1 N ∨  · "r D + log(N  ) + z # (2.3) Denoting by ˆρ(N ) k = N −1PN i=1δθk

i the empirical distribution of parameters after k SGD steps, we will prove that

ˆ

(28)

when N → ∞,  → 0. Here ⇒ denotes weak convergence. The asymptotic dynamics of ρt is dened by the following PDE, which we shall refer to as

distributional dynamics (DD) ( ∂tρt = ξ(t)∇θ· (ρt∇θΨ(θ; ρt)) Ψ(θ; ρ) = δR(ρ)δρ(θ) = E∂2l y,  σ∗(x; ¯θ)ρ(d ¯θ) σ∗(x; θ)  (2.4)

Here ∇θ · v(θ)denotes the divergence of the vector eld v(θ). In particular

(ρt)t≥0 is a weak solution of (2.4) that is for any ζ ∈ C0∞(R × Rd)we have

 Rd ρ0(θ)ζ0(θ)dθ = = −  (0,∞)×Rd [∂tζt(θ) − ξ(t)h∇θΨ(θ; ρt), ∇θζt(θ)i]ρt(dθ)dt. (2.5)

Observation 11. Indeed let ϕ : RD → R be a test function with bounded

gradient, using the divergence theorem and observing that the integral over the boundary is zero, we have

d dthρt, ϕi = ξ(t)  hϕ(θ), ∇ · [ρt(θ)∇Ψ(θ; ρt)]idθ = −ξ(t)  h∇ϕ(θ), ∇Ψ(θ; ρt)iρt(θ)dθ

As in [[14], Proposition 4.2] it is elementary to notice that this weak formu-lation is equivalent to the one introduced in (2.5).

We have that (2.4) can be interpreted as an evolution equation for the risk function in P(RD), in particular can be viewed as a gradient ow for

the cost function in the space (P(RD), W

2) of probability measure on RD

endowed with the Wasserstein metric, dened as W2(ρ1, ρ2) =  inf γ∈C(ρ1,ρ2)  kθ1− θ2k22γ(dθ1, dθ2) 1/2 . (2.6)

Moreover (2.4) is an evolution in the space of probability measures in RD,

(29)

Now we can observe that, by Theorem 1.1 of [5], assumptions A1 and A3 are sucient for the existence and uniqueness of solution of the DD.

A very useful tool for the analysis of this PDE is provided by the following nonlinear dynamics. We introduce trajectories ( ¯θt

i)1≤i≤N,t∈R≥0 by letting ¯θ

0 i =

θ0

i to be the same initialization as for the SGD and, for t ≥ 0 (here PX denotes

the law of the random variable X): ( ¯ θt i = θi0− 2 t 0 ξ(s)∇Ψ( ¯θ s i; ρs)ds ρs = Pθ¯s i (2.7) This can be regarded as an equation for the law of the trajectory ( ¯θt

i)t∈R≥0, with boundary condition determined by ¯θ0

i ∼ ρ0. As implied by Theorem

1.1. of [5], under the same assumptions A1 and A3, the non-linear dynamics has a unique solution ρt that satises (2.4).

Now we want prove a rst result that gives us some continuity properties of the distribution that solve (2.5) and the associated solution of the non-linear dynamics. This result we will be useful for the main theorem of this chapter, that ensures the continuity (in time) of the distributions associated to the distributional dynamics in the topology of the weak convergence, and a Lipschitz continuity for the trajectories of the nonlinear dynamics.

Lemma 12. Assume that conditions A1 and A3 hold. Let (ρt)t≥0 be the

solution of the PDE (2.4). Let ( ¯θt

i)t≥0 be the solution of non-linear

dy-namics (2.7). Then t → ¯θt

i is K1K3−Lipschitz continuous, and t → ρt

is K1K3−Lipschitz continuous in W2 Wasserstein distance, with K1 and K3

as per conditions A1 and A3. In particular, t → ρt is continuous in the

topology of weak convergence.

Proof. Since ξ and ∇Ψ are K1and K3 bounded respectively, t → ¯θti is K1K3

-Lipschitz continuous. Further for the bounded -Lipschitz distance between probability measure dBL dened as

dBL(µ, ν) = sup   f (x)µ(x) −  f (x)ν(x) : kf k∞≤ 1, kf kLip ≤ 1  ,

(30)

we have that

dBL(µ, ν) ≤ 2 inf γ∈C(µ,ν)



(kx − yk2∧ 1)γ(dx, dy) ≤ 4dBL(µ, ν). (2.8)

So (2.8) implies that t → ρt is Lipschitz continuous in W2 Wasserstein

dis-tance, namely dBL(ρt, ρs) ≤ 2W2(ρt, ρs) ≤ 2(Ek ¯θit− ¯θ s ik 2 2) 1 2 ≤ 2K 1K3 | t − s | .

We notice that, under the non-linear dynamics, the trajectories ( ¯θt1)t∈R≥0, · · · , ( ¯θ

t

N)t∈R≥0 are independent and identically distributed. In par-ticular, this implies that, almost surely

1 N N X i=1 δθ¯t i ⇒ ρt.

Starting by this observation we want now prove the weak convergence of the distributions obtained through the Stochastic Gradient Descent to the solution of the distributional dynamics. Formally it holds the following result: Theorem 13. Assume that conditions A1, A2, A3 hold. For ρ0 ∈ P(RD),

consider SGD with initialization (θ0

i)i≤N ∼iid ρ0 and step size sk = ξ(k).

For t ≥ 0, let ρt be the solution of the PDE. Then, for any xed t ≥ 0,

ˆ

ρ(N )bt/c ⇒ ρt almost surely along any sequence (N,  = N)such that N →

∞, N → 0, N/log(N/N) → ∞ and Nlog(N/N) → 0. Further, there

exists a constant C (depending uniquely on the parameters Ki of conditions

A1 − A3) such that, for any f : RD × R → R, with kfk∞, kf kLip ≤ 1,  ≤ 1,

sup k∈[0,T /]∩N | 1 N N X i=1 f (θki) −  f (θ)ρk(dθ) |≤ CeCTerrN,D(z), sup k∈[0,T /]∩N | RN(θk) − R(ρk) |≤ CeCTerrN,D(z), with probability 1 − e−z2 .

(31)

The convergence of the SGD process to the PDE model is an example of a phenomenon which is known in probability theory as propagation of chaos [5]. Throughout this proof, we will use K to denote generic constants depending on the constants K1, K2, K3 in conditions A1, A2 and A3. It

is convention to introduce the notations zk = (xk, yk) to denote the k−th

example and dene

Fi(θ; zk) = −v(yk, σ∗(xk, θ))∇θiσ∗(xk; θi), θ = (θi)i≤N ∈ R

D×N,

G(θ; ρ) = −∇Ψ(θ; ρ), θ ∈ RD.

Observation 14. We can note that the assumption (A3) implies that G(θ; ρ) is Lipschitz bounded and Lipschitz continuous in θ and ρ, i.e.

kG(θ; ρ)k2 ≤ K, kG(θ1; ρ) − G(θ2; ρ)k2 ≤ Kkθ1− θ2k2,

and

kG(θ; ρ1) − G(θ; ρ2)k2 ≤ KdBL(ρ1, ρ2).

In fact we have that:

ˆ For the rst term we obtain that: kG(θ; ρ)k2 = k − ∇Ψ(θ; ρ)k2 ≤ k∇Ψ(θ; ρ∗) + ∇Ψ(θ; ρ) − ∇Ψ(θ; ρ∗)k2 ≤ k∇Ψ(θ; ρ∗)k2+ max ρM∈{ρ,ρ∗} k  (∇1W (θ, ¯θ, ρM)(ρ − ρ∗)( ¯θ)k2 ≤ k∇Ψ(θ; ρ∗)k2+ 2 max ρM∈{ρ,ρ∗} k  (∇1W (θ, ¯θ, ρM)ρM( ¯θ)k2 ≤ 3K.

(32)

ˆ In a similar way kG(θ1; ρ) − G(θ2; ρ)k2≤ k∇Ψ(θ2; ρ∗) − ∇Ψ(θ1; ρ∗)k2 + max ρM∈{ρ,ρ∗} k  (∇1W (θ2, ¯θ, ρM) − ∇1W (θ1, ¯θ, ρM))(ρ − ρ∗)( ¯θ)k2 ≤ k∇Ψ(θ2; ρ∗) − ∇Ψ(θ1; ρ∗)k2 + 2 max ρM∈{ρ,ρ∗} k  (∇1W (θ2, ¯θ, ρM) − ∇1W (θ1, ¯θ, ρM))ρM( ¯θ)k2 ≤ 3Kkθ2− θ1k2. (2.9) ˆ Finally kG(θ; ρ1) − G(θ; ρ2)k2≤ max ρ∈{ρ1,ρ2} k  ∇1W (θ, ¯θ, ρ)(ρ2− ρ1)(d ¯θ)k2 ≤ KdBL(ρ1, ρ2).

With these new notations, we can rewrite the SGD dynamics as θk+1i = θki + ξ(k)Fi(θik; zk+1), which yields θik= θi0+  k−1 X l=0 ξ(l)Fi(θli; zl+1). (2.10) Recall (θ0 i)i≤N ∼ ρ0 independently.

For t ∈ R≥0 we will dene [t] = bt/c. Equation (2.10) should be

com-pared with the nonlinear dynamics (2.7), which reads ¯

θit= θ0i +  t

0

ξ(s)G( ¯θsi; ρs)ds. (2.11)

For the proof Theorem 13 we rst need to prove some preliminary results, from which it follows almost immediately.

We next state and prove the key estimate controlling the dierence be-tween the discrete and the nonlinear dynamics.

(33)

Lemma 15. Under the assumptions of Theorem 13, there exists a constant K depending uniquely on K1, K2, K3 in conditions A1, A2 and A3, such that

for any T ≥ 0, we have max i≤N k∈[0,T /]∩Nsup kθ k i − θki k2 ≤ KeKT · p 1/N ∨  ·hpD + log(N (T / ∨ 1)) + zi, (2.12) with probability at least 1 − e−z2

.

Proof. Consider for simplicity of notation t ∈ N ∩ [0, T ]. Taking the dier-ence of equations (2.10) and (2.11)

kθt i − ¯θ t ik2 = k  t 0 ξ(s)G( ¯θis; ρs)ds −  t −1 X k=0 ξ(k)Fi(θk, zk+1)dsk2 ≤  t 0 kξ(s)G( ¯θis; ρs) − ξ([s])G( ¯θ [s] i ; ρ[s])k2ds +  t 0 kξ([s])G( ¯θi[s]; ρ[s]) − ξ([s])G( ¯θ bs/c i ; ρ[s])k2ds + k t/−1 X k=0 ξ(k)Fi(θk; zk+1) − G(θki; ρk) k2 ≡ Ei 1(t) + E2i(t) + E3i(t) (2.13)

We next consider separately the three terms above. Using Lipschitz conti-nuity of G(θ; ρ) with respect to θ and ρ, and due to the condition A1 and Lemma 12 (implying that ξ, ¯θi and ρs are Lipschitz continuous), we get

E1i(t) ≤ t sup s∈[0,t] kξ(s)G( ¯θsi; ρs) − ξ([s])G( ¯θis; ρs)k2+ + kξ([s])G( ¯θsi; ρs) − ξ([s])G( ¯ θi[s]; ρs)k2 + kξ([s])G( ¯θi[s]; ρs) − ξ([s])G( ¯θ [s] i ; ρ[s])k2 o ≤ Kt. (2.14)

(34)

bound for the second term is: E2i(t) ≤ K  t 0 kG( ¯θsi; ρ[s]) − G(θbs/c; ρ[s])k2ds ≤ K2  t 0 k ¯θ[s]i − θbs/ci k2ds. (2.15) It remains only to bound the last term. We denote by Fk, for k ∈ N,

the σ−algebra generated by (θ0

i)i≤N and z1, · · · , zk. Now, setting ˆρ (N ) k := 1 N P i≤Nδθk

i, and let us rst observe that we have

G(θki; ˆρ(N )) = −E[∂2l(y,  σ∗ρˆ(N )k )∇θiσ∗(x; θ k i)] = E[Fi(θk; zk+1) | Fk]. (2.16) Hence E3i(t) ≤  t/−1 X k=0 ξ(k)nG(θki; ˆρ(N )k ) − G(θik; ρk) o 2 +  t/−1 X k=0 ξ(k)Zik 2 ≡ Ei 3,0(t) + Q i 1(t), (2.17) where Zi

k ≡ Fi(θk; zk+1) − E[Fi(θk; zk+1) | Fk]. We can then apply

Azuma-Hoeding inequality (Lemma 9), observing that the condition (2.2) follows from the fact that σ∗(x; θ) is bounded and ∇θσ∗(x; θ) is sub-Gaussian

(be-cause the product of a sub-Gaussian random vector and a bounded random variable is sub-Gaussian) and hence each ξ(k)Zi

kis K2−sub-Gaussian. Then we obtain P( max k∈[0,t/]∩NQ i 1(k) ≥ K √ t(√D + u)) ≤ e−u2, and by taking union bound over i ≤ N, we get

P(max i≤N k∈[0,t/]∩Nmax Q i 1(k) ≤ K √ t(pD + log(N ) + x)) ≥ 1 − e−z2. (2.18)

For the term Ei

3,0(t), we use the Lipschitz continuity property (2.9), from

which it follows that kG(θk i; ˆρ (N ) k ) − G(θ k i; ρk)k2 ≤  ∇1W (θki, ¯θ; ˆρ (N ) k )(ρk− ˆρ (N ) k )(d ¯θ)

(35)

= 1 N N X j=1 [∇1W (θik, θ k j; ˆρ (N ) k ) − Eθ¯[∇1W (θik, ¯θ k j; ˆρ (N ) k )]] 2 ≤ 1 N N X j=1 [∇1W (θik, θkj; ˆρ (N ) k ) − ∇1W (θki, ¯θkj ; ˆρ (N ) k )] 2 + + 1 N N X j=1 [∇1W (θik, ¯θjk; ˆρ (N ) k ) − Eθ¯[∇1W (θik, ¯θkj ; ˆρ (N ) k )]] 2 ≤ 3K N kθ k j − ¯θjkk2 + Qi2(k) + 3K N , where Qi 2(k) for k ∈ N is dened as Qi2(k) = 1 N X j≤N,j6=i [∇1W (θik, ¯θ k j; ˆρ (N ) k ) − Eθ¯[∇1W (θik, ¯θ k j; ˆρ (N ) k )]] 2 .

Since for any xed k, ( ¯θk

j )j≤N,j6=i are i.i.d. and independent of θki, and ∇1W

is bounded, we get by another application of Azuma-Hoeding inequality (Lemma 9)

P 

Qi2(k) ≥ Kp1/N (√D + u)≤ e−u2. Therefore, the union bound for k ∈ [0, t/] ∩ N and i ≤ N gives P  max i≤N k∈[0,t/]∩Nmax Q i 2(k) ≤ K p 1/NpD + log(N (t/ ∨ 1)) + z  ≥ 1−e−z2. (2.19) Now we can write equation (2.17), conditioning on the good events in equa-tions (2.18) and (2.19) we obtain

E3i(t) ≤ K N N X j=1  t 0 kθbs/cj − θ[s]j k2ds + Q(t) + Kt N , (2.20)

(36)

where Q(t) ≡ max i≤N Q i 1(t) + t max i≤N k∈[0,t/]∩Nmax Q i 2(k) ≤ K√tz +pD + log(N )+ tKp1/NpD + log(N (t/ ∨ 1) + z ≤ K(√t ∨ t)p1/N ∨ hpD + log(N (t/ ∨ 1) + z i , (2.21) with probability at least 1 − e−z2

. We nally dene the random variable ∆(t; N, ) ≡ max

i≤N k∈[0,t/]∩Nsup kθ k

i − ¯θikk2.

Using the bounds (2.14), (2.15) and (2.20) in the equation ((2.13)) we get ∆(t; n, ) ≤ K

 t

0

∆(s; N, )ds + Kt + Kt

N + Q(t). By Gronwall's inequality, we have

∆(t; N, ) ≤ KeKt   + 1 N + Q(t)  . Using the bound (2.21), the claim follows

Lemma 16. Under the assumption of Theorem 13, we have max k∈[0,T /]∩N RN( ¯θk) − RN(θk) ≤ K max i≤N k∈[0,T /]∩Nmax kθ k i − ¯θikk2. (2.22)

Proof. Let θ = (θ1, · · · , θi, · · · , θn) and θ0 = (θ1, · · · , θi0, · · · , θn)be two

con-gurations that dier only in the position i. Therefore: |RN(θ) − RN(θ0)| ≤1 N | E " ∂2l(y, 1 N N X j=1 σ∗(x, θj))(σ∗(x, θi) − σ∗(x, θi0)) # | ≤1 N | Ψ(θi; ρ∗) − Ψ(θ 0 i; ρ∗) | +

(37)

+1 N | Ψ(θi; ˆρN) − Ψ(θ 0 i; ˆρN) − Ψ(θi; ρ∗) + Ψ(θ0i; ρ∗) | =1 N | Ψ(θi; ρ∗) − Ψ(θ 0 i; ρ∗) | + max ρ∈{ ˆρ(N )k ,ρ∗} 1 N2 N X j=1 | W (θi, θj; ρ) − W (θ0i, θj; ρ) | + + max ρ∈{ ˆρ(N )k ,ρ∗} 1 N |  W (θi, ¯θ; ρ)ρ∗(d ¯θ) −  W (θi0, ¯θ; ρ)ρ∗(d ¯θ) | =1 N | Ψ(θi; ρ∗) − Ψ(θ 0 i; ρ∗) | + max ρ∈{ ˆρ(N )k ,ρ∗} 1 N2 X j6=i | W (θi, θj; ρ) − W (θ0i, θj; ρ) | + + | W (θi, θj; ρ) − W (θ0i, θ 0 i; ρ) | + max ρ∈{ ˆρ(N )k ,ρ∗} 1 N  | W (θi, ¯θ; ρ) − W (θi0, ¯θ; ρ) | ρ∗(d ¯θ) ≤K N(kθi− θ 0 ik2 ∧ 1).

Then equation (2.22) follows immediately

Lemma 17. Under the assumptions of Theorem 13, we have max k∈[0,T /]∩N RN( ¯θk) − R(ρk) ≤ K p 1/NpD + log(N (T / ∨ 1)) + v (2.23) with probability at least 1 − e−z2

.

Proof. Using the inequality obtained in the proof of the previous lemma: | RN(θ) − RN(θ) |≤

K

N(kθi− θ

0

ik2∧ 1)

and the Azuma-Hoeding inequality (lemma 9) and union bound, we get max k∈[0,T /]∩N| RN( ¯θ k ) − ERN( ¯θk) |≤ K p 1/NpD + log(N (T / ∨ 1)) + z 

with probability at least 1 − e−z2

. The claim follows immediately since | EθRN( ¯θt) − R(ρt) |≤ ≤ 1 N ρ∈{ ˆmaxρ(N ) t ,ρt}  W (θ, θ, ρ)ρt(dθ) −  W (θ1, θ2, ρ)ρt(dθ1)ρt(dθ2) ≤ K N,

(38)

where the rst inequality can be obtained in a similar way as done in propo-sition 1.7, but considering the max for the inequalities.

Now we are ready to prove the main theorem of this chapter for the noiseless stochastic gradient descent.

Proof. The proof of theorem 13 follows immediately by an application of the three previous lemma (15,16 and 17). In particular the proof for any bounded Lipschitz function f follows the same argument as Lemma 16, 17. As a result, for any sequence N,  = N such that N → ∞ and N → 0

with N/log(N/N) → ∞ and log(N/N) → 0, we have that ˆρ (N )

bk/c converges

weakly to ρt almost surely immediately.

2.2 Noisy SGD and generalization of the

pre-vious result

Now we want introduce a regularized version of SGD, known as noisy SGD, and extend the previous result in this case.Throughout this section we assume that conditions A1, A2 and A3 hold. This variant of SGD consider the following iteration: θk+1i = (1 − λsk)θik+ skv(yk, ˆyk)∇θiσ∗(xk; θ k i) + p sk/βgik (2.24) where λ ≤ 1, gi

k ∼ N (0, ID)and ˆyk = ˆy(xk; θik)+psk/βgik. The introduction

of the term −λskθki corresponds to an l2 regularization and psk/βgik is a

noise term in the output from which follows the name of Noisy stochastic gradient descent. As in the standard SGD we obtain a scaling limit, that is similar to (1.1), but dier from this by the addition of a diusion term:

∂tρt= ξ(t)∇θ· (ρt∇θΨλ(θ; ρt)) + ξ(t)β−1∆θρt (2.25)

where Ψλ(θ; ρ) = Ψ(θ; ρ) + (λ/2)kθk22 and ∆θf (θ) =

Pd

i=1∂θ2if (θ) de-notes the usual Laplacian. As in the noiseless case, this equation should be interpreted in a weak sense and, similarly to (2.4), can be viewed as

(39)

a gradient ow, but in this case for the free energy Fβ,λ(ρ) = R(ρ) +

(λ) kθk2

2ρ(dθ) − β−1Ent(ρ), where Ent(ρ) = −



ρ(θ)log(ρ(θ))dθ and by denition Ent(ρ) = −∞ if ρ is singular with respect to the Lebesgue mea-sure.

Observation 18. We can obtain the weak formulation in a similar way as done in observation 11. Indeed let ϕ : RD → R be a test function with

bounded gradient, using the divergence theorem and observing that the in-tegral over the boundary is zero, we have

d dthρt, ϕ(θ)i = h∇θ · (ρt∇θΨλ(θ, ρt)), ϕ(θ)i + hβ −1 ξ(t)∆θρt, ϕ(θ)i =  ∇θ· (ρt∇θΨ(θ, ρt))ϕ(θ) dθ +  ∇θ· (ρt∇θkθk22ϕ(θ) dθ+ + β−1  hξ(t)∇θ· ∇θρt, ϕ(θ)idθ = −  h∇θϕ(θ), ∇θΨ(θ, ρt)i ρt(dθ) −  h∇θϕ(θ), ∇θkθk22)iρt(dθ)+ − β−1  hξ(t)∇θρt ρt , ∇θϕ(θ)iρt(dθ) = −  h∇θϕ(θ), ∇θΨ(θ, ρt)i ρt(dθ) −  h∇θϕ(θ), ∇θkθk22)iρt(dθ)+ − β−1  hξ(t)∇θlog(ρt), ∇θϕ(θ)iρt(dθ).

As in [[14], Proposition 4.2] this weak formulation is equivalent to  Rd ρ0(θ)ζ0(θ)dθ = = −[  (0,∞)×Rd [∂tζt(θ) − ξ(t)h∇θΨλ(θ; ρt), ∇θζt(θ)i + 2ξ(t)∆θζt(θ)]ρt(dθ)dt (2.26) for each ζ ∈ C∞ 0 (R × Rd).

Even in this case we can see the problem as a gradient ow in the space (P(RD), W

(40)

free energy functional: Fβ,λ(ρ) = R(ρ) + (λ)  kθk2 2ρ(dθ) − β −1 Ent(ρ).

We want to generalize the proof of Theorem 13 to Noisy SGD at nite temperature β < ∞, so as in the noiseless case, we can consider the equivalent formulation of (2.25) as a xed point distribution for the following non-linear dynamics: ( θk i = θi0+  t 0 ξ(s)G( ¯θ s i; ρs)ds + t 0 pξ(s)/βdWi(s) ρs= Pθ¯s i (2.27) This is the integral form of a stochastic dierential equation, where {Wi(s)}s≥0

for i ≤ N are independent D-dimensional Brownian motion, and G(θ; ρ) ≡ −∇Ψ(θ; ρ).

By Theorem 1.1 of [5], assumptions A1 and A3 are sucient for the exis-tence and uniqueness of solution of the DD, but we will prove it explicitly in the last chapter, while we discuss the time convergence for this case . More-over the assumptions on Ψ, λ and ξ ensures that this non-linear dynamics has a unique continuous solution.

We can compare this nonlinear dynamics with the discrete dynamics, writing an iteration as θki = θoi +  k−1 X l=0 ξ(l)Fi(θl; zl) +  k 0 p ξ([s])/βdWi(s), (2.28) where

Fi(θ; zk) = −λθi− v(y, ˆy(xk; θ))∇θiσ∗(xk; θi), θ = (θi)i≤N ∈ R

D×N

. The following result gives us some standard estimates about the solution of the SDG (2.27).

Lemma 19. Assume ρ0is K02−sub-Gaussian, ξ(s) and G(0; ρs)are K0−bounded,

(41)

of (2.27) with independent initialization (θ0

i)i≤N ∼ ρ0. Let (ρt)t≥0 be the

solution of PDE (2.25).

Then there exist a constant K depending uniquely on K0, such that

(a) with probability at least 1 − e−z2 : sup i≤N sup t∈[0,T ] k ¯θitk2 ≤ KeKT[ p D + log N + z]. (2.29)

(b) With probability at least 1 − e−z2

sup i≤N sup k∈[0,T /]∩N sup u∈[0,] k ¯θk+ui − ¯θikk2≤ KeKT hp D + log(N (T / ∨ 1)) + zi√. (2.30) (c) For any t, h ≥ 0, t + h ≤ T , dBL(ρt, ρt+h) ≤ W2(ρt, ρt+h) ≤ KeKT √ Dh. (2.31)

Proof. We decompose the proof in three parts, one for each equation (2.29), (2.30) and (2.31).

(a) It is useful to note that given any D−dimensional K2

0−sub-Gaussian

random vector X and Y ∼ N(0, 1), the following inequality holds: EX[exp{τ kXk22/2}] = EX,Y[exp{τ hY, Xi}]

≤ EY[exp{τ K02kYk 2 2}/2]

= (1 − τ K02)−D/2,

(2.32)

where the rst equality follows by the symmetry of distribution of Gaussian vector, the inequality follows by the denition of a sub-Gaussian vector and the last equivalence is a consequence of the fact that

EY[exp{τ K02kYk22}/2] is the moment-generating function of kYk22 ∼

χ2

D calculated in τK02/2.

Note that (θ0

(42)

using the Markov's inequality and (2.32), we obtain P(kθ0ik2 ≥ u) ≤ E[exp(τ kθik22)]/exp{τ u 2/2} ≤ (1−τ K2 0) −D/2 exp{−τ u2/2}. Now, considering the union bound over i ≤ N gives

P  max i≤N k ¯θ 0 ik2 ≥ u  ≤ (1 − τ K2 0) −D/2 exp{−τ u2/2 + log N }. In particular, by choosing τ = 1/(2K2 0)and u = 2K0( √ D + log N + z), we get P  max i≤N kθ 0 ik2 ≥ 2K0( p D + log N + z)  ≤ exp{−z2}. (2.33)

On the other hand, dening Wξ,i(t) ≡

t

0 pξ(s)dWi(s) we have

V ar(Wξ,ij (t)) =0tξ(s)ds ≤ K0t for j ≤ D.

Now we can observe that exp{τkWξ,i(t)k22} is a sub-martingale,

there-fore (using Doob's martingale inequality) we have: P  sup t≤T kWξ,i(t)k2 ≥ u 

≤ E[exp{τ kWξ,i(T )k22/2}] · exp{−τ u 2/2}

≤ (1 − K0T τ )−D/2exp{−τ u2/2}.

Taking union bound over i ≤ N gives P

 max

i≤N supt≤T kWξ,i(t)k2 ≥ u

 ≤ (1 − K0T τ )−D/2exp{−τ u2/2 + log N }. If we set τ = 1/(2K0T ) and u = 2 √ K0T ( √ D + log N + z) in the previous inequality, we get

P 

max

i≤N supt≤T kWξ,i(t)k2 ≥ 2

p K0T ( p D + log N + z)  ≤ exp{−z2}. (2.34) Observing that ξ(s), G(0; ρs)are K0bounded, and G(θ; ρs)is K0−Lipschitz

(43)

depend-ing on K0, such that ∆i(t) ≤ K  t 0 ∆i(s)ds + K[W/ p β + Θ],

where ∆i(t) ≡ sups≤tk ¯θisk2, W ≡ maxi≤Nsupt≤TkWξ,i(t)k2, and Θ ≡

maxi≤N kθ0ik2. Applying Gronwall's inequality we obtain

∆i(T ) ≤ Kexp(KT )[W/

p

β + Θ].

Now the probability bound (2.29) holds, using the high probability bound for Θ and W in equations (2.33) and (2.34).

(b) Dene ∆i(h; k, ) = sup0≤u≤hk ¯θik+u− ¯θikk2. By noting that ξ(s), G(0; ρs)

are K0−bounded, and G(θ; ρs)is K0−Lipschitz in θ, according to

equa-tion (2.27), we have ∆i(h; k, ) ≤ K  sup s≤T k ¯θisk2+ 1  h +√1 β 0≤u≤hsup kWξ,i,k(u)k2, (2.35)

where Wξ,i,k(u) ≡

k+u

k pξ(s)dWi(s). We can proceed in a similar

way as done for the bound in equation (2.34), obtaining

max

i≤N k∈[0,T /]∩Nsup 0≤u≤hsup kWξ,i,k(u)k2≤ 2

p K0h

p

D + log(N (T / ∨ 1)) + z, (2.36)

with probability P at least 1−e−z2

. Using the bounds (2.29) and (2.36) into equation (2.35), we have

max i≤N k∈[0,T /]∩Nsup ∆i(h; k, ) ≤ Ke KT[p D + log N + z]h+ + K(pD + log(N (T / ∨ 1)) + z)√h ≤ KeKT hpD + log(N (T / ∨ 1)) + zi√h,

with probability at least 1 − e−z2 .

(44)

(c) Equation (2.31) holds directly by noting that

W2(ρt, ρt+h)2 ≤ E[k ¯θt− ¯θt+hk22],

and applying an integration over z in a modied version of equation (2.30) without union bound over i ≤ N and k ∈ [0, T/] ∩ N, i.e

 P  sup h:t+h≤T k ¯θt+h− ¯θtk2 ≥ KeKT √ Dh+ KeKT√h · zdz ≤  (e−z2)dz =√π.

By a substitution in the left hand side (w = (1

D + z)), the previous inequality is equivalent to E  sup h:t+h≤T k ¯θt+h− ¯θtk2  =  P  sup h:t+h≤T k ¯θt+h− ¯θtk 2 KeKT√Dh ≥ w  dw ≤√π, i.e. E  sup h:t+h≤T k ¯θt+h− ¯θtk2  ≤ KeKT√Dh.

As in the noiseless case, the key step is to bound the dierence between the non-linear dynamics and the SGD dynamics. The following result is the analogue of lemma 15 for the noisy version.

Lemma 20. Under the assumptions of Theorem 13, there exists a constant K depending uniquely on K0, K1, K2, K3, such that for any T ≥ 0, we have

max i≤N k∈[0,T /]∩Nsup kθ k i − ¯θ k i k2 ≤ KeKT · p 1/N ∨  · [D + log(N (t/ ∨ 1)) + z] , (2.37) with probability at least 1 − e−z2

(45)

Proof. We take the dierence of equations (2.28) and (2.27), for t ∈ N∩[0, T ]: θ t/ i − ¯θ t i 2 ≤  t 0 h ξ(s)G( ¯θsi; ρs) − ξ([s])G( ¯θ [s] i ; ρ[s]) i ds 2 +  t 0 ξ([s])G( ¯θ [s] i ; ρ[s]) − ξ([s])G(θ bs/c i ; ρ[s]) 2ds +  t/−1 X k=0 ξ(k)Fi(θk; zk+1) − G(θki; ρk) +  t 0 (pξ(s)/β −pξ([s])/β)dWi(s) 2 ≡Ei 1(t) + E i 2(t) + E i 3(t) + E i 4(t). (2.38) Terms Ei

2(t), E3i(t) can be bounded in the same way as done in lemma 15,

because the replacement of Ψ by Ψλ does not aect the bounds (2.15) and

(2.17). To estimate Ei

4(t), as Wξ,i ≡

T

0 (pξ(s) − pξ([s]))dWi(s)is a

Gaus-sian random vector, Wξ,i ∼ N(0, τ2ID), where, using the Lipschitz continuity

of ξ, τ2 =  T 0 p ξ(s) −pξ([s]) 2 ds ≤ KT . Using the Gaussian distribution

P  kWξ,ik2 ≥ ( √ D + z)τ≤ e−z2/2 , and hence P  max i≤N maxs≤T E i 4(s) ≤ K( p D + log N + z)√T   ≥ 1 − e−z2/2 . (2.39)

It remains only to bound Ei

(46)

modied version of the proof of the Lemma 15. In fact E1i(t) ≤  t 0 [ξ(s) − ξ([s])]G( ¯θis; ρs) 2 +  t 0 ξ([s])[G( ¯θsi; ρs) − G( ¯θsi; ρ[s])]ds 2 +  t 0 ξ([s])[G( ¯θsi; ρ[s]) − G( ¯θ [s] i ; ρ[s])]ds 2 ≡Ei 1,A(t) + E i 1,B(t) + E i 1,C(t) (2.40) To bound the rst term Ei

1,A(t), due to the Lipschitz property of G(θ; ρ) and

the boundedness of G(0; ρ), with probability at least 1 − e−z2

, we have that for all i ≤ N and t ≤ T,

E1,Ai (t) ≤ T K · sup s∈[0,T ] kG(θs i; ρs)k2 ≤ T K · " K sup s∈[0,T ] k ¯θs ik2+ K # ≤ KeKT[p D + log N + z]. (2.41)

Here the last inequality is due to the equation (2.29) in the previous lemma (19).

To bound the second term Ei

1,B(t), using the fact that ∇1W is bounded

Lipschitz, we have for all i ≤ N and t ≤ T , E1,Bi (t) ≤ T K · sup θ∈RD k∇1W (θ; ρs) − ∇1W (θ; ρ[s])k2 ≤ T K2· d BL(ρs, ρ[s]) ≤ KeKT √ D. (2.42) Here the last inequality is due to equation (2.31) in lemma 19.

To bound the last term Ei

1,C(t), with probability at least 1 − e −z2

, we have for all i ≤ N and t ≤ T,

E1,Ci ≤ T K · sup s∈[0,T ] kG( ¯θsi; ρ[s]) − G( ¯θ [s] i ; ρ[s])k2 ≤ T K2· sup s∈[0,T ] k ¯θs i − ¯θ [s] i k2 ≤ KeKT hp D + log(N (T / ∨ 1)) + zi√. (2.43)

(47)

Here the last inequality is due to equation (2.30) in Lemma 19.

As a result, combining the bounds obtained for Ei(t), i = 1, · · · , 4 (i.e.

equations (2.15), (2.20), (2.21), (2.38), (2.39), (2.41), (2.42) and (2.43)), and dening ∆(t; N, ) ≡ max i≤N k∈[0,T /]∩Nsup kθ k i − ¯θikk2, we get ∆(t; N, ) ≤ K  t 0 ∆(s; N, )ds + Kt N + E(T ), where E(T ) = KeKT ·p1/N ∨  ·hpD + log(N (T / ∨ 1)) + zi. Applying Gronwall's inequality we obtain the desired result.

The generalization of Theorem 13 to β < ∞ follows from this lemma exactly as in the noiseless case.

(48)
(49)

Chapter 3

Characterization of Fixed Points

and General Properties of the

Solutions

In this chapter we want to nd a characterization for the xed points of the Distributional Dynamics and prove some general properties of the so-lution. We start by the Noiseless case, in which we prove that the risk is non-increasing along the trajectories of the PDE (2.4) and a xed point ρ has support contained in the set of stationary points for Ψ(·; ρ).

In this case we further investigate additional properties of the solution that will be useful in the following chapter in order to prove general results of stability of the dynamics, and so convergence of the dynamics when the time goes to innity. These properties are relative to the continuousness of the solution, a non-decreasing property for the measure of Borel sets that in particular implies that if we choose an initial distribution concentrated in the set of parameters with value stricly positive then this property is preserved by the dynamics, and the set of limiting points of the trajectories are connected and compact.

In the Noisy case a slightly dierent characterization hold, in fact we will see that there exists a unique xed point, and so at most one with nite free-energy. In this case a support property as in the Noiseless case does not

(50)

hold, but instead we have that the distribution is absolutely continuous, with density that satises the following equation

ρ∗(θ) =

1

Z(β)exp{−βΨ(θ; ρ∗)}

3.1 Characterization of xed points of the PDE

3.1.1 Noiseless case

As already told we want to nd a characterization of xed points of the dynamics. We can notice that in the main theorem of the previous chapter, even if the dependence of the error terms in N and D is rather mild, we have that the error grows exponentially with the time horizon T . This dependence limits the applicability of this method to cases in which the Distributional Dynamics converges rapidly to a good solution. Therefore we want to obtain other results that exploit the long-time convergence of the distributional dynamics.

In the general setting of Theorem 13, we expect that the convergence of the distributional dynamics is not always fast, because this result includes also cases in which the dynamics is unstable. Observing that it is possible to see J(θ; ρt) = ρt(θ)∇θΨ(θ; ρt) as a current, we have that the xed points

of the continuum dynamics are densities that correspond to zero current, as formally stated in the next result.

Proposition 21. Assume that, for each ρ, Ψ(·; ρ) is dierentiable with bounded gradient. If ρt is a solution of the PDE (2.4), then R(ρt) is

non-increasing. Furthermore a probability distribution ρ is a xed point of the PDE (2.4) if and only if

Supp(ρ) ⊆ {θ : ∇θΨ(θ; ρ) = 0} . (3.1)

Observation 22. It is interesting to note that global optimizers of R(ρ) dened by condition (1.6) in the proposition 4, are xed points, but the set of xed points is, in general, larger than the set of optimizers.

Riferimenti

Documenti correlati

Tran preprint 2017: existence and uniqueness (small data) for toy model with periodic BC?. Existence for periodic BC and regularity: Some results can be adapted from general

Figure 5.6: Numerical simulation of HBB-II display: flux density at the middle plane of the fluid when 2 opposite pistons operate.. acting both electrically, varying the value of

We  trained  and  tested  three  different  models.  In  the  first  model  we  trained  a  classifier  using  the  trials  from  Pre‐  Adaptation  and 

11-13 show the error response of the 1-dof and 2-dof system, with the adaptive hybrid system, the hybrid system with an optimal fixed n, the Impedance and Admittance Control in a

The springs were observed from May 2015 to April 2016, measuring flow discharge, temperature, pH and electrical conductivity every twenty days, with two samplings

In the rest of the chapter first of all we will study limits of symmetric functions, then Nash equilibria for symmetric games, before in pure and finally in mixed strategies....

Although the analysis in Section 4 highlighted the difficulty to design a set of international transfers that ensures the stability of an environmentally effective coalition, there

The removal of the 1 Hz spikes by subtracting a fixed filtered template for all the mission used in the Planck 2013 data release produces very good results reducing the contribution