A Mathematical Framework for Stochastic Gradient Descent Algorithms

(1)

A Mathematical Framework

for Stochastic Gradient

Algorithms

Candidato: Relatori:

Andrea Papini Dott. Davide Bacciu

Prof. Marco Romito

Controrelatore: Dott. Dario Trevisan

(2)

(3)

“Tutte queste cose sparse che mi organizzano i gesti i pensieri i battiti del cuore.”

(4)

(5)

1 Introduction 3

1.1 Thesis Overview . . . 4

2 Stochastic Modified Equations 13 2.1 Heuristic motivations . . . 13

2.2 Mathematical framework . . . 16

2.3 Approximation Theorems . . . 18

2.3.1 One-Step to N-step . . . 18

2.3.2 SME for SGD . . . 21

2.3.3 SME for mementum-SGD . . . 27

2.3.4 SME for Nesterov-SGD . . . 29

2.4 Toy Model of the SMEs to the analysis of SGD . . . 31

2.4.1 Analysis of SGD . . . 32

2.4.2 Analysis of MSGD . . . 36

3 SGD Performs Variational Inference 43 3.1 Stochastic Differential Equation Derivation . . . 43

3.2 Steady-State Distribution . . . 46

3.3 SGD performs variational inference . . . 51

3.3.1 Bayesian Inference . . . 55

3.4 SGD is out-of-equilibrium . . . 56

3.4.1 Linearization around a critical point . . . 58

3.4.2 General case . . . 58

3.5 Implication and Empirical characterization . . . 63

3.5.1 Empirical characterization of SGD . . . 64

4 Tail Analysis of SGD: New Horizon 71 4.1 Stable distributions and L´evy-Driven SDE . . . 74

4.2 Experimental Setup and Results . . . 80

4.2.1 Results . . . 82

4.3 Exit Time Under Tailed Gradient Noise . . . 85

4.3.1 Metastability and exit time of L´evy SDE . . . 87

4.4 Conclusion & Open Problems . . . 90

(6)

vi CONTENTS

Appendices 91

A Existence, uniqueness and moments 95

B Derivatives with respect to initial condition 103

C Detailed Proof of Thm. 2.3.1 109

D Detailed Proof of Thm. 2.3.4 111

E Sketched Proof of Thm. 4.3.2 115

E.1 Proof Overview . . . 115 E.2 Useful Lemmas . . . 117

(7)

Motivated by the remarkable achievement Deep Learning has obtained dur-ing the last decade in a number of applied domains from visual recognition and speech, to natural language processing and robotics. In this master we decided to give a survey about the theoretical study of Stochastic Gradi-ent DescGradi-ent (SGD): the most common training algorithm for Deep Neural Network. SGD is widely used in deep learning with great success in its com-putational efficiency. Beyond efficiency, understanding how SGD performs better than its full batch counterpart in terms of test accuracy remains a major challenge ([CCS+_{16], [CS17], [COO}+_{17], [WME18]).}

The first chapter is devoted to a small survey of the main definition and problem concerning the dynamic of Stochastic Gradient Descent, as well the structure of the thesis.

The second chapter is the core of this work and is based on [LTE18]. We develop the mathematical foundations of the stochastic modified equations (SME) framework for analyzing the dynamics of stochastic gradient algo-rithms, where the latter is approximated by a class of stochastic differential equations with small noise parameters. We believe that the SDE represen-tation is sufficiently accurate for small step-sizes and a good, if not the best, proxy for understanding the behavior of SGD. Starting with this goal in mind, we present a rigorous formulation to answer the following question: in which sense a Stochastic Differential Equation (SDE) is an approximation of discrete Stochastic gradient Algorithms?

The third chapter starts with empirical observation concerning extremal point of the algorithm, showing that the spectrum of the Hessian at this extremum discovered by SGD on Convolutional Neural Network ([WME18], [CCS+16]) has great portion of eigenvalues near zero. This suggest to inves-tigate the fluctuation of the Stochastic Gradient Descent as an asymptotic time analysis, by showing like in [COO+17], that the SGD has a limit distri-bution satisfying a Fokker-Plank type equation. Following [CS17], we show that this implies that SGD does not even converge in the classical sense: most likely trajectories of SGD for deep networks do not behave like Brow-nian motion around critical points, instead, they resemble closed loops with deterministic components.

(8)

2 CONTENTS

In the final chapter, we tackle the assumption that the gradient noise (GN) in the stochastic gradient descent algorithm is often considered to be Gaussian in the large data regime. This assumption is often made for math-ematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Accord-ingly, we start to analyze SGD as an SDE driven by a L´evy motion. Such SDEs can incur in ‘jumps’, which force the SDE transition from narrow minima to wider minima: the approach we follow is based on the existing metastability theory [NS¸GR19], which is still in is prime.

(9)

Introduction

The techniques of artificial intelligence are to the mind what bureaucracy is to human social interaction.

Terry Winograd (1991), [? ]

The first chapter is devoted to a brief introduction to the problem we are going to focus on, and a presentation of the structure of the thesis. We will give a presentation of some facts regarding the dynamic of Stochas-tic Gradient Descent seen through the lens of a heurisStochas-tic continuous time interpretation of the discrete time algorithm: although these result are pre-sented as the state of the art around the dynamic of SGD only later in the work, chapter 2, we will make a clear and rigorous frame in which this kind of heuristic is build on solid mathematical ground [LTE18].

We start with the assumption that

dXt= −∇f (Xt)dt + p

ηΣ(Xt)dWt,

is a fair model for the discrete time SGD and this suggest to investigate the fluctuation of the Stochastic gradient descent as an asymptotic time analysis, showing like in [COO+17], that the Stochastic Gradient Descent has a limit distribution satisfying a Fokker Plank type equation.

(10)

4 CHAPTER 1. INTRODUCTION

1.1 Thesis Overview

Motivated by the remarkable achievement Deep Learning has obtained dur-ing the last decade in a number of applied domains from visual recognition and speech, to natural language processing and robotics [LBH15].

We decided, in this Master thesis, to take a survey into the theoretical study of Stochastic Gradient Descent (SGD): the most common training algorithm for Deep Neural Network. Neural networks are a parametric class of func-tions whose parameters are found by minimizing a non-convex loss function, typically via stochastic gradient descent (SGD), using a variety of regular-ization techniques.

For the scope of this work a Deep Neural Network is a nested compo-sition of linear functions of an input ξ ∈ Rp and some non linear functions {hj}j1,...,jJ at each level. The output of a neural network, can be written by

y(x, ξ) = hjJ x

p_h jJ −1 x

p−1 _{... h}

1 x1ξ ;

Here x1, ..., xp−1 ∈ Rp×p _{and x}p _{∈ R}d×K _{are the weights of the network,} which we will consider as a vector x ∈ Rd of the suitable dimension after the concatenation and rearranging of each one of the as a vector.

The various function hj are applied point-wise to each element of their rear-rangement and are called activation function. For instance, in some of the state of the art neural network we have hj(ξ) := ReLu(ξ) = max(0, ξ), while the last non linearity can depend on the task the network as to perform, i.e. for a N -classification problem the last non linearity could be h0(ξ) = I{x≥0}, and the result output would be a y(x, ξ) ∈ {1, .., N }.

A neural network is thus a function with d parameters that produces and output y for each input ξ.

In supervised learning setting, training a Deep network for a task consist of using a finite sample of input and outputs D := {(ξi, yi)}n_i=1 (a dataset), and minimizing the empirical loss:

f (x) := 1 N n X i=1 f_(i)(x).

where f_(i)(x) is a loss function (typically non-convex) that measure the dis-crepancy between the prediction y(x, ξi) and the label yi.

In many key tasks, and for sake of representation, deep learning reduces to solving the following optimization problem:

x∗ = arg min x∈Rd ( f (x) := 1 n n X i=1 f_(i)(x) )

(11)

where x denotes the weights of the neural network, f : Rd→ R denotes the loss function that is typically non convex in x and each f(i) denotes the loss function that is contributed by the data point i ∈ {1, ..., n}.

SGD is widely used in deep learning with great success to perform this optimization, thanks to its computational efficiency.

Beyond efficiency, understanding how SGD performs better than its full batch counterpart in terms of test accuracy remains a major challenge. Despite the fact that the impact of SGD on generalization has been studied, a satisfactory theory that can explain its success in a way that encompasses such peculiar empirical properties is still lacking.

The combination of experimental work and theoretical approaches has gained many insights in the last decades ([CCS+_{16], [CS17], [COO}+_{17], [WME18])} but we are still far away from a real understanding; despite many attempts, the roots of this success remain elusive.

Stochastic gradient descent, in its simple form, can be written as

xk+1 = xk− η∇fγk(xk), k = {1, ..., K}

where K is the maximal iteration number, and ∇fγk denotes the stochastic

gradient iteration, which can be stated in a more standard form as

∇f_γ_k(x) = 1 |Ωk|

X i∈Ωk

f_(i)(x)

Here, Ωk is a subset of all the data drawn with a distribution γk.

The second chapter is the core of this work and is based on [LTE18]. We develop the mathematical foundations of the stochastic modified equations (SME) framework for analyzing the dynamics of stochastic gradient algo-rithms, where the latter is approximated by a class of stochastic differential equations with small noise parameters.

The starting motivation is the observation that full batch Gradient Descent (GD) iterations is a (Euler) discretization of the continuous-time, ordinary differential equation

dx

dt = −∇f (x), (1.1)

and studying (2.1) can give us important insights about the dynamics of the discrete-time algorithm for small enough learning rates.

The natural question that we answer in this work, in full generality, is how to extend this to SGD.

In recent work [WME18] a popular approach for analyzing SGD is based on considering SGD as a discretization of a continuous-time process with the

(12)

Eulero-Maruyama approximation ([CCS+16], [COO+17], [CS17]) but some theoretical concerns have been also raised about the SDE approximation of SGD [Yai19]: first of all the approximation is an heuristic interpretation and not a rigorous limit process, second the fluctuation of the process is not yet fully understood.

We believe that the SDE representation is sufficiently accurate for small step-sizes and a good, if not the best, proxy for understanding the behavior of SGD. Starting with this goal in mind we make a rigorous formulation, like in [LTE18], to answer the following questions: in which sense a Stochastic Differential Equation (SDE) is an approximation of discrete SG Algorithms? How can the limit problem be overcome?

We write the SGD iteration as

xk+1 = xk− η∇f (xk) + √ ηVk(xk, γk), (1.2) where Vk(xk, γk) = √ η(∇f (xk) − ∇fγk(xk)) is a d−dimensional random

vector. A straight forward calculation, using the canonical assumption shows that E[Vk|xk] = 0, cov[Vk, Vk|xk] = ηΣ(xk), Σ(xk) := E[(∇fγk(xk) − ∇f (xk))(∇fγk(xk) − ∇f (xk)) T_|x k], (1.3) Vk(xk) conditional to xk has 0 mean and covariance ηΣ(xk), where Σ is the conditional covariance of the stochastic gradient approximation ∇fγ of ∇f . Now we take in consideration a time-homogeneous Ito stochastic differential equation (SDE) of the form:

dXt= b(Xt)dt + √

ησ(Xt)dWt, (1.4)

where Xt∈ Rdand t ≥ 0. Wtdenote for us a standard d-dimensional Wiener process. We recall that b : Rd 7→ Rd _{is known as drift and σ : R}d _{7→ R}d×d diffusion matrix.

The key observation is to take an Euler discretization with step-size η to (2.4), approximating Xkη by ˆXk and we get the following discrete iteration for the following:

ˆ

Xk+1= ˆXk+ ηb( ˆXk) + ησ( ˆXk)Zk, (1.5) where zk:= W(k+1)η− W(kη) are i.i.d. d−dimensional normal random vari-ables.

If we set b = −∇f, σ(x) = Σ(x)1/2 and identify t with kη, we then have matching first and second conditional moments. Hence, this motivates the approximating equation

(13)

Fact 1 (Diffusion term). Note that, as this heuristic argument shows, the presence of the small parameter √η on the diffusion term in necessary to model the fact that when the learning rate decreases, the fluctuations to the stochastic gradient algorithms iterates must also decrease.

Question 1. In what sense is an SDE like (2.6) approximation of (2.2)?

On a more specific level, we first need to notice that the stochastic process {xk} induces a probability measure on the product space Rd× Rd× · · · , whereas {Xt} induces a probability measure on C0([0, ∞), Rd). Hence, we can only compare their values by sampling a discrete number of points from the latter.

Second, the process {xk} is adapted to the filtration generated by {γk} (in the case of SGD is the random sampling of function (fr) ), whereas the process {Xt} is adapted to an independent Wiener filtration. Hence is not helpful to compare individual sample paths. Rather, we define below a sense of weak approximations by comparing the distributions of the two processes. Definition 1.7. Let G ⊆ C0 such that

G := {g ∈ C0(Rd, R)|∃ k1, k2 > 0 s.t. |g(x)| ≤ k1(1 + |x|2k2), ∀x ∈ Rd}, functions with at most polynomial growth.

We denote, for each α ≥ 1, Gαthe set of α-times continuously differentiable functions with its partial derivatives up to and including order α in G.

Note that Gα ⊆ Cα_{, the usual space of α-times continuously} differen-tiable functions.

Moreover, if g depends on additional parameters, we say that g ∈ Gα _{if the} constants k1, k2 are independent from these parameters, i.e. g ∈ Gα uni-formly.

We last note that the definition generalizes to vector-valued functions coor-dinate wise in the co-domain.

Definition 1.8 (Weak Approximation).

Let T > 0, η ∈ (0, 1 ∧ T ), α ≥ 1 a integer and we set N = bT_ηc.

We say that a continuous-time stochastic process {Xt}t∈[0,T ] is an order α weak approximation of a discrete stochastic process {xk}k=0,...,N if for every g ∈ Gα+1, there exists C > 0, independent of η, such that

max

k=0,...,N|Eg(xk) − Eg(Xkη| ≤ Cη

α _(1.9)

These are approximations of the distribution of sample paths, instead of the sample paths themselves. This is enforced by requiring that the expec-tations of the two processes over a sufficiently large class of test functions

(14)

to be close, in particular it includes all polynomials.

In particular in our definition, all moments of the two processes become close at the rate of ηα, and hence so must their distributions.

The notion of weak approximation must be contrasted with that of strong approximations, where one would for example require, in the case of mean-square approximations,

[E|xk− Xkη|2]1/2≤ Cηα.

This forces the actual sample-paths of the two process to be close, per real-ization of the random process, which severely limits its application.

In fact, an advantage of our definition is that the approximating SDE can approximate the discrete stochastic processes whose step-wise driving noise is not Gaussian, which is exactly what we need to analyze general stochastic gradient iterations.

As in [LTE18] we derive a second-order accurate weak approximation for the simple SGD iterations.

Theorem 1.1.1. Let T > 0, η ∈ (0, 1 ∧ T ) and set N = bT /ηc. Let {xk}k≥0 be the SGD iterations defined as

xk+1 = xk− η∇fγk(xk),

where γkas the same distribution of γ. Suppose the following conditions are met:

(i) f ≡ Efγ is twice continuously differentiable, ∇|∇f |2 is Lipschitz, and f ∈ G_w4;

(ii) ∇fγ satisfies a Lipschitz condition:

|∇f_γ(x) − ∇fγ(y)| ≤ Lγ|x − y| a.s. for all x, y ∈ Rd_{, where L}

γ is a random variable which is positive a.s. and ELmγ < ∞ for each m ≥ 1.

Define {Xt}t∈[0,T ] as a stochastic process satisfying the SDE

dXt= −∇(f (Xt) + 1 4η|∇f (Xt)| 2_{)dt +}√_ηΣ(X t)1/2dWt, X0 = x0, (1.10) with Σ(x) = E(∇fγ(x) − ∇f (x))(∇fγ(x) − ∇f (x))T.

Then {Xt}t∈[0,T ] is an order-2 weak approximation of the SGD, i.e. for each g ∈ G3, there exists a constant C > 0 independent of η such that

max

k=0,...,N|Eg(xk) − Eg(Xkη)| ≤ Cη 2_.

(15)

We prove that this approximation can be also extended as a weak ap-proximation of momentum Stochastic Gradient Descent and Stochastic Nes-terov’s Accelerated Gradient Method, mostly used methods in the Deep Neural Network framework and in the general setting of stochastic objec-tives.

We end the section applying the SME framework developed to analyze the dynamics of the stochastic gradient descent algorithm discussed above. We shall focus on simple but non-trivial models where to a large extent, analyt-ical computations using SME are tractable, giving us key insights into the algorithms that are otherwise difficult to obtain without appealing to the continuous formalism presented in this chapter. We consider primarily the following model:

Model: Let H ∈ Rd×d be a symmetric, positive definite matrix. Define the sample objective:

fγ(x) := 1 2(x − γ) T_{H(x − γ) −}1 2T r(H) γ ∼ N(0, Id), (1.11)

which gives the total objective f (x) ≡ Efγ(x) = 1₂xTHx.

Based on the results of the previous chapters, following ([CCS+16], [CS17]), in chapter 3 we focused on the relation between the invariant mea-sure of the continuous stochastic process and the algorithm parameters, namely the step-size η and mini-batch size, as a function of Σ2. We con-cluded that the ratio of learning rate divided by the batch size is the control parameter that determines the width of the minima found by SGD.

We start with the assumption that

ηΣ(Xt)dWt,

is a fair model for the discrete time SGD, proving that the condition of weak-order approximation are satisfied.

This chapter start with empirical observation concerning extremal point of the algorithm, showing that the spectrum of the Hessian at this extremum discovered by SGD on Convolutional Neural Network ([WME18], [CCS+16]) has great portion of eigenvalues near zero.

This suggest us to investigate the fluctuation of the Stochastic gradient descent as an asymptotic time analysis, by showing like in [COO+17], that the Stochastic Gradient Descent has a limit distribution satisfying a Fokker Plank type equation

(16)

Lemma 1.1.2. The continuous time limit equation for the SGD:

ηΣ(Xt)dWt,

Have a stationary distribution (ρ(z, t) = P(x(t) = z), that evolves as the Fokker Plank equation:

∂tρ = ∇ · (∇f (x)ρ + η∇ · (Σ(x)ρ)) . By defining

j(x) = −∇f (x) + Σ(x)∇ log ρ(x) − η∇ · ∇σ(x),

the part of the dynamics that cannot be written as a multiple of Σ and the limit distribution, let us prove result on his limit dynamic:

Theorem 1.1.3 ((Most likely trajectories of SGD are limit cycles)). The force j(x) introduces a deterministic component in SGD given by:

˙

x = j(x).

under the condition ∇·j(x) = 0, which means that the system is conservative, we get that most likely trajectories of SGD traverse closed trajectories in weight space.

Although we will not discuss the topic in the thesis, we mention a link to the study of new algorithms like Entropy-SGD [CCS+16] for training deep neural networks that are motivated by the local geometry of the energy landscape and thanks to the rigorous framework developed in chapter 2 and 3 are now fully understandable.

In this context, instead of minimizing the original loss function f (x), they propose to maximize new objective like:

F (x; β) := log Z x∈Rn exp(−f (x) −β 2||x − x|| 2 2)dx.

The above is a log-partition function that measures both the depth of a valley at a location x ∈ Rn, and its flatness through the entropy of f (x). This new objective function can be seen as solution of a viscous Hamilton-Jacobi equation ∂tu = − 1 2|∇u| 2₊λ−1 2 ∆u, for 0 < t ≤ γ

Which is equivalent to the aforementioned (1.1.2) Fokker-Plank equation satisfied by the stationary distribution of the stochastic gradient descent. The study of such dynamic lead to new insight to better understand that local minima that generalize well, discovered by gradient descent, lie in

(17)

“wide valleys” of the energy landscape rather than in sharp, isolated min-ima [WME18], and so they can be found smoothing the energy landscape that arise from the SGD dynamic and continuous time model.

In chapter 4, we tackle the assumption that the gradient noise (GN) in the stochastic gradient descent algorithm is often considered to be Gaus-sian in the large data regime, by assuming that the classical central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussian-ity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Following [SSG19] we analyze the histogram of the norms of the stochastic gradient noise com-puted using a convolutional neural network in a real classification problem and compared it to the histogram of the norms of Gaussian random vari-ables clearly show that the shape of the real histogram is very different than the Gaussian and shows a heavy-tailed behavior.

Inspired by non-Gaussian natural phenomena, we consider the Gradient Noise in a more general context and invoke the generalized Central Limit Theorem, which suggests that the GN converges to a heavy-tailed α-stable random variable.

Thus we can make the following assumption on the stochastic gradient noise: √

ηVk(xk, γk) := √

η(∇f (xk) − ∇fγk(xk)) ∼ SαS(Σ(xk)),

where SαS is a centered symmetric α-stable distribution, which is a special case of α-stable distributions that are symmetric around the origin.

X ∼ SαS(Σ) ↔ E[exp(iωX)] = exp(−|Σω|α).

Developing result similar to [LTE18] we obtain a rigorous weak approxima-tion SDE for the dynamic as the

dXt= −∇f (Xt)dt + η(α−1)/αΣ(Xt))dLαt,

where Lα_t is a α−stable L´evy motion with independent components. This equation exhibits a fundamentally different behavior than the one driven by a Brownian motion, and the metastability behavior of these pro-cesses and their dynamic is still not clear at the moment and its theory is still in an early phase and could lead us to find new algorithms with fast convergence to local minima such that the performance is stable after adding noise to the dataset.

Accordingly, in the thesis we start to analyze SGD as an SDE driven by a L´eevy motion. Such SDEs can incur in ‘jumps’, which force the SDE transi-tion from narrow minima to wider minima, the approach we follow is based on the existing metastability theory [NS¸GR19], which is still in is prime. To validate the α-stable assumption, we use the following estimator

(18)

Theorem 1.1.4 ([MMO15]). Let {Xi}Ki=1 be a collection of random vari-ables with Xi ∼ SαS(Σ) and K = K1× K2. Define Yi :=PK_j=11 Xj+(i−1)K1

for i ∈ [1, K2]. Then, the estimator

ˆ 1 α := 1 log K1 1 K2 K2 X i=1 log|Yi| − 1 K K X i=1 log|Xi| ! , (1.12)

converges to 1/α almost surely, as K2 → ∞.

which we use on sample model to validate experimentation on common deep learning architectures and show that in all settings, the gradient noise is highly non-Gaussian and admits heavy-tails.

The last part of the chapter is devoted to explain the result obtained in [NS¸GR19] on the first exit time on minima of the new L´evy derived dynamics and the discrete time counterpart.

We define the exit times of the SDE process and the discrete time one:

τε,a(ε) := inf{t ≥ 0| kXt− xk /∈ [0, a + ε]}; τε,a(ε) := inf{t ≥ 0| kxK− xk /∈ [0, a + ε]};

where x is the candidate local minimum and we derive a bound on the exit time and the hyper-parameters of the model such as

Theorem 1.1.5 (4.3.2,[NS¸GR19]). Under regularity assumption, for δ ∈ (0, 1), the following inequality holds:

P[τε,a(ε) > kη] − CK,η,ε,d,ε− δ ≤ P[τε,a(ε) > K] ≤ P[τε,a(ε) > Kη]+ + CK,η,ε,d,ε+ δ.

The theorem help us derive explicit conditions for the step-size such that the discrete-time system can inherit the metastability behavior of its continuous-time limit, but we are still missing meaningful results on toy model and deep neural networks as well as new theories that relax hypoth-esis on the loss function f to obtain a better proxy for approximate the discrete dynamics of gradient descent.

(19)

(20)

Chapter 2

Stochastic Modified

Equations

A theory is just a mathematical model to describe observations.

Karl Popper, [? ]

In this chapter we develop the mathematical foundations of the stochas-tic modified equations (SME) framework for analyzing the dynamics of stochastic gradient algorithms, where the latter is approximated by a class of stochastic differential equations with small noise parameters. We prove that this approximation can be understood mathematically as a weak ap-proximation, which leads to a number of precise and useful results on the approximations of stochastic gradient descent (SGD), momentum SGD and stochastic Nesterov’s accelerated gradient method in the general setting of stochastic objectives.

2.1 Heuristic motivations

Stochastic Gradient Algorithms introduced in the previous chapter can take the general form of an optimization problem:

min

x∈Rdf (x) := Efγ(x),

where {fr(x) : r ∈ Γ} is a family of function from Rd to R and γ is a Γ-valued random variable with respect to which we compute the expectation. In the empirical loss minimization we have introduced in supervised learning applications, γ is usually a uniform random variable taking values in Γ = {1, 2, ..., n}: here n is the total number of data. Then, in this case, f is the

(21)

total empirical loss function f (x) = 1 n n X i=1 fi(x),

where fr, r ∈ Γ, are the loss function due to the rth training sample. In this chapter we address the general situation of an expectation over ar-bitrary index sets and distributions, but always having the empirical loss in mind as a guide example to the theory.

One of the usual techniques to solve the minimization problem is im-plementing a standard gradient descent (GD) on x given by the iteration scheme

xk+1= xk− η∇Efγ(x), for k ≥ 0 and η is the learning rate.

Applying this in our working environment requires the evaluation of the gradient of an expectation, which can be costly (in Deep Learning setting, n is very large and so the empirical loss calculation become prohibitive). As we have already seen in previous chapter, we introduce the stochastic gradient descent (SGD) algorithm to solve the problem, replacing the ex-pectation of the gradient with a sampled gradient,

xk+1 = xk− η∇fγk(x),

where each γk is an independent and identically distributed (i.i.d.) random variable with the same distribution as γ.

In all the chapter we set the hypothesis

E[∇fγk(x)|xk] = ∇Ef (xk),

so that the new iteration is a sampled version of the original approximation scheme.

We now introduce the stochastic modified equations framework. The starting motivation is the observation that Gradient Descent (GD) itera-tions is a (Euler) discretization of the continuous-time, ordinary differential equation

dx

dt = −∇f (x), (2.1)

and studying (2.1) can give us important insights to the dynamics of the discrete-time algorithm for small enough learning rates. The natural ques-tion when extending this to SGD is: what is the right continuous-time equa-tion to consider?

(22)

2.1. HEURISTIC MOTIVATIONS 15

We rewrite the SGD iteration as

xk+1= xk− η∇f (xk) + √ ηVk(xk, γk), (2.2) where Vk(xk, γk) = √ η(∇f (xk) − ∇fγk(xk)) is a d−dimensional random

vector. A straight forward calculation shows that

E[Vk|xk] = 0,

cov[Vk, Vk|xk] = ηΣ(xk),

Σ(xk) := E[(∇fγk(xk) − ∇f (xk))(∇fγk(xk) − ∇f (xk))

T_|x

k], (2.3) Vk(xk) conditional to xkhas 0 mean and covariance ηΣ(xk), where Σ is the conditional covariance of the stochastic gradient approximation ∇fγ of ∇f .

Now we take in consideration a time-homogeneous Ito stochastic differ-ential equation (SDE) of the form:

dXt= b(Xt)dt + √

ησ(Xt)dWt, (2.4)

where Xt ∈ Rd and t ≥ 0. Wt denote for us a standard Wiener process d-dimensional. We recall that b : Rd_{7→ R}d_{is known as drift and σ : R}d_{7→ R}d×d diffusion matrix.

The key observation is to take an the Euler discretization with step-size η to (2.4), approximating Xkη by ˆXk and we get the following discrete iteration for the following:

ˆ

Xk+1 = ˆXk+ ηb( ˆXk) + ησ( ˆXk)Zk, (2.5) where zk := W(k+1)η− W(kη are i.i.d. d−dimensional normal random vari-ables.

If we set b = −∇f, σ(x) = Σ(x)1/2 and identify t with kη, we then have matching first and second conditional moments. Hence, this motivates the approximating equation

dXt= −∇f (Xt) + (ηΣ(Xt))1/2dWt. (2.6) Fact 2 (Diffusion term). Note that, as this heuristic argument shows, the presence of the small parameter √η on the diffusion term in necessary to model the fact that when the learning rate decreases, the fluctuations to the stochastic gradient algorithms iterates must also decrease.

(23)

2.2 Mathematical framework

Let (Ω,F , P) a sufficiently rich probability space. Let (Γ, FΓ) be a measure space representing the index space for our random objectives.

Let γ : Ω → Γ be a random variable and (r, x) 7→ fr(x) a measurable mapping from Γ × Rdto R.

Hence, for each x, the composition fγ(x) is a random variable and we assume the following facts:

Assumption 1. The r.v. fγ satisfies (i) fγ(x) ∈L1, ∀x ∈ Rd;

(ii) fγ(x) is continuously differentiable in x almost surely and for each R > 0, there exists random variable Mγ,R such that a.s.

max

|x|≤R|∇fγ(x)| ≤ Mγ,R, with E|Mγ,R| < ∞;

(iii) ∇fγ(x) ∈L2, ∀x ∈ Rd.

Note that in the empirical risk minimization case, where Γ is finite, the conditions above are usually trivially satisfied, considering the state of the art of deep learning.

Thanks to (i) we can define the total objective function we would like to minimize as expectation such as

f (x) := Efγ(x) ≡ Z

Ω

fγ(ω)(x)dP(ω). (2.7)

Moreover, via the dominated convergence theorem, using the (ii) assump-tion, we get

E∇fγ = ∇Efγ≡ ∇f.

Definition 2.8 (General SG). Let {γk: k = 0, 1, ...} be a sequence of i.i.d. Γ-valued r.v. with the same distribution as γ and x0 ∈ Rd fixed. We define the generalized stochastic gradient iteration as the stochastic process

xk+1 = xk+ ηh(xk, γk, η) (2.9)

for k ≥ 0, where h : Rd× Γ × R → Rd _{is a measurable function and η > 0} is the learning rate.

Fact 3. In the case of SGD, we have h(x, r, η) = −∇fr(x), but we consider the generalized version so that modified equation can be derived for SGD variants.

(24)

2.2. MATHEMATICAL FRAMEWORK 17

Let us define the class of approximating continuous stochastic processes, which we call Stochastic M odif ied Equations.

Consider the time-homogeneous Ito diffusion process {Xt : t ≥ 0} repre-sented by the following stochastic differential equation:

dXt= b(Xt, η)dt + √

ησ(Xt, η)dWt, X0 = x0 (2.10)

where (Wt)t≥0 is a standard d−dimensional Wiener process independent of {γk}. Note that b : Rd× R → R, the approximating drift vector, and σ : Rd× R → Rd×d_{, the approximating diffusion matrix, will need to be} picked appropriately so that (2.9) is approximated by (2.10), in the sense we are going to describe.

We first need to notice that the stochastic process {xk} induces a prob-ability measure on the product space Rd× Rd_{× · · · , whereas {X}

t} induces a probability measure on C0([0, ∞), Rd). Hence, we can only compare their values by sampling a discrete number of points from the latter.

Second, the process {xk} is adapted to the filtration generated by {γk} (in the case of SGD is the random sampling of function (fr) ), whereas the process {Xt} is adapted to an independent Wiener filtration. Hence is not helpful to compare individual sample paths. Rather, we define below a sense of weak approximations by comparing the distributions of the two processes.

Definition 2.11. Let G ⊆ C0 _{such that}

G := {g ∈ C0(Rd, R)|∃ κ1, κ2 > 0 s.t. |g(x)| ≤ κ1(1 + |x|2κ2), ∀x ∈ Rd}, functions with at most polynomial growth.

We denote, for each α ≥ 1, Gαthe set of α-times continuously differentiable functions with its partial derivatives up to and including order α in G.

Note that Gα ⊆ Cα_{, the usual space of α-times continuously} differen-tiable functions.

Moreover, if g depends on additional parameters, we say that g ∈ Gα if the constants k1, k2 are independent of there parameters, i.e. g ∈ Gα uniformly. We last note that the definition generalizes for vector-valued functions co-ordinate wise in the co-domain.

Definition 2.12 (Weak Approximation).

Let T > 0, η ∈ (0, 1 ∧ T ), α ≥ 1 a integer and we set N = bT_ηc.

We say that a continuous-time stochastic process {Xt}t∈[0,T ] is an order α weak approximation of a discrete stochastic process {xk}k=0,...,N if for every g ∈ Gα+1, there exists C > 0, independent of η, such that

max

k=0,...,N|Eg(xk) − Eg(Xkη| ≤ Cη

(25)

These are approximations of the distribution of sample paths, instead of the sample paths themselves. This is enforced by requiring that the expec-tations of the two processes over a sufficiently large class of test functions to be close, in particular it includes all polynomials.

In particular in our definition, all moments of the two processes become close at the rate of ηα, and hence so must their distributions.

The notion of weak approximation must be contrasted with that of strong approximations, where one would for example require, in the case of mean-square approximations,

[E|xk− Xkη|2]1/2≤ Cηα.

This forces the actual sample-paths of the two process to be close, per real-ization of the random process, which severely limits its application.

In fact, an advantage of our definition is that the approximating SDE can approximate the discrete stochastic processes whose step-wise driving noise is not Gaussian, which is exactly what we need to analyze general stochastic gradient iterations.

2.3 Approximation Theorems

The derivation is based on the following two step process:

1. Establish a connection between one-step approximation and approxi-mation on a finite time interval;

2. Construct a one-step approximation that is of order α + 1, and so the approximation on a finite interval is of order α.

2.3.1 One-Step to N-step

Let us first consider generally the question of the relationship between one-step approximations and approximation on a finite interval

Let T > 0, η ∈ (0, 1 ∧ T ), and N = bT_ηc and recall the SG iterations

xk+1= xk+ ηh(xk, γk, η), x0∈ Rd, k = 0, ..., N (2.14) and the general candidate family of approximating SDEs

dX_tη,ε= b(X_tη,, η, ε)dt +√ησ(X_tη,ε, η, ε)dWt, X0 = x0, t ∈ [0, T ], (2.15) where ε ∈ (0, 1) is a mollification parameter, whose role will become appar-ent later.

To reduce notation in this section and improve readability, unless some lim-iting procedure is considered, we shall not explicit write the dependence of

(26)

2.3. APPROXIMATION THEOREMS 19

X_tη,ε on η, ε and simply denote the solution of the SDE by Xt.

Let us also write ˜Xk := Xηk, and denote {Xtx,s}t≥s the stochastic process obeying the same SDE, but with the initial condition xx,ss = x.

We last write ˜X_kx,l := X_kηx,lη and denote by {xx,l_k }k≥l the SG process, with xl= x.

Assumption 2. The function b : Rd_{× (0, 1 ∧ T ) × (0, 1) → R}d _{and the} function σ : Rd× (0, 1 ∧ T ) × (0, 1) → Rd×d _satisfy:

(i) Uniform linear growth condition

|b(x, η, ε)|2+ |σ(x, η, ε)|2≤ L2(1 + |x|2) for all x ∈ Rd, η ∈ (0, 1 ∧ T ), ε ∈ (0, 1);

(ii) Uniform Lipschitz condition

|b(x, η, ε) − b(y, η, ε)| + |σ(x, η, ε) − σ(y, η, ε)| ≤ L|x − y|

for all x, y ∈ Rd, η ∈ (0, 1 ∧ T ), ε ∈ (0, 1).

Note that (ii) implies (i) if exists at least one x where the supremum of b, σ over η, ε is finite. In particular these condition implies existence and uniqueness solution the the SDE.

Now, let us denote the one-step changes

∆(x) := xx,0₁ − x, ˜∆(x) := ˜X₁x,0− x. (2.16) We prove the following result that relates one-step approximations with approximation on a finite time interval.

Theorem 2.3.1. Let T > 0, η ∈ (0, 1 ∧ T ), ε ∈ (0, 1) and N = bT /ηc. Let α ≥ 1 be an integer. Suppose further that the following conditions hold:

(i) there exists a function ρ : (0, 1) → R+ and K1 ∈ G independent of η, ε such that E s Y j=1 ∆ij(x) − E s Y j=1 ˜ ∆ij(x) ≤ K1(x)(ηρ(ε) + ηα+1), for s = 1, 2, ..., α and E α+1 Y j=1 |∆₍ij)(x)| ≤ K1(x)ηα+1, where ij ∈ {1, ..., d}.

(27)

(ii) for each m ≥ 1, the 2m-moment fo xx,0_k is uniformly bounded with respect to k, η.

This means that there exists a K2 ∈ G, independent of η and k, such that:

E|xx,0k |

2m_{≤ K} 2(x), for all k = 0, ..., N .

Then, for each g ∈ Gα+1, there exists a constant C > 0, independent of η, ε, such that

max

k=0,...,N|Eg(xk) − Eg(Xkη)| ≤ C(η

α_{+ ρ(ε)).}

The proof requires a number of technical results that we have listed in the appendix, where their proof are fully presented. Below, we give a proof using the main ingredients and refer to the right theorem in the appendix when needed.

Proof. We introduce the alternative notation Xt(x, s) ≡ Xtx,sto avoid nested superscripts, and similarly for ˜Xk and xk.

Fix g ∈ Gα+1 and 1 ≤ k ≤ N . Computing expectation, we have

Eg(Xkη) = Eg( ˜Xk) = Eg( ˜Xk( ˜X1, 1)) − Eg( ˜Xk(x1, 1)) + Eg( ˜Xk(x1, 1)). If k > 1, bu noting that ˜Xk(x1, 1) = ˜Xk( ˜X2(x1, 1), 2), we get

Eg(X˜k(x1, 1)) = Eg( ˜Xk( ˜X2(x1, 1), 2)) − Eg( ˜Xk(x2, 2)) + Eg( ˜Xk(x2, 2)). Continuing the process, we then have

Eg(X˜k) = k−1 X l=1 Eg(X˜k( ˜Xl(xl−1, l), l)) − Eg( ˜Xk(xl, l)) + Eg( ˜Xk(xk−1, k − 1)),

and hence subtracting Eg(xk) ≡ Eg(xk(xk−1, k − 1)) we have

Eg(X˜k) − Eg(xk) = k−1 X l=1 Eg(X˜k( ˜Xl(xl−1, l), l)) − Eg( ˜Xk(xl, l)) + Eg( ˜Xk(xk−1, k − 1)) − Eg(xk(xk−1, k − 1)) and so Eg(X˜k) − Eg(xk) = k−1 X l=1 E h E h g( ˜Xk( ˜Xl(xl−1, l), l)) ˜ Xl(xl−1, l − 1) ii − − EhE h g( ˜Xk(xl, l)) xl ii + + Eg( ˜Xk(xk−1, k − 1)) − Eg(xk(xk−1, k − 1)),

(28)

Let us write u(x, s) = Eg(Xkη(x, s)). Then, we have

|Eg( ˜Xk) − Eg(xk)| ≤ k−1 X l=1 Eu( ˜Xl(xl−1, l − 1), lη) − Eu(xl(xl−1, l − 1), lη) + Eg( ˜Xk(xk−1, k − 1)) − Eg(xk(xk−1, k − 1)) ≤ k−1 X l=1 E E h u( ˜Xl(xl−1, l1), l − 1) xl−1 i − E [u(xl(xl−1, l − 1), lη)| xl−1] + E E h g( ˜Xk(xk−1, k − 1) xk−1 i − E [g(xk(xk−1, k − 1))| xk−1] . Using P rop.B.0.4, u(˙,s) ∈ Gα+1 uniformly in s, t, η and ε. Thus, by

assump-tion (i) and lem.C.0.2 we get,

|Eg(xk) − Eg( ˜Xk)| ≤ (ηρ(ε) + ηα+1) k−1 X l=1 EKl−1(xl− 1) + EKk−1(xk−1) ! ≤ (ηρ(ε) + ηα+1) N X l=0 κl,1(1 + E|xl|2κl,2),

Where in the last inequality we have used moment estimate from T hm.A.0.2. Finally, using assumption (ii) and the fact that N ≤ T /η, we have

|Eg(xk) − Eg(Xkη)| = |Eg(xk) − Eg( ˜Xk)| ≤ C(ρ(ε) + ηα).

2.3.2 SME for SGD

Theorem (2.3.1) allows us to prove the main approximation results for the current paper. In particular, in this section we derive a second order accu-rate weak approximation for the simple SGD iterations, and we deduce a simpler first order accurate approximation.

As seen in Thm.(2.3.1), we need to verify the condition (i) − (ii) in order to prove the weak approximation result. These conditions mostly involve moment estimates, which we now perform.

Let us set in (2.15)

b(x, η, ε) = b0(x, ε) + ηb1(x, ε), σ(x, η, ε) = σ0(x, ε),

where b0, b1, σ0 are functions to be determined. We can get the following estimate.

(29)

Lemma 2.3.2. Let ˜∆(x) be defined as in (2.16). Suppose further that b0, b1, σ0∈ G3. Then we have

(i) E ˜∆_(i)(x) = b0(x, ε)(i)η+[12b0(x, ε)(j)∂(j)b0(x, ε)(i)+b1(xε)(i)]η2+O(η3),

(ii) E ˜∆(i)(x) ˜∆(j)(x) = [b0(x, ε)(i)b0(x, ε)(j) + σ0(x, ε)(i,k)σ0(x, ε)j,k]η2 + +O(η3),

(iii) EQ3

j=1| ˜∆(ij)(x)| = O(η

3_).

Proof. To obtain (i) − (iii), we simply apply Lem. D.0.1. with ψ(z) =Qs

j=1(z(ij)− x(ij)) for s = 1, 2, 3 respectively.

Now, we estimate the moments of the SGA iteration below.

Lemma 2.3.3. Let ∆(x) be defined as (2.16) with the SGD iterations, i.e. h(x, t, η) = −∇fr(x). Suppose that for each x ∈ Rd, f ∈ G1. Then:

(i) E ˜∆(i)(x) = −∂(i)f (x)η,

(ii) E∆(i)(x)∆(j)(x) = ∂(i)f (x)∂(j)f (x)η2+ Σ(x)(i,j)η2,

(iii) EQ3

j=1|∆(ij)(x)| = O(η

3_),

where Σ(x) := E(∇fγ(x) − ∇f (x))(∇fγ(x) − ∇f (x))T.

Proof. We have ∆(x) = −η∇fγ0(x). Taking expectations, the results easily

follow.

We now prove the main approximation theorem for the simple SGD. Before presenting the statement and proof, we shall note a few technical issues that prevents the direct application of Thm.(2.3.1) with the moment estimated in the lemma.

The latter suggest ignoring ε and setting

b0(x, ε) = −∇f (x), b1(x, ε) = − 1

4∇|∇f (x)| 2_{, σ}

0(x, ε) = Σ(x)1/2. Then, we would see from the lemma that the SGD and the SDE have match-ing moments up to O(η3).

The first issue with this approach is that even if Σ(x) is sufficiently smooth (which may follow from the regularity of ∇fγ), the smoothness of Σ(x)1/2 cannot be guaranteed unless Σ(x) is positive-definite, which is often too strong an assumption in practice and excludes interesting cases where Σ(x) is a singular diffusion matrix.

(30)

Second, we would like to consider functions fγ that may not have higher strong derivatives required by the Lemmas, beyond those required to define the modified equation itself. To fix both of these issues, we will use a simple mollifying technique. This is the reason for the inclusion of the ε parameter in the results.

Definition 2.17. Let us denote by υ : Rd→ R, υ ∈ Cc∞(Rd) the standard mollifier υ(x) := ( C exp(−_1−|x|1 2), |x| < 1, 0, |x| ≥ 1. where C := (R Rdυ(y)dy)

−1 _{is chosen so that the integral of υ is 1. Further,} define υε(x) := ε−dυ(x/ε).

Let ψ ∈L_loc1 (Rd), then we may define its mollification by ψε(x) := (υε∗ ψ)(x) = Z Rd υε(x − y)ψ(y)dy = Z B(0,ε) υε(y)ψ(x − y)dy,

where B(0, ε) is the d−dimensional ball of radius centered at 0. The mol-lification of a vector (or matrix) valued function is defined element-wise.

The mollifier has very useful properties, in particular we will need the following well known facts.

Fact 4. 1. If ψ ∈L_loc1 (Rd), then ψε∈ C∞(Rd);

2. ψε(x) → ψ(x) as ε → 0 for almost every x ∈ Rd (with respect the Lebesgue measure);

3. If ψ is continuous, then ψε(x) → ψ(x) as ε → 0 uniformly on compact subsets of Rd.

Next, we make use of the idea of weak derivatives.

Definition 2.18. Let Ψ ∈ L_loc1 (Rd) and J be a multi-index of order |J |. Suppose that there exists a ψ ∈L_loc1 (Rd) such that

Z Rd Ψ(x)∇Jφ(x)dx = (−1)|J| Z Rd ψ(x)φ(x)dx for all φ ∈ C∞c .

Then we call ψ the order J weak derivative of Ψ and write D|J|Ψ = ψ. Note that when it exists, the weak derivative is unique almost everywhere and if Ψ is differentiable, ∇JΨ = DJΨ almost everywhere.

The introduction of weak derivatives motivates the definition of the weak version of the function spaces Gα.

(31)

Definition 2.19. For α ≥ 1, we define the space G_wα to be the space of L1

loc(Rd) such that if g ∈ Gwα, then g has weak derivatives up to order α and for each multi-index J with |J | ≤ α, there exists positive integers κ1, κ2 s.t.

|DJg(x)| ≤ κ1(1 + |x|2κ2) f or a.e. x ∈ Rd.

As in Def.2.11, if g depends on additional parameters, we say that g ∈ Gα w if the above constants do not depend on the additional parameters.

Also, vector-valued g are defined as above element-wise in the co-domain. Note that Gα

w is a subspace of the Sobolev space W α,1 loc.

Theorem 2.3.4. Let T > 0, η ∈ (0, 1 ∧ T ) and set N = bT /ηc. Let {x_k}_k≥0 be the SGD iterations defined as

xk+1 = xk− η∇fγk(xk),

where γk as the same distribution of γ. Suppose the following conditions are met:

(i) f ≡ Efγ is twice continuously differentiable, ∇|∇f |2 is Lipschitz, and f ∈ G_w4;

(ii) ∇fγ satisfies a Lipschitz condition:

|∇fγ(x) − ∇fγ(y)| ≤ Lγ|x − y| a.s.

for all x, y ∈ Rd, where Lγ is a random variable which is positive a.s. and ELmγ < ∞ for each m ≥ 1.

Define {Xt}t∈[0,T ] as a stochastic process satisfying the SDE

dXt= −∇(f (Xt) + 1 4η|∇f (Xt)| 2_{)dt +}√_ηΣ(X t)1/2dWt, X0 = x0, (2.20) with Σ(x) = E(∇fγ(x) − ∇f (x))(∇fγ(x) − ∇f (x))T.

max

k=0,...,N|Eg(xk) − Eg(Xkη)| ≤ Cη 2_.

Proof. First, we check that Eq.2.20 admits a unique solution, which amount to checking the condition in T hm. A.0.1.

(32)

constant ELγ.

To see that Σ(x)1/2 is also Lipschitz, observe that u(x) := ∇fγ(x) − ∇f (x) is Lipschitz (in the sense of (ii), with constant at most Lγ+ ELγ), and

|Σ(x)1/2− Σ(y)1/2| = | k[u(x)u(x)T]1/2k_L2_(Ω)− k[u(y)u(y)T]1/2k_L2_(Ω) |

≤ k[u(x)u(x)T_]1/2_{− [u(y)u(y)}T_]1/2_k L2_(Ω).

Moreover, observe that for vectors u ∈ Rd the function u 7→ (uuT)1/2 = uuT/|u| is Lipschitz, which implies

|Σ(x)1/2− Σ(y)1/2| ≤ L0ku(x) − u(y)k_L2_(Ω)≤ L00|x − y|.

The Lipschitz conditions on the drift and the diffusion matrix imply uniform linear growth, so by T hm. A.0.1, Eq.2.20 admits a unique solution.

For each ε ∈ (0, 1), define the mollified functions

b0(x, ε) = −υε∗∇f (x), b1(x, ε) = − 1 4υ

ε_{∗(∇|∇f (x)|}2_{), σ}

0(x, ε) = υε∗Σ(x)1/2. Observe that b0+ ηb1, σ0 satisfies a Lipschtiz condition in x uniformly in η, ε. To see this, note that for any Lipschitz function ψ with constant L, we have

|υε∗ ψ(x) − υε∗ ψ(y)| ≤ Z

B(0,ε)

υε|ψ(x − z) − ψ(y − z)|dz ≤ L|x − y|,

which proves b0+ ηb1 and σ0 are uniformly Lipschitz. Similarly, the linear growth condition follows.

Hence, we may define a family of stochastic processes {X_tε}_ε∈(0,1) satisfying dX_tε= (b0(Xtε, ε) + ηb1(Xtε, ε)) dt +

√

ησ0(Xtε, ε)dWt, X0ε= x0, which each admits a unique solution by T hm. A.0.1.

Now we claim that b0(·, ε), b1(·, ε), σ0(·, ε) ∈ G3uniformly in ε. To see this, simply observe that mollifications are smooth, and moreover, the polynomial growth is satisfied since υε∗ DJ_{ψ = ∇}J_(υε_{∗ ψ) and furthermore, if ψ ∈ G,} then we have |ψε(x)| ≤ Z B(0,ε) υε(y)|ψ(x − y)|dy ≤ k₁ 1 + 2κ2−1_|x|2κ2_{+ 2}2κ2−1 1 εd Z B(0,ε) |y|2κ2_dy ! ,

But we have that Z

B(0,ε)

(33)

where C is independent of ε. This show that ψε ∈ G uniformly in ε. This immediately implies that b0(·, ε), b1(·, ε), σ0(·, ε) ∈ G3.

Now, since b0(x, ε) → b0(x, 0) (similarly for b1, σ0), and the limits are con-tinuous by Lem. 2.3.2, 2.3.3, D.0.2, D.0.3 all conditions of T hm.2.3.1 are satisfied, and hence we conclude that for each g ∈ G3

max

k=0,...,N|Eg(X ε

kη) − Eg(xk)| ≤ C(η2+ ρ(ε)), where C is independent from η, ε and ρ(ε) → 0 as ε → 0.

Moreover, since bε₀, bε₁, σ₀εconverges uniformly (risp. b0(x, 0), b1(x, 0), σ0(x, 0)) on compact sets, we may apply T hm. A.0.3 to conclude that

sup t∈[0,T ]

E|Xtε− Xt|2→ 0 as ε → 0.

Thus, we have

|Eg(Xkη) − Eg(xk)| ≤ |Eg(Xkη) − Eg(xk)| + |Eg(Xkηε ) − Eg(Xkη)| ≤ ≤ C(η2_{+ ρ(ε)) + (E|X}_kηε − X_kη|2)1/2× × Z 1 0 E|∇2g(λXkηε + (1 − λ)Xkη)|2dλ 1/2 .

Using T hm. A.0.2 and the assumption that ∇2g ∈ G, the last expectation is finite and hence taking the limit ε → 0 yields our result.

by going for a lower order approximation, we have the following: Corollary 2.3.5. Assume the same conditions as in Thm. 2.3.4 on page 24, except that we replace (i) with

(i)0 f ≡ Efγ is continuously dif f erentiable, and f ∈ Gw3. Define {Xt}t∈[0,T ] as a stochastic process satisfying the SDE

dXt= −∇f (Xt)dt + √

ηΣ(Xt)1/2dWt, X0 = x0, (2.21) with Σ(x) = E(∇fγ(x) − ∇f (x))(∇fγ(x) − ∇f (x))T.

max

k=0,...,N|Eg(xk) − Eg(Xkη)| ≤ Cη.

Observation 1. • In the above results, the most restrictive condition is probably the Lipschitz condition on ∇fγ.

Such Lipschitz conditions are important to ensure that the SMEs ad-mit unique strong solutions and the SGA having uniformly bounded

(34)

moments.

Note that following similar techniques in SDE analysis (e.g. Kloeden and Platen (2011) [KP92]), these global conditions may be relaxed to their respective local versions if we assume in addition a uniform global linear growth condition on ∇fγ.

Finally, for applications, typical loss functions have inward pointing gradients for all sufficiently large x, meaning that the SGD iterates will be uniformly bounded almost surely. Thus, we may simply modify the loss functions for large x (without affecting the SGA iterates) to satisfy the conditions above;

• The constant C does not depend on η, but as evidenced in the proof of the theorem, it generally depends on g, T, d and the various Lipschitz constants. For the fairly general situation we are consider, we do not derive tight estimates of these dependencies.

2.3.3 SME for mementum-SGD

As an example of application to various type of SGD, we study the SM equation for a popular variant of the SGD called the momentum − SGD. We can understand the MSGD as the usual SGD iterations with a ”memory” term. In the most usual and standard form, we have the iterations:

ˆ

vk+1= ˆµˆvk− ˆη∇fγk(xk);

xk+1 = xk+ ˆvk+1.

Where ˆµ ∈ (0, 1), typically close to 1, is called the momentum parameter and ˆη is the learning rate.

Let us consider a rescaled version of the standard MSGD to make the analy-sis and the computation easier via the continuous-time approximations. For this purpose we consider

η :=pη, vˆ k := ˆvk/η, µ := (1 − ˆµ)/η, (2.22) obtaining

vk+1= vk− µηvk− η∇fγk(xk);

xk+1 = xk+ ηvk+1. (2.23)

Here the range of the momentum parameters we consider becomes µ ∈ (0, η−1/2) (Note that, since η is usually small we can consider (0, ∞) fir simplicity).

We are ready to derive the SME satisfied by the new system of equations. Observe that this is again a special caso of 2.14, with x now replaced by (v, x) and

(35)

As we have already seen in T hm.2.3.4, in order to derive the SMEs we need to match moments up to order 3. For this reason we define the one step difference:

∆(v, x) := (v₁v,x,0− v, xv,x,0₁ − x). We can derive the following expansions.

Lemma 2.3.6. Let ∆(v, x) the one step difference, we can derive: (i) E∆(i)(v, x) = η(−µv(i)− ∂(i)f (x), v) + η2(0, −µv(i)− ∂(i)f (x)); (ii) E∆(i)(v, x)∆(j)(v, x) = η2A + O(η3);

(iii) EQ3

j=1|∆(ij)(v, x)| = O(η3),

where, as usual, Σ(x) := E(∇fγ(x) − ∇f (x))(∇fγ(x) − ∇f (x))T and A is a matrix such that

♣ −µv_(i)v(j)− v(i)∂(j)f (x) −µv_(i)v_(j)− v_(j)∂_(i)f (x) v_(i)v_(j)

.

Here ♣ is define as

µ2v(i)v(j)+ µv(i)∂(j)f (x) + µv(j)∂(i)f (x) + Σ(x)(i,j)+ ∂(i)∂(j)f (x). Proof. The proof is similar as the previous lemma, and follow from direct calculation of the moments.

Hence, similarly as in previous lemma, we see that we may set, in order to match the moments, the following:

b0(v, x) = (−µv − ∇f (x), v) b1(v, x) = − 1 2(µ[µv + ∇f (x)] − ∇ 2_{f (x)v, µv + ∇f (x))} σ0(v, x) = Σ(x)1/2 ₀ 0 0 .

Proceeding as in the proof of the theorem 2.3.4, we arrive at the following approximation result, where we see that the SME for the Momentum-SGD take the form of a Langevin Equation.

Theorem 2.3.7. Assume the same condition as in T hm.2.3.4. Let µ > 0 be fixed and define {(Vt, Xt) : t ∈ [0, T ]} as the stochastic process satisfying the SDE dVt= −[(µI + 1 2η[µ 2_{I − ∇}2_{f (X} t)])Vt+ (1 + 1 2ηµ)∇f (Xt)]dt+ +√ηΣ(Xt)1/2dWt, V0= v0 dXt= [(1 − 1 2ηµ)Vt− 1 2η∇f (Xt)]dt, X0 = x0, (2.24)

(36)

with Σ(x) as defined in T hm.2.3.4.

Then, {(Vt, Xt) : t ∈ [0, T ]} is an order-2 weak approximation of the Momentum-SGD.

Moreover, if we follow the relaxed assumption of Cor.2, we have the order-1 weak approximation

dVt= −[µVt+ ∇f (Xt)]dt + √

ηΣ(Xt)1/2dWt, V0 = v0

dXt= Vtdt, X0= x0. (2.25)

2.3.4 SME for Nesterov-SGD

We can also obtain the SME for the stochastic gradient version of the Nes-terov accelerated gradient (NAG) method [NES83], which we refer to as SNAG.

In the non-stochastic case, the NAG method has been analyzed using the ODE approach [SBC14]. Therefore, the derivations in this section can be viewed as a stochastic parallel.

The NAG method is sometimes used with stochastic gradients, and hence it is useful to analyze its properties in this setting and compare it to MSGD. The unscaled NAG iterations are

ˆ

vk+1= ˆµkvˆk− ˆη∇fγ(xk+ ˆµkvˆk) xk+1 = xk+ ˆvk+1

with ˆv0 = 0, which differs from the momentum iterations as the gradient is now evaluated at a “predicted” position xk+ ˆµkˆvk, instead of the original position xk.

Moreover, the momentum parameter is allowed to vary as k increases (note that usually the choice is ˆµk = k−1_k+2), this has important links to stability and acceleration in the deterministic case.

In particular, as shown in [NES83], it achieves O(1/k2) convergence rate for general convex functions.

We shall first consider the case of constant momentum parameter with ˆ

µk≡ µ, and then the choice of varying momentum.

Constant momentum: Using the same re-scaling as in MSGD, we have

vk+1= vk− µηvk− η∇fγk(xk+ η(1 − µη)vk)

xk+1 = xk+ ηvk+1. (2.26)

which is 2.14 with

h(v, x, γ, η) = (−µv − ∇fγ(x + η(1 − µη)v), v − ηµv − η∇fγ(x + η(1 − µη)v)) the following momentum expansion can be derived

(37)

Lemma 2.3.8. Let ∆(v, x) as before, we have:

(i) E∆(i)(v, x) = η(−µv(i)− ∂(i)f (x), v)+

η2(∂(i)∂(j)f (x)v(j), −µv(i)− ∂(i)f (x + v)) + O(η3); (ii) E∆(i)(v, x)∆(j)(v, x) = η2A + O(η3);

(iii) EQ3

j=1|∆(ij)(v, x)| = O(η3),

where, as usual, Σ(x) := E(∇fγ(x) − ∇f (x))(∇fγ(x) − ∇f (x))T and A is a matrix such that

♣ −µv_(i)v(j)− v(i)∂(j)f (x + v) −µv_(i)v_(j)− v_(j)∂_(i)f (x + v) v_(i)v_(j)

.

Here ♣ is define as

µ2v(i)v(j)+µv(i)∂(j)f (x+v)+µv(j)∂(i)f (x+v)+Σ(x+v)(i,j)+∂(i)∂(j)f (x+v).

Proof. The proof follows from direct calculation of the moments and Taylor’s expansion, as in the previous lemmas.

We can match the moments by setting

b0(v, x) = (−µv − ∇f (x), v) b1(v, x) = − 1 2(µ[µv + ∇f (x)] − ∇ 2_{f (x)v, µv + ∇f (x))} σ0(v, x) = Σ(x)1/2 ₀ 0 0 .

for which we obtain the following approximation for SNAG.

Theorem 2.3.9. Assume the same conditions as in T hm.2.3.4. Define {(Vt, Xt) : t ∈ [0, T ]} as the stochastic process satisfying the SDE

dVt= −[(µI + 1 2η{µ 2_{I + ∇}2_{f (X} t)})Vt+ (1 + 1 2ηµ)∇f (Xt)]dt + (2.27) +√nΣ(Xt)1/2dWt, V0 = v0; dXt= [(1 − 1 2ηµ)Vt− 1 2η∇f (Xt)]dt, X0= x0, (2.28) with Σ as defined in T hm.2.3.4.

Then, {(Vt, Xt) : t ∈ [0, T ]} is order-2 weak approximation of SNAG. More-over, the same order-1 weak approximation of MSGD in 2.25 holds for the Stochastic Nesterov accelerated gradient.

(38)

2.4. TOY MODEL OF THE SMES TO THE ANALYSIS OF SGD 31

The result above shows that for constant momentum parameters, the modified equations for MSGD and the SNAG are equivalent at leading or-der, but differ when we consider the second order modified equation. Let us now discuss the case where the momentum parameter is allowed to vary.

Varying momentum: Let us take ˆµ varying as ˆµk= k−1_k+2, using the same rescaling arguments as:

η :=pη, vˆ k:= ˆvk/η, µk:= (1 − ˆµk)/η,

we arrive at

vk+1 = vk− µkηvk− η∇fγ(xk+ η(1 − µkη)vk)

xk+ 1 = xk+ ηvk+1. (2.29)

where we can write µk= 3/(2η + kη).

Now, in order to apply our theoretical results to deduce the SME, simply notice that we may introduce an auxiliary scalar variable

zk+1= zk+ η, z0 = 0.

Then, µk = 3/(2η + zk), and hence all terms are now not explicitly k − dependent, thus we may proceed formally as in the previous sections to arrive at the order-1 SME for SNAG with varying momentum

dVt= − 3 tVt+ ∇f (Xt) dt +√nΣ(Xt)1/2dWt, V0 = v0, dXt= Vtdt, X0= x0. (2.30)

As we can see, this results is formal because the term 3/t does not satisfy our global Lipschitz conditions, unless we restrict our interval to some [t0, T ] with t0 > 0, in which case the above result becomes rigorous.

2.4 Toy Model of the SMEs to the analysis of SGD

In this section, we apply the SME framework developed to analyze the dy-namics of the stochastic gradient descent algorithm discussed above. We shall focus on simple but non-trivial models where to a large extent, an-alytical computations using SME are tractable, giving us key insights into the algorithms that are otherwise difficult to obtain without appealing to the continuous formalism presented in this paper.

(39)

Model: Let H ∈ Rd×d be a symmetric, positive definite matrix. Define the sample objective:

fγ(x) := 1 2(x − γ) T_{H(x − γ) −}1 2T r(H) γ ∼ N(0, Id), (2.31)

which gives the total objective f (x) ≡ Efγ(x) = 1₂xTHx.

2.4.1 Analysis of SGD

We first derive the SME associated with 2.31, considering, for sake of sim-plicity, the order-1 SME, as in 3.1.

Σ(x) = 1 4E(∇fγ(x) − ∇f (x))(∇fγ(x) − ∇f (x)) T ₌ = 1 4E(∇((x − γ) T_{H(x − γ) − x}T_{Hx))(∇((x − γ)}T_{H(x − γ) − x}T_Hx))T ₌ = 1 4E(2H(x − γ) − 2Hx)(2H(x − γ) − 2Hx) T ₌ = H2.

This prove that Σ(x) = H2, so the SME for the SGD applied to model 2.31 is

dXt= −HXtdt + √

ηHdWt

This is a multi-dimensional Ornstein-Uhlembeck (OU) process and admits the explicit and admits the explicit solution

Xt= e−tH x0+ √ η Z t 0 esHHdWs .

Observe that for each t ≥ 0, the distribution of Xt is Gaussian. Using Itˆo’s isometry, we then deduce the dynamics of the objective function

Ef (Xt) = 1 2x T 0He−2tHx0+ 1 2η Z t 0 T r(H3e−2(t−s)H)ds = 1 2x T 0He−2tHx0+ 1 4η n X i=1 λ2_i(H)(1 − e−2tλi(H)_). _(2.32)

As we can see from the computation, the first term decays linearly with asymptotic rate 2λd(H), and the second term is induced by noise, and its asymptotic value is proportional to the learning rate η. This is the well known two-phase behavior of SGD under constant learning rates: an initial descent phase induced by the deterministic gradient flow and an eventual fluctuation phase dominated by the variance of the stochastic gradients. In this sense, the SME makes the same predictions, and in fact we can see that

(40)

Figure 2.1: SME prediction vs SGD dynamics: (a) SME as a weak approx-imation of SGD, computed the weak error with test function g = f . As predicted by the analysis, the order-2 SME (order-1, risp.) should give a slope = 2(= 1, risp.)) decrease in error as η decrease. The SME solution is computed using an exact formula derived by the Ito isometry and the SGD expectation is averaged over 1e6 runs, T=2.0. We can see that the prediction of thm. 2.3.4 and cor. 2.3.5 hold.

(b) descent rate vs condition number: H is generated with different condition numbers and the resulting descent rate of SGD is approximately ∝ k(H)−1 as predicted.

it approximates the SGD iterations well as η decreases (F ig. 2.1.), according to the rates we derived in T hm.2.3.4 and Cor.2.3.5.

Moreover, notice that by the identification t = kη (k is the SGD iteration number), the SME analysis tells us that the asymptotic linear convergence rate (in k, i.e. rate ∼ −log(Ef (xk))/k) in the descent phase of the SGD is 2λd(H)η.

In the numerical stability framework (even in the non-stochastic case), as show in no., we usually require η ∝ 1/λ1(H), thus the maximal descent rate is inversely proportional to the condition number k(H) = λ1(H)/λd(H). We validate this observation by generating a collection of H’s with varying condition numbers and applying SGD with η ∝ 1/λ1(H). In F ig 2.1., we plot the initial descent rates versus the condition number of H and we ob-serve that we indeed have rate ∝ k(H)−1.

Alternate Model: Now we consider a slight variation of the model 2.31. The goal is to show that the dynamics of SGD (and the corresponding SME) is not always Gaussian-like and thus using the O-U process to model the SGD is not always valid.

Given the same positive-definite matrix H, we diagonalize it in the form H = QDQT where Q is an orthogonal matrix and D is a diagonal matrix of eigenvalues.

(41)

We then define the sample objective: fγ(x) := 1 2(Q T_x)T_{[D + diag(γ)](Q}T_x) γ ∼ N(0, I). (2.33)

which gives the same total objective f (x) ≡ Efγ(x) = 1₂xTHx. However, we have a different expression for Σ(x)

Σ(x) = Qdiag(Qx)2QT, which gives the SME

dXt= −HXtdt + √ ηQ|diag(QTXt)|QTdWt= in distrib. = −HXtdt + √ ηQdiag(QTx)QTdWt. We can rewrite the above as

dXt= −HXtdt + √ η d X i=1 Q(l)XtdW(l),t,

where Q(l)= Qdiag(Q_(l,·))QT and Q_(l,·) denotes the lth row of Q.

Noting the choice of Q and H we know that every pair of {H, Q(1), ..., Q(d)} commute, we can find the explicit solution

Xt= e− 1 2ηt+ √ ηPd i=1Q(l)W(l),t_e−Ht_x 0.

This equation can be see as a multidimensional Black-Scholes process. In particular, the distribution is not Guassian for any t > 0. And taking the expectation we get Ef (Xt) = 1 2e ηt_xT 0He−2HtxO.

We can study the behavior of it: if η < 2λd(H), then 2H − ηI is positive definite cause η/2λd(H) < 1 and so Ef (Xt) → 0 exponentially at constant, non-zero η; otherwise, depending on initial condition x0 can also not con-verge to 0. If η > 2λd(H) (if condition number of H is large enough this can happen) and x0 is in general position, then we have asymptotic exponential divergence.

We follow the study in [LTE18] and analyze their model as in F ig. no.. As we have already noticed

Ef (Xt) = 1 2e ηt_xT 0He −2Ht_x O,

is a variance induced divergence typically observed in Black-Scholes and ge-ometric Brownian motion type of stochastic processes. The term “variance-induced” is important here since the deterministic part of the evolution

(42)

equation is mean-reverting and in fact is identical to the stable OU process studied earlier.

In F ig. 2.2, we show the correspondence of the SME findings with the actual dynamics of the SGD iterations. In particular, we see in F ig. no. that for small η, we have exponential convergence of the SGD at constant learning rates, whereas for η > 2λd(H), the SGD iterates start to oscillate wildly and its mean value is dominated by few large values and diverges approximately at the rate predicted by the SME.

Note that this divergence is predicted to be at a finite η, and from the theory developed so far we cannot conclude that the SME approximation always holds accurately at this regime (but the approximation is guaranteed for η sufficiently small). Nevertheless, we observe at least in this model that the variance-induced divergence of the SGD happens as predicted by the SME.

Figure 2.2: SME prediction vs SGD dynamics in the model variant. (a) order of convergence of the SME to the SGD. The setup is the same as 2.1. The analysis predicts the correct rate of weak error decay as η decreases. (b) SGD paths vs order-1 SME prediction. Solid lines are SME exact solutions, dotted lines are mean of SGD over 500 runs, and the 25-75 percentiles are shaded. We can observe convergence of Ef at constant η. (b) variance induced explosion. As predicted by the SME analysis, if η > 2λd(H) (here, λd(H) = 0.01) variance induced instability sets in.

(43)

2.4.2 Analysis of MSGD

Let us now use the SME to analyse MSGD applied to the model. We have shown earlier that Σ(x) = H. Thus, according to Thm. 2.3.1, the order-1 SME for the momentum stochastic gradient descent is

dVt= −[µVt+ HXt]dt + √

ηHdWt

dXt= Vtdt, (2.34)

where X0 = x0 and V0= vo.

If we set Yt := (Vt, Xt) ∈ R2d, Ut a 2d-dimensional Brownian motion with first d coordinates equal to Wt, and define block matrices

A := µI H −I 0 , B :=H 0 0 0 .

We can then rewrite 2.34 as dYt= −AYtdt +

√

ηBdUt, Y0 = (0, x0). By Ito’s isometry, we have

Ef (Xt) = 1 2 |diag(0, H)1/2e−AtY0|2+ η Z t 0 |diag(0, H)1/2e−(t−s)AB|ds ,

We can see that a similar two-phase behavior is present, but the property of the descent phase now is linked to the spectral properties of the matrix A (instead of H).

For this reason, before proceeding, we first observe that the eigenvalues of A can be written as λ(A) := {λ+, λ−}, λ±,i:= 1 2 µ ±pµ2_{− 4λ} i(H) , i = 1, ..., d.

As we can see, as long as µ > 0 Re(λi(A)) > 0. We state now the following lemma concerning the decay of the norm of e−tAis all eigenvalues of A have positive real part.

Lemma 2.4.1. Let A be a real square matrix such that all eigenvalues have positive real part, then:

1. For each ε > 0, there exists a constant Cε > 0 independent of t but dependent on ε such that

|e−tA| ≤ Cεr−t(miniRe(λi(A))−ε);

2. If in addition A is diagonalizable, then there exists a constant C > 0 independent of T such that

(44)

Proof. First of all we start with the first inequality, and the second one will easily follow.

(i) A is similar to a Jordan matrix J such that exist a P ∈ GL(d2) such that

e−At = P e−JtP−1.

Hence we have |e−At| ≤ |P ||e−Jt||P−1|. For each Jordan block Jk we have

Jk= λkI + Nk, where Nk is nilpotent (Nkk = 0). Hence e−Jkt_{= e}−λkIt_e−λkIt_{= e}−λkt k−1 X m=0 Nm k m! (−t) m = e−(λk−ε)t "_k−1 X m=0 N_km m!(−t) m_e−εt # .

Let’s look at the summation:

| k−1 X m=0 N_km m! (−t) m_e−εt_{| ≤} k−1 X m=0 1 m!(−t) m_e−εt _{≤ Ce}−ε = Cε

Since the norm of nilpotent matrix is 1, for each ε > 0 the norm of the sum is uniformly bounded in t, and hence we obtain the result adding all the Jordan block and noting that we get the highest value in the exponential when we have the the least real part of the eigenvalues (of A).

(ii) Similarly from (i) now we have A = P DP−1 with D a diagonal matrix of eigenvalues of A and P ∈ GL(d2). Defining Q := PTP , we have

|e−tA| = T r(e−tATe−tA) = T r(Q−1e−tPTQe−tD)

Which easily give us the estimate without considering any dependencies on arbitrary ε.

Using the above results we can characterize the decay of the objective under momentum SGD.

From the eigenvalues of A

λ±,i:= 1 2 µ ±pµ2_{− 4λ} i(H) , i = 1, ..., d,

we see that as long as µ2 6= 4λi(H), ∀i = 1, ..., d we have 2d distinct eigen-values for A and hence is diagonalizable.