Universit`
a di Pisa
DIPARTIMENTO DI MATEMATICA Corso di Laurea Magistrale in Matematica
Tesi di Laurea
Defenses against Adversarial Examples
in
Deep Neural Networks
Candidato:
Fabio Brau
Relatore:
Prof. Giancarlo Bigi
Controrelatore:
Prof. Federico Poloni
Contents
Introduction vii
1 Neural Networks 1
1.1 Sequential Fully Connected Networks. . . 2
1.2 Sequential Convolutional Networks . . . 5
1.3 Neural Network with Skips . . . 8
1.4 Universal Approximation Theorem . . . 13
1.5 Error Bounds for ReLU Networks . . . 20
1.6 Training a Feed Forward Network . . . 24
2 Defenses against Adversarial Examples 29 2.1 Adversarial Examples . . . 29
2.2 Robust Classification . . . 31
2.3 Relaxation of the Adversarial Polytope . . . 33
2.4 Test of Robustness via Dual Problem . . . 38
2.5 Robust Training . . . 47
3 Defenses for Leaky ReLU Networks 49 3.1 Leaky Dual Bound . . . 49
4 Experiments 55 4.1 Adversarial Polytope . . . 55
4.2 Testing a simple Binary Classifier . . . 56
4.3 Adversarial Examples in MNIST . . . 58
4.4 Robust Classifiers. . . 60
References 63
Appendices 67
A An example of gradient 69
List of Figures
1.1 Neuron. . . 2
1.2 Fully Connected Network . . . 3
1.3 Valid Cross-Correlation Product . . . 6
1.4 Kernel Transformation . . . 6
1.5 Convolutional Neural Network . . . 8
1.6 Neural Network with Skips . . . 9
1.7 Graphical Proof. . . 13
1.8 Square Network . . . 23
2.1 Adversarial Example: Panda . . . 30
2.2 ReLU Relaxation . . . 36
2.3 Polytope and Convex Outer Bound . . . 38
4.1 Adversarial Polytope of small Networks . . . 55
4.2 Two input Binary Classifier . . . 56
4.3 Tested Binary Classifier . . . 57
4.4 Graphical Test . . . 57
4.5 Adversarial Example: MNIST . . . 58
4.6 Test MNIST Classifer . . . 59
4.7 Test Robustness with many Perturbations . . . 60
4.8 Robust Binary Classifier . . . 61
4.9 Robust MNIST Classifier . . . 61
Introduction
In the last 3 decades, the scientific community has improved the research over Neural Networks, reevaluating their power in many theoretical and practical areas. Defined for the first time by [Ros58] in 1958 giving the definition of perceptron (an entity composed by an input an output and weights that can be modified with error back propagation method), they were an attempt of modelling human cerebrum that, according with the neuroscience theories of the time, was thought as a huge net of links between atomics entities called neurons. This first enthusiasm collapsed when Marvin Minsky and Symour Papert showed that perceptron cannot implement the function Xor.
Between 1985 and 1995, Yann le Cun [Cun89] used a modified version of the perceptron, discovered in the 1972 with the name of multi layer percep-tron by a Phd Student Paul Werbos, to solve the classification problem of handwritten digits. In the same years the mathematical community made a great discover: neural network are able to approximate continuous func-tions. Hornik [HSW89] showed that NNs are universal approximators in the sense that they form a dense set of continuous function with compact domain. Moreover he extended the density result over each probability measure space. Some year later, in 1993, Leshno [Les+93] made another important step in approximation theory giving a characterization of all the shallow networks (network with one single hidden layer) that are able to approximate any continuous function.
Some years later, computer science community ([Hay09],[GBC16], Google, Amazon, etc. ) applied neural network to solve a huge amount of problems: speech recognition, image classification, email spam recognition etc. They discovered that deep neural network (with more than one hidden layer) are more powerful than shallow ones. This result is supported by a recent paper of Yarotsky that in 2017 gave error bounds for these particular net-work showing that they are able to approximate more efficiently (using less computation units and so less parameters) all the almost everywhere differ-entiable functions. Other mathematicians approach the problem in geomet-rical sense: Italians mathematician Monica Bianchini and Franco Scarselli [BS14], in 2014, described the complexity of network using topological in-struments as Betti numbers.
Classifiers are vulnerable to adversarial examples. They must be considered as mischievously computed images that with a good lack of probability are wrongly classified from the network, even if they seem absolutely common for human eye. This discover opens a new kind of research, that can be dived in two categories: The first searches a way to test if a given network is sensible to these attacks; the second searches a way to build robust “uncrack-able” networks. Our work focuses on these problems, following the Wong’s paper [WK18] that produces a test (incomplete, i.e affected by false-positive results) to verify the robustness of a given network.
In the first chapter we present the basilar concepts of Neural Networks. In the first section we define the most simple kind of network: the sequen-tial fully connected network. It is characterized by the presence of layers that can be thought as a composition of an affine function and an activation
function. In the second section we define a more complex structure called convolutional layer that allows building networks that take tensors as input
instead of vector without increment the number of computation units. In the third section, defining more general networks which do not satisfy the sequentiality property (admits connection between not adjacent layer), we provide a result which shows that, under certain hypothesis on the activation function, a Network with skips can be written as a fully connected network. In section four, we expose Leshno’s theorem that allows to characterize all the activation functions inducting networks dense in the class of continuous functions. In section five, we expose Yarotsky theorems giving conditions over the complexity of the network (defined by the number of layers,
connec-tions, neurons ) to show that deep networks can approximate differentiable
functions with more accuracy than the shallow networks. In conclusion, in the last section, we expose the methods which allow to training a network as classifier provided of a set of training examples.
In the second chapter we treat the problem of robustness, it is the heart of the work. In the first section we introduce adversarial examples, showing how to generate them using the classical database of handwritten digits:
MNIST [LC10]. In the second section we introduce one robust training
algo-rithm able to produce robust network, i.e networks not sensible to adversarial
examples. In the third section we describe one method to approach the prob-lem of testing the robustness by building a set called adversarial polytope. We first show that it can be solved exactly by a mixed NP-hard program (not feasible in practice) and after we compute a relaxed version of the poly-tope that will be useful in the next section. In the section four, we expose the method that Wong uses in his article to build a test by computing the
convex outer bound via the dual problem and obtaining a dual network. In
the last section we show that the test itself can be used to produce a robust training algorithm inducting a loss function.
In the third chapter we show that robustness test via dual network, and consequently the robust training algorithm, used for ReLU networks can be applied with some modification to deep network with a similar activation function called LeakyReLU that depends on a parameter called slope. This kind of activation function do not produce a better accuracy in terms of classifications if the slope is fixed, better results can be obtained if this pa-rameter is adaptively learned, see [He+15] for more details.
In the last chapter we provide some experimental results. We focus on the testing of the robustness in two cases: in the first we consider a small network to induct a binary classifier of a finite random set in the two dimensional plane; in the second we consider a bigger network to induct a classifier of handwritten digits taken from MNIST database.
Chapter 1
Neural Networks
In this chapter we focus on the foundational aspects of neural networks : how to define them in a rigorous and, at the same time, general way; how a neural network works in the sense of approximation theory by recalling at old and more recent density theorems.
A first approach to modern machine learning techniques is the
Percep-tron that must be considered as a landmark in the development of neural
networks. Between 1943 and 1958 McCulloch and Pitts introduced a first mathematical model of the neuron, conceived as a non linear function de-pending on parameters. Frank Rosenblatt in 1958 [Ros58] gave the first algorithmic description of neurons and their organization in the brain, over-coming biological nervous system models of the biological environment with formally rigorous machines. At this stage a single neuron takes n input sig-nals x1, · · · , xn, process them with an affine combination and produces an output discrete value via a sign truncation
y =sign(x1w1+ · · · + xnwn+ b).
This kind of neuron produces two possible output signals: activated or
not-activated ( 1 or −1). The parameters can be derived by a error-correction al-gorithm to produce a binary classification or to separate two linear-separable
classes. More information can be found in [Hay09]. In the following sections we treat a more general kind of network that obvious includes Rosenblatt’s perceptron as particular case.
Our work focuses in particular on fully connected Neural Networks with multiple layers and rectified linear units. In recent literature lot of results can be found which suggest to use these kinds of networks, in particular there are many motivations to exploit ReLU activation function: some of them are empiric [GBB11] [Jar+09], some of them are geometric [Mon+14] and some of them are numeric [Yar17]. In third chapter we will treat also another activation function : leakyReLU.
1.1 Sequential Fully Connected Networks
Given a target function f∗: Rn→ Rm we search for some parametric func-tion of the form
fθ = g ◦ fθk◦ · · · ◦ fθ1
which gives an approximation of f∗in some sense. Each function fθiis called
hidden layer or easily layer, the function g is called output layer.
There is a great variability which is due to the possibility of choosing different kind of layers. In this section we define the most simple neural network : Fully Connected Sequential Network.
x1 xi xn Input Signals wk,1 wk,i wk,n Synaptic Weigths P Summing Junction bk Bias σ(·) Activation function y Output
Figure (1.1). Graphical representation of a Neuron with n input signals.
Definition 1.1 (Layers and Neurons). Let be σ : R → R a fixed measurable
function, that we call activation function. A layer of type hn, mi is a function fθ: Rn→ Rm of the form
fθ(x) = σ(W x + b), (1.1)
where W ∈ Rm×n, b ∈ Rm and σ is computed componentwise. It will be useful to keep the parameter θ = {W, b} in formulas. Each component (fθ)i=1,··· ,m is called neuron or unit. According with this notations we use to say that a layer of type hn, mi contains m neurons. Output layers are functions similar to hidden layers except for the absence of the activation function, formally an output layer of type hn, mi is a function of the form g(x) = W x + b.
We must remember that this language is very close to the biological one simply for the reason that this was an attempt of modelling cerebrum’s neuron in nature [Ros58]. The figure1.1is a representation of the kthneuron of a layer of type hn, ·i.
The signals, that mathematically can be thought as a vector of n com-ponents, propagate trough the synapses up to the neuron center. Here each
1.1. SEQUENTIAL FULLY CONNECTED NETWORKS 3 signal is combined to the other to produce a single output signal via an affine map depending on the weights and the bias term.
Then the output is modulated through the activation function, that plays a fundamental role, and finally leaves the neuron to propagates itself up to other neurons in the next layer (or a muscle in the biological model).
Observation 1.1. Observe that if we consider σ(x) = sign(x), i.e the
func-tion which gives −1 for negative values of x, 1 for values of x greater or equal then 0, then we obtain a Rosenblatt Perceptron.
If the activation function σ produces the output zero when the signal is negative and produces as output the signal itself when it is positive (i.e σ(z) = max(0, z)) then the neuron is called Rectified Linear Unit (ReLu). We will discuss later some aspect of the function approximation via ReLU networks.
We now present the definition of Neural Network with fully connected layers [Hay09] [GBC16]. This definition must not be thought as the most general neural network but it is sufficient for our proposal. Practical exper-iments can be done with this networks even over official datasets as MNIST that contains an enormous quantity of scanned handwritten numbers. Image processing for bigger pictures is more expensive in terms of computational cost and so it requires ad hoc networks using convolutional layers.
A sequential network, as the name suggests, is obtained composing many layers of a certain type.
Figure (1.2). An example of a small fully connected network of depth 3
and type h8, 9, 9, 9, 4i. The input signals, modelled as a vector of length
8, is processed by 3 hidden layers, the output layer returns 4 output
signals. The network contains 27 hidden neurons and 4 output units. The number of links is 270, this means that this small network contains
301 free parameters. Link.
Definition 1.2 (Neural Network). Let k, n, m and m1, · · · , mkfixed positive integers. Let σ : R → R a fixed activation function. A Fully connected
Neural Network (FNN) of depth k and type hn, m1, · · · , mk, miis a function f : Rn→ Rm obtaining by composition
fθ= g ◦ fθk ◦ · · · ◦ fθ1, (1.2) where
fθi(x) = σ(W(i)x + b(i)),
is a layer of type hmi, mi+1i for i = 0, · · · , k, with the assumption n = m0, m = mk+1 and where
g(x) = W(k+1)x + b(k+1)
is the output layer of type hmk, mk+1i. Let the set Nσ(m0, · · · , mk+1) rep-resents the class of all the FNN of the same depth and of the same type.
Observe that the depth of a neural network represent the number of hidden layers, i.e. networks of depth k contains k + 1 layers. If the depth is 1 than the network is said to be shallow, if k is greater than 1 the net-work is said to be deep. According to latter definition we can say that Nσ(n, m1, · · · , mk, m)contains neural networks with
• k hidden layers; • 1 output layer; • Pk+1
i=1mi neurons; • Pk
i=0mimi+1synaptic links.
These values express in certain sense a measure of the complexity of the network, this is a conventional way to classify networks [AB99].
If we are treating all possible neural networks of depth k from Rnto Rm, then we use this notation
Nk
σ(n, m) = [
µ∈Nk
Nσ(n, µ1, · · · , µk, m).
Observe that a fully connected network can be represented through a graph in which each node is a computation unit and in which links connect all but only the units belonging to consecutive layers. In these particular networks the signal can not propagate between two not adjacent layers. If we admit these connections then the network is said to be not fully connected or easily free1.3.
1.2. SEQUENTIAL CONVOLUTIONAL NETWORKS 5
1.2 Sequential Convolutional Networks
Firstly we must remember that one of the many motivations that, in practice uses, led to the development of modern neural networks is the recognition of Images and the prediction of handwritten digits. Let us consider for a while a black and white photo, that can be mathematically thought as a matrix. Through a vectorization process we can treat the image as a vector of a certain length that can be applied to a fully connected network described in the previous section.
Unfortunately there are two big disadvantages by using fully connected networks. The first is that if the image has O(n) pixels for each dimension, then the first hidden layer must contain at least O(n2)neurons; a big number of computation units. The second is that, due to the vectorization process, far coordinates in the input layer can belong to near pixels in the image; we lose the information over local features of the image.
In 1989, in order to avoid these problems, LeCun defined a new category of neural networks named high constrained network [Cun89]. Moreover in the article [Cun+92] he provided a method to recognize handwritten numbers, with 1% error rate, taken by a zip dataset from U.S Postal Service, using
convolutional layers to take in input the whole image without vectorization.
Referring to Goodfellow, he says: “Convolutional networks are simply
neural networks that use convolution instead of generic matrix multiplication in at least of one of their layers”. In this section we give a definition of this
kind of network using (a little bit of) tensor algebra tools. This operation does not correspond to usual convolution used in functional analysis or Fourier analysis: it is much similar to a discrete cross correlation (used in signals theory). Before of defining convolutional layers, it can be useful to define the following notations.
Definition 1.3. Let be m1, · · · , mk be integer numbers. We use the follow-ing notation
T(m1, · · · , mk) := Rm1⊗ · · · ⊗ Rmk.
Where ⊗ indicates the tensor product. If τ ∈ T(m1, · · · , mk) then we say that τ is a k-dimensional array of components (τµ1···µk)µi=1,··· ,mi. We are identifying the coordinates of the tensor τ in the canonical base
{eµ1 ⊗ · · · ⊗ eµk}µi=1,··· ,mi with the tensor itself. The space of all tensor of any shape and dimension k is indicated with the notation Tk.
For our proposals is sufficient to treat 3-dimensional arrays (networks working with videos can use 4-dimensional arrays). If τ ∈ T(k, m, n) then we say that τ is an array of shape k@m × n or also that τ is a tensor with kchannels. Observe that an RBG image can be expressed with a tensor of 3 channels. Analogously we say that an array of shape (k, m1, · · · , mr) has kchannels.
Definition 1.4 (valid cross-correlation). Let be τ ∈ T(m1, · · · , mk) and κ ∈T(n1, · · · , nk) two k-dimensional tensors with ni ≤ mi. The valid cross-correlation product gives in output a k-dimensional tensor τ ? κ ∈ Tk with shape (m1− n1+ 1, · · · , mk− nk+ 1), and components
(τ ? κ)µ1,...,µk = X
ν1,...,νk
τµ1+ν1−1,...,µk−+νk−1κν1...νk. In image processing k is called kernel due to his little dimension.
The definition above considers a particular kind of cross-correlation prod-uct, there are many variants used in signal analysis. His continuous formu-lation is strictly linked with Fourier Transform.
a b c d e f g h i j k l ? α β γ δ = aα + bβ +eγ + f δ bα + cβ +f γ + gδ cα + dβ +gγ + hδ eα + f β +iγ + jδ f α + gβ +jγ + hδ gα + hβ +kγ + lδ
Figure (1.3). Example of valid cross-correlation product between two matrices
respectively of shape 2 × 4 and 2×2
In image processing the application of a kernel k to an image τ is called
convolution, machine learning theory has inherited this name.
Figure (1.4). [GBC16] The image on the right was formed taking each pixel in the original image and subtracting the value of its neighbor on the left. The original black and white image on the left is modelled as a tensor of shape 1@280 × 320. The transformation above was obtained with a cross-correlation product trough the small kernel κ = [−1, 1]. The output image on the right has shape 1@280 × 319.
1.2. SEQUENTIAL CONVOLUTIONAL NETWORKS 7 Valid cross-correlation product is the heart of a convolutional layer. The following definition is not the most general but it is suitable for many ap-plications.
Definition 1.5 (Conv2d). Let κ ∈ T(h, k, n1, n2) be a fixed 4-dimensional tensor, called kernel, and let m1, m2 be integers respectively greater than n1, n2. Let b ∈ T(h, m1−n1+ 1, m2−n2+ 1)be a fixed 3-dimensional tensor, called bias. A 2-dimensional convolutional layer is a map
Conv2d(·, κ) : T(k, m1, m2) →T(h, m1− n1+ 1, m2− n2+ 1) such that if V of shape k@m1× m2 and Z = Conv2d(V, κ) then
Zi,:,: = σ(bi,:,:+ k X
l=1
Vl,:,:? κi,l,:,:) (Conv2d) where σ : R → R, applied componentwise, is an activation function.
Proposition 1.1. Let κ ∈ T(h, k, n1, n2) be a fixed kernel. It is easy to
prove the following statement
• Cross product ? is linear if we fix the kernel on the right side of the product, i.e. for each V, W of same shape (V + W ) ? κ = V ? κ + W ? κ and (λV ) ? κ = λ(V ? κ).
• The computational cost Cκ, i.e. the amount floating point
multiplica-tion and summamultiplica-tion, of applying convolumultiplica-tion to a tensor τ of shape
k@m1× m2 is linear over his dimension, in particular Cκ ≈ hkm1m2n1n2.
• If we put the bias to zero, then the shape of the input tensor is not fixed, except for the number of channels.
This proposition has two useful corollaries. The first one is that by lin-earity it is trivial to compute derivation of Conv2d(τ, κ) over the components of κ, this will be useful in the training stage. The second is that this opera-tion reduces the computaopera-tional cost. Because n1 m1 and n2 m2, it is obvious that we can apply convolution even to these images that fully con-nected network can not process (a common photo can contains 16M pixels). The transformation in figure 1.4 costs 3 × 280 × 320 floating point compu-tations and requires to memorize the kernel that have only 2 entries. The same transformation can be done with a usual matrix multiplication, even if we treat the matrix as sparse it would require to archive all the non-zero entries that are 280 × 320. These are the reasons for which convolutional layers are more powerful.
Analogously, via the cross-correlation product, we can define Conv3d convolution which takes a 4-dimensional input tensor and executes a convo-lution through a 5-dimensional kernel tensor, Conv1d and many others.
Therefore we must observe that this kind of transformations, can not produce a big rescaling of the input. For these purposes there are numerous kinds of transformations, called pooling [GBC16].
Figure (1.5). An example of deep neural network with mixed convolutional
and fully connected layers used to classify handwritten digits. The input tensor is processed by two convolutional layer and two pooling layer that extract the features of the images, as vertical and horizontal edges or other pattern. The output is vectorized and toke in input by the fully connected layers, they process the vector and solve the classification tasks. Link.
In conclusion, convolutional networks , i.e function of the form f = fθ◦ fκr ◦ · · · ◦ fκ1 result to be very powerful to manage with photos, videos and other structured data. They requires few parameters (for each channel the parameters are shared ), have a small computational cost (linear on the dimension of input and the number of channels), and can extract particular features from the image as vertical edges1.4.
1.3 Neural Network with Skips
Sequential networks described in the previous section are the most common and largely used in practice. In these networks the input signal propagates through the units in a similar way to human visual cortex, layer by layer : the output of a layer is the input of the next one. Not all the networks satisfy this propriety. In this section we give a formal definition of networks with skips and analyse the link with the sequential ones described in the previous section.
A free network f with L computation layers and U neurons is a function that can be described recursively by a set F = {uj}j∈F of computation units. The units are grouped in layers F = ∪L
l=0Fl, let Fl be the index set of units in the l-th layer and FLthe index set of units in output layer. These subsets
1.3. NEURAL NETWORK WITH SKIPS 9 form a partition of F . The input layer F0 is the set of the projections over coordinates: formally we define uj(x) = xj for j = 1, · · · , n. The elements of the first layer F1 are functions of the form
uj(x) = σ( X
i∈Ij
wijxi+ bj), j ∈ F1
where Ij ⊆ {1, · · · , n} = F0 for j ∈ F1. Recursively a unit in the layer Fl is a function of the form
uj(x) = σ( X
i∈Ij
wijui(x) + bj), j ∈ Fl
where Ij ⊆ F0∪ · · · ∪ Fl−1. It follow that an hidden unit is a combination of units that belong only from previous layers. As we expect an output unit is function of the form
uj(x) = X
i∈Ij
wijui(x) + bj, j ∈ FL
where Ij ⊆ F0∪ · · · ∪ FL−1. The sets Ij gives the rules whereby the units are connected. The set Θj = {wi,j}i∈Ij ∪ {bj} represents the weights and the bias of the jth unit.
1 2 3 4 5 6 7 8 9 10 11 12
Figure (1.6). A free network with 3 layers, 3 input units, 8 hidden units, 1
output unit (without activation). The number of total weight is 19, the number of bias is 9
Definition 1.6. (Free Neural Network) Formally we define a free neural
network as a tuple (F, f) where
is the set of units and where, using the previous construction, the function f : Rn→ Rm is the vectorial function described by the units of the output layer uji ∈FL for i = 1, · · · , m f (x) = uj1(x) ... ujm(x)
sometimes we refer with free network directly to the function f.
Clearly the two kind of networks, the fully connected and the free one, are strictly related to each other. It is natural to ask if the first network can be written in a free form and, otherwise, if free networks admit a fully connected formulation. The following propositions give a positive answer to these questions.
Proposition 1.2. If fθ ∈N(n, m1, · · · , mk, m) is a fully connected neural
network with n-inputs and m-outputs, then there exists a free network (F, f) with U = Pk+1
i=1 mi units and L = k + 1 layers such that fθ = f.
Proof. According to the last notation we consider the sets of unit F = ∪L l=0Fl where we use the follow indexations
Fl= j : X i<l mi< j ≤ X i≤l mi
and where the units uj, for j ∈ Fl, are the components of the layer fθl. Observe that for all j ∈ Fl the rules are the same and are defined by Ij = Fl−1: this express the fully connection of the neural network.
This proposition proves that there is no increase in the number of layers and units if we think to a fully connected network as a free one with many adjacent connections. This conversion propriety falls down if we go in the opposite direction. We can say more by exploiting the following property.
Lemma 1.3. The activation function σ(x) = max(0, x) satisfies the
follow-ing propriety
x = σ(x) − σ(−x).
We are able to describe a ReLU network in a free form with a network in the form of definition1.2 but, sometime, with a huge increment of units per layer.
1.3. NEURAL NETWORK WITH SKIPS 11
Theorem 1.4. If (F, f) is a free ReLU network with n input units, m
output units, U global computation units and L layers, then there exist a fully connected network fθ : Rn→ Rm of depth k = L − 1 and a certain type hn, m1, · · · , mk, mi such that:
• fθ= f
• The amount of neurons of fθ satisfies the inequalities U ≤ k+1 X i=1 mi≤ M (F), where M(F) = 2(2k− 1)U 0+Pkl=1(2k−l+1− 1)Ul+ m.
Proof. The aim of the proof is to build all the hidden layers fθi : Rmi → Rmi+1 and the output layer g(x) : Rmk → Rm. We proceed iteratively by building first fθ1, then fθ2 and so on.
First we modify the network f to a new free network ˜f such that f ≡ ˜f, from which we deduce in a easy way the layer fθ1.
1. If a units uj in the second layer, or in a shallower one, is directly connected to the input xi, bypassing the first layer, with a weight factor wji= α, then we create two dummy units uµ and uν in the first layer with only two weight and no bias, formally: we put Θµ = {wµi, bµ} and Θν = {wνi, bν}, where wµi,wνi are 1 and −1 respectively, and bµ= bν = 0.
2. Thanks to lemma 1.3we can modify the unit uj to a new ˜uj units in which we put ˜Ij = Ij∪{µ, ν}and in which ˜Θj = Θj∪{ ˜wjµ, ˜wjν}\{wji} where
˜
wjµ= α, w˜jν = −α.
With graph terminology we are cutting the link between xiand ujand we are adding two new connections with new units. Observe that the new layer ˜uj produce the same output of the oldest uj, formally we have uj(x) =σ X s∈Ij, s6=i wjsus(x) + bj + αxi =σ X s∈Ij, s6=i wjsus(x) + bj + ασ(xi) − ασ(−xi) =σ X s∈Ij, s6=i wjsus(x) + bj + αuµ(x) − αuν(x) = ˜uj(x).
3. We iterate this procedure for all the layers until the output one. Re-arranging the indexation we obtain a new equivalent free network: ˜f. Observe that: the two free network have the same number of layers equal to L; the number of unit per layer can increase by a factor 2. In details, let be ml and Ul the number of units in the l-th layer for ˜f and f respectively, then U0 = m0= n Ul ≤ ml≤ Ul+ 2ml−1, l = 1, · · · , k mk+1 = UL= m . (1.3)
The worst scenario happens when the units of each layer are connected to all the neurons in all the following layers, in this case the equality is verified and we have ml= l X i=0 2iUl−i, l = 1, · · · , k.
In the worst case the total number of neurons can be computed by the formula m + k X l=1 l X i=0 2iUl−i.
Rearranging the terms, observing that Uiappears k−i times with coefficients 2j, we can write the bounds of the total number of neurons in the following way U ≤ L X l=1 ml≤ 2(2k− 1)U0+ k X l=1 (2k−l+1− 1)Ul+ m (1.4) In conclusion, for the reason that in ˜f there are no connections between two not adjacent layers, we are able to write fθl(x) = σ(W(l)x + b(l)) where the weights matrix W(l) and the bias b(l) are define as follows:
wµν(l)= ( w(J (l)+µ),(J (l−1)+ν), if J(l − 1) + ν ∈ IJ (l)+µ and J(l) + µ ∈ Fl 0, otherwise and b(l)µ = bJ (l)+µ, for J(l) + µ ∈ Fl
where the function J is useful for indexation, J(0) = 0 and J(l) = J(l − 1) + ml−1. The same procedure can be applied to the output layer g.
In conclusion, we have obtained a version of f in a fully connected form f = g ◦ fθk ◦ · · · ◦ fθ1 with no change in the number of layer and , in the worst case, with a huge increment of the number of units.
This proposition can be adapted, with minimal variations in the proof, to all the neural networks for which the activation function σ satisfies the following propriety.
1.4. UNIVERSAL APPROXIMATION THEOREM 13
(a) Link between not consecutive
layer units (b) Added two dummy units topropagate all the information from the input to last layer
Figure (1.7). Graphical idea of proof.
Definition 1.7 (Sequential Property). An activation function has the
Se-quential propriety if there exists an integer M and three M-tuples α, β, γ
such that x = M X j=0 αjσ(βjx + γj), (1.5) for all x in R.
Observe that ReLU activation function satisfies the propriety with M = 1, α0= β0= 1, α1 = β1 = −1 and γ0 = γ1 = 0.
1.4 Universal Approximation Theorem
Let f∗ be a target function, the (mathematical) aim is, given a tolerance ε and some functional norm k · k, to search for a neural network f ∈ Nk
σ, of a fixed depth, such that
kf∗− f k < ε.
It is natural to asking these questions: For which σ the neural networks are able to approximate any function? Under which conditions? Does in-creasing the depth allow us to improve the accuracy? The following theorems give an exhaustive answer to the first question, and a partially answer to the second one for ReLU Networks (and a little more).
In 1989 Hornik [HSW89] gave a first proof that Neural Networks with a particular activation function can approximate any functions in any compact subset. This must be consider as the first result, but not the general one.
Theorem 1.5 (Hornik). Let σ be a continuous, bounded and non-constant
activation function. Let N1
σ(n, 1)be the class of all shallow networks defined
as 1.2, then, for any X ⊆ Rn compact and for any continuous function g : Rn→ R, there exists a sequence (fj) ⊆Nσ1(n, 1) such that
lim
j→∞k(g − fj)Xk∞= 0. 1 1The neural network f
jis a continuous function so in this case k · k∞reduces to be the
We will not prove this statement, we will see it as a corollary of more general one. In 1993 Leshno and others [Les+93] made a great discover giv-ing a characterization of activation functions which admit shallow networks, and so in particular deep networks, to approximate any continuous func-tions. In this section we expose this landmark of the approximation theory with neural networks.
We must introduce some classical tools of modern analysis. Observe that without loss of generality we can treat only networks with one single output unit.
Definition 1.8 (essential supremum). Let u : Ω → R be a non negative
measurable function respect to Lebesgue measure ν. Let S be the set of all real λ such that
ν({u ≥ λ}) = 0.
If S = ∅ then we say that u is essentially not bounded, and we write kuk∞= ∞. If S 6= ∅ then the essential supremum of u is defined as kuk∞= inf S.
The essential bound represents a bound for a positive measurable func-tion u except for a negligible set: Let β = kuk∞< ∞, then
{u > β} = ∞ [ n=1 u−1 β + 1 n, ∞
from which ν({u > β}) = 0.
Definition 1.9 (L∞- space). If f is a measurable function on Ω ⊆ Rn re-spect to the Lebesgue measure ν, we define kfk∞ to be the essential supre-mum of |f|, and L∞(Ω)the set of all f for which kfk
∞< ∞. The members of L∞(Ω)are called essentially bounded measurable functions on Ω.
Definition 1.10 (L∞
loc- space). A function f : Ω → R is locally essentially
bounded if for each compact set K ⊆ Ω, u ∈ L∞(K). L∞
loc(Ω) is the set of all the locally essential bounded functions on Ω.
Observation 1.2. For each open subset Ω ⊆ Rn, C(Ω) ⊆ L∞
loc(Ω), Where C(Ω)is the set of all continuous functions f : Ω → R.
Definition 1.11 (density). For each F ⊆ L∞
loc(Rn), we say that F is dense in C(Rn)if for every continuous function g, for each compact set K and for each tolerance ε there exists fK,ε∈ F such that
k(fK,ε− g)Kk∞< ε
Definition 1.12 (M-property). A function σ ∈ L∞
loc(R) is said to satisfy the M-property if the closure of the set of his points of discontinuity is of zero Lebesgue measure. Sometimes we refer to M as the set of all essentially locally bounded functions which satisfy the above property.
1.4. UNIVERSAL APPROXIMATION THEOREM 15
Lemma 1.6. Observe that if σ satisfies M-property then for any closed
interval [a, b] and for any δ > 0, there exists a finite numbers of open intervals I1, · · · , IN such that their union U satisfies
ν( N [
k=1
Ik) = δ
and such that σ is uniformly continuous on [a, b] \ U.
Proof. This is due to the definition of Lebesgue measure. Let C be the
closure of the set of discontinuity points of σ, because ν(C) = 0 then for each δ exists an open set ˜U ⊇ [a, b], union of a numerable family of intervals, of measure δ. Compactness guarantees the possibility to extract a finite subset of them to obtain the required set U.
The following lemma can be extract from the original proof of the fol-lowing Leshno density theorem, is a property of convolutional operator over polynomial.
Lemma 1.7 (Leshno). Let σ ∈ L∞
loc a fixed function, then, σ is a polynomial
if and only if σ ∗ ϕ is a polynomial for each ϕ ∈ C∞ 0 2.
Now we are able to formulate Leshno theorem of convergence.
Theorem 1.8 (Leshno). Let σ ∈ L∞
loc(R) which satisfies M-property, then N1
σ(n, 1) is dense in C(Rn)
if and only if
σ is not an algebraic polynomial.
Proof. One of the two implications is trivial. If σ is a polynomial of degree
k, then N1σ(n, 1)is the set of all the polynomial of degree at most k with n variables. This implies that N1
σ(n, 1) can not be a dense of C(Rn) (it can not approximate a polynomial of degree greater than k). This concludes the if part of the implication. From now we suppose that σ is not a polynomial. Following the article we divide the proof in steps:
(i) If N1
σ(1, 1) is dense in C(R), then Nσ1(n, 1) is dense in C(Rn). Let we consider the vectorial subspace of C(Rn) defined as follows
V =Span {F ∈ C(Rn) : F (x) = f (a · x), f ∈ C(R), a ∈ Rn} . Observe that this set contains all the polynomial of n-variables. From Stone-Weierstrass theorem in several variables we deduce that for each 2C∞functions with compact support
compact K ⊂ Rn the set V is dense in C(K). Let g ∈ C(Rn) and K ⊆ Rn be a compact set. From the definition of density, for each ε > 0, there exist k ∈ N, fi∈ C(R) and ai ∈ Rn for i = 1, · · · , k, such that |g(x) − k X i=0 fi(ai· x)| < ε 2, ∀x ∈ K. (1.6) Let α ∈ R be such that for {ai· x : x ∈ K} ⊆ [−α, α] for each i = 1, · · · , k. Because Nσ1(1, 1)is dense in C(R) by hypothesis, there exist mi ∈ N and fθi ∈Nσ(n, mi, 1) for i = 1, · · · , k shallow networks such that
|fi(y) − fθi(x)| < ε
2k, ∀y ∈ [−α, α]. (1.7) This implies that, for all x ∈ K,
g(x) − k X i=1 fθi(ai· x) ≤ g(x) − k X i=0 fi(ai· x) + + k X i=0 (fi(ai· x) − fθi(ai· x)) ≤ ≤ε 2 + k ε 2k = ε. (1.8) Observing that F (x) = Pk
i=1fθi(ai·x)is a shallow network in Nσ(n, 1) we deduce that N1
σ(n, 1)is dense in C(Rn). (ii) If σ ∈ C∞
(R), then Nσ1(1, 1) is dense in C(R) Let σ ∈ C∞
(R) and let K ⊆ R be a compact set. For each w, b ∈ R and h 6= 0, let fw,b(x) = σ(wx + b) be a simple shallow network. Observe that the difference quotient in the parameter w is a shallow network for each h
1
h(fw+h,b(x)−fw,b(x)) = 1
h(σ((w + h)x + b) − σ(wx + b)) ∈Nσ(1, 2, 1). Because fw,b(x) is differentiable in w this implies that
d
dwfw,b ∈N 1
σ(1, 1) (1.9)
where the topology is inducted by the supremum norm over the net-works restricted on the compact set K. With the same argumentation it follow that dk
dwkfw,b ∈ N1σ(1, 1) for all k, therefore d k
1.4. UNIVERSAL APPROXIMATION THEOREM 17 xkσ(k)(wx + b). Since σ is not a polynomial, for each k there exists βk∈ R such that σ(k)(βk) 6= 0. This implies that
xkσ(k)(βk) = dk
dwkσ(wx + b)w=0b=βk (1.10) and so that N1
σ(1, 1)contains each polynomial of any degree. By Stone-Weierstrass approximation theorem we deduce that
C(K) =P ⊆ N1
σ(1, 1) =Nσ1(1, 1) (1.11) where P is the set of all polynomials with one variable. This is true for each K ⊆ R and so we deduce the N1
σ(1, 1)is dense in C(R). (iii) For each ϕ ∈ C∞
0 (R), σ ∗ ϕ ∈ Nσ1(1, 1). Let ϕ ∈ C∞
0 (R) and let α ∈ R be such that supp(ϕ) ⊆ [−α, α] =: Kα. Observe that the convolution σ ∗ ϕ : R → R, defined as followss
(σ ∗ ϕ)(x) = Z
R
σ(x − y)ϕ(y) dy, (1.12) is well defined for each x ∈ R. We proceed proving that σ ∗ ϕ can be approximate with N1
σ(1, 1) with the supremum norm on Kα through a sequence of networks fm(x) := 2α m m X i=1 σ(x − yi), (1.13) where yi = −α +2iαm for i = 1, · · · , m.
Let −2α − 1 ≤ z1 ≤ · · · ≤ zr ≤ 2α + 1 be the points of discontinuity of σ in [−2α − 1, 2α + 1] and ε > 0. Observe that exists δ such that
10δkσ[−2α,2α]k∞kϕk∞< ε. (1.14) Since σ satisfies M-property, from lemma 1.6 we deduce that exists r(δ) ∈ N and a set U (union of open intervals) of measure equal to δ such that σ is uniformly continuous on [−2α, 2α] \ U. We choose m1 ∈ N such that m1r(δ) > αr(δ). Due to uniformly continuity of ϕ, there exists m2 such that
|ϕ(s) − ϕ(t)| < ε
2αkσ[−2α,2α]k∞
, ∀s, t, |s − t| < 2α m2
. (1.15) Moreover, due to uniformly continuity of σ in the compact set [−2α, 2α]\ U, there exists m3 such that
|σ(s) − σ(t)| < ε
M, ∀|s − t| < 2α m3
where M = RR|ϕ(y)| dy. Let m be the maximum between m1, m2, m3. Let x ∈ [−α, α] be a fixed point. Let ∆i be the intervals [yi−1, yi] for i = 1, · · · , m, where yi is defined as above. Observe that from triangular inequality and since supp(ϕ) ⊆ Kα, the following inequality holds Z R σ(x − y)ϕ(y) dy − m X i=1 Z ∆i σ(x − yi)ϕ(y) dy ≤ ≤ m X i=1 Z ∆i |σ(x − y) − σ(x − yi)| |ϕ(y)| dy. (1.17)
If x−∆i doesn’t intersect the set U (let be I the set of index for which this happens), then from inequality1.16 we deduce
Z ∆i |σ(x − y) − σ(x − yi)| |ϕ(y)| dy ≤ ε M Z ∆i |ϕ(y)| dy. (1.18) from which it is easy to deduce
X
i∈I Z
∆i
|σ(x − y) − σ(x − yi)| |ϕ(y)| dy < ε. (1.19)
Let ˜I be the set of i such that x − ∆i intersects the set U in a non empty set. Observe that the measure of the union of these sets is at most δ + 4α
mr(δ) but, by our choice of m (m ≥ m1), we have that δ +4αmr(δ) ≤ 5δand so from 1.14we obtain that
X
i∈ ˜I Z
∆i
|σ(x−y)−σ(x−yi)| |ϕ(y)| dy ≤ 2kσK2α]k kϕk∞5δ < ε. (1.20) From all these inequalities, that do not depend on x, we can conclude taking |(σ ∗ ϕ)(x) −2α m m X i=1 σ(x − yi)ϕ(yi)| (1.21) applying triangular inequality and adding and subtracting R∆iσ(x − yi)ϕ(y) dy.
(iv) If there exists ϕ ∈ C∞
0 (R) such that σ ∗ ϕ is not a polynomial, then N1
σ(1, 1) is dense in C(R).
From the previous statement we deduce σ ∗ ϕ ∈ N1
σ(1, 1) for each compact set K and so, for each w, b ∈ R also (σ ∗ ϕ)(wx + b) is in the closure of the set of all the shallow neural networks with one single input neuron. By a well know propriety of the convolutional product
1.4. UNIVERSAL APPROXIMATION THEOREM 19 σ ∗ ϕ ∈ C∞. Therefore by hypothesis σ ∗ ϕ is not polynomial and so from the statement (ii) N1
σ∗ϕ(1, 1) is dense in C(R). It follows that each continuous function can be approximated in each compact by a sequence of networks fi with activation function equal to σ ∗ ϕ, at the same time for each fi a sequence of networks with activation function σ can be approximated by a sequence fi(j) of networks in N1σ(1, 1). From this is easy to deduce that N1
σ(1, 1) is dense in C(R).
In conclusion, the counter nominal proposition can be deduced by the fol-lowing chain of implications. If σ is not a polynomial then, from lemma1.7, there exists ϕ ∈ C∞
0 (R) such that σ ∗ ϕ is not a polynomial. From step (iv) this implies that N1
σ(1, 1) is dense in C(R), and from step (i) this implies that N1
σ(n, 1) is dense in C(Rn).
Corollary 1.9. ReLU (or also leaky ReLU) neural networks, even with one
single hidden layer, can approximate any continuous function to any degree of accuracy, provided sufficiently many neurons are available in the single hidden layer.
This theorem hides, as particular case, a well know statement.
Corollary 1.10. Trigonometric polynomial are particular case of neural
networks of depth 1. In this sense shallow networks can be thought as a generalization of Fourier Series.
Proof. Let σ(x) = sin(x), we can write the space of all neural network of
depth 1 in a most common way N1
σ(n, 1) =Span f : Rn→ R | f(x) = σ(WTx + b), W ∈ Rn, b ∈ R . The function sin(·) is a continuous non polynomial function, so by Leshno we have that N1
sin(n, 1) is dense (with the usual supremum norm) in C(K) where K is the compact set [0, 2π]n. Observe also that the subspace of C(K) defined as Fn:=Span n sin(wTx + b) : w ∈ Nn, b = 0,π 2 o is a subset of N1
sin(n, 1) and, because of sin(t + π2) = cos(t), represents the class of all trigonometric polynomial in n variables. It is well known that this set is dense in C(K)
This proposition shows us another property of shallow neural networks: Remember that the set S = Span{sin(wTx)}
w∈Nn is not a dense set of C(K) (in one variable not all functions are odd functions) this implies that the bias b can not be neglected. The bias b, present in the unique hidden layer, is necessary to guarantee universal convergence property.
Another more usefully, but weaker, formulation of the theorem is the following.
Theorem 1.11. Let σ : R → R be a not polynomial function, continuous
except for a finite number of points. Given a continuous function f ∈ C(Rn)
then for each ε > 0 end for each compact set K there exist an integer m, a matrix W ∈ Rm×n, two vectors w, b ∈ Rm such that
sup x∈K f (x) − wT(σ(W x + b)) < ε.
The number of rows m increases when we search a more accurate approxi-mation.
These results stimulate the search of an approximation function in the set N1
σ, but give no estimations of the error. In particular what is the link between the accuracy and the increasing of the number of hidden units? In the paper [RBa94] it can be found an exhaustive answer to this questions but only in the case that σ is a sigmoid function. Therefore this result doesn’t explain why and if deep networks are more accurate than the shallow ones with the same number of neurons.
In a recent paper [Yar17] the author gives error bounds for approxima-tions neural network with rectified linear units.
1.5 Error Bounds for ReLU Networks
Shallow networks are able to approximate any continuous function. More-over, recent results show that networks with several hidden layers may be more expressive than shallow ones of comparable size.
For example in a recent paper [BS14] the authors exploit concepts coming form topology theory to produce a new measure of function complexity that can be applied to neural networks, specially those used in classification framework. Precisely let fθ ∈Nkσ(n, 1)be a fully connected neural network with k hidden layer, n input layers and 1 single output layer. The idea of the authors is to investigate the property of the set Sθ :={x ∈ Rn : fθ(x) ≥ 0} computing his Betti Number in order to measure his complexity.
In this section we investigate aspects of deep networks with ReLU acti-vation function. We present new Yarotsky theories from his recent article [Yar17] of 2017.
The first result shows that, in term of the numbers of computation units, numbers of hidden layers, using the ReLU activation function is not much different from using a piecewise linear activation function. Only in this section, σ will represent the function max(0, x).
Theorem 1.12. Let ρ : R → R be any continuous piecewise linear function
with M breakpoints, where 1 ≤ M < ∞.
(i) Let F be a network3 having L layers, W weights, U computation units
and activation function ρ. Then there exists a ReLU network G with L
1.5. ERROR BOUNDS FOR RELU NETWORKS 21
layers, not more than (M +1)2W weights and not more than (M +1)U
units such that F ≡ G pointwise.
(ii) Let F : Rn→ Rm be a ReLU network with L layers, W weight and U
computation units. Then, for any compact set K ∈ Rn, there exists
a network G with activation function ρ, L layers, 4W weight and 2U units such that FK = GK pointwise.
Proof. (i) Let a1 < · · · < aM be the breakpoints of ρ such that left and right derivatives are different, ρ0
+(ai) 6= ρ0−(ai). Observe that we can formulate ρ in terms of combinations of σ, ρ(x) = c0σ(a1− x) + M X i=1 ciσ(x − ai) + h (1.22) with a good choice of the coefficients (ci)i=1,...,M and h. This can be done iteratively. Let c0, d0 be such that
ρ(x) = c0(a1− x) + d0, for x ≤ a1, (1.23) then it is trivial that ρ(x) = ρ0(x) := c0σ(a1−x)+d0in the domain ]−∞, a1]. Let be c1, d1 such that
ρ(x) = c1(x − a1) + d1, for a1 ≤ x ≤ a2,
then it is trivial that ρ(x) = ρ1(x) := c1σ(x − a1) + b1 in the domain [a2, a2]. We want to proceed in a similar manner in the next intervals [ai, ai+1]for i = 2, · · · , M, let ci, di such that
ρ(x) = ci(x − ai) + di− i−1 X
j=1
ρj(x), forai≤ xai+1(or + ∞)
where ρi = ciσ(x − ai) + di. To conclude it is enough to verify that ρ(x) = ρ0(x) +PMi=1ρi(x). From the formulation of ρ as 1.23 it follows that computation by a single unit uj(x)
(xi)i∈Ij 7→ ρ X Ij wijxi+ bj
can be represented by a linear combination of a constant function and com-putations of the following M + 1 units with activation function σ
(xi)i∈Ij 7→ σ P i∈Ijwijxi+ bj− am , m = 1, · · · , M σa1− bj−Pi∈Ijwijxi
Observe that this can be done for each unit in each layer without any change in the depth of the network. Since each unit is replaced by M + 1 units, then the total number of units in the new network is (M + 1)U. This also implies that the number of weights can increase with a scale factor (M +1)2. (b) Let a ∈ R be a point of discontinuity for the derivate of ρ and let r0 be the distance of a from the nearest other breakpoint ( r0 > 0 if there is one single breakpoint for ρ). It is easy to verify that, for any r > 0,
σ(x) = ρ(a + r0 2rx) − ρ(a − r0 2 + r0 2rx) − ρ(a) + ρ(a − r0 2) (ρ0+(a) − ρ0−(a))2rr0 , x ∈ [−r, r]. (1.24) It is sufficient to observe that, since ρ is linear respectively in [a − r0, a[and [a, a + r0], the function on the right side is linear between −r and r, then because it takes value 0 on −r and 0 (so it is zero in [−r, 0]) and takes value r in x = r, we deduce the equivalence. In a similar manner of the previous step, a single computation unit uj belonging from a certain layer
(xi)i∈Ij 7→ σ X Ij wijxi+ bj
can be represented by a linear combination of a constant function and two units with ρ activation function
(xi)i∈Ij 7→ ρ a + r02rb + r02rP i∈Ijwijxi , ρa − r02 +r02rb − r02rP i∈Ijwijxi (1.25)
where r can be chosen from K at each layer such that Pi∈Ijwijxi+ bj ∈ [−r, r] holds. As before we can replace this new units and obtain a new network with the double of units for each layer and consequently four times the number of the links.
In particular, leakyReLU activation functions, since they are piecewise linear with 0 as unique breakpoint, will produce networks equivalent to ReLU networks.
The following results must be considered as existence theorems. They doesn’t produce a learning algorithm but prove that deep ReLU networks express, more efficiently then shallow ones, smooth functions in terms of number of computation units.
The following result shows how ReLU neural networks with unlimited depth can approximate efficiently, more then fixed depth networks, the sim-ple polynomial function x2.
1.5. ERROR BOUNDS FOR RELU NETWORKS 23
Figure (1.8). Graphical representation of the network f4 for approximate x2
function. The error of this approximation is 2−6. The network have 5
layers and 13 computation units. The equivalent fully connected network would have type h1, 5, 13, 29, 61, 1i with 108 hidden neurons.
Proposition 1.13. Let f(x) = x2 be defined in the domain [0, 1] and let ε > 0. Then there exists a network (with skips) (F, F), with U units and W
weight, such that
kF (x) − f (x)k∞< ε
and, U, W ≈ O(ln(1 ε)).
Proof. Let g : [0, 1] → [0, 1] the function called tooth or mirror defined as
follows
g(x) = (
2x, x < 12
2(1 − x), x ≥ 12 (1.26)
Let be g(s) = g ◦ · · · ◦ g the function obtained by self composition of g s times. Is not to hard to prove that g(s) is a piecewise linear function with 2(s−1) breaking points. In details the following equivalence holds
g(s)(x) = (
2s x −2k2s , x ∈ 2k2s,2k+12s , k = 0, · · · , 2s−1− 1 2s 2k2s − x , x ∈ 2k−12s ,2k2s , k = 1, · · · , 2s−1
(1.27) . Let fmbe the piecewise linear interpolation of the function x2in the nodes
k
2m where k = 0, · · · , 2m that can be described explicitly as follow fm(x) = 2k + 1 2m x − k 2m + k 2 22m, x ∈ k 2m, k + 1 2m , (1.28) for each k = 0, · · · , 2m− 1. It is easy to verify that f
m approximate x2 with error εm = 2−2m−2 with supremum norm. The heart of the proof is the following identity
fm−1(x) − fm(x) = gm(x)
22m from which recursively we can deduce
fm(x) = x − m X s=1 gs(x) 22s .
We can now conclude observing that fm is a neural network with activation function g, with m + 1 layers, m + 1 neuron (1 unit per layer), 2m + 3 weight. Since g(x) is piecewise linear with M = 3 breakpoints, then there exists a neural network with ReLU activation function that computes the same function of fmwith at most 4(m+1) unit, m+1 layers and 16(2m+3) links. In this particular case, since σ(x) = 2σ(x) − 4σ(x − 1
2) + 2σ(x − 1), the network fm can be write as a network with activation function σ with a less number of neuron and weight, as in figure1.8. Observe in conclusion that m ≈ O(ln( 1
εm))
Corollary 1.14. Since xy = 1
2 (x + y)
2− x2− y2
, we can use networks
fm to implement the product of real numbers.
Yarotsky gives also an upper bound on the number of units that a ReLU networks must have to obtain an approximation of accuracy ε.
Theorem 1.15 (Yarotsky). For any d, n, for each f ∈ Cn([0, 1]d) and for
each ε > 0, there exists a ReLU network fθ such that:
• kf − fθk∞< ε;
• fθ has depth less than c(ln(1/ε) + 1);
• fθ has at most cε−d/n(ln(1/ε) + 1) weights and computation units. The network fm is able to approximate the square function, the depth of the network increases to produce a better approximation. What happens to the number of computation units if we fix an upper bound for the number of layers? The following theorem gives an answer to this question, producing a lower bound for minimum number of units of a network which guarantees approximation with a fixed error.
Theorem 1.16 (Yarotsky). Let f ∈ C2([0, 1]d) be a non-affine function.
Then, for any fixed k, a deep ReLU neural network of depth k (and so with
L = k + 1 layers) that approximates f with error ε < 1 must have at least O(cε−2k1 ) weights and computation units, where c is a constant that depends
on f, k.
Corollary 1.17. A shallow ReLU network fθ that approximates a function f ∈ C2([0, 1]) (as f(x) = x2) with a precision ε < 1 must to have O(c√ε)
hidden units.
1.6 Training a Feed Forward Network
In the previous sections we have seen how neural networks can approximate functions with assumptions on the regularity. In the practice the target
1.6. TRAINING A FEED FORWARD NETWORK 25 function f∗ which we want to approximate is unknown except for a certain finite set of points. In this chapter we do not expose the back propagation techniques that allow computing the parameters of the network. Rather we introduce some basilar notions as loss functions, backward gradient
compu-tation.
The set Ω ⊂ Rn× Rm of these points is called set of training object or simply training set, if (x, y) ∈ Ω then we use to say that y is a label or a
target of the input x.
In a philosophical sense this set expresses a previous experience for our neural network fθ. If we use this (and only this) information to find the best parameter θ then, practically, we are training or feeding the neural network fθ that can induct a classifier or approximate a particular function.
Clearly there are more then one method to search the best parameter θ. The classic method consists in choosing a network (this depends from what are the objects which we are treating on), initializing the parameters of the network, propagating the training object to the network, comparing the output with the known labels through a loss function, modifying the pa-rameters. Schematically, classical learning of feed forward network proceeds as follow:
• Choose a loss function L : Rm× Rm→ R ( see later for more details); • Choose a neural network (fully connected, convolutional, with skips,
etc…);
• Minimize the empirical risk over a fixed subset of the training set Ω. min θ 1 N X (x,y)∈Ω L(fθ(x), y) (mP)
This kind of methods are called learning and when k > 1 deep learning. For our proposal we treat only the fully connected networks. It will be comfortable, in pseudo algorithms first, keep the parametrical variable θ = W(1), · · · , W(k+1), b(1), · · · , b(k+1)
and use different variables for different layers.
When θ is fixed and we are treating the neural network as a function in the input variable x then we use these names for variables
z(0)= x, ˆ
z(i)= W(i)z(i−1)+ b(i), i = 1, · · · , k + 1 z(i)= σ(ˆz(i)), i = 1, · · · , k
so fθis a neural network which maps x 7→ z(1) 7→ · · · 7→ z(k+1). The variables ˆ
If we fix the input x it will be convenient use the following variables/func-tions that depend only on parameter θi. More in detail, for a fixed x ∈ Rn, we define
h(1)(θ1; x) = σ(W(1)x + b(1))
h(i+1)(h(i), θi; x) = σ(W(i+1)h(i)+ b(i+1)), i = 1, · · · , k − 1 h(k+1)(h(k), W(i+1), b(k+1); x) = W(k+1)h(k)+ b(k+1),
(1.29) Sometimes we omit the parameter x in the formulas when no confusion may arise: h∗ with parameter x is different from h∗ with a different parameter x0.
Loss functions
The choice of the loss function L strictly depends on the task that we are asking to our network. Typically we can summarize in two kind of problems:
function approximation, classification.
• Approximation: This request consists, as we just say in the intro-duction, in choosing parameters of the network to approximate, in a certain norm, a target function f∗which typically is continuous or even differentiable. The input and the output of the neural network must be the same of f∗. The set of training objects Ω can be thought as a
sampling or a discretization of f∗ in a grid. Clearly the must common loss function used in this situation is a vectorial norm.
• Classification: In this kind of task, the neural network is asked to specify which category some input belongs to. Typically the
cate-gories, or classes, are encoded as a vector ej ∈ Rm of the canonical base, where m is the number of the classes. If C = {1, · · · , m} is a finite set of classes then a pair training (x, y) ∈ Ω is a pair of vector such that y = ec if the input x belongs to the class c. Typical exam-ples are given by the classification of: handwritten digits, handwritten numbers (MNIST database), photos, speech, etc. The choice of the loss function for these kind of problems is less intuitive. Functions as
CrossEntropyLoss (implemented in pytorch) is one of the most famous.
Combined with a Stochastic Gradient Descend it produce good results [GBC16]. This specific argument is beyond our aims.
The classification stage is pretty simple for this kind of neural net-works: given an input x with unknown target, the class of x is ob-tained by searching the index relative to the maximum component of the output vector.
1.6. TRAINING A FEED FORWARD NETWORK 27
Definition 1.13 (Cross Entropy). The cross entropy loss function is defined
as follows
L(y, y∗) = − log exp(y · y ∗) P
jexp(yj) !
. Because y∗ is always some of e
c∈ Rm, sometime this function is also given in the following form
L(y, c) = − log Pexp(yc) jexp(yj)
!
. (1.30)
Definition 1.14. A loss function L : Rm× Rm → R is said do be monotonic if for each u, v, y ∈ Rm
u ≤ v ⇒ L(u, y) ≤ L(v, y)
where, as usual, u ≤ v if for any j the component uj is lower then vj. This property is satisfies from many loss functions (as the Cross Entropy). Batch Learning
Bach learning is a method of supervised learning, the upgrades of the weights {w(iµν, b(i)ν }are performed after the presentation of all the training examples. Fix a loss function L and initialize a Network fθ of a certain depth. Let w be a generic parameter of the network at the epoch e, the upgrade of this parameter is made computing (seeAfor an example of gradient computation of a simple sequential network) the value
δ = −∂R(θ) ∂w , where R(θ) = |Ω|1 X (x,y)∈Ω L(fθ(x), y)
and Ω contains all the training examples, and adjust the parameter with a
learning parameter η ( that can be fixed or variable) as follows
wnew = wold+ ηδ. (1.31)
This procedure can be done in parallel. When all the parameters (both weight and bias) are upgraded the Network is say to be in the epoch e + 1.
Online Learning
On line procedure is another method of supervised learning. In this case the adjustments of the synaptic weights of the deep neural network are performed on an example-by-example basis. Fixed a loss function L consider the following empirical risk computed on a fixed training example (xe, ye) ∈ Ωin a certain epoch e
R(θ; xe) = L(fθ(xe), ye).
Let w be a generic parameter of the network, let η be a parameter of learning and α a parameter called momentum. The upgrade is done by computing
∆new(w; e) = −η
∂R(θ; x)
∂w + α∆old(w; e − 1) and substituting the old parameter with the following
wnew = wold+ ∆new(w; e). (1.32) When this procedure is done for all the parameters, the network is said to be in the epoch e + 1. The choice of the training examples can be done randomly and it is not forbidden that the same object is presented many times. Observe that this method is simple to implement but due to the sequentiality can not be parallelized.
Chapter 2
Defenses against Adversarial
Examples
In the previous chapter we have introduced basilar concept concerning neural networks: Definitions; Density Theorems, Approximation Theorems, a little of Learning algorithm (an example of gradient computation of a simple network can be found in appendix A). What we can say over the stability of a neural network?
In [Sze+14] the authors give a method to compute an adversarial example that can be thought as a mischievously computed object that misleads the network and produces a misclassification.
These attacks can be dangerous for automatic classifier based systems. In [WK18] the author gives a methods avoid these attacks training a network with a robust loss function and produce a kind of warranty certificate.
2.1 Adversarial Examples
In this section we analyse stability proprieties for network which task is to classify objects.
The most common way to induct a classifier from a neural network is by applying the argmax function to the output layer. In particular, given a set of classes C = {1, · · · , m}, we suppose that fθ : Rn→ Rm is neural network already trained over a training set of the form Ω ⊆ Rn× C. This neural network inducts a classifier class : Rn→ C defined as follow
class(x) ∈ argmax j
ˆ
zj(k+1) (2.1) where as usual ˆz(k+1) = f
θ(x) is the output obtained forwarding the input through the layers.
In a famous article [Sze+14], the authors, analysing proprieties of deep neural networks, made an intriguing discovery: Neural Networks are
able to adversarial examples. This it means that the machine misclassifies objects that are slightly different from correctly classified objects.
Figure (2.1). [GSS14] A famous adversarial example found by Goodfellow. Since the variation (i.e the amplitude of the noised signal) is too small, the human eye can not detect differences between the perturbed image on the right and the clean image on the left. However this thin perturbation is able to mislead the machine, producing a wrong classification.
Definition 2.1 (Adversarial Example). Let fθ : Rn→ Rmbe a DNN which inducts a classifier defined as 2.1. Let ε be a fixed positive number, then ˜x is an adversarial example if there exists a training object (x, y) such that
kx − ˜xk ≤ ε
and class(˜x) 6= class(x), for some vectorial norm k · k.
Clearly we can not prove the existence of an adversarial examples for each training object, but we can understand how it is possible to generate them with a good expectation of success.
Let us consider a single linear neuron described by the two parameters w, b. The perturbed signal in input ˜x = x+η (with η ∈ [−1, 1]n) is processed with an affine transformation to obtain
ˆ
z = wT(x + η) + b = wTx + b + wTη.
This implies that the perturbation of the input causes the activation to grow by wTη. We can maximize ˆz, over [0, 1]n, by assigning η = sign(w), where, at this stage, w is the fixed parameter of the neuron, and obtain an adversarial perturbation. This attack can be extended to sequential linear networks (that are strongly non linear functions) to obtain, sometimes, adversarial examples. The following procedure is discovered by Goodfellow and presented in the article [GSS14] as fast gradient sign method
Definition 2.2 (FGSM). Let fθ be the network, x some known object well classified from the network with label y and ε a small positive number. We
2.2. ROBUST CLASSIFICATION 31 suppose that the network has been already trained thought a loss function L. The method FGSM compute adversarial examples for x respect to the supremum norm considering the perturbation
η :=sign (∇xL(fθ(x), y))
and putting ˜x = x + εη, obtaining an object which lays on the border of the square neighborhood of center x and radius ε.
It is natural to ask what is a typical value of ε. This depends on the nature of the problem. For example, the MNIST database [LC10], that we largely use for experimentations, offers a huge amount of handwritten digits archived as matrices of shape 28 × 28 expressed in uint8 format.
This means that the values on these matrices are natural numbers from 0to 255 and so, if we treat the rescaled one, this archiving method discards all information below 1/255. This produces a lower bound for ε, in other words if we want to use these kind of attacks in real applications we must to produce realistic adversarial examples considering ε ≥ 1/255.
On the other side values of ε larger than 0.1 produce variations of the images that are easily detectable from the human eye.
2.2 Robust Classification
A robust network is a network that is not affected by adversarial examples, a robust algorithm is a learning method that produces robust networks. Avoiding these attacks is useful in practical usages: for examples, potential attacks include having malicious content identified as legitimate from self guided vehicles, clearly this produce catastrophic results [Pap+17].
Definition 2.3 (Robust Network). A network fθis said to be robust if there exists a certificate ε such that each point near to a training object less then εis classified with same label of x. For each ˜x ∈ Bε(x), class(˜x) = class(x), where Bε(x)is neighborhood of x of radius ε in a some norm (in our work we focus only on the supremum norm).
The robustness of a classification algorithm is strictly related to the Lipschitzianity of the neural network.
Proposition 2.1. Let fθ : Rn→ Rm a deep neural network of depth k where σ is a Lipschitz function of constant l. Then for a fixed parameter θ, for
each x, x0
∈ Rn it holds
kfθ(x) − fθ(x0)k∞≤ l(θ)kx − x0k∞
where the Lipschitz constant is