I Metodi Interior Point Incontrano le Reti Neurali: un'Applicazione al Deblurring di Immagini

(1)

Reggio Emilia

Dottorato di Ricerca in Matematica

In convenzione con l’Universit`

a degli Studi di Ferrara

e l’Universit`

a degli Studi di Parma

Ciclo XXXII, Coordinatore Prof. Cristian Giardin`a

Interior Point Methods Meet Neural Networks:

an Application to Image Deblurring

Settore Scientifico Disciplinare MAT/08

Dottoranda

Dr. Carla Bertocchi

Tutore

Prof. Marco Prato

(2)

(3)

Introduction 1

Notations 7

1 Machine Learning Theory 9

1.1 Machine Learning Algorithms . . . 9

1.2 Function Estimation Model . . . 11

1.2.1 Expected vs Empirical Risk Minimization . . . 12

1.2.2 Model Capacity . . . 13

1.2.3 Bias and Variance Decomposition . . . 16

1.2.4 Loss Functions . . . 19

1.2.5 Hyperparameter Selection . . . 21

2 Deep Learning Theory 23 2.1 Neural Networks History . . . 24

2.1.1 The Formal Neuron . . . 24

2.1.2 Rosenblatt’s Single Layer Perceptron . . . 25

2.1.3 Multilayer Feed-Forward Neural Networks . . . 27

2.1.4 Convolutional Neural Networks . . . 31

2.2 Neural Networks Training . . . 38

2.2.1 Optimization Challenges . . . 39

2.3 Gradient-based Learning Methods . . . 41

2.3.1 Stochastic Gradient Descent algorithm . . . 42

2.3.2 Minibatch Gradient Descent . . . 46

2.3.3 Backpropagation Algorithm . . . 51

2.3.4 Other Optimization Algorithms . . . 55

3 Inverse Problems and Imaging 63 3.1 Inverse Problems and Ill-Posedness . . . 63

3.1.1 Linear Inverse Problems . . . 64

3.2 Image Reconstruction Problem . . . 66 i

(4)

3.2.1 Mathematical Modeling of Data Acquisition . . . 67

3.3 Statistical Formulation of Image Reconstruction Problem . . . 71

3.3.1 Maximum Likelihood Approach . . . 72

3.3.2 Maximum A Posteriori Approach . . . 74

3.3.3 Regularization Functionals . . . 76

3.3.4 Hard Constraints . . . 78

3.4 Interior Point Methods . . . 79

3.4.1 Problem Formulation . . . 79

3.4.2 Interior Point Approaches . . . 80

3.4.3 Proposed Iterative Schemes . . . 82

3.4.4 Limitations . . . 83

3.5 Proximity Operator of the Barrier . . . 84

3.5.1 Affine Constraints . . . 84

3.5.2 Hyperslab Constraints . . . 86

3.5.3 Bounded `2-norm . . . 88

4 A Neural Network Based on a Proximal IPM for Image Restoration 93 4.1 Unfolding Methods . . . 93

4.2 iRestNet Architecture . . . 95

4.2.1 Hidden Structures . . . 96

4.2.2 Differential Calculus . . . 97

4.3 Network Stability . . . 97

4.3.1 Relation to Generic Deep Neural Networks . . . 97

4.3.2 Preliminary Results . . . 98

4.3.3 Averaged Operator . . . 100

4.3.4 Robustness of iRestNet to an Input Perturbation . . . 100

4.4 Numerical Experiments . . . 101

4.4.1 Problem Formulation for Gaussian Noise . . . 101

4.4.2 Network Characteristics . . . 102

4.4.3 Dataset and Experimental Settings . . . 104

4.4.4 Training . . . 105

4.4.5 Evaluation Metrics and Competitors . . . 107

4.4.6 Results and Discussion . . . 108

4.5 Extending iRestNet to Poisson Noise . . . 113

4.5.1 Problem Formulation for Poisson Noise . . . 113

4.5.2 Network Characteristics . . . 113

4.5.3 Experimental Settings and Network Training . . . 114

4.5.4 Results and Discussion . . . 116

(5)

Appendix A Mathematical Background 123 A.1 Elements of Functional Analysis . . . 123 A.2 Elements of Convex Analysis . . . 126

(6)

(7)

The purpose of this thesis is to propose a novel Deep Learning model to address the problem of image reconstruction. For this, we summarize the general theory on machine learning and more particularly the mathematics underlying Deep Learning. We introduce neural networks starting from their basic units: the formal neurons. These are combined to form complex and different structures, such as convolutional neural networks. We also study various aspects related to the training of neural networks and the various learning algorithms used to train them.

The problem of image reconstruction, or deblurring problem, is a well known inverse prob-lem. Before approaching it, we review the generalities of the problem, providing some details on the image formation process and on the noise deriving from data acquisition. Following a maximum a posteriori statistical approach, the problem is reformulated as a regularized optimization problem, in which the objective function to be minimized is a sum of a data discrepancy measure with a regularization term. Also additional contraints can be imposed to incorporate a priori knowledge on the desired solution. In our work we consider smooth data-fidelity and regularization terms and we include constraints in the objective function by means of a logarithmic barrier.

A proximal interior point method (IPM) is considered to address the minimization problem, in which the proximity operator is restricted only to the barrier function. Then, considering a finite number of iterations, the iterative algorithm is unfolded into a neural network, that we call iRestNet. In particular, the iterations of the IPM are incorporated into the layers of the network, whose training process merges with the optimization process. The key issue of our proposed approach is that the regularization parameter, the barrier parameter and the steplength needed in the iterations of the IPM are automatically chosen, by means of a partic-ular Deep Learning strategy. In addition, we provide a stability result for the proposed network when the data fidelity term and the regularization function are quadratic.

We use benchmarks image datasets to perform the training and test the resulting neural network. For every experiment, a dataset is created with a different type of blur and corrupted with Gaussian noise. Comparisons with standard gradient projection methods, with recent machine learning algorithms and also with other unfolded methods have been performed and the tests show good performances and the competitiveness of our approach.

(8)

We have also extended the proposed approach to deblurring problems on data corrupted with Poisson noise. Although the model needs to be improved, preliminary results show that also in this case iRestNet achieves a good reconstruction quality.

(9)

In questa tesi si propone una nuova strategia di Deep Learning per affrontare il problema di ricostruzione di immagini digitali. Per far ciò, riassumiamo la teoria generale sul machine learning e più in particolare la matematica che sta alla base del Deep Learning. Introduciamo le reti neurali partendo dalle loro unità di base: i neuroni formali. Questi vengono combinati per formare strutture complesse e diverse, come le reti neurali convolutive. Inoltre studiamo vari aspetti relativi al training delle reti neurali e i vari algoritmi di learning utilizzati per addestrarle.

Il problema di ricostruzione di immagini, o deblurring, è un noto problema inverso. Prima di approcciarci ad esso, rivediamo le generalità del problema, fornendo alcuni dettagli sul pro-cesso di formazione delle immagini e sul rumore derivante dall’acquisizione dei dati. Seguendo un approccio statistico di tipo maximum a posteriori, il problema di deblurring viene solita-mente riformulato come problema di ottimizzazione regolarizzato, in cui la funzione obiettivo da minimizzare è data dalla somma di un termine di data-fidelity e di un termine di regolar-izzazione. È inoltre possibile imporre dei vincoli sulla funzione obiettivo, che permettono di integrare conoscenze a priori sulla soluzione desiderata. Nel nostro studio abbiamo considerato termini di data-fidelity e di regolarizzazione regolari e abbiamo incluso vincoli di tipo box sulla funzione obiettivo, per mezzo di una funzione di barriera logaritmica.

Per affrontare il processo di minimizzazione consideriamo un metodo interior point (IPM), in cui il proximity operator è applicato alla sola funzione di barriera. Partendo da un numero finito di iterazioni del metodo iterativo, queste vengono incorporate in una rete neurale, che chiamiamo iRestNet. La peculiarità dell’approccio proposto sta nel fatto che il parametro di regolarizzazione, il parametro di barriera e la lunghezza di passo, necessari nell’applicazione del metodo IPM, sono scelti automaticamente, mediante una particolare strategia di Deep Learn-ing. Inoltre, è fornito un risultato teorico sulla stabilità della rete in particolari condizioni.

Per fare il training e testare la rete neurale risultante si sono utilizzati diversi dataset di immagini. Per ogni esperimento, viene creato un set di immagini derivanti da un diverso tipo di blur e corrotte con rumore gaussiano. In particolare ci si `e confrontati con metodi standard di proiezione del gradiente, con recenti algoritmi di machine learning e con altri ”unfolded methods”. In conclusione, i test mostrano come iRestNet abbia buone prestazioni e sia competitiva rispetto allo stato dell’arte.

(10)

Abbiamo inoltre esteso l’approccio proposto a problemi di deblurring su dati corrotti con rumore poissoniano. Nonostante il modello debba essere perfezionato, risultati preliminari dimostrano la bont`a di iRestNet anche in questo caso.

(11)

Machine learning research seeks to provide learning models with the ability to learn and improve automatically through observations and interaction with the world. The main objective of a machine learning system is to generalize from experience, to be able to perform accurately on new, unseen examples after having experienced a learning dataset. The training examples come from some generally unknown probability distribution, representative of the space of occurrences, and the machine learning algorithm builds a mathematical model about this space that enables it to produce sufficiently accurate predictions on new cases.

In this last decade, machine learning is having a big impact in many areas of science, industry, and engineering, and recently Deep Learning attracted even more attention. Deep Learning is a particular branch of machine learning which involves the use of artificial neural network models. These are engineered systems inspired by the biological brain, consisting of the interconnection of basic units called neurons, arranged in layers. By increasing the num-ber of layers, a deep network can represent functions of increasing complexity. In particular, most tasks that consist of mapping an input vector to an output vector, can be performed by Deep Learning, given sufficiently large models and sufficiently large datasets of labelled training examples. In recent years, Deep Learning popularity greatly increased, thanks to the availability of more powerful computers and larger datasets that allowed experimentation and the development of new techniques to train deeper networks. The training phase of a network require minimizing the empirical risk on the training set images, which is not a trivial problem. Most of the successful approaches are gradient-based learning methods, which aim to generate a sequence of iterates {θk}k∈N, in the parameter space, so that the empirical risk is decreased

along the iterations. In particular, the decrease of the objective function is obtained by moving along a descent direction, which is found exploiting first–order information.

The research activity presented in this thesis has dealt with the analysis of Deep Learning methods applied to inverse problems. We paid particular attention to image reconstruction problems, both by examining state-of-the-art deblurring algorithms, and by proposing our own deblurring method.

We consider inverse problems arising from the following mathematical model:

y = D(H ¯x), (1) 1

(12)

where y ∈ Rm is the observed data, ¯x ∈ Rn is the sought image, H ∈ Rm×n is the observation operator, which is assumed linear and known, and D is the noise perturbation operator. In this context, both variational and Deep Learning approaches provide efficient methods for finding a good estimate of ¯x, while offering different benefits and drawbacks, which will be discussed below.

In order to find an adequate solution to an ill-posed inverse problem arising from (1), variational methods incorporate prior information on the sought variable ¯x, through constraints and regularization functions, such as the total variation and its various extensions [4]. Thus, the sought signal can be approximated by the minimizer of a regularized objective function, expressed as the sum of a data–fidelity term f : Rm× Rm_{→ R, , which measures the fidelity of}

the solution to the observation model (1), and an additional regularization term R : Rn→ R, which encodes some prior information on the sought image and improves stability to noise. This leads to

min

x∈C f (Hx, y) + λR(x), (2)

where λ ∈ R>0 is a regularization parameter, which balances the role of the two terms in the

objective function, and C is a subset of Rn.

Over the past decades, iterative reconstruction methods have achieved remarkable results to solving inverse problems in imaging, like (2). Thanks to robust regularizers such as total variation, practical algorithms have appeared with good image quality and reasonable com-putational complexity. In particular, it has recently been shown that combining the interior point framework with a proximal forward–backward strategy [34, 36], leads to very competitive solvers [32, 38, 39]. Based on these considerations, we consider a particular proximal interior point algorithm for the development of our own approach. Each iteration of the algorithm is made up of two steps, namely a forward step, which is a gradient step on the differentiable term, and proximal backward step, which involves the evaluation of the proximity operator of a logarithmic barrier.

Although useful, this approach is sometimes limited by its complexity: solving (2) may be too slow for real-time applications, especially if we consider the time required for parameters estimation. For example, R is usually parametrized by one or several parameters, like λ, whose optimal choice may strongly depend on the data at hand. The parameters are often tuned man-ually or by following some heuristic rules. However, these methods are often time-consuming and their success is not always guaranteed, with consequent loss in efficiency and versatility of the resulting reconstruction scheme. An accurate setting of the regularization parameter is particularly critical, because the quality of the reconstruction obtained with regularized ap-proaches highly depends on it. Existing apap-proaches for the selection of λ, which are based on statistical considerations, generally lead to a substantial increase in computational cost.

(13)

have shown outstanding performance on object classification and segmentation tasks. Moti-vated by these successes, researchers have begun to apply CNNs to various applications related to inverse problems, such as denoising [149], non-blind and blind deblurring [124, 125, 144], super-resolution [92], or CT reconstruction [25, 76], and they have started to report improve-ments over state-of-the-art methods.

Applying neural networks directly to the blurred image seems not to be the best way to deal with image reconstruction problems. Conversely, DNNs for inverse problems are often preceded by a pre-processing step, where a rough estimation of ¯x can be found by using the inverse or pseudoinverse of H. The latter tends to strongly amplify noise. Hence, in this context, DNNs are used as denoisers and artifact-removers. A drawback of these metods is that, since DNNs are used as black-box, their explainability and reliability could be questioned. Furthermore, the pre-processing step, in itself, can include a penalty, thus amounting to solving a problem of the form (2), where the regularization weight strongly depends on the noise level, e.g., [22, 124]. Our research mainly concerned the study of recent strategies that merge Deep Learning techniques with variational methods, in order to take the strengths of both. One straightfor-ward way to combine the benefits of both variational-based methods and DNNs is to unfold an iterative method into a network structure, by turning each iteration into a layer of the network. In our work, we propose a novel neural network architecture called iRestNet [13, 37], which is obtained by unfolding the aforementioned proximal interior point algorithm over a finite number of iterations. One key feature of this algorithm is that it produces only feasible iterates thanks to a logarithmic barrier. This barrier enables prior knowledge to be directly incor-porated into iRestNet. The stepsize, barrier parameter, and regularization weight are untied across the network and learned for each layer, in a supervised fashion. To train iRestNet we use gradient backpropagation. The latter requires the derivatives of the involved proximity operator with respect to its input and to the parameters which are to be learned. Therefore, we conducted an analysis of the proximity operator of the barrier and of its derivatives, for three examples of interest. Once the network has been trained, its application on test images requires only a short execution time per image without any parameter search, as opposed to traditional variational methods.

Related works apply deep unfolding to probabilistic models, such as Markov random fields [68], topic models [31], and to different algorithms like primal-dual solvers [139] or the prox-imal gradient method [45, 97]. Classic optimization algorithms can be unfolded to perform many different tasks in image processing. For instance, FISTA and ISTA can be unfolded to perform sparse coding [59, 78], while the same ISTA and ADMM can be unfolded for im-age compressive sensing [127, 147]. Deep unfolding is also used to learn shrinkim-age functions, which can be viewed as proximity operators of sparsity-promoting functions [123, 128], or to optimize hyperparameters in nonlinear reaction diffusion models [30]. In the aforementioned works, some functions and operators are learned. To the best of our knowledge, iRestNet is

(14)

the first architecture corresponding to a deep unfolded version of an interior point algorithm with untied stepsize and regularization parameter. As opposed to other unfolding methods like [45, 97], the proximity operator and the regularization term are kept explicit, which establishes a direct relation between the original algorithm and the network.

Only a few works so far have considered combining interior point methods (IPMs) with Deep Learning. Every layer of the network from [2] solves a small quadratic problem using an IPM, while in [134], hard constraints are enforced on weights by using the logarithmic barrier func-tion during training. It is worth noticing that, the goal of these approaches is to minimize the given objective function. Conversely, in our case a better indicator of perceptual quality (SSIM) is optimized during the training of iRestNet. For this reason, the output of the trained network is not necessarily a solution to problem (2). In addition, iRestNet appears to have more flexibility since the regularization weight can vary among layers.

One critical issue concerning neural networks is to guarantee that their performance remains acceptable when the input is perturbed. In this direction, a recent work [35] provides a theo-retical framework which enables to evaluate the robustness of a network. We then conduct a stability analysis of the proposed network when the data fidelity term and the regularization function are quadratic, and we give explicit conditions under which the robustness of the pro-posed architecture is ensured.

Also numerical experiments are conducted to validate theoretic results. We test iRestNet on a set of non-blind image deblurring problems, where the images to be reconstructed are natural color images blurred with different types of blur, and corrupted with Gaussian noise. In this case, the problem is formulated as the constrained minimization of an objective func-tion expressed as the sum of a mean–squared loss funcfunc-tion and a smoothed Total Variafunc-tion regularizer. The choice of the data–fidelity function is strictly related to the noise that affects the images, while the regularization function is chosen for its edge–preserving properties. From this problem formulation we derive the IPM iteration, and we use it to construct the layers of iRestNet. The network is then trained in a supervised fashion, in order to find an optimal reconstruction quality, and it is tested on a test set of images. The performance of our approach are compared with state-of-the-art methods.

Further work is devoted to extend iRestNet to data corrupted with Poisson noise, with the aim of extending the approach to a wider class of problems. This may be useful, because in imaging applications such as emission tomography, microscopy and astronomy, the main source of noise corrupting the images is photon counting [145], which is well described by a Poisson process. For Poisson noise case, we presents the preliminary results we obtained in nu-merical experiments, knowing that further work must be done in order to improve these results. This thesis is organized as follows.

In Chapters 1 and 2, we introduce the general theory about machine learning and Deep Learning, respectively. In fact, it is helpful to fix some knowledge about learning theory, in

(15)

order to use neural networks in an useful way, so as to exploit their potential in application applied to inverse problems. Specifically, in Chapter 1 machine learning framework is presented from a mathematical point of view; while paying particular attention to concepts of supervised learning, empirical risk minimization and generalization. The purpose of Chapter 2 is to intro-duce the mathematical theories underlying Deep Learning, the algorithms used to train neural network structures, as well as providing some historical perspectives about neural networks de-velopment. We explain neural networks by starting from their basic units: the formal neurons. We see how to combine them to form complex and different structures, like convolutional neural networks. Then we see various aspects of the optimization process involved in the training of these deep models, in order to tune their internal weights.

In Chapter 3, we give the generalities of image reconstruction problems, providing some details on the image formation process and on the noise arising during the data acquisition. Moreover, we resume the basics of the statistical framework underlying the solution of inverse problems, focusing particularly on the maximum a posteriori approach and on the regularization functionals which will be employed in numerical experiments. We describe the proximal interior point method which is at the core of our approach, and we conclude the chapter by conducting an analysis of the proximity operator of the barrier and of its derivatives, for three examples of interest.

In Chapter 4, after a brief introduction about unfolding methods, we present the proposed neural network architecture and its associated backpropagation method. In Section 4.3, we conduct a stability analysis of the proposed network when the data fidelity term and the regu-larization function are quadratic. The rest of the chapter is dedicated to numerical experiments and comparison to state-of-the-art methods for image deblurring, both in case of Gaussian noise and Poisson noise. Starting from the experiments conducted on Gaussian noise, we introduce the variational formulation of the problem approached, the specific structure we defined for iRestNet and how we conducted the training of the network. All the details about the ex-perimental set-up are given, from the training, to the test of iRestNet, and its comparison with other methods. Also, numerical and visual results are reported. In the last section of the chapter we adapt iRestNet to address the image reconstruction problem on Poisson data. Also in this case, we provide numerical experiments and the preliminary results we obtained. Finally, we draw conclusions by discussing these results.

In Appendix A, we report a summary of basic concepts of functional analysis and convex analysis, which can be useful for the reader.

(16)

The research presented in this PhD thesis appears in the following publications:

1) C. Bertocchi, E. Chouzenoux, M.-C. Corbineau, J.-C. Pesquet, and M. Prato. Deep unfolding of a proximal interior point method for image restoration. Inverse Problems, 36(3):034005, Feb. 2020.

2) M.-C. Corbineau, C. Bertocchi, E. Chouzenoux, M. Prato, and J.-C. Pesquet. Learned Image Deblurring by Unfolding a Proximal Interior Point Algorithm. In 2019 IEEE In-ternational Conference on Image Processing (ICIP), pages 4664-4668, IEEE, Sep. 2019.

All the codes and a demo for training and testing iRestNet are avaliable at the link: https://github.com/mccorbineau/iRestNet

(17)

• Rn_{denotes the n-dimensional Euclidean space endowed with the standard scalar product}

(., .) and the norm k.k.

• k · k denotes the Euclidean norm (also called `2_{-norm): kxk = kxk} 2 =

√ xT_x.

• Given a normed vector space X, k · kX denotes the norm in X induced by the inner

product (., .)X: kxkX =p(x, x)X.

• M ∈ Rm×n _{denotes a real matrix of m rows and n columns.}

• I_n _{∈ R}n×n _{denotes the n × n identity matrix, i.e., the square matrix with ones on the}

main diagonal and zeros elsewhere. We will simply denote it by I, if the size can be trivially determined by the context.

• Idn denotes the identity operator of Rn.

• e ∈ Rm _{and 0 ∈ R}m _{denote m-vectors with all entries equal to 1 and 0, respectively.}

• |x|+ = max(0, x), that is x if x > 0, 0 otherwise.

• Given A, B ∈ Rm×n_{, A B denotes the Hadamard product of A and B.}

• Q denotes the product operation, e.g., Qn_i=1i = 1 · 2 · . . . · n = n!. • If x, y ∈ Rn_{, then x}T_{y =}Pn

i=1xiyi denotes the scalar product.

• If x ∈ Rn_{, x ≥ 0 ⇔ x}

i ≥ 0, i = 1, . . . , n. An analogous notation holds for >, ≤, <.

• ¯R = R ∪ {−∞, +∞} is the extended real numbers set.

• R≥0 = {x ∈ R : x ≥ 0} and R>0 = {x ∈ R : x > 0} are the sets of non negative and

positive real numbers, respectively.

• Given ρ ∈ R>0 and c ∈ Rn, B(c, ρ) = {x ∈ Rn : kx − ck ≤ ρ} denotes the closed

Euclidean ball of center c and radius ρ. • Given S ⊂ Rn_{, intS = {x ∈ S : ∃ρ ∈ R}

>0, B(x, ρ) ⊆ S} is the interior of the set S.

(18)

• For every q ∈ N, Γ0(Rq) denotes the set of functions which take values in R ∪ {+∞} and

are proper, convex, lower semicontinuous on Rq.

• R(K) = {y ∈ Y | y = K(x) for some x ∈ X} ⊆ Y is the range of the operator K : X → Y .

• N (K) = {x ∈ X | K(x) = 0} ⊆ X is the kernel of the operator K : X → Y .

• L(X, Y ) denotes the space of the linear and continuous operators between two Hilbert spaces X and Y .

• The matrix D is used to denote a discrete gradient operator, while Dv and Dh are the

(19)

Machine Learning Theory

We revisit in this chapter the basic principles of machine learning. This general theory will be applied throughout the rest of the thesis, when we will talk about the more specific context of deep learning and when we will use neural networks applied to a specific problem of image deblurring.

1.1 Machine Learning Algorithms

First of all, a machine learning algorithm is usually defined as an algorithm able to ”learn” from empirical data. But what does it mean that an algorithm learns? In [102] Mitchell gives the following definition:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

In this definition, learning is the way to become capable of performing a task. Many kinds of problems have been addressed with machine learning. Tasks like classification, regression, translation, speech recognition, anomaly detection, self-driving are only some of the problems that have contributed to the success of machine learning in recent years.

In order to evaluate the abilities of a machine learning algorithm on a given task, we also need a quantitative measure of its performance. Usually this measure P is specific to the addressed task T. For example, for classification tasks we often measure the accuracy of the model, which is the proportion of examples for which the model gives the correct output, among the total number of examples [101]. For tasks such as regression other kind of measures are more suited, like metrics that give a continuous-valued score for each example. For image denoising, many related works evaluate the learning performances using the peak signal-to-noise ratio, often abbreviated PSNR, or the structural similarity measure (SSIM), which are measures specific for the quality of signal reconstruction. In particular, SSIM will be treated

(20)

in the experimental part of this thesis, in Section 4.4, dealing with the problem of image deblurring.

Regarding the experience E, usually a dataset is available, which is a collection of examples of objects or events related to the task to be performed. If we consider a dataset of M elements, {x1, . . . , xM}, every example, represented as a vector xi ∈ Rn, is a collection of features that

have been quantitatively measured from the object or event, and that we want the machine learning system to process. In this case, each entry of the single example xi is a feature.

Considering a neural network performing image denoising, for example, it is usual to work with a dataset of images, where the features to be processed are the values of the pixels.

Different types of learning paradigms are possible, depending on the type of dataset which is available during the learning process. The most common ones are classified as

• Unsupervised learning, in which the machine must learn to make sense of the data pro-vided without a guide. This means that the algorithm classifies and organizes a series of input, studying common characteristics. For example, for the classification problem, the classes may not be known a priori, but must be learned by the machine, as the input to the system are unclassified examples.

Unsupervised algorithms are useful if we want to learn properties of the dataset struc-ture: by observing several examples of a random vector x, they can get information on the probability distribution p(x) that generated the dataset. This is helpful for learning to draw new samples from the distribution, finding a manifold to which the data lies near, or clustering the data into groups of related examples.

A classic unsupervised learning task is to find the best representation of the data, that preserves as much information as possible about the input x, while it is simpler or more accessible than x itself. An example of algorithm used for this, is given by the Princi-pal Component Analysis (PCA) [56], which transforms data into a representation with a lower dimensionality than the original input.

• Supervised learning, aims to learn a function that maps an input to a desired output by looking at many examples of input-output pairs. It is like the target for each training input is provided by an instructor, who shows the machine what to do.

Thus, in supervised learning, each example is a pair consisting of an input object x, as described above, and an associated target y. The target can be of different types, depend-ing on the task to be performed. For example, if we want to classify an object, the target could be either a scalar symbolizing the correct class, or the probability to belong to that class. Instead, if we perform signal denoising, the target is usually the clean signal. The input-output pairs thus provided are then analyzed by the learning algorithm, and used to infer the underlying function which maps the inputs to the correct outputs. In this way, the hope is that a good training on the given examples will allow for the al-gorithm to determine the correct outputs also for unseen instances. This is not obvious, as explained in Section 1.2.1, because it requires the learning algorithm to be able to

(21)

generalize from the training data to unseen situations.

Typical tasks treated with supervised learning include classification and regression. These are useful in many different applications, like medical diagnoses, voice identification, hand-writing recognition, spam detection and many other tasks. Some traditional supervised learning algorithms are Support Vector Machines (SVM) [19, 40], the k-nearest neighbors algorithm [1], or the decision tree [115].

Other learning paradigms exist. Some machine learning algorithms do not just experience a fixed dataset. For example, a reinforcement learning algorithm learns to perform a given task by interacting with the environment, so that there is a feedback loop between the learning system (the agent) and its own experience. In particular, during the learning process these algorithms find the best possible behaviour or path to take in a specific situation by being penalized when they make wrong decisions or rewarded when they make right ones. In this sense, the experience is not provided as a fixed dataset of examples, but as a system of positive and negative rewards to the agent actions.

Due to its generality, the field is studied in many disciplines, such as game theory, control theory, operations research, multi-agent systems, statistics and genetic algorithms. See [14, 130] for reference. Also other learning paradigms are possible, like semi-supervised or weakly supervised learning, but they are not treated in this thesis. Instead, from now on we will work in a purely supervised learning framework.

1.2 Function Estimation Model

Supervised learning involves learning from a set of data, that we call training set. Every element of the training set is an example of input-output pair, where the input maps to an output. In this context, the learning problem consists of inferring the function that maps between an input and its output, such that the learned function can be able to predict the output from future inputs, which may be different from those used to train the model.

Generalization is this ability we want to acquire, to perform accurately on new, unseen examples after having experienced the training set. To estimate this generalization ability, we typically measure the performance of the model on a test set of new examples, different from those used in training. However, in practice, when training a machine learning model, we have access to a single dataset of examples. The standard procedure is to set aside part of this original dataset, and to use it as a test set on which to evaluate the performance of the model, i.e., its generalization ability. This is known as the holdout method [85]. In this way, the test set is composed of examples, independent from the examples on which we train the model, but that follow the same probability distribution.

(22)

1.2.1 Expected vs Empirical Risk Minimization

To put it more formally, take X ⊆ Rn to be the vector space of all possible inputs drawn independently from a fixed but unknown distribution P (x). Also, consider Y to be the vector space of all possible outputs, such that, for every input vector x an output y ∈ Y is returned, according to a conditional distribution function P (y|x), also fixed but unknown.

From the statistical learning theory perspective, we assume that there is a joint probability distribution P (x, y) over X × Y , called the data generating distribution, such that P (x, y) = P (y|x)P (x). In this case, the dataset consists of random independent identically distributed (i.i.d.) samples drawn from P (x, y). We take M such samples and we use them as training set,

S = {(x₁, y1), ..., (xM, yM)},

where every xi is an input vector from the training data, and yi is the corresponding output.

The remaining samples are kept aside as test set.

Note that it is important to make i.i.d. assumptions, which imply that the examples in each dataset are independent from each other, and that the training set and test set are identically distributed, drawn from the same probability distribution.

The main goal of supervised learning is to infer the function which maps the inputs to the correct outputs, without specifically learning the assigned data, but generalizing beyond the examples of the training set. To do this, we select a family H of prediction functions, which are all the functions the algorithm will search through,

H := {h(., θ)|θ ∈ Ω}

parametrized by θ in the space of possible parameter values Ω. This is called the hypothesis space. The inference problem consists of finding the function h(x, θ∗) ∈ H such that, for every sample (x, y) drawn from P (x, y), the value h(x, θ∗) will represent an accurate prediction of the output value y.

Reformulating the problem, we look for the function that minimizes the errors caused by incorrect predictions. For this purpose we choose a cost function, or loss function, L : Y × Y → R, which is a cost measure L(h(x, θ), y) of the errors between the output h(x, θ) predicted by the system and the real output y. The objective function we want to minimize is called generalization error or expected risk, and it is the expectation of the loss function with respect to the underlying probability distribution P (x, y), given by

J (θ) = E[L(h(x, θ), y)] = Z

X×Y

L(h(x, θ), y)dP (x, y). (1.1) In this definition, the variable θ represents the part of the learning system which must be adapted to find the function h(x, θ∗) ∈ H that minimizes the expected risk J (θ) on every possible input-output pair.

(23)

Now, even if we would like to minimize the expected risk, it is not possible to do this directly, because the joint probability distribution P (x; y) is unknown by hypothesis, and the only available information are contained in the training set S.

In order to solve this problem, the induction principle of empirical risk minimization (ERM principle), from [136], suggests to minimize an approximation of the expected risk J (θ), con-structed by averaging the loss function on the available examples of the training set S. This estimate is called the training error, or empirical risk, JM(θ) : W → R,

J_M(θ) = 1 M M X i=1 L(h(xi; θ), yi). (1.2)

The ERM principle assumes that the function h(x, θ∗) which minimizes JM(θ) over θ ∈ W ,

results in an expected risk J (θ∗) which is close to its minimum. This is shown in [135], where general theorems prove that the minimization of the empirical risk JM(θ), can provide a good

estimate of the minimum expected risk, given that the size of the training set is large enough. Theoretically, the system is therefore able to generalize, to learn results that have general validity, from a training set of finite dimensions.

Note that generalization being the goal has interesting consequences for machine learning. In this case we do not have access to the function J (θ) we want to optimize, instead, we have to use the empirical risk as a surrogate. But this also means that, since the objective function is only an approximation for the true goal, we may not need to fully optimize it; in fact, a low value of JM(θ) may yield better generalization performance than the global optimum JM(θ∗). This

argumentation justifies the introduction of early stopping techniques in the learning process, as we will see in Section 2.3.1.

1.2.2 Model Capacity

To train a machine learning model, we search the system parameters θ that minimize the empirical risk on a given training set, then we test the model with the trained parameters on the test set, to evaluate its generalization ability.

In this process, the expected error on the test set is greater than or equal to the expected value of the training error, that is, the empirical risk JM(θ). The factors determining how well a

trained model will perform are its ability to: - Minimize the training error.

- Minimize the gap between training and test error.

These two factors correspond to the two central challenges in machine learning: underfitting and overfitting. In particular, underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set. Overfitting occurs when the gap between the training error and test error is too large.

(24)

Figure 1.1: Consider training data generated by evaluating a quadratic function. Left : a linear function cannot capture the data curvature (underfitting). Center : a quadratic function generalizes well to unseen points. Right : The solutions of a polynomial of degree 9, fit to the data, pass through all the training points exactly, but it does not extract the correct structure (overfitting). Example from [56].

We can control whether a model is more likely to overfit or underfit by modifying its capacity. Informally, a model’s capacity is its ability to fit a wide variety of functions. Models with insufficient capacity may struggle to fit the training set, falling into underfitting. Models with high capacity can solve complex tasks, but when their capacity is higher than needed they may overfit by memorizing properties of the training set that are not useful on the test set. Instead, machine learning algorithms will generally perform best when their capacity is adequate to the complexity of the task, and to the amount of training data they are provided with.

An example of this principle is shown in Figure 1.1. Here we compare a linear, a quadratic and a degree-9 polynomial predictors attempting to fit a training set drawn from a quadratic underlying function. The linear function is unable to capture the curvature in the data, so it underfits. The degree-9 predictor is capable of representing many functions that pass exactly through the training points, including the correct function, but there is little chance of choosing a solution that generalizes well when so many different solutions exist. In this example, the quadratic model matches the true underlying structure, so it generalizes well to new data.

From another point of view, this means that empirical risk and expected risk behave differ-ently, depending on the model capacity. Typically, increasing the model capacity, training error decreases until it asymptotes to its minimum error value. Instead, generalization error assumes a U-shaped curve. The gap between these two errors increases, and eventually outweighs the decrease in training error, entering the overfitting regime above the optimal capacity. This general behaviour is illustrated in Figure 1.2.

(25)

Figure 1.2: Relationship between model capacity and error, from [56].

As we can see, in general, while simpler functions are more likely to generalize, we must still choose a sufficiently complex model to achieve low training error.

The model’s capacity can be modified by selecting the hypothesis space H of the functions the learning algorithm can choose from. In particular H determines the representational capacity of the model [56]. However, once the hypothesis space is chosen, empirical risk minimization can be a very difficult optimization problem, and in practice the learning algorithm does not find a minimizer, but only a function that significantly reduces the training error. This means that the effective capacity of the model may be less than its representational capacity. Con-sidering the previous polynomial example, we choose H as the family of the polynomials of a particular degree n, and then the optimization process looks for the polynomial of degree n that best approximates the data.

The most important results in statistical learning theory show that the discrepancy between empirical risk and expected risk is bounded from above by a quantity that grows as the model capacity grows, but shrinks as the training set dimension increases [135, 137]. These bounds provide a theoretical justification for machine learning algorithms, but they are rarely used in the context of Deep Learning, in which we will present our results. This is in part because these bounds are often quite loose and in part because it can be difficult to determine the capacity of deep neural networks, [56]. Also, in Deep Learning, there is little theoretical understanding of the non-convex optimization problems that are involved in the training of neural networks, and the effective capacity of the models is often limited by the capabilities of the optimization algorithms used to train them.

(26)

1.2.3 Bias and Variance Decomposition

The field of statistics gives us many tools to understand generalization. We recall concepts such as parameter estimation, bias and variance, which are useful to formally characterize the notions of underfitting and overfitting.

Point estimation is the attempt to provide the best prediction of a certain quantity, that can be a single parameter, a vector of parameters in some parametric model, such as the weights in a neural network, or it can also be a whole function. In this case we will talk about function estimation, which is the same as a point estimator in a function space.

Let {x1, ..., xM} be a set of M independent and identically distributed (i.i.d.) data points. A

point estimator, or statistic, is any function of the data:

ΘM = g(x1, ..., xM). (1.3)

This definition is very general, while almost any function qualifies as an estimator, a good estimator is a function whose output is close to the true underlying ˆΘ that generated the training data. Also, assuming that the true ˆΘ is fixed but unknown, since the data is drawn from a random process, the estimate Θ is a random variable.

Let us recall now some properties of point estimators.

The bias of an estimator is defined as the expected deviation from the true value of ˆΘ,

bias(ΘM) = E(ΘM) − ˆΘ, (1.4)

where the expectation is over the data (seen as samples from a random variable) and ˆΘ is the true underlying value used to define the data generating distribution. An estimator ΘM is said

to be unbiased if bias(ΘM) = 0, which implies that E(ΘM) = ˆΘ. An estimator is said to be

asymptotically unbiased if limM →∞bias(ΘM) = 0, which implies limM →∞E(ΘM) = ˆΘ.

Another property of the estimator that we consider is how much we expect it to vary as a function of the data sample. Just as we computed the expectation of the estimator to determine its bias, we can compute its variance. The variance of an estimator is simply the variance

Var(ΘM) = E[(ΘM − E[ΘM])2].

Alternately, the square root of the variance is called the standard error, denoted as SE(ΘM).

The variance, or the standard error, of an estimator provides a measure of how we would expect the estimate we compute from data to vary, as we independently sample the dataset many times from the underlying data-generating process.

Just as we might like an estimator to exhibit low bias, we would also like it to have relatively low variance. Bias and variance measure two different sources of error in an estimator. Bias measures the expected deviation from the true value of the function. On the other hand, vari-ance provides a measure of the deviation from the expected estimator value that any particular sampling of the data is likely to cause. A way to negotiate this trade-off is to compare the

(27)

mean squared error (MSE) of the estimates. In fact, the MSE measures the overall expected deviation between the estimator ΘM and the true value of the parameter ˆΘ,

MSE = Eh(ΘM − ˆΘ)2 i = E h Θ2_M+ ˆΘ2− 2 ˆΘΘM i = E[Θ2M] + ˆΘ2− 2 ˆΘE[ΘM] = Bias(ΘM)2+ Var(ΘM) (1.5)

As we can see in equation (1.5), the evaluation of MSE incorporates both the bias and the variance of ΘM. From this, to keep these quantities monitored, estimators for which the mean

squared error is small are preferred.

In the context of machine learning, several researches have analysed overfitting through the decomposition of the generalization error, measured by MSE, into bias and variance [41, 52]. In this case, the bias is a measure of how much the model output, averaged over all possible datasets, differs from the desired function. The variance is a measure of how much the output varies between different datasets.

Increasing the model capacity tends to increase variance and decrease bias. At the beginning of training, the data still had little influence, so the variance is small, while the bias is large because the model output is far from the desired function (underfitting). Late in training, the bias is small because the algorithm has learned the underlying function. However, if trained too long, the algorithm will also learn the noise specific to the dataset, and the variance will be large. This is referred to as overfitting. It can be shown that the minimum total error will occur when the sum of bias and variance is minimal [91]. This is illustrated in Figure 1.3, where we see again the U-shaped curve of generalization error as a function of capacity.

To summarize, in machine learning it is important to choose the appropriate model capacity: if the model is too complex, it will generalize poorly to unseen data (overfitting); if the capacity is too low the model won’t capture all the information in the data (underfitting). This is often referred to as the bias-variance trade-off, since a complex model exhibits large variance while an overly simple one is strongly biased. Empirically, there are several techniques to control this trade-off. Cross-validation, for example, is highly successful on many real-world tasks. Specific to machine learning, methods like early stopping (see Section 2.3.1), or regularization are well-known, as in support vector machines and regularization networks [51, 65].

Cross–Validation Method

Dividing the data into a fixed training set and a fixed test set can be problematic if the test set results small. In fact, a small test set implies statistical uncertainty around the estimated average test error, making it difficult to claim that an algorithm works better than another on the given task.

(28)

Figure 1.3: Example from [56], of the relationship between model capacity and generalization error. Here the dependence of the latter on the concepts of bias and variance is highlighted.

When the dataset has hundreds of thousands of examples, this is not a serious issue. When the dataset is too small, alternative procedures enable one to use all of the available data for the estimation of the mean test error, at the price of increased computational cost. These procedures are based on the idea of repeating the training and testing computation on different randomly chosen splits of the original dataset. The most common is the k−fold cross-validation procedure, in which a partition of the dataset is formed by splitting it into k non-overlapping subsets. The test error may then be estimated by taking the average test error across k trials. On trial i, the i-th subset of the data is used as the test set and the rest of the data is used as the training set.

Regularized Models

So far, the method we have discussed to modify a learning algorithm, is to increase or decrease the model’s representational capacity, by adding or removing functions from the hypothesis space H of solutions the learning algorithm is able to choose.

The behavior of our algorithm is strongly affected not just by how large we take H, but by the specific functions we pick from H. We can thus control the performance of our algorithm by preferring certain solutions in the hypothesis space over others. This means that all the functions are eligible, but maybe one is preferred over another, and the unpreferred solution will be chosen only if it fits the training data significantly better than the preferred one.

More specifically, we can construct a regularized model that learns a function h(x, θ) by adding to the cost function a penalty term, for example on the norm of θ, which has the effect of narrowing the set within which the parameters are chosen. This is essentially equivalent to

(29)

imposing regular conditions on the class of functions realized by the model [54]. For example, in the case Tikhonov regularization, we modify the training criterion to include weight decay. In this case the empirical risk we minimize is a sum of both the error on the training and a regularizer, that expresses a preference for the parameters to have smaller squared `2-norm. Then, given a training set S = {(x1, y1), ..., (xM, yM)}, the function we minimize is of the type,

J_M(θ) = 1 M M X i=1 L(h(xi; θ), yi) + λθTθ (1.6)

where λ is a value that controls the strength of our preference for smaller parameters. When λ = 0, we impose no preference, while larger λ forces the parameters to become smaller. Min-imizing JM(θ) then results in a choice of parameters that make a trade-off between fitting the

training data and being small. Even though the model is capable of representing functions with much more complicated shape, weight decay encourages it to use a simpler function described by smaller coefficients.

Adding a penalty term to the cost function is a classic regularization strategy, used in different contexts. This strategy is introduced from a different and more formal point of view in Chap-ter 3. In particular, Tikhonov regularizer and other regularization Chap-terms will be reviewed in Section 3.3.3, when approaching the problem of image reconstruction.

In Machine Learning other ways of expressing preferences for different solutions exist, both implicitly and explicitly. Together, these approaches are known as regularization. Then, in this context, we can say that regularization is any modification we make to a learning algorithm in order to reduce its generalization error, i.e., in order to solve overfitting.

1.2.4 Loss Functions

An important aspect of the design of a machine learning model is the choice of the loss function. In fact it has great influence on the solution h(., θ∗) found by the learning algorithm, and it also affects the convergence rate of the minimization process.

Given an input-output pair (x, y), the loss function L : Y × Y → R represents the price L(h(x, θ), y) we are willing to pay by predicting h(x, θ) instead of y.

There are various factors involved in choosing a loss function, such as the task we approach, and the model we use. Also, we may want the loss to have specific properties. For example, the loss may require a gradient large and predictable enough to be used in the training process by gradient based learning algorithms.

We can see in Figure 1.4 some of the most common loss functions, which can be classified into two major categories depending on the type of learning task we are dealing with: regression losses and classification losses. In particular, some of the most used for regression are:

• the mean squared loss function (also known as the `2_-norm):

(30)

Figure 1.4: Different types of loss function in one variable t, with: left) t = h(x, θ) − y for regression; right) t = yh(x, θ) for binary classification.

• the absolute value loss (also known as the `1-norm):

L(h(x, θ), y) = |y − h(x, θ)| (1.8) • the –insensitive loss proposed by Vapnik, if > 0:

L(h(x, θ), y) = (

0 if |y − h(x, θ)| ≤

|y − h(x, θ)| − otherwise (1.9) For classification tasks, given the binary nature of classification, a natural loss function would be the 0-1 indicator function, which takes the value 0 if the predicted output h(x, θ) is the same as the true output y, 1 if the predicted output is different. For binary classification with Y = {−1, 1}, the 0-1 indicator function can be written as

L(h(x, θ), y) = H(−yh(x, θ)) (1.10) where H is the heaviside step function, defined as

H(t) = (

0 if t < 0

1 if t ≥ 0. (1.11) However, this loss function is non-convex and non-smooth, and finding the optimal solution is an NP-hard combinatorial optimization problem. As a result, it is better to substitute

(31)

continuous, convex loss functions which are tractable for commonly used learning algorithms. For example, with output space Y = {−1, 1}, a more tractable loss is the Hinge loss,

L(h(x, θ), y) = max (0, 1 − yh(x, θ)) =: |1 − yh(x, θ)|+ (1.12)

where h(x, θ) is the classifier score. This loss is used for maximum-margin classification, most notably for Support Vector Machines. In fact, the Hinge loss penalizes predictions such that yh(x, θ) < 1, corresponding to the notion of a margin in a SVM.

Finally, the most common loss for classification problems may be the cross-entropy loss function. This loss measures the performance of multi-class classification models, whose outputs are the probabilities of the different classes. In this case, given an object x, y is usually chosen as the one-hot vector, with entry 1 corresponding to the true class of x. Then, if the probability of class i estimated by the model is pi = hi(x, θ), we can use cross-entropy to get a measure of

dissimilarity between yi and pi,

L(h(x, θ), y) = −X

i

yilog pi (1.13)

In particular, as we can see in Figure 1.5, cross-entropy loss increases as the predicted proba-bility diverges from the actual label, and heavily penalizes the predictions that are confident but wrong.

Figure 1.5: Cross–entropy loss function for multi-class classification, when true label is equal to 1.

1.2.5 Hyperparameter Selection

A common trait in machine learning systems is that they are usually parametrized by a set of hyperparameters that we can use to control the behavior of the learning algorithm. These

(32)

are used to configure various aspects of the learning algorithm, so that they significantly affect the resulting model’s performance, and the time required to train and test the model [33]. Examples of hyperparameters are the number and the size of layers, that define the structure of neural networks; but also the threshold , when using a −insensitive loss function, or the step size of the iterative learning algorithm are all different types of hyperparameters.

The values of these hyperparameters are not adapted by the learning algorithm itself, like the other trainable parameters. This applies, for example, to all hyperparameters that control model capacity. If learned on the training set, such hyperparameters would aggressively increase the model capacity, resulting in overfitting.

To make this clearer, let’s consider the polynomial example in Figure 1.1. If we treat the degree of a polynomial equation fitting a regression model as a trainable parameter, this would just raise the degree up until the model perfectly fit the data, giving small training error, but bad generalization performance. Instead the polynomial degree is a hyperparameter, chosen outside of the training.

Since the goal is to find the model having the best performance on new data, the simplest approach to compare different models (with different hyperparameters) is to evaluate their performance on new data, different from that used for training, as we do with the test set. The subset of data used to guide the hyperparameters selection is called the validation set. However, validation set cannot contain examples from the test set. In fact, we discussed how a test set is used to estimate the generalization error, after the learning process has been completed. It is important that the test examples are not used to make choices about the model, including hyperparameters tuning.

Therefore, after putting the test set aside, we split again the training data into two disjoint subsets. Typically, one uses about 80% of the training data for training and 20% for validation [56]. The one used to learn the trainable parameters is typically called training set, even though this may be confused with the larger pool of data used for the entire training process. The other subset is the validation set, used to estimate the generalization error during or after training, allowing for the hyperparameters to be updated accordingly. Finally, when the entire learning process, comprising both empirical risk minimization and hyperparameter optimization, is complete, the trained model with hyperparameters having the best performance on the validation set is selected [15], and its generalization error is estimated on the test set.

In the polynomial example, the learning process implies that we train polynomial of different degrees on the examples of the training set. The polynomial that achieves the minimum gener-alization error on the validation set is selected as learning algorithm, and its final performance is evaluated on the test set.

Hyperparameter selection is an open research field. Specially in deep learning, the training process already requires a great amount of computational resources and time, and hyperpa-rameter search is commonly performed manually, via rules-of-thumb [70, 73], or by grid-search methods.

(33)

Deep Learning Theory

Deep Learning is a particular branch of machine learning, with great power and flexibility, which provides a powerful framework for supervised learning.

Deep Learning models, also referred to as artificial neural networks (ANNs), are engineered systems inspired by the biological brain, consisting of the interconnection of neurons. Neurons can be seen as nodes of an oriented network, provided with processing capacity. They receive weighed signals from the environment or from other neurons, as inputs that are processed. Then, the output of the neuron is sent to other neurons or to the network output, through other weighed connections.

A network is specified by i) the type of operations the neurons perform, ii) the values of the weights in the connections, which are determined by learning algorithms, iii) its structure: number of nodes, their arrangement in multiple layers, type of connections. By adding more layers and more neurons within a layer, a deep network can represent functions of increasing complexity. Most tasks that consist of mapping an input vector to an output vector, can be performed by Deep Learning, given sufficiently large models and sufficiently large datasets of labelled training examples. In particular, in recent years, Deep Learning popularity greatly in-creased, thanks to the availability of more powerful computers and larger datasets that allowed experimentation and the development of new techniques to train deeper networks.

The purpose of this chapter is to introduce the mathematical theories underlying Deep Learning, the algorithms used to train neural network structures, as well as providing some historical perspectives about neural networks development. For this purpose, we start by introducing the network’s units: the formal neurons. We see how to combine them to form complex and different structures, like convolutional neural networks, which will be applied in the experimental part of this thesis. Then we see various aspects of the optimization process involved in the training of these deep models, in order to tune their internal weights.

(34)

Figure 2.1: Formal neuron structure, based on the model proposed by [99], with g(t) ≡ sign(t). Picture taken from [60].

2.1 Neural Networks History

Even though Deep Learning only recently acquired global acknowledgement, its history dates back to the early 1940’s with the first mathematical model of an artificial neuron proposed by McCulloch and Pitts in 1943 [99], and revisited by Rosenblatt in 1958 [116].

2.1.1 The Formal Neuron

In the simplest version proposed by McCulloch and Pitts, the artificial neuron, or formal neuron, is based on the principle of functioning of the biological neuron. The inputs to the formal neuron are multiplied by weights, representative of the synaptic connections, and the weighted sum of the inputs is compared to a threshold value. By applying an activation function, the neuron gives output 1, if the sum of the inputs is greater than the threshold, -1 (or 0) otherwise. In mathematical terms, given an input vector x ∈ Rn, denoting with ω ∈ Rnthe weights vector, and with y ∈ {−1, 1} the output of the neuron, and indicating with ξ ∈ R the threshold value, the output of the neuron w.r.t. its input can be written as

y(x) = g n X i=1 ωixi− ξ ! ≡ g(ωTx − ξ),

where g is the neuron activation function, in this case taken as the sign function: g(t) ≡ sign(t) =

(

1 if t ≥ 0 −1 if t < 0 This model of formal neuron is reported in Figure 2.1.

(35)

2.1.2 Rosenblatt’s Single Layer Perceptron

In 1958, Rosenblatt reinterpreted the formal neuron as a binary classifier, able to sort the elements of a given dataset into two groups.

He sees the input x ∈ Rnas a generic vector, whose components are the features of the object to be classified. Assigning proper values to the weights vector ω ∈ Rn, the neuron is able to perform a binary classification, returning y(x) = 1 or y(x) = −1, based on the sign of the linear function a = ωTx − ξ.

Moreover, Rosenblatt’s major achievement is the intuition that this model can actually learn from data. In fact, he devised a fairly simple algorithm that allows to learn the correct synaptic weights ω from the dataset itself. The weights and the threshold value are determined by a learning process starting from a training set of M input-output pairs,

S = {(x_m, ym) | xm ∈ Rn, ym∈ {−1, 1}, m = 1, ..., M }

where ym is the correct classification associated to the input xm. This learning algorithm is

called Perceptron, and it is reported in Algorithm 1. Once the formal neuron is trained on the samples of S, we can use its generalization skills to classify new inputs not belonging to S.

Now, let’s see how the Perceptron algorithm works. We want to train the neuron to correctly classify the samples of the training set, and this is done if the weights ω and the threshold ξ are determined in such a way that

(

ωTxm− ξ ≥ 0 if ym= 1

ωTxm− ξ < 0 if ym= −1

m = 1, ..., M (2.1) From a geometric point of view, system (2.1) search for the hyperplane H = {x ∈ Rn| ωT_{x =}

ξ}, which separates the two sets

A = {xm | (xm, ym) ∈ S, ym = 1} B = {xm | (xm, ym) ∈ S, ym = −1}.

This way, the existence of ω and ξ, which solve (2.1) can be assured if and only if the sets A and B are linearly separable. Also, (2.1) admits solution if and only if the system below admits solution, ( ωTxm− ξ > 0 xm ∈ A ωT_x m− ξ < 0 xm ∈ B m = 1, ..., M (2.2) We can add dummy components xm0 = −1 and ω0 = ξ, so that the input vectors and the

weights vector become vectors with n + 1 components, xm = (−1, xm1, . . . , xmn)

T_{, and ω =}

(ξ, ω1, . . . , ωn)T. In this way, the threshold is considered as an additional bias term with weight

ξ, and finding the correct synaptic weights and threshold corresponds to solving the system (

ωT_x

m> 0 xm ∈ A

ωTxm< 0 xm ∈ B

(36)

w.r.t ω ∈ Rn+1. Here we can presume, without loss of generality, that ||xm|| = 1, m = 1, . . . , M.

To solve system (2.3), Rosenblatt proposes to find ω by using an algorithm, which uses only one sample of the training set at each iteration. As we can see in Algorithm 1, at each cycle, ω is updated in correspondence of the incorrectly classified examples, adding to the current value of ω a correction term: if the example xm ∈ B and the Perceptron wrongly

predicts sign(ω(k)Txm) = 1, the weight update is given by ω(k + 1) = ω(k) − xm, while the

threshold is increased by 1. If xm belongs to class A and the Perceptron wrongly predicts

sign(ω(k)Txm) = −1, the weight correction is ω(k + 1) = ω(k) + xm, while the threshold is

decreased by 1.

Algorithm 1 Perceptron training algorithm, from [60].

Given the input xm ∈ Rn, with ||xm|| = 1 and the target ym ∈ {−1, 1}, m = 1, . . . , M . Set

ω(0) = 0, k = 0, classified = 0. while classified < M do for m = 1, . . . , M do if sign(ω(k)T_x m) = ym then classified = classified + 1 else ω(k + 1) = ω(k) + ymxm k = k + 1 end if end for if classified < M then classified = 0 end if end while

This means that along the iterations the update rule can make samples, that were previously well classified, not correctly classified; the algorithm needs to go through the whole training set S as long as all the samples are not correctly classified. However, it can be shown (see [60]) that if the sets A and B are linearly separable, the algorithm determines a weight vector ¯ω in a finite number of iterations, such that all the training set samples are correctly classified, i.e. it is satisfied

ym = sign(¯ωTxm) m = 1, . . . , M. (2.4)

Perceptron suffers from major limitations, because the samples sets A and B are not linearly separable for many classification problems. A simple example is given by the XOR function, an operation on two binary inputs, that returns 1 when exactly one of these input values is equal

(37)

x

1

x

2 1 0 0 1 0 1 1 0

Figure 2.2: XOR function.

to 1. Otherwise, it returns 0. Figure 2.2 shows how a linear model is not able to represent the XOR function. In fact there is no straight line capable of separating the different outputs.

It was the inability of the basic Perceptron to solve such simple problems that led, in part, to a reduction in interest in neural network research during the 1970s. To overcome the limitations of Perceptron, it is necessary to use more complex structures, which learn a different feature space in which a linear model is able to represent the solution.

This is achieved, for example, by Multilayer Perceptrons, which can solve arbitrary classification problems.

2.1.3 Multilayer Feed-Forward Neural Networks

The limits of the Perceptron motivated the study of more complex architectures consisting of several layers of formal neurons connected in chain. These architecture are called Multilayer Perceptrons (MLPs), or feedforward neural networks, in which it is assumed that there are no feedback connections, nor connections between the neurons of the same layer.

The first layer is a set of n input nodes, without processing capacity, that are associated to an input vector x ∈ Rn. The other formal neurons are organized in K ≥ 2 different layers, arranged in a chain structure, with each layer being a function of the layer that preceded it. In particular, each of the first K − 1 internal hidden layers consists of neurons (the hidden units) that act in parallel, each representing a vector-to-scalar function. Every neuron receives input from the neurons of the previous layer, computes its own activation value, and then the output contributes to the inputs of the neurons of the successive layer. In this sense, MLPs are said to be fully connected: all the neurons in a layer are connected to each neuron in the successive layer. Finally, the last layer is the output layer, which return the outputs x(K) of the network. The length K of the chain gives the depth of the model. Since each layer of the network may have a different number of neurons, we denote with dkthe dimension of the kthlayer, and

the maximum layer’s size determines the width of the model.

(38)

1

x

(0)₃

x

(0)₁

1

Inputs

Input layer

Hidden layer

Outputs

Output layer

b

(2)₂

x

(2)₂

x

(1)₄

ω

₂₄(2)

x

(2)₁

x

(1)1

x

(1)₂

x

(1)₃

W

(2)

W

(1)

x

(0)₂

Figure 2.3: Feedforward neural network structure with depth K = 2.

that are usually determined by performing different simulations, and evaluating the different performances on the validation set.

The Multilayer Perceptron structure is shown in Figure 2.3. As we can see, for k = 1, ..., K, a weight ω_ji(k) is associated to each oriented arc between the neurons x(k−1)_i and x(k)_j . This weight represents the entity of the synaptic connection between the two neurons. Furthermore, like in Rosenblatt’s Perceptron, biases are added as additional input x(k)₀ = 1 with weight b(k)_j . Suppose that every formal neuron of the kthlayer applies an activation function g(k): R → R to the weighted sum of the layer’s inputs; this is usually a nonlinear activation function like, for example, a ReLU or a sigmoid function (see Section 2.1.3). Given an input vector x(0) _{∈ R}d0_,

the output of each neuron in the network can be computed as x(k)_j = g(k)   dk−1 X i=1 ω_ji(k)x(k−1)_i + b(k)_j   for k = 1, ..., K, j = 1, ..., dk (2.5)

where dk is the number of neurons in the kth layer. Reformulated in matrix form it becomes

x(k)= g(k)(W(k)x(k−1)+ b(k)) for k = 1, ..., K, (2.6) where W(k) ∈ Rdk×dk−1 _{is the weight matrix of the k}th _{layer and b}(k)_{∈ R}dk _{is the bias vector.}

Thus, if x(0)∈ Rd0 _{is the input to the network, equation (2.6) gives the expression to calculate}