Deep learning quantification

(1)

Facoltà di Ingegneria

Scuola di Ingegneria Industriale e dell’Informazione Dipartimento di Elettronica, Informazione e Bioingegneria

Master of Science in Computer Science and Engineering

Deep Learning Quantification

Supervisor:

p r o f. stefano ceri Assistant Supervisors: p r o f. marco brambilla p r o f. pavlos protopapas

Master Graduation Thesis by: a n d r e a a z z i n i Student Id n. 837669 g i ova n n i b at t i s ta c o n s e r va Student Id n. 837974

(2)

(3)

(4)

(5)

To our supervisors Stefano Ceri, Marco Brambilla and Pavlos Protopapas, thank you for your passion, dedication and profes-sionalism. You led our learning experience for more than one year with your courses, the DataShack program and this thesis, adding an enormous contribution to the quality of our studies and our personal growth.

To Marta Confalonieri, for her terrific contribution to this work. To all the other professors and staff of Politecnico di Milano, thank you for this enriching and unforgettable experience. To the competent and warm Harvard IACS faculty, thank you for making us feeling at home.

To our beloved families, for your love, generosity and all the sacrifice you made to support us.

To our friends, for always being there for us, in good and bad times.

To our amazing colleagues, each one of you added something in our path, with your uniqueness.

(6)

(7)

Abstract xv Sommario xix 1 i n t r o d u c t i o n 1 1.1 Context . . . 1 1.2 Problem Statement . . . 2 1.3 Proposed Solution . . . 3

1.3.1 Synthetic Data and Generalization . . . 4

1.3.2 Contextualization . . . 5

1.4 Structure of the Thesis . . . 5

2 b a c k g r o u n d 7 2.1 History of Neural Networks . . . 8

2.1.1 The Early Age of Neural Networks . . . 8

2.1.2 The Breakthrough of Deep Learning . . . . 9

2.2 Architectural Basics of Neural Networks . . . 10

2.2.1 Training . . . 13

2.2.2 Backpropagation . . . 14

2.3 Architectural Issues of Neural Networks . . . 17

2.3.1 The Vanishing Gradient Problem . . . 17

2.3.2 Regularization and Dropout . . . 19

2.3.3 Batch Normalization . . . 21

2.4 Convolutional Neural Networks . . . 24

2.4.1 Image Classification . . . 24

2.4.2 Convolutional Layers . . . 26

2.4.3 Pooling Layers . . . 28

2.5 Recurrent Neural Networks . . . 30

2.6 Autoencoders . . . 31 3 m at e r i a l s a n d m e t h o d s 33 3.1 Description . . . 33 3.1.1 Terminology . . . 33 3.1.2 Objectives . . . 35 3.1.3 Proposed Pipeline . . . 37 3.2 The Dataset . . . 39

3.2.1 Size, Shape and Classes . . . 39

3.2.2 Rendering Images . . . 40

3.3 Segmentation . . . 41

3.3.1 Classification . . . 44

3.4 Volume Reconstruction . . . 45

3.4.1 3D Recurrent Reconstruction Neural Net-work . . . 45

3.4.2 Optimization of Volume Reconstruction . . 46

3.4.3 Absolute Volume Measurement . . . 47

(8)

4 e x p e r i m e n t s 53

4.1 Segmentation . . . 54

4.1.1 Experiment 1: Single objects, monochro-matic backgrounds . . . 54

4.1.2 Experiment 2: Single objects, random back-grounds . . . 56

4.1.3 Experiment 3: Multiple objects, monochro-matic backgrounds . . . 58

4.1.4 Experiment 4: Multiple objects, random back-grounds . . . 60

4.1.5 Experiment 5: Single and multiple objects, monochromatic backgrounds . . . 62

4.1.6 Experiment 6: Single and multiple objects with complex variations . . . 64

4.1.7 Experiment 7: Mixed objects with complex variations . . . 66

4.2 The Bridge between Segmentation and 3D Re-construction . . . 68

4.3 Volume Reconstruction . . . 69

4.3.1 Experiment 1: Novel vs Benchmark . . . . 70

4.3.2 Experiment 2: Pre-Segmentation vs Non Segmentation . . . 72

4.3.3 Experiment 3: Proportions . . . 74

4.3.4 Experiment 4: Absolute Volume and Quan-tification . . . 76

5 c o n c l u s i o n 79

b i b l i o g r a p h y 83

a c o d e l i s t i n g s 89

(9)

Figure 1.1 Synthetic (i.e. computer generated) images. 5

Figure 2.1 The unnormalized trend of the "deep learn-ing" query on Google. The curve follows an exponential growth, with a turning point traceable in 2012. . . 9

Figure 2.2 In feedforward networks, neurons are or-ganized into layers through connections that do not form cycles. . . 12

Figure 2.3 Mapping of the input to the output through the sum of the weighted inputs, overall biases and activation functions . . . 13

Figure 2.4 . . . 15

Figure 2.5 Dropout operates by randomly remov-ing each neuron from the original net-work with a certain pre-defined proba-bility. The new network will then be trained in the same way, while the network eval-uation will adjust the learned parame-ters, depending on the probability that a neuron is dropped out of the network . . 21

Figure 2.6 A diagram of softmax regression. The in-put is mapped directly to the outin-put through connections (weights) and biases. The soft-max activation transform the weighted sum into a probability distribution. . . 25

Figure 2.7 Neuron arrangement resembles pixel or-ganization within an image. Receptive fields are responsible for the localized connec-tions of the input to a hidden neuron. . . 27

Figure 2.8 Max-pooling layers pick the maximum activation and preserve it. They discard all the other activations, making the de-tected features less localized for the sake of model simplicity. . . 29

Figure 2.9 Convolutional and pooling layers are put together to form the core of convolutional neural networks. The orange and purple colors correspond to two input filters, that convolve into two feature maps, that get finally condensed by the pooling function. 30

(10)

Figure 2.10 Autoencoders are encoder-decoder archi-tectures that first map the input data to a code, a reduced representation of the in-put, and then reconstruct the input from the code through a symmetric decoding. . 32

Figure 3.1 The full pipeline. The input in form of one or multiple views is first segmented, then divided into 2 pictures, with and without the Stabilo used as a reference. From these inputs 2 volume reconstruc-tions are created, and by using the ratio computed in the segmentation phase the absolute volume is obtained by compari-son. . . 39

Figure 3.2 This image grid shows the different cat-egories of the dataset. You can easily see the difference between monochromatic or random backgrounds, and single and mul-tiple object. . . 42

Figure 3.3 Different colors decoder are associated with different tasks. The intermediate cost functions, belonging to the specific tasks are represented in orange, while the Fi-nal cost function is represented in red. . . 50

Figure 4.1 Different classes of objects in the training set. The collage includes synthetic images and label images. . . 55

Figure 4.2 Training accuracy over time during the first experiment. After 10,000 iterations, it reaches a value of 99%. We also report the loss function plot. . . 55

Figure 4.3 Test results over real images. The contour and shape of the objects are caught cor-rectly, but the colors do not correspond to the training labels. . . 56

Figure 4.4 Different classes of objects in the train-ing set, comprehendtrain-ing synthetic images and label images. . . 57

Figure 4.5 Training accuracy over time during the first experiment. We also report the loss function plot. . . 57

Figure 4.6 Test results over real images. The contour and shape of the objects are detected cor-rectly, except some issues with the egg class. Again, the colors do not correspond to the known classes, as seen in fig.4.3. . 58

(11)

ing set, comprehending synthetic images and label images. . . 59

Figure 4.8 Training accuracy over time during the first experiment. Both the accuracy and loss plots look saturated, and the vari-ance is definitely higher than in the pre-vious experiments. . . 59

Figure 4.9 Test results over real images. The contour and shape of the objects are more diffi-cult to detect correctly. The colors do not correspond to the training labels at all, as seen in all previous experiments. . . 60

Figure 4.10 Different classes of images in the train-ing set, comprehendtrain-ing synthetic images and label images. . . 61

Figure 4.11 Training accuracy over time during the first experiment. Both the accuracy and loss plots look absolutely saturated, and the variance is definitely higher than in the single experiments. . . 61

Figure 4.12 Test results over real images. The contour and shape of the objects are more diffi-cult to detect correctly. The colors do not correspond to the training labels at all, as seen in all previous experiments. . . 62

Figure 4.13 Training accuracy after 10,000 iterations. . 63

Figure 4.14 Test results over real images. The contour and shape of the objects are detected cor-rectly in simple situations only. . . 64

Figure 4.15 Different objects in the training set, com-prehending synthetic images and label images. . . 65

Figure 4.16 Training accuracy with the new training set. . . 65

Figure 4.17 Test results over real images. The con-tour and shape of the objects are detected very accurately. . . 66

Figure 4.18 Different objects in the training set, com-prehending synthetic images and label images. . . 67

Figure 4.19 Training accuracy with the final training set. . . 67

(12)

Figure 4.20 Test results over real images. The con-tour and shape of the objects are detected very accurately. The class of the objects is

overall understood. . . 68

Figure 4.21 The bridge between the segmentation and the 3D reconstructor modules. The first row represents the input of the pipeline. The row in the middle represents the out-put of the segmentation layer. The third row represents the input, after being fil-tered by the segmentation output image. . 69

Figure 4.22 Some views of a pear from the Como test set. . . 70

Figure 4.23 Some views of a pear from the Como test set cleaned from the background. . . 70

Figure 4.24 Prediction on the cleaned images after training with our training set. . . 71

Figure 4.25 Prediction on the cleaned images after training with Shapenet. . . 71

Figure 4.26 From left to right: Como data set picture, its segmentation with SegNet, the Segnet mask predicted. . . 72

Figure 4.27 Prediction on the non segmented vs seg-mented input with 1 view. . . 72

Figure 4.28 From left to right: Como data set picture, its segmentation with SegNet, the Segnet mask predicted. . . 73

Figure 4.29 Prediction on the non segmented vs seg-mented input with 2 views. . . 73

Figure 4.30 Two pictures from the training set for pro-portions learning. . . 74

Figure 4.31 Objects in the proportion test set. . . 74

Figure 4.32 Smaller pear in the test set. . . 75

(13)

bilo. It may be noticed that the pear with-out the Stabilo, even though is the same, is taller. This happens because the pre-dictor always try to fill the 32*32*32 space. This is why the prediction needs to be scaled according to the ratio between the Stabilo and the pear . . . 76

L I S T O F TA B L E S

Table 3.1 The table represents the classes we have chosen in order to carry out our experi-ments. The class selection has been made considering simplicity of recognition, but still non-trivial segmentation and 3D model reconstruction. . . 40

Table 3.2 Summary of the different categories of experiments we have carried out. . . 42

Table 4.1 ShapeNet: Network pre-trained on ShapeNet, 50000iterations, Adam. ShapeNet + Novel: Network pre-trained on ShapeNet, 50000 iterations, then trained on our classes, 5000 iterations, Adam. Novel: Network trained on our classes, 5000 iterations, Adam. . . 70

Table 4.2 Predicted vs real height of the pears above 75

Table 4.3 Predicted vs real width of the pears above 76

Table 4.4 Final quantification . . . 77

L I S T I N G S

Listing A.1 The models that have been implemented: SegNetAutoencoder and MiniAutoencoder 89

Listing A.2 The classifier helper makes it possible to manipulate tensors for label selection . . . 94

Listing A.3 Train the autoencoder responsible for seg-mentation . . . 94

(14)

(15)

Il recente e massivo utilizzo di reti neurali ha apportato enormi sviluppi nel campo della computer vision.

In questa tesi affrontiamo il problema di recuperare informazioni quantitative, come il volume e il peso di oggetti, a partire da immagini degli stessi. Ci concentriamo sul dominio degli ingre-dienti culinari, volendo sviluppare un’applicazione che riceve in input foto di cibi e fornisce indicazioni come il loro peso, le kilocalorie, eccetera.

La soluzione proposta è una pipeline di reti neurali convoluzion-ali, per il recupero del volume, tramite il quale tutte le altre in-formazioni sono ricavabili. La prima rete convoluzionale, basata sull’architettura encoder-decoder, viene utilizzata per separare gli ingredienti dal background e classificarli, ottenendo così im-magini filtrate. Queste imim-magini intermedie vengono utilizzate da una rete di ricostruzione della forma tridimensionale. La seconda rete in questione è in grado di ricostruire forme anche a partire da singole immagini di ingredienti, grazie alle infor-mazioni imparate su di essi in fase di training. Proponiamo un metodo di recupero del valore assoluto del volume, a partire da quello relativo e da un oggetto di dimensioni note nel mondo reale, usato come metro di paragone.

Affrontiamo inoltre il problema di costruire un appropriato training set, tramite l’utilizzo di modelli 3D per la sintesi di viste 2D. Questo approccio risulta facilmente scalabile a nu-merose classi di oggetti e contesti. Studiamo il potere predit-tivo su immagini reali di reti allenate con immagini sintetiche. Presentiamo esperimenti sull’importanza del background, delle texture, dell’angolazione e dell’illuminazione, ottenendo impor-tanti indicazioni su come costruire al meglio futuri dataset. Implementiamo una pipeline in grado di risolvere il problema di quantificazione sopra descritto, e proponiamo un’architettura di rete innovativa basata sul multitask principle.

Questa tesi è stata realizzata sotto la supervisione congiunta di Politecnico di Milano e Harvard University. Abbiamo trascorso tre mesi a Cambridge, dove abbiamo avuto l’opportunità di pre-sentare due workshop sul deep learning in occasione dell’Harvard ComputeFest 2017.

(16)

(17)

Practical applications and theoretical results in computer vi-sion have dramatically improved since the massive utilization of neural networks.

In this thesis we face the problem of retrieving quantitative in-formation of objects, i.e. volume and weight, from pictures. We focus on the food ingredients domain. The goal is an applica-tion that takes as input one or multiple pictures of ingredients, and is able to provide a variety of suggestions about them, like weight, calories, and so on.

To solve this problem we build a fully-convolutional pipeline of neural networks, that goes from the pictures of ingredients to the reconstruction of their volume. We choose volume as the main measure to obtain among other quantitative information. We use a convolutional encoder-decoder architecture to solve the subproblem of classifying the objects in a picture and sepa-rating them from the background. We feed the images prepro-cessed this way to a convolutional-recurrent network, to solve the problem of 3D shape reconstruction. This network can pre-dict the 3D shape of an ingredient even from a single picture as input, thanks to its knowledge of ingredients learned during training. We then devise a method to obtain the absolute value of the volume, starting from the relative one and the recon-structed volume of an object of standard size in the real world, used for comparison.

To solve the problem of having enough labeled data in a scal-able way, we study the utilization of 3D synthetics models, by creating a novel synthetic dataset. Thus, we provide some ex-periments to study the power of networks trained with syn-thetically generated images to predict real ones. We study the influence of backgrounds, light conditions, textures and cam-era orientations on the predictive performance. Thus, we gain insights on how to properly build a synthetic dataset.

We provide a functioning pipeline that solves the aforemen-tioned quantification task. Finally, we also propose a novel mul-titasking architecture.

(18)

This thesis was supervised by both Politecnico di Milano and Harvard University professors. We spent three months in Cam-bridge, where we also presented two workshops on deep learn-ing at Harvard ComputeFest 2017.

(19)

Il deep learning è uno degli ambiti di ricerca più in crescita degli ultimi anni. La sua sempre crescente popolarità è dovuta ad una serie di successi in ambiti di ricerca estremamente di-versificati, quali computer vision, natural language processing, robotica, medicina, e molti altri.

Il successo dell’applicazione di tecniche di deep learning in numerosi e disparati contesti ha spinto sempre più la ricerca in una direzione applicativa. Infatti, sebbene ultimamente i con-cetti teorici alla base delle suddette tecniche siano rimasti pres-soché invariati, la maggioranza della ricerca odiearna si colloca nella spinta al superamento dello stato dell’arte in un contesti specifici.

In questa classe di ricerca si colloca anche la nostra tesi, che mira alla risoluzione di un problema che denominiamo deep learning quantification. Facendo uso degli stessi concetti architet-turali alla base della teoria del deep learning, il nostro obiettivo è quello di rispondere a domande in merito ad informazioni quantitative di oggetti rappresentati in una scena. Per scena in-tendiamo la rappresentazione di un gruppo di oggetti in una stessa immagine, e per informazioni quantitative intendiamo non solo la corretta identificazione del numero di oggetti, ma an-che delle relative misure geometrian-che, quali lunghezze, aree e volumi, e di eventuali misure composte. Nel caso specifico di questa tesi, la scelta del dominio è ricaduta sull’ambito culi-nario, settore nel quale il problema di quantificazione risulta interessante per determinare, per esempio, una serie di val-ori nutrizionali direttamente collegati alle dimensioni degli in-gredienti stessi. Il nostro obiettivo non è dunque soltanto la definizione e implementazione di un modello che si occupi di rispondere alle esigenze di quantificazione, ma anche la di-mostrazione del suo corretto utilizzo nel contesto appena de-scritto.

Al fine di raggiungere gli obiettivi prefissati, ci siamo avvalsi di tecniche di:

• Pixel-wise semantic segmentation, ovvero la categoriz-zazione del contenuto di un’immagine attraverso la col-orazione di ogni pixel a seconda dell’oggetto occupato

(20)

• 3D model reconstruction, ovvera la ricostruzione di un modello tridimensionale a partire di una o più immagini bidimensionali dello stesso oggetto

Il modello finale è quello di una rete neurale estremamente profonda, i cui singoli moduli di segmentation e 3D reconstruc-tion sono modellati come architetture di encoding-deconding, trainabili sia indipendentemente che end-to-end.

La vera potenzialità di architetture estremamente profonde è raggiungibile solo a due principali condizioni:

• la disponibilità di un training set molto ampio

• l’utilizzo estensivo di unità di elaborazione grafica (GPU) durante la fase di apprendimento dei modelli

Il training set adeguato per la risoluzione del nostro prob-lema deve essere costituito non solo da una moltitudine di im-magini di classi predefinite di oggetti, ma anche dei rispettivi modelli tridimensionali degli oggetti stessi. Per utilizzare un metodo di costruzione scalabile dal punto di vista della classe di ingredienti, e ricco di spunti in un’ottica di ricerca, abbiamo affrontato l’ulteriore sfida di sintetizzare un dataset, creando le suddette immagini e modelli per mezzo di un software di grafica 3D. Gli oggetti considerati, appartenenti a cinque di-verse classi, vengono variati in texture, condizioni di luce, om-bre, background, e una serie di altre caratteristiche, dipenden-temente dagli esperimenti effettuati.

L’utilizzo di un dataset sintetizzato porta ad una serie di sot-toproblemi, quali per esempio lo studio della generalizzazione del nostro modello. Infatti, allenando il modello con immag-ini sintetizzate, ci si aspetta un peggioramento nell’accuratezza dello stesso quando viene testato su immagini reali, salvo non si adottino alcuni accorgimenti. Pertatno, un ulteriore obiettivo della nostra tesi è quello di dimostrare che, con una sintesi il più dettagliata possibile degli oggetti considerati, il modello conserva le proprie capacità predittive.

Durante la fase sperimentale del nostro lavoro, abbiamo usufruito di diverse GPU. Nel periodo trascorso ad Harvard, abbiamo uti-lizzato una macchina locale, messa a disposizione dall’Institute for Applied Computational Science (IACS). Nel periodo trascorso in Italia, abbiamo utlizzato una macchina virtuale, messa a dis-posizione dai nostri relatori.

(21)

in input una serie di immagini di scene diverse, si occupa di segmentarle correttamente, ricostruirne il modello tridimension-ale, e determinare le relative informazioni quantitative. L’architettura così definita e implementata si basa unicamente sull’utilizzo reti neurali profonde, nelle note forme di convolutional neu-ral network, recurrent neuneu-ral network e autoencoder. Inoltre, il risultato considerato è ottenuto allenando la suddetta rete con immagini completamente sintetizzate, ma applicando i test con immagini reali, ottenendo così un risultato generalizzabile e scalabile.

Questa tesi è stata realizzata sotto la supervisione congiunta di Politecnico di Milano e Harvard University. Abbiamo trascorso tre mesi a Cambridge, dove abbiamo avuto l’opportunità di pre-sentare due workshop sul deep learning in occasione dell’Harvard ComputeFest 2017.

(22)

1

I N T R O D U C T I O N

1.1 c o n t e x t

Deep learning has become a popular research subject, thanks to its achievements and results on many subfields and real-life problems, such as computer vision [28][34][45][50][37][20],

nat-ural language processing [48][10][27][21][29], robotics [31][38][32],

cancer detection [15][19], autonomous cars [24][8], and more.

Despite its success in specific domains, deep learning is gen-erally advancing at small incremental steps, and its method-ologies have been applied to research topics that share a lot of implementation and practical aspects, but still are very domain-specific. For instance, convolutional neural networks have been tested thoroughly in a lot of fields, and have been pushed be-yond the state-of-the-art models of the past. New models have been proposed to solve very peculiar problems by changing a few details only, such as the nature of the loss function to be minimized, or the number and type of layers to be included in the network. Techniques such as Dropout [23] and Batch

Normalization [25], which will be described in the following

chapters, have now become a standard in the design and im-plementation of deep neural networks, because they have been proved to improve the performance and the results of the mod-els dramatically.

It may seem that deep learning is a subject which has been already studied, and cannot offer more compelling research questions. However, only recently, new methods such as Gen-erative Adversarial Networks [18] have revolutionized the way

networks learn by competing against each other. From a mi-croscopic viewpoint, they only use the building blocks of con-volutional or recurrent neural networks, but it is the way those building blocks are put together to create an innovative pipeline that made those models incredibly successful. Similarly, with our thesis, we do not aim at creating new models to solve al-ready existing problems, and push the results beyond state-of-the-art. Instead, our objective is to create a pipeline of indepen-dent modules, aimed at solving a difficult innovative problem, which we call Quantification.

(23)

1.2 p r o b l e m s tat e m e n t

Human visual system is a masterpiece of nature. A large con-tribution to our skills and intelligence is due to vision. For us, it is enough to take a quick look at a picture, to being able to answer a lot of questions about it. We are able to reproduce an internal representation of the 3D models on the scene, and by using that, plus our prior knowledge of the objects represented, we can estimate detailed measures. For instance, by looking at a picture of a man holding a very small cat, we can naturally estimate a narrow range for the animal’s weight.

Computers are still far from being able to solve as efficiently many tasks that are natural for us. Nonetheless, improvements in computer vision are making machines able to solve increas-ingly harder questions. Before discussing the main question we want to answer in this thesis, about quantification, and what kind of quantitative information we want to retrieve in pictures, we first review related problems:

• Classification [43]: Classification task is about answering the question “what is in the picture?” with a single label, i.e. banana. It doesn’t matter how many objects are in the scene, the answer is the label of the object detected with higher probability. There is no quantitative information in this kind of predictions.

• Object detection [16]: Object detection task is more ad-vanced, and targets the question: “where are each object in the scene?”. Thus, it is about assigning a frame to each object in the scene. By counting the detected frames, it is possible to have a first quantitative information, the number of objects in the scene, with their corresponding classes.

• Segmentation [16]: Semantic segmentation, or pixelwise classification, is about assigning a class to each pixel in the scene. It is more precise than object detection in terms of the estimated area, and gives another quantitative in-formation in term of the pixels’ number.

These tasks are not directly related to the 3D shape of objects, which is in fact really important for a broader understanding. Tasks regarding 3D information are:

• Object orientation [36]: A first example of a 3D-related task is object orientation, where the goal is to determine the orientation of objects relative to some coordinate sys-tem.

(24)

1.3 proposed solution 3

• Depth estimation [35]: Depth estimation, is about produc-ing a depth map, to represent how much the objects in the scene were close to the camera.

• 3D reconstruction [47]: Finally, an interesting task is 3D reconstruction, which is about recreating the 3D shape of objects from one or multiple 2D pictures representing the object taken from different camera angles.

The main goal of this thesis is to develop a method to an-swer the question “how much of each object is in a scene?”. By “scene” we mean the representation of a group of objects, via one or multiple pictures of the same group, taken from different angles. By “how much”, we refer to something more detailed than just counting the number of objects in the scene. In fact, we want to retrieve as many quantitative information as possi-ble. In the domain we chose to focus, food ingredients, these are for instance the calories of each ingredient, how many per-son can one serve with that food, even suggestions about possi-ble recipes. In particular, we are interested in retrieving weight, since a lot of other information can be obtained from it.

Our objective is not only to devise an abstract method to answer this question, but also to solve it practically and provide a proof of concept that works with real ingredients.

1.3 p r o p o s e d s o l u t i o n

Our main objective to verify if it is possible to quantify objects using only deep learning techniques. In order to test this, we have performed a lot of experiments to find the most suitable models that could be able to achieve our goals. We decided to divide the problem into two main parts, that are segmentation and 3D model reconstruction. Our pipeline has then been built based on these two concepts.

The segmentation module of the pipeline is responsible for defining the shape and contour of the objects. As we will de-scribe in the following sections, the segmentation module op-erates using a fully-convolutional autoencoder, that is fed with images representing objects belonging to different classes, en-codes the object features and deen-codes them while trying to la-bel every pixel according to their already known class.

The reconstruction module needs the results of the segmen-tation module to reconstruct the 3D shape of the original ob-ject. This process paves the way for the final quantification.

(25)

If the two modules that constitute the pipeline are conceptu-ally very simple, there are however a lot of unresolved issues. Firstly and most importantly, the training set to be used needs to include not only the input images and their labels, but also a 3D model as label for the reconstructor. Secondly, the prob-lem needs to be contextualized, because quantifying an object may mean very different things. Finally, there is the problem of retrieving not only relative measurements but absolute ones, so there is the needing to devise a precise criteria to estimate the reconstructed values. Before diving into the implementation specifics, we identified the best ways to solve these issues.

1.3.1 Synthetic Data and Generalization

To address the problem of needing a very rich dataset, we con-cluded that the search for a preexisting dataset on the internet was practically infeasible. In fact, it is difficult, if not impossible, to find a dataset which satisfies all the requirements we have listed above. However, it is certainly possible to synthesize the dataset on our own, provided that it is done considering all the variation and complexity of the data we need to reproduce.

Although generating the data seemed a reasonable approach, synthetic images can be very different from their real counter-parts, thus making the results extremely biased and inaccurate for a real-case scenario. The way we decided to tackle this prob-lem is not only to generate data trying to make our synthetic images as close to their real twins as possible, but also to study the generalization power of our pipeline. With generalization, we mean the potential of changing the test set from a synthetic to a real one, without losing prediction capabilities.

More specifically, during our experiments, we have trained our models using only synthetic images. Then, module by mod-ule, we have tested the model against real images that we have collected. Obviously, the real images represent the same kind of objects we have trained the network with, only they are real pictures, not generated by a computer. If proved successful, gen-eralization is a very powerful technique because it will demon-strate that, within certain contexts, it is possible to synthesize data, without being worried about losing prediction power over the real data with a certain degree of confidence.

(26)

1.4 structure of the thesis 5

Figure 1.1: Synthetic (i.e. computer generated) images.

1.3.2 Contextualization

We chose food as a good environment to generate the data for the focus of this work. Food has interesting challenges, like dif-ferent textures, colors, and shapes, that make it perfect for the kind of research we want to conduct. It is also relatively easy to find real images of food to test generalization, and it is also relatively easy to take pictures of our own. Finally, food quan-tification is directly related to nutrition and diet. Being able to know, for instance, how much of a particular food there is in a picture is directly related to the nutritional values of the food itself, thus making the problem interesting also from an applica-tion perspective. If our models work, it would be relatively easy to create an application that, given the images of that food, it is able to predict its nutritional values instantly. The research question is then not only interesting from an architectural per-spective, or because its generalization capabilities, but also be-cause of its applicative nature. Of course, we have chosen food for the aforementioned reasons, but there could be way more context that will make our models worth using.

1.4 s t r u c t u r e o f t h e t h e s i s

The subject we have worked on is complicated, and needs to be addressed before diving into implementative aspects of our

(27)

research. For this reason, we have organized the thesis in a way that makes it understandable without jumping from chapter to chapter. The structure of the thesis is the following.

• Background: in the Background chapter, we talk about the theoretical foundation that constitutes the basics of neural networks and Deep Learning, which are necessary to fully understand the thesis. In this chapter, we will first point out the main historical events that made Deep Learning so important for modern research. Then, we will cover the architectural basics of neural networks, includ-ing not only their structure, but also explaininclud-ing how they are trained successfully. The training process issues are in fact fundamental in order to understand why different, more advanced and deeper models have been introduced (e.g. convolutional neural networks). We will finally ex-plain how these advanced models work, and why are im-portant for our research.

• Materials and Methods: in this chapter, we elaborate the terminology and objectives of our thesis. They are funda-mental before introducing our dataset and pipeline. About the dataset, we will explain how it has been generated, providing every detail that has been considered to make it as accurate as possible. About the pipeline, it is nec-essary to explain in detail how it is composed, how the modules are structured and the way they contribute in order to achieve the research goal.

• Experiments: in the Experiments chapter, we list and show the results of all the experiments we have performed in or-der to verify thepipeline capabilities. The chapter itself is divided into two main section, one for each of the main modules that constitute the quantification pipeline, which are segmentation and reconstruction.

• Conclusion: we conclude our thesis considering the ef-fectiveness of our methods, considering all the issues we have encountered during the experiments. We also list a series of possible future improvements that could be in-vestigated further.

(28)

2

B A C K G R O U N D

Deep learning is a machine learning branch that has recently gained a lot of attention because of the breakthroughs it accom-plished in fields such as computer vision, speech recognition, language translation and many more. Similarly to the majority of machine learning branches, its aim is to build models that can, through a series of subsequent training iterations over the input data, learn to make accurate predictions on some pre-defined context. The way deep learning differs from other ma-chine learning methods lies in the nature of the models it builds in order to achieve learning: deep neural networks.

This background chapter is a summary of the theoretical in-formation the reader needs in order to understand the inno-vative aspects of our thesis. During our experience as guest students at the Institute for Applied Computational Science at Harvard University, we had the chance to intensively study the subject and to come up with most of the ideas that will be con-stituting the core of the thesis itself. The same notions have also been explained in the context of our two deep learning workshops during the ComputeFest 2017 event at Harvard. [4]

Organized by the Institute for Applied Computational Science, ComputeFest is an annual two week program of knowledge-and skill-building activities in computational science knowledge-and engi-neering. Similarly to the objectives of this chapter, our work-shops aimed at providing the students with the theoretical and coding foundation on the subject of deep learning.

We first introduce neural networks historically and architec-turally. Later on, we are analyzing the peculiarities of deep neu-ral networks during training, i.e. particular characteristics and issues of the learning process that are specific in the context of neural networks. Finally, more advanced architectures, which have achieved state-of-the-art results in many different tasks, will be introduced. These architectures are at the core of the technology that has been built for the purposes of this thesis, and are thus fundamental to fully understand the next chapters. Deep learning is an extremely empirical and abstract research field. As a consequence, it very important to focus on the basic concepts that constitute the building blocks of artificial neural networks.

(29)

2.1 h i s t o r y o f n e u r a l n e t w o r k s

Let’s dive into a first historical background of neural networks. It is important to understand why certain bricks have been used to build what we know as modern neural networks. We will proceed in chronological order, because it also emphasizes why certain historical conditions or technologies made the dif-ference for the full development of neural networks as a model and deep learning as a subfield of machine learning.

2.1.1 The Early Age of Neural Networks

The first supervised deep feedforward architecture [26] was far

from being a modern neural network, which we are describ-ing in the followdescrib-ing sections, but it is important to point out, because it shares some peculiarities with feedforward neural networks. However, the characteristic that has contributed the most to the full development of practically trainable neural net-works is backpropagation, which was first introduced by Wer-bos in 1975. [52]

Backpropagation gave us huge hints on how to train neural networks faster. It normally couples with algorithms such as Gradient Descent to achieve its objective of minimizing a loss function, as we will see later on in this chapter. During the 80s, a lot of research efforts have been made on a field which is strongly related to parallel distributed computing: Connection-ism. Connectionism proposed, for the first time, the idea of an interconnected model comprised of simple units. [42]

For a long time, neural networks did not really make any ad-vancements. There are several reasons why that was the case. First, neural networks are computationally expensive models to train. In problems like regression or classification, there were much easier machine learning models that were easier to deal with (e.g. Random Forests or Support Vector Machines). Sec-ond, neural networks typically require very big input data, es-pecially during training. There were not huge training sets when the first theoretical discoveries emerged, so it was pretty diffi-cult to determine the power of the neural models. Third, even with modern CPUs, the deeper the network, the slower the learning process. In the last decade, there has been a lot of research and development focusing on GPUs and parallel com-putational devices.

(30)

2.1 history of neural networks 9

This effort led to very fast improvements in subfields like Computer Vision, where the neural networks involved (i.e. mainly Convolutional Neural Networks) can leverage this fast parallel processing to drastically improve their training accuracies.

2.1.2 The Breakthrough of Deep Learning

Deep learning is a subject that has brought artificial neural net-works on another level. In particular, the idea behind the term deep learning is very simple. The networks that model the prob-lem on which we want to make inference and prediction be-comes very deep in terms of number of layers that constitute them. The model is supposed, then, to be able to recognize an incredibly large number of independent features and pat-terns within a specific learning task. As expected, however, big models require an enormous amount of data. That is why deep learning is often associated with another hyped modern field of Computer Science: Big Data.

The first time the term deep learning has been introduced can be traced back to the year 2000. [2] The early years of the new

millennium gave birth to a high number of papers related to topics like convolutional and recurrent neural networks, which we will discuss in the following sections. However, if we try to understand when Deep Learning has really become a famous subject for both researchers and the industry, we discover that the early 2010s were actually its clamorous years.

2004 2006 2008 2010 2012 2014 2016 2018 0 20 40 60 80 100

Figure 2.1: The unnormalized trend of the "deep learning" query on Google. The curve follows an exponential growth, with a turning point traceable in 2012.

Fig.2.1shows the normalized search trend of the deep

learn-ing topic on Google. It is pretty evident the resemblance with an exponentially growing function. However, the turning point

(31)

of the curve is 2012, when several events showed the true po-tential of deep learning not only in theory, but also in practice.

First, in 2012 Google themselves showed how a neural net-work could be trained to recognize high-level objects from un-labeled images. [39] Furthermore, deep learning was started to

be used in the context of competitions such as the ISBI Segmen-tation of neuronal structures in EM challenge and, more impor-tantly, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The network which was used by Alex Krizhevsky et al. at ILSVRC in 2012 [28] was fundamental to show the

incred-ible potential of deep convolutional neural networks to solve problems related to image recognition, such as classification, object detection and so on. Thanks to the paper by Krizhevsky et al., ILSVRC became one of the most famous competitions for deep learning.

In particular, convolutional neural networks became a very interesting research topic, due to its surprising results on many applications. As a matter of facts, two papers coming out from ILSVRC participants in 2014 investigated even more into the be-havior of deeper and deeper networks: VGG [46] and GoogLeNet

[49]. These new studies really helped to give insights on how

bigger models should be built to be both trainable and perform-ing.

2.2 a r c h i t e c t u r a l b a s i c s o f n e u r a l n e t w o r k s Neural networks are computational models that, similarly to their biological counterparts, are made of a large number of units (artificial neurons) that are first arranged into layers, and then linked together through a series of interconnections be-tween these layers. There are several ways which these connec-tions can be wired in.

The simplest example of neural interconnection occurs when no cycle is formed by all the connections within the network. Neural networks that belong to this special class of networks, built with this simple and straightforward rule, are called feed-forward neural networks. In this kind of networks, information flows from one direction to another in an unambiguous fashion, from input to output, through hidden layers.

The more the number of layers which the neurons are or-ganized in, the deeper the network is said to be. Deep learn-ing, from a mere conceptual perspective, can be itself seen as

(32)

2.2 architectural basics of neural networks 11

a machine learning branch, in which the models that are stud-ied are deep neural networks. These models are inspired by the way biological neurons are connected by axons in human brain. However, what is referred as neuron can be called unit, especially when the biological metaphor fades (it will be more evident when RNNs and CNNs are introduced). Feedforward neural networks are considered very adherent to the biological metaphor, not only because of the nature of their connections, but also the way neurons fire.

The conditions which determine if an artificial neuron fires depends on another design choice: the so-called activation func-tions. In principle, any function which is able to define what it means for a neuron to be fired, given certain preconditions, can be regarded as an activation function. However, due to how the training process works in the context of neural networks, all these functions need to retain an important property: differen-tiability. That is the reason why any kind of step function, i.e. functions that abruptly change their value, albeit being piece-wise constant for every input, cannot be used as activations, due to their non-differentiability. The reason why this property is crucial for neural activations relies in the way weights and bi-ases are updated during the training process that will be shown in the following sections. This process, called backpropagation, leverages a gradient descent based optimization. The gradients that are calculated at training time make the differentiability of the activation functions an essential requirement.

Even if the concepts presented so far are not very abstract and complicated, the way they translate into mathematics, and consequently into code, might be fuzzier. Each connection be-tween the neurons that compose the network can be repre-sented numerically and graphically by numerical values (weights) and arrows connecting two neurons respectively. Additionally, each neuron is characterized by its own overall bias, a numeri-cal value that affects each connection towards that specific neu-ron. Weights and biases typically need to be initialized before the model is actually trained. The way this initialization is per-formed is by sampling the values from some probability distri-bution (e.g. a truncated normal distridistri-bution, or even a uniform distribution). Each initialization affects the training process, as will be clearer in the following sections. Therefore, investing time to find a good initialization for a particular learning con-text might be a clever approach.

(33)

As it was mentioned above, feedforward neural networks, which are regarded as the simplest subset of artificial neural models, are characterized by their simple monodirectional con-nections. More specifically, these connections between layers of neurons are defined such that they cannot form any cycle. This property is important, because it defines how the firing of each neuron at each layer of the network depends only on the pre-vious layer activations. Feedforward networks are in fact inher-ently time-independent, meaning that the output of their out-puts depend only on the nature of the input, weights and biases of the network. Feeding a feedforward neural network with the same input at different timesteps will not change the output of the network itself. However, there are still several ways in which neurons can be connected in order to produce this time-independent output. As it turns out, this neural interconnection is fundamental in determining how well (or poorly) the model will be able to learn.

x1 x2 x3 x4 y Hidden layer Input layer Output layer

Figure 2.2: In feedforward networks, neurons are organized into lay-ers through connections that do not form cycles.

The simplest way to connect the neurons from one layer to another is by means of so-called fully connected layers. Neu-rons in fully connected layers have connections to every single activation from the previous layer. The way the input neurons are mapped to a hidden (or output) neuron can be seen math-ematically as the activation function of that layer, applied to the weighted sum of the input data, plus the overall bias of the hidden neuron. ai = σ( n X i=1 m X j=1 wi,jxi+ bi) (2.1)

(34)

2.2.1 Training

Before studying how to train a neural network, it is important to revise how training and evaluating models are approached overall as problems in machine learning. Training and evalu-ating models require some knowledge of optimization theory, and they are inherently interesting and challenging. As a matter of fact, only after training a model and evaluating its inference capabilities, it is possible to come to certain conclusions on its goodness. In order to analyze a topic like this one, it is neces-sary to make some assumptions on the nature of the dataset.

x₂ w₂

Σ

_σ

y

x1 w1

x3 w3

b

Figure 2.3: Mapping of the input to the output through the sum of the weighted inputs, overall biases and activation func-tions

The dataset considered in this context is assumed to organize and structure information in couples in the form (xi,ˆyi), repre-senting the input data and the labels. The input data is fed to the model, which can be thought of as a black box that pro-duces an output given an input. A label ˆyi represents how the output yi of the model should look like, given the input xi. A couple of the form (xi,ˆyi)is called labeled input.

When a complete dataset of labeled data is available, the problem of make inference can be reduced to a supervised learning task. A notorious and standard way to train and subse-quently validate a supervised learning model takes advantage of a first split of the dataset into two main parts: the training and test sets. These sets are meant to have different purposes: while the training set is used to train the model, the test set is used to assess the accuracy of the model predictions. This way, the accuracy test will immediately spot issues like overfitting, i.e. situations in which the trained model learns to recognize the training data only, without generalizing to the entire dataset.

Equation (2.1) and fig.2.3show how each activation, being it

both the output of one layer and the input to its next one (xi = ai−1), increasingly depends on previous layers’ activations.

(35)

2.2.2 Backpropagation

There are some similarities between human brain and artificial neural networks, like the concepts of firing and interconnection between different neurons. However there are also differences, as well as open questions. A big challenge still unsolved is un-derstanding how human brain learns. According to the “single algorithm hypothesis” [14] , there is only one learning

algo-rithm behind the different structures of the brain. Experiments that suggested this hypothesis showed that neurons used in visual tasks can serve in the auditory system, if trained prop-erly. If this hypothesis is true, even at some extent, to find "the" learning algorithm would mean find the Holy Grail of learning. Like human brain, also artificial neural networks are mainly successful because of their capability to learn from data. Hilton [22] surveys some different strategies that can be used to

de-sign a learning mechanism. Among them, Backpropagation is one of the most famous and successful one designed so far.

Backpropagation is usually implemented in conjunction with Gradient Descent. Gradient Descent is a heuristic to find the minimum of the cost function with respect to the parameters in a model. As described before, the cost function is a measure of how well a model is performing, thus to find its minimum with respect to the model’s parameters means to find the op-timal configuration of that model to solve the precise problem described by the cost function itself. A model is ultimately a function, so in principle the problem of finding the global min-imum of the cost function can be solved deterministically.

However, in case of models with millions of parameters, like neural networks, an analytical solution is computationally pro-hibitive. This is why an heuristic like Gradient Descent is so useful. Let us first see how it works in case of a linear regres-sion problem. Suppose we have a set of points as below

(36)

Figure 2.4

that we want to fit with a line equation such y = mx + b, where m is the line’s slope and b is the line’s y-intercept. To find the best line for our data, we need to find the best set of slope m and y-intercept b values.

To be able to apply gradient descent, we first define a cost function, that in this case can be the sum of the error over all the points. Error_(m,b) = 1 N N X i=1 (yi− (mxi+ b))2

It is a common practice to take the sum of the squared error, to have only positive values.

The algorithm starts with randomly initializing the param-eters. Then, it computes the derivative of the function in that configuration, with respect to all the parameters. This deriva-tive express the direction in which the cost function decreases with the highest slope. Then, the product of the parameters and a value called learning rate is added to the parameters themselves. This change causes a decreasing of the cost, i.e. bet-ter predictions. This upgrading steps are repeated until the cost function reaches a local minimum. In fact, there are not decreas-ing directions close to a local minimum.

We saw how gradient descent works in case of a simple linear model. With Neural Network, the idea is the same, i.e. finding

(37)

the derivative of the cost function with respect to the parame-ters, that in this case are the biases and the weights.

∂CX ∂w,

∂CX ∂b

However, there is an additional complexity: to find these deriva-tives is not straightforward, since each neuron is connected to each other. Luckily, there is a way to find these derivatives, through 4 equations, called the equations of Backpropagation.

δL=∇aC σ0(zL) (BP1) δl = ((wl+1)Tδl+1) σ0(zl) (BP2) ∂C ∂bl_j = δ l j (BP3) ∂C ∂wl_jk = a l−1 k δ l j (BP4)

One main idea behind Backpropagation is to express the needed derivatives as a function of a quantity called "error". The error δl_j of neuron in layer ł is:

δl_j ≡ ∂C ∂zl_j

This is useful because there is a straightforward way to com-pute the error with all the information already available in a forward pass, i.e. as a function of the input and the neural net-work weights and biases. After the error is computed, then the needed derivatives are already expressed as a function of the error itself. We provide at [4] a visual sequence we created to

show how the forward and backward step of Backpropagation works.

The algorithm works this way. The input data is provided to the network, in form of a single element or a batch of ele-ments from the training set. Then, using the forwarding equa-tion, i.e. l = wlal−1+ bl, the activation function is computed each layer after another. Eventually, the activation function at the final layer is computed as a function of the activation of its

(38)

2.3 architectural issues of neural networks 17

previous layer. Here, BP1 is used to compute the error in the last layer as a function of the activation in the last layer. This is the starting point for the backward pass: in fact, the error in the penultimate layer can be computed through the BP2 as a function of the error in the last layer. In the same way, the error in each layer can be obtained from the value of the subsequent one. Eventually, the error at every layer is available, as well as the needed derivatives thanks to BP3 and BP4.

At this points, the network’s parameters can be updated, via the derivatives and the learning rate.

wl → wl− η m X x δx,l(ax,l−1)T bl → bl− η m X x δx,l

Each step of Backpropagation and updating is called an epoch. The number of epochs needed to train a network depends on many factors, such as the complexity of the problem, the num-ber of inputs used at each epoch (it can varies from 1 to all the elements in training set), the learning rate and so on.

2.3 a r c h i t e c t u r a l i s s u e s o f n e u r a l n e t w o r k s The framework that has been introduced in the previous section provides precise indications on how to build a neural network from square one. Those instructions are not only valid theoreti-cally, but they also hold from a practical viewpoint.

However, there are several problems that can be spotted, both in theory and in practice. As a matter of fact, we can prove that, the deeper the network, the more difficult the learning process becomes. Also, overfitting has been mentioned above, but it has not been addressed concretely.

The solutions that have historically introduced have been also used also to build our models, thus being interesting not only for their general value, but also for our particular research prob-lem.

2.3.1 The Vanishing Gradient Problem

As we have already mentioned in the previous sections, adding layers to a deep neural network does not automatically trans-late in a substantially better model accuracy. Counterintuitively,

(39)

it can happen that the accuracy drops even if the number of pa-rameters is increased. In fact, our intuition suggests that, by adding complexity to the models, they should be able to iden-tify more abstract and composite features, thus improving their learning tasks.

What happens from a microscopical standpoint is that gra-dients tend to get smaller as we move backward through the hidden layers of our neural network. This problem, which is one of the most acknowledged problems in at least vanilla neu-ral networks, is called the vanishing gradient problem. It also has a symmetric and equally worrying issue, called exploding dient problem, which does the opposite, that is making the gra-dients higher and higher in the first layers of the network. In both problems, we can say that backpropagation makes gradi-ents unstable.

To address the vanishing gradient, we can first translate its conceptual description into an equation. Consider a very sim-ple network with three hidden layers made of a single neuron. Then, backpropagating the gradients towards the first hidden layer translates into:

∂C ∂b1

= σ0(z₁)w₂σ0(z₂)w₃σ0(z₃)w₄σ0(z₄)∂C ∂a4

(2.2) It is now clear that the gradients at the first layer (Equation

2.2) are extremely influenced by the nature of the derivative of

the activation function σ. For instance, the sigmoid derivative has a maximum value of 0.25, making the result of the multipli-cation above smaller the deeper the network.

One way to address the problem is by means of new activa-tion funcactiva-tions that do not present the problem of the sigmoid function. A notorious example is the Rectified Linear Unit func-tion (ReLU). [17]

ReLU(x) = max(0, x)

For a logistic function, since the gradient can become arbi-trarily small, we can get a numerically vanishing gradient by composing several negligible logistics, problem that gets worse for deeper architectures. For the ReLU, as the gradient is piece-wise constant, a vanishing composite gradient can only occur if there is a component that is actually 0, thus introducing spar-sity and reducing vanishing gradients overall.

(40)

2.3.2 Regularization and Dropout

Vanishing gradient is one of the most relevant and notorious problems in deep learning. Unfortunately, there are some other problems that are inherently (and negatively) affecting the train-ing process. One of the most common problems in machine learning is overfitting. It occurs every time training data are fit by an overly complex model. The model itself is not really look-ing at the nature of the problem it is trylook-ing to learn, but rather on the shape of the input data. The issue is exposed when the model is finally evaluated with a different set of test data, re-vealing how the model cannot understand the underlying rela-tionship between input point.

There is a very long set of proposed workarounds to deal with overfitting. Most of these methods are common within the machine learning context, and are not interesting only neural networks. An example is regularization. When facing overfit-ted settings, one noticeable thing is the fact that weights and biases tend to be way greater than their non-overfitted counter-parts. Weight decay, or L2 regularization, works by adding a regularization term to the cost function, in order to deal with this weight explosion.

C = −1 n X xy [yjln aLj + (1 − yj)ln (1 − aLj) + λ 2n X w w2] (2.3)

Equation (2.3) uses a λ > 0 as regularization parameter, that

makes the network prefer to learn smaller weights, rather than larger. Ideally, large weights are considered only when they con-siderably improve the first part of the cost function, or at least better than their square. λ is a hyper-parameter which needs to be tuned to either push for the small weights preference or the original cost function, when λ is large or small, respectively. In the context of neural network, we need to consider how the gradients are calculated after introducing the regularization.

∂C ∂w = ∂Co ∂w + λ nw ∂C ∂b = ∂Co ∂b b← b − η∂Co ∂b w← w − η∂Co ∂w − η ∂λ ∂nw = (1 − ηλ n )w − η ∂Co ∂w

(41)

A smoother version of L2 regularization is L1 regularization, where the weights get shrinked much less than with L2.

C = Co+ λ n X w |w| ∂C ∂w = ∂Co ∂w + λ nsgn(w)

L1 and L2 regularization are good workarounds to deal with overfitting, but not only they are not proven to be always effec-tive, but also introduce another hyper-parameter to be tuned. The most important characteristic of these regularization meth-ods, however, is the fact that they do not try to inspect the model, to act on the exact way they tend to overfit. Instead, they act on the cost function to somehow adjust the results. A much more powerful method to reduce the effects of overfitting neural networks is a method called Dropout.[23]

In an ordinary setting, we would train a network by forward-propagating the input through the entire network, and then backpropagating to determine the contribution to the gradient. With Dropout, we take a slightly different approach, that is we randomly and temporarily delete some hidden neurons, and all their connections. We forward-propagate the input and finally backpropagate through this shrinked network.

Each time a new input batch is fed to the network, we change the subset of hidden neurons that will be removed from the original network. In this kind of scenario, each neuron is trying to learn the relationships within the input data without tak-ing for granted the help of all its fellow neurons. Of course, the weights that are learned using Dropout need to be reduced, when the full network will be actually used. This compensation is due to the fact that we have learned the weights and biases of the network in some shrinked setting, and we cannot just re-introduce the previously removed neurons without adjusting those weights and biases.

The probability of a neuron to be dropped out of the net-work is, again, a hyper-parameter, to be chosen by the netnet-work designer. Ideally, the smaller the drop probability, the less itera-tions are required to reach an adequate accuracy level. However, by dropping neurons, the network relies on fewer assumptions about the nature of the data than in the original setting, and so it should overfit less.

(42)

Figure 2.5: Dropout operates by randomly removing each neuron from the original network with a certain pre-defined probability. The new network will then be trained in the same way, while the network evaluation will adjust the learned parameters, depending on the probability that a neuron is dropped out of the network

Heuristically, the Dropout operation is equivalent to a train-ing process that takes place by ustrain-ing a random set of differ-ent neural networks, each of which is a subset of the original network. What Dropout does when it picks up the pieces and merges the smaller networks together, is to average the effect of the different neural networks.

2.3.3 Batch Normalization

Dropout is an incredibly clever and efficient method to address the problem of overfitting and to speedup training. However, it introduces yet another hyperparameter to tune somehow dur-ing validation. Of course, this issue may seem easy to deal with, especially because it would only require a validation step to choose the models that perform the best.

However, there are other problems that dropout does not ad-dress, and will still affect the training performance, and con-sequently slow down the learning process. These issues are mainly the aforementioned vanishing gradient problem and pa-rameter initialization.

We have introduced interesting solutions to these problems already, like the use of ReLU activations to avoid saturation,

(43)

and these methods are still valid and widely used, especially together with dropout. Nevertheless, finding a way to deal with all those issues at once would relieve a lot of effort at training and validation time. One of the most famous algorithm that succesfully tackles this problem is batch normalization. [25]

Batch normalization reflects on how the distribution of the activations at each layer changes during training, depending on every preceding layers’ activations. This condition, called internal covariate shift, slows down training, making it nearly impossible to learn anything without setting low learning rates and carefully initializing the network parameters.

Furthermore, this change in distributions makes the model need to constantly conform to the nature of these distributions. To make the situation even worse, the problem, similarly to the vanishing gradient, amplifies as the network gets deeper and deeper.

The way batch normalization practically tackles the problem is by whitening (i.e. normalizing and decorrelating) not only the network input, but also every other layer inputs, which co-incide with the network activations. Since the full whitening at each layer would be too computationally expensive, there are a few assumptions and approximations to be introduced to make batch normalization applicable. First, each feature is normalized independently.

ˆx(k)= x

(k)₋_E[x(k)_] p

Var[x(k)_] (2.4)

Yann LeCun already pointed out how this kind of feature nor-malization was effectively speeding up training convergence, even when the features were correlated.[30] The problem with

this kind of normalization is that it is an operation that changes what the layers actually represent. One of the proposed solu-tions is to transform the normalized inputs by scaling and shift-ing it back to its original space, as follows.

ˆy(k) = γ(k)ˆx(k)+ β(k) (2.5)

The innovative thing about this operation is that scaling and shifting is carried out by using trainable parameters, learned by the network through backpropagation. Also, without this

(44)

transformation, there would be a very high chance that the rep-resentation of the learned features was distorted.

The scaling parameters make it possible to project the nor-malized features back to their original space. As a matter of fact, the scaling and shifting parameters could be learned to represent the exact inverse operation of normalization, if that was the optimal thing to learn for the network.

γ(k) = q

Var[x(k)_] β(k) =E[x(k)] y(k) = x(k)

We have now all the components to formally write down how batch normalization works both in the feedforward and backpropagation steps.

Algorithm 1: The Batch Normalization algorithm (BN) as presented in [25]

Input :Values of x over a mini-batch: B = x1,..., xm Parameters to be learned: γ, β Output : yi = BNγ,β(xi) µ_B← 1 m m X i=1 x_i σB← 1 m m X i=1 (xi− µB)2 ˆxi ← _qxi− µB σ2_B+ yi← γ ˆxi+ β≡ BNγ,β(xi)

As we have previously hinted, having two new trainable pa-rameters (γ and β), requires understanding how their partial derivatives are calculated during the backpropagation phase.

∂l ∂ˆxi = ∂l ∂y_i · γ ∂l ∂σ2_B = m X i=1 ∂l ∂ˆxi · (xi− µB)· − 1 2(σ 2 B+ ) −3₂

(45)

∂l ∂µB = m X i=1 ∂l ∂ˆxi · − 1 q σ2_B+ + ∂l ∂σ2_B· Pm i=1−2(xi− µB) m ∂l ∂x_i = ∂l ∂ˆxi · 1 q σ2_B+ + ∂l ∂σ2_B· 2(xi− µB) m + ∂l ∂µ_B · 1 m ∂l ∂γ = m X i=1 ∂l ∂yi · ˆxi ∂l ∂β = m X i=1 ∂l ∂yi 2.4 c o n v o l u t i o na l n e u r a l n e t w o r k s

As we have discussed in the previous sections, feedforward neural networks are powerful, but introduce a variety of prob-lems that need to be addressed in order to achieve valuable and non-trivial degrees of accuracy. The workarounds and solutions we have introduced are certainly brilliant, and come handy in the majority of settings. However, it sounds inefficient to try to adjust a model at all cost, especially when the context we are trying to learn from is not really aligned with our learning method. Let’s introduce a simple example, to understand how powerful fully-connected neural networks are, but also where they can be improved the most.

2.4.1 Image Classification

Suppose we need to build a neural network which needs to learn how to classify images. A common example is handwrit-ten digits classification. The problem is defined as it follows: given a dataset of handwritten digits, our model needs to be able to classify its input images correctly.

From what we have learned in the previous sections, we can imagine to build a neural network by modeling the input layer as a set of neurons, each one mapping exactly one input pixel. We also need to define how to model the output layer. Since there are ten classes of digits in total, we model the output layer as a set of ten neurons. The final activation, the one that connects the last hidden layer to the output layer, has to some-how define a probability distribution of the classes, depending

(46)

2.4 convolutional neural networks 25

on the previous activations, weights and biases. A very simple way to achieve this, is by means of the softmax function.

softmax(xi) = exi

PN j=1exj

(2.6)

The softmax equation (2.6) pushes the maximum value even

closer to 1, and the other values even close to 0. It also trans-forms the input space into a probability distribution. In fact, each softmaxed value has a value that ranges between 0 and 1, and the sum of all the softmaxed inputs is always 1.

N X i=1 softmax(x_i) = PN i=1exi PN j=1exj = 1

Figure 2.6: A diagram of softmax regression. The input is mapped di-rectly to the output through connections (weights) and bi-ases. The softmax activation transform the weighted sum into a probability distribution.

A model that aims at classifying some input data by means of a probability distribution created by a softmax function is said to do softmax regression. Of course, the model we have just described is too simple to achieve very important results. How-ever, given the original problem of recognizing handwritten digits, even a fully-connected input-output network can achieve non-trivial accuracy.

Naturally, one might think that adding hidden layers to the network automatically leads to better results. For the reasons