Implementation of a Face Recognition System using Convolutional Neural Networks

(1)

Dipartimento di Ingegneria dell’Informazione

Corso di Laurea Magistrale

Computer Engineering

Implementation of a Face

Recognition System using

Convolutional Neural Networks

24 Luglio 2017

Candidato

Vittorio Romolini

Relatori Prof. Giuseppe Amato Prof. Fabrizio Falchi Prof. Claudio Gennaro

(2)

(3)

(4)

(5)

The availability of large training datasets and the introduction of GP-GPUs, along with a number of algorithmic news, have fostered the recent progress in the computer vision field of Artificial Intelligence.

The latest state-of-the-art approaches for face recognition have taken advantage of such a progress by exploiting deep convolutional neural networks (DCNNs). I have examined these methods, as well as the most widely used face datasets and the techniques for face detection and alignment.

Moreover, I have selected a promising method to be developed, based on the training of DCNNs through a “triplet loss function”. Triplets are composed by three images: an anchor sample, a positive sample similar to the anchor, and a negative sample that instead belongs to a different class. The investigated loss function enables the network to learn a mapping that projects the similar input images to embeddings whose Euclidean distance is smaller enough than the distance between the embeddings associated to anchor and negative samples. In this manner, after training on triplets of face images, the similarity verification between two face samples can be reduced to the comparison of the Euclidean distance of their embeddings to a distance threshold.

I have implemented the software tools needed to realize such a technique and I have trained and tested a number of models with publicly accessible face datasets, assessing different training settings and exploring the hyper-parameters space.

Finally, I have built a face recognition system that enhances the face verification accuracy of the base method, even on strongly unaligned face images, and also surpasses the human accuracy.

(6)

(7)

1 Introduction 1

1.1 Face Recognition: Problem and Tools . . . 1

1.2 Thesis Goals . . . 4

1.3 Thesis Outline . . . 5

1.4 Background . . . 6

1.4.1 Machine Learning and Neural Networks . . . 6

1.4.2 The Representation Issue . . . 11

1.4.3 Deep Learning . . . 14

1.4.4 Convolutional Neural Networks . . . 25

1.5 Further Network Architectures . . . 34

1.5.1 Distance Metric Learning through the Siamese Network . . . 34

1.5.2 The Triplet Network for Similarity Learning . . . 38

1.5.3 Inception Module and GoogLeNet . . . 42

1.6 Historical Notes. . . 44

2 Datasets 47 2.1 Private Corporates’ Datasets . . . 47

2.2 Publicly Available Datasets . . . 48

2.2.1 Labeled Faces in the Wild (LFW) . . . 48

2.2.2 YouTube Faces Database (YTF) . . . 51

2.2.3 VGG Face Dataset . . . 52

2.2.4 IARPA Janus Benchmark A (IJB-A) . . . 53

(8)

2.2.6 MegaFace . . . 54

2.2.7 MS-Celeb-1M . . . 54

3 Related Work 57 3.1 Face Detection and Pose Estimation . . . 57

3.1.1 Cascade Detectors based on Rigid Templates . . . 57

3.1.2 Deformable Part Models based on Standard Image Features . . . . 60

3.1.3 Detectors based on Neural Networks . . . 62

3.2 Face Alignment . . . 63 3.3 Face Recognition . . . 65 3.3.1 DeepFace . . . 65 3.3.2 DeepID 1, 2, 2+ and 3 . . . 67 3.3.3 FaceNet . . . 70 3.3.4 VGG Face. . . 71

3.3.5 Data Augmentation Strategies for Object and Face Recognition. . 75

3.4 Discussion . . . 76

4 Software 79 4.1 NVIDIA . . . 79

4.2 Dlib . . . 81

4.3 MatConvNet . . . 82

4.4 Basic Linear Algebra Subprograms . . . 84

4.5 Caffe . . . 85

4.5.1 Introduction . . . 85

4.5.2 Installation . . . 91

4.5.3 Python Interface . . . 93

4.5.4 Adding the Triplet Loss Layer to Caffe. . . 99

4.6 Design and Implementation of PyLearning . . . 101

4.6.1 Design and Implementation of PyLearning. . . 101

4.6.2 Support to Triplet Training . . . 102

(9)

5.1 Experimental Settings . . . 107

5.1.1 Test Configurations, Thresholding and Collected Performances . . 107

5.1.2 Test Datasets: Uncropped Variants . . . 110

5.1.3 Test Datasets: Detection through the VGG Face Detector . . . 110

5.1.4 Test Datasets: Central Crop . . . 111

5.1.5 Test Datasets: Detection through Dlib . . . 112

5.1.6 Test Datasets: IJB-A Ground-Truth . . . 114

5.2 Performances of the Base Model “D” . . . 114

5.3 Fine-Tuning of the Base Model . . . 118

5.3.1 Architecture . . . 118

5.3.2 Training Triplets . . . 122

5.3.3 Distance Margin and Learning Rate . . . 123

5.3.4 Blacklisting of LFW Ground-Truth Errors . . . 129

5.3.5 Architectural Variants . . . 130

5.3.6 Learning over Casia WebFace . . . 131

5.4 Performances of the Proposed Models . . . 134

6 Conclusions and Future Work 137

(10)

(11)

1.1 Relationship between Deep Learning and AI . . . 3

1.2 Artificial neuron . . . 7

1.3 Feed forward neural networks . . . 8

1.4 Underfitting and overfitting of a function . . . 9

1.5 Learning curves . . . 10

1.6 Bias and variance analogy with dart throwing . . . 11

1.7 Rectifier, logistic and hyperbolic tangent functions. . . 17

1.8 Faster training using ReLUs . . . 18

1.9 Example of 2-D convolution without kernel flipping . . . 27

1.10 Stack of small convolutional layers . . . 31

1.11 Siamese network . . . 36

1.12 Triplet network . . . 40

1.13 Architecture for Euclidean embedding by Wang et al. . . 41

1.14 Inception module . . . 42

1.15 GoogLeNet architecture . . . 43

1.16 LeNet-5 architecture . . . 45

1.17 AlexNet architecture . . . 46

2.1 Distribution of samples in peopleDevTrain . . . 51

2.2 Distribution of samples in peopleDevTest . . . 51

3.1 Yaw, pitch and roll . . . 57

3.2 Haar-like features used by Viola and Jones. . . 58

(12)

3.4 FaceNet model structure and triplet loss learning . . . 70

4.1 Example of object classification with MatConvNet . . . 83

4.2 Caffe and its interfaces . . . 85

4.3 Triplet training in debug mode . . . 106

5.1 Central cropping with crop factors 60, 65, 70. . . 111

5.2 FaceCropper in interactive mode . . . 113

5.3 Tested accuracy of model D over LOC65 . . . 115

5.4 Training and deploy architecture of trained models . . . 121

5.5 Violating triplets in models DL001, DL002, DL008, DL009, DL011 . . . . 125

5.6 T2 EER-Acc on LOC65 in models DL001, DL002, DL008, DL009, DL011 126 5.7 T2 EER-Acc on LDC65 in models DL001, DL002, DL008, DL009, DL011 127 5.8 Tested accuracy of model DL009 epoch 10 over LDC65. . . 128

5.9 Tested accuracy of model DL002 epoch 7 over LOC65 . . . 128

5.10 T2 EER-Acc on LOC65 of models learned with blacklisting of the LFW ground-truth errors . . . 129

5.11 Overall scheme of trained models.. . . 133

5.12 ROC curves of selected models tested on T3 over LOC65 . . . 135

(13)

2.1 LFW lists for View 1 under the unrestricted protocol. . . 50

3.1 The DeepFace-single architecture . . . 65

3.2 Architecture D used by Parkhi et al. . . 72

3.3 Performance evaluation of the models proposed by Parkhi et al. without triplet training . . . 73

4.1 Experiments’ environments . . . 79

4.2 Face detection performances of OpenCV and Dlib . . . 81

5.1 Test configurations . . . 108

5.2 Model D performances in T1, T2, T3, T4 tests over unaligned datasets. . 114

5.3 Model D performances in T2 tests over aligned test datasets and with descriptors’ normalization. . . 116

5.4 Model D performances according to Parkhi et al. . . 117

5.5 T2 performances on LOC65 in best-performing intermediate models for DL001, DL002, DL008, DL009, DL010+DL011 . . . 126

5.6 T2 performances on LDC65 in best-performing intermediate models for DL001, DL002, DL008, DL009, DL010+DL011 . . . 127

5.7 Training data used for DC014 and DC016, compared with DL002 . . . 132

5.8 T4 performances of selected intermediate models over IJB-A. . . 134

5.9 T3 performances of selected intermediate models . . . 135

(14)

(15)

Introduction

Face recognition is at the forefront of the algorithmic perception revolution

—Taigman et al. [61]

1.1 Face Recognition: Problem and Tools

In computer vision, face recognition is a kind of object recognition specialized in recognizing human faces among the visual contents of input images. It is a general problem that includes two tasks:

Face verification : a comparison is performed between a probe image and a reference image, in order to determine whether or not they both represent the same person. Face identification : a one-to-many comparison is performed between a probe image and a gallery of images of known people, in order to identify the face depicted in the probe. In closed-set identification tasks, the probe individual is known to be in the gallery, while that’s not guaranteed in the open-set identification task.

These recognition tasks can be effectively fulfilled by applying Artificial Intelligence methods. More to the point, face models are learned through Machine Learning tech-niques, where faces are represented through characteristic features that describe the visual content of image patches.

(16)

In order to use a face recognition system, one first needs to train it over a set of face images of many individuals. In the training phase, characteristic face features are detected in the training images: by taking advantage of the presence and intensity of such features in the images associated to different people, the system learns models of faces of different people.

Representation Learning techniques, in turn, enable to automatically learn the char-acteristic face features themselves during the training phase.

Moreover, feature detections can be mapped to embeddings in a lower-dimensional space, and a distance function over the embeddings can be used to efficiently evaluate the similarity between two input images. System training can be performed in order to tune the mapping parameters so that faces of the same individual are mapped to embeddings having small distance, while faces of different people are mapped far apart. This strategy is applied for instance by Triplet Loss Training.

After the learning phase ends, features detections in any new input image can be easily compared to the detections found in the training images. In this way, it is possible to identify the input face against the predefined set of training individuals, or verify whether two faces correspond to the same people or not by comparing the distance between their embeddings with a threshold.

Face recognition systems have to address non-trivial issues as the following ones:

• Often just a small collection of training images is available.

• There could be strong intra-class variability, meaning that significant differences may exist between images depicting the same person due to the so-called A-PIE variability (Age, Pose, Illumination and Expression) or other factors (presence of glasses, eyebrows, ...).

• Conversely, there could also be a small inter-class variability, in the sense of high similarity between images of different people.

Nevertheless, in the last years a substantial and growing literature has announced a number of visual recognition systems achieving increasingly higher performances that finally tied up and surpassed human performances.

(17)

Deep Learning is at the heart of these advances in visual classification tasks, performing as state-of-the-art approaches also for face recognition. Indeed, it enables the detection of complex face features composed by simpler ones, which in turn are composed by even simpler ones, hence defining a hierarchy of features.

Figure 1.1: Venn Diagram representing the relationship among Artificial Intelligence, Machine Learning, Representation Learning and Deep Learning [20].

Most notably, deep Convolutional Neural Networks have driven the recent progress of the computer vision field due to their intimate compatibility with visual classification tasks and their performance gains during system training.

As discussed in details later on, all these sophisticated techniques for object recogni-tion were recently made feasible especially thanks to the increasing size of available face datasets and the availability of powerful hardware and software infrastructure. Indeed, these advancements provided the basis for automatically learning of effective face mod-els, generalizing over sufficient examples, without requiring hand-crafted features and with reasonable training time.

(18)

1.2 Thesis Goals

The objectives of this thesis are the following:

1. To analyze novel approaches for face recognition based on deep learning. 2. To select a usable and promising approach.

3. To implement the software tools needed to fulfill and improve such an approach. 4. To build a face recognition system that enhances the face verification accuracy of

the base method.

Given these goals, we have carried out an analysis of deep learning principles and we have reviewed both the datasets and techniques involved in the pipeline from detection to recognition of faces.

The neural networks we have trained have been built by fine-tuning a promising deep learning approach which we have selected among those that have been proposed in the recent face recognition literature.

A noteworthy characteristic of the selected approach is that it takes advantage of the triplet loss technique aimed to learn a Triplet Distance Embedding, that we have analyzed.

This thesis also fills a gap by performing reproducible tests, over standard public datasets and detailed test configurations, for the base face recognition model we have later fine-tuned.

We have explored and presented in this thesis some software tools useful both for face detection and face recognition through deep learning techniques.

Then, we have developed new software on top of the above-mentioned tools and libraries, so that it has become possible to easily perform training with a triplet loss function, generating models that can be tested over sets of face images pairs in order to evaluate those models’ accuracy for the face verification task.

By using these software we have implemented, we have been able to generate effective face verification models, whose performances have been thoroughly tested and reported in the Experiments chapter.

(19)

1.3 Thesis Outline

In the remaining part of this Chapter, we are going to introduce the background notions about the machine learning techniques employed in the rest of the thesis. Afterwards, the milestones of research efforts up to the most relevant breakthroughs in deep learning are listed from a historical perspective.

Chapter 2is dedicated to a review of widely used face datasets.

Chapter3presents the related work about face detection, alignment and recognition. The most relevant scientific articles on these fields are examined.

Moreover, in this Chapter we motivate the design choice of the base model we have selected for fine-tuning, as later discussed.

In Chapter4we first introduce the software tools employed for face detection and for training and testing neural networks. Then, we present the PyLearning library we have designed and implemented with the purpose of easing deep learning through Python scripts and in particular to enable triplet training.

Chapter 5 contains the experiments we have performed on the base model and the fine-tuned models we have trained.

After having described the experimental settings, including test configurations and gen-eration of the cropped dataset variants, a detailed performance analysis of the base model is illustrated.

Then, we discuss the learning configuration of the neural networks we have trained along with the associated performance results.

Finally, overall considerations, future enhancements and conclusions are reported in Chapter6.

(20)

1.4 Background

1.4.1 Machine Learning and Neural Networks

The field of Artificial Intelligence (AI) is directed to the development of intelligent agents. As stated by Russel and Norvig [47], an intelligent agent perceives its environ-ment in order to maximize its chances of success in accomplishing a task. Intelligent software systems can help in the automation of many tasks, especially repetitive ones over big amounts of data.

Decisions taken by an intelligent system may be based, as in the first AI projects, on a hard-coded formally-expressed knowledge base of rules applied to the input facts using inference rules. Nevertheless, the real world’s complexity proved to be too big to be formally represented with a finite list of rules. This is especially true with respect to the subjective and intuitive decisions that any person takes in his everyday life.

Machine Learning lets a system build his knowledge by experience from a training phase over example data. Hence, Domingos [12] synthesizes that “machine learning systems automatically learn programs from data”.

In computer vision, the input data (the raw image) is represented through relevant features like corners. Tuytelaars and Mikolajczyk define a local feature as “an image pattern which differs from its immediate neighborhood” [63]. Many different feature detection algorithms exist, and many other techniques have been developed for encoding feature descriptors and to aggregate these descriptors in compact image representations. In the example case of supervised machine learning, the training takes place by supplying to the system a large set of examples, each one containing a multidimensional input (the relevant features) and the corresponding desired output.

For instance, in an intelligent system for image classification, a training example would be composed by the features detected in a photo of a dog along with the numeric identifier of the “dogs” category.

During the training, the intelligent system automatically optimizes its parameters in order to link each training input to the corresponding desired output: so, to continue the previous example, the features found in a photo representing a dog are linked with the dog class, for instance by changing the weights in an artificial Neural Network.

(21)

Figure 1.2: Functional diagram of an artificial neuron.

Neural networks are composed by simple artificial neurons schematized as in Fig. 1.2. In each neuron i, the weighted sum ai of the inputs xj,i with respect to weights wj,i is passed to an activation function f so to produce the output yi = f

Pn

j=1wj,i· xj,i

. For instance, f (ai) could return 1 if ai > 0, or 0 otherwise. A further neuron input is the bias, used as a threshold over ai by the activation function: the bias can be modeled with additional rows and columns in the x , w vectors.

Typical activation functions are the logistic function, the hyperbolic tangent and the Rectified Linear Unit (ReLU).

Weights are typically randomly initialized1 and they get updated during the network training in order with the purpose of improving yi with respect to the desired output ti. A commonly used neural network architecture is the feed-forward network, depicted in Fig. 1.3 as a Directed Acyclic Graph (DAG) since the information flows from the input layer to the output layer through the hidden layers. When feedback links are present, the network is called recurrent instead of feed-forward.

As pointed out by Domingos [12], there exists a wide variety of machine learning systems, since they are essentially the combination of three components: system mod-eling, evaluation function and optimization technique. Each one of them has several alternative implementations but a complete analysis of all the possible combinations is out of the scope of this thesis.

1_{Several weights initialization algorithms have been proposed in the literature.}

For instance, by applying Gaussian initialization the weights are randomly sampled from a Gaussian distribution with zero mean and a given standard deviation. With Xavier initialization, instead, the weights of the ninconnections feeding a neuron are sampled from a uniform distribution over the interval

[−a, a] where a =q 3 nin .

(22)

Figure 1.3: A layered feed-forward network, where neurons (represented as circles) are organized in layers. In this example, there are two hidden layers between the input layer (whose passive nodes only replicate the percepted inputs for the next layer) and the output layer (responsible for producing the network output).

Concerning the system modeling, we concentrate on face recognition techniques based on (deep convolutional) neural networks, which offer impressive performances for com-puter vision tasks as proven by the recent literature and discussed in §3.3, both on object recognition and more specifically on face recognition.

Given a supervised training example (ˆx, ˆt), the loss function (also known as cost or evaluation function) quantifies the network error between the target result ˆt and the actual network output for the input ˆx by using a specific configuration of weights.

The Gradient Descent, instead, is one of the optimization algorithms that may be used. It consists in minimizing the loss by varying the system weights. Indeed, the loss function shapes an error surface over the weights configurations space: the opposite of the gradient for a given weights configuration suggests where to look for points of smaller loss. The weights themselves are then updated in the direction given by the opposite of the gradient and proportionally to a learning rate. The weights update algorithms also typically take into account previous updates through a momentum term which privileges a continued descent along the same direction over the error surface.

The weights are updated back-propagating the error from the output layer to the previous layers. This process continues over all the training examples: a loop over all examples is called epoch and the training lasts many epochs until some stop condition gets satisfied. As a result of the gradient descent over the loss function, the network output tends to become increasingly accurate.

(23)

The lengthy network training can be remarkably accelerated through several meth-ods, including the already mentioned momentum. A significant improvement can be achieved by using the Stochastic Gradient Descent (SGD) instead of the basic gradient-descent algorithm. Indeed, the widely used SGD estimates the loss gradient by taking the average gradient on a mini-batch of some examples drawn independently from the training set. This random sampling introduces a source of noise (that does not vanishes even at a minimum loss point), so the SGD requires to progressively reduce the learning rate over the epochs: such learning rate reduction is often implemented trough step-wise drops by multiplying the learning rate by a positive constant value γ < 1.

The training must take care of two opposite problems which are tightly linked with the network capacity, which is the network’s ability to model a given function:

• Underfitting (not learning): in addition to the very beginning of the training, it occurs when the capacity is not sufficient, for instance because it has too few hidden neurons. Underfitting also takes place when the gradient descent doesn’t lead to a configuration of parameters with small loss, possibly because the gradient descent falls trapped in a local minimum of the non-convex error surface which, in turn, is associated with a loss much greater than in other existing minima points.

• Overfitting: it takes place when a network with sufficient or too much capacity with respect to its task starts to learn meaningless details of the training data, thus becoming unable to fit or classify different unseen inputs.

Figure 1.4: Underfitting and overfitting of a system aimed to perform function fitting upon some training points (the bold dots) [20, Fig. 5.2].

(24)

Hence, the ultimate goal of the training is to obtain a configuration of parameters such that the system not only does correctly classify training data, but it also accurately recognizes new input data.

In other words, the optimization of the loss function over the training examples is just a proxy for the true goal: we want the system to generalize training data, starting to recognize inputs that it has not yet seen before.

With the purpose to achieving an acceptable generalization, there exists a number of regularization techniques (discussed in §1.4.3.7): the simplest one is the early stopping of the training, based on having split the available dataset in three parts:

• The training set, whose examples are supplied to the network in order to train it by updating the weights (for instance by using SGD) so to decrease the loss function. • The validation set, used to stop the training when the network configuration be-comes overfitted with respect to the training data: overfitting is detected by watch-ing whether the network is more and more badly classifywatch-ing the validation exam-ples, that suggests that it is losing the ability to generalize.

• The test set, used to compare the performances of two different networks without using the training and validation sets over which those networks have been tuned.

Figure 1.5: As shown by this plot [20, Fig. 7.3], at the beginning of training both the errors on training and validation sets are high. By optimizing the negative log-likelihood loss function (described in §1.4.3.5) over the training set, both the errors decrease but after some epochs the validation set error won’t decrease anymore because of overfitting.

(25)

The loss signal of the network is composed by two parts, called variance and bias (different but clearly related to the bias parameters of neurons):

• The bias is the “learner’s tendency to consistently learn the same wrong thing” [12], being the expected deviation from the target value.

• The variance represents the “tendency to learn random things irrespective of the real signal” [12] so it measures the “deviation from the expected output value that any particular sampling of the data is likely to cause” [20, §5.4]).

Figure 1.6: Domingos [12] proposes an effective analogy for bias and variance, to be interpreted as error components during darts throwing.

1.4.2 The Representation Issue

1.4.2.1 Local vs Distributed Representation

The representation of data exchanged between the layers is another critical design choice in machine learning, as it’s usual for data representation in computer science: it directly affects learning performances.

Different solutions exist for this issue and among them there are the local representa-tion and the distributed representarepresenta-tion: the representarepresenta-tion locality of data produced as output by a network layer is consequent of how the features are matched and generalized in that layer.

(26)

Let’s assume to break the layer’s input in small partitions. Given an input partition ˆ

x and a feature ˆk, both being vectors in the input space, local feature matching consists in having a neuron which gets activated when ˆx is near to ˆk. The Gaussian kernel in Eq. 1.1for some size σ is a typical example of matching local templates.

G(ˆx, ˆk) = e−

kˆx−ˆkk2

σ2 (1.1)

Such an input-space local matching requires many parameters in order to detect the features in each one of the input partitions. Furthermore, local generalization is based on the smoothness prior , which is the assumption that the target function can be adequately approximated by a function which is smooth over the extent of the input partitions.

Hence, the more the target loss-function is highly-varying in the input space, the larger is the number of parameters that we need to learn and then the number of training examples necessary to reach a given generalization level significantly grows.

In a distributed representation, instead, “the information is not localized in a partic-ular neuron but distributed across many, [in order to] bring close to each other examples which share abstract characteristics that are relevant factors of variation of the data distribution” [1]. By encoding the underlying regularities of data, a distributed repre-sentation is much more efficient because of its reduced number of parameters (higher compactness), so one needs fewer training examples to achieve a generalization level.

As an analogy, Bengio [1] indicates two representations for an integer i ∈ {0, 1...N − 1}: a local representation would be the one-hot representation of i (a vector of N bits with just a single one at index j = i, while all other elements are zeros); a distributed repre-sentation would be instead a compact vector of just dlog₂N e bits.

1.4.2.2 Sparsity and its Advantages

Sparsity consists in having only a small subset of neurons that are activated at a given time. According to Lennie [39], it appears that also in the human cortex no more than 1% of neurons are simultaneously activated, due to the energy cost of a spike. By continuing the example of the binary representation for an integer number, a sparse representation forces just a few bits to be ones while all the others are zeros.

(27)

Therefore, a sparse representation is somewhere in between of a fully local representation (sparse at most) and a dense distributed representation.

Sparsity has several noteworthy advantages, as described by Glorot et al. [19], in-cluding the following ones:

• Sparse representations are not as efficient as dense distributed representations, but they are still far more efficient than purely local ones.

• The combination of sparsity with the robustness to small input changes (yielded by the distributed representation) makes it possible to have a small set of non-zero neuron activations well maintained even if small changes of the input occur. This clearly allows information disentangling, which is one of the most relevant building blocks of the features hierarchy of deep architectures.

• By varying the number of active neurons, it is possible to adequately represent concepts which need different amounts of information and precision to be usefully described to the following layers.

1.4.2.3 Human-crafted Features vs Representation Learning

In classical machine learning, the features were carefully designed ad-hoc for every application. Unfortunately, this approach may take even decades to a community of field specialists, typically being a complex and bias-prone task.

Representation Learning, instead, lets the intelligent system to learn the features themselves. In this way, they can be obtained in a much shorter time (the training time) without needing teams of specialized researchers and also achieving very good testing performances.

The role of the designers is then to build the neural network architecture and to collect a useful dataset, also taking into account all those techniques that foster sparse distributed representation of data exchanged between the network layers. Such tech-niques, like the use of ReLUs as activation function of hidden neurons, are discussed in the following Sections.

(28)

1.4.3 Deep Learning

1.4.3.1 Feature Hierarchies

All the ideas cited so far, however, proved to be not sufficient to solve more complex tasks like object recognition. To recognize a face, for instance, the representation learning alone does not suffice: a “face” is a high level concept composed by finer-grained sub-concepts (mouth, nose, . . . ) which only ultimately relies on local features like corners and edges. How to extract such high-level abstract features from a photo?

As introduced by Goodfellow, Bengio and Courville, “Deep learning solves this cen-tral problem in representation learning by introducing representations that are expressed in terms of other, simpler representations. Deep learning allows the computer to build complex concepts out of simpler concepts” [20, Ch. 1]. Similarly, Ranzato points out that “Deep Learning method makes predictions by using a sequence of non-linear pro-cessing stages: the resulting intermediate representations can be interpreted as feature hierarchies and the whole system is jointly learned from data” [13].

Hence, a hierarchy of features with increasing complexity is needed and a network with as many layers should be employed to detect and recognize so structured entities like a face in an input image.

1.4.3.2 Network Depth

The depth of a neural network architecture refers to the number of the non-linear processing stages mentioned by Ranzato. More precisely, it is the maximum number of non-linear stages between any input to any output in the network’s DAG. An architecture is said to be shallow if the depth is small. Bengio warns, however, that what really matters is not the absolute number of layers but instead “the number of levels relative to how many are required to represent efficiently the target function” [1].

The deep hierarchical approach, based on multiple levels of abstractions, recalls how the primate visual system works, as it was first described by Hubel and Wiesel in the ’50s and ’60s. Moreover, according to the current knowledge of brain anatomy as depicted by Serre et al. [51], it appears that the mammal brain is organized in a deep architecture, where just the visual system is structured over 5-10 layers.

(29)

Researchers have imagined deep networks for decades. Nevertheless, apart from some successful result by taking advantage of a few more hidden layers than basically shallow architectures (e.g. techniques proposed by LeCun et al. in [37] and [38]), researchers have not investigated extensively the training of neural networks with many hidden non-linear layers for several reasons including:

• The lack of big labeled datasets for supervised training of many weights in networks with a high number of fully connected neurons.

• The bad performances of deep feed-forward neural networks where hidden layers are randomly initialized, due to the diffusion of the error gradient across many layers that makes the derivatives of the overall loss (with respect to weights) very small especially in the earlier layers, hindering them to learn anything useful. • The common (mis)belief that training of deep networks can’t work because of the

presence of many local minima “traps” in the error surface.

1.4.3.3 The Neural Networks Renaissance

The so-called renaissance [20] in artificial neural networks started in 2006 when Hinton et al. described in [24] how to effectively train a many-layered feed-forward neural network by first pre-training the network one layer at a time in an unsupervised2 way (instead of randomly initializing the weights), then fine-tuning the network with supervised error back-propagation.

Starting from that breakthrough, other techniques were proposed, again exploiting the same principle of unsupervised pre-training where learning is performed locally at each level, each one mapping his inputs to a higher-level (more abstract) feature space. Since this learning process “remains local to each layer, [it] side-steps the issue of gradient diffusion that might be hurting gradient-based learning of deep neural networks” [1].

Deep neural networks, with more and more neurons and with a growing number of neuron connections, are now part of many state-of-the-art methods in several fields, winning reference competitions like the ImageNet Large Scale Visual Recognition

Com-2

Unsupervised training is aimed to learn the hidden structure of unlabeled training examples, for instance by applying clustering methods.

(30)

petition (ILSVRC). Most notably, their use for computer vision tasks has surpassed human performances. For instance:

• The human top-5 error rate in image classification over the ImageNet dataset is 5.1% according to Russakovsky et al. [46]. The 152-layers ResNet neural network by Microsoft researchers He et al. [23], instead, achieves a 4.49% top-5 error rate. Moreover, the ensemble3 which contains ResNet-152 has won ILSVRC2015 by reaching just a 3.57% top-5 error rate.

• The human accuracy in face verification between two tight face crops is 97.53% in accordance with results by Kumar et al. [35], whereas the FaceNet neural network proposed by Google researchers Schroff et al. [49] reached 99.63% on the LFW dataset in 2015.

A further example of the recent success of deep learning is given by AlphaGo by Google, which defeated 4-1 the human world champion Lee Sedol during March 2016. Silver et al. in [52] state that “AlphaGo combines Monte-Carlo tree search with deep neural networks that have been trained by supervised learning, from human expert games, and by reinforcement learning from games of self-play”.

The significant slide on deep learning of the last years, especially for computer vision applications, has been made possible primarily for two reasons:

• The huge increase of the amount of data in the training datasets which are available in the current Digital Age has made it possible to train the neural networks with far more data, hence achieving stronger generalization.

• Many hardware improvements and software infrastructure have been introduced, supporting deeper and bigger networks.

The most noteworthy of the hardware innovations has been the introduction of general-purpose graphics processing units (GP-GPUs), well-suited for efficient computation of the matrix math involved in deep learning. More to the point, NVIDIA created the CUDA architecture, a parallel computing platform and ap-plication programming interface that lets software developers to take advantage of

3

Ensemble methods reduce generalization error by considering the results produced by several models with different architectures. See §1.4.3.7for more details.

(31)

CUDA-enabled GPUs for general purpose processing.

GP-GPUs empower not only ultra-performant systems like NVIDIA DGX-14, but they still allow advanced (yet slower) research for a few thousand euros.

With respect to the software infrastructure, NVIDIA has also released the cuDNN software library [8] for GPU-accelerated execution of primitives for deep neural networks. Moreover, many software have been released too in the recent years in order to make it easier to work on deep learning: see Chapter4 for details.

The support to bigger networks let Choromanska et al. [10] observe that local minima in the error surface are not a problem in large-size networks; on the contrary, “poor quality local minima have non-zero probability of being recovered” in small-size networks. Furthermore, some algorithmic news also helped to improve neural networks per-formances, including: the widespread adoption of the ReLU in place of the classical activation functions for hidden neurons; the replacement of MSE with the cross-entropy family of loss functions; the dropout regularization technique.

1.4.3.4 The Rectified Linear Unit (ReLU)

Classical activation functions have sigmoidal shape like the logistic function: it is non-linear, completely differentiable and saturates at output values of 0 or 1. Similarly, the tanh function (non-linear and differentiable too) saturates to -1 or 1 so it is also called “rescaled sigmoid”. Fig. 1.7 depicts the plots of both of them along with the linear-by-part ReLU activation function rectif ier(x) = max{0, x}.

Figure 1.7: Rectifier, logistic and hyperbolic tangent functions.

(32)

The rectifier max{0, x} can be used for gradient-based learning even if it is not actually differentiable at all input points, more precisely in x = 0: this is not a problem since one can just use the rectif ier’s left or right derivative in zero.

Starting from Nair et al. [41], the ReLU has gained wide adoption since it enables larger and deeper networks, as shown by Krizhevsky [34]. Indeed, even big networks us-ing ReLU as activation function of the hidden neurons can be trained effectively without requiring any unsupervised pre-training; Glorot et al. [19] described why this happens:

• ReLUs foster hard zeros as activation outputs for inactive neurons (instead of the small but non-zero values produced by a logistic activation), thus yielding the advantages of sparse representations of the intermediate features (see §1.4.2.2). • Thanks to ReLU’s linearity by part, the gradients flow unsaturated over the active

synapses and therefore there is no vanishing effect on the error gradient.

• The computations of the activation value and its partial derivatives are clearly much simpler and faster.

Figure 1.8: As shown by Krizhevsky et al. [34], which also reported this plot, “deep convolutional neural networks with ReLUs (solid line) train several times faster than their equivalents with tanh units (dashed line).”

Another valuable remark on ReLUs is about their biological coherence with the cor-tical neurons that Glorot et al. describe as “rarely in their maximum regime, suggesting that their activation function can be approximated by a rectifier” [19], which does not saturate when the neuron is active.

(33)

1.4.3.5 The Output Layer and the Evaluation Function

The activation function used in the neurons of the last layer directly affects the result of the loss function to be minimized.

Theoretically, any function could be used for neuron activation of the output layer, for instance the logistic function. In practice, however, for classification tasks the convenient way to represent the output is as a probability distribution over a discrete variable which can assume C different values (labels).

The most commonly used of such a function is the softmax, whose outputs are normal-ized so to lie in [0, 1] and their sum equals 1. The ithoutput of the softmax, representing the probability that the network input belongs to the ith class, is given by Eq. (1.2) where qj is the jth among the m inputs of the output layer:

sof tmax(q)i = e qi

Pm−1 j=0 eqj

(1.2)

Let’s consider an example network devoted to classify inputs among three classes. Let’s also assume that the penultimate layer produces a score vector q = [2; 3; 0.5] (indi-cating detection scores), so eq0 _{≈ 7.4, e}q1 _{≈ 20.1, e}q2 _{≈ 1.6 whose sum is approximately}

29.1. Therefore, the resulting probabilities for the three classes are y0 ≈ 0.254, y1 ≈ 0.69, y2 ≈ 0.057.

The softmax saturates to zero or one in response to an increasing difference between an input with respect to the others: for instance, the softmax outputs associated to q = [1; 4; 12] are y ≈ [0.00002; 0.00034; 0.99965].

The neural network learns by minimizing an objective function (the cost or loss function) which evaluates the fitting of the output layer result with respect to the target function. Several possible evaluation functions have been proposed in the literature, the classical one being the Mean Squared Error (MSE).

Let yi, ti be the output and target vectors for the ith example among the N training examples: their size is C (the number of classes).

The MSE outputs a vector where the jth element is defined as the mean of the squared errors on predicting class j over all the N test examples.

M SE = 1 N N X i=1 (yi− ti)2 (1.3)

(34)

MSE has been extensively used for many years as evaluation function but its per-formances are very poor over saturating neuron activations, as it happens both with softmax (as previously discussed) and sigmoid functions (e.g., tanh saturates at -1 and +1 for extremely negative or extremely positive inputs). When the output neuron acti-vations get saturated, then the MSE saturates too, so the back-propagated error rates are too little to let the network learn by updating weights.

This gradient-vanishing effect can be addressed by the combination of the softmax activation with a different evaluation function based on negative log-likelihood.

As already mentioned, if it does make sense to interpret the network outputs as probabilities (like for a classification task among disjoint classes), then the softmax is an appropriate activation function for the output neurons.

Let f (y | x) be the unknown function we’d like the network to learn: the condi-tional probability distribution which represents the correct probabilities for the outputs y (discrete random variable Y ) given network inputs x (discrete random variable X). Functions’ subscripts as for f are omitted for simplicity of notation when deducible from the context. Ideally, the probability p(Y = ¯t | X = ¯x) = 1 if ¯t is the target for input ¯x. Let also ˆfdata(y | x) be the empirical distribution of the available examples for supervised training, linking the inputs to the corresponding targets. Lastly, let fmodel(y | x; w) be a parametric family of probability distributions, where each configuration w indexes an estimate of f (y | x), and pmodel(¯y | ¯x; w) be the associated probability for a specific ¯y.

By having some observed data (the training examples) and a parametric model (the deep network architecture), it is possible to determine the weights configuration wM L which indexes the most likely probability distribution that associates the training inputs with their target outputs, assuming that:

• The architecture has sufficient capacity to model f (y | x).

• The available training examples are enough to let the weights to converge to a network configuration which sufficiently models the task.

In other words, one only needs to determine the weights that maximize the likeli-hood function L(w; x, y) = pmodel(t | x; w) which represents the likelihood of a weights

(35)

configuration that tunes the model to produce target t as output given an input x: wM L = arg max

w

pmodel(t | x; w) (1.4)

If the m training examples are independent and identically distributed and we index them by i, then: wM L = arg max w m−1 Y i=0 pmodel(ti | xi; w) (1.5)

Since the Eq. (1.5) is subject to numeric underflow, the log-likelihood is introduced: by applying a logarithm, the optimum wM L does not change but we can replace the multiplication with a sum as in Eq. (1.6). The logarithm also plays a central role when applied to softmax’s outputs, as we’ll discuss later on.

wM L = arg max w m−1 X i=0 log pmodel(ti | xi; w) (1.6)

By dividing every term of the sum by m, the arg max stays unchanged but we can express wM L as the configuration that maximizes the expected log-likelihood over the empirical distribution of training examples:

wM L = arg max w Efˆdata

[log pmodel(t | x; w)] (1.7) The maximization of the likelihood function can also be interpreted as a minimization problem, suitable for gradient descent, for the negative log-likelihood loss function:

−E_fˆ_data[log pmodel(t | x; w)] (1.8) Besides addressing numerical underflow while computing the likelihood, the log-arithm effectively handles the softmax’s exponential terms. Let qj be the jth input of the output layer, as in Eq. (1.2), then:

log sof tmax(q)i= qi− log X

j

eqj _(1.9)

Since the first term does not saturate, it directly influences the function even if it is a relatively small term in the summation, so the back-propagated error does not vanish and learning continues.

Moreover, Goodfellow, Bengio and Courville observe that the negative log-likelihood objective function “strongly penalizes the most active incorrect prediction” [20, §6.2].

(36)

1.4.3.6 Checking Similarity between Inputs

The score vector produced by the penultimate layer, other than being supplied to the following output layer, can have another interesting use at inference time.

Indeed, by applying a distance measure between two higher-level feature vectors produced by the penultimate layer of a deep architecture, we can estimate the similarities between the corresponding network inputs.

For instance, one could first normalize the feature vectors so that they lie over a m-dimensional hypersphere with radius one. Let’s call them q(1), q(2). Then, their (dis)similarity can be computed through the Minkowsky distance:

d q(1), q(2) = p v u u t m−1 X j=0 q (1) j − q (2) j p (1.10)

The Minkowsky distance coincides with the Euclidean distance when p = 2.

1.4.3.7 Regularization against Overfitting

A regularization technique is “any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error”, as it has been defined by Goodfellow, Bengio and Courville [20, §5.2]. The reduction of the generalization error can be achieved by applying one or more of such techniques. In this Section some of them are introduced, perhaps the most adopted ones.

As described in §1.4.1, a simple regularization technique is the early stopping strategy. Instead of stopping the optimization on the training set after a fixed number of epochs, this strategy stops the learning when the loss function over the validation set is not decreasing anymore for a certain number of epochs. The resulting weights are those parameters that were in use at the epoch where the smallest loss was measured.

The straightforward way to reduce the generalization error, however, is simply to train the network over much more examples.

Unfortunately, this is not directly possible since the available dataset usually has a finite size. The commonly used approach is to insert fake data: this strategy is called data augmentation. Clearly, the fake examples must be coherent with the rest of the dataset and the system task. For instance, by considering classification tasks for computer vision:

(37)

• Horizontal flipping of dataset images is a valid trick for doubling available photos for training a network whose task is object recognition.

• Any kind of flipping can’t be applied over training images if the neural network’s task is the recognition of handwritten digits.

Therefore, different tasks support different data augmentation strategies. We ob-serve, however, that a neural network must not be entirely devoted to just one task. Multi-task learning is an effective regularization technique where the network is trained for more than one task over several datasets (each one possibly augmented coherently with its task). In such a case, the network is composed by two sequences of layers: a generic one and a task-specific one. The former is trained for all the tasks, so it is config-ured with generic parameters shared across all of them, while the latter is trained only for the corresponding task. The aim of this technique is to reduce the generalization error in the generic part of the network, which in turn leads to a stronger learning.

Model ensembles are another example of regularization strategy, where different learners are combined together according to some method, for instance bagging, boost-ing and stackboost-ing as synthesized by Domboost-ingos [12]: bagging is the simplest of them, since it simply averages the outputs inferred by different models.

Goodfellow, Bengio and Courville summarize the advantage of this technique writing that “if the [ensemble] members make independent errors [then] the ensemble will per-form significantly better than its members” [20, §7.11]. The k ensemble members can have different architectures and they are trained separately over k different datasets, typically generated through random picking of samples from the original dataset.

Dropout is a powerful regularization technique proposed by Srivastava et al. [55]. It could be seen as a bagging of k models whose architectures are all derived from a single base neural network N : each ensemble member differs from the others because some of the non-output neurons are randomly removed by zeroing the corresponding weights. Ideally, the overall output would be the average of the outputs produced by all the nets derived in such a way. In practice, the removal of non-output neurons is performed by randomly generating a binary mask: a mask value of one5 _{causes a unit from N to be}

included in the network, whose actual architecture is then changing while it learns.

5

The probability p of sampling a mask value of 1 is a fixed hyper-parameter. According to Goodfellow, Bengio and Courville [20, §7.12], typically p = 0.8 for an input unit and p = 0.5 for a hidden unit.

(38)

Dropout has several advantages. First, it is computationally cheap. Second, it works well with the models we use later on, where layers interface exchange data in distributed representations. More importantly, the random noise it inserts hinders units co-adaption (the dependency of a neuron activation from particular other neurons), which in turn yields that neurons learn more robust features as pointed out by Srivastava et al. in [55] and Krizhevsky et al. in [34].

Regularization can be achieved also by considering some prior knowledge about the target function during the network training. Priors are necessary due to the curse of dimensionality: given a finite-size dataset, the higher is the number of features (dimen-sionality), the exponentially harder is generalization because the greater is the number of weights to learn. Hence, data alone can’t suffice for learning real-world problems, even after extensive dataset augmentation.

Since the no free lunch theorems by Wolpert and Macready state that “no learner can beat random guessing over all possible functions to be learned” [70], then Domingos argues that a learner “must embody some knowledge or assumptions beyond the data it’s given in order to generalize beyond it” [12], trading variance for a little more bias. An example of such a prior is given by the smoothness prior, discussed in §1.4.2.1.

Priors can be represented through regularization penalties. In that case, regulariza-tion takes place by modifying the loss funcregulariza-tion to be optimized such that it also considers a regularizer term λ · Ω, where λ is often a very small hyper-parameter6.

A simple regularization term in the objective function is the weight decay, whose most commonly used form is the L2 parameter regularization Ω(w) = 1₂||w||22 , where ||w||_p =Pm_i=1Pn_j=1|w_i,j|p1/p _{is the p-norm of the m × n weights matrix. This method} alters the loss by penalizing weights having bigger amplitude: significant weights would induce on the network output a higher variance, leading the network to overfit.

A penalty term like the weight decay regularizer can also be added to the objective function with the aim of promoting sparse representation. Sparsity, described in §1.4.2.2, is achieved by increasing the cost of weights configurations when the norm of intermediate representations’ elements grows.

6

A hyper-parameter is a model parameter that is not learned during network training: instead, it is defined a-priori. The number of network layers and the learning rate are examples of hyper-parameters.

(39)

1.4.4 Convolutional Neural Networks

1.4.4.1 Local Connectivity and the Convolutional Layer

In the classical multi-layer architecture, every layer is typically fully connected, which means that each neuron of each layer is connected to all the neurons of the previous layer: such an architecture tends to require a huge amount of weights to train. Moreover, the number of connections makes the error gradient vanishing over too many neurons, especially when propagating in a deep network randomly initialized. Hinton et al. [24] addressed this issue in 2006 through an unsupervised pre-training (see §1.4.3.3).

Convolutional neural networks (CNNs) follow a different approach, inspired by the neuroscientific model of the visual system. Their hidden layers are composed by convo-lutional and subsampling layers, in addition to the fully-connected layers: the convolu-tional and subsampling layers differ to the latter because neurons are locally connected. Indeed, the spatial extent of the connectivity of a neuron (called receptive field ) in such layers is limited to just a fixed subset of the outputs produced by the previous layer.

A convolutional layer is composed by a number of planes: all the neurons in a plane share the same weights configuration. A plane corresponds to a single filter (or kernel) which is matched by all the neurons in that plane against the results coming from the previous layer. The neurons in a plane perform feature detection against different and overlapping small neighborhoods of the input data, and the combination of their outputs is called feature map. Therefore, a convolutional layer produces a set of feature maps, each one describing whether (and where) a higher-level feature has been detected in the subregions of the input lower-level feature maps.

The CNNs’ name comes from how features are detected. A neuron in a convolutional layer computes the detection score of a feature in its receptive field by passing to the activation function the result of a discrete convolution between the input patch and the feature’s kernel (instead of their multiplication).

The basic one-dimensional discrete convolution c(t) = x(t) ∗ k(t) is defined as follows:

c(t) = (x ∗ k) (t) = ∞ X

a=−∞

x(a)k(t − a) (1.11)

(40)

Eq. (1.11) reduces to a sum over a finite number of valid elements. The invalid terms, corresponding to the elements out of index range, are assumed to be equal to zero.

Let’s now consider the first hidden layer of a neural network for a computer vision task: the input feature map is an image I. Let cI be the number of input channels, each one having a wI pixels width and a hI pixels height. Let’s assume that I is in gray-scale so cI = 1. Let the kernels of the convolutional layer be bi-dimensional too (being matrices with wK columns and hK rows): they produce as outputs cO feature maps having wO columns and hO rows. In the following equations of this Section, the elements in all these matrices are referenced as (column, row) instead of (row, column). Given I and a kernel K, a plane in a convolutional layer produces a feature map given by A = f (b + I ∗ K) where b is the bias and f is the activation function (e.g. the ReLU). Each element of M (i, j) = (I ∗ K) (i, j) is given by the sum over valid terms in:

M (i, j) =X m

X

n

I(m, n)K(i − m, j − n) (1.12) As m or n increases, the convolution in Eq. (1.12) is applied between elements of increasing index of the input and decreasing index of the kernel, so the latter is as flipped. For instance, let I be a gray-scale squared image where each side has a length of 100 pixels, and let hK = wK = 2. When computing the value of an example point (42, 84) in the output map, we get that most of the summed terms are invalid as for instance I(0, 0)K(42, 84) + I(0, 1)K(42, 83) + . . . and the actual result for that example point is given by I(41, 83)K(1, 1) + I(41, 84)K(1, 0) + I(42, 83)K(0, 1) + I(42, 84)K(0, 0). Since the convolution’s commutative property, M can be rewritten as in Eq. (1.13), where the input matrix is traversed from the last value (the “bottom-right” corner) to the first value (the “top-left” corner).

M (i, j) =X m

X

n

I(i − m, j − n)K(m, n) (1.13)

By applying Eq. (1.13) to the previous example, the result does not change thanks to the commutative property:

M (42, 84) = I(42, 84)K(0, 0) + I(42, 83)K(0, 1) + I(41, 84)K(1, 0) + I(41, 83)K(1, 1) Nevertheless, by applying the commutative property it’s possible to compute the convo-lution more efficiently, given that typically the kernels are smaller than the input maps: the index ranges of the two summations are just m ∈ [0 . . . wK− 1], n ∈ [0 . . . hK− 1].

(41)

In both Eqs. (1.12) and (1.13), I and K are traversed in opposite directions. While the input flipping in Eq. (1.13) is required by the commutative property, it is not necessary with respect to the way the neural network works, thus many software libraries implement convolution by actually computing cross-correlation as in Eq. (1.14). The result can be only computed for valid rows and columns: the kernel is only convolved with receptive fields entirely contained in the input map.

M (i, j) = wK−1 X m=0 hK−1 X n=0 I(i + m, j + n)K(m, n) ∀ valid i, j (1.14)

Figure 1.9: Example of 2-D convolution without kernel flipping [20, Fig. 9.1].

We previously simplified the scenario by assuming that the input has just one channel (cI = 1) but in general that’s not true: typically, an input image has three channels. Moreover, if a network layer has cOplanes of neurons sharing weights, then the following layer has to support that number of input channels. So, in general, we redefine the convolution (without kernel flipping) over cI input channels against a bidimensional kernel as follows, where I(i, j, c) is the input value in column i, row j, channel c:

M (i, j) = cI−1 X c=0 wK−1 X m=0 hK−1 X n=0 I(i + m, j + n, c)K(m, n) ∀ valid i, j (1.15)

(42)

Computing the convolution involves many nested loops: three accumulation loops (over c, m, n as in Eq. (1.15)) and up to four more independent loops with respect to output channels, its width and height, and possibly the batch size over input samples. Hence, efficient implementations are needed as described by Chetlur et al. in [8].

Many convolutional neural networks employ 1 × 1 convolutional layers. Such layers convolve three-dimensional inputs (cI maps having size hI × wI) with cO kernels each one having a single-value receptive field (wK = 1, hK = 1): the “1 × 1 convolution” name implicitly refers to the 1 × 1 × cI-sized receptive fields supplied to the cO kernels. Let’s consider, for instance, a two-channel bidimensional input to a 1 × 1 convolutional layer: the result of applying a single kernel having parameters [1, −1] would be a simple point-wise difference of the input feature maps. Indeed, the primary role of such layers is that they vary the number of dimensions in the filter space from cI to cO.

A common usage of 1 × 1 convolutional layers is to build the 1 × 1 × cO score vectors supplied to the output layer of the network (see §1.4.3.5) or compared through a distance measure (see §1.4.3.6).

Apart from the case of 1 × 1 convolutional layers, it’s also important to observe that the receptive fields of the neurons in a convolutional layer can overlap. The distance s between the centers of neighbor receptive fields in the input feature map (called stride) and the size hK× wK of the receptive field are among the hyper-parameters of a convo-lutional layer. In Fig. 1.9, for instance, s = 1, hk = 2, wk= 2.

It is also possible to apply different strides for each dimension, skipping hs rows and ws columns between neighboring neurons.

Since a kernel is only convolved with receptive fields entirely contained in the input map, then features can’t be detected on the whole input map, specifically leaving out the rightmost and bottom boundary in Fig. 1.9. Hence, the output map is smaller than the input. As a concrete example, we consider again Fig. 1.9where hI = 3, wI = 4 and hO = 2, wO = 3: if the stride s = 1, then the output map has hI− (hK− 1) rows and wI− (wK− 1) columns.

After many convolutions in a row, the produced data progressively shrinks, so one should limit the number of layers or use smaller kernels. It is possible, however, to keep the feature maps size constant: to this aim, the input map I is padded with zeros so that each kernel K is applied to the I’s boundaries too. The numbers of padding rows hP and

(43)

padding columns wP on each side of I’s channels are further layer hyper-parameters. Having introduced the main hyper-parameters of a convolutional layer, it is possible to identify the size of its output: each one of the cO kernels produces a feature map having hO rows and wO columns where:

hO= hI+ 2 · hP − hK hs + 1 (1.16a) wO= wI+ 2 · wP − wK ws + 1 (1.16b) 1.4.4.2 Pooling Layers

For several tasks, the exact position of a detected feature is not important: in LeCun’s words, “only its approximate position relative to other features is relevant” [38]. Besides being useless, the exact position of the features is explicitly undesired when learning for instance object recognition: the position must not be considered in order to correctly generalize upon varying instances of a given class.

The straightforward way to reduce spatial precision is subsampling, which consists in reducing the input resolution.

A pooling layer performs a subsampling of data, typically coming from a previous convolutional layer used for feature detection.

Also pooling layers are composed by planes of neurons where each neuron is connected to a different small neighborhood in the input feature maps. A pooling neuron produces its contribution to the output feature map by subsampling the elements in its receptive field through an aggregation function.

Different aggregation functions can be used in pooling layers, for instance the average or the sum functions. For instance, LeCun et al. in [38] aggregates by a sum, whose result gets multiplied and incremented by using two trainable coefficients shared among the neurons in the same plane.

In pooling layers, instead, the aggregation is implemented by selecting the max-imum value in the receptive field. Max pooling is widely used because it performs better than other pooling techniques especially when applied to sparse intermediate represen-tations, as experimented by Boureau et al. [2].

(44)

The subsampling performed by a pooling layer can be partially mitigated if the pooled input regions overlap. Let z be the number of rows and columns covered by the pooled regions in the input maps, and let s be the stride between such regions along both dimensions: overlapping pooling means that s < z.

According to Krizhevsky [34], overlapping pooling is an effective regularization technique for visual object-recognition tasks.

1.4.4.3 Composition of CNNs’ Building Blocks A convolutional neural network is typically composed by:

• The passive input layer, responsible for input perception. For instance, input neurons in CNNs for computer vision just read pixel values from the input images. • Sequences of alternated convolutional and pooling layers.

• Possibly, some fully-connected layers (often implemented through 1 × 1 convolu-tional layers).

• The output layer, for instance implementing softmax.

The hyper-parameters of the network, especially for the convolutional and pooling layers, are such that by traversing the network itself the number of feature maps tends to increase while the spatial resolution decreases.

Many efforts have been spent in improving CNN architectures, for instance by looking for the best performing size of receptive fields in convolutional layers and evaluating the mix of such layers with pooling layers.

For instance, Simonyan and Zisserman [54] observed that for computer vision tasks like object recognition it is convenient to convolve over small receptive fields and small stride, for instance 3×3 and 1, respectively7. More to the point, they studied six different architectures with varying depth where stacks of several convolutions are alternated to pooling layers: they noticed that a stack of several convolutional layers having both small receptive field and small stride actually perform better than a single convolutional

7

As a comparison, neurons in the first convolutional layer in the AlexNet architecture by Krizhevsky et al. [34] have size 11 × 11 and stride 4.

(45)

layer with the same overall receptive field.

For instance, let’s consider a stack of two 3×3 convolutional layers (without subsampling in between) having stride 1: it has the same actual receptive field as a 5×5 convolutional layer, as shown in Fig. 1.10 where the former is on the left and the latter on the right. The benefit of a n-stacked convolution is twofold:

• A significant decrease of the parameters to be trained.

If we assume C is the constant number of input and output channels at every 3 × 3 convolutional stack layer, then each stack layer has C planes parameterized by 32C weights. Thus, the total number of parameters for such n-stacked convolution is 32C2n. For instance, let n = 2, then 18C2parameters must be trained. Conversely, a single 5 × 5 convolutional layer with the same number of input and output channels would need 25C2 parameters.

As n increases, the reduction in the number of parameters grows: for n = 3, the stacked approach needs 27C2 parameters instead of 49C2.

Such formulas can be easily extended to the case of different number of channels at each layer.

• The non-linear activation function (e.g., the ReLU) after each convolution is per-formed n times instead of just once, hence increasing the discriminativeness of the network.

Figure 1.10: On the left, a schematized view of a stack of two convolutional layers where each neuron convolve its input map with a 3 × 3 kernel; on the right, a convolutional neuron where kernel size is 5 × 5.

(46)

1.4.4.4 Advantages and Pitfall of Using CNNs

The no free lunch theorems, already introduced in §1.4.3.7about regularization, state that data alone is not enough to generalize over high-dimensional data: thus prior beliefs are necessary to model the target function.

Convolutional neural networks address this issue by enforcing the previously dis-cussed constraints on neurons’ weights, yielding very strong priors: stationarity of statis-tics and locality of pixel dependencies. The consequences of these priors follow.

Fewer parameters and improved statistical efficiency As seen in §1.4.4.1, fully connected layers require a huge amount of free parameters to be trained, since ev-ery neuron interacts with all the neurons in the previous layer: such layers perform dense matrix multiplication in order to compute their outputs.

In convolutional layers, instead, neurons have fewer parameters: since the kernels are typically much smaller than the input, then the parameters of a neuron are related to just the input values belonging to the neuron’s small receptive field. Moreover, all the neurons in a plane share the same parameters, so the same set of weights is learned for every input region instead of learning a different set of features for every input regions.

Besides using less memory for storing weights, convolutional layers are much more efficient than fully connected layers due to their statistical efficiency: by having much fewer connections, they need fewer training examples to achieve a general-ization level.

Equivariance to translation A kernel in convolutional layers produces a feature map indicating whether and where a feature has been detected in the input map. The weights sharing implements equivariance to translation, which means that if the detected feature changes position in the input then the feature representation moves in the same way inside the output map.

Invariance to translation Pooling layers aggregate data in small receptive fields. Thanks both to the equivariance to translation and the prior of stationarity of summary statistics in the input, they yield invariance to small translations of features over the input map.