Low-resolution face verification using convolutional neural networks

(1)

Università di Pisa

Dipartimento di Ingegneria dell’Informazione Corso di Laurea Specialistica in Computer Engineering

Laurea Specialistica in Computer Engineering

Low Resolution Face Verification

Using Convolutional Neural

Networks

Supervisors:

Fabrizio FALCHI

Claudio GENNARO

Giuseppe AMATO

Candidate:

Francesco MOLA

Academic Year 2016-17

(2)

Abstract

The goal of this thesis is to investigate how low resolution images affect the accuracy of deep convolutional neural networks (CNNs) in face verification which is the task of determining if two face images belong to the same person. We ma-nipulated photos from the Labeled Faces in the Wild dataset in order to obtain different levels of images resolution used to test state of the art CNNs. We consid-ered the comparison of couple of images at same resolution dataset and also the case in which one of the two images belongs to a high resolution dataset, while the other one to a low resolution dataset. The results show a decrease in perfor-mances of state-of-the-art CNNs trained on the high resolution images. In order to achieve better accuracy on low-resolution images and on mixed resolution, we tested the siamese learning approach for learning an embedded layer using a con-trastive loss. In particular, we fine-tuned a pre-trained CNN using the siamese approach on high resolution images and then we selected the best neural network for training on a dataset containing both high and low resolution images. The results show that a low resolution training can actually improve performances. In particular we achieved good results when the test set was composed only of low resolution images and the best results when the test set included both high and low resolution images, so the presence of high resolution images is however preferable in order to achieve such outcomes. Correspondingly to this increment we obtained a decrease in performances of test set composed only of high resolu-tion images. Such decrease is however acceptable and reasonable considering the kind of training we have executed.

(3)

List of Figures

1.1 Face Identification . . . 2

1.2 Face Verification . . . 2

3.1 Feed-forward Neural Network . . . 8

3.2 CNN filters view . . . 11

3.3 ReLU graph comparison . . . 12

3.4 Max Pooling . . . 13

3.5 An example of dropout applied to a standard neural network with 2 hidden layers. . . 14

5.1 Different crop sizes . . . 22

5.2 Different Resolution Datasets . . . 23

6.1 Siamese Network with Contrastive Loss . . . 34

6.2 Residual Block . . . 36

6.3 Performances vs layer size . . . 39

(6)

List of Tables

5.1 Step and description of the experiments. . . 22

5.2 Datasets Description . . . 24

5.3 Vggface Test Results, View 2 . . . 28

5.4 Vggface Test Results . . . 29

5.5 Vggface with Triplet Loss Test Results . . . 30

5.6 Dlibnet Test Results . . . 31

6.1 Fine-tuning Results . . . 38

6.2 Low Resolution Training Results with lfw_64, first configuration . 41 6.3 Low Resolution Training Results with lfw_32, first configuration . 42 6.4 Low Resolution Training Results with lfw_64, second configuration 42 6.5 Low Resolution Training Results with lfw_32, second configuration 43 6.6 Contrastive Loss Results . . . 43

(7)

Alla mia famiglia e in particolare ai miei genitori Giovanni e

Antonella e mia sorella Maria Luisa che hanno reso possibile questo

lavoro.

A Gessica, sempre al mio fianco che mi supporta e sopporta ogni

giorno della mia vita e ai miei amici più cari.

Un ringraziamento speciale per i miei relatori e in particolare

Fabrizio Falchi per il contributo dato alla realizzazione di questa tesi.

(8)

Chapter 1 Introduction

Artificial neural networks (ANNs) have found most use in numerous applications especially related to image recognition and detection. Thanks to convolutional neural networks (CNNs) and deep learning the accuracy in image challenges and benchmarks has reached levels that were unimaginable just a few years ago. One important application field in this sector is face recognition. We can subdivide it in face verification and face identification. The goal of face identification is to identify an unknown person based on the image of his face; this face image has to be compared with all the registered persons and so it is a 1-to-n matching system, where n is the total number of biometrics already knew by the system. Face verification, instead, is concerned with validating a claimed identity based on the image of a face, and either accepting or rejecting the identity claim, it is a 1-to-1 matching system because the system tries to match the biometric presented by the individual against a specific biometric. In both cases the neural network tries to extract features from the face in order to obtain useful informations for the recognition. Figure 1.1 explains the process of face identification. The network tries to identify the image passed as input through all the possible identities it knows. Figure 1.2 explains how face verification works. The features extracted from the network are compared generally using an euclidean distance and if the result is under a certain threshold the two images are deemed to be of the same person, otherwise they are deemed to be from different people. Here, differently to face identification, we have two images of people the network does not need to

(9)

Chapter 1. Introduction

Figure 1.1: A scheme of face identification process.

know but has to match or not based on the number of common characteristics. We are going to consider face verification because it is easier to evaluate in terms of performances and its limits and characteristics are well-defined.

Figure 1.2: A scheme of face verification process.

The results obtained by CNNs in face verification have been reached consid-ering datasets for training and testing with good resolution, but they usually do not represent practice cases. One typical example is face recognition from video surveillance, where the cameras are not in high definition and they have to record a wide area. Consequently the part of the image interested in face verification

(10)

could be really small and with low resolution. In this kind of condition the in-formation loss from high resolution to low resolution images makes more unlikely CNNs extract sufficient recognizable features directly from low resolution images. The difficult in matching low resolution images is added to the known issues of face recognition with CNNs: the problem to find a large dataset for training; the existence of significant differences between images of the same person due to different age, illumination, pose and other factors (the presence of glasses, beards...).

In this work we want to investigate the behavior of CNNs when tested with dataset at different level of resolutions and to determine if it is possible to im-prove these results. In particular we focus on the behavior of the network when it has mixed resolution images available and for the matching between couples can rely on at least one high resolution image.

1.1 Thesis Goals

The main objective of this thesis is to define an approach for face verification using Convolutional Neural Networks (CNNs) that gives the best possible results with both high and low resolution images. This main target can be subdivided int the following points:

• To test robustness of state of the art CNNs, using a low resolution dataset obtained from manipulation of a high resolution one.

• to improve robustness investigating different layer solutions and configu-rations with a network fine-tuning phase combined with a contrastive loss learning using high resolution dataset.

• To select the best architecture obtained from the previous point and to perform a training using both high and low resolution dataset.

• To test and compare the results obtained by the considered architecture before and after the low resolution training in order to underline any possible improvements.

(11)

For each point we are going to describe and explain all the techniques useful for deploying our procedures. In particular for the first point we are going to describe the dataset we want to take into account and the modifications we need to carry out and to briefly describe the CNNs we want to consider and how we will perform the tests. For the second and third point we are going to explain how fine-tuning, and more in general the training phase, is performed. The last point is a comparative description.

1.2 Thesis Structure

Chapter 1 was about introduction and thesis goals. In Chapter 2 all the work related to face verification with Convolutional Neural Networks and their results are presented. In Chapter 3 all the backgrounds to understand our work are presented, in particular a deepening about neural networks, convolutional neural networks, fine-tuning and loss function. In Chapter 4 the tools used to develop and execute this work are presented. In Chapter 5 all the experiments we are going to perform are described and the results obtained from state-of-the-art CNNs are presented and discussed. In Chapter 6 new design modifications and training phases are brought to a CNN in order to improve its robustness. In Chapter 7 conclusions and future works are presented.

(12)

Chapter 2 Related Work

Face recognition in images is a problem that has received significant attention in the recent past. There are many methods proposed to approach it, they can be distinguished between methods based on pure computer vision techniques and methods based on deep learning. Generally the latter ones reach better accuracy than the former ones, but they usually require big datasets for training the net-work. The first ones are based on extracting a representation of the face image using local image descriptors such as SIFT, LBP, HOG that are then aggregated into an overall face descriptor by using a pooling mechanism, for example the Fisher Vector ([11]).

The defining characteristic of the second ones is the use of a CNN feature extrac-tor to obtain image descripextrac-tors which are then compared to detect similarities. Taigman et al.[17] introduce DeepFace, a deep CNN trained to classify faces us-ing a dataset of 4 million examples spannus-ing 4000 unique identities. They use also a siamese network architecture where the CNN is formed by two replicated CNNs and at each layer the weights are shared. The purpose is to apply the same CNN to pairs of faces to obtain descriptors that are then compared through the Euclidean distance. The DeepFace work was extended by the DeepId series of papers by Sun et al. A number of new ideas were incorporated over the papers such as the use of multiple CNNs, different CNN architectures which branch a fully connected layer after each convolution layer and very deep networks. In the end they obtained a final model which is quite complicated and involving around

(13)

Chapter 2. Related Work

200 CNNs([16]).

Schroff et al. [15] use dataset of 200 million face identities and 800 million image face pairs to train a CNN, Facenet, with a triplet-based loss, where a pair of two congruous faces are compared to a third incongruous one. The goal is to minimize the distance between the first (anchor) and the second (positive) image and to maximize the distance between the anchor and the third image (negative). DeepFace, DeepID and Facenet have been tested on Labeled Faces in the Wild (LFW [6]) benchmark and Youtube Faces in the Wild (YFW [19]) benchmark and in particular DeepId incremented the performance of DeepFace and Facenet achieved the best performances.

Parkhi et al. [12] investigate different network architectures in order to filter what is important from irrelevant details. As result they obtain a network archi-tecture (Vggface) which almost achieves, with a triplet loss training, the results performed by DeepFace, DeepID and Facenet on LFW and YFW benchmarks. They also applied a method to create a reasonably large face dataset that col-lects face data using knowledge sources available on the web and requires only a limited amount of person-power for annotation. They obtained a dataset with over two million faces and free available for the research community.

One main issue using deep neural networks is the problem of degradation: with the network depth increasing, accuracy gets saturated and then degrades rapidly. He et al. propose a solution based on residual learning [3], in which shortcut connections are used to skip one or more layers and their output are added to the output of the stacked layers. With this technique the networks are easier to optimize and can gain accuracy from considerably increased depth.

Our work is based on [12] but enhanced with reduction layers and trained with contrastive loss in order to recognize faces in low resolution images obtained from crops and resizes of LFW dataset. Our idea is to test first the robustness of the network with low resolution images and then to try to improve these results with a learning phase of the network based on both low and high resolution images dataset.

(14)

Chapter 3 Background

In this chapter we want to introduce and briefly discuss the main approaches that constitute the theoretical part of what we are going to use in the experiments. In particular to understand the experiments a general knowledge about neural networks and convolutional neural networks is needed with more details provided about the learning process, the loss function (contrastive loss) and the main characteristic of neural network fine-tuning.

3.1 Neural Network

A neural network is a computing system inspired by biological neural network of the brains. It is composed by neurons which receive an input, change their internal state(activation) based on that input and produce an output depending on the two previous phases. In particular if y is the output of the neuron we obtain y = f (Pn

i=1wixi), where x are the inputs, w are the weights, f is the activation function and n is the total number of inputs. The weights are generally randomly initialized (a typical example is xavier initialization where the weights are randomly sampled from a Gaussian distribution with zero mean and _N1, where N is the number of input neurons) and they get updated by a learning rule in order for a given input to the network to produce a favored output. More neurons are organized in layer and in particular we have input layer where the neurons are passive and only propagate the input to the next layers, output layer where the

(15)

Chapter 3. Background

network output is produced and hidden layers which connect input layer to the output layer. Figure 3.1 shows an example of neural network, in particular it is called a feed-forward network because it forms a Directed Acyclic Graph (DAG) since the information flows from the input layer to the output layer through the hidden layers, otherwise the network is called recurrent.

Figure 3.1: Scheme of feed-forward neural network with input layer, two hidden layers and output layer.

Learning Process At the output layer a loss function is computed to quan-tify the error between the target result and the current output result. In order to minimize the loss function the error is back-propagated from the output layer to the previous layers and the weight are consequently updated over all the training examples. In particular optimization algorithms are used to find a local minimum of the loss function by varying the weights. More in detail, for dataset D, the optimization objective is the average loss over all |D| data instances throughout the dataset: L(W ) = 1 |D| |D| X i fW(X(i)) (3.1)

(16)

where fW(X(i)) is the loss on data instance X(i). |D| can be very large so in practice it is used a stochastic approximation of this objective, drawing a random mini-batch (a subset) of N«|D| instances:

L(W ) ≈ 1 N N X i fW(X(i)) (3.2)

The model computes fW in the forward pass and the gradient ∇fW in the back-ward pass.

Equation 3.2 is faster than equation 3.1 because the optimization is computed on a subset instead of using the whole dataset. Moreover the use of equation

3.2 works better for error surfaces with a lot of local minima, in this case the random sampling subset introduces a source of noise in the gradient that can jerk the model out of local minima into a region that hopefully is more optimal. We present two optimization algorithms both based on equation3.2.

Stochastic Gradient Descent (SGD) updates the weights W by a linear combina-tion of the negative gradient ∇L(W) and the previous weight update Vt:

Wt+1= Wt+ Vt+1 (3.3a)

Vt+1 = µVt− α∇L(Wt) (3.3b)

where the learning rate α is the weight of the negative gradient and the mo-mentum µ is the weight of the previous update. It is important to correctly initialize these values in order to obtain the best results. Krizhevsky et al. ([10]) uses α = 10−2 and dropping it by a constant factor throughout training (step policy learning rate) and µ = 0.9.

Adam solver, proposed by Kingma et al. ([9], is an adaptive moment estima-tion, that from an initialized learning rate, scales it for each weight based on the module of the gradient:

(Wt+1)i = (Wt)i− α p1 − (β2)ti 1 − (β1)ti (mt)i p(vt)i+ (3.4a)

(17)

(mt)i = β1(mt−1)i+ (1 − β1)(∇L(Wt))i (3.4b)

(vt)i = β2(vt−1)i+ (1 − β2)(∇L(Wt))2i (3.4c)

Proposed values are β1 = 0.9, β2 = 0.999, = 10−8.

3.2 Convolutional Neural Network

They are a special type of feed-forward neural network which offer impressive performances in computer vision tasks. They are inspired by biological processes in which the connectivity pattern between neurons is inspired by the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field. They use convolution operator and have a particular architecture composed of an input layer (images) followed by a sequence of convolutional layers and pooling layers, and ends with fully-connected layers. In particular the input to a convolutional layer is a m x m x r image where m is the height and width of the image and r is the number of channels, e.g. an RGB image has r=3. All the others layers are described below.

3.2.1 Convolutional Layer

It is the core of CNNs, where the convolution operator is executed. It consists in a set of learnable filters (called kernels), every filter is small spatially (along width and height), but extends through the full depth of the input volume. During the forward pass, each filter is convolved cross the width and height of the input volume and compute dot products between the entries of the filter and the input at any position. We can specify the stride, which is a parameter that represents the number of pixel the filter is slided; when the stride is 1 the filter is moved one pixel at a time, if it is 2 the filter jumps to pixel at a time (producing smaller output volumes). As the filter is slided over the width and height of the input

(18)

volume it will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position. Intuitively, the network will learn filters that activate when they see some type of visual feature. In this way each neuron is connected to only a local region of the input volume. The spatial extent of this connectivity is called the receptive field of the neuron and it is equivalent to the filter size. Figure 3.2 shows the 96 filters learned by the first convolutional layer of Krizhevsky et al.’s network ([10]). The network has learned a variety of frequency and orientation-selective kernels, as well as various colored blobs.

Figure 3.2: 96 filters of size 11x11x3 learned by the first convolutional layer on the 224x224x3 input images.

3.2.2 ReLU

It stands for Rectifier Linear Unit, it is the most used activation function in CNNs and it is defined as f (x) = max(0, x). It is applied to each pixel and replaces all negative pixel values in the feature map with zero. ReLU can be used for gradient-based learning even if it is not actually differentiable at all input points, more precisely in x = 0: this is not a problem since one can just use the left or right derivative in zero. As shown by Krizhevsky et al. ([10]) deep convolutional neural networks with ReLU train several times faster than their equivalents with tanh (hyperbolic tangent) units. They report figure 3.3 that shows the number of iterations required to reach 25% training error.

(19)

Figure 3.3: Solid line represents convolutional neural network with ReLU, which is six times faster than an equivalent network with tanh neurons (dashed line).

3.2.3 Pooling Layer

This kind of layer is generally inserted between successive convolution layers. It is a form of down-sampling to progressively reduce the spatial size of the repre-sentation to reduce the amount of parameters and computation in the network, and hence to also control over-fitting. A pooling function replaces the output of the network at a certain location with a summary statistic of the nearby outputs. The most used form of pooling is max pooling, which reports the maximum out-put within a rectangular neighborhood, because it performs better than other pooling techniques, as experimented by Boureau et al.([1]). The most common form is a pooling layer with filters of size 2x2 discarding 75% of the activations. Every MAX operation would in this case be taking a max over 4 numbers as reported in figure 3.4. Other types of pooling are average pooling or L2-norm pooling. Average pooling was often used historically but has recently fallen out of favor compared to the max pooling operation, which, as said, has been shown to work better in practice.

(20)

Figure 3.4: Max pooling with 2x2 filters.

3.2.4 Fully Connected Layer

It is the final layer after convolution and pooling layers. Neurons in a fully connected layer have full connections to all activations in the previous layer; their activation can be computed with a matrix multiplication.

3.2.5 Dropout

It is a technique introduced by Hinton et al. ([4]) that consists in training phase of setting to zero the output of each hidden neuron with probability 0.5. The neurons which are dropped out in this way do not contribute to the forward pass and do not participate in back-propagation. So every time an input is presented, the neural network samples a different architecture, but all these architectures share weights. The use of dropout prevents complex co-adaptations on the training data, in fact with hidden neurons randomly omitted, a hidden unit cannot rely on other hidden units being present. Krizhevsky et al. ([10]) shows that their network with the use of dropout reduces over-fitting. Figure shows an example of dropout application. The behavior of dropout is deepened later during high resolution training (6.1.

(21)

(a) Before Dropout (b) After Dropout

Figure 3.5: An example of dropout applied to a standard neural network with 2 hidden layers.

3.3 Fine-tuning

Deep convolutional neural networks have a huge numbers of parameters: Training them from scratch can be very expensive and requires the use of big dataset to avoid over-fitting. The most practical solution is to use a network already trained and to fine-tune it in order to use the network for our application case and the dataset we want to use. Obviously it is convenient if the dataset we are going to use a dataset similar to the one the pre-trained model is trained on.

There are three main methods to apply fine-tuning:

• To truncate the last layer and to replace it with layers related to our appli-cation case

• To use a smaller learning rate to train the network, in order not to distort too much the pre-trained weights, considered already quite good. In partic-ular a common choice is to make the initial learning rate 10 times smaller than the one used for scratch training.

• To maintain unchanged the weights of the first layers and to train only the subsequent layers. This is because first layers capture universal features like curves and edges, relevant for every application case.

(22)

3.4 Contrastive Loss

It is introduced by Hadsell et al. ([2]) as a loss function to earn the parameters W of a parameterized function GW, in such a way that neighbors are pulled together and non-neighbors are pushed apart. Let −X→1,

−→

X2 be a pair of input vector and Y a binary label assigned to this pair such that Y = 0 if −X→1 and

−→

X2 are similar and Y = 1 if they are dissimilar. Define the parametrized distance function DW between the input vectors as the euclidean distance between the output of GW:

DW( −→ X1, −→ X2) = kGW( −→ X1) − GW( −→ X2)k2 (3.5)

Then the loss function in its general form is:

L(W ) = P X i=1 L(W, (Y,−X→1, −→ X2)i) (3.6) L(W, (Y,−X→1, −→ X2)i) = (1 − Y )LS(DiW) + Y LD(DiW) (3.7) where (Y,−X→1, −→

X2)i is the i-th labeled sample pairs, LS is the partial loss function for a pair of similar points, LD the partial loss function for dissimilar points and P the number of training pairs. LS and LD must be designed that minimizing L with respect to W would result in low values of DW for dissimilar pairs, because contrastive loss penalizes greater distances for similar pairs and smaller distances for dissimilar pairs. In order to obtain the desired behavior the exact loss function is: L(W, (Y,−X→1, −→ X2) = (1 − Y ) 1 2(DW) 2 + (Y )1 2max(0, m − DW) 2 (3.8)

where m > 0 is a margin. The margin defines a radius around GW( − →

X ). Dissimilar pairs contribute to the loss function only if their distance is within the radius. It gives a granularity of distances, to understand if a certain distance can be considered big or small.

(23)

Chapter 4 Tools

In this chapter we describe the most relevant software tools employed in order to fulfill the thesis’ goals. For training and testing CNNs we use Caffe framework, available for Linux, which provides completely supports for these tasks, fine-tuning and new layers implementation.

For dataset manipulation we use ImageMagick available for multiple operative systems.

The hardware acceleration of CNNs, needed to speed up the deep learning phase, are based on CUDA libraries and executed on GPUs both provided by NVIDIA.

4.1 Caffe

It is a deep learning open-source1 framework made with expression, speed, and modularity in mind. It is widely used for training and testing deep neural network and it was developed and described by Jia et al. ([7]). Caffe is implemented in C++ with CUDA support for GPU computation and supports for Python and Matlab. In the Caffe terminology, the input layer of the neural network is called data layer. The information processed by Caffe at intermediate layers is stored in blobs. The blob in the forward step (representing neurons activations) is often called “data”, while the blob during the backward step is named “diff” since it contains the error gradients.

(24)

Chapter 4. Tools

Each layer of the network is defined in a declarative way, containing the name, all the needed parameters and connections to the other layers. In particular there are two types of layer connections: bottom connections from which layer receives data and top connections from which layer sends data during forward step (the opposite for backward step). In this way all the layers described in the file, called prototxt file, can be connected to form the graph of the network. Caffe directly implements different layers such as convolutional, pooling, ReLU, fully connected layers. It implements also different layers that compute loss function and in particular contrastive loss but with a formula exactly the opposite of the equation 3.8. It completely supports the use of siamese networks. Combined to the network model file there is a solver file (in prototxt format) used to set general parameters to train the network. For example it is possible to set the type of solver (SGD, ADAM,. . . ), the base learning rate, the momentum and the maximum iteration.

Each layer implements just four basic functions: setup and reshape, forward and backward computations. The former two are responsible for layer initialization and allocations, while the latter two produce new data from the bottom data or new diffs from the top diffs, respectively. Every user can implement (for example in python) its own layer redefining the four basic functions.

It is possible to launch training and testing phase through command line in this way:

caffe train -solver solver.prototxt caffe test -solver solver.prototxt

The command is executed inside Caffe directory and the solver gives all the needed information (especially the path of the network to train) for training. Inside the .prototxt file of the network is is possible to specify train or test phase for each layer, in this way Caffe knows when to include or not the given layer for computations. The outputs of the first command are two file: the first is a .caffemodel file and represent the weights of the network obtained at the end of the training phase; the second one is a .solverstate file which is a snapshot

(25)

Chapter 4. Tools

containing all the informations needed in order to eventually resume the training. Resume training is executed in this way:

caffe train -solver solver.prototxt -snapshot snapshot.solverstate

Fine-tuning is performed in this way:

caffe train -solver solver.prototxt -weights weights.caffemodel

Caffe provides some interesting and useful tools such as computing mean of a set of images, converting images to a lmdb database and paring the log output produced by the program. The most important one is probably the one used to extract features from a layer of the network and it is done in this way:

./caffe/build/tools/extract_features weights.caffemodel network.prototxt layer_name1[,layer_name2,...] save_feature_dataset_name1[,name2,...] num_mini_batches db_type [CPU/GPU]

In particular num_mini_batches is the number that needs to be multiplied to the batch size of the network to obtain the total number of features extracted. For example if batch_size=32 and the total number of features to extract is 2000, num_mini_batches will be equal to 80.

4.2 ImageMagick

It is a free software2 _{to create, edit, compose, or convert bitmap images. It can}

read images over 200 formats and the functionality of ImageMagick is typically utilized from the command-line. We are going to use it, in particular, for cropping and resizing images. To crop an image in the center is done with the following command:

convert image -gravity center -crop sizexsize+0+0 cropped_image

To resize an image is done with the following command:

convert image -resize sizexsize! resized_image

(26)

Chapter 4. Tools

4.3 CUDA Libraries

In order to speed up the process of training and testing of deep learning convo-lutional neural networks, we are going to use GP-GPUs provided by NVIDIA. In particular we are going to use two NVIDIA’s GeForce GTX 1080 with 8 GB frame buffer each. NVIDIA assigns a compute capability (CC) number3 _{to each}

GPU model. The two considered GPU have compute capability of 6.1: the Caffe website recommends to perform deep learning with GPUs having CC greater or equal to 3.0.

NVIDIA provides a number of useful software tools for deep learning with GP-GPUs: System Management Interface (SMI) which provides information about the installed GPUs together with their characteristics and state; The NVIDIA CUDA Toolkit which is a software development kit that let developers build pro-grams that take advantage of GPU acceleration; NVIDIA cuDNN4_{which provides}

a CUDA-transparent set of building blocks for working on deep neural networks.

3_{developer.nvidia.com/cuda-gpus} 4_{developer.nvidia.com/cudnn}

(27)

Chapter 5 Experimental Settings and

State-of-the-Art Results

In this chapter we describe all the experiments. In particular here we describe the dataset and the tests which are common to all of them. In this chapter we show also the results obtained from the first experiment: test the robustness of state-of-the-art CNNs. Before it a brief description of the networks considered is inserted.

5.1 Dataset Description

The dataset we use is the Labeled Faces in the Wild (LFW [6]) benchmark which is available on-line ([5]). The dataset contains more than 13,000 images of faces collected from the web. Each face has been labeled with the name of the person pictured. 1680 of the people pictured have two or more distinct photos in the dataset. The only constraint on these faces is that they were detected by the Viola-Jones face detector ([18]). The images are available as 250 by 250 pixel JPEG images. Most images are in color, although a few are gray-scale only. The dataset is organized into two views. View 1 is for algorithm development and general experimentation, view 2 is for performance reporting and should be used only for the final evaluation of a method.

(28)

test-Chapter 5. Experimental Settings and State-of-the-Art Results

ing. The training set consists of more than 4000 people where each one has at least one image. The test set consists of 500 pairs of matched and 500 pairs of mismatched images. The people who appear in the training and testing sets are mutually exclusive, in this way people in the test images have never been seen before in the training phase, so there is no opportunity to build models for such individuals, instead the main focus is based on the generic problem of differenti-ating any two individuals that have never been seen before.

The second view of the data consists of ten subsets of the database. To report accuracy results on View 2, the experimenter should report the aggregate perfor-mance of a classifier on 10 separate experiments where in each experiment nine of the subsets should be combined to form a training set with the tenth one used for testing.

5.2 Experimental Description

In this section we present a general overview of the experiments and all the needed procedures in order to execute the tests. In particular we have three main steps and in all of them we perform the tests described in5.2.3 with different datasets obtained modifying the LFW dataset with methods described in5.2.2.

5.2.1 Experiment steps

The experiments are conducted in order first to test the robustness of various neural networks and then to improve these results. In particular we can consider three steps. The first one consists in testing the robustness of three different neu-ral network: Vggface, Vggface trained with triplet loss and Dlibnet. The second one consists in a training phase of Vggface with a technique called contrastive loss with high resolution images and a testing phase in order to obtain a baseline to make comparisons. The third one consists in repeating the train with both high and low resolution images and a testing phase to compare the latter results with the former ones. In all the three steps we perform the same type of test that is described further on. Table 5.1 summarizes the steps. In this chapter we are

(29)

Chapter 5. Experimental Settings and State-of-the-Art Results

going to focus on step 1.

Step 1 Testing phase of CNNs state of the art

Step 2 High resolution training phase with contrastive loss and testing

Step 3 High and low resolution training phase with contrastive loss, testing and compare

Table 5.1: Step and description of the experiments.

5.2.2 Dataset adjustment

The images of the dataset are not taken entirely but we decide to use two differ-ent crop sizes in order to remove as much as possible the background that can lead to incorrect predictions, increasing or reducing the real performances of the networks. The first crop is called LFWcrop (free available on-line [14]) and it is a crop of 83 by 83 pixels in the center of each image, with the upper-left and lower-right corners being (83,92) and (166,175), respectively. Then the dataset is resized to the dimension of 64 by 64 pixels and 32 by 32 pixels. Although this crop manages to remove almost the entire background from the images and they appear to exhibit real-life conditions, we consider it a too much aggressive crop and we decide to use a second one as comparisons. The second crop decision is a crop of 160 by 160 pixels in the center of each image, which corresponds to a reasonable crop of the 64% of the original image size. Figure5.1 gives an idea on how much we cropped with these two crop sizes respect to the original image.

(a) Original Image (b) 160x160 crop (c) 83x83 crop

(30)

The dataset is then resized to 64 by 64 pixels and 32 by 32 pixels. It is important to notice that all the images are then resized to the dimension of 224 by 224 pixels, in order to be used as input in Vggface ([12]). In the end we obtain five different datasets, Table 5.2 summarizes them. Figure 5.2 gives an example of the resolution from different datasets, it is already possible to notice the worsening in image quality and consequently an information loss that will somehow affect the neural networks decisions.

(a) lfw_160 (b) lfw_64 (c) lfw_32

(d) lfwcrop64 (e) lfwcrop32

Figure 5.2: Examples images at different resolutions from the datasets

5.2.3 Test description

The test phase consists in extracting the features of the images from a fully connected layer and to compute the euclidean distance between the couples; if the distance is under a certain threshold we have a match, otherwise we have a mismatch between the couples. The test phase is performed first using view 2 of LFW on Vggface and then we perform all the remaining tests with view 1, once a coherence between view 1 and view 2 is guaranteed. The tests take into

(31)

Dataset Description

Lfwcrop64 Crop 83x83 and resized 64x64 Lfwcrop 32 Crop 83x83 and resized 32x32 Lfw_160 Crop 160x160

Lfw_64 Crop 160x160 and resized 64x64 Lfw_32 Crop 160x160 and resized 32x32

Table 5.2: All the datasets used in the experiments, all of them are then resized to 224x224.

account the number of false positives (FP) and false negatives (FN). The first ones happen when the network predicts a match between mismatched couples, the second ones happen when the network predicts a mismatch between matched couple. For performances analysis and evaluation metric we consider the EER (equal error rate), which is defined as the error rate where the false positive and false negative rates are equal. We save and show the threshold where we get the EER and we use the threshold obtained with the dataset lfw_160 to compare all the computed performances, taking into account the number of false positives and false negatives and the corresponding true positive rate (TPR) and false positive rate (FPR) computed at that threshold from different resolution datasets. We consider two types of experiments, the first one is the standard one where all the couples are taken from the same dataset. The second one, which we call crossed experiment, consists in taking one image from one dataset and the second one from another one and then exchanging the two datasets, for each considered couple. In this way the couples used in the test are doubled (1000 matched and 1000 mismatched in case of view 1). Instead of trying all the possible datasets we decide to keep one fixed, lfw_160 which has the highest resolution, and to vary all the other ones. This experiment is interesting to understand how the network reacts if it has to predict equality of people in images at different resolutions (one lower and higher). The second experiment represents also a better real-life case than the first one, because it is quite normal, for example in a surveillance area,

(32)

having to predict if a person detected from the camera (low resolution image) is the same of a person in a high resolution image (taken for example from an identity card).

5.3 Networks Description

Before starting to present all the results we are going to briefly describe the three networks considered in the first step of our experiment.

5.3.1 Vggface

Parkhi et al. [12] presented this relatively simple network which has obtained high performances. The architecture comprises 13 convolutional layers subdivided into 5 blocks. Each convolutional layer is followed by a rectifier layer (ReLU) and at the end of each block there is a max pooling layer . Notice that the first two blocks contain two convolutional layer each whilst the last three blocks contain three each. After such blocks there are three fully connected layers (FC). The first two FC layers output are 4096 dimensional and are followed by a ReLU layer each. The last one has N = 2622 dimensions for N-way class prediction. the resulting vector is passed to a soft-max layer to compute the class posterior probabilities. All the FC layers do not include the Local Response Normalization operator. We extract the features from the second fully connected layer(FC7). The network is fed with images of 224 by 224 pixels.

In the training phase stochastic gradient descent is used as optimization with momentum coefficient of 0.9. The model is regularized using dropout and weight decay; the coefficient of the latter was set to 5 × 10-4, whereas dropout was applied after the two FC layers with a rate of 0.5. The learning rate was initially set to 10-2 and then decreased by factor of 10 when the validation set accuracy stopped increasing. Overall, the model was trained using three decreasing learning rates. The weights of the filters in the CNN were initialized by random sampling from a Gaussian distribution with zero mean and 10-2 _{standard deviation. Biases were} initialized to zero.

(33)

5.3.2 Vggface with triplet loss

It consists in a learning phase of the weights parameter by looping over all possible triplets of input examples (X, X+, X−), where X is called anchor, seeking to both reduce the distance between the anchor and the positive similarity example X+ (target neighbor) and to increase the distance between the anchor and the negative example X− (impostor). The loss function is defined as follows:

Loss(X, X+, X−) = max{0, m + Dsim(X, X+) − Ddis(X, X−)} (5.1)

where Dsim and Ddsim are the euclidean distances between respectively (X, X+) and (X, X-); the margin hyper-parameter m fosters a sufficient gap between the distances among similar examples and the distances among dissimilar examples. In order to use triplet loss Vggface needs to be modified for accepting three images as input and adding the triplet loss layer to compute the equation 5.1. In particular we use the network modified in a master thesis of a student from University of Pisa ([13]).

5.3.3 Dlibnet

This network (the complete name is dlib face recognition resnet model) has been developed by dlib ([8]), a general purpose cross-platform software library written in the programming language C++ and largely used for face detection, and it is based on the network of He et al. ([3]). It is a residual network with 29 convolutional layers and ends with a global average pooling layer and a 128-way fully-connected layer with a metric loss. The network training started with randomly initialized weights and the metric loss layer is used to try to project all the identities into non-overlapping balls of radius 0.6. Dlib has been already trained with images of 150 by 150 pixels. Dlibnet is used with dlib library for face detection, but considering the crop we applied, we are going to use directly the network without detection.

(34)

5.4 Results Step 1 (State-of-the-Art)

Table 5.3 and table 5.4 show similar results that guarantees coherence between them. From now we are going to consider and discuss only results obtained from view 1.

The first column of table 5.4a shows that the best performances are obtained, as expected, from lfw_160 dataset, the highest resolution one. Comparing the other results and in particular datasets with same resolution but different crop size, we can observe that the more aggressive crop reach lower results compared to the other datasets, so this kind of crop does not behave well combined with Vggface network and it is preferable to use the other kind of crop.

The first column of table5.4b, which is the crossed experiment accuracy, must be compared to the previous considered column. We can observe that the crossed experiment gives better performances, demonstrating the fact that the presence of at least one resolution image in each couple let the network to verify the images more simply.

Observing the sixth and seventh column of both table5.4aand table5.4b, which represent the true positive and negative rate, we can notice that lfwcrop64 and lfwcrop32 have results that vary largely depending on the respective metric. These results confirm that Vggface cannot be considered robust with this kind of crop, whilst the other crop shows less variability in these results and specially lfw_64 shows great similarities with the highest resolution dataset, due to a minor in-formation loss. Considering and comparing the TPR and TNR metrics between crossed and not crossed experiments we can underline that when crossed is ap-plied TPR results trend towards worse accuracy because the different resolution of the two images in each couple cause the network have more difficulty to rec-ognize the two images belong to the same person. For the same reason when crossed is applied to TNR results the network shows better performances. The last two propositions are valid in general for all the networks we consider. The results from table5.5, Vggface with triplet loss, are similar to the ones of the previous table 5.3. In general we can notice absolute performances are slightly

(35)

lower but the results are less variable, so this network, thank to the presence of the triplet loss layer, shows to be more robust than the previous one.

The results from table 5.6, the one with Dlibnet, shows that this network does not reach the performances of Vggface, in absolute terms, but it works definitely better with the crop we consider aggressive, without the high variability we ob-served before. In general we can notice that also the robustness is better than Vggface, because Dlibnet has been already trained with low resolution images. All the tables we have considered show that there is a decrease in performances when we use low resolutions datasets. Dlibnet shows better robustness results than Vggface (both with and without triplet loss). In particular the normal ar-chitecture of Vggface obtains the worst results in terms of robustness. We are going to try to improve it in the next chapter.

100%-EER threshold FN(FP) FN* FP* TPR*% TNR*% AVG

lfw_160 97.3 0.88 82 82 82 97.3 97.3 97.3

lfwcrop64 91.4 0.85 257 112 593 96.3 80.2 88.3

lfwcrop32 89.5 0.85 314 133 676 95.6 77.5 86.5

lfw_64 96.8 0.90 97 148 48 95.1 98.4 96.7

lfw_32 93.5 0.90 195 329 94 89.0 96.9 93.0

(a) Not Crossed

lfw_160 97.3 0.88 164 164 164 97.3 97.3 97.3 lfwcrop64 91.7 0.92 501 1605 93 73.3 98.5 85.9 lfwcrop32 90.8 0.92 554 1879 86 68.7 98.6 83.6 lfw_64 96.9 0.90 187 307 80 94.9 98.7 96.8 lfw_32 94.7 0.91 317 733 74 87.8 98.8 93.3 (b) Crossed

Table 5.3: Test results with Vggface Network with view 2 of LFW, the fields with the asterisk use threshold of lfw_160.

(36)

lfw_160 97.8 1.25 11 11 11 97.8 97.8 97.8

lfwcrop64 88.2 1.16 59 6 187 98.8 62.6 80.7

lfwcrop32 86.0 1.15 70 12 242 97.6 51.6 74.6

lfw_64 97.0 1.24 15 12 14 97.6 97.2 97.4

lfw_32 93.2 1.23 34 22 45 95.6 91.0 93.3

(a) Not Crossed

Table 5.4: Test results with Vggface Network with view 1 of LFW, the fields with the asterisk use threshold of lfw_160.

(37)

lfw_160 96.8 0.78 16 16 16 96.8 96.8 96.8

lfwcrop64 89.0 0.73 55 31 90 93.8 82.0 87.9

lfwcrop32 86.4 0.71 68 23 156 95.4 68.8 82.1

lfw_64 96.8 0.79 16 21 13 95.8 97.4 96.6

lfw_32 93.4 0.78 33 33 31 93.4 93.8 93.6

(a) Not Crossed

Table 5.5: Test results with Vggface Network with triplet loss layer, the fields with the asterisk use threshold of lfw_160.

(38)

lfw_160 95.8 0.65 21 21 21 95.8 95.8 95.8

lfwcrop64 96.6 0.61 17 6 26 98.8 94.8 96.8

lfwcrop32 96.0 0.61 20 8 38 98.4 92.4 95.4

lfw_64 95.0 0.64 25 21 30 95.8 94.0 94.9

lfw_32 88.4 0.62 58 29 94 94.2 81.2 87.7

(a) Not Crossed

Table 5.6: Test results with Dlibnet Network, the fields with the asterisk use threshold of lfw_160.

(39)

Chapter 6 Design and Test of a CNN for Face

Verification

The results of the previous chapter have shown that information loss in low resolution images determines a decrease of performances in CNNs. Unfortunately low resolution images represent a better case than the high resolution ones, so we need to investigate further in order to determine the reasons of the performances decrease and to find a way to improve these results. In particular, first we are going to train and test different layers and configurations to understand if these kind of solutions can alone give better robustness to the networks, even if the training is performed only with high resolution images. Then we are going to use the latter results, the one obtained from the best network solution, as baseline to compare to a mixed resolution training of the same network. The last experiment we are going to perform is particularly significant because it represent the best and typical real application scenario. In fact in many cases when a face verification is needed the system already has an anchor image, generally in high definition, and it has to test it against an image provided at real time, without any predefined dimension and generally in low resolution.

Our design choices and implementations described below are especially devised to resolve, as best as possible, this type of scenario.

(40)

Chapter 6. Design and Test of a CNN for Face Verification

6.1 High Resolution Training

In this phase we perform a fine-tuning on Vggface. The original application of Vggface was face identification (but in the previous tests we use the features ex-tracted from it to do face verification), we modify it in order to perform face verification. In particular, the architecture is modified adding an embedded fully connected layer after FC7 and removing the last fully connected layer (FC8), which was originally used for classification, moreover a normalization layer is added to normalize the features from FC7 and the dropout layer of FC7 is sub-stituted with a scale layer. The reason of this substitution resides in the different behavior of dropout layer between training and testing phase: as said in the previ-ous section (3.2.5) dropout layer in training phase sets to zero the output of each hidden neuron with probability 0.5; whilst in testing phase dropout multiplies results for 0.5 to have in this way the same statistics the network would have in training phase. The result is that for the same input the dropout in training mode would randomly give different features. To prevent that we substitute dropout with a scale layer, with a scaling factor of 0.5, that behaves in the same way the dropout behaves in testing mode. The approach we want to use is a siamese network training, each layer is duplicated and they share the same weights. At the beginning there is an input layer to manage the images and the label and at the end there is a contrastive loss layer that implements this kind of loss function. Figure 6.1 shows the resulting architecture.

6.1.1 Couples Selection and Input Layer

Each branch of the siamese network takes as input an image and produces an array of features that is passed to the contrastive loss layer. The task of the input layer is to read from a list the couples and the label and passes them respectively to the siamese network and to the contrastive loss layer (the contrastive loss layer has three inputs). We implement the input layer in python redefining the four basic functions that are called by Caffe framework (see also4.1): setup, reshape, forward and backward. In setup function we open the list of couple, we define

(41)

Figure 6.1: Siamese network with special input layer and contrastive loss layer. In our case the two networks are Vggface networks.

all the transformations needed for the network, the image size, the batch size and the path where all the images are located; in reshape function the images are reshaped to the desired input size (not necessary for our dataset because it is already resized to the right dimension); in forward function each line of the list is split in two images and a label and the three inputs are generated; the backward function is empty because we are implementing an input layer and it has no need of backward computations.

The input list is generated by an application program written in python and it is composed of negative and positive pairs obtained in the following way: we consider all the people in the train set of LFW (around 4000), for the negative pair we select a randomly image of each person and we search, among all the

(42)

images not belonging to the same person, the one with the smallest euclidean distance with the previous one; for positive pairs we consider only people with more than one image each and we select two images of the same person randomly. Each line of the list is composed of the two images and the label (1 if the couple match, 0 if the couple does not match). In total we obtain 5222 couples (lines).

6.1.2 Architecture and Configurations

We investigate different solutions in order to improve robustness of the network. In particular, at architecture level, we try three different sizes of the embedded layer (2048, 1024, 512), to add one or two residual layers (each one preceded by a 4096-dimension fully connected layer) and, in case of two residual layers, we try also to add a reduction layer with various dimensions (logarithmically chosen from 4096 to 128).

An example of residual block is shown in figure6.2. After the normalization layer of FC7 we have another block fully connected (we call it FC7_TO_SUM ) of the same dimension of FC7, then we have a layer that sums the features passed through FC7_TO_SUM and the features directly obtained from FC7. The use of residual blocks should guarantee that the original features of FC7 are still used without modification, and so the images recognized by these features are still correctly recognized, but now an extra capability is given from the modifications obtained from the new fully connected layer and so the network should be able to correctly recognize other images. It is noticeable that without the use of skip-connection the features of FC7 could go to degradation, resulting in a decrease of performances.

At configuration level we vary the base learning rate, the learning policy (step or fixed), the type of solver (SGD or ADAM) and the maximum number of iterations. We choose 32 as batch size, which means that at each iteration 32 couples are loaded; considering that, the minimum number of iterations needed to the network to process all the images is 164.

(43)

Figure 6.2: Residual block after FC7, composed of a fully connected layer and a residual layer.

6.1.3 Overview

The step executed at each cycle during fine-tuning are the following: features for training and testing phase are extracted using weights from the previous cycle, at the first cycle the original weights of Vggface are used; with testing features we compute performances of the current cycle, in particular we compute test only with high resolution dataset; with training features we generate couples list in the way previously discussed. For each configuration tried we perform ten cycles in total.

6.2 Results Step 2

Table 6.1 shows all the results we obtained with fine-tuning. We start directly from the test of the second cycle because the test of the first cycle simply consist in the test of the network with original weights, which is without any kind of learning from fine-tuning and would obviously produce the same result. We can

(44)

notice that without normalization and with dropout the network is not able to learn and loses in performances at the increasing of the cycle (first row). The use of ADAM solver gives no improvements to the network as we can see from the fourth row, neither the use of a step policy or a learning rate too much low. We ca notice also that it does not produce advantages to perform long iterations with a high learning rate (eighth row). From these firsts results we can notice that the best configuration should have SGD solver, a fixed policy for learning, a learning rate between 0.01 and 0.25 and a reasonable number of iterations. As we can see the results from ninth row on and with all the mentioned characteristics obtain an accuracy higher than 95%. The maximum result is obtained by the residual network with reduction layer of 4096 and it is 98% at the tenth cycle. However as average performances the best architecture is the one with 1024-size of the embedded layer, a learning rate of 0.01 and iterations three times the number of minimum iteration the network needs to process the entire train-set (last row). Moreover we prefer this architecture for its simplicity because it only needs to add one layer to the original Vggface, while the residual one needs to add 5 layers. The network select will be used for experiments explained in the next paragraph.

(45)

Chapter 6. Design and Test of a CNN for Face Verification base_lr lr_p olicy max_iter norm drop out em b_size solv er test #2 test #3 test #4 test #5 test #6 test #7 test #8 test #9 test #10 a vg(test) 0.00001 fixed 492(3xIter) no y es 1024 SGD 86.0 80.2 77.2 75.0 75.0 75.0 78.1 0.0001 step(492,0,1) 1476(9xIter) no no 1024 SGD 87.0 84.0 82.0 79.6 77.8 75.6 74.8 73.8 73.0 78.6 0.00001 fixed 492(3xIter) no no 1024 SGD 89.4 86.6 84.8 83.8 82.6 81.8 81.0 80.4 79.6 83.3 0.01 fixed 164(Iter) y es no 1024 AD AM 82.2 82.0 87.2 83.8 85.8 82.2 84.0 85.4 85.4 84.2 0.000001 fixed 492(3xIter) no no 1024 SGD 92.8 91.6 91.0 90.4 90.2 89.2 89.0 88.8 88.2 90.1 0.1 fixed 164(Iter) y es no res 2 SGD 88.0 92.2 91.6 91.0 90.8 91.0 92.0 91.8 92.6 91.2 0.000001 fixed 164(Iter) no no 1024 SGD 92.6 92.6 92.2 92.2 91.8 91.6 91.4 91.2 90.8 91.8 0.25 fixed 1640(10xiter) y es no 1024 SGD 97.0 94.4 93.8 93.8 93.4 92.8 93.2 93.4 93.8 94.0 0.1 fixed 164(Iter) y es no res 1 SGD 92.6 93.8 94.6 95.0 95.6 95.8 96.0 96.6 96.4 95.2 0.2 fixed 164(Iter) y es no res 1 SGD 93.0 94.4 95.4 96.0 96.4 96.8 97.0 97.0 97.0 95.9 0.1 fixed 164(Iter) y es no res 3 (64) SGD 92.6 95.8 96.0 96.4 96.4 96.8 97.0 96.8 97.0 96.1 0.1 fixed 164(Iter) y es no res 3 (256 ) SGD 96.2 96.8 96.6 96.8 97.0 97.0 97.2 97.2 97.6 96.9 0.1 fixed 164(Iter) y es no res 3 (128 ) SGD 95.6 96.4 96.6 97.0 97.2 97.2 97.4 97.4 97.8 97.0 0.1 fixed 164(Iter) y es no res 3 (409 6) SGD 96.6 96.4 96.0 96.6 96.6 96.8 97.8 98.0 98.0 97.0 0.1 fixed 164(Iter) y es no res 3 (512 ) SGD 97.0 97.0 97.0 97.2 97.2 97.2 97.4 97.4 97.4 97.2 0.1 fixed 492(3xIter) y es no 1024 SGD 97.8 97.2 97.0 97.2 96.8 97.4 97.2 97.4 97.2 97.2 0.1 fixed 492(3xIter) y es no 512 SGD 97.0 97.4 97.4 97.8 97.2 97.2 97.6 97.0 97.2 97.3 0.1 fixed 164(Iter) y es no 1024 SGD 97.8 97.6 97.6 97.4 97.2 97.2 97.0 97.2 97.0 97.3 0.1 fixed 164(Iter) y es no 512 SGD 97.8 97.6 97.4 97.2 97.4 97.4 97.2 97.0 97.0 97.3 0.000001 fixed 164(Iter) y es no 1024 SGD 97.4 97.4 97.4 97.4 97.4 97.4 97.4 97.4 97.4 97.4 0.1 fixed 164(Iter) y es no res 3 (102 4) SGD 97.6 97.8 97.4 97.6 97.4 97.4 97.4 97.2 97.2 97.4 0.1 fixed 164(iter) y es no res 3 (204 8) SGD 97.4 97.4 97.8 98.0 97.6 97.8 97.2 97.2 97.2 97.5 0.1 fixed 164(Iter) y es no 2048 SGD 97.8 97.8 97.8 97.6 97.8 97.6 97.6 97.6 97.6 97.7 0.01 fixed 492(3xIter) y es no 1024 SGD 97.4 97.8 97.8 97.8 97.8 97.8 97.8 97.6 97.6 97.7 T able 6.1: Results of fine-tuning. Norm is the normalization la y er, res 1 is a blo ck comp osed of a fully connected la y er of 4096 size and a residual la y er, res 2 is tw o b lo cks of res 1, res 3 is res 2 plus a redu ction la y er of v ariable size. Iter is the minim um n um b er of iterations the net w ork n eeds to pro cess the en tire train set.

(46)

Figure 6.3: Variation of performances with different layer sizes.

In details figure 6.3 shows how the performances change varying the size of the reduction layer used after the residual blocks. We can notice that the best performances are achieved with sizes 1024 and 2048, while with extreme layer sizes the results tend to decrease. In particular for the lowest dimension layer (64) this is due to the excessive reduction of features needed for the face verification phase. For the highest dimensional layer (4096) the increment of features bring to an increment of the loss and consequently to a decrease in performances. Maybe it is possible to obtain better results with 4096-size layer training for long periods the network in order to obtain a loss reduction.

6.3 High and Low Resolution Training

The architecture of the experiment we perform now is the same as the previous paragraph: a siamese network training with contrastive loss function. As said the network used is the one with the best average results and we start from the weights we obtained at the end of the last cycle. The different in this training respect to the previous one resides in the use of a combined dataset of high and low resolution images. In particular we consider two low resolution images:

(47)

lfw_64 and lfw_32. We are not going to consider the other type of crop for simplicity reasons and because the other crop was not a fitting one for Vggface (5.4). For each resolution level we consider two kind of configurations to apply to the couples of images:

• 40% both high resolution images; 20% both low resolution images; 20% first high resolution, second low; 20% first low resolution, second high.

• 33% both low resolution images; 33% first high resolution, second low; 33% first low resolution, second high.

In the first configuration we prefer to give more probability of occurrence to the high resolution images, in fact they appear at least with one image in 80% of the cases. In the second configuration the occurrence probability of low resolution images grows, in fact they appear in any case.

6.4 Results Step 3

Table6.2and table6.3represent the results obtained using the first configuration. Table 6.4 and table 6.5 represent the result of the second configuration. They all need to be compared to table6.6 to underline any improvements. From these comparisons we can notice that the first configuration is too much related to high resolution images, in fact the results, both for lfw_64 and lfw_32, show that the performances are not as good as the one of the baseline (table 6.6). In the second configuration, instead, the results show to be increased because the network mainly see low resolution images and the training comes out to be effective for these types of datasets. In particular the best improvements can be noticed for lfw_32, while lfw_64 does not give the best values because this kind of resolution is very similar to the high resolution one with the consequence that also the results are similar. Since we obtain the best results for lfw_32, we are going to focus on comparisons between table6.5and table6.6, expressed in detail in figure 6.4 and in particular on the results of lfw_32, since the network was trained with this kind of resolution. We can observe that in accuracy, TPR, TNR

(48)

the results are equal or incremented both in crossed and not crossed versions. In particular we obtain good increments for TPR, which represents an interesting result for real case application, because the increments of TPR corresponds to a decrease in FNR (false negative rate), which means that it is harder and less likely that the network would say that a matching couple does not match. This represents a good increment for robustness.

Figure 6.4: Comparison between results of Vggface trained with low resolution images (lfw_32) with second configuration and the results of Vggface with con-trastive loss layer. The latter are underlined in light blue color.

lfw_160 97.4 1.06 13 13 13 97.4 97.4 97.4

lfw_64 97 1.07 15 17 14 96.6 97.2 96.9

lfw_32 93 1.09 35 49 23 90.2 95.4 92.8

(a) Not Crossed

lfw_160 97.4 1.06 26 26 26 97.4 97.4 97.4

lfw_64 97.1 1.07 29 33 25 96.7 97.5 97.1

lfw_32 95.1 1.11 49 97 26 90.3 97.4 93.9

(b) Crossed

Table 6.2: Test results after training of Vggface with low resolution dataset lfw_64, first configuration. The fields with the asterisk use threshold of lfw_160.

(49)

lfw_160 97.4 1.06 13 13 13 97.4 97.4 97.4

lfw_64 96.8 1.07 16 19 13 96.2 97.4 96.8

lfw_32 93 1.09 35 50 22 90.0 95.6 92.8

(a) Not Crossed

lfw_160 97.4 1.06 26 26 26 97.4 97.4 97.4

lfw_64 97.1 1.07 29 33 24 96.7 97.6 97.2

lfw_32 95.1 1.11 49 102 26 89.8 97.4 93.6

(b) Crossed

Table 6.3: Test results after training of Vggface with low resolution dataset lfw_32, first configuration. The fields with the asterisk use threshold of lfw_160.

lfw_160 97.4 1.06 13 13 13 97.4 97.4 97.4

lfw_64 97 1.07 15 19 15 96.2 97.0 96.6

lfw_32 93.4 1.09 33 47 24 90.6 95.2 92.9

(a) Not Crossed

lfw_160 97.4 1.06 26 26 26 97.4 97.4 97.4

lfw_64 97.2 1.07 28 31 23 96.9 97.7 97.3

lfw_32 95.3 1.10 47 97 27 90.3 97.3 93.8

(b) Crossed

Table 6.4: Test results after training of Vggface with low resolution dataset lfw_64, second configuration. The fields with the asterisk use threshold of lfw_160.

(50)

lfw_160 97.4 1.06 13 13 13 97.4 97.4 97.4

lfw_64 97.2 1.07 14 19 12 96.2 97.6 96.9

lfw_32 93.8 1.07 31 34 24 93.2 95.2 94.2

(a) Not Crossed

lfw_160 97.4 1.06 26 26 26 97.4 97.4 97.4

lfw_64 97.1 1.07 29 29 21 97.1 97.9 97.5

lfw_32 95.2 1.10 48 89 24 91.1 97.6 94.4

(b) Crossed

Table 6.5: Test results after training of Vggface with low resolution dataset lfw_32, second configuration. The fields with the asterisk use threshold of lfw_160.

lfw_160 97.6 1.07 12 12 12 97.6 97.6 97.6

lfw_64 97 1.08 15 18 11 96.4 97.8 97.1

lfw_32 93.2 1.08 34 43 24 91.4 95.2 93.3

(a) Not Crossed

lfw_160 97.6 1.07 24 24 24 97.6 97.6 97.6

lfw_64 97.5 1.08 25 31 22 96.9 97.8 97.4

lfw_32 95.2 1.10 48 94 25 90.6 97.5 94.1

(b) Crossed

Table 6.6: Test results with Vggface network with contrastive loss layer, the fields with the asterisk use threshold of lfw_160.

Low-resolution face verification using convolutional neural networks

Università di Pisa

Laurea Specialistica in Computer Engineering

Low Resolution Face Verification

Using Convolutional Neural

Networks

Supervisors:

Fabrizio FALCHI

Claudio GENNARO

Giuseppe AMATO

Candidate:

Francesco MOLA

Academic Year 2016-17

Contents

List of Figures

List of Tables

Alla mia famiglia e in particolare ai miei genitori Giovanni e

Antonella e mia sorella Maria Luisa che hanno reso possibile questo

lavoro.

A Gessica, sempre al mio fianco che mi supporta e sopporta ogni

giorno della mia vita e ai miei amici più cari.

Un ringraziamento speciale per i miei relatori e in particolare

Fabrizio Falchi per il contributo dato alla realizzazione di questa tesi.

Chapter 1

Introduction

1.1

Thesis Goals

1.2

Thesis Structure

Chapter 2

Related Work

Chapter 3

Background

3.1

Neural Network

3.2

Convolutional Neural Network

3.2.1

Convolutional Layer

3.2.2

ReLU

3.2.3

Pooling Layer

3.2.4

Fully Connected Layer

3.2.5

Dropout

3.3

Fine-tuning

3.4

Contrastive Loss

Chapter 4

Tools

4.1

Caffe

4.2

ImageMagick

4.3

CUDA Libraries

Chapter 5

Experimental Settings and

State-of-the-Art Results

5.1

Dataset Description

5.2

Experimental Description

5.2.1

Experiment steps

5.2.2

Dataset adjustment

5.2.3

Test description

5.3

Networks Description

5.3.1

Vggface

5.3.2

Vggface with triplet loss

5.3.3

Dlibnet