Deep Learning for Emotion Classification through Facial Expression Images: Design and Development of Ensemble Solutions

(1)

UNIVERSITÀ DI PISA

DIPARTIMENTO DI INGEGNERIA

DELL’INFORMAZIONE

Corso di Laurea Magistrale in Ingegneria Biomedica

Master Thesis

Deep Learning for Emotion

Classification through Facial

Expression Images:

Design and Development of

Ensemble Solutions

Supervisors:

prof. Alessio Bechini

prof. Francesco Marcelloni

Candidate

(2)

(3)

Summary

The aim of the present work is to investigate the performance of an Ensemble of Deep Convolutional Neural Networks for automated Facial Expression Recognition. Emotion classification of facial images could be integrated in diagnostic systems or exploited in several other fields, ranging from Human-Computer Interaction to Data Analytics. Recent works have proven that convolutional neural networks are suitable for features extraction and inference, and that ensemble voting guarantees a significant boost in performance. We investigate whether pre-training individual networks on different datasets contributes to differentiate their training procedures with the objective to produce a more accurate ensemble of neural networks. The experiments have been carried out using TensorFlow machine learning library and exploited GPU in order to speed up computation. They show that in the proposed experimental scenario, the Pretrain Strategy is not appropriate: it does not improve significantly the ensemble accuracy and it is more expensive in terms of time and additional data than other differentiation strategies.

(4)

(5)

List of Tables

1.1 Facial Expression Databases. FER-2013, CK+ and SFEW have

been used in the present work . . . 6

1.2 FER-2013: Number of examples per label . . . 7

1.3 State of the art performance on CK+ dataset (6 classes) . . . 14

3.1 Comparison between CPU and GPU runtime - 10 epochs of training 52 4.1 Seed Strategy - Networks Accuracy . . . 56

4.2 Pretrain Strategy - Networks Accuracy . . . 56

4.3 Seed Strategy - Ensemble Statistics . . . 57

4.4 Pretrain Strategy - Ensemble Statistics . . . 57

4.5 Fixed Strategy - Networks Accuracy . . . 59

4.6 Fixed Strategy - Ensemble Statistics . . . 60

4.7 Cross-database results - Base Learning Rate Decay Strategy . . . . 61 4.8 Cross-database results - Modified Learning Rate Decay Strategy . 63

(9)

List of Figures

1.1 Two examples of FER-2013 images . . . 7 1.2 Two examples of CK+ images after face detection with openCV . . 8 1.3 Example of SFEW Image . . . 9 1.4 Typical approaches for features extraction in FER . . . 10 1.5 Example of Deep Neural network. Source [27] . . . . 12 1.6 Visualization of spatial patterns that activate 10 selected filters in

the third convolutional layer. Each row correspond to a filter; Each column is one of the top ten image of the dataset that elicited the maximum magnitude response of that filter. Experiment on CK+ dataset. Source [18] . . . 15 2.1 U-shaped plot: Training and generalization error as function of

model capacity. Source [12] . . . . 19 2.2 Biological and Artificial Neuron. Source [32] . . . 21 2.3 Examples of activation functions . . . 22 2.4 Example of fully-connected neural network for handwritten digits

recognition. Source [27] . . . . 24 2.5 Effect of Momentum in SGD. Source [12] . . . 28

(10)

2.6 Dropout: Some neurons of the network are deleted. In the next step they will be resumed and another random subset will be deleted.

Source [27] . . . 29

2.7 Local Receptive Field. Source [27] . . . . 32

2.8 Example of simple Convolutional Neural Network. Source [27] . . 33

2.9 From the input to the primary visual cortex. Source www. posturepro. net/ eye-tracking-and-sports-performance/ . . . . 34

2.10 Most common approaches for CNN Training . . . 38

3.1 Input pipeline: from original images to the network input . . . 45

3.2 Three versions of the same images after preprocessing . . . 46

3.3 Scheme of Basic Network: ad-hoc model . . . 49

3.4 Scheme of Deep Network: VGG-inspired . . . 49

3.5 Cartoon representation of the design difference between CPU and GPU. Source [35] . . . 51

4.1 Training Accuracy, Validation Accuracy and Total Loss normalized in range [0,1] - Basic Network Seed Strategy - Seed: 456, Prepro-cessing: iNor . . . 53

4.2 Original Image . . . 54

4.3 Visualization of First Layer summaries for Deep Net architecture . 54 4.4 Normalized Confusion Matrix on Test Set - Basic Network Seed Strategy - Seed: 123, Preprocessing: Default . . . 55

4.5 Three Partial Training procedures on FER-2013 using CPU . . . . 58

4.6 Learning Rate decay strategies . . . 62

5.1 Ensemble accuracy values. Plot in range [66,76] . . . 64

5.2 Ensemble Gain. Plot in range [0,5] . . . 65

5.3 Average values of accuracy. Plot in range [66,76] . . . 66

(11)

5.5 Performance on CK of the networks from Pretrain Strategies. Plot in range [0,50] . . . 68 5.6 Performance on SFEW of the networks from Pretrain Strategies.

Plot in range [0,50] . . . 69 5.7 Performance on cross-Database task of the pre-trained networks. . 69 5.8 Performance on FER-2013 with Base and Modified Learning Rate.

Plot in range [50,100] . . . 70 6.1 Accuracy of different methods: blue bars represent the reported

accuracy of several methods analyzed in the State of the Art Review [28], P (red) stands for present work (Global Ensemble, Pretrain Strategy), H (green) stands for human accuracy. Plot in range [50,100] 72

(12)

Introduction

A brief introduction to the Facial Expression Recognition problem is given in

Chap-ter 1. The standard algorithmic approach for automated FER is presented along

with a description of the most common databases. Finally, we explain why Deep Learning techniques are suitable for tackling the FER task.

Chapter 2 introduces the basics of Deep Neural Networks starting from the

definition of a Machine Learning Algorithm and deepening the main aspects of Artificial Feed Forward Neural Networks; particular attention is given to Convo-lutional Neural Networks since they are the class of networks used in the present work. Finally, the keypoints of Ensemble methods and Transfer Learning are in-troduced.

The core ideas of the proposed model are reported in Chapter 3 : a description of the Pretrain Strategy as designed Ensemble Method is provided in comparison with state-of-the-art Seed Strategy. The details about Network architecures and experimental setup could be found in this chapter and in the Appendices.

The experimental results of the ensemble methods are shown in Chapter 4. A further investigation about two aspects is reported: the variability introduced by low level implementation when using GPU, and the performance of the pre-trained networks on the cross-database task.

(13)

In Chapter 5 we analyze the experimental results and apply statistical hypoth-esis tests in order to evaluate the significance of the encountered variability.

A brief summary of the results and the conclusions of the present work are reported in Chapter 6.

(14)

Chapter 1

Facial Expression

Recognition Challenge

1.1 Introduction

Motivations behind Facial Expression Recognition Research

Facial Expression Recognition could play an important role in several fields of appli-cation. On one hand it could be exploited in Human-Machine Intereaction systems. On the other hand it could represent a crucial component of Behaviomedics [36], defined by Valstar as "the application of automatic analysis and synthesis of

af-fective and social signals to aid objective diagnosis, monitoring, and treatment of medical conditions that alter one’s affective and socially expressive behaviour."

In particular it is possible to identify three groups of behaviomedical applica-tions:

• Diagnosis and analysis of mood and anxiety disorders: depressive disorders, bipolar disorders or substance induced disorders.

(15)

1 – Facial Expression Recognition Challenge

• Diagnosis and analysis of neuro-developmental disorders: autistic spectrum disorder, schizophrenia, foetal alcohol spectrum disorder, Down syndrome or attention deficit hyperactivity disorder.

• Pain estimation: symptom used as an indicator in medical settings

Emotion and expression

The final goal of these techniques is to identify the actual emotion, affect or mood, rather than the expression used to exhibit it. In [23] the authors compare the emotion to the message, and the expression to the signal used to convey it.

Nevertheless, in 1971, Ekman et al. [8] demonstrated that facial expressions of emotion are universal. The study was carried out on both literate and preliterate cultures: the universality of humans way of expressing an emotion is supposed to be an evolutionary, biological fact, not depending on the specific culture.

This outcome allows modern Computer Vision studies to focus on the signal (facial expression) in order to analyze the message (emotion).

In [24], Matsumoto adds the contempt emotion to the Ekman original classifica-tion. The basic emotions are indeed: happiness, sadness, anger, surprise, disgust, fear, contempt.

Ekman has also given an important contribution to the development of the Facial Action Coding System [37]: a facial expression can be decomposed in Action Units, defined as the contraction or relaxation of one or more muscles. Every expression representing an emotion is characterized by the combined activation of several AU. For example, given the following action units:

• 6: Cheek Raiser, • 12: Lip Corner Puller,

(16)

1.2 Facial Expression Databases

Machine Learning approaches to the FER problem require the availability of a Facial Expression Database. An ideal database should be indicative of real word condition so that an algorithm trained on it is able to correctly classify a previously unseen example.

The main features of Facial Expression databases are the following:

• Database Size: the number of images in the Database. The number of subjects represents another relevant factor.

• Database Construction: many databases are actually video databases: they include several frames for every example. This results in an increased database size, but including highly correlated images.

• Controlled conditions or Naturalistic conditions: images acquired in controlled condition (or lab-condition) consist in posed expressions of frontal faces, controlled illumination and background condition. As stated in [28], FER of images in controlled conditions is considered a solved problem. FER under naturalistic condition is the scenario of interest for what concerns the applications described in section 1.1 The factors of variation that makes this a harder problem are: head pose, illumination, occlusions, subtlety of spontaneous expressions. FER under naturalistic or spontaneous condition is often referred to as "in the wild"

• Labels and Ground truth: in a labeled database, every image is associated with a label that indicates the depicted emotion. The set of emotions could vary between databases, but is often a subset of the Ekman list of emotion. In two publicly available databases (MMI, CK+) it is possible to find the Action Units labels too.

(17)

• Images properties Resolution and number of channels (grayscale or color images)

Table 1.1 summarises the main features of a list of the most common databases for FER that have been used in literature. A brief description of those used in this work is given in the following paragraphs along with image examples.

Database Labels Posed

Spontaneous

# Images # Subjects Size

FER [13] 7 Spontaneous 35877 N.A. ∼35877

48x48 CK+ [22] 8 Posed 327 sequences 123 640x490 SFEW 2.0

[7] [6] [5]

7 from Movies 1311 68 movies 143x181

JAFFE 7 Posed 213 10 256x256 DISFA AU Spontaneous 4845 27 1024x768 CMU Multi-PIE 7 Posed 750k images 348k neutral 337 3072x2048 MMI AU P / S 1280 videos 250 images 43 720x576

Table 1.1: Facial Expression Databases. FER-2013, CK+ and SFEW have been used in the present work

The FER-2013 Database

FER 2013 database is the reference database for this work and most of the state of the art works. It was released in 2013 for a machine learning contest held as part of the ICML workshop "Challenges in Representation Learning". The first step for database construction consisted in obtaining 600 strings which were used as

(18)

queries on Google Search for facial image. The first 1000 images returned for each query were kept and faces were detected using OpenCV. Human labelers checked the consistency between images and queries, corrected cropping and filtered out some duplicate images. Therefore, the number of different subjects is unknown, but the independence of the examples will be taken as assumption. Images are then resized to 48x48 pixels and converted to grayscale. Finally, the authors prepared a subset of 35877 images and mapped the emotion keywords of the queries into the seven labels, the six defined by Ekman plus neutral expression (see table 1.2). The reported average human accuracy on this dataset is 65%.

Figure 1.1: Two examples of FER-2013 images

Label Examples Anger 4953 Disgust 547 Fear 5121 Happiness 8989 Sadness 6077 Surprise 4002 Neutral 6198

(19)

CK+ Database

The Extended Cohn-Kanade Dataset (CK+) consists of 593 sequences from 123 subjects. Only 327 sequences are annotated with the the same labels of table 1.2 plus contempt. In the present work, four images are selected from each sequence: this results in 1308 images. As shown in figure 1.2, images are obtained in lab-condition.

Figure 1.2: Two examples of CK+ images after face detection with openCV

SFEW Database

Static Facial Expression in the Wild consists of images extracted from temporal facial expressions database, Acted Facial Expressions in the Wild, extracted in turn from movies. Along with SFEW dataset, a set of detected and aligned images is furnished (figure 1.3). Even when using aligned dataset, FER is a difficult task because of variation in head pose, age, occlusions, focus and real world illumination condition.

1.3 The algorithmic approach for automated FER

A standard pipeline to address the FER problem consists in three main steps: 1. preprocessing;

(20)

(a) image of original dataset (b) image of dataset Aligned-SFEW

Figure 1.3: Example of SFEW Image

2. features extraction; 3. classification.

1.3.1 Preprocessing

The preprocessing step could involve all those transformations that allows an easier features extraction. The most common preprocessing steps are:

• face detection: identifies the portion of the image that represents a face. This is the only mandatory step enabling features extraction;

• facial landmarking: it consists in finding the position of face keypoints, e.g. the position of the corner of the eyes, mouth or nose. Facial landmarking could be exploited in two ways:

– for geometric features extraction, under the hypothesis that the relative

position of landmarks is discerning for the emotion classification;

– for face registration;

• face registration: aligns the face to a reference shape, usually exploiting landmarks;

(21)

• modification of image histogram: acts on the intensity value of the pixels of an image and could be achieved using a variety of techniques including min-max normalization, histogram equalization, illumination normalization.

1.3.2 Features extraction

As underlined in [23] this is a crucial step for facial expression analysis: an optimal features extractor needs to be robust to nuisance factors, e.g. subject variability, illumination or pose variability, and able to extract information useful for the classification step.

The block diagram in figure 1.4 shows the typical approaches for features ex-traction.

Figure 1.4: Typical approaches for features extraction in FER

Hand-crafted features

Methods of this category require a mathematical descriptor designed in order to meet a set of desired properties.

(22)

Geometric features The information of keypoints location and their spatial relationship can encode a specific expression. This approach has the advantage to be fast, intuitive and not affected by illumination variation and pose [23]. The drawback is that it requires a reliable landmarking.

Appearance features Appearance-based methods use the pixel intensity in-formation in order to identify those variations of the skin texture related to the conveyed expression. The most popular approaches are:

• Gabor filter, for robustness to misalignment

• Local binary pattern (LBP), for robustness to illumination variation

• Scale invariant feature transform (SIFT), for robustness to scale variation

• Histogram of Oriented Gradient (HOG), for robustness to affine transforma-tion

It is worth mentioning that hybrid approach are possible: in [39] both LBP and HOG are used in order to achieve better classification performance.

Learned Features

The decision of which of the features to extract and the tuning on the methods presented above, require an expert knowledge of the domain. Recent research has tried to investigate if it is possible to learn the features useful to the classifier directly from the data, instead of designing them by hand. In computer vision, the most popular models used for this purpose are the Deep Convolutional Neural Networks (DCNNs).

(23)

1.3.3 Classification

Facial expression recognition can be referred to as a multiclass classification prob-lem. Once the features are extracted a machine learning technique is employed to perform classification: common choices are Support Vector Machine and Softmax Regression.

1.4 Why Deep Learning?

1.4.1 Definitions

Deep Learning Deep learning is a branch of machine learning and consists in a set of algorithms and techniques for learning in neural networks. An introduction to artificial neural network is presented in section 2.2, while an example of a deep neural network is shown in figure 1.5.

Figure 1.5: Example of Deep Neural network. Source [27]

(24)

information processing. Such architecture allows to learn a hierarchical represen-tation of the input data, from trivial information of the first layers to more complex and abstract features of the last layers.

AI-Complete problem Facial Expression Recognition in the wild is a computer vision problem that belongs to the category of AI-complete problems and as such should be treated with a deep learning model as suggested in [12].

In the 1991 Jargon File [29] it is possible to find the definition of an AI-Complete problem: [MIT, Stanford, by analogy with ’NP-complete’] Used to describe

prob-lems or sub-probprob-lems in AI, to indicate that the solution presupposes a solution to the ’strong AI problem’ (that is, the synthesis of a human-level intelligence). A problem that is complete is, in other words, just too hard. Examples of AI-complete problems are ’The Vision Problem’, building a system that can see as well as a human, and ’The Natural Language Problem’, building a system that can understand and speak a natural language as well as a human.

1.4.2 Motivation for Deep Architecture

In [2] the author lists the theoretical advantages of using deep architectures: • Cognitive processes in humans seem to have a deep structure, with different

level of representation and abstraction. The physiological basis that inspire deep convolutional neural networks are presented in section 2.3.2.

• too shallow architecture fails in representing the desired function with a feasi-ble number of parameters. The number of elements might grow exponentially if the depth is reduced.

For what concerns the FER problem, Khorrami et al. in [18] have given a contri-bution that can be summarised in two points:

(25)

• their Convolutional Neural Network reaches or outperforms state of the art methods that use hand-crafted features, as shown in table 1.3. The descrip-tion of these techniques is beyond the scope of the present work; the references could be found in [18] and [39].

• The spatial regions of the input image that maximally excite neurons in the Convolutional Layers correspond to the Facial Action Units described by Ekman. That is to say, the network is able to learn relevant high level features (figure 1.6)

It is worth mentioning that this kind of solutions became popular only in the mid-2000’s when some papers explained how to train them fast enough [27]. The continuous advance in CPU and, even more, GPU technology, allows the reduction of computational time and indeed a wider usage of this techniques.

Method Accuracy CNN 95.7% ± 2.5% MKL 95.5% CSPL 89.89% LBPSVM 95.10% BDBN 96.70%

(26)

Figure 1.6: Visualization of spatial patterns that activate 10 selected filters in the third convolutional layer. Each row correspond to a filter; Each column is one of the top ten image of the dataset that elicited the maximum magnitude response of that filter. Experiment on CK+ dataset. Source [18]

(27)

Chapter 2

Deep Neural Networks

Unless otherwise specified, the theoretical content of this chapter is taken from the book Deep Learning, by Ian Goodfellow, Yoshua Bengio and Aaron Courville [12].

2.1 Machine Learning Algorithms

2.1.1 Definitions

The aim of a Machine Learning technique is to give a computer the ability to autonomously learn from raw data. In [25] the author provides a definition of computer learning: "A computer program is said to learn from experience E with respect to some class of task T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

In a Facial Expression Recognition problem the terms T, P, E can be described as follows.

Task The task T is to classify images of human faces in a finite number of cate-gories (N), each one corresponding to an emotion. The algorithm is asked to find

(28)

2 – Deep Neural Networks

a function f :

f : R2→ {1, 2, ..., N} The domain of the function is R2

for grayscale images, while it is R3 _{for RGB}

images.

Performance The measure of performance is the accuracy of the model: it is calculated as the number of correctly classified examples divided by their total number.

There are other possible performance metrics, like for example recall or cover-age, but in the present work accuracy is used.

Experience A learning algorithm increases its knowledge of the desired function by training on a dataset, a collection of examples. The experience E could grow in a supervised or unsupervised fashion: in the present work every input example is associated with a label that specifies the class it belongs to, so a supervised learning algorithm will be implemented.

Dataset: Training set, Validation set and Test set

In the present case study, the final goal of the classification algorithm is to cor-rectly classify a previously unseen example. Therefore it is not sufficient to solve an optimization problem on the examples used for training. To evaluate this gen-eralization capability the input dataset is split into two separate sets: training set and test set. The training set is often further split in training set and validation set.

• the training set is used during training in order to increase the experience of the model. An optimization procedure finds the parameters configuration which minimizes the training error.

(29)

• the validation set is a collection of examples that are not used for training: being part of the training set, we are allowed to modify the hyper-parameters of the model according to the performance of the algorithm on this set. • the test set is used to measure the actual performance of the model, thus his

generalization capability.

The inference capability on previously unseen examples arises from an assumption about data generating process (i.i.d. assumption): examples in training and test set are supposed to be independent from each other and identically distributed.

Capacity, Overfitting, Underfitting

The capacity of a model corresponds to its degree of complexity: higher the ca-pacity (to a first approximation, the number of parameters), wider the variety of functions it will be able to fit.

If the model is given a low capacity, the error on the training set will be high: this situation is called underfitting. If the model is given a high capacity, it will overfit the training-set, losing the capability to generalize well on a test-set. This situation is called overfitting. The U-shaped plot in figure 2.1 shows the relationship between error and model capacity.

2.1.2 Regularization

In a broad sense, regularization is any modification to a learning algorithm that aims to reduce overfitting and thus generalization error.

Adding a regularization term to the cost function One way to achieve this goal is to include a regularization term to the cost function: if L2 regularization term is chosen and given w the parameters of the model, the cost function becomes:

(30)

Figure 2.1: U-shaped plot: Training and generalization error as function of model capacity. Source [12]

Since the training procedure tries to minimize J (w)∗, the preferred solutions are those with a smaller squared L2-norm of the weights w. L2 regularization is also referred to as weight decay or Tikhonov regularization.

Dataset Augmentation Another way to improve generalization is to increase the training set size. The restricted size of available databases is one of the central issue in many machine learning applications. Gathering and annotating new data is often a difficult task. Nevertheless, when the inputs are images, it is possible to introduce several types of transformation or distortion that consist in an artificial dataset augmentation.

Early stopping Early stopping is an effective form of regularization that modi-fies the stop criterion of a training algorithm. The model is periodically evaluated on a validation set during the training procedure. The generalization error on validation set represents an estimate of the error on the test set. The early stop-ping procedure halts the training process when there are no improvements in the

(31)

validation error. The model with the lowest error is saved.

This technique reduces overfitting by finding the optimal duration of training and without affecting the underlying model dynamics.

2.1.3 Learning procedure: Gradient-Based Optimization

The optimization procedure aims to minimize a cost function and it is achieved using Gradient Descent, an iterative method from calculus. Given f(x) the cost function we want to minimize by altering the parameters x, and given ε the learning rate that defines the size of the step, the method updates the parameters at a new iteration using the following formula:

∆x = ε∇xf (x) (2.1)

x ← x − ∆x (2.2)

2.2 Artificial Neural Networks

Artificial Neural Networks are a machine learning approach inspired to biological brain. The key feature is the weighted interconnection of several simple Action Units in order to calculate complex functions.

2.2.1 Artificial Neuron: the Action Unit

The mathematical model (figure 2.2b) is a coarse approximation of the biological neuron (figure 2.2a).

• an action unit receives input signal xi for each i preceding unit

• the output yk of the neuron k is a non linear functions f of its weighted

input: yk = f X i (wi· xi) + b (2.3)

(32)

(a) Cartoon representation of biological neuron

(b) Mathematical model used in neural networks

Figure 2.2: Biological and Artificial Neuron. Source [32]

• the weight w represents the strength of the synaptic connection between two action units. It is a learnable parameter and can result in an excitory effect (positive weight) or inhibitory effect (negative weight)

• the bias b is equivalent to a threshold.

• f represents the activation function. This non-linearity allows a neural net-work to compute non linear functions of its input. The most common func-tions used are: Sigmoid, Hyperbolic Tangent, ReLU (Rectified Linear Unit).

• the output of an unit represents an input for following units

Activation functions

The activation function characterizes the Neuron Unit.

Step function The trivial step function (figure 2.3a) is never used in practice: during the learning procedure it is desirable that small changes in input determines small changes in output.

(33)

(a) Step (b) Sigmoid (c) ReLU

Figure 2.3: Examples of activation functions

Sigmoid Sigmoid function (figure 2.3b) is a common choice because it guarantees the above mentioned property.

f (x) = 1

1 + e−x

ReLU ReLUs (figure 2.3c) are the most popular choice for convolutional neural networks. The advantages of ReLU over sigmoid units can be summarized as follows:

• they do not require expensive computation, by only thresholding the activa-tions at zero;

• saturation of sigmoid units both for small and big input leads to the vanishing gradient problem (see section 2.2.4). ReLU allows to avoid this phenomenon thanks to the shape of its derivative;

• they allow a sparse representation that, in turn, guarantees [11]:

– information disentangling: small changes of the input do not affect the

features representation, as is the case of a dense representation;

– efficient variable-size representation: the model itself can reduce its

rep-resentational power, i.e. the number of active neurons, depending on the input;

(34)

– easier linear separability of representation.

A drawback to these units is the fact that they do not learn when they have negative input: when too many units lie in this region, the effective capacity of the model could be dramatically reduced.

2.2.2 Feed Forward Neural Networks

When a neural network is organized in a feed-forward chain, without cycles, it is called a Feed Forward Neural Network: the outputs of a layer represent the inputs of the next layer.

The Universal Approximation Theorem

Feed-Forward neural networks represent a universal approximation framework: in fact, the universal approximation theorem states that [4] "every continuous function

defined on a compact set of the nth-dimensional vector space over the real numbers can be arbitrary well approximated by a feed-forward artificial neural network with one hidden layer (with finite number of artificial neurons)".

In other words, for a big enough neural network, it always exists a parameters configuration that makes the network able to approximate any above-mentioned function. Obviously, it is not guaranteed that the training procedure will guide the model to that parameters configuration.

In most applications, the limitation on the class of continuous functions does not represent a problem, since a discontinuous function can be approximated by a continuous one.

An example of feed forward Neural Network, presented in [27]:

The network in figure 2.4 is a shallow neural network composed by three layers: input layer, hidden layer, output layer. It is used to classify handwritten digits:

(35)

input images are 28x28; the ten output neurons correspond to the digits [0,1, ... ,9]. Every pixel of the image is a neuron in the first layer (input layer); the second layer (hidden layer) produces the input of the third layer (output layer) where the number of units is equal to the number of classes to discern. The label predicted by the model for the input image is the one corresponding to the firing neuron.

Figure 2.4: Example of fully-connected neural network for handwritten digits recog-nition. Source [27]

(36)

2.2.3 Gradient Based Learning in Feed Forward Neural

Net-works

Output Units and Cost Function

The choice of the output units determines the distrbution represented in the output layer. For Multinoulli output distribution (n-way classification problem), softmax units are chosen:

sof tmax(z)i=

ezi

P

jezj

given the input z = wx + b:

The cost function is chosen in conjunction with the output units: the most common cost function is the cross-entropy between training data and model pre-diction. Considering for simplicity a binary classification problem: given x the input example, a the related activation and y the true label, the sum over the whole training set of size n, the cross-entropy is:

C = −1 n

X

x

[yln(a) + (1 − y)ln(1 − a)]

When y and a have the same value, the cost function is zero.

When y and a have different values, the cost function of the single example becomes positive.

Forward Propagation

During the forward step, the network accepts an input and computes the related output. In figure 2.4 the information flows form left to right.

This step produces a scalar cost that is usually the sum of two contributions: task-related cost function (e.g. the cross-entropy) and regularization term.

(37)

Back-propagation algorithm

Gradient descent algorithm requires that the gradient of the scalar cost is calculated respect to the weights of the network. This step is efficiently accomplished by the back-propagation algorithm. For a detailed description of the algorithm and its theoretical basis it is possible to refer to [27].

The final equations are: • for bias parameters: ∂C_∂b = δ • for weights parameters: ∂C

∂w = ainδout

where δ is the error term related to the neuron; it is proportional to the deriva-tive of the activation function and its formulation originates from the chain rule of calculus.

Stochastic Gradient Descent

During a training step, the gradient descent algorithm updates the weights of the whole network, according to the equation 2.1. Back-propagation allows to compute the gradient of the cost function of a single example, but theoretically the average value over the entire training set is needed. In practice, the gradient in the formula can be evaluated on n examples, and this value identifies three possible scenarios:

• online gradient descent: optimization uses one example at a time; (n = 1) • deterministic gradient descent: all the training examples are used; (n =

Train-ing set size)

• minibatch or stochastic gradient descent: a minibatch (or simply batch) of n examples is used. (1 < n < Training set size)

(38)

Stochastic gradient descent is a common choice in most practical applications: higher values of n guarantees a better estimate of the gradient, while lower val-ues implies lower computation time. Actually, there are several hardware-related issues: the computation over a batch of n examples can be performed more effi-ciently that n computations of single examples thanks to parallel computing. On the other hand, big values of n require a relevant amount of memory.

When the network performs the forward and backward pass on a single batch of examples, it is said to have executed a step or an iteration. An epoch of training is composed by a number of steps that allows the network to see all the training examples.

Improving the optimization strategy Stochastic Gradient Descent is the most popular method for optimization in DNN. Nevertheless several variants have been proposed in order to accelerate learning: momentum algorithm, for example, represents a common choice in convolutional neural networks. The update rule is:

∆x = α∆x − ε∇xf (x) (2.4)

x ← x + ∆x (2.5)

The only difference between this formulation and the classic SGD (equation 2.1) is given by the term α∆x: it can be considered a velocity term in the parameters space. The relationship between α and ε determines how much previous updates influence the current update, contributing to keep the direction of motion in the space parameter and avoiding oscillation. A visual representation of momentum effect is shown in figure 2.5b, in comparison with classic gradient descent of figure 2.5a.

(39)

(a) Without Momentum (b) With Momentum

Figure 2.5: Effect of Momentum in SGD. Source [12]

2.2.4 Major issues in training deep architectures

Convex and non-convex optimization: Because of the non-linearity of a neu-ral network, the commonly used loss functions become non-convex. In non-convex optimization with iterative method the convergence is not guaranteed and there is high sensitivity to the parameter initialization. It is worth underlying that non-convexity in neural network often arises from an intrinsic property: the high num-ber of parametrized latent variables determines the non-identifiability of the model. That is to say that, for any model, there exists a finite number of other parame-ters configurations that produce equivalent models. This kind of non-convexity is due for example to the weights space symmetry and actually does not represent a problem for optimization procedure.

Vanishing gradient problem: this problem arises when gradient of the cost function is calculated with respect to the weights in the earlier layers of a deep network. The back-propagation algorithm requires a chain multiplication of n terms, where n is the depth of the layer; whit an improper choice of activation function and initialization strategy it could happen that all of these terms have small absolute values. This leads to vanishing gradient problem. The use of ReLUs instead of sigmoid units allows to avoid this phenomenon.

(40)

Overfitting In addition to the regularization methods mentioned in section 2.1.2, another common technique used in neural network is dropout: when dropout is applied to a layer, at every training step every unit of that layer is kept with probability p, otherwise it is temporarily ignored (figure 2.6). During evaluation (i.e. inference on validation or test set) the dropout is turned-off and all the neurons are kept. Dropout effect could be seen from two points of view. On one hand, by randomly sampling the neurons to delete at every step, it reduces the co-adaptation phenomenon between neurons; on the other hand, it results in a procedure that trains a variety of different neural networks (i.e. different number of units per layer). These subnetworks are supposed to overfit the training set in different way, so the global network should be able to generalize better.

Figure 2.6: Dropout: Some neurons of the network are deleted. In the next step they will be resumed and another random subset will be deleted. Source [27]

Internal Covariate Shift [17] In deep feed forward networks the input to a layer is affected by the outcomes of all the preceding layers. The amplification of a change in the distribution of parameters increases with the depth of the net-work. In stochastic gradient descent, the input distribution changes at every step

(41)

according to the statistics of the current batch. This phenomenon requires contin-uous adaptation and implies learning slowdown: it is referred to as Covariate Shift (Internal when affecting hidden layers). Batch Normalization layer consists in a transformation of the input of an activation function in order to avoid covariate shift effects. Given an input batch B = {x1...n}, the transformation applied to

each activation independently is:

xi← γ

xi− µB

pσ2

B+ 

+ β (2.6)

where µB, σBare respectively mean and standard deviation of the feature over the

batch, and γ, β are scale and shift trainable parameters that allow the network to preserve its representational power.

In convolutional networks the formula above is not evaluated on each activation independently, but is the same for different activation of the same feature map. This allows to preserve the spatial properties of a convolutional layer.

It is worth mentioning the main advantages due to the introduction of batch normalization:

• it allows higher learning rate, and consequently learning speed-up; • it reduces overfitting and allows to reduce dropout and L2 regularization; • network becomes more resilient to parameter scale and initialization; • it allows to remove bias, since beta represents itself a shift factor.

2.3 Convolutional Neural Network

Convolutional Neural Network (CNN) is a type of feed-forward artificial neural network particularly suitable to be used in computer vision applications. There are two advantages in using CNNs. First, thanks to their architecture, they can

(42)

take into account the spatial structure of the input; this is a desired property when the input neurons are the pixels of an image. Second, they require fewer parameters than fully-connected networks. This means that they are faster to train, less prone to overfitting, and that deeper and more powerful models can be designed.

2.3.1 Architecture and properties

These properties originate from the following ideas behind CNN [27]: • Local Receptive fields

• Shared weights • Pooling

Local Receptive Fields Only a limited number of neuron in the ith _{layer is}

connected to a neuron in the (i+1)th _{layer, (figure 2.7). This sparse interaction}

does not take place in a fully-connected network, where every neuron of a layer is connected to a neuron in the following layer.

Shared weights The weights used to connect the first receptive fields in figure 2.7 with the first neuron in the hidden layer are the same for every connection between receptive fields and corresponding neurons. This is the core of the convo-lution operation: the (i+1)th _{layer could be seen as the result of the convolution}

between the ith _{layer and a kernel. The kernel size corresponds to the size of the}

receptive field, while the value are the shared parameters. Since this filtering ex-tracts a feature of the ith _{layer, the (i+1)}th _{layer is often referred to as a feature}

map.

For a two-dimensional input I and kernel K the convolution operation is:

S(i, j) = (K ∗ I)(i, j) =X

m

X

n

(43)

Figure 2.7: Local Receptive Field. Source [27]

Pooling A convolutional layer is generally composed by three stages: a linear stage (convolution), a non linear stage (activation function, see 2.2.1), a pooling stage. Pooling reduces the size of a layer: a group of neurons is replaced by one neuron, representing a summary statistic of them. Max-pooling with a 2x2 kernel, for example, halves the dimensions of the layer by choosing the maximum values of non overlapping 2x2 windows of neurons. Reduction of the number of neurons (and connections indeed) is not the only consequence of pooling stage: pooling guarantees an increased translation invariance because it maps the information of a group of neurons in only one neuron of the next layer; it is more important to know whether a feature is present or not, than its exact location.

An example of CNN architecture

In a convolutional layer it is possible to extract more than one feature at a time. For example, a simple convolutional neural network for handwritten digits recognition is shown in figure 2.8:

(44)

• input layer: the 28x28 image • convolutional layer:

– 20 feature maps are obtained by using kernel of size [20x5x5] without

applying zero-padding.

– The pooling layer guarantees the size reduction and the property of

translation invariance. • fully-connected hidden layer • output layer

Figure 2.8: Example of simple Convolutional Neural Network. Source [27] According to the complexity of the task, the architecture could be modified. By adding another convolutional-pooling layer before the fully-connected layer, the depth of the model, and indeed its capacity, would be increased.

2.3.2 Physiological basis of CNN

Convolutional neural networks are inspired to the mammalian vision systems. Re-search in the last decades has driven to the following simplified view of the vision

(45)

system, presented in [12] (figure 2.9):

Figure 2.9: From the input to the primary visual cortex. Source www. posturepro.

net/ eye-tracking-and-sports-performance/

1. light stimulates the retina;

2. neurons in the retina perform simple preprocessing on the signal without altering its representation;

3. the signal passes through the optic nerve and through the lateral geniculate nucleus

4. the signal arrives in V1, primary visual cortex, in the back of the head, where the first advanced processing on input happens. A convolutive layer tries to emulate the following properties of V1:

• V1 has a bidimensional structure, consistently with the structure of the input signal (acquired image)

(46)

• V1 is composed by both simple and complex cells:

– simple cells can be modelled as linear detection units that respond

to specific features of small receptive fields;

– complex cells are similar to the simpler ones but their activations are

invariant to small shifts of the position of the stimulating features. Pooling layers are inspired to these cells.

The structure of alternate detection and transformation invariant stages is repeated in the following areas of visual system;

5. signal flows from V1 to V2, V4 and inferotemporal cortex in the first 100 ms of seeing an object. In this lapse of time the inferotemporal cortex be-haviour is similar to the last layer of a convolutional network for basic object recognition.

6. Moving deeper into the brain the signal reaches cells that are activated by more and more abstract representation of the input. The so called

grand-mother cells are situated in the medial temporal lobe are more powerful than

modern convolutional networks. The name grandmother cells originates from the fact that a person could have a neuron that activates when he sees, hears, or discriminates a specific entity, for example his or her grandmother. Research has tried to better understand the functionality of biological neuron of V1. One approach, cited in [12], uses reverse correlation: it measures with an electrode the response of the neuron to white noise images in input, and fit a model in order to compute the neuron weights. The experimental results have shown that these weights correspond to Gabor filter, that can be expressed as a gaussian kernel modulated by a sinusoidal plane wave.

The fact that CNNs autonomously learn similar kinds of function for texture representation is an impressive correspondence between machine learning and neu-roscience. This is obviously not enough to claim that CNNs represent a model of

(47)

the brain visual system.

Main differences between convolutional neural network and visual sys-tem

1. In convolutional neural networks, input images are at a fixed and full reso-lution. In human visual system images are at a low resolution except for a small part of the retina, the fovea. The brain obtains information about the scene making quick eye movement called saccades;

2. human visual system relies also on the presence of other senses;

3. human visual system paths are circuit with feedback from higher levels to simple areas like V1. This has not been found particularly advantageous in CNN;

4. the distinction between simple and complex cells is not sharp and maybe the functions they calculate are substantially different from those of neuron units in CNN;

2.4 Advanced Methods in Classification

2.4.1 Techniques to improve classification accuracy:

Ensem-ble methods

As stated in [14], an ensemble combines a series of learned models (or base clas-sifiers) with the aim of creating an improved composite classification model. The classification output of an ensemble for a new example is based on the votes of each base classifier.

Random forests represent a popular example of ensemble methods, where the classification is performed by considering the votes of several decision trees.

(48)

Ensembles of Neural Network

Ensemble solutions have been widely exploited with neural networks. Giacinto et al. in [9] underline the keypoint about this topic with regard to image classification purposes: experimental results show that an ensemble can outperform the best single net, provided that the nets make different errors.

From this perspective, the task of producing error-independent network is not-trivial: different training sessions could lead to different parameters configurations, but these are often equivalent because of the weights space symmetry (see section 2.2.4).

The approaches used to create ensembles of neural networks can be grouped in the following methods, in descending order of effectiveness:

• variation of the network type;

• variation of the training data in terms of sampling of the dataset or prepro-cessing of images;

• variation of the network architecture;

• variation of initial distribution of random weights.

Another relevant issue is the design strategy of an ensemble. In order to produce an ensemble of n network, the direct strategy aims to produce error-independent networks and collect them directly in the ensemble; the overproduce and choose

strategy instead, aims to produce more than n networks and finds a way to select

the best ensemble of n network. A possible solution, proposed in [9], uses clustering in order to group network with correlated error and then chooses the best network from each cluster.

(49)

2.4.2 Dealing with dataset size: Transfer Learning

The most common approaches to tackle an image classification problem are pre-sented in figure 2.10. The most appropriate approach strongly depend on the available dataset of interest [32].

Figure 2.10: Most common approaches for CNN Training

Training from scratch It consist in designing a network ex novo and training it from an initial random configuration of the parameters. This technique requires a dataset of sufficient size.

Transfer Learning According to [14], transfer learning is a technique that aims to extract knowledge from one source task and applies it to a new target task. In a neural network context, this technique consists in learning a parameters configu-ration by training a network on a general, big dataset and using it for the dataset of interest. It can be used in two ways:

• Frozen Network: the parameters configuration learned on the bigger dataset is fixed and used as feature extractor. A new classifier (fully connected layer)

(50)

is added on the top and trained from scratch in order to perform the classi-fication of interest. As a rule of thumb, this is a convenient choice when the dataset of interest is small and similar to the first dataset.

• Initialization Network for fine-tuning: the parameters configuration learned on the bigger dataset is used as initialization and the training proce-dure involves also the layers of the pre-trained network. This is a convenient choice when the dataset of interest is large and similar to the first dataset. If the target dataset is small and the number of parameters of the network is high, fine-tuning can result in overfitting.

The idea of transfer learning comes from the observation of a common phenomenon: the first layers of convolutional neural networks trained on images tend to learn the same features (edges, Gabor filters, color blobs), independently from the task. The features learned in the first layers can be considered general, while those in the last layers are task specific. This means that in both the previous context it is possible to choose the degree of generality of the learned features to transfer.

Performances of Transfer Learning in Neural Networks

The authors in [38] have tried to investigate the performance of transfer learning, both with frozen networks and fine-tuned networks. The results can be summarized as follows:

• transfer learning with fine-tuning for similar datasets leads to improvement in performance;

• transfer learning without fine-tuning can cause a performance drop; the per-formance drop is due to representation specificity and fragile co-adaptation of features of successive layer;

(51)

• the co-adaptation issue arises also when the same dataset is used for re-training and disappears when fine-tuning;

• the performance drop due to representation specificity is bigger when the datasets used are more dissimilar.

2.5 Practical Methodology: the choice of

Hyper-Parameters

Weights and biases are the parameters of a network: they define the behaviour of neurons and are modified during the training procedure. That is why they could be referred to as trainable parameters.

However, the behaviour of an algorithm for training a neural network is affected by a set of hyperparameters. Examples of hyperparameters are:

• Number of hidden units; affects the capacity of the model;

• Learning rate; determines the step size in learning procedure (equation 2.1); • Learning rate decay strategy;

• Mini-batch size;

• Number of epoch of training and stop criterion;

• Convolutional kernel width, stride, padding; influence the behaviour of con-volution operation and indeed the number of parameters of the model; • Dropout rate; represents the probability to delete a neuron from a dropout

layer;

• Weight decay coefficient; determines the weight of the L2 regularization term in the loss function;

(52)

• Momentum factor; used when adding momentum optimizer to SGD (equation 2.4 and 2.5);

• Weights initialization strategy; • Preprocessing strategy; • Data augmentation strategy;

• Seed for random number generation.

Hyperparameters selection The choice of the above-mentioned hyperparame-ters is a crucial step of the design of a neural network. They influence the training error, generalization error and the demand of computational resources (memory, computation time). The manual tuning requires good command of the effect of every hyperparameter and their interaction. A common approach to solve this

hyper -optimazion problem is to perform grid search or random search in the

hy-perparameter space.

In practice, when the objective task has been studied extensively, it is advisable to start from literature models, algorithms and parameters configuration that are known to perform best.

(53)

Chapter 3

Proposed Model and

Experimental Setup

3.1 Goals of the present work

There are several evidences of the fact that ensemble voting improves test-accuracy by 2-3% [28]. As reported in section 2.4.1, this significant boost in performance is possible if the ensemble is made up of error-independent networks, i.e. networks that produce different FER decisions.

The final goal of the present work is to propose an alternative method for the design of an ensemble of neural networks to tackle the Facial Expression Recogni-tion problem on the FER-2013 dataset. Hereafter, this method will be referred to as the Pretrain Strategy.

The proposed strategy is compared to a method used by Kim et al. in [19]: hereafter, Kim’s method will be referred to as the Seed Strategy.

In order to enable fair comparison of the results, experiments are carried out using the same parameters setting.

(54)

3 – Proposed Model and Experimental Setup

3.2 Experiments

Two different architectures, that represent state of the art models, are used in order to evaluate the general validity of the results. These models and their performances are described in section 3.2.3.

Both ensemble methods use nine networks, that differ because of two factors: a first factor consists, for both the strategies, in applying three different trans-formations to the input images. A description of these pre-processing procedures could be found in section 3.2.2. The second factor of variation characterizes each strategy and is described in the following paragraphs.

3.2.1 Design of Ensemble Methods

What distinguishes the ensemble strategies is how weigths are initialized; the weight of a connection between two neurons is represented by the term w in equation 2.3.

Seed Strategy

Three different seeds are used for random weight initialization.

The seed initializes the pseudorandom number generator: theoretically, if the seed is fixed, the algorithm produces a unique sequence. The elements of the algorithm that are influenced by random number generator are:

• random weight initialization: weights are often initialized by drawing from a normal or uniform distribution.

• shuffle of the dataset: batching operation consists in picking n images from the dataset. Random shuffling the dataset influences the order in which images are provided to the network during training;

(55)

• data augmentation: any modification of the input images that involves ran-dom operation. For example ranran-dom crop, ranran-dom scale factor or ranran-dom angle of rotation.

Pretrain Strategy

Instead of fixing seeds for random weight initialization, three different initial con-ditions are obtained by:

• random initial distribution of the weights;

• pre-training the network on CK+ dataset;

• pre-training the network on SFEW dataset.

This idea relies on the assumption that each of the datasets, described in section 1.2, has a unique fingerprint, for example typical illumination and pose condition. This hypothesis could be confirmed by the fact that cross-database experiments (i.e. training on a dataset and test on another dataset) carried out in [26], have achieved unsatisfactory results.

Since CK+ consists of posed and lab-controlled images, while SFEW is by definition in-the-wild, they should fulfil the purpose.

Details about how pre-training is done in practice could be found in appendix A.2.2. Since seed determines the behaviour of several component, it is set coher-ently with Seed Strategy in order to enable fair comparison.

3.2.2 Dataset Setup and Preprocessing Step

(56)

Figure 3.1: Input pipeline: from original images to the network input

Dataset split and final face crops

FER-2013 The official split of the FER-2013 dataset is used in the present work. The only modification is the removal of 11 "black" images (zero anywhere). This results in 28699 images as training set, 3588 images as validation set, 3589 as test set. The provided face crops are used without additional detection stage.

CK The 1308 images of the dataset are split in training and validation set. The only constrain is that images of the same subject belong to the same set. In order to accomplish this, a 1/10 fraction of the subjects is sampled to generate the validation set. This results in 100 images for validation set and 1208 images for training set.

Faces are detected using Haar Cascade frontal face detector from OpenCV [3]. Images are then resized to 162x162 and converted to grayscale.

SFEW The original split of the dataset Aligned-SFEW 2.0 consists of 890 images for training set and 431 images for validation set. For the purposes of the present work a new split is proposed in order to increase the training set size. A 1/10 fraction of the movies list of the entire dataset is sampled to generate the validation set. This results in 1036 training images and 285 validation images. For consistence

(57)

with the CK dataset, the images are then converted to grayscale and resized from 143x181 to 162x162. The modification of the aspect ratio introduces a distortion.

Additional Preprocessing steps

The authors in [19] have demonstrated that fusing aligned and non-aligned face information improves the test accuracy on FER by 1%. Nevertheless we avoid straight landmark-based registration; this choice has been justified in [28] for the following reasons:

• the network becomes more able to generalize, by learning to compensate for pose variation;

• the results are not affected by registration error;

• the model is conceptually simpler than one that requires face registration.

Modification of the histogram

Three versions of each datasets are obtained as a factor of variation to generate error-independent network, as proposed in [19]. An example of the resulting images is provided in figure 3.2. It is picked from FER-2013 test set.

(a) Default (b) HistEq (c) iNor

Figure 3.2: Three versions of the same images after preprocessing

Default images No transformation is performed on the original histogram of the images (figure 3.2a)

(58)

Histogram Equalization The python function used to perform histogram equal-ization is reported in the following snippet:

def i m a g e _ h i s t o g r a m _ e q u a l i z a t i o n ( image , number_bins = 2 5 6 ) :

# image h i s t o g r a m image_histogram , b i n s = np . h i s t o g r a m ( image . f l a t t e n ( ) , number_bins , normed=True ) # c u m u l a t i v e d i s t r i b u t i o n f u n c t i o n c d f = image_histogram . cumsum ( ) c d f = 255 ∗ c d f / c d f [ − 1 ] # new p i x e l v a l u e s i m a g e _ e q u a l i z e d = np . i n t e r p ( image . f l a t t e n ( ) , b i n s [ : − 1 ] , c d f ) return i m a g e _ e q u a l i z e d . r e s h a p e ( image . s h a p e )

The operation increases the contrast of the original image, as can be seen in figure 3.2b.

Illumination Normalization The MatLab function provided by [16] is used to perform illumination normalization. This is accomplished by evaluating isotropic diffusion in order to smooth the image. The resulting image is illustrated in figure 3.2c.

Feeding the network: Normalization

A common practice before feeding the image to the network is to zero-center the data [20, 28, 31, 32]. As suggested in these papers, a global mean value µ, and a global standard deviation value σ are evaluated over all the training set of interest,

(59)

after the preprocessing procedure. The normalization step consists in the following operation:

I(x, y) ← I(x, y) − µ σ

It is worth noting that µ and σ are computed on the training set. The trans-formation is then applied to every image both at training time and at evaluation time.

3.2.3 Convolutional Neural Networks: Two Architectures

A detailed description of the architecture and the hyperparameters configuration could be found in Appendix A.1.

Basic Network The first network is inspired to the work of Kim et al. [19]. A comparatevely shallow convolutional neural network is designed ad-hoc for the FER task. The architecture is illustrated in figure 3.3: input layer is followed by three convolutional and max pooling layers with respectively 32, 32 and 64 feature maps. A fully connected layer of 1024 neurons precedes the output layer composed by 7 units, i.e. the seven classes of emotion of the FER-2013 database. As proposed in [28] a batch normalization layer is added after every convolutional and fully connected layer.

Deep Network The second network is similar to the VGG-B architecture pro-posed in [31] by the Visual Geometry Group of the University of Oxford. The original network won the ImageNet 2014 Large Scale Visual Recognition Chal-lenge [30], in the localization and classification task over 1000 classes of objects. For the FER purpose the architecture used is the present work is modified accord-ing to the specifications proposed in [28]. The architecture is illustrated in figure 3.4: it is composed by four CCP blocks with 32, 64, 128, 128 feature maps and a

(60)

Figure 3.3: Scheme of Basic Network: ad-hoc model

reduced kernel-size of 3x3. The fully connected and output layers are the same as in Basic Network.

Figure 3.4: Scheme of Deep Network: VGG-inspired

3.2.4 Training procedure

Stochastig gradient descent is used as optimizer for the cross-entropy loss function. The details of the parameter choice are reported in appendix A.2.

Deep Learning for Emotion Classification through Facial Expression Images: Design and Development of Ensemble Solutions

UNIVERSITÀ DI PISA

DIPARTIMENTO DI INGEGNERIA

DELL’INFORMAZIONE

Corso di Laurea Magistrale in Ingegneria Biomedica

Master Thesis

Deep Learning for Emotion

Classification through Facial

Expression Images:

Design and Development of

Ensemble Solutions

Supervisors:

prof. Alessio Bechini

prof. Francesco Marcelloni

Candidate

Summary

Contents

List of Tables

List of Figures

Introduction

Chapter 1

Facial Expression

Recognition Challenge

1.1

Introduction

1.2

Facial Expression Databases

1.3

The algorithmic approach for automated FER

1.3.1

Preprocessing

1.3.2

Features extraction

1.3.3

Classification

1.4

Why Deep Learning?

1.4.1

Definitions

1.4.2

Motivation for Deep Architecture

Chapter 2

Deep Neural Networks

2.1

Machine Learning Algorithms

2.1.1

Definitions

2.1.2

Regularization

2.1.3

Learning procedure: Gradient-Based Optimization

2.2

Artificial Neural Networks

2.2.1

Artificial Neuron: the Action Unit

2.2.2

Feed Forward Neural Networks

2.2.3

Gradient Based Learning in Feed Forward Neural

Net-works

2.2.4

Major issues in training deep architectures

2.3

Convolutional Neural Network

2.3.1

Architecture and properties

2.3.2

Physiological basis of CNN

2.4

Advanced Methods in Classification

2.4.1

Techniques to improve classification accuracy:

Ensem-ble methods

2.4.2

Dealing with dataset size: Transfer Learning

2.5

Practical Methodology: the choice of

Hyper-Parameters

Chapter 3

Proposed Model and