Development of a mobile application for Food Recognition using Convolutional Neural Networks

(1)

Universit`

a di Pisa

Facolt`

a di Ingegneria

Corso di Laurea Magistrale in Computer Engineering

Development of a mobile application for

Food Recognition using Convolutional

Neural Networks

Candidate: Michele De Bonis Supervisors: Giuseppe Amato Fabrizio Falchi Claudio Gennaro 5 May 2017

(2)

”Your diet is a bank account. Good food choices are good investments.”

Bethenny Frankel

”There is no sincerer love than the love of food.”

(3)

Development of a mobile application for Food Recognition using Convolutional Neural Networks

A.A. 2016-2017

Abstract

The aim of this thesis is to develop an Android mobile application for Food Recognition. To this purpose, a Convolutional Neural Network (CNN) has been trained on datasets of 101 classes of dishes available in literature. The CNNs usually work on a powerful hardware. Since the objective is to run the CNN on mobile phone, the challenge is to point out the best one in terms of memory occupation, computational speed and accuracy.

The analysis exploits the following CNNs: AlexNet, Residual Network, GoogLeNet (Inception), VGG, SqueezeNet, BWN and XNORNet. The train of the CNNs follows two main steps and it has been done on both ETHZ and UPMC datasets coming from Zurich and Paris University, respectively.

In the first step, the CNNs have been trained from scratch in order to pick the best in terms of accuracy, computational speed and memory occupation. Then, a fine tuning of the best CNNs previously identified has been performed.

The frameworks used for training and testing the nets are Caffe, which is probably

the most used, and Torch. In order to simplify the processes of both training and

testing, the NVIDIA Deep Learning GPU Training Systems (DIGITS) has been used. This system provides an efficient and easy-to-use web interface which allows to format images and set network parameters. Once the best CNN is identified, the application is actually developed. The application works in two modalities: running the CNN locally in the smartphone using an implementation based on RenderScript, or querying the CNN deployed on a server into a Web Application. Afterwards, a statistical study of the results has been done in order to acquire a better user experience. By establishing a threshold on the score assigned by the CNN to a certain dish, it is possible to communicate the user if the prediction is comfortable. Consequently, the application has been developed and designed relying on Android Studio.

On the basis of the results of this research, it can be concluded that the best accuracy is obtained by the GoogLeNet which also has a reduced size. The CNN has been ported on the mobile phone using an implementation based on RenderScript. Moreover, the CNN has been deployed on a Web Application.

(4)

Acknowledgements

This work would have been impossible without the support of my project advisors. I would like to thank professors Giuseppe Amato and Fabrizio Falchi for the hints and the help in analyzing data.

Last but not least, I would like to thank Mohammad Motamedi for teaching me how to use his code in the implementation of the Convolutional Neural Network.

(5)

1.3.1.1 Dataset . . . 3 1.3.1.2 Training . . . 4 1.3.1.3 Validation . . . 4 1.3.1.4 Testing . . . 4 1.3.2 Frameworks . . . 5 1.3.2.1 Caffe . . . 5 1.3.2.2 Torch . . . 5 1.3.3 Tools . . . 6 1.3.3.1 DIGITS . . . 6 1.3.3.2 MATLAB . . . 7 1.3.3.3 OpenCV . . . 7 1.3.3.4 Android Studio . . . 7 1.4 Related Work . . . 7

2 Overview of the datasets 9 2.1 Food 101 . . . 9

2.1.1 ETHZ Food101 . . . 9

2.1.2 UPMC Food101 . . . 10

(6)

Contents v

2.1.3 Comparison of the datasets . . . 11

2.2 Setting the test framework . . . 12

3 Comparisons of various CNNs architectures 13 3.1 Overview of the CNNs . . . 13 3.1.1 AlexNet . . . 17 3.1.2 VGG-Net . . . 18 3.1.3 GoogLeNet . . . 20 3.1.4 SqueezeNet . . . 21 3.1.5 Residual Networks . . . 23

3.1.6 Binary Weight Network and XNOR-Net . . . 23

3.2 Training from scratch . . . 24

3.2.1 AlexNet . . . 25 3.2.2 VGG-Net . . . 26 3.2.3 GoogLeNet . . . 27 3.2.4 ResNet-50 . . . 29 3.2.5 SqueezeNet . . . 30 3.2.6 BWN . . . 31 3.2.7 XNOR-Net . . . 32 3.2.8 Comparison of results . . . 33

3.3 Fine tuning of most promising CNNs . . . 35

3.3.1 AlexNet . . . 36

3.3.2 SqueezeNet . . . 37

3.3.3 GoogLeNet . . . 38

3.3.4 Comparison of results . . . 39

4 Results and analysis 40 4.1 Results of literature . . . 40

4.2 Testing most promising approaches . . . 41

4.2.1 SqueezeNet test . . . 43

4.2.2 GoogLeNet test . . . 45

4.2.3 Comparison and analysis . . . 47

4.3 Threshold evaluation . . . 48

4.4 Classes analysis . . . 51

5 Mobile application development 54 5.1 Application modalities . . . 54 5.1.1 Off-line mode . . . 54 5.1.2 On-line mode . . . 55 5.2 Application architecture . . . 55 6 Conclusions 58 6.1 Future work . . . 59 Bibliography 60

(7)

List of Figures

1.1 Example of over trained CNN . . . 5

2.1 Example Images of ETHZ dataset . . . 10

2.2 Example Images of UPMC dataset . . . 11

2.3 Squashing procedure of an image . . . 12

3.1 Convolution Layer . . . 14

3.2 ReLU Layer activation function . . . 14

3.3 Pooling Layer . . . 15

3.4 Fully Connected Layer . . . 15

3.5 Dropout operation . . . 16 3.6 AlexNet architecture . . . 17 3.7 VGG-16 architecture . . . 19 3.8 Inception Module . . . 20 3.9 GoogLeNet Architecture . . . 21 3.10 Fire Module . . . 21 3.11 SqueezeNet Architecture . . . 22

3.12 Learning Rate in function of training epochs . . . 25

3.13 AlexNet Training . . . 26 3.14 VGG Training . . . 27 3.15 GoogLeNet Training . . . 28 3.16 ResNet-50 Training . . . 29 3.17 SqueezeNet Training . . . 30 3.18 BWN Training . . . 31 3.19 XNORNet Training . . . 32

3.20 AlexNet Fine Tuning . . . 36

3.21 SqueezeNet Fine Tuning . . . 37

3.22 GoogLeNet Fine Tuning . . . 38

4.1 Accuracy TopK of SqueezeNet . . . 44

4.2 Scc ICDF of SqueezeNet . . . 45

4.3 Accuracy TopK of GoogLeNet . . . 46

4.4 Scc ICDF of GoogLeNet . . . 46

4.5 Comparison of the Accuracies Top-K . . . 47

4.6 Comparison of the Scc ICDFs . . . 48

4.7 Threshold usage . . . 49

4.8 Threshold evaluation of the GoogLeNet . . . 50

4.9 Total accuracy in function of threshold . . . 51

(8)

List of Figures vii

4.10 Steak vs Prime Rib comparison . . . 52

4.11 Breakfast Burrito vs Chicken Quesadilla comparison . . . 52

5.1 Application architecture in blocks . . . 57

(9)

List of Tables

2.1 Comparison of datasets . . . 11

3.1 Comparisons of various CNNs architectures trained from scratch . . . 33

3.2 Comparisons of various CNNs architectures fine tuned . . . 39

4.1 Results in Bossard et al. [2014] . . . 41

4.2 Results in Wang et al. [2015] . . . 41

4.3 SqueezeNet Cross Test . . . 43

4.4 GoogLeNet Cross Test . . . 45

4.5 FP and FN . . . 49

4.6 Results with different thresholds . . . 50

4.7 Accuracy and Avg Scc for each class . . . 53

(10)

Abbreviations

AdaGrad Adaptive Gradient

BWN Binary Weighted Network

CDF Cumulative Distribution Function

CNN Convolutional Neural Network

FC Fully Connected

FN False Negative

FP False Positive

ICDF Inverse Cumulative Distribution Function

ResNet Residual Network

RF Random Forest

Scc Score of correct class

SGD Stochastic Gradient Descent

SVM State Vector Machine

(11)

The dedication of this graduation is split in six parts:

to my mother, for the domestic support,

to my father, for solving my paperwork,

to my brother, for teaching me electrical circuits,

to my two sisters, for creativity and curiosity,

and to my friends, for drinking with me.

You all know what you have done, and I’m truly grateful for that.

(12)

Chapter 1

Introduction

1.1 Objectives of the thesis

Mobile applications are growing on the world stage. Along with the miniaturization of electronics devices, more and more people have their own smartphone. Mobile appli-cations are now very popular and part of the environment, and can help us in almost every activity.

Food recognition has recently become a very popular topic. However, there is no method which offers a sufficiently high accuracy in classifications of images representing food. This is because of the variety in terms of colors and shape of each dish and the moltitude of recipes to recognize.

The research in this thesis exploits the use of Convolutional Neural Networks (CNN) for food recognition. A CNN is a type of artificial neural network which is based on a large collection of neural units (artificial neurons), loosely mimicking the way a biological brain solves problems with large clusters of biological neurons connected by axons. Since they are very complex and they need powerful hardware in order to get results in a reasonable time, they usually run on a device with high computational power in terms of CPU and GPU. The goal of this work is to implement a food recognition mobile application. A relevant part of this thesis has been the comparison of various CNN to identify the one which presents a good trade-off between accuracy and complexity to be used on the mobile phone.

(13)

Chapter 1. Introduction 2

The analyzed and tested CNNs are the following: AlexNet, SqueezeNet, GoogLeNet (or Inception), VGG with 16 Layers (VGG-16), Residual Network with 50 Layers (ResNet-50), Binary Weighted Network (BWN) and XNOR-Net. The best CNN has been ported on the mobile phone and a mobile application has been developed for the Android operating system.

1.2 Thesis structure

In the remaining part of this Chapter, the used tools are presented. A description of the related work is done, and the functioning of the CNN is briefly described.

In Chapter 2, an overview of the used dataset is presented. The composition and the process adopted to format them is described.

The Chapter 3 is dedicated to the comparison of various CNNs architectures. A brief overview of the analyzed CNNs is presented and a description of the training process of every CNN is provided. In addition, the best approaches are selected.

In Chapter 4, the best CNNs are tested and compared with results obtained in liter-ature. Some statistics are computed in order to establish thresholds and evaluate the distribution of obtained data.

In Chapter 5, the development of the Android mobile application is described. This in-cludes the porting of the CNN on the mobile memory, the development of the graphical interface and the implementation of the application.

In Chapter 6, the experiments done are described and the evaluation of the presented work is reported. In addition, future works and enhancements are discussed.

1.3 Background

A recap of the used technologies is necessary in order to better understand the work in this thesis. It is important to have a global point of view of the adopted procedures.

1.3.1 Convolutional Neural Network (CNN)

Neural networks are a computational approach used in computer science, which is based on a large collection of neural units (artificial neurons), loosely mimicking the way a biological brain solves problems with large clusters of biological neurons connected by

(14)

axons. These systems are self-learning and trained, rather than explicitly programmed. Neural networks typically consist of multiple layers, and the signal path traverses from front to back. The goal of the neural network is to solve problems in the same way the human brain would.

Neural networks are based on real numbers, with the value of the core and of the axon typically being a representation between 0 and 1. Every axon which connects two neurons, is characterized by a weight that determines the strength of the signal which goes from a neuron to the next one.

In machine learning, a Convolutional Neural Network (CNN) is a type of feed-forward artificial neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex. Individual cortical neurons respond to stimuli in a restricted region of space known as the receptive field. The receptive fields of different neurons partially overlap so that they tile the visual field. The response of an individual neuron to stimuli within its receptive field can be approximated mathe-matically by a convolution operation.

Convolutional Networks have wide applications in image and video recognition. To this purpose, every neuron processes a portion of the input image and the results are then overlapped in order to acquire a higher-resolution representation of the original image.

1.3.1.1 Dataset

A CNN needs a dataset in order to learn to recognize images. The dataset is a collection of images divided in classes. The classes of images in the dataset are those that the CNN wants to recognize (i.e. in this case, images representing food).

The dataset is split in three different sets: the training set, the validation set and the testing set.

The training set is the portion of the dataset used to train the CNN. The training stage is the phase in which the CNN is learning (i.e. the weights of the CNN are ad-justed basing on the error).

The validation set is the portion of the dataset used to validate the CNN. The vali-dation stage is done in order to prevent the over training of the network.

The testing set is the portion of the dataset used to test the CNN. This set is used only for testing the final solution in order to confirm the actual predictive power of the network.

(15)

1.3.1.2 Training

In the training stage, every image belonging to the training set is given as input to the CNN that adjusts its weights in order to minimize the error between the actual and the expected result (i.e. the loss).

This stage exploits the backpropagation of the error to establish the amount of update of the weights. The backpropagation is performed only during the training stage, in the other stages there is no need to do that because the purpose is simply to evaluate the results.

The training is divided into epochs; one epoch consists of one full training cycle on the training set. Once every sample in the set has been presented to the CNN, the process starts again marking the beginning of the second epoch and so on.

The way in which the weights are updated, depends on some parameters (e.g. solver type, learning rate, etc.).

1.3.1.3 Validation

The validation stage is periodically made to verify if the CNN is over trained. This means that the process of training is too long and this can lead to lower accuracy (i.e. percentage of images correctly classified) and higher loss (i.e. the compatibility between the prediction and the correct class).

Validation is made at the end of every epoch. Every image in the validation set is given as input to the network. The weights are not adjusted during this stage, but the CNN is just computing accuracy and loss over the set.

The accuracy is used to verify if it increases over a dataset that has not been shown to the network before, or at least the network has not been trained on it. If the loss on the validation set is increasing, it means that the network is over training so the training stage should stop.

1.3.1.4 Testing

In the testing stage, every image belonging to the testing set is given as input to the CNN in order to estimate the classification accuracy.

It is important to say that the test must be done on images that are showed to the CNN for the first time. This has to be done in order to verify if the CNN has the ability to

(16)

generalize what it learned. If the CNN does not generalizes, it will recognize well only images on which it has been trained.

Neither in that stage the network weights are adjusted.

Figure 1.1: Example of over trained CNN

1.3.2 Frameworks

The frameworks on which the CNNs are implemented in this thesis, are Caffe and Torch. AlexNet, GoogLeNet, SqueezeNet, VGG and ResNet are implemented in Caffe. BWN and XNORNet are implemented on Torch.

1.3.2.1 Caffe

Caffe (Jia et al. [2014]) is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors.

The structure of the CNN is described layer by layer in .prototxt files. The deploy.prototxt file describes the CNN architecture during the classification.

The train val.prototxt file describes the CNN during the train, therefore it has back-propagations and loss computing.

Another file called solver.prototxt is used to describe the parameters used during the training stage.

The weights of the CNN are stored in a .caffemodel file.

1.3.2.2 Torch

Torch (Collobert [2002]) is a scientific computing framework with wide support for ma-chine learning algorithms that puts GPUs first. It is easy to use and efficient, thanks to

(17)

an easy and fast scripting language, LuaJIT, and an underlying C/CUDA implementa-tion.

The CNN layers are described in .lua files. The weights of the CNN are stored in a .t7 file.

1.3.3 Tools

The tool used for the creation of the database, the training and the validation of the CNN is DIGITS.

The testing stage has been accomplished through a python script which queries the CNN with every image in the testing set and computes the accuracy.

MATLAB has been used to convert the weights obtained by the training stage, in the format needed for the mobile application.

The Web Application relies on OpenCV to import and use the model of the CNN.

1.3.3.1 DIGITS

The NVIDIA Deep Learning GPU Training System (DIGITS) puts the power of deep learning into the hands of engineers and data scientists. DIGITS can be used to rapidly train the highly accurate deep neural network (DNNs) for image classification, segmen-tation and object detection tasks.

DIGITS simplifies common deep learning tasks such as managing data, designing and training neural networks on multi-GPU systems, monitoring performance in real time with advanced visualizations, and selecting the best performing model from the results browser for deployment. DIGITS is completely interactive so that data scientists can focus on designing and training networks rather than programming and debugging. The graphical interface of DIGITS allows the user to select the parameters for the training process without thinking about the .prototxt. The only thing to do, is to define the CNN structure into a unique .prototxt.

DIGITS provides a set of standard CNN definitions including AlexNet, GoogLeNet and LeNet.

(18)

1.3.3.2 MATLAB

MATLAB (matrix laboratory) is a multi-paradigm numerical computing environment and fourth-generation programming language. A proprietary programming language developed by MathWorks, MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages.

1.3.3.3 OpenCV

OpenCV (Bradski [2000]) is released under a BSD license and hence it’s free for both academic and commercial use. It has C++, C, Python and Java interfaces and sup-ports Windows, Linux, Mac OS, iOS and Android. OpenCV was designed for computa-tional efficiency and with a strong focus on real-time applications. Written in optimized C/C++, the library can take advantage of multi-core processing. Enabled with OpenCL, it can take advantage of the hardware acceleration of the underlying heterogeneous com-pute platform. In this thesis, the Python interface has been used to test the CNN and the Java interface has been used to deploy the CNN on the Web Application.

1.3.3.4 Android Studio

Android Studio is the official Integrated Development Environment (IDE) for Android app development. Android Studio offers features that enhance the productivity when building Android apps.

Every project in Android Studio contains one or more modules with source code files and resource files. The graphical interfaces of the application, are written in XML code. Android Studio renders the XML code in order to show a preview of the graphical interface. It offers also many useful tools to design buttons and menus without writing any code.

1.4 Related Work

Bossard et al. [2014] present a method for food recognition based on a dataset of 101 categories of food exploiting the Random Forests. Random Forests are used to mining discriminant components of each class of food.

(19)

Discriminant components are all those image regions which help distinguish each type of dish from the others. Those components are then clustered, and finally the most promising groups are selected.

Consequently, the Random Forest is entirely discarded and is not used at classification time. Instead, the image is classified relying on Support Vector Machines (SVM). Wang et al. [2015] present a method for food recognition based on the same classes of food of the research above.

The research present deep experiments of recipe recognition using visual, textual infor-mation and fusion. They try to extract visual features using different methods, one of which is the OverFeat Convolutional Neural Network.

The textual information extracted from HTML pages associated to a particular image, are then combined with the visual information in order to achieve a better accuracy in classification.

In both cases, experiments have shown that using CNN as feature extractor a better result in terms of accuracy is achieved. Nevertheless, any method which involves only visual features has achieved good results.

(20)

Chapter 2

Overview of the datasets

2.1 Food 101

This thesis is based on datasets with 101 classes of food (i.e. 101 dishes). The recipes in the datasets are the most popular in the world.

There are two versions of dataset presented in literature which exploit the same classes but use different images. The way in which the images are collected to populate the datasets is different, and this determines a difference in the number of images per class and in the type of the images.

Two datasets have been chosen in this thesis in order to have the possibility to perform cross-testing (e.g. train a CNN on a dataset and test on the other) to evaluate the so called transfer learning (i.e. the ability to recognize images of another dataset).

The datasets presents some noise images (i.e. images not representative of the class). This is due to the protocol used to collect them. In fact, as will be shown in next sections, images are collected by querying two different internet sites and it is extremely likely that the result of a query leads to wrong images.

2.1.1 ETHZ Food101

This dataset has been presented in Bossard et al. [2014] from the ETHZ University in Zurich.

The dataset is populated by collecting images from foodspotting.com. The site allows users to take images of what they are eating, annotate place and type of food and upload these information online.

(21)

Chapter 2. Overview of the datasets 10

The researchers have chosen the top 101 most popular and consistently named dishes and randomly sampled 1000 images per class. All the images have been rescaled to have a maximum side length of 512 pixels and smaller ones have been excluded from the whole process.

This process has led to a dataset of 101000 real-world images in total, including very diverse but also visually and semantically similar food classes such as Apple pie, Waf-fles, Escargots, Sashimi, Onion rings, Mussels, Edamame, Paella, Risotto, Omelette, Bibimbap, Lobster bisque, Eggs benedict, Macarons, to name a few.

Example images in ETHZ dataset are shown in Figure 2.1. There is some noise in this dataset (i.e. images which not represent properly the class). The noise in the Pizza class is represented, for example, by an eating baby (Figure 2.1b).

(a) True Pizza (b) Noised Pizza

Figure 2.1: Example Images of ETHZ dataset

2.1.2 UPMC Food101

This dataset has been presented in Wang et al. [2015] from the UPMC University in Paris.

The dataset is populated by querying Google Image Search with 101 classes taken from the ETHZ Food101 dataset along with added word ”recipe”. Then, the first 1000 images returned from each query have been collected and all those images with a size smaller than 120 pixels have been removed. This has led to a dataset quite smaller than the other one. In this research, also raw HTML source pages which embed images have been collected, but this is not useful for the objective of this thesis.

(22)

this dataset than the ETHZ dataset. For instance, the noise in the Hamburger class is represented by an ingredient of it (i.e. the meat itself).

(a) True Hamburger (b) Noised Hamburger

Figure 2.2: Example Images of UPMC dataset

2.1.3 Comparison of the datasets

The different procedure adopted during the collection of data has led to a substantial difference between the datasets.

As mentioned before, the classes of food are the same, but the number of images for each class varies between 790 and 956 in the UPMC dataset, while it remain stable to 1000 in the ETHZ dataset.

Moreover, the images in the ETHZ dataset have strong selfie style as they are uploaded by consumers. Although some background noise (human faces, hands) are introduced in images, it ensures image out of food categories are excluded from this dataset. Instead, with regard to the UPMC dataset, it has images completely irrelevant with the dish. In some case, there can be images of ingredients or images of recipe book page since the word ”recipe” is queried.

Dataset number of classes images per class source

UPMC 101 790-956 various

ETHZ 101 1000 specific

(23)

2.2 Setting the test framework

The database creation has been done using DIGITS and it is needed to train the CNNs. The images from the datasets have been formatted in the same way and the three sets (training, validation and test set) have been created.

The testing set of each dataset is exactly the same used in literature. This allows to compare the results in terms of accuracy. UPMC dataset has a test set composed by 22716 images, while the ETHZ dataset has a test set of 25250 images. Remaining images have been divided into the training and the validation set. The 25% of remaining images have been used for the validation, while the 75% of them have been used for the training.

Every single image has been formatted following the same specifications. Every image has been squashed to 256x256 pixels and encoded in PNG format in order to have no losses in the compression.

(a) Original Image (b) Squashed Image

Figure 2.3: Squashing procedure of an image

Figure 2.3 shows an example of image before and after the formatting process. During the database creation, the mean image of the dataset is computed. The mean image is subtracted to every input image of the network in order to point out the significant regions of it.

The database backend is LMDB. This is the format in which the images are archived in order to make DIGITS work easier. In fact, LMDB is a software library that provides a high-performance embedded transactional database in the form of a key-value store. LMDB stores arbitrary key/data pairs as byte arrays, has a range-based search capabil-ity, supports multiple data items for a single key and has a special mode for appending records at the end of the database which gives a dramatic write performance increase over other similar stores.

(24)

Chapter 3

Comparisons of various CNNs

architectures

3.1 Overview of the CNNs

As mentioned before, CNNs are a category of Neural Networks that have proven very effective in areas such as image recognition and classification.

There are four main operation in the CNNs: Convolution, Non Linearity (ReLU), Pool-ing (or Sub SamplPool-ing) and Classification (Fully Connected Layer). These operations are the basic layers of every CNN, so understanding how these work is an important step to do in order to get how the whole CNN works.

Every image can be represented as a matrix of pixel values. Digital images have three channels (RGB), so they can be represented as 3 matrix of pixel values in the range from 0 to 255.

The Convolution Layer has the job to extract features from input images. A filter (i.e. kernel) is slided over the original image and a feature map is extracted by performing the multiplication between the filter and the region of the image on which it is. The result of that layer is composed by as many feature maps as the number of kernels used. The size of the feature map is controlled by three parameters:

• Depth: is the number of kernels used for the convolution operation

• Stride: is the number of pixels by which the kernel is slided over the input image. The higher is the stride, the smaller is the resulting feature map

(25)

Chapter 3. Comparisons of various CNNs architectures 14

• pad : is the number of 0 pixels to add on every border of the original image

Figure 3.1: Convolution Layer

The Rectified Linear Unit (ReLU) Layer performs a non linear operation. It is an elementwise operation (applied per pixel) and replaces all negative pixel values in the feature map with zero. This operation introduces non-linearity in the CNN, since most of the real-world data would be non-linear.

Figure 3.2: ReLU Layer activation function

The Pooling Layer (also called subsampling or downsampling) reduces the dimension-ality of each feature map but retains the most important information. A window is slided over the input matrix, and the values in it are aggregated. The aggregation operation can be the sum, the maximum, the minimum, the average and so on. This operation is controlled by two parameters:

• Stride: has the same function of the stride parameters of the Convolution Layer • Kernel size: is the size of the sliding window

(26)

Figure 3.3: Pooling Layer

The Fully Connected (FC) Layer is a traditional Multi Layer Perceptron that uses a Softmax activation function in the output layer. The output from the convolutional and pooling layers represent high-level features of the input image. The purpose of the FC Layer is to use these features for classifying the input image into various classes based on the training dataset. The result of the network is an array of n elements containing the scores of each class. Using ETHZ and UPMC dataset, the result of the CNN will be an array of 101 elements. The sum of output probabilities from the FC Layer is 1. This is ensured by using the Softmax on the output of the FC Layer. That function takes an array of arbitrary real-valued scores and squashes it to a vector of values between 0 and 1.

(27)

The Local Response Normalization (LRN) Layer performs a kind of “lateral in-hibition” by normalizing over local input regions. In across channels mode, the local regions extend across nearby channels, but have no spatial extent. In within channels mode, the local regions extend spatially, but are in separate channels.

Since the over training is a serious problem in CNN, it is important to have techniques to prevent it. A specific layer acts when the training stage is in progress. The Dropout Layer randomly drops units (along with their connections) from the neural network during training. This prevents units from co-adapting too much.

Figure 3.5: Dropout operation

Usually, the last layer is the Softmax Layer. It performs a softmax operation which ”squashes” a K-dimensional vector of arbitrary real values to a K-dimensional vector of real values in the range (0, 1) that add up to 1. This way, the result is that when the CNN classifies, it assigns each class a score as percentage.

The training process of the CNN goes through four steps.

In the first step, all kernels and parameters (i.e. the weights) are initialized with random values.

In the second step, the CNN takes a training image as input and finds the output probabilities for each class.

In the third step, the total error at the output is computed (e.g. target probability -output probability).

In the fourth step, weights are updated basing on the total error computed in the previous step.

The steps two, three and four are then repeated with all images in the training set. The weights resulting from the training process are stored in a file whose format depends

(28)

on the framework the CNN is implemented on. Caffe framework stores the weights in a .caffemodel file, whereas Torch framework stores the weights in a .t7 file.

3.1.1 AlexNet

The AlexNet, presented in Krizhevsky et al. [2012], is composed by eight layers with weights; the first five are Convolutional and the remaining three are FC.

The output of the last FC Layer is fed to a n-way Softmax that produces a distribution over the n class labels.

The kernels of the second, fourth and fifth Convolution Layers are connected only to those feature maps in the previous layer which reside on the same GPU. The kernels of the third Convolution Layer are connected to all feature maps in the second layer. The neurons in the Fully Connected Layers are connected to all neurons in the previous layer. LRN Layers follow the first and second Convolutional Layers. Max-Pooling Layer follow both LRN Layers as well as the fifth Convolution Layer. The ReLU Layer is applied to the output of every Convolutional and FC Layer. A dropout is applied to the FC Layers in order to reduce over-fitting.

The first Convolutional Layer filters the 224x224x3 input image with 96 kernels of size 11x11x3 with a stride of 4 pixels. The second Convolutional Layer takes as input the output of the first Pooling Layer and filters it with 256 kernels of size 5x5x48. The third, fourth, and fifth Convolutional Layers are connected to one another without any inter-vening layer. The third Convolutional Layer has 384 kernels of size 3x3x256 connected to the second Pooling Layer. The fourth and the fifth Convolutional Layers has 384 and 256 kernels of size 3x3x192. The FC Layers have 4096 neurons each, except for the last one which has 101 neurons because it performs a 101-way classification.

Figure 3.6 shows the full architecture of the AlexNet. The CNN concerned, is imple-mented on Caffe framework.

(29)

3.1.2 VGG-Net

VGG Neural Network exploited in this thesis, presented in Simonyan and Zisserman [2014], is composed by sixteen layers with weights.

Figure 3.7 shows the full architecture of the VGG-16. It is composed by thirteen Convo-lution Layers, five Pooling Layers and three FC Layers. In particular, a Pooling Layer is placed after the second, the fourth, the seventh, the tenth and the thirteenth Convo-lution Layer. The last three layers are Fully Connected.

The ReLU Layer is applied after every Convolution Layer and after first two FC Layers. The last FC Layer is connected to the Softmax.

The input of this CNN is a fixed-size 224x224 RGB image. The image is passed through a stack of Convolutional Layers with 3x3 kernels and stride of 1 pixel. Spatial pooling is carried out by max Pooling Layers, which follow some of the Convolutional Layers. A stack of Convolutional Layers is followed by three FC Layers: the first two have 4096 neurons each, the third performs 101-way classification and thus contains 101 neurons. The CNN concerned, is implemented on Caffe framework.

(30)

(31)

3.1.3 GoogLeNet

GoogLeNet (or Inception), presented in Szegedy et al. [2015], uses the Inception mod-ules (Figure 3.8). The Inception Module basically acts as multiple Convolutional Layers that are processed on the same input. It also does pooling at the same time. All the results are then concatenated. This allows the model to take advantage of multi-level feature extraction from each input. For instance, it extracts general (5x5) and local (1x1) features at the same time.

The network is 22 layers deep when counting only layers with parameters. The overall number of layers used for the construction of the network is about 100. The ReLu is applied to the output of every Convolution Layer. LRN Layers follow the first Pooling Layer and the third Convolution Layer. The output of the last FC Layer is connected to the Softmax Layer.

Since this CNN is very deep, the ability to backpropagate is a concern. Therefore, during training, auxiliary FC Layers have been added in some intermediate layers. This way, also the losses of the intermediate layers are taken into account and backpropa-gated.

The input image is 224x224 RGB. The Convolutional Layers have 7x7 and 3x3 kernels with stride of 2 and 1 pixel. Spatial Pooling is carried out by four max Pooling Layers and one avg Pooling Layer.

Figure 3.8: Inception Module

In the Inception Modules, Convolutional Layers with 1x1, 3x3 and 5x5 kernels are per-formed in different branches along with a max Pooling.

Figure 3.9 shows the full architecture of the GoogLeNet. The CNN concerned, is im-plemented on Caffe framework.

(32)

Figure 3.9: GoogLeNet Architecture

3.1.4 SqueezeNet

SqueezeNet, presented in Iandola et al. [2016], relies on the fact that smaller CNN architectures require less communications and less bandwidth to export a new model. Moreover, smaller CNN are more feasible to deploy in an hardware with limited memory.

Figure 3.10: Fire Module

The SqueezeNet relies on the Fire Module. A Fire Module is composed by a ”squeeze” Convolution Layer (only 1x1 kernels), feeding into an ”expand” Convolution Layer that has a mix of 1x1 and 3x3 kernels.

The SqueezeNet is composed by a standalone Convolution Layer with 96 7x7 kernels and stride of 2 pixels, followed by eight Fire Modules and a final Convolution Layer with 101 1x1 kernels and stride of 1 pixel. The number of kernels per Fire Module is gradually increased and a Max-Pooling Layer is performed after the first Convolution Layer, the

(33)

second and the fourth Fire Module, whereas an avg Pooling is performed after the last Convolutional Layer. The ReLU is applied to the output of every Convolution Layer. Last layer (i.e. the fourth Pooling Layer) is connected to the Softmax Layer.

Figure 3.11 shows the full architecture of the SqueezeNet. The CNN concerned, is implemented on Caffe framework.

(34)

3.1.5 Residual Networks

Residual Networks, presented in He et al. [2016], rely on the concept of residual learning. Since this is a very deep neural network (e.g. 50 layers), it has the same problem of the GoogLeNet : the backpropagation.

Because of this, sometimes, a shortcut sends the result of the layer down the network by skipping some layers. The ResNet exploited in this thesis has 50 layers with weights, and it is called ResNet-50.

The architecture of the ResNet-50 CNN, is mainly inspired by the philosophy of VGG networks. The Convolutional Layers mostly have 3x3 kernels and follow two simple design rules: for the same output feature map size, the layers have the same number of kernels; and if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer. Each Convolution Layer has a stride of 2 pixels. The network ends with a global average Pooling Layer and a 101-way FC Layer with Softmax.

The shortcut is different basing on the dimensions of input and outputs. When the input and output dimensions are the same, the identity shortcut is used. When the output dimension is greater than the input dimensions, two options are possible: add padding to fill the extra zero entries or perform 1x1 convolutions. The Batch Normalization, the LRN and the ReLU are performed after every Convolution Layer. The Eltwise operation is done to concatenate the results obtained by different ”trees”. Last Layer (i.e. the Pooling Layer) is connected to the Softmax Layer. The CNN concerned, is implemented on Caffe framework.

3.1.6 Binary Weight Network and XNOR-Net

The Rastegari et al. [2016] research, proposes two efficient approximations to standard AlexNet CNN: Binary-Weight-Networks (BWN) and XNOR-Networks (XNORNet). In the BWN, the kernels are approximated with binary values resulting in 32x memory saving with respect to the AlexNet. In XNORNet, both the kernels and the input to Convolutional Layers are binary. XNORNet approximates convolutions using primarily binary operations. This results in 58x faster convolutional operations and 32x memory savings.

Contrary to the other CNNs, XNOR-Net and BWN offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Moreover, they are implemented on Torch framework.

(35)

3.2 Training from scratch

The CNNs listed before have been trained over both datasets in order to point out the most promising. In this section, the training have been accomplished by starting from scratch. This means that the weights of the layers of the CNN have been randomly initialized.

This is just a preliminary stage. The results obtained in this stage are only used to evaluate all the CNN architectures. The most promising approaches will be analyzed in depth in next section.

The performances of the CNNs have been measured relying on the accuracy on the validation set (e.g. percentage of images of the validation set correctly classified). That metric will be used also to understand if the CNN is over training, as explained in Chapter 1. In order to choose the best approach, also the models size have been taken into account (smaller models are preferred because of the reduced dimensions of the mobile phone memory).

The parameters that have been taken into account for training are the following:

• Base Learning Rate: the weights of the CNN are updated step by step in order to reach the minimum of the error function. The learning rate determines the step size. Note that an high learning rate may lead to a step which jumps the solution, whereas small learning rates lead to many steps to reach the solution (i.e. minimum of the error function).

• Solver Type: the solver type determines the method used to find the solution. In particular, the way in which the weights are updated.

• Training Epochs: the number of training epochs determines the length of training process in terms of epochs. An high number of epochs may lead to over training, whereas small number of epochs may not converge to the optimal solution. • Batch Size: the batch size determines how often the weights are updated

The train process has been done by setting the same parameters for every CNN. The number of Training Epochs is fixed to 60 but in some cases a different number of epochs has been tested. This number is high enough to have a global view of the training process. DIGITS allows to export the model at every epoch, this way the epoch with higher accuracy can be easily chosen.

(36)

The Learning Rate has a step down policy, which consists in a ”step down” of 90% every 33% of epochs. Figure 3.12 shows an example of the learning rate in function of epochs. If number of epochs is 60, Learning Rate is fixed to 0.01 in first 20 epochs, to 0.001 in second 20 epochs and to 0.0001 in last 20 epochs.

Figure 3.12: Learning Rate in function of training epochs

This particular choice guarantees that steps are smaller when the CNN is in proximity of the solution. This avoids steps which jump the solution.

Two Solver Types have been analyzed in this thesis: Stochastic Gradient Descent(SGD) and Adaptive Gradient (AdaGrad). Both solvers, update weights by using the gradient of the error function with the difference that AdaGrad is an optimization of the for-mer. AdaGrad performs larger updates for infrequent and smaller updates for frequent parameters, SGD performs same updates for all the parameters.

3.2.1 AlexNet

Figure 3.13 shows the accuracy on validation set and the loss trend in function of epochs of the training process.

In the AlexNet trained over ETHZ dataset (Figure 3.13a), a maximum accuracy of 40.60% on the validation set has been reached. Whereas, in the AlexNet trained over UPMC dataset (Figure 3.13b), the maximum accuracy on validation set achieved is 37.88%. Both results have been obtained using SGD as Solver Type.

Very low results have been obtained in the same scenario using AdaGrad instead. This suggests that this kind of Solver Type do not fit well with the AlexNet trained over the datasets used in this thesis.

The accuracies obtained on both dataset is almost the same despite the difference be-tween them.

(37)

format. The model size of this architecture is about 230Mb.

As can be seen in both graphs of Figure 3.13, the CNN is slightly over trained. There is a small increase in the loss on validation set after epoch 35. Although the accuracy is still increasing after that epoch, a larger number of epochs may lead to a lower accuracy.

(a) Train over ETHZ dataset

(b) Train over UPMC dataset

Figure 3.13: AlexNet Training

3.2.2 VGG-Net

Figure 3.14 shows the accuracy and the loss trend in function of epochs of the training process of the VGG.

Best accuracy on validation set obtained when the VGG is trained over ETHZ dataset (Figure 3.14a) is 44.27%, whereas it is 36.92% when the VGG is trained over UPMC dataset (Figure 3.14b). Result of training process over ETHZ dataset outperforms by about 9% the results of the train over UPMC dataset.

(38)

The loss is drastically increasing when the training process is in proximity of the 20th epoch, but after that (i.e. when the learning rate changes from 0.01 to 0.001) the loss decreases again. The same behavior can be observed in proximity of the 40th epoch but after that epoch the loss is constantly increasing. This suggests that the number of epochs is too high. Therefore, a new training has been done by setting the number of epochs to 50. Results obtained in that train have been better in term of accuracy than the former, but no significant increases have been observed.

This architecture is implemented on Caffe framework and the model size is 590Mb.

Figure 3.14: VGG Training

3.2.3 GoogLeNet

Figure 3.15 shows the accuracy and the loss trend in function of epochs of the training process. Best results obtained in this scenario rely on SGD method as Solver Type. They have reached an accuracy of 56.08% on the validation set in the GoogLeNet trained over

(39)

ETHZ dataset, whereas an accuracy of 43.54% have been reached in the GoogLeNet trained over UPMC dataset.

This CNN has better accuracy when is trained over ETHZ dataset. Training over ETHZ dataset outperforms the train over UPMC dataset by about 13%. This result is not com-parable because of the different training and validation set. GoogLeNet implementation relies on Caffe framework. The .caffemodel file containing the weights resulting from train, has the size of 40Mb.

The loss on validation set is decreasing all time in the train over ETHZ dataset (Figure 3.15a, this suggests that CNN is still learning after 60th epoch. But the accuracy is not consistently increasing. Therefore, a new training with an higher number of epoch (i.e. 90) and a learning rate with same policy have been done. The results obtained have been worser than the results obtained in former scenario. The same goes for the GoogLeNet trained over UPMC dataset (Figure 3.15b).

Using AdaGrad as Solver Type, has led to worser results in term of accuracy. So, also in this case, the best approach is to use SGD method.

(40)

Figure 3.16: ResNet-50 Training

3.2.4 ResNet-50

Figure 3.16 shows the accuracy and the loss trend in function of epochs of the training process. The best result in term of accuracy on validation set have been obtained at 16.56% when the CNN is trained over ETHZ dataset (Figure 3.16a), whereas an accuracy on validation set of 19.98% on UPMC dataset (Figure 3.16b) have been achieved. Contrary to AlexNet and GoogLeNet, train of the ResNet-50 over UPMC dataset gave higher accuracy than train over ETHZ dataset but there is no significant differences between the accuracies. In both cases, the CNN is strongly over training. The loss on validation set is quickly increasing after epoch 20. The accuracy on validation set is decreasing proportionally. It is important to notice that the learning rate after epoch 20 is smaller than learning rate in first 20 epochs. If the CNN accuracy is decreasing despite the lower learning rate, it probably means that the CNN have reached its best at epoch 20. A new training have been done by setting the number of epochs to 20 but the accuracy obtained have not been better than the former.

(41)

ResNet-50 model is in .caffemodel format since this architecture is implemented on Caffe framework. The size of the .caffemodel file is 80Mb. Results obtained with AdaGrad as Solver Type are quite lower in term of accuracy than results obtained using SGD method.

Figure 3.17: SqueezeNet Training

3.2.5 SqueezeNet

Figure 3.17 shows the accuracy and the loss trend in function of epochs of the training process.

The best accuracy on validation set obtained with this architecture trained over ETHZ dataset (Figure 3.17a) is 40.83%, whereas it is 31.18% when the training is performed over UPMC dataset (Figure 3.17b). Train over ETHZ dataset outperforms the results obtained from train over UPMC dataset by about 10%.

The loss on validation set is decreasing during all the training process, so the CNN is not over training. But the accuracy on validation set remains almost the same after

(42)

40th epoch.

The implementation of the SqueezeNet relies on Caffe framework therefore the weights are stored in .caffemodel format. The model size of this architecture is 3Mb.

Figure 3.18: BWN Training

3.2.6 BWN

Figure 3.18 shows the accuracy and the loss trend in function of epochs of the training process of the BWN.

When train is performed over ETHZ dataset (Figure 3.18a), an accuracy of 38.37% has been reached on validation set. Meanwhile, an accuracy of 34.58% has been reached when the training is performed over UPMC dataset (Figure 3.18b). The accuracies on both datasets are almost the same, but the accuracy obtained by training over ETHZ dataset is slightly better than the accuracy in the other train.

(43)

result is obtained once the 20th epoch has been reached.

The implementation of the BWN relies on Torch framework, therefore the model is in .t7 format. The model size of this architecture is 380Mb.

Figure 3.19: XNORNet Training

3.2.7 XNOR-Net

Figure 3.19 shows the accuracy and loss trend in function of training epochs of the XNOR-Net.

When the training is performed over ETHZ dataset (Figure 3.19a), best accuracy reached on validation set is 25.31%. Meanwhile, an accuracy of 24.36% has been reached when the training is performed over UPMC dataset (Figure 3.19b). Accuracies are the same for both scenarios. The XNOR-Net has almost the same behavior on both ETHZ and UPMC dataset.

Losses are very high with this architecture. Moreover, they remain the same after 20th epoch (i.e. when the learning rates decreases to 0.001). A new train has been performed

(44)

over both datasets by increasing the number of epochs to 90. In this way, the learning rate remains to 0.01 in first 30 epochs. The result of the latter training has been almost the same as the former but in the latter the CNN is a bit over trained.

The BWN is implemented on Torch framework, therefore the weights are exported in .t7 format. The model size of this architecture is 410Mb.

3.2.8 Comparison of results

Table 3.1 shows results obtained by every CNN architecture listed before. The CNN are sorted on accuracy percentage over ETHZ dataset.

CNN framework model size accuracy on validation set

ETHZ train UPMC train

ResNet-50 caffe ˜80Mb 16% 20% XNOR-Net torch ˜410Mb 25% 24% BWN torch ˜380Mb 38% 34% AlexNet caffe ˜230Mb 40% 37% SqueezeNet caffe ˜3Mb 41% 31% VGG-16 caffe ˜590Mb 44% 37% GoogLeNet caffe ˜40Mb 56% 43%

Table 3.1: Comparisons of various CNNs architectures trained from scratch

The parameters to rely on for the evaluation of an architecture are the model size and, of course, the accuracy over both datasets. The model size is important since the CNN has to run on mobile phone.

The best architecture in term of size is the SqueezeNet. It is one or two order of

magnitude smaller than all the other architectures. Moreover, it has a good accuracy on both datasets.

The best architecture in term of accuracy is the GoogLeNet. It outperforms all the other architectures by at least 10%. Moreover, its size is second only to the SqueezeNet and it is one order of magnitude smaller than all the other architectures except the ResNet-50 but the latter has very low level in accuracy.

VGG-Net has an high accuracy on both dataset, but its size is too large to fit in mobile memory.

BWN and AlexNet have almost the same accuracy on ETHZ dataset of the SqueezeNet. Since Torch is not well documented and optimized for implementation on mobile phones, BWN has been discarded.

(45)

is better than the accuracy of the SqueezeNet on the same dataset.

Summarizing, the most promising architecture are AlexNet, SqueezeNet and GoogLeNet. They will be analyzed in deep in next section.

(46)

3.3 Fine tuning of most promising CNNs

In practice, a CNN is not usually trained from scratch with random initialization of parameters. This is because it is relatively rare to have a dataset of sufficient size that is required for the depth of network. Instead, it is common to pre-train a CNN on a very large dataset and then use the trained CNN weights either as an initialization or a fixed feature extractor for the task of interest.

In this section, the fine tuning has been performed on the models coming from a pre-train over ILSVRC dataset. This dataset came from Russakovsky et al. [2014].

ImageNet is an image database in which each node of the hierarchy is depicted by hundreds and thousands of images. The whole ImageNet dataset is composed by millions of images and thousands of classes, but the CNNs in this section have been pre-trained on 1000 classes.

The CNN found before have been fine tuned in order to acquire better accuracy. In fine tuning, the parameters of the CNN are not randomly initialized, but they are loaded from the .caffemodel resulting from the training over the ILSVRC dataset. This is only valid for the intermediate layers of the CNN. The last layer (i.e. the FC Layer for the chosen architectures) has been randomly initialized, since the number of classes of FC Layer for Food Recognition in ETHZ and UPMC datasets is different from the number of classes of ILSVRC dataset (101 vs 1000).

The parameters taken into account during the fine tuning process are the same of the training from scratch with the addition of the following:

• Learning Rate Multiplier (lr mult): the learning rate multiplier, is a parameter of each layer. It is multiplied to the Learning Rate during the updating of the weights. It is used to set different Learning Rates for each layer: if a layer has a small lr mult, its learning process is slower. Meanwhile, if a layer has an high lr mult, its learning process is faster because the weights are substantially changed.

The lr mult have been set to 0.1 in the intermediate layers and to 1 in the last layer. This has lead to the intermediate layers Learning Rate to be 10 times smaller than the last layer. It has been used a smaller learning rate for CNN weights that are being fine-tuned under the assumption that the pre-trained CNN weights are relatively good. In order to not distort them too quickly or too much, the lr mult has been kept very small to have small learning rate in those layers. Instead, the last FC Layer has to learn from scratch as before, so it is good to have an higher learning rate.

(47)

The Solver Types tested in this section are the same of the previous section. Same goes for the number of Training Epochs and the Learning Rate. An example of the Learning Rate shape is showed in Figure 3.12.

(a) Fine Tuning over ETHZ dataset

(b) Fine Tuning over UPMC dataset

Figure 3.20: AlexNet Fine Tuning

3.3.1 AlexNet

Figure 3.20 shows the accuracy and loss trend in function of epochs of the fine tuning of the AlexNet.

When the fine tuning is performed using the ETHZ dataset (Figure 3.20a), an accuracy of 59.58% have been reached. Meanwhile, an accuracy of 54.99% have been obtained when the fine tuning is performed using the UPMC dataset (Figure 3.20b).

As can be seen from the graphs, the accuracy is immediately high at the first epoch. This is because the weights of the CNN have been already initialized to extract some feature correctly.

(48)

Both the accuracy and the loss on validation set remain the same after the 22th epoch. The reason may be that the Learning Rate is lower after the 20th epoch (i.e. 0.001 vs 0.01).

A new fine tuning has been performed by fixing the Learning Rate to 0.01. The result obtained shows that the accuracy remains the same of the former train.

Figure 3.21: SqueezeNet Fine Tuning

3.3.2 SqueezeNet

Figure 3.21 shows the accuracy and loss trend in function of epochs of the fine tuning process of the SqueezeNet.

When the fine tuning is performed using ETHZ dataset, the best result obtained is an accuracy of 63.95%. Whereas, when the fine tuning is performed using UPMC dataset, the best accuracy reached is 54.71%.

The loss on validation set is increasing after the 20th epoch. This means that the CNN is over training. The difference between the two datasets is that in fine tuning using

(49)

UPMC dataset (Figure 3.21b), the accuracy keeps increasing after that epoch. Whereas, it is decreasing in the fine tuning using ETHZ dataset (Figure 3.21a) until the Learning Rate is decreased.

Figure 3.22: GoogLeNet Fine Tuning

3.3.3 GoogLeNet

Figure 3.22 shows the accuracy and loss trend in function of epochs of the fine tuning process of the GoogLeNet.

When the fine tuning is performed using ETHZ dataset (Figure 3.22a), both the accuracy and the loss on validation set remains the same after the 20th epoch. The loss is slightly increasing therefore the GoogLeNet is over training, but the accuracy is not decreasing. When the fine tuning is performed using UPMC dataset (Figure 3.22b, the loss on validation set is quickly increasing after the 20th epoch but even in this case the accuracy is decreasing.

(50)

policy of Learning Rate, but no significant changes in accuracy have been observed. The best accuracy obtained is 73.46% when ETHZ dataset has been used. Whereas, an accuracy of 63.84% has been obtained with the other dataset.

3.3.4 Comparison of results

In Table 3.2, the different CNNs have been compared.

As said before, the best network in term of size is the SqueezeNet. However, the best network in term of accuracy on both datasets is still the GoogLeNet, which outperforms the AlexNet and the SqueezeNet by 8% if the train is performed over UPMC dataset. Instead, if the train is performed over ETHZ dataset, the GoogLeNet outperforms the AlexNet by 14% and the SqueezeNet by 9%.

The two most promising CNNs are the SqueezeNet, because of the reduced model size, and the GoogLeNet, because of the higher accuracy.

When training from scratch over the ETHZ dataset, the SqueezeNet outperformed the AlexNet by only 1%, but it has been outperformed by the AlexNet when training over UPMC dataset. Instead, in fine tuning, the SqueezeNet has the same accuracy as the AlexNet on UPMC dataset and outperforms the AlexNet by 5% on ETHZ dataset. AlexNet is the worst network for both size and accuracy, therefore will not be further analyzed.

CNN framework model size accuracy on validation set

ETHZ train UPMC train

AlexNet caffe ˜230Mb 59% 55%

SqueezeNet caffe ˜3Mb 64% 55%

GoogLeNet caffe ˜40Mb 73% 63%

(51)

Chapter 4

Results and analysis

4.1 Results of literature

Before proceeding with the testing of two most promising CNN models obtained in the previous section (i.e. SqueezeNet and GoogLeNet), we will now have a look at the results that have been obtained in the literature.

Many results involving ETHZ dataset have been presented in Bossard et al. [2014]. Us-ing global classifiers, the best result obtained is an accuracy of 56.40% on the test set. Whereas, using local classifiers, the best results is an accuracy of 50.76%.

Table 4.1 shows the classification accuracies obtained by the different methods dis-cussed in the paper. The used method are Bag Of Words (BOW), Improved Fisher Vec-tors (IFV), Convolutional Neural Network (CNN), Random Forests (RF), Randomized Clustering Forests (RCF), Mid-Level Discriminative Superpixels (MLDS) and Random Forests Discriminative Components (RFDC).

Listed methods have been used to extract features. In every approach, a SVM has been used to classify, except for the CNN method which performs both the extraction and the classification in the last layer with the FC Layer. The CNN architecture used in this research is an AlexNet pretrained on ImageNet-1000. Same scenario has been tested in this thesis, in fact the obtained accuracy is almost the same (i.e. about 57%).

Best result obtained in Wang et al. [2015] exploits the OverFeat CNN pretrained over ImageNet-1000. It has been used as features extractor, and the classification has been done exploiting SVM once again. In the research, also a cross test has been performed (i.e. training on ETHZ, testing on UPMC and viceversa).

(52)

Chapter 4. Results and analysis 41

Classifier Type Method Accuracy

Global BOW 28.51% IFV 38.88% CNN 56.40% Local RF 32.72% RCF 28.46% MLDS 42.63% RFDC 50.76%

Table 4.1: Results in Bossard et al. [2014]

Table 4.2 shows the obtained accuracies. The best result has an accuracy of 42.54% and it has been obtained by training and testing the CNN over ETHZ dataset.

Train/Test UPMC ETHZ

UPMC 40.56% 25.63%

ETHZ 25.28% 42.54%

Table 4.2: Results in Wang et al. [2015]

4.2 Testing most promising approaches

The most promising approaches coming from fine-tuning have been tested. The test of both CNN architectures has been done by following the cross testing also adopted in Wang et al. [2015] in order to compare the results obtained.

Both CNNs have been queried with all the images in the testing set of both datasets. An example of the script used to this aim is reported below.

1 i m p o r t numpy a s np

2 i m p o r t m a t p l o t l i b . p y p l o t a s p l t 3 from PIL i m p o r t Image

4 i m p o r t c a f f e 5 i m p o r t o s 6 c a f f e . s e t m o d e c p u ( ) 7 8 #l o a d t h e model 9 n e t = c a f f e . Net (’ c a f f e m o d e l / o f / t h e /CNN ’, c a f f e . TEST) 10 11# l o a d i n p u t and c o n f i g u r e p r e p r o c e s s i n g 12 t r a n s f o r m e r = c a f f e . i o . T r a n s f o r m e r ( {’ d a t a ’: n e t . b l o b s [ ’ d a t a ’] . d a t a . s h a p e } ) 13 t r a n s f o r m e r . s e t m e a n (’ d a t a ’, np . l o a d (’ mean/ image / o f / t h e / d a t a s e t ’) . mean ( 1 ) .

mean ( 1 ) )

14 t r a n s f o r m e r . s e t t r a n s p o s e (’ d a t a ’, ( 2 , 0 , 1 ) ) 15 t r a n s f o r m e r . s e t c h a n n e l s w a p (’ d a t a ’, ( 2 , 1 , 0 ) )

(53)

Chapter 4. Results and analysis 42 16 t r a n s f o r m e r . s e t r a w s c a l e (’ d a t a ’, 2 5 5 . 0 ) 17 n e t . b l o b s [ ’ d a t a ’] . r e s h a p e ( 1 , 3 , 2 2 4 , 2 2 4 ) 18 19#c r e a t e l o g f i l e 20 o u t f i l e = open(” l o g / f i l e . c s v ”,”w”) 21#l o o p on e v e r y image i n t h e t e s t i n g s e t 22 f o r image i n t e s t s e t 23 #l o a d t h e image i n t h e d a t a l a y e r and p r e p r o c e s s i t 24 im = c a f f e . i o . l o a d i m a g e (’ path / o f / an / image ’) 25 n e t . b l o b s [’ d a t a ’] . d a t a [ . . . ] = t r a n s f o r m e r . p r e p r o c e s s (’ d a t a ’, im ) 26 #q u e r y t h e CNN 27 o u t = n e t . f o r w a r d ( ) 28 #w r i t e t h e s c o r e s i n l o g f i l e 29 f o r i i n r a n g e( 0 , 1 0 1 ) : 30 o u t f i l e . w r i t e (’ %.100 f ’ % f l o a t( o u t [’ s o f t m a x ’] . f l a t t e n ( ) [ i ] ) ) 31 o u t f i l e . w r i t e (” ; ”) 32 o u t f i l e . w r i t e (” \n”) 33#c l o s e t h e l o g f i l e 34 o u t f i l e . c l o s e ( ) ;

First step is to load the .caffemodel and to configure the transformer for preprocessing. It will subtract the mean image to every image of the test set and reshape each one of them.

Consequently, a For loop queries the CNN with every image in the testing set by logging the scores assigned to each class. The scores in the log file are consequently sorted to point out the Top predictions. The Top1 prediction will be the class with higher score, the Top2 predictions will be first 2 classes with higher scores, and so on.

This process has been done for each possible scenario: SqueezeNet and GoogLeNet have been trained and tested on both ETHZ and UPMC datasets. Once the results have been logged, two metrics have been used to evaluate the CNN architectures:

• Accuracy Top-K : the percentage of images in the testing set classified in Top-K by the CNN

• Distribution of Correct Class Scores (Scc ICDF): the probability that the Scc of the images in testing set will take a value greater than or equal to x (i.e. Inverse Cumulative Distribution Function - ICDF, considering the Scc values as random variable).

The Accuracy Top-K is useful to evaluate the percentage of images correctly classified by the CNN architecture. The idea is to show the user not only the Top1 prediction,

(54)

Chapter 4. Results and analysis 43

but the Top-K (e.g. with K equal to 5). The Accuracy Top5 of the CNN will be higher because the probability for the correct class to be in Top5 is, of course, higher than the probability to be in Top1. Moreover, the Accuracy Top101 is 1 because the correct class is for sure in first 101 predictions. Therefore, it would be useful to show the accuracy in function of k in order to better compare the two CNN architectures. Since the number of classes is 101, k varies from 1 to 101.

The Scc ICDF is useful to evaluate the distribution of the correct class scores. This results have been used in the threshold analysis to evaluate the classification accuracy once the threshold on the score has been established.

4.2.1 SqueezeNet test

Table 4.3 shows the Accuracy Top1 of the SqueezeNet. The best accuracy is obtained in the CNN trained and tested over ETHZ dataset (e.g. about 60%). If the test is done over the dataset which has not been used for training, the accuracy is about 37%. The SqueezeNet trained over ETHZ dataset outperforms the accuracy of the best result in Bossard et al. [2014] by 3% on the same testing set. Moreover, it is far better than architecture used in Wang et al. [2015] because it outperforms almost every scenario by at least 10%.

Train/Test UPMC ETHZ

UPMC 50.11% 37.51%

ETHZ 37.06% 59.71%

Table 4.3: SqueezeNet Cross Test

Figure 4.1 shows the Accuracy Top-K trend in function of K in the SqueezeNet trained over ETHZ dataset (Figure 4.1a), and over UPMC dataset (Figure 4.1b).

If the SqueezeNet is trained over UPMC dataset, the accuracy trend is almost the same for both tests. Meanwhile, if the SqueezeNet is trained over ETHZ dataset, the accuracy is much more higher when the test is performed using ETHZ dataset.

In the SqueezeNet trained over ETHZ dataset, it is about 77% over ETHZ test if k is equal to 5 (i.e. the correct class is in Top5 predictions in 77% of cases).

The Scc ICDF resulting from the SqueezeNet is shown in Figure 4.2. Since the sum of scores assigned to every 101 classes is 1, if the score of the correct class (Scc) is greater than 0.5, the correct class is the Top1 prediction. This is because it is not possible for another class to obtain an higher score.