Deep Learning for Image Classification and Retrieval: Analysis and Solutions to Current Limitations

(1)

UNIVERSITÁ DIPISA

DOTTORATO DI RICERCA ININGEGNERIA DELL’INFORMAZIONE

D

EEP

L

EARNING FOR

I

MAGE

C

LASSIFICATION AND

R

ETRIEVAL

: A

NALYSIS AND

S

OLUTIONS TO

C

URRENT

L

IMITATIONS

DOCTORALTHESIS

Author

Fabio Carrara

Tutor (s)

Dr. Giuseppe Amato, Dr. Claudio Gennaro, Prof. Francesco Marcelloni

Reviewer (s)

Prof. Nicu Sebe, Prof. Michel Crucianu

The Coordinator of the PhD Program

Prof. Marco Luise

Pisa, May 2019 Cycle XXXI

(2)

(3)

(4)

(5)

“Science is like sex: sometimes something useful comes out, but that is not the reason we are doing it.” Richard P. Feynman

(6)

(7)

Acknowledgements

I

had the luck and pleasure to be surrounded by wonderful people during my Ph.D. career. Most of them do not even know that without them the reader would not be keeping this thesis in her or his hands.

First and foremost, I would like to express my immense gratitude to Dr. Giuseppe Amato and Dr. Claudio Gennaro, that offered me this opportunity, supported me, and advised me with patience during the research activities and the writing of this thesis. Without them, I would have not started nor ended this journey.

A special thank goes also to Dr. Fabrizio Falchi, with whom I collaborated closely in most of the research activities of my Ph.D. I am sincerely grateful for the invaluable help provided and the delightful discussions upon science, music, food, books, and much more is worth living for.

I gratefully acknowledge Prof. Francesco Marcelloni for supervising my Ph.D., and Prof. Michel Crucianu and Prof. Nicu Sebe that took the time to evaluate my work and provide valuable feedback.

I would like to thank the Networked Multimedia Information Systems lab for giv-ing me the opportunity to join its team, especially Dr. Fausto Rabitti for the fundgiv-ing opportunities for trips and hardware. I am sincerely grateful to all my lab mates for the three years spent together — I really enjoyed my time at the CNR because of you. Special mentions go to Lucia Vadicamo, that helped me out countless times, Alejan-dro Moreo Fernàndez, for sharing ideas, music, and deep-learning jokes, Claudio Vairo and Franca Debole, for their collaboration and useful insights, Paolo Bolettieri, for the laughs about programming language master races, and Alessandro Nardi, for shooting some hoops together in the free time.

My deepest appreciation goes to all Gli Altri with special mentions to Andrea, Gia-como, Giuditta, and Lorenzo, for the endless friendship and wonderful music we shared for more than 10 years, but also to Riccardo, Gianna, Sara, Lelle, Adele, and Lallu for providing an unlimited amount of good times and joy; you are one of the pillars of my life, and I would not have gone so far without your support.

I am deeply grateful to my family: my parents Marta and Alessandro, that respec-tively taught me perseverance and criticism in their unique way, my sister Rita for

(8)

constantly reminding me the important values in life without saying a word, and my uncle Paolo for introducing me to science in my youth; you really helped to shape my existence in the best possible way.

Last but not least, incommensurable gratefulness is due to my girlfriend Regina, who stood next to me during the entire journey, sharing ups and downs with me, enjoying with me every accepted publication, supporting me in deadline-dense stressful periods, helping me keeping my hair from falling while writing this thesis, loving me anyhow and anytime. I owe you so much for all the invaluable time spent together that I will always treasure.

(9)

Summary

T

HElarge diffusion of cheap cameras and smartphones led to an exponential daily

production of digital visual data, such as images and videos. In this context, most of the produced data lack manually assigned metadata needed for their manageability in large-scale scenarios, thus shifting the attention to the automatic un-derstanding of the visual content. Recent developments in Computer Vision and Ar-tificial Intelligence empowered machines with high-level vision perception enabling the automatic extraction of high-quality information from raw visual data. Specifi-cally, Convolutional Neural Networks (CNNs) provided a way to automatically learn effective representations of images and other visual data showing impressive results in vision-based tasks, such as image recognition and retrieval.

In this thesis, we investigated and enhanced the usability of CNNs for visual data management. First, we identify three main limitations encountered in the adoption of CNNs and propose general solutions that we experimentally evaluated in the con-text of image classification. We proposed miniaturized architectures to decrease the usually high computational cost of CNNs and enable edge inference in low-powered embedded devices. We tackled the problem of manually building huge training sets for models by proposing an automatic pipeline for training classifiers based on cross-media learning and Web-scraped weakly-labeled data. We analyzed the robustness of CNNs representations to out-of-distribution data, specifically the vulnerability to adversarial examples, and proposed a detection method to discard spurious classifications provided by the model. Secondly, we focused on the integration of CNN-based Content-based Image Retrieval (CBIR) in the most commonly adopted search paradigm, that is, tex-tual search. We investigated solutions to bridge the gap between image search and highly-developed textual search technologies by reusing both the front-end (text-based queries) and the back-end (distributed and scalable inverted indexes). We proposed a cross-modal image retrieval approach which enables textual-based image search on un-labeled collections by learning a mapping from textual to high-level visual tions. Finally, we formalized, improved, and proposed novel surrogate text representa-tions, i.e., text transcriptions of visual representations that can be indexed and retrieved by available textual search engines enabling CBIR without specialized indexes.

(10)

(11)

Sommario

L’

enorme diffusione di fotocamere e smartphone a prezzi economici ha portato a una produzione esponenziale giornaliera di dati visivi digitali, come im-magini e video. La maggior parte dei dati prodotti non è accompagnata dai metadati, come descrizioni, tags, o altri dati manualmente assegnati, che sono necessa-ri per la loro gestione automatica su larga scala. L’attenzione della necessa-ricerca si è quindi spostata sulla comprensione automatica del contenuto visivo di tali dati. I recenti svi-luppi dell’Intelligenza Artificiale applicata alla Computer Vision hanno reso possibile l’estrazione automatica di informazioni di alta qualità direttamente da dati visivi grezzi (pixel). In particolare, modelli neurali come le reti neurali convoluzionali hanno forni-to un modo per apprendere auforni-tomaticamente delle rappresentazioni numeriche efficaci per immagini e altri dati visivi che hanno ottenuto risultati impressionanti in task visivi come il riconoscimento di immagini.

In questa tesi, è stata studiata e migliorata l’usabilità delle reti neurali convoluziona-li per la gestione dei dati visivi su larga scala. Nella prima parte, sono state identificate tre limitazioni principali solitamente incontrate nell’utilizzo delle reti convoluzionali e sono state proposte delle soluzioni generali che abbiamo valutato sperimentalmente nel contesto della classificazione di immagini. Sono state proposte architetture minia-turizzate per ridurre il costo computazionale solitamente elevato di questo tipo di reti e consentire quindi il loro utilizzo anche a bordo di dispositivi embedded a bassa po-tenza. È stato affrontato il problema della creazione di training set per i modelli, che richiederebbero un notevole sforzo manuale, proponendo una pipeline automatica per allenare reti basata sul cross-media learning e su dati imprecisi provenienti dal Web. È stata analizzata la robustezza delle rappresentazioni estratte dalle reti convoluzionali per dati fuori dalla distribuzione di train, con enfasi particolare sulla vulnerabilità delle reti agli attacchi avversari (adversarial examples), proponendo un metodo di rilevamen-to per scartare le classificazioni spurie fornite dal modello. In secondo luogo, ci sia-mo concentrati sull’integrazione della ricerca per immagini, basata su rappresentazioni estratte da reti convoluzionali, col paradigma di ricerca più comunemente adottato, cioè la ricerca testuale. In questo contesto, abbiamo studiato delle soluzioni per colmare il divario tra l’attuale stato dell’arte nella ricerca di immagini e le più mature tecnologie

(12)

di ricerca testuale. In particolare, sono state integrate soluzioni per la ricerca di imma-gini basata sul contenuto sia con il front-end (query testuali) che con il back-end (indici invertiti distribuiti e scalabili per documenti testuali). Nel primo caso, è stato propo-sto un approccio di recupero di immagini cross-modale che consente la ricerca tramite descrizione testuale di immagini in collezioni non etichettate tramite l’apprendimento di una funzione di mapping delle rappresentazioni testuali in quelle visive. Nel secon-do caso, sono state formalizzate, migliorate e proposte nuove rappresentazioni testuali surrogate per immagini, che consistono in una trasformazione delle rappresentazioni visive in testo surrogato che può essere indicizzato e recuperato dai motori di ricerca testuali attualmente disponibili, abilitando applicazioni di recupero di immagini senza il bisogno di indici specializzati.

(13)

List of publications

International Journals

1. Amato, G., Carrara, F., Falchi, F., Gennaro, C., Meghini, C., and Vairo, C. (2017, April). Deep learning for decentralized parking lot occupancy detection. Expert Systems with Applications. (Vol. 72, pp. 327-334). Pergamon.

2. Carrara, F., Esuli, A., Fagni, T., Falchi, F., and Fernández, A. M. (2018, June). Picture it in your mind: Generating high level visual representations from textual descriptions. Information Retrieval Journal. (Vol. 21, pp. 208-229). Springer. 3. Carrara, F., Falchi, F., Caldelli, R., Amato, G., and Becarelli, R. (2018).

Adversar-ial image detection in deep neural networks. Multimedia Tools and Applications. (pp. 1-21). Springer.

International Conferences/Workshops with Peer Review

1. Amato, G., Carrara, F., Falchi, F., Gennaro, C., and Vairo, C. (2016, June). Car parking occupancy detection using smart camera networks and deep learning. In 2016 IEEE Symposium on Computers and Communication (ISCC). (pp. 1212– 1217). IEEE.

2. Carrara, F., Falchi, F., Caldelli, R., Amato, G., Fumarola, R., and Becarelli, R. (2017, June). Detecting adversarial example attacks to deep neural networks. In Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing (CBMI). (p. 38). ACM.

3. Amato, G., Carrara, F., Falchi, F., and Gennaro, C. (2017, June). Efficient Index-ing of Regional Maximum Activations of Convolutions usIndex-ing Full-Text Search Engines. In Proceedings of the 2017 ACM on International Conference on Multi-media Retrieval (ICMR). (pp. 420–423). ACM.

4. Vadicamo, L., Carrara, F., Cimino, A., Cresci, S., Dell’Orletta, F., Falchi, F., and Tesconi, M. (2017, October). Cross-Media Learning for Image Sentiment Anal-ysis in the Wild. In 2017 IEEE International Conference on Computer Vision (ICCV) Workshops. (pp. 308–317).

(14)

5. Amato, G., Bolettieri, P., Carrara, F., Falchi, F., Gennaro, C. (2018, June). Large-Scale Image Retrieval with Elasticsearch. In Proceedings of the 41st International ACM Conference on Research and Development in Information Retrieval (SIGIR). (pp. 925–928). ACM.

6. Messina, N., Amato, G., Carrara, F., Falchi, F., and Gennaro, C. (2018, Septem-ber). Learning Relationship-aware Visual Features. In Computer Vision – ECCV 2018 Workshops. (Vol. 4, pp. 486–501). Springer, Cham.

7. Carrara, F., Becarelli, R., Caldelli, R., Falchi, F. and Amato, G. (2018, September). Adversarial examples detection in features distance spaces. In Computer Vision – ECCV 2018 Workshops. (Vol. 2, pp. 313–327). Springer, Cham.

(15)

List of Abbreviations

A

Adam Adaptive Moment Estimation. 20, 71

AP Average Precision. 34

AUC Area Under the Curve. 23, 24, 34, 47, 50

B

BoW Bag-of-Words. 59, 85, 88, 89, 96, 97, 106

C

CBIR Content-based Image Retrieval. III, XII, 6, 29–31, 33, 70, 85, 87, 95, 104

CNN Convolutional Neural Network. III, XII,

XIII, 2–4, 6, 9, 11, 21, 25–27, 29, 31–33, 36, 39–42, 44–46, 48–50, 52–54, 56, 58, 59, 62, 64, 70, 71, 74–76, 79, 80, 84–90, 92, 94, 96, 98, 100, 102, 104, 105, 107, 121–124

COCO Common Objects in Context. 28, 37, 87,

94, 95, 97–100, 102

CReLU Concatenated Rectified Linear Unit. 109, 115, 117, 120

D

DCG Discounted Cumulative Gain. 34, 35, 98–

102

DCNN Deep Convolutional Neural Network. 25,

28, 29, 31, 33, 41, 42, 56, 59, 60, 69, 73, 74, 87

(16)

DL Deep Learning. 2, 3, 6–9, 12, 14, 17–21, 24, 25, 30, 31, 40, 41, 69, 70, 87, 105, 121, 123

DNN Deep Neural Network. 4, 7, 8, 18, 20, 28

DP Deep Permutation. 109, 112, 123

E

EER Equal Error Rate. 23, 77, 82, 83

F

FFNN Feed-Forward Neural Network. 7, 12, 18 FGSM Fast Gradient Sign Method. 71–73, 75–77,

79, 81–83

FNR False Negative Rate. 23

FPR False Positive Rate. 23, 75

G

GPU Graphical Processing Unit. 25

I

ILSVRC ImageNet Large Scale Visual Recognition Challenge. XIII, 2, 25, 26, 28, 29, 31, 35, 36, 38, 42, 67, 75, 76, 81, 87, 89

L

LBP Local Binary Patterns. 39, 41, 47, 48 LPQ Local Phase Quantization. 41, 47, 48

LRN Local Response Normalization. 26, 43

LSTM Long Short-term Memory. 13, 14, 58, 61–

63, 87, 92, 93, 96, 101, 103, 124

M

MAC Maximum Activations of Convolutions. 32

mAP mean Average Precision. 34, 96, 116, 117, 119, 120

medR Median Rank. 35, 99, 101

ML Machine Learning. 7, 14, 15, 17, 24, 25,

41, 58, 71, 87, 122

MLP Multilayer Perceptron. 8, 9, 88

MRR Mean Reciprocal Rank. 35, 99

MSE Mean Squared Error. 15, 91, 93, 94, 98 MSER Maximally Stable Extremal Regions. 24

O

(17)

List of Abbreviations

P

PCA Principal Component Analysis. 31, 76–82,

95, 98, 101, 104

PQ Product Quantization. 107, 109, 113, 116,

119

R

R-MAC Regional Maximum Activations of

Convo-lutions. 32, 108, 111, 115–117, 119, 120, 125

R@K Recall at K. 35, 100

ReLU Rectified Linear Unit. 8, 26, 27, 43, 115 ResNet Residual Network. 28, 29, 87, 89, 97

RNN Recurrent Neural Network. 7, 11–13, 59,

124

ROC Receiver Operation Characteristics. 23, 24, 47, 77, 78

RPN Region Proposal Network. 33

S

SENet Squeeze-and-Excitation Network. 29 SGD Stochastic Gradient Descent. 19, 20, 48 SIFT Scale Invariant Feature Transform. 24, 59

SQ Scalar Quantization. 106, 109, 113–117,

119, 120, 123

STR Surrogate Text Representation. 106–110,

112–114, 117, 120, 123

SURF Speeded Up Robust Features. 24, 39 SVM Support Vector Machine. 2, 24, 41, 47–49,

59, 61–63

T

T4SA Twitter for Sentiment Analysis. 36, 57, 61– 63, 120

TDCG Ties-aware Discounted Cumulative Gain.

35, 98

TF term frequency. 108, 110, 115

TNR True Negative Rate. 22, 23, 79, 83 TPR True Positive Rate. 22, 23, 75, 79, 83 TTD Twitter Testing Dataset. 64, 65, 67

V

VGG Visual Geometry Group. 26, 28, 62, 107

VLAD Vector of Locally Aggregated Descriptors. 24, 31, 95, 107

(18)

CHAPTER

1

Introduction

Vision is one of the primary senses of human beings, if not the most important one. From generic signage to advertisements and artistic photography, imagery is ubiquitous in our lives due to its ability to convey lots of information quickly while overcoming cultural and linguistic barriers. Thanks to the ease of access to camera-equipped smart-phones and social networks, we collectively generate, share, and receive a impressive amount of digital photos and videos daily. According to InfoTrends1, the number of digital photos taken worldwide in 2017 estimated at around 1.3 trillion, and the pace of digital media creation is intended to grow as more people get access to the Internet and cheap camera technology. In lights of this scenario, there is an increasing interest in creating automatic tools for the management of digital visual data pursuing the same accessibility revolution that search engines brought to the World Wide Web in the 90s and 00s. Despite manual annotation and captioning of images and videos helped to build successful systems (e.g., keyword-based image search engines), methods relying on metadata surrounding visual data — such as keywords, tags, captions — require the creation of such metadata by human actors. This is a labor-intensive and subjective task that cannot scale to the current trend of digital media creation. To overcome these limi-tations, the attention shifted to methods that try to model and infer the visual semantics in imagery relying solely upon the visual content, i.e., the information that machines can automatically extract from raw pixels and store it in numerical representations (im-age descriptors). In such systems, the quality of the im(im-age descriptors heavily dictates the effectiveness and efficiency of the content-based image database. Early approaches employed Computer Vision to create image descriptors relying on low-level manually defined features of the images, such as the distribution of edges, colors, simple shapes, to name a few. However, these descriptors are usually unable to capture high-level

(22)

concepts like the one assigned by human taggers to images. Thus, much of the effort focused on manually defining the right combination and aggregation of low-level fea-tures to represent higher-level concepts and performing well for the specific task under analysis, which requires a considerable amount of hand engineering and domain exper-tise. Classical Machine Learning methodologies — such as Support Vector Machines, decision trees, and naive Bayes classifiers — provide a considerable boost to the per-formance of models, helping in feature selection, weighting, and combination. Still, the definition and extraction of fine-tuned problem-specific features still plays an essential role in complex perception tasks such as vision.

In the last years, a new wave in a field of Artificial Intelligence, called Deep Learn-ing (DL), enabled researchers to automatize perception and understandLearn-ing of visual data by extracting information with high-level of abstraction from raw pixels, dras-tically limiting the human labor in the process. With the term Deep Learning, the research community indicates the family of Machine Learning methods which aim to automatically learn from data a hierarchy of features extractors which map the input data in a high-level feature space tailored to a specific task to solve. In the context of computer vision and image representation, Deep Learning, and in particular Convolu-tional Neural Networks, revolutionized feature engineering and visual understanding, outperforming handcrafted models on multiple vision tasks such as object recognition and detection, image description and retrieval, and many more. Convolutional Neural Networks are artificial neural networks specifically tailored to process image data and trained in a supervised end-to-end fashion. Although Convolutional Neural Networks (CNNs) have been around for many years, we pinpoint their turning point in 2012, year in which a deep convolutional neural network model outperformed approaches based on handcrafted features in the ImageNet Large Scale Visual Recognition Challenge, a fine-grained image classification task including a thousand of high-level concepts and semantics. Following this trend, the last six years experienced an overwhelming adop-tion of deep neural network models which set the new state of the art in numerous appli-cations spanning multiple fields, including visual perception and image understanding. This reborn of CNNs is attributed to multiple factors, the most important ones being the availability of large-scale datasets of labeled images (such as ImageNet) and the com-putational power offered by modern hardware accelerators such as GPUs. These factors contributed to training models with millions of parameters that can achieve astonishing performance in challenging problems.

Nevertheless, deep-learning-based solutions pose non-trivial engineering challenges in their adoption. In this thesis, we tackle critical limitations encountered in the usage of Convolutional Neural Networks (CNNs) and propose their adoption in novel ap-proaches for large-scale content-based image management.

1.1

Objectives and Contributions

Excluding the first two chapters, which introduce our work and provide the reader with relevant background, the dissertation is divided into two logical parts.

In the first part, we focus our investigation on one of the most diffuse and successful problems in which CNNs shine, that is, image classification. In this context, we inves-tigate some of the most common and well-known limitations of DL-based approaches

(23)

1.1. Objectives and Contributions

— which are of general interest and not solely relevant to image classification — and we propose solutions to mitigate their effect and evaluate their potential in practical applications. Specifically, we report in the following the major drawback we analyzed together with the contributions we propose in this study to tackle them.

Resource Hunger DL-based solutions, including CNNs, often require a considerable amount of computational resources. In order to learn a functional hierarchy of fea-tures, models are defined to be deep, i.e., need to stack many parametric transforma-tions (called layers). The increased model size requires a considerable computational and memory budget for the model training and evaluation, thus drastically limiting the applications of this kind of solutions in restricted environments with limited power re-sources, such as IoT devices and smartphones, which currently delegate complex data analysis to a centralized server. In this context, Chapter 3 presents our study on the re-duction of CNN models targeting smart embedded devices with limited resources. We propose a miniaturized CNN architecture for image classification capable of running in low-resources camera-equipped embedded devices, and evaluate its effectiveness in the task of visual parking lot occupancy detection, which receives highly benefits from the adoption of decentralized and cost-effective vision-based solution. We conduct an extensive evaluation of our approach on multiple datasets to compare it against popular fully-featured CNN classification models and state-of-the-art vision-based approaches for parking detection. Results show that our reduced models outperform state of the art in visual parking lot occupancy detection and maintain a satisfactory level of general-ization in comparison to computationally expensive models.

Human Supervision The increased model size also requires a larger amount of human supervision (in terms of the size of manually labeled data) needed to learn the param-eters of the model properly. Even if the creation of training sets for DL models is a one-time process, the amount of manual labor needed for its preparation still represents one of the highest cost of this kind of solutions. To overcome this problem, in Chap-ter 4, we explore alChap-ternative methods for the generation of training sets that rely on weakly-labeled data, i.e., data with noisy labels. The motivation of the investigation of this direction is that weakly-labeled data can be easily obtained in vast amounts by scraping the Web. Thus, we propose a fully automated pipeline for building an image classification training set, which leverages the co-occurrence of text and images in posts of the Twitter social media platform.

Unexpected Behaviors Unlike manually engineered solutions, we usually do not have particular guarantees or full understanding of the features detectors that deep models will learn from data. Despite being effective and highly optimized for the training data distribution, the high dimensionality of the parameter space usually leads to un-expected behavior for out-of-distribution data. A severe consequence of this behavior of deep models, and thus of CNNs, is its high vulnerability to evasion attacks, i.e., al-gorithms that aim to bypass filters by crafting malicious inputs — called adversarial examples— that are usually undetectable for human eyes but lead to a blatant misclas-sification. In the scenario in which DL-based classifier are more and more employed in sensitive applications, such as spam/malware filtering or pornographic/violent content

(24)

moderation, the robustness to this kind of attacks is essential. In this context, Chapter 5 presents a novel detection scheme for adversarial examples in Deep Neural Networks (DNNs) for increasing the robustness of CNNs classifiers to out-of-distribution data.

In the second part of the dissertation, we investigate the adoption of CNNs and CNN-based image representations in the context of content-based image retrieval, and we propose practical solutions to the constraints imposed by large-scale scenarios. To this aim, we focus our effort on bridging the gap between the user-friendliness and efficiency of extensively studied text information retrieval approaches and large-scale image search in order to facilitate its adoption and foster applications. Specifically, we discuss the contributions of this part in the following paragraphs.

Metadata-free textual search of images The natural query paradigm for content-based im-age searches is query-by-example: the user is asked to upload an imim-age with a content similar to the one he or she wants to retrieve, and the system extracts from it a de-scriptor which compares to the database in the dede-scriptor space in order to perform the search. Despite the simplicity of this paradigm, which does not require additional data, a metadata-based web image search engine featuring textual queries usually offers more flexibility to the user and do not require him to first obtain a query image. How-ever, most of the created data does not come with metadata describing its content. In Chapter 6, we introduce a cross-media image retrieval approach that aims to retain the ability to both express textual queries and perform metadata-free content-based image searches. We propose to map textual queries into the space in which image descriptors extracted from pre-trained CNNs live, and perform searches in this visual space which can be interpreted as a language-agnostic semantic space. We argue that this choice brings essential advantages in large-scale scenarios, most importantly the flexibility offered by visual spaces. A change in the search paradigm, e.g., support a new lan-guage, requires only to update the mapping function accordingly without recomputing the image descriptors of the entire database.

Image search on Full-Text Search Engines While the technology of textual search engines evolved rapidly in the last years and brought to the creation of many open source projects (e.g., Apache Solr, Elasticsearch), most of the content-based image search engines are either closed-source solutions or research projects of difficult usability. In particular, most of the available solutions for content-based image retrieval do not scale to large-scale scenarios as well as extensively developed decentralized full-text inverted indexes. Chapter 7 investigates the adoption of surrogate text representations, an approach that enables content-based image search on existing full-text search engine backends based on inverted indexes, e.g., Apache Lucene, Elasticsearch. We analyze, formalize and improve the transformations that map image descriptors into surrogate texts which can be indexed and searched by full-text engines without requiring addi-tional software, and we compare them with state-of-the-art main-memory indices on large-scale image retrieval datasets. Experimental results show that we obtain a com-parable performance while scaling beyond main memory by exploiting the available search engine software.

(25)

1.1. Objectives and Contributions

Finally, Chapter 8 concludes the dissertation by discussing our contributions and defining new research directions to foster visual data accessibility through data-driven learned models and representations.

(26)

CHAPTER

2

Background

In this chapter, we provide the reader with the basic concepts about Deep Learning and Convolutional Neural Networks (CNNs). Specifically, we introduce their fields of application investigated in this thesis, namely Image Classification and Content-based Image Retrieval (CBIR). Image classification provides a general and straightforward framework for many applications and thus represents one of the most typical usages of CNNs.

In the first part of the thesis (Chapters 3 to 5), we will investigate solutions to general limitations of CNNs, and, without loss of generality, we will evaluate them in practical applications of image classification. Thus, we will provide the reader with an overview of state-of-the-art approaches in this field.

The second part of the thesis (Chapters 6 and 7) will investigate novel solutions for large-scale CBIR, focusing on the interfaces and indexing patterns users are most famil-iar with, i.e., the one for textual documents. Therefore, in this chapter we will provide some background about CBIR, focusing on CNN-based image representations, and we will also give an overview on standard models for text representation, which are used in Chapters 4 and 6 for harnessing textual data to build better image representations.

The chapter is organized as follows. In Section 2.1, we provide the reader with a quick introduction to Deep Learning, focusing on deep neural networks for image and text processing and gradient-based optimization. In Section 2.2, an introduction to image classification using CNNs is presented together with a review of successful approaches in this field. In Section 2.3, we describe the main aspects of CBIR based on image representations extracted from deep neural networks, and we discuss some state-of-the-art methodologies to build effective descriptions of images and to index them in large-scale scenarios efficiently. Section 2.4 summarizes the public datasets used in the experiments presented in this thesis.

(27)

2.1. Deep Learning

2.1

Deep Learning

Deep Learning (DL) defines a subfield of Artificial Intelligence, specifically of Ma-chine Learning (ML) and Representation Learning, in which a hierarchy of data rep-resentations (or features) is learned from data for solving a particular task [30, 84]. Inspired by nature, Deep Learning models are often implemented as Deep Neural Net-works (DNNs) — a computational model comprised of multiple layers of processing units that mimics the structure and the interconnection of neurons in the mammalian brain [188]. Recent years have witnessed a rapid rise in the use of DL approaches to solve complex tasks, reaching even super-human performance in perceptual [94], control [154], and planning [206] activities. Interestingly, the representations learned with DL methodologies resemble the structures of signals in the neocortex build by our brain to implement intelligent behaviors, suggesting strong parallelism between the two domains [37, 126].

Deep Neural Networks are organized as a sequence (or more generally a graph) of non-linear parametric transformations, known as layers, that acts like features extrac-tors; starting from raw data, each layer searches for useful patterns in its input and provides a higher-level representation of the data to the next layer.

Formally, given an inputx, we can express the output y of DNN as

y = f (x, Θ) , (2.1)

wheref (·) is an arbitrary composition of parametric transformations (layers), and Θ indicates the set of all the parameters, also known as weights of the DNN.

Given a training set of input-target couples X = {(xi, y?i), i = 1, . . . , N}, the

quality of a particular configuration of parameters is quantitatively defined by a loss function _{L (X; Θ) that measures how much the model predictions y differ from the} targetsy?_{provided by}_{X. The particular formulation of}_{L is task-dependent and further}

discussed in Section 2.1.3. In the end, the learning problem reduces to the optimization problem

Θ? = arg min

Θ L (X; Θ) ,

(2.2) in which we search the best parameter configurationΘ?_{that minimizes the loss function}

L (X; Θ).

The specific layers used and their interconnections define the DNN’s architecture (or computation graph). We can roughly categorize DNNs into Feed-Forward Neural Networks (FFNNs), in which information flows from input to output in a non-recursive cascade of computations, and Recurrent Neural Networks (RNNs), which present a feedback loop in their computation graph. In the following sections, we will review some practical and successful formulations of DNNs, and we will provide the reader with the basics of gradient-based optimization of Equation (2.2).

2.1.1 Feed-Forward Neural Networks

Feed-Forward Neural Networks are networks whose computation graph can be ex-pressed as a directed acyclic graph, i.e., there are no feedback loops and information flows from inputs to outputs in a cascade fashion. Thus, when computing of the whole chain from inputs to outputs, called the forward pass of the network, each transforma-tion defined by layers is computed only once.

(28)

x

ϕ(x) = max(x, 0)

x

ϕ(x) =

1 1 + e

−x

1 x

ϕ(x) = tanh(x)

1 −1

Figure 2.1: Commonly used activation functions in DNNs. The Rectified Linear Unit (ReLU) on the left, the sigmoidσ in the middle, and hyperbolic tangent on the right.

X

w

i

x

i

i

x

i

Figure 2.2: The parallelism between a real neuron and the artificial one used in MLPs. Multilayer Perceptrons

The Multilayer Perceptron (MLP) is the simplest form of artificial neural network in the Deep Learning field. An MLP is comprised by a cascade of inner-product (or fully-connected) layers, which are the basic building block for DNNs. An inner-product layer performs a linear projection of the input followed by a usually non-linear element-wise activation function. Formally, given an input comprised of n features x _{∈ R}n_{, the}

outputy_{∈ R}m_{of the layer is obtained as}

y = ϕ (Wx + b) , (2.3)

whereW_{∈ R}n×m _and_b _{∈ R}m _{are learnable parameters of a linear projection to a}

m-dimensional space. Commonly used activation functions_{ϕ : R} _{→ R are the Rectified} Linear Unit (ReLU), the sigmoidσ (·), and tanh (·) functions (depicted in Figure 2.1). The dimensionalities of the inputn and the output m are referred to as respectively the number of input features and output features.

Historically, this layer implemented a group ofm perceptrons. The perceptron [188] is a biologically-inspired building block for artificial neural networks that has been developed mimicking the structure of biological neurons (see Figure 2.2). As a neuron cell, it is composed by n inputs xi (i = 1 . . . n), usually connected to the output of

(29)

Figure 2.3: Example of cross-correlation between 2D signals.

a weight wi (i = 1 . . . n) expressing how much of the signal coming through that

connection is promoted or inhibited. Weights in a perceptron are interpreted as the strength of the interconnection between neuron cells. The neuron “fires” when the combination of their weighted inputs gets above a certain threshold; this is modeled in the perceptron defining its output as the activation function ϕ(_{·) applied to the inner} product between input and weights

yj = ϕ n X i=1 wixi+ b ! , (2.4)

whereyj is the output of a particular neuron andb ∈ R is an optional weight used as

bias term. A layer composed bym perceptrons (and thus m outputs yj, j = 1 . . . m)

sharing the same inputx can be formalized with Equation (2.3), where the columns of W correspond to the weights of the m perceptrons. In MLPs, the outputs of a layer are densely connected to the input of each neuron comprising the next layer, hence the name “fully-connected layer”. The output of a MLP withL layers is defined as

y = ϕ (WL(. . . ϕ (W2(ϕ (W1x + b1)) + b2)) + bL) , (2.5)

where(Wi, bi) are the parameters of the i-th fully-connected layer in the network.

Convolutional Neural Networks

A CNN is a feed-forward artificial neural network composed by at least one convolu-tionallayer. This kind of layer computes the cross-correlation between the input and a set of learnable filters. Since there is a substantial similarity between the convolu-tion and the cross-correlaconvolu-tion operaconvolu-tions, this layer is often attributed with the adjective “convolutional” in the DL literature; thus we will adopt the same terminology through-out this thesis.

Cross-correlation The cross-correlation (also called sliding inner product) is typically used in signal processing to search for matches between two signals. Intuitively, a small signal called filter containing the prototype we want to match is slid on a bigger

(30)

Figure 2.4: Depiction of 2D cross-correlation on volumes implemented in convolutional layers.

input signal, and for each position, the inner product between the intersection of the two signals measures the quality of the match. We will provide the reader with the formulation of the two-dimensional version of the cross-correlation due to its massive adoption in image-related fields that are of interest for this work.

Given a two-dimensional input matrix x _{∈ R}H×W _{and a two-dimensional filter}

w_{∈ R}K1×K2_{, the cross-correlation}_y_{∈ R}H0×W0 _between_{x and w is given by} yu,v = K1 X i=1 K2 X j=1 wi,jxi+u−1,j+v−1, (2.6)

whereu = 1, . . . , H0 _and_{v = 1, . . . W}0_{. Intuitively, the filter}_{w is superimposed on the}

inputx, and for each possible position (u, v), the scalar product between the covered input and the filter is computed. Depending on the presence of padding P added to each side of the input and the strideS of the filter application, the output dimensionality changes following the relations

H0 = H + 2× P − K1 S + 1 , W0 = W + 2× P − K2 S + 1 . (2.7)

Inputs and outputs of convolutional layers are also called feature maps since high val-ues in the two-dimensional map are usually interpreted as the presence of a feature a particular filter has learned to identify. A depiction of 2D cross-correlation is reported in Figure 2.3.

2D Cross-correlation on Volumes Equation (2.6) defines the cross-correlation operation for inputs and outputs having both a single feature map. Images instead are represented as 3-D tensors havingC channels (e.g., C = 3 for RGB data, C = 1 for grayscale) and two spatial dimensionsH and W ; thus, the definition of the 2D cross-correlation is extended to 3D tensors letting the filters span the depth of the input tensor. In such a way, filters are still applied over the two spatial dimensionsH and W , but each output value depends on all the input feature maps in a particular spatial position. Formally,

(31)

given an input tensorx _{∈ R}H×W ×C _{and a filter}_w _{∈ R}K1×K2×C _{the cross-correlation} y_{∈ R}H0

×W0

betweenx and w is defined as

yu,v = K1 X i=1 K2 X j=1 C X k=1 wi,j,kxi+u−1,j+v−1,k. (2.8)

A convolution layer is often composed by a bank ofK filters. Each filter is applied to the input, obtainingK output feature maps that are stacked along the depth dimension. The obtained output is a new 3D tensory_{∈ R}H0

×W0

×K _{that is commonly followed by}

an element-wise non-linear activationϕ(_{·). The entire process is depicted in Figure 2.4.} The main difference between fully-connected layers and convolutional layers is the way weights are utilized; while fully-connected layers have a dedicated weight for each couple of input and output features, a convolutional layer shares the weights of its filters across the spatial dimensions, thus learning to detect translation-invariant features by design.

Pooling Layers Pooling layers are often used in CNNs to reduce the amount of data to be processed in the next layers. As the name suggests, this kind of layers pools data in groups and aggregates them using a non-parametric aggregation function, such as maximum or average. The groups are defined as fixed-size windows that are slid along one or more dimensions of the data in the same way the cross-correlation operator is applied. In the two-dimensional case, input and output sizes follow Equation (2.7). The application of convolutional layers having small strides usually produces redundant local information in their output. Thus, a max-pooling layer is often used to reduce the resolution of intermediate feature maps.

Features hierarchy in CNNs Convolutional layers are stacked to build deep CNNs, that are one of the core tools of Deep Learning for image perception and analysis [125, 202, 207, 94, 95, 238]. Once trained, filters in a deep CNN tend to organize in a hierarchy of detectors, where layers near the input detect the presence of simple features in the input data, while the following layers build up from them and detect more complex features. A successful example of the representative power of CNNs is object recognition. The visual aspect of objects in images follows a hierarchical organization: an object can be decomposed in parts, parts in patches, patches in textures, textures in edges or blobs, and finally in pixels. CNNs trained to recognize objects, directly or indirectly, often organize their hierarchy of detectors following this kind of visual decomposition of objects (see Figure 2.5).

2.1.2 Recurrent Neural Networks

A Recurrent Neural Network (RNN) is a stateful artificial neural network with feedback connections in which the output depends not only on the input but also on the current state of the network [84, 192]. The state of the network acts as a memory which is updated at each input, allowing information to persist between inputs. This architecture is naturally able to process sequences of inputs; the elements of a sequence can be fed to the network one by one, updating the internal state of the network to compactly represent the sequence processed so far regardless of its length. RNNs set the state

(32)

Figure 2.5: Visualization of the feature hierarchy learned by convolutional layers in a DL-model trained on faces. Low-level features (edges and blobs, on the left) are detected and then combined to rec-ognize parts (eyes, nose, mouth, in the middle) and finally faces (on the right). Image by Lee et al. [133]. f xt ht h0 f x1 h1 f x2 h2 f x3 h3 f x4 h4 f x5 h5

Figure 2.6: Basic architecture of a Recurrent Neural Network (left). On the right, its unrolled version for a sequence of length 5.

of the art on many sequence modeling tasks, such as the ones concerning textual and language representations. In Chapters 4 and 6, we will adopt this kind of models to extract information from texts accompanying images, and thus we provide the reader with the basic concepts and models involved.

Given a sequence of inputs of length L xt ∈ Rn, t = 1 . . . L, and an initial state

h0 ∈ Rm, a RNN can be described as a parametric function mapping the input and the

state at a particular timestep to the next hidden state

ht = f (xt, ht−1; W) , (2.9)

wheref (·) is the recurrent cell of the RNN parametrized by the weights W, and ht ∈

Rm, t = 1 . . . L is the hidden state at timestep t, i.e., the state of the network after all inputs until xt have been processed. In its simplest form, the recurrent cell f can be

implemented using a fully-connected layer that operates on the concatenation of the current state and input

ht = W[xt|ht−1] + b , (2.10)

whereW and b are the parameters of a fully-connected layer, and [_{·|·] is the} concatena-tion operaconcatena-tion. During learning, the RNN is unrolled in time to the length of the input sequence to create an acyclic computation graph similar to FFNNs (see Figure 2.6) on which standard training procedures can be applied (that will be discussed in Sec-tion 2.1.4). Depending on the task we want to model, we could be interested in the last hidden statehLcompactly represent the whole sequence, or we could use a

0₅

Figure 2.7: The unrolled version of a Bidirectional RNN for a sequence of length 5.

Bidirectional RNNs In many tasks — such as sequence classification and segmentation — the future context of the sequence is also informative. However, Equation (2.9) defines a RNN encoding only past information, i.e., ht depends only on information

at time step t0 _{< t. To overcome this limitation, bidirectional RNNs have been}

pro-posed [201]. Bidirectional architectures add a separate recurrent cell which processes the sequence in reverse order (from future to past)

h0_t= f0 xt, h0t+1; W0

, (2.11)

where h0

t encodes the "future" part of the sequence, i.e., depends on information at

time step t0 _{> t. The concatenation of the internal states of both past-to-future and}

future-to-past RNNs[ht|h0t] is often used as the representation of the whole sequence.

Long Short-Term Memory

Long Short-term Memory (LSTM) [98] is a type of RNN with a particular formula-tion of the recurrent cell that has proved its effectiveness in several sequence-modeling tasks [88, 213, 66, 230]. This specific formulation is aimed at solving some significant drawbacks of vanilla RNNs, most importantly the inability to encode long-term depen-dencies in long sequences due to gradient instability problems, namely the problem of vanishing or exploding gradients [176]. LSTM cells are comprised by four learnable gating functions that better control the evolution of the internal state of the cell when dealing with long sequences, and mitigating problems occurring in vanilla RNNs, such as vanishing gradients [28].

Let xt ∈ Rn an element of a sequence, and ht−1 ∈ Rm the current hidden state.

The forget gate ft decides which information is to be discarded at the current step,

and it is implemented as a fully-connected layer withp output features and parameters wf ∈ R(n+m)×p, bf ∈ Rp followed by a sigmoid activation

ft= σ (wf[ht−1|xt] + bf) .

The input gateitencodes the information that needs to be inserted in the internal state

of the cellct, and it is implemented following the same rationale of the previous gate

it= σ (wi[ht−1|xt] + bi) .

The update gate ut modulates the amount of new input to be inserted in the internal

state of the cellct

(34)

The new cell statect ∈ Rp is computed as the sum of two terms; the first one, given

by the element-wise product (_{∗) between the forget gate and the previous cell state,} represents the part of the old state not to be forgotten; the second one, computed as the element-wise product of the input and update gates, represent the modulated input to be kept in the new state;

ct= ft∗ ct−1+ it∗ ut.

The output gateotdecides which information in the cell needs to be transferred to the

final stateht

ot= σ (wo[ht−1|xt] + bo) ,

and finally, the output statehtis computed as

ht = ot∗ tanh (ct) .

With abuse of notation, the complete LSTM formulation can be compactly written as      it ft ot ut     =      σ σ σ tanh      W xt h_t−1 ct = ft∗ ct−1+ it∗ ut ht = ot∗ tanh (ct) , (2.12)

whereW_{∈ R}(n+m)×4p_{summarizes all the parameters of all the fully-connected layers}

forming the gates.

2.1.3 Loss Functions

Loss functions are a vital component of ML methods. Its role is to quantitatively assign a value of goodness to a model in reference to the particular task we want to solve.

Given a training set X = {(xi, y?i) , i = 1, . . . , N} comprised by N couples of

inputs and desired outputs, the quality of a particular configuration of parametersΘ is quantitatively defined by a loss function_{L (X; Θ) that measures how much predictions} and targets differ. A loss function is usually defined as one or more terms summed together. The main term _Ld(X; Θ) is defined as the average of the individual loss

values computed on each sample of the dataset _{L (y}i, y?i) — we denote it the data

term. An optional secondary term _Lr(Θ) is often used to regularize the network and

depends only on the model parameters Θ — we indicate this as regularization term. Regularization consists of model constraints added to avoid other undesired properties during training such as overfitting — i.e., the tendency of a model to predict well-known data while failing to make correct predictions for unseen data. Regularization has gained importance in the DL field due to the vast amount of parameters models are usually comprised of, thus increasing the risk of overfitting on the train data. The regularization term is usually multiplied by a hyperparameterα and then added to the

(35)

loss function. A general formulation of the loss function is

L (X; Θ) = data term z }| { Ld(X; Θ) + α regularization term z }| { Lr(Θ) = 1 N N X i=1 L (yi, y?i) +α Lr(Θ) = 1 N N X i=1 L (f (xi; Θ) , y?i) +α Lr(Θ) . (2.13)

In the next paragraphs, we will provide the reader with some of the most used for-mulations of_{L (y}i, y?i) andLr(Θ) of interest for this work.

Mean Squared Error Loss When dealing with regression problems, we want our pre-dictions to be close to the one or more real-valued targets. For example, in an age estimation problem, we want our network to predict the exact age expressed as a real value, e.g., 46.5. The Mean Squared Error (MSE) between predictionsz and targets z?

is a commonly used loss function to measure the goodness of regressions

LMSE(z, z?) =

1

2(z− z

?₎2

. (2.14)

The scale factor1_/₂_{is usually introduced to simplify the gradient computation when}

using this loss function (more details on this in Section 2.1.4).

Cross-entropy Loss The cross-entropy loss is commonly used in ML to measure the distance between two categorical distributions. It is commonly adopted in single-label classification tasks, where a single label have to be assigned to a piece of data choosing fromN labels (N ≥ 2). Let z and z? _{the probability masses of two}_{N -way categorical}

distributions, i.e., zi, z?i ∈ [0, 1],

PN

i=1zi = 1,PNi=1z?i = 1; the cross-entropy loss

between the predicted distributionz and the target one z?_{is defined as}

LCE(z, z?) =− N

X

i=1

z?_i log zi. (2.15)

Classification models are often designed to output anN -dimensional vector z that is mapped to a categorical distribution via the softmax function

softmax(z)i =

ezi

PN

j=1ezj

. (2.16)

Contrastive (or Siamese) Loss The contrastive (or siamese) loss [34, 92] is employed for learning representations of data suited to be compared with a metric functiond(_{·, ·).} Thus, it is particularly useful for learning to rank problems [36, 41], in which the ob-jective is to learn representations that need to be matched efficiently for scoring large sets of items, e.g., for retrieval and ranking purposes. The single data sample on which this loss is applied is comprised by a pair(x1, x2) of similar (matching, y? = 1) or

(36)

x

a

m

(a) A ranking loss tends to pull similar elements (black dots) near to an anchorxa(i.e., the query

point) and to push dissimilar ones (gray dots) away from it. The parameterm defines the min-imum margin under which two elements are con-sidered similar. x1 x2 f (·) f (_·) Contrastive Loss

(b) Siamese architecture with contrastive loss.

xa xp xn f (·) f (_·) f (_·) Triplet Loss

(c) Siamese architecture with triplet loss.

Figure 2.8: Ranking loss and siamese architectures.

dissimilar (mismatching,y? _{= 0) items; We denote as f (x}

1) and f (x2) the

represen-tations of the data points forming the pair computed by a neural networkf (·), and we indicate withd(·, ·) a distance function defined over a couple of real vectors, e.g., the squared Euclidean distance function. The constrastive loss is defined as

L(x1, x2, y?) = y?·d(f(x1), f (x2))+(1−y?)·max (0, m − d(f(x1), f (x2))) , (2.17)

where m is a parameters called margin. The first term is only active in the case of matching pairs (y? _{= 1) and penalizes representations that are far apart and should}

instead be nearer, i.e., similar pairs having a high value ofd(f (x1), f (x2)). On the other

hand, the second term is only active when a mismatching pair is processed (y? _{= 0)}

and penalizes representations that are too near to each other and should be taken further apart, i.e., dissimilar items having d(f (x1), f (x2)) < m. The margin parameter m

defines the minimum value of the distance at which we want mismatching pairs to be. Back-propagating this loss through the parameters of the network f (_{·) permits us to} steer the network to embed the data into a compact vector space — thus suitable to efficiently compute distancesd(_{·, ·) — in which similar points lie near each other, and} dissimilar points are instead far apart (see Figure 2.8a). When using this loss during training, the computational graph that implements the functionf (_{·) and computes the} two representations for the input couple is often organized into two identical branches with shared weights, thus the term “siamese”. A depiction of the siamese architecture is reported in Figure 2.8b.

Triplet Loss Similarly to the contrastive loss, the triplet loss [233, 200] is a ranking loss in which the constraints on distances between items are defined by a triple(xa, xp, xn),

wherexais an anchor data point,xp is a point similar toxa(positive), andxnis a point

dissimilar to xa (negative). The goal of this loss function is to learn a representation

for each element of the tripletf (xa), f (xp), and f (xn) such that the embedded positive

(37)

can be formalized as

L(xa, xp, xn) = max (0, m− d(f(xa), f (xn)) + d(f (xa), f (xp))) , (2.18)

whered(·, ·) is the distance function used to compare learned representations, e.g., the squared Euclidean distance. The common triplet architecture is depicted in Figure 2.8c.

LpWeight Decay The most used regularization terms in the DL literature are the ones

penalizing the parameters having a large norm. This is usually implemented defining the regularization term_Lr(Θ) added to the loss to be minimized as the p-norm of the

parameters. Practical definitions have been adopted forp = 2 (L2 weight decay) and

forp = 1 (L1weight decay). The former produces a more uniform utilization of all the

available parameters, penalizing under-utilized and over-utilized weights [160], and is defined as Lr(Θ) = 1 2||Θ|| 2 2 = 1 2 X i θ_i2. (2.19)

The latter instead tends to produce a more sparse weight configuration, i.e., with many parameters having a null optimal value [160], and is defined as

Lr(Θ) =||Θ||1 =

X

i

|θi| . (2.20)

2.1.4 Gradient-Based Optimization

As already stated, solutions of Equation (2.2) are the parameter configurationsΘ that minimizes the loss function_{L defined over a given training set X. Unfortunately, in} very rare cases closed-form solutions of Equation (2.2) are available for practical for-mulations of DL modelsf (_{·) due to the non-linear non-convex nature of deep models.} In the past years, there was reluctance in the ML field to work with non-convex mod-els due to the lack of theoretical guarantees on their convergence. Nevertheless, recent developments showed that non-convex model — albeit having fewer guarantees — have higher capacity and representational power and thus have a better overall perfor-mance [30]. The most common approach to find suboptimal (but valuable) solutions to Equation (2.2) is using iterative gradient-based optimization.

Gradient Descent In gradient-based optimization, given a training setX and a parame-ter configurationΘ, we search for a new solution following the direction of the gradient of the loss function_{∇L (X; Θ) with respect to the parameters Θ. The direction given} by_{∇L (X; Θ) is the one having maximum steepness of the loss surface in the} param-eter space, that is, the one that maximizes the loss change locally. A new paramparam-eter configurationΘ?_{is chosen descending the loss surface along the direction of maximum}

steepness with a step size ofλ, that is

Θ? = Θ− λ∇L (X; Θ) . (2.21)

This update rule can be iterated until a (local or global) minimum of the loss function is reached. This procedure is called gradient descent optimization, and in the ML fieldλ is referred to as the learning rate. A depiction of this algorithm for a 2D toy example is

(38)

Figure 2.9: Example of gradient descent optimization on a loss surface_{L(X) defined over a 2D} param-eter spaceΘ = (w1,w2) with two different starting points.

reported in Figure 2.9. Bigger learning rates correspond to bigger steps along the loss surface that usually improve the convergence rate and avoid local minima at the cost of a higher risk of oscillating around minimum points. Small learning rates, instead, bring a lower convergence rate but can better exploit the local topology of the loss function, reaching a locally better solution.

Back-propagation As formalized in Equation (2.1), DNNs are comprised of a compo-sition of non-linear functions fi each having its own set of parameters θi. Thus, to

perform gradient descent on DL models, we need to compute gradients_{∇L (X, θ}i) for

eachθi ∈ Θ. This is efficiently done by back-propagation [192, 131].

In back-propagation, we start computing the gradient of the loss function from the last layer (the one that produces the final predictions), and we propagate the error back-ward to previous layers using the chain-rule for computing derivatives of compound functions. For simplicity, we report the formulation of back-propagation for a Feed-Forward Neural Network (FFNN), but this mechanism is general and applies to all dif-ferentiable architectures. Letf (x; Θ) a FFNN composed by L layers with parameters Θ =_{θ1, . . . , θL} defined in Equation (2.1), and let oithe output of thei-th layer

o0 = x

oi = fi(oi−1; θi) .

(2.22)

The gradient _∂θ∂L

(39)

2.1. Deep Learning chain-rule as ∂L ∂θi = ∂L ∂oi · ∂oi ∂θi = ∂L ∂oL · ∂oL ∂oL−1 · · · ∂oi+1 ∂oi · ∂oi ∂θi , (2.23)

where the partial derivatives ∂oi+1

∂oi and

∂oi

∂θi are well defined and depend on the imple-mentation of thei-th layer fi. The formulations of the loss functionL and intermediate

layersfi are chosen to be differentiable to ensure that also the entire model — being

a composition of differentiable functions — is, in turn, differentiable, and that back-propagation can be applied to compute gradients efficiently.

Stochastic Gradient Descent In our presentation of gradient-based optimization so far, the computation of the loss function_{L (X; Θ) and its gradient depend on the whole dataset} X. This may be limiting for DL applications where huge training sets are needed to train models, thus increasing the computational cost required to compute the exact value of_{L (X; Θ). To overcome this problem, Stochastic Gradient Descent (SGD) and} in particular mini-batch SGD [192] have been proposed to compute an estimate the loss function and its gradients at a lower computational cost.

In mini-batch SGD, we use a small random sample of the entire training set ˜X = {( ˜xi, ˜yi), i = 1 . . . B} ⊂ X (called batch or mini-batch) to compute an estimate of loss

function ˜ L (X; Θ) = L( ˜X; Θ) = 1 B B X i=1 L(f( ˜xi; Θ), ˜yi) , (2.24)

which in turn is used in back-propagation to estimate its gradient_{∇L( ˜}X; Θ) and per-form the parameter update as specified in Equation (2.21). The size of the mini-batch sizeB is a hyperparameter that controls the trade-off between the computational cost needed to compute the loss and the quality of the loss estimate.

SGD with Momentum A successful proposal to improve the parameter update rule pre-sented in Equation (2.21) is SGD with momentum [182]. The key idea of momentum is to add to the current direction given by the loss gradient a fraction of the direction computed in the previous iteration, such that the direction we are following to descend the loss surface gains inertia and smooths out eventual oscillations given by very steep loss manifolds. Givenm(k−1)the direction of the previous iteration, we define the new direction to followm(k)_as

m(0) = 0

m(k)= γm(k−1)+∇L X; Θ(k)

Θ(k+1) _{= Θ}(k)_{− λm}(k)_,

(2.25)

whereγ _{∈ (0, 1) is the momentum of the previous direction. Commonly used values} forγ are around 0.9. A good analogy to comprehend the rationale behind momentum is a ball pushed down a hill: the higher the momentumγ, the more inertia the ball has, and the less it is influenced by steepness variations during the descent. This modified update

(40)

rule tends to discard the oscillating directions of the gradient while strengthening the stable directions over iterations; this brings a reduction of oscillations while descending the loss manifold and thus a faster convergence.

Adam Another widely used parameter update rule in the DL literature is Adaptive Moment Estimation (Adam) [120]. Similar to other proposed update rules (such as Adagrad [70] or Adadelta [245]), this method computes an adaptive learning rate for each parameter to be optimized based on the second moment of the gradientsv (i.e., squared gradients), but in addition, it also estimates the first momentm (the mean of the gradient itself) as in SGD with momentum

m(k) = β1m(k−1)+ (1− β1)∇L X; Θ(k)

v(k) = β2v(k−1)+ (1− β2)∇L X; Θ(k)

2

. (2.26)

The hyperparametersβ1 andβ2 are respectively controlling the aggressiveness of the

exponential average of the two momentsm(k) _and_v(k)_{. Since the initial values of the}

moments are initialized to zero, the authors propose to use the bias-corrected version of the moments, that is

ˆ m(k) = m (k) 1− βk 1 ˆ v(k) = v (k) 1_{− β}k 2 . (2.27)

Finally, moments estimates are used to formulate the Adam update rule

θ_i(k+1)= θ_i(k)− q λ ˆ v(k)_i +

m(k)_i ∀θi ∈ Θ , (2.28)

wherem(k)_i andvˆ(k)_i are the first and second moments estimates of the gradients corre-sponding to parameterθi, and is a small value to prevent division by zero. Commonly

used values for hyperparameters areβ1 = 0.9, β2 = 0.999, and = 10−8, as suggested

by the authors.

Dropout Regularization is not limited to regularization loss terms only. Dropout [97] is a common regularization technique for DNNs which consists in deactivating (drop) some neurons of an internal layer (by setting their activations to zero) during the train-ing phase and thus enabltrain-ing the back-propagation of the error only in the survived neurons. The choice of the neurons to deactivate is random: each neuron on which dropout is applied has a given probabilityp to survive. The employment of this tech-nique usually increases the number of training iterations needed to a model to converge, because only the subset of weights of the survived neuron is updated in an optimiza-tion step. However, training a random subset of neurons at each iteraoptimiza-tion reduces the co-adaptation of neurons, which tend to learn more diversified and independent con-figurations of weights. This results in an increased ability of bigger models to exploit their capacity and limit overfit. At test time, all neurons are used, virtually utilizing and combining the predictions of many independently trained models. The activations of the layers trained with dropout are scaled byp to approximate the combined prediction produced by the virtual ensemble of sub-networks.

(41)

2.2. Image Classification

2.2

Image Classification

In this section, we formally define the problem of image classification, and we analyze and review recent solutions based on Deep Learning — in particular, Deep Convolu-tional Neural Networks — proposed in the literature.

Many real-world problems in image understanding can be formalized as classifi-cation problems, such as object and face recognition, anomaly detection, automatic tagging, occupancy detection (discussed in Chapter 3), visual sentiment analysis (dis-cussed in Chapter 4), to name few. For this reason, classification methods for images have received increasing attention in the last decade.

2.2.1 Problem Setting

Image classification consists of assigning one or more labels to an imagex choosing from a finite set ofL labels. We refer to single-label L-way classification if only one label has to be picked among theL available, referring to the particular case in which L = 2 as binary classification. If multiple labels can be assigned to a piece of data, we talk about multi-label classification. A multi-labelL-way classification problem is often decomposed in L independent binary classification sub-problems, each of them focusing on predicting the presence or the absence of one of the availableL labels.

In the case of image classification, the data on which classifiers must rely on is raw pixel data, usually organized in a tensor with shape(H, W, C), respectively represent-ing the height, the width, and the number of channels of the image. LetI the image space as the space of all valid raw pixel values; given an image to be classifiedx_{∈ I, a} single-labelL-way classifier is a function f mapping x into the label space. The com-monly used form of classification is probabilistic, in the sense that soft assignments to the available labels are considered instead of hard ones. In this case, an imagex is mapped into its posterior probabilitypi = p(y = i|x) of belonging to category i, for

each categoryi ∈ {1, . . . , L}. Thus, the classifier is a function f : I → p which maps an image to the parametersp = (p1, . . . , pL),PLi=1pi = 1 of a categorical distribution

defined over the label space. Soft assignments present a broad range of advantages with respect to hard assignments, the most important one being differentiability, which enables and facilitates gradient-based learning of models forf .

2.2.2 Evaluation

In this section, we will report the most commonly used evaluation metrics to measure the quality of classification models. AssumeY = (y1, . . . , yn) is a set of predictions

and ˜Y = (˜y1, . . . , ˜yn) the set of the ground-truth labels for a single-label L-way

classi-fication problem, withyi, ˜yi ∈ {1, . . . , L}.

Confusion Matrix One of the fundamental tools for evaluating a set of label assignments is the confusion matrixC = _{cij}, which compactly reports the co-occurrence of

pre-dicted and ground-truth labels. The elementcij of the matrix indicates the number of

times the model predicted the classi for a sample belonging to class j. We report in Table 2.1 the confusion matrix for the binary case (L = 2) together with the terminol-ogy used to refer to particular values in the matrix. For the binary case, many useful

Deep Learning for Image Classification and Retrieval: Analysis and Solutions to Current Limitations

D

EEP

L

EARNING FOR

I

MAGE

C

LASSIFICATION AND

R

ETRIEVAL

: A

NALYSIS AND

S

OLUTIONS TO

C

URRENT

L

IMITATIONS

Acknowledgements

I

Summary

T

Sommario

L’

List of publications

List of Abbreviations

Contents

CHAPTER

1

Introduction

1.1

CHAPTER

2

Background

2.1

x

ϕ(x) = max(x, 0)

x

ϕ(x) =

1

1 + e

1

x

ϕ(x) = tanh(x)

1

−1

w

x

ϕ

x

w

x

w

x

w

x

w

x

w

ϕ

w

x

h

f

h

x

f

h

x

f

h

x

f

h

x

f

h

x

h

_w