Deep Learning for Image Classification and Retrieval: Analysis and Solutions to Current Limitations

(1)

UNIVERSITÀ DI PISA

DIPARTIMENTO DI INGEGNERIA DELL’INFORMAZIONE Dottorato di Ricerca in Ingegneria dell’Informazione

Activity Report by the Student Fabio CARRARA - Ph.D. Program, cycle XXXI

Tutor(s):

● Dr. Giuseppe Amato - CNR, Institute of Information Science and Technologies (ISTI) ● Dr. Claudio Gennaro - CNR, Institute of Information Science and Technologies (ISTI) ● Prof. Francesco Marcelloni - UniPI, Dipartimento di Ingegneria dell’Informazione

1. Research Activity

The research activity carried out during the three-year period of the PhD Program consists in the proposal of novel approaches for Image Classification and Content-Based Image Retrieval (CBIR) implemented with Deep Learning approaches, in particular, Convolutional Neural Networks. In its simplest form, image classification is the task to assign to an image one or more labels picked from a set of predefined classes. In my study, I identified three major limitations of neural networks 一 high computational cost, the need of huge labeled training datasets, and low robustness to out-of-distribution inputs 一 and proposed approaches to mitigate them.

Initially, the research activity focused on the analysis of state of the art models for image classification that was composed by huge parametric models such as convolutional neural networks. The first limitation I tackled in this type of model is computational efficiency, since effective architectures in the literature have millions of parameters. This leads to substantial memory occupation and a high computational cost, usually amortized through the use of hardware accelerators, such as GPUs. These considerations justified a study of miniaturization of convolutional network architectures with low-resources systems in mind, e.g. embedded systems. During the activity, new solutions were studied and evaluated in the context of a visual monitoring system of parking occupancy based on smart cameras. Experiments showed that the developed approach improves the state of the art in terms of accuracy and is more robust in the case of occlusions, variations in viewpoints and lighting [C7]. Moreover, the new network architectures have been compared with the state-of-the-art architectures in the classification of large-scale images (e.g: AlexNet), resulting comparable in terms of effectiveness, but much more efficient [J5]. As a further contribution, a novel dataset comprised of approximately 145,000 labeled parking lots images have been collected and publicly released. This new dataset allowed a comparative analysis of the techniques in response to changes in viewpoints, changes in atmospheric conditions and seasonal lighting, and highlighted the superior generalization performance of our approach compared to the state of the art.

Subsequently, I worked on the problem of human supervision (in terms of manually assigned labels) when training neural models. To cope with this problem, a novel cross-media learning approach has been investigated and proposed to exploit weakly-labelled data. The availability of huge amounts of weakly-labelled data (usually provided by the Web) can be exploited to train image classifiers reducing the human effort in providing labeled examples. In [C4], a fully

(2)

unsupervised pipeline to build image classifiers using data crawled from social networks has been defined. The proposed technique has been adopted in the context of the visual sentiment polarity estimation in which obtaining labels is particularly expensive due to the intrinsic subjectiveness. In the proposed cross-media learning approach, the labels of data in modality are assigned exploiting labels of a different modality. In particular, in collaboration with the IIT and ILC institutes of the CNR, a huge collection of random english tweets containing images and text have been collected and leveraged, with the use of previously developed dictionary-based textual sentiment classifier, to build a weakly-labelled image dataset for visual sentiment prediction. Experiments showed that deep convolutional models trained on this dataset outperform state-of-the-art approaches trained on manually labelled data for visual sentiment prediction. The collected dataset, namedTwitter

for Sentiment Analysis (T4SA), and the trained models have been made publicly available . 1

Then, I tackled the problem of the reduced robustness of deep neural networks to adversarial

images, i.e. natural images maliciously manipulated adding small perturbations in order to fool the

classifier to predict a wrong class with high confidence. Previous studies showed that changing the model weights to increase its robustness to adversarial images will only create models that are still vulnerable to other adversarial images that can be found by efficient algorithms applied to the new model. Based on these considerations, in [C5, J3], an approach to detect adversarial images and discard possible erroneous classifications without changing the model has been proposed. The proposed approach relies on the representation in the visual space created by the activations of intermediate layers of a deep neural network, also known as deep features. Using a kNN search on this space, a score is assigned to the prediction given the by the network, and by simply thresholding, we are able to decide whether the prediction is correct or spurious. Experiments with the ImageNet dataset on multiple architectures have shown that the proposed approach is able to discern manipulated images from authentic ones with an accuracy up to ~85%, indicating that the visual space created by deep features is more robust to those kind of attacks. Moreover, we observed that deep features of adversarial images tend to occupy empty regions of the visual space. In light of this findings, in [C1] an improved detection scheme for adversarial examples has been introduced; the proposed approach is based on embedding multiple internal deep features of an image extracted from deep convolutional neural networks into a distance space 一 a space in which each dimension represents the distance between the input deep features and a set of pivot points. Preliminary experiments showed that using this embedded features, we are able to train high-performance detectors of adversarial images crafted by multiple attack algorithms.

Finally, I worked on improving and facilitating the use of deep neural networks in the context of Content-based Image Retrieval (CBIR). In CBIR, the objective is to search and retrieve images relying solely on their visual content. This enables easy access and organization of potentially large-scale archives of unlabelled images without the need of text or other metadata associated to them. Modern CBIR systems often rely on high-dimensional image descriptors extracted from neural networks. Despite their effectiveness, this systems present drawbacks that limits their technological transfer, such as a restricted query paradigm (query-by-example) and the need of ad-hoc search engines. In my research activity, I propose two approaches to circumvent these drawbacks and fill the gap between image-retrieval approaches and mature and preferred text-based approaches for indexing and expressing queries.

In [C8, J4], I investigated the problem of image retrieval through textual descriptions provided by the user as queries. After the analysis of various paradigms for cross-media retrieval, an effective approach to carry out the search in the visual space of the deep features has been proposed. The proposed approach is based on the mapping of the textual query into the visual space spanned by image representations. This transformation was defined by training a neural network, dubbed Text2Vis, on MSCOCO, a dataset of about 120,000 pairs <image, text description>. From the

(3)

experiments conducted on MSCOCO it emerged that our system is able to retrieve images relevant to the text query even though the images of the dataset are not labeled. One of the advantages of this approach is that it is not necessary to re-index the images if you change language, but it is sufficient to learn a new transformation from the new language to the same visual space.

In [J1, C3, C6], I worked on indexing high-dimensional visual features extracted from deep neural networks with textual search engines. This is motivated by the fact that modern textual search engines already efficiently solve the problem of high-dimensionality indexing using techniques based on inverted indexes. Instead of rebuilding ad-hoc indexes for image representations, this procedure can be formulated as transforming the original visual feature in a surrogate text that can be indexes and retrieved by textual search engines such as Lucene or Elasticsearch natively. I investigated the use of Surrogate Text Representation (STR). An STR is a piece of text that can be used in full-text search engines to represent an image, and it assures that the ranking of the retrieved images in response to a query is a good approximation of the ranking obtained by applying the original similarity function (dot-product) on the original descriptors. A theoretical formulation of a surrogate text representation has been introduced, and under this framework, multiple surrogate text techniques have been proposed and evaluated. In particular, I analyzed the scenario in which images are represented using Regional Maximum Activations of Convolutions (RMAC) features, a dense 2048-dimensional float global descriptor extracted using deep neural networks expressly designed for image similarity and retrieval. Two RMAC features are compared using the dot product, obtaining a score by which retrieved images can be ranked. To obtain an STR from RMAC features, two approaches have been proposed: a permutation-based approach, in which an STR is obtained by converting into text the permutation that sort the vector dimensions in descending order, and a scalar quantization-based one, in which an STR is obtained by scaling and quantizing each dimension of the feature vector. I performed experiments using large-scale image retrieval datasets, analyzed the obtained tradeoff between effectiveness (in terms of quality of results) and efficiency (in terms of query cost), and compared them with state-of-the-art techniques based on product quantization (PQ). Results showed that the proposed techniques have a performance comparable to or slightly worse than PQ-based main-memory indices while retaining the advantages of modern full-text search engines, i.e. high scalability thanks to efficient secondary-memory inverted indices implementations (e.g. Apache Lucene, Elasticsearch).

Novel directions for content-based image retrieval have also been explored. While relational reasoning in computer vision demonstrated impressive results on difficult tasks, such as Visual Question Answering (VQA), its employment in content-based image retrieval has been less explored. Thus, in [C2] a new task, named Relational Content-based Image Retrieval (R-CBIR), has been defined in which we are interested in retrieving images not only containing objects visually similar to the one of a query, but also having similar spatial relationships among them (eg. A left of B, A behind B, etc.). To solve this task in large-scale scenarios, the investigation focused on the creation of a compact feature representation able to encode relationship information among objects in an image. A modification of the Relation Network module — a differentiable model that showed impressive results on Relational-VQA (R-VQA) — has been proposed to transfer the network knowledge learned to solve R-VQA by extracting a compact feature vector encoding visual relational information. Using the synthetic dataset CLEVR, a comparison of our methods with state-of-the-art features for instance-level image retrieval have been conducted, showing that the learned representation produces better image rankings for the R-CBIR task. The model, code, and ground-truth are publicly available to foster research in this direction. 2

(4)

All researches were performed within the following projects:

● Smart News, Social sensing for breaking news, co-founded by the Tuscany region under the FAR-FAS 2014 program, CUP CIPE D58C15000270008.

● Automatic Data and documents Analysis to enhance human-based processes (ADA), co-founded by the Tuscany region under the POR-FESR 2014-2020 program, CUP CIPE D55F17000290009.

2. Formation Activity

During the PhD Program, I have attended the following courses: ● Academic Writing and Presentation Skills_{, J. Spataro - 40h (5 CFU)}

● Linked Open Data: a paradigm for the Semantic Web, _{A. Lo Duca - 12h (3 CFU)} ● DeepLearn 2017_{- Summer School - 50h (10 CFU)}

● Game Theory and Optimization in communications and Networking _{, M. Luise, L. Sanguinetti -} 16h (4 CFU)

● Signal Processing and Mining of Big Data: Biological Data as Case Study_{, G. Coro - 20h (5 CFU)} ● Multi-modal Registration of Visual Data_{, M. Corsini - 15h (4 CFU)}

● Academic English_{, J. Spataro - 40h (4 CFU)}

● Machine Learning Techniques and Selected Applications for Big Data_{, M. Stewin - 20h (5 CFU)} for a total of 40 (25 internal + 15 external) credits.

3. Research Periods at Qualified Research Institutions

In February and March 2018, I visited the Data Intensive Systems and Applications (DISA) laboratory of the Masaryk University, located in Brno, Czech Republic, for about three weeks. The DISA laboratory is directed by Prof. Pavel Zezula, whose research activity has achieved significant results in the field of Similarity Search, Image Retrieval and Database Systems.

During the stay, I participated on activities focused on human action recognition and detection in motion capture (mocap) data using Recurrent Neural Networks. We proposed an LSTM-based architecture to segment actions by predicting their beginnings and endings in long sequences of mocap data, and we evaluate its ability to do early action prediction in live streams of data. Our method outperforms state-of-the-art approaches for action segmentation on the marker-based HDM-05 dataset in both effectiveness and efficiency. The activity produced [J2] which has been submitted to international journal.

4. Awards

● ISCC 2016 Best Local Paper Award_{: for the article [C7] in the 2016 IEEE Symposium on} Computers and Communication.

● ISTI Grants for Young Mobility 2017_{: grant to carry out research in cooperation with} foreign Universities and Research Institutions of clear international standing

(5)

○ I used the travel grant for visiting the DISA Laboratory (Masaryk University). ● ISTI Young Researcher Award 2018_{: award for young staff members (less than 35 years}

old) of the Institute of Information Science and Technologies (ISTI-CNR) with high scientific production.

5. Publications

List of publications appeared during the PhD period:

International Journals

[J1] G. Amato, F. Carrara, F. Falchi, C. Gennaro, L. Vadicamo: “Large-Scale Instance-Level Image Retrieval”, SUBMITTED to International Journal

[J2] F. Carrara, P. Elias, J. Sedmidubsky, P. Zezula: “Real-Time Action Detection and Prediction in Human Motion Streams”, SUBMITTED to International Journal

[J3] F. Carrara, F. Falchi, R. Caldelli, G. Amato, R. Becarelli: “Adversarial image detection in deep neural networks”, Multimedia Tools and Applications, March 2018.

[J4] F. Carrara, A. Esuli, T. Fagni, F. Falchi, AM. Fernández: "Picture It In Your Mind: Generating High Level Visual Representations From Textual Descriptions", Information Retrieval Journal, 21 (2), pp. 208-229, October 2017.

[J5] G. Amato, F. Carrara, F. Falchi, C. Gennaro, C. Meghini, C. Vairo: "Deep Learning for Decentralized Parking Lot Occupancy Detection", ESWA Expert Systems with Applications Journal, 72, pp. 327-334, April 2017.

International Conferences/Workshops with Peer Review

[C1] F. Carrara, R. Becarelli, R. Caldelli, F. Falchi, G. Amato: “Adversarial examples detection in features distance spaces”, in Computer Vision – ECCV 2018 Workshops. (Vol. 2, pp. 313–327). Springer Cham, 2018

[C2] N. Messina, G. Amato, F. Carrara, F. Falchi, C. Gennaro: “Learning Relationship-aware Visual Features”, in Computer Vision – ECCV 2018 Workshops. (Vol. 4, pp. 486–501). Springer Cham, 2018

[C3] G. Amato, P. Bolettieri, F. Carrara, F. Falchi, C. Gennaro: “Large-Scale Image Retrieval with Elasticsearch”, In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 925-928. ACM, June 2018.

[C4] L. Vadicamo, F. Carrara, A. Cimino, S. Cresci, F. Dell’Orletta, F. Falchi, M. Tesconi: “Cross-Media Learning for Image Sentiment Analysis in the Wild”, ICCV Workshops, pp. 308-317, October 2017.

[C5] F. Carrara, F. Falchi, R. Caldelli, G. Amato, R. Fumarola, R. Becarelli, “Detecting adversarial example attacks to deep neural networks”, Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing (CBMI), p. 38, June 2017.

(6)

[C6] G. Amato, F. Carrara, F. Falchi, C. Gennaro: “Efficient Indexing of Regional Maximum Activations of Convolutions using Full-Text Search Engines”, Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval (ICMR), pp. 420-423, ACM, June 2017.

[C7] G. Amato, F. Carrara, F. Falchi, C. Gennaro, C. Vairo: "Car parking occupancy detection using smart camera networks and Deep Learning", 2016 IEEE Symposium on Computers and Communication (ISCC) 2016 Jun 27 (pp. 1212-1217). IEEE. (Best Local Paper award)

[C8] F. Carrara, A. Esuli, T. Fagni, F. Falchi, AM. Fernández: "Picture It In Your Mind: Generating High Level Visual Representations From Textual Descriptions", accepted to the non-archival Workshop Neu-IR: The SIGIR 2016 Workshop on Neural Information Retrieval. arXiv preprint arXiv:1606.07287. 2016 Jun 23.

Pisa, 23/04/19

Fabio CARRARA