Design and Testing of a Neural Network for Relational Content-Based Image Retrieval

(1)

Dipartimento di Ingegneria dell’Informazione

Corso di Laurea Magistrale

Computer Engineering

Design and Testing of a Neural Network

for Relational Content-Based Image

Retrieval

Candidato:

Nicola Messina

Relatori:

Dott. Fabrizio Falchi

Dott. Claudio Gennaro

Dott. Giuseppe Amato

(2)

(3)

Abstract

Humans often perceive the physical world as sets of relations between objects, whatever nature (visual, tactile, auditive) they exhibit.

Spatial relationships, in particular, assume a strong importance, since they correlate objects in the three-dimensional spatial world in which we are submerged.

In this work we will study deep learning architectures that are able to mimic this spatial consciousness. We will investigate visual deep learning models able to understand how objects in an image are arranged.

In literature this problem is often faced using VQA (Visual Question Answering): a question regarding the arrangement of objects in a particular image is asked and the network should be able to answer correctly.

This thesis aims to employ one of the latest proposals in the VQA field, the Relational Networks by DeepMind team, to introduce Relational Content-Based Image Retrieval (R-CBIR).

Current CBIR systems do not take into consideration relations between objects inside images. We will analyze and modify the Relational Network architecture in order to extract visual relational features, so that relational indexing becomes possible. Statistics will be collected and analyzed in order to measure the accuracy gain relational features reach over standard ones.

We will reference a state-of-the-art synthetic dataset, CLEVR and we will even consider one of its variants, Sort-of-CLEVR, built to be simpler and easier to train and debug. The code will be written in Python using PyTorch framework with CUDA acceleration and it will be optimized to run on a multi-GPU system.

(4)

(5)

Chapter 1 Introduction

1.1 Relational Reasoning

Relational reasoning refers to a particular kind of reasoning process that is able to understand and process relations among multiple entities. It is a must for every form of biological intelligence, regardless we are talking about humans or animals. The environment we are submerged in is composed by large sets of different comparable entities. Since the world in which we live is mainly perceived through two simple mental categories, space and time, entities can be related using spatial associations as well as temporal concepts. For example, two different football balls could be joined to respect their relative positions, their textures, their materials as well as to respect the difference in their speeds once kicked in the air.

In this regard, Krawczyk et al. [9] characterize relational reasoning as the human’s brain ”unique capacity to reason about abstract relationships among items in our environment”.

Biological intelligence developed such reasoning capabilities during thousands of years of evolution: comparing objects is indeed a critical task, since it triggers decisions that could influence the safety of the individual, hence the survival of the species.

As stated by Alexander [1], if individuals were not able to relationally process sensory information that continually floods them, then they would

(10)

undoubtedly ”be relegated to a world that consists solely of noise or fragmentary pieces of sensory data that carry little or no meaning”.

Today machine learning is becoming a wide area of study. Even if early studies appeared in the 1980s, only these days a major implementation and engineering process is taking place. This is due to improvements in hardware and in the increasing availability of high computing power, that makes this kind of applications more affordable, therefore convenient. Machine learning, in particular deep learning, turned out to be the right way to attack some problems that otherwise would have been very difficult (if not impossible) to solve. A set of problems now affordable with deep learning pertain the Computer Vision research branch, that deals with the development of techniques for high-level scene understanding: image classification, deep features extraction for indexing and fast and scalable retrieval, image recognition.

Modern deep architectures are quite good at tasks such as classifying or recognizing objects; however, latest studies demonstrated the difficulties of such architectures to intrinsically understand a complex scene, where understand implicitly means catching relations among objects to compare them in a spatial and temporal dimension, exactly as the biological intelligence would operate. In other words, differently from biological intelligence, as of now deep architectures can perceive with quite good accuracy the world that surrounds us, but still cannot understand it very well.

1.2 Question Answering

In order to know if a particular architecture is really able to understand, hard work must be invested even only in the development of analysis methods that try to understand how the architecture is internally behaving. This process run through the creation of ad-hoc datasets whose purpose is twofold:

• checking if the architecture is able to correctly learn and generalize; • understanding if the reasoning process is authentic and not barely based

(11)

1.2. QUESTION ANSWERING 3

Scientific community concentrated efforts to respect a particular subset of problems that implicitly implies relational reasoning capabilities to be correctly carried out: they are grouped under the name of Question Answering.

An architecture that solves a question-answering problem can be modeled as a system that takes as input

• a question (usually written using natural language); • an object the question can be asked to.

and outputs an answer, by reasoning on the question asked to that particular object.

Deepmind team, in his two publications Santoro et al. [16] and Watters et al. [19], introduced three different types of objects questions can be asked to:

• a sentence. In this case the question is asked to respect a natural language written sentence. Here is an example by Santoro et al. [16] ”Sandra picked up the football; Sandra went to the office”. If the system is asked the question ”Where is the football?” it should be able to predict the answer ”Office”, since it is a reasonable and logical answer to the asked question;

• a dynamical system. Watters et al. [19] presents a scenario in which questions are asked to simple evolving dynamical systems, where masses are linked together by some kind of constraints (e.g.: springs). Their architecture should be able to infer how the balls were constrained among them;

• a picture. The question is asked to respect a picture. In this case the main objective is understanding spatial arrangement of the objects in the image, in order to correctly answer questions regarding their relative positioning, or discriminate if they are present in the scene or not. Notice how a deep relational understanding is needed in order to deal with each one of these three different problems: question must be deeply understood, each concept deriving from each word needs to be contextualized

(12)

depending on the syntax, and the natural language learned concepts must be associated to the ones expressed in the object form. For example, considering fig. 1.1, the table and the window words must possibly be associated with the fragments of the image containing the table and the window. In the end, in order to answer the question, spatial or time awareness is strongly needed.

1.2.1 Visual Question Answering

This work concentrates on Question Answering problems where the object of the question is an image. The resulting subset of problems goes under the name of Visual Question Answering (VQA). The picture could be a photo as well as a computer generated (synthetic) picture. Figure 1.1 presents an example of VQA task with a photo as the image.

Answering such kind of questions in not simple at all: it involves a set of reasoning capabilities that makes this problem challenging even for humans:

• being the question posed in natural language form, it could be misunderstood. In fact, there are multiple formulations for the same concept and, on the other hand, there could exist different concepts expressed by using the same words; the particular meaning varies depending on the context;

• the image captures a shot of a scene from a certain point of view. Therefore, the spatial dimensions are always influenced by the perspective (dimensions of objects depend on their distance from the camera) and from the position of the photographer: imagine to move the camera in fig. 1.1 a bit to the left while rotating it slightly to the right. Probably the window would have become occluded at the point the answer Window would have turned into an improbable (if not totally incorrect) answer;

• in order to answer correctly, usually humans make great use of common sense. Common sense refers to all the knowledge learned in past experiences, namely concepts that are not inferable by solely looking at the image. For example, the fact that behind the table refers with

(13)

1.2. QUESTION ANSWERING 5

higher probability to objects behind the longest edge of the table is just common sense.

Figure 1.1: Question:What is behind the table? Expected answer:Window

In order for the VQA task to be executed correctly, an authentic relational concept is essential. Even before catching relations, an high level scene understanding is required, since we want to transform the perception of a bare pixel matrix to the higher level concepts of objects and attributes characterizing the objects. Moreover, this task is spatially complex, since the number of existing relations in principle grows quadratically to respect the number of objects.

Since the order among the relations does not influence the outcome, considering architectures that are invariant to permutations of relations is a good chance to diminish problem size.

However, the major risk consists in the possibility that architectures could try to answer correctly by simply exploiting some undesirable data biases. For example, if for some reason the majority of long questions are answered in a binary form yes/no, it is much simpler for the network to use this information to increase the probability of success. Doing so, the network could in principle be able to answer correctly to questions present in the learned dataset, but

(14)

obviously it is not capable of generalizing, since it basically learned to cheat without really reasoning.

To this end, datasets should be built using criteria that try to minimize all possible data biases.

1.3 Content-Based Image Retrieval

This work will present concepts derived from the Content-Based Image Retrieval (CBIR) research field. Thus, it is worth mentioning some of the main motivations and working principles behind this subject.

In early times, multimedia data was manually labeled to be indexed and classified. This meant that humans should have described a photo, a video or a song using words or sentences that were able to fully characterize the content, so that the entire file could have been indexed.

Nowadays, multimedia content is the main source of traffic on the web. Multimedia files are continuously created and distributed, one need only think to the incredible amount of photos or videos uploaded to social networks every second in the entire world. Such content must be automatically and quickly indexed to be easily retrieved. It becomes evident that manual labeling is no more possible with such enormous amount of data; instead, systems that are able to classify multimedia files working of their content became essential.

Referring to images as multimedia element, indexing is a complex task whose first step consists in transforming the content of the image in a n-dimensional vector, namely feature, that is able to characterize the image as much as possible.

Features are extracted in some way from the content of the image. Extraction methodologies can include the exploitation of local/global image descriptors and, more recently, there is great interest in neural networks for the accomplishment of this task.

Once features have been extracted, they are threated as elements in a metric space, inside which distances among them can be easily calculated. Given a query, distances between the query feature and all the other image features

(15)

1.4. CONTRIBUTION OF THIS THESIS 7

are computed.

Overall, features should be built such that similar objects are neighbors in the metric space. If this is true, sorting features distances implies ordering the corresponding images by decreasing similarity to respect the query.

Since in deployed scenarios we want the system to return results in fractions of a second, we cannot rely on a linear scan of the features to build up the result set. Instead, modern approaches use complex index structures that can exploit metric space properties in order to instantaneously prune large zones of the dataset, those pertaining to irrelevant images, saving lot of computational time and speeding up the retrieval.

1.4 Contribution of this thesis

Recent works performed on VQA tasks and presented in chapter 3 demonstrated that standard architectures are not able to understand

relational concepts. They developed a new set of approaches and

architectures that have been proved to be very good with relational tasks. They reached off the charts accuracies.

This work does not try to ride the same wave, that would consist in searching for a novel architecture capable of performing better than the ones already presented in literature. In fact, that direction already shows many broken records. The last and almost unbeatable accuracy value was reached by Perez et al. [11], with 97.7% on the CLEVR dataset.

This work, instead, aims to exploit the ideas behind these architectures to move attention to a different but correlated field: Content Based Image Retrieval (CBIR), described in section 1.3.

Today many systems already exploit new technologies, in particular machine learning, to extract features from an image, a video, or an audio. However, it is still difficult to extract informations regarding relations happening among objects in the multimedia content. These objects could assume different forms and meanings. For example, they could be physical objects in a photo (a cat and a table), as well as different musical themes in two distinct acts of the same lyric opera.

(16)

Thus, this work will describe how to use existing architectures, developed and tested on VQA tasks, to open the way to a relational version of the CBIR task.

In this regard, we can introduce the Relational CBIR (R-CBIR), that is the chosen identifier for the relational version of the classic CBIR.

The development of a working prototype for the R-CBIR task, working with relational queries over real images, will be the ultimate goal; this thesis will only lay down the foundations that will hopefully pave the way for further investigations on this complex and still unexplored subject.

1.5 Outline of the thesis

Chapter 2 introduces the main public available datasets that can be used to train and test the proposed architectures working on VQA tasks.

In chapter 3 the current literature is presented and the obtained accuracy results are compared. The reference task in this chapter is always VQA using the CLEVR dataset.

Chapter 4 presents all the challenges introduced passing from VQA task to R-CBIR.

Chapter 5 and chapter 6 explain in details the experimentations performed respectively on the Sort-of-CLEVR and CLEVR datasets, evaluating all the results against an ad-hoc created ground-truth.

In the end, chapter 7 sums up the entire work. It will also introduce future possibilities that are able to expand and improve results obtained in this thesis.

1.6 Development Tools

All the code written and found on public repositories was written in Python. The reference programming language for this work is Python 3.6.

(17)

1.6. DEVELOPMENT TOOLS 9

1.6.1 Deep Learning Frameworks

As of now, there are plenty of frameworks that help developing deep learning based applications. As interest in machine learning grows, different solutions come out. They all try to simplify some base concepts common to all deep learning approaches, masking the underlying system and hardware capabilities. All this solutions can automatically handle backpropagation, gradient descent algorithms, different optimization strategies as well as hardware accelerations (CUDA on the GPUs rather than simple CPU computation).

TensorFlow, from Google and PyTorch, from Facebook are two of the most widespread frameworks that can easily handle complex deep learning architectures. They both have the possibility to create common used models (CNNs, LSTMs, RNN, etc.) with very few lines of code. As fig. 1.2 depicts, they born in two different academic contexts and now they are maintained by the two biggest IT companies, Facebook and Google.

Figure 1.2: Tensorflow and PyTorch development history

Keras is an higher level framework that works on top of Tensorflow or Theano and brings useful abstractions to the basic Tensorflow functionalities. Infact, it provides ready-to-use models for the major architectures.

Even if Tensorflow and PyTorch share the same potential, they differ in the way they are designed: TensorFlow builds a static computational graph that can be run after the graph has been compiled. This paradigm is better suited for static architectures, since it is difficult to change the graph once it is built. In listing 1.1 a simple example code for Tensorflow is shown. Until the creation of the tf.Session, the network is only constructed and placeholders variables are defined. Placeholders variables contain no values when they are

(18)

created. They will be effectively filled at execution time, when the sess.run() command is issued.

Listing 1.1: Tensorflow example

1 import t e n s o r f l o w a s t f 2 import numpy 3 rn g = numpy . random 4 5 # p a r a m e t e r s 6 l e a r n i n g r a t e = 0 . 0 1 7 t r a i n i n g e p o c h s = 1000 8 d i s p l a y s t e p = 50 9 10 # t r a i n i n g Data 11 t r a i n X = numpy . a s a r r a y ( [ 3 . 3 , 4 . 4 , 5 . 5 , 6 . 7 1 , 6 . 9 3 , 4 . 1 6 8 , 9 . 7 7 9 , 6 . 1 8 2 , 7 . 5 9 , 2 . 1 6 7 , 12 7 . 0 4 2 , 1 0 . 7 9 1 , 5 . 3 1 3 , 7 . 9 9 7 , 5 . 6 5 4 , 9 . 2 7 , 3 . 1 ] ) 13 t r a i n Y = numpy . a s a r r a y ( [ 1 . 7 , 2 . 7 6 , 2 . 0 9 , 3 . 1 9 , 1 . 6 9 4 , 1 . 5 7 3 , 3 . 3 6 6 , 2 . 5 9 6 , 2 . 5 3 , 1 . 2 2 1 , 14 2 . 8 2 7 , 3 . 4 6 5 , 1 . 6 5 , 2 . 9 0 4 , 2 . 4 2 , 2 . 9 4 , 1 . 3 ] ) 15 n s a m p l e s = t r a i n X . s h a p e [ 0 ] 16 17 # t f Graph I n p u t 18 X = t f . p l a c e h o l d e r ( ” f l o a t ” ) 19 Y = t f . p l a c e h o l d e r ( ” f l o a t ” ) 20 21 # s e t model w e i g h t s 22 W = t f . V a r i a b l e ( rn g . randn ( ) , name=” w e i g h t ” ) 23 b = t f . V a r i a b l e ( r ng . randn ( ) , name=” b i a s ” ) 24 25 # c o n s t r u c t a l i n e a r model 26 p r e d = t f . add ( t f . m u l t i p l y (X, W) , b ) 27 28 # mean s q u a r e d e r r o r 29 c o s t = t f . r e d u c e s u m ( t f . pow( pred−Y, 2 ) ) / ( 2 ∗ n s a m p l e s ) 30 # g r a d i e n t d e s c e n t 31 o p t i m i z e r = t f . t r a i n . G r a d i e n t D e s c e n t O p t i m i z e r ( l e a r n i n g r a t e ) . m i n i m i z e ( c o s t ) 32 33 # i n i t i a l i z e t h e v a r i a b l e s ( i . e . a s s i g n t h e i r d e f a u l t v a l u e ) 34 i n i t = t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) 35

(19)

1.6. DEVELOPMENT TOOLS 11 36 # s t a r t t r a i n i n g 37 w i t h t f . S e s s i o n ( ) a s s e s s : 38 39 # run t h e i n i t i a l i z e r 40 s e s s . run ( i n i t ) 41 42 # f i t a l l t r a i n i n g d a t a 43 f o r epoch in range ( t r a i n i n g e p o c h s ) : 44 f o r ( x , y ) in zip ( t r a i n X , t r a i n Y ) : 45 s e s s . run ( o p t i m i z e r , f e e d d i c t ={X: x , Y: y } )

PyTorch, instead, fully exploits the interactivity of the Python programming language: the graph is built on the fly, and it can be changed after every iteration, allowing for self adjusting and more dynamic architectures. Thus, there are no placeholder variables in this case.

Listing 1.2 show a very simple example that makes use of the NN module, that contains a set of pre-built models such as Linear or ReLu.

Listing 1.2: PyTorch example

1 import t o r c h 2 from t o r c h . a u t o g r a d import V a r i a b l e 3 4 # N i s b a t c h s i z e ; D i n i s i n p u t d i m e n s i o n ; 5 # H i s h i d d e n d i m e n s i o n ; D out i s o u t p u t d i m e n s i o n . 6 N, D in , H, D out = 6 4 , 1 0 0 0 , 1 0 0 , 10 7

8 # c r e a t e random T e n s o r s t o h o l d i n p u t s and o u t p u t s , and wrap them i n V a r i a b l e s . 9 x = V a r i a b l e ( t o r c h . randn (N, D i n ) ) 10 y = V a r i a b l e ( t o r c h . randn (N, D out ) , r e q u i r e s g r a d=F a l s e ) 11 12 model = t o r c h . nn . S e q u e n t i a l ( 13 t o r c h . nn . L i n e a r ( D in , H) , 14 t o r c h . nn . ReLU ( ) , 15 t o r c h . nn . L i n e a r (H, D out ) , 16 ) 17 18 l o s s f n = t o r c h . nn . MSELoss ( s i z e a v e r a g e=F a l s e ) 19 20 l e a r n i n g r a t e = 1 e−4 21 f o r t in range ( 5 0 0 ) :

(20)

22 # f o r w a r d p a s s 23 y p r e d = model ( x ) 24 25 # compute l o s s 26 l o s s = l o s s f n ( y p r e d , y ) 27 28 # z e r o t h e g r a d i e n t s b e f o r e r u n n i n g t h e backward p a s s . 29 model . z e r o g r a d ( ) 30 31 # backward p a s s : compute g r a d i e n t o f t h e l o s s w i t h r e s p e c t t o a l l t h e l e a r n a b l e 32 # p a r a m e t e r s o f t h e model . 33 l o s s . backward ( ) 34 35 # update t h e w e i g h t s u s i n g g r a d i e n t d e s c e n t . 36 f o r param in model . p a r a m e t e r s ( ) : 37 param . d a t a −= l e a r n i n g r a t e ∗ param . g r a d . d a t a

Most of the already existing code was prepared in PyTorch; also, the particular application area needs strong malleability; hence, the reference framework considered in this thesis work will be PyTorch v0.3.0, the latest stable version at the time of writing.

1.6.2 Hardware accelerators

Neural networks are natively parallel architectures. In order to exploit their massive parallelism, PyTorch can be setup to work with some hardware accelerator libraries so that the code can be run on the GPU, that, to respect CPU, present a natively highly parallelized structure.

The only bottleneck when using GPUs is given by the amount of VRAM (Video RAM) mounted on the graphic card. VRAM must suffice so that all parameters and variables of the neural network model can fit into the graphic memory.

The amount of memory increases depending mainly on the network size and on the mini-batch size used.

NVIDIA made available a set of solutions for neural networks developers. First of all, one of the available programs is called System Management

(21)

1.6. DEVELOPMENT TOOLS 13

Interface (SMI). It is able to give an inside of the GPU cores overall utilization, the VRAM used and even the GPU cores temperature. This program can be called by issuing command nvidia-smi. It is helpful when training networks, since it tells us if the utilization of the GPU is high enough and how much our code is memory efficient.

The NVIDIA stack needed to work with deep learning is actually composed by the CUDA Toolkit and the cuDNN framework.

NVIDIA CUDA Toolkit

The NVIDIA CUDA Toolkit aims to create a low level interface versus the CUDA cores, the general purpose units on the GPU.

This toolkit enables programmers to fully exploit the parallel architecture of GPU, for a wide range of application whose design allows strong parallelism (deep learning included).

The NVIDIA CUDA Toolkit, in addition to drivers, includes nvcc, the compiler for CUDA code.

cuDNN

cuDNN is a framework built on top of CUDA that makes available a set of functions to work easily on deep learning.

It creates an abstraction over the general purpose CUDA framework and it implements basic deep learning functionalities such as forward and backward convolution, normalization, pooling, loss functions and so on. This way, the developer can focus on deep learning problems rather than on parallel programming.

In this work PyTorch has been installed with CUDA v8.0 and cuDNN 6 dependencies. They are not at their latest versions; on the other hand these are very stable releases.

(22)

1.7 Hardware

Two different machines were used to develop and train architectures presented in this work.

We will sum up some of the most important hardware features from both machines in table 1.1. The two architectures will be referenced, in the entire document, by the labels assigned to them as shown in the table: C1 and C2.

C1 C2

RAM 256 GB 16GB

CPU Intel Xeon E5-2680 v4 (x2), 28 threads each Intel Core i7 3370K, 8 threads GPU Tesla K80 (x2), 24GB VRAM each Nvidia GTX 1060, 6GB VRAM

Table 1.1: Hardware features of the available machines

As it can be noticed, machine C1 is much powerful than C2: its major strength lies in the GPU arrangement, that is a critical factor when dealing with deep learning using CUDA cores. Thus, C1 will be mainly used for training, while C2 will be employed for developing purposes and for performing simple tests before deploying the solution to C1 machine.

In order to deploy code, a Docker environment has been setup. Docker exploits containers, an operating-system-level virtualization method that enables applications inside the container to run in a controlled and fully portable environment.

This way, we are sure that there are no environmental differences by moving developed applications from C2 to C1, hence extinguishing up annoying platform incompatibilities.

(23)

Chapter 2 Datasets

In order to study the internal behavior of the architectures that try to solve VQA problems, some ad-hoc datasets have been proposed. This kind of datasets usually contain images, a set of questions associated to the single image and a candidate answer for each question.

2.1 Available Datasets for VQA

There exist datasets composed by real photos, usually built by recovering pictures from the web. VQA by Antol et al. [3] and GuessWhat?! by Vries et al. [18] fall within this group.

Other datasets instead, such as NLVR by Suhr et al. [17] and CLEVR by Johnson et al. [7], are on purpose computer generated.

2.1.1 Non-synthetic datasets

GuessWhat?! was built in order to teach a machine to play the namesake game GuessWhat, a game in which the first player thinks about an object present in the scene and the second (in this case, the robot) tries to guess that object by asking questions that could be answered only with binary answers yes/no (fig. 2.1). Notice how this game covers almost completely the challenges exhibited by the VQA problems.

(24)

Figure 2.1: Example of images/questions/answers in GuessWhat?! dataset

Even if fully representative of the real world, learning such datasets is very complex. In particular, being the pictures real photos, there is really no much control over the biases that could show up. Also, it is difficult to trace back the reasoning process exhibited by the target architecture, since the problem can unlikely be split into independent sub-problems. In fact, usually, there are intermediate steps (parsing the question, perceiving the image, understanding the relations, predicting a probability distribution over the answers) that would need additional intermediate training data in order to be threated separately. In order to overcome these limitations, some synthetic dataset have been introduced.

2.1.2 Synthetic datasets

NLVR is a set of 2D images of different colored and shaped objects. Questions are about relations among positions, colors and shapes of multiple objects in the same scene (fig. 2.2). In this dataset, questions are sentences that must be evaluated as true or false. Hence, the outcome is expected in a domain that contains only two values.

CLEVR was introduced by Santoro et al. [16] for training and testing their work. It is composed by a set of 3D rendered images. The images appear very photorealistic; this peculiarity makes this dataset of great interest, being a middle way among NLVR, composed of 2D images and VQA, built with

(25)

2.2. CLEVR 17

Figure 2.2: NLVR example. Sentence: There is exactly one black triangle not touching any edge; Expected answer: true

real photographs. In this work we will mainly concentrate on CLEVR. Thus, we are going to better introduce the CLEVR dataset and one of its simple variations (Sort-of-CLEVR) in the following sections.

2.2 CLEVR

CLEVR is a synthetic dataset composed by 3D rendered scenes. There are 100000 rendered images, subdivided among training (75000), validation(15000) and test(15000) sets. The total number of questions is around 850000, again split among training, validation and test.

CLEVR stands for Compositional Language and Elementary Visual Reasoning. The particularity of his name comes not from the acronym; rather, the name of this dataset was forged from assonance with a bizarre character lived in the early 1900s, an horse named Hans Clever. It was said that Hans was capable of performing simple arithmetic operations. Careful observation revealed that Hans was correctly answering questions by simply reacting to invisible body cues from the people assisting his shows. Hans is a perfect example of what an architecture can try to exploit in order to predict the correct answer. In Hans case, there was no reasoning process running, he was only very good at reading subtle movements of the crowd.

This anecdote explains quite well the main goals of the CLEVR dataset: • trying to understand how an architecture internally pursues the reasoning

process;

• avoiding data biases so that architectures have no chance to behave like Hans.

(26)

In order to reach such objectives, CLEVR has been designed in a very careful way.

The main concept is the scene. A scene contains different simple shaped objects, with mixtures of colors, materials and sizes. There are cubes, spheres, cylinders, each one of which can have a color chosen among eight, they can be big or small and they can be made of one of two different materials, metal or rubber. The scene is fully and uniquely described by a scene graph. The scene graph describes in a formal way all the linkings between objects.

The question is formulated under the form of a functional program. This is a declarative formulation, in which elementary functions such as count(), exists(), filter size() are connected together to form a query. These basic functions potentially take as input attributes, e.g.: filter size(small).

This way, the query is formalized again in the form of a graph (fig. 2.3)

Figure 2.3: Example of CLEVR functional program to encode the question: ”How many cylinders are in front of the small thing and on the left side of the green object?”

The answer to a question represented by its functional program on a scene simply is calculated by executing the functional program on the scene graph.

At the end, the natural language questions and the images must be generated. To this end, the scene graphs are rendered to photo realistic 3D scenes by using Blender, a free 3D software (fig. 2.4); instead, the functional programs are converted to natural language expression compiling some templates embedded in the dataset and written in English. Compiling a template means filling the ”blank spaces” of the question template with the

(27)

2.2. CLEVR 19

right shape/material/color/size attributes.

In order to minimize the biases, all the questions were divided into question families, each one of which contains a single program template and a set of four different text templates (four distinct ways to express the query in natural language).

Authors of CLEVR also designed a question generation heuristic able to avoid the creation of ill-posed questions: a question is ill-posed if it can be answered on a certain image only with sentences like ”None” or ”I don’t know”. For example, considering fig. 2.4, the question ”What is on the left side of the big metal cube?” would have been ill-posed, since there is nothing to the left of the specified cube.

Figure 2.4: Example of CLEVR image with associated question: ”Are there fewer metallic objects that are on the left side of the large cube than cylinders to the left of the giant shiny block?” Expected answer:”yes”

CLEVR dataset gives us way more control on the learning phase than other datasets present in literature.

Informations in each sample of the dataset are complete and exclusive. This means that no common-sense awareness is needed in order to correctly answer the questions. Answers can be given simply understanding the question and reasoning exclusively on the image, without needing external concepts.

Also, CLEVR simplifies the learning phase, since architectures could, at least in initial phases, keep off the perception modules and concentrate only on the reasoning ones. This is possible since CLEVR makes available not only unstructured data (images and natural questions), but also the functional programs and the scene graphs responsible for the generation of

(28)

the final unstructured data. If these formal descriptions were not available, they should have recovered directly from the images and questions by means of some embedding technique (CNNs, LSTMs).

Problem is that dealing at the same time with perception and reasoning increases problem size and the learning process gets more difficult and subtle. Instead, being these informations available in the dataset, the learning process can be better supervised. Therefore, chances for the target architectures to behave like Hans, the horse, are very poor.

2.3 Sort-of-CLEVR

Sort-of-CLEVR is introduced by Santoro et al. [16] and consist in a simplification of the original CLEVR dataset. It is created mainly for testing and debugging architectures that are designed to work with CLEVR. Thus, this dataset is composed by simpler building blocks to respect the full CLEVR. Images, in fact, are simpler than 3D renders provided with the original dataset; they instead carry simple 2D scenes, consisting of a certain number of 2D shapes (in the paper, they used 6 shapes per image).

Shapes can be circles or squares and they can have different colors (fig. 2.5). Every object, however, is uniquely identified by its color.

Figure 2.5: Example of images in Sort-of-CLEVR dataset

Differently from the base dataset, this one splits the questions into two different subsets:

• relational questions, asking for the color or shape of the farthest or the nearest object to respect the given one; example: What is the shape of the object that is farthest from the gray object?

(29)

2.3. SORT-OF-CLEVR 21

• non-relational questions, involving specific attributes that characterize a single object, in particular the shape, or the absolute position of the object to respect the overall scene; example: What is the shape of the gray object?

Questions are directly encoded into 11-dimensional vectors, so there is no need for LSTM modules processing natural language.

Architectures handling this kind of images and questions can be trained in a faster way: the network has smaller complexity since the inputs are given in a simpler form (question embeddings natively available, very simple 2D images). Even if this dataset seems extremely simple, it can help spotting out some architectural problems that inhibit the network to think in a relational way. Moreover, this dataset can help measuring the ability of architectures to reason on non-relational aspects, giving an overall measure of the solution robustness.

(30)

(31)

Chapter 3 Related work

Important works on visual question answering began spreading in 2015. Even if incredible results were obtained only in 2017, it is worth citing most important trials. We’ll concentrate on works that involve the CLEVR dataset.

3.1 Early techniques

Johnson et al. [7] makes an important performance analysis, comparing all the most crucial architectures proposed starting from 2015, using the newer CLEVR dataset. The number of introduced solutions in two year temporal window is quite huge; here we will report only the most representatives.

• Q-Type mode was a model built by CLEVR authors in order to demonstrate the low accuracy revealed by architectures that try to exploit data biases instead of performing an authentic reasoning process. In particular, this architecture does not even look at the image; rather, it tries to predict the most frequent training set answer for each question type. It considers only questions, so it has no other way to operate than exploiting implicit question biases.

• LSTM is a model based on an architecture found in Antol et al. [3]. As the Q-type mode, it does not look at the image, it works only on the questions. In particular, this architecture uses LSTM to learn word embeddings and predict distribution over the possible answers.

(32)

• CNN + BoW was explained in Zhou et al. [21]. It considers questions as a bare bag of words. The BoW set is built collecting the more frequent 1000 words from all training set questions. They discovered that the first three words of the question have high correlation with the answer. Hence, they increased the 1000 words set with other three sets of 10 words each, containing the most frequent 10 words for each of the three buckets. Images are elaborated by means of a convolutional neural network. The resulting visive and linguistic embeddings are concatenated and sent to a Softmax layer that predicts the probability distribution over the answers. • CNN + LSTM is very similar to CNN + BoW, except that the linguistic processing layer is built with an LSTM instead of a simple BoW model.

• CNN + LSTM + SA Yang et al. [20] tried to evolve the idea behind CNN + LSTM, introducing two Stacked Attention layers in order to condition the query by means of the image. Therefore, differently from the CNN + LSTM model, the Stacked Attention layers replace the raw embeddings concatenation with a simple but quite effective reasoning module. The first attention is obtained by combining embeddings from the linguistic pipeline with some of the features from the visual pipeline. The output is again combined with other image features to form the second attention. Probabilities over the answers are determined by a final Softmax layer (fig. 3.1). These two layers generate masks over the image (the attention maps) whose purpose is spotting out the areas of the image mainly involved in the reasoning process, considering the particular question, fig. 3.2.

Results of the performance tests on these architectures are reported in table 3.1. First two methods, inserted in the comparison to demonstrate their ineffectiveness, cannot even reach 50% of accuracy. This result demonstrates the difficulty of architectures to exploit data biases when using CLEVR dataset.

The model that performs better is the CNN + LSTM + SA, thanks to the attention layers architecture. It shows very good performances when dealing with questions about attributes, and overall it has better percentages

(33)

3.1. EARLY TECHNIQUES 25

Figure 3.1: CNN + LSTM + SA architecture

Figure 3.2: CNN + LSTM + SA: on the left, the original image; in the middle, the output from the first attention layer; on the right, the output from the second attention layer

over all query types to respect the other architectures. Nevertheless, none of the architectures was able to answer correctly questions involving attributes comparisons.

Model Overall Count Exist Compare

Integer Query Attribute Compare Attribute Q-type baseline 41.8 34.6 50.2 51.0 36.0 51.3 LSTM 46.8 41.7 61.1 69.8 36.8 51.8 CNN+LSTM 52.3 43.7 65.2 67.1 49.3 53.0 CNN+LSTM+SA 68.5 52.2 71.1 73.5 85.3 52.3 Human 92.6 86.7 96.6 86.5 95.0 96.0

Table 3.1: Accuracy over the test set (in percentage) of the various architectures on the CLEVR dataset

(34)

by humans, as highlighted by the last row of the table.

3.2 State of the art techniques

Latest architectures show an incredible gain in accuracy over the training set. Since many solutions came out in literature, a simple taxonomy is built in order to group them to respect the used approach. In particular, three different paths were followed:

• Specialized architectures approach; • Compositional networks approach; • Conditioning networks approach.

We will describe the main ideas behind each one of these approaches.

3.2.1 Specialized architectures approaches

This kind of methodology is followed by Santoro et al. [16] and Raposo et al. [14], both by the DeepMind team. The major idea behind this approach consists in the introduction of an ad-hoc architecture intrinsically able to perform relational reasoning.

These publications introduce a particular type of network called Relational Network (RN). This kind of architecture is specialized in comparing pairs of objects. The function that analytically describes this network is the following:

r = fφ(a(gψ(o1, o2), gψ(o1, o3), ..., gψ(om−1, om))) (3.1)

fφand gψ are parametric functions whose parameters can be learned during

the training phase. For example, they can be multi-layer perceptrons (MLP) networks.

a is a commutative and associative function, that combines the output of all the evaluations of gψ over all the different pairs of objects. The

(35)

3.2. STATE OF THE ART TECHNIQUES 27

approach. Commutativity holds since the order of the permutations among all objects is not important in this context. Making a commutative, the network has not to learn by itself the invariance to respect permutation, that is a task that grows quadratically in the number of objects in the scene. Instead, this important notion is hard-coded in the architecture itself.

The choice carried out by both publications was to use a simple summation operator as a function. This way, eq. (3.1) simplifies as follows:

r = fφ(

X

i,j

gψ(oi, oj)) (3.2)

The overall architecture as proposed by Santoro et al. [16] is shown in fig. 3.3. They used a CNN to extract deep features from the image and an LSTM to produce question embeddings. The features map in output from the CNN is a tensor of order 3, with indexes (i,j,k). The first two dimensions are strictly related to the scene physical space: indexes i,j regard the two spatial dimensions of the input image. An element of the tensor with indexes (i,j) is defined by the authors as an object. Hence, every object is described by means of K different features.

Differently from the basic RN definition described by eq. (3.2), the implemented relational network has three inputs: the couple of objects we are relating and the query embedding. This way, the linking between the two compared object is conditioned by means of the query, that plays a central role in VQA problems. Hence, the resulting analytical expression for the RN module is the following:

r = fφ(

X

i,j

gψ(oi, oj, q)) (3.3)

where q is the question embedding vector obtained by an LSTM module. The authors wanted to underline the invariance of this architecture to respect the semantic we give to the object in input to the RN module. In this case the object is defined as a vector of size K in output from a CNN, but the RN architecture is malleable to different definitions for the objects given in input. Also, fig. 3.3 shows how the perception and reasoning modules are separated among them. This enhance the malleability of this solution and it

(36)

results in a higher modularity of the overall architecture.

Figure 3.3: The overall RN architecture as explained by Santoro et al. [16]

Results are shown in table 3.2. Over the CLEVR dataset, this

architecture obtains super-human performances, in all the different tasks. Count and compare attribute tasks, difficult to handle for all previous solutions, gained about 40% in performance to respect the previous best model CNN+LSTM+SA.

Integer Query Attribute Compare Attribute Q-type baseline 41.8 34.6 50.2 51.0 36.0 51.3 LSTM 46.8 41.7 61.1 69.8 36.8 51.8 CNN+LSTM 52.3 43.7 65.2 67.1 49.3 53.0 CNN+LSTM+SA 68.5 52.2 71.1 73.5 85.3 52.3 Human 92.6 86.7 96.6 86.5 95.0 96.0 CNN+LSTM+RN 95.5 90.1 97.8 93.6 97.9 97.1

Table 3.2: Accuracy over the test set (in percentage) of the RN architecture to respect others, with CLEVR dataset

(37)

3.2.2 Compositional networks approaches

This kind of architectures try to explicitly model the reasoning process by dynamically building a graph that states which operations must be carried out and in which order, to obtain the right answer.

Usually these architectures are internally split into two different subcomponents:

• a generator network, having the task of producing an execution graph based on the question embeddings. The generated graph should mimic the functional graph embedded in the CLEVR dataset;

• an execution network, that executes the graph produced by the generator network, taking in input the image features and outputting the answer.

This method is explored in Hu et al. [6] and Johnson et al. [8]. Both methods leverage the ideas introduced in Andreas et al. [2], where the concept of Neural Module Networks (NMN) is explored. These kind of networks are built connecting together simple elementary neural network modules.

In this case, these simple modules implement functions that model the basic questions operations such as count(), exists(), compare(). Hence, the generated graph basically constitute the functional graph describing the question in input, that is the sequence of elementary operations to be carried out to process the image in order to get the correct answer.

Work by Hu et al. [6] developed an end-to-end system, that is able to learn all the internal parameters of both the generator and execution networks without needing other data than images, questions and answers. The architecture is shown in fig. 3.4.

The generator network here is called Layout Policy network. It generates a layout of execution, outputting a set of elementary modules with their interconnections and a set of question attentions for each module. Attentions are sets of words from the query that are considered important for that particular module. Hence, attentions will constitute the inputs for the elementary modules once inserted into the execution network.

(38)

Figure 3.4: The overall architecture as explained by Hu et al. [6]

Authors payed specific attention to the train process: they tried to initialize the weights of the policy network by making it learn from human expert defined policies. Once initialized, the overall network is fine-tuned on the actual dataset data. However, they even tried to train all the network from scratch without expert policy, as well as they tried training the network by keeping fixed the weights of the layout network and setting them to the human expert policy.

Work by Johnson et al. [8] is based on a similar idea; their generator network is a sequence-to-sequence network built with 2 levels LSTM. It takes as input a natural language question and outputs a sequence of operations that describe all important elementary operations to be carried out on the image to answer correctly. Predicted program takes the form of a serialized graph. Architecture is shown in fig. 3.5.

The execution engine contains all the modules chosen by the generator networks with their linkings and, as Hu et al. [6], the input to the network is given by deep features extracted from the image by means of a CNN.

Even in this case authors payed attention to the training steps. They realized that CLEVR dataset had all the data needed to train separately both the generator and the executor: the generator could be trained using natural language questions and associated functional programs, while the executor could be trained using images, functional programs and answers. Problem with this approach is that only CLEVR dataset has all this facilities. Other datasets do not contain functional programs. To overcome this issue, they

(39)

Figure 3.5: The overall architecture as explained by Johnson et al. [8]

tried a middle solution: they trained the modules separately with a subset of 9K, 18K functional programs; then, they basically fine-tuned the entire model end-to-end with the remaining samples of the dataset without using functional programs.

Results by these architectures are reported in table table 3.3.

Integer Query Attribute Compare Attribute CNN+LSTM+SA 68.5 52.2 71.1 73.5 85.3 52.3 [6] (fine-tuned approach) 83.7 68.5 85.7 83.7 90.0 88.7 Human 92.6 86.7 96.6 86.5 95.0 96.0 CNN+LSTM+RN 95.5 90.1 97.8 93.6 97.9 97.1 [8] (18K-prog.) 95.4 90.1 95.3 96.2 97.4 98.0 [8] (700K-prog.) 96.9 92.7 97.1 98.6 98.2 98.9

Table 3.3: Accuracy over the test set (in percentage) of Hu et al. [6] and Johnson et al. [8] architectures, with CLEVR dataset

We reported only the fine-tuned result (the best result) for the first approach Hu et al. [6]. It does not defeat the CNN+LSTM+RN approach,

(40)

while the second Johnson et al. [8] beats the previous best model gaining an overall 1.5% over it. Even with only 18K programs training, the second approach keeps an enviable performance, confirming the quite strong power of this architecture.

Differently from the specialized architectures solution, this method does not introduce an ad-hoc structure for relational reasoning, demonstrating that it is not necessary to introduce a new design to deal with this problem.

3.2.3 Conditioning network approaches

These methods are primarily based on the visual pipeline. The question, however, is still taken into considerations, since question embeddings are used to condition the network responsible for elaborating the image. By conditioning the visual pipeline, some parameters embedded in layers of image processing pipeline are related in some way to the question embedding. Architectures by Perez et al. [12, 11] introduce this methodology. Being from the same authors, the architectures are quite similar, but they condition the visual network in different ways.

[12] conditions the visual pipeline acting on batch normalization layers embedded in residual blocks modules that build up the visual network (fig. 3.6).

A standard batch normalization layer is described by the function

BN (Fi,c,h,w|γc, βc) = γc

Fi,c,h,w− E[F·,c,·,·]

pV ar[F·,c,·,·] +

+ βc (3.4)

Where Fi,c,w,h is a mini-batch on the samples i, operating on the cth feature

map at the spatial location (h, w). Usually γ and β are parameters learned from data.

In this case, instead, γ and β play the role of conditioning parameters: they are computed by means of certain functions operating on the question embeddings:

γi,c = fc(ei)

βi,c = hc(ei)

(41)

Figure 3.6: The overall architecture as explained by Perez et al. [12]

where fc , hc are learnable functions taking in input ei, the question

embedding for the ith _{question sample in the dataset. Authors defined f} c and

hc functions as affine transformations operating with matrix W and bias b,

both learned from the data in an end-to-end training process.

(γi,·, βi,·) = W ei+ b (3.6)

Every residual block in the architecture contains two BN layers, and there are N different residual blocks, hence there are 2N distinct W and b parameters to be learned.

Differently from CNN+LSTM+RN model, question attentions are

generated by means of GRU (Gated Recurrent Unit) instead of an LSTM module. They are functionally almost equivalent, even if their internals are quite different: GRU uses no internal memory modules like LSTM does. Also, GRU has fewer parameters so it is simpler to train, but it is not so reliable when it comes to handling long and complex data, to respect LSTM.

(42)

on a preexisting module; instead, they introduce a new module called FiLM, described by the following equation:

F iLM (Fi,c|γi,c, βi,c) = γi,cFi,c+ βi,c (3.7)

To respect [12], eq. (3.7) substitutes eq. (3.4). The architecture is the same except for the residual block implementation, that is modified as shown in fig. 3.7

Figure 3.7: Residual block implementation by Perez et al. [11]

Results in table 3.4 show that no significant performance improvements are brought by FiLM to respect the Conditional Batch Normalization approach. However, their performances reach the maximum values over all previous architectures, with only about 2% error.

Though if accuracy is very promising, there are still some problems authors have spotted out. In particular, they noticed that the error increases with query length. This behavior is not shown in architectures probed by Johnson et al. [7].

Also, they spotted out critical logical inconsistencies that suggest the understanding level, nevertheless the accuracy, is still not very accurate: in fact, architecture is able to correctly count the number of two different objects having distinct sizes and colors, but it fails deciding if there is the same quantity of both these objects. This is a quite critical problem, since a human, though having on average a worst accuracy than this architecture, would not fail this way; instead, it is more likely its choice would remain consistent, even in case of mistaken evaluation.

(43)

Integer Query Attribute Compare Attribute Human 92.6 86.7 96.6 86.5 95.0 96.0 CNN+LSTM+RN 95.5 90.1 97.8 93.6 97.9 97.1 [8] 96.9 92.7 97.1 98.9 98.2 98.6 CNN+GRU+CBN [12] 97.6 94.5 99.2 93.8 99.2 99.0 CNN+GRU+FiLM [11] 97.7 94.3 99.1 96.8 99.1 99.1

Table 3.4: Accuracy over the test set (in percentage) of Johnson et al. [8] and Perez et al. [12, 11] architectures, with CLEVR dataset

Appendix work by FiLM authors is remarkably interesting. They tested their system with a particular version of the CLEVR dataset, called CoGen-T. CoGen-T dataset comprehend all samples already present in the standard CLEVR dataset; differently from it, samples are subdivided in two macro categories, namely Condition A and Condition B respectively:

• in Condition A all cubes are gray, blue, brown or yellow, while all cylinders are red, green, purple or cyan;

• in Condition B cubes and cylinders have swapped colors to respect Condition A.

Performances evaluated training the model with these different sets of samples revealed some problems: in particular accuracy on condition A is much larger than on condition B, probably due to some data bias the model is exploiting to answer (for example, it caught that cubes are not cyan). Authors decide to approach the problem differently, introducing a zero-shot generalization method : they found out that linear combination in FiLM parameters γ and β for two different questions obtained by training on Condition A could output the values for γ and β that make the model output an answer for a question only Condition B could handle. In other words, the questions ”How many cyan spheres are there?” and ”How many brown cubes are there?”, trained on Condition A gave the answer to the question ”How

(44)

many cyan cubes are there?”, that is a question pertaining Condition B. In fact, the accuracy was tested on ”Condition B”, and a 3.2% gain in performance was observed to respect the standard training.

This result demonstrates how generalization could be handled in an efficient and modular way: knowledge coming from only half of the dataset is used to answer questions related to the other half, simply by exploiting already learned parameters and elaborating them in a linear way.

Finally, FiLM authors took in consideration a even more challenging version of the CLEVR dataset, called CLEVR Humans. This dataset is composed by the same images already present in the standard CLEVR dataset; however, questions were written by humans in a way they would have been non-trivial to answer if asked to a smart robot.

Examples of images-questions-answers from CLEVR-Humans dataset are shown in fig. 3.8.

Figure 3.8: CLEVR-Humans dataset examples

Questions are difficult because they involve multiple objects, aggregation concepts as well as complex visual tasks such as recognizing shapes of objects reflected by the metal shaded ones.

FiLM architecture obtained state of the art results, as shown in table 3.5. Basically the CLEVR-Humans dataset was used to fine-tune the model already trained on the basic CLEVR dataset. The huge accuracy gain (about 20%) obtained when fine-tuning on CLEVR-Humans can be justified only by supposing that complex high level notions and reasoning schemes have already been learned and internalized by the architecture even during the

(45)

Model Train on basic

CLEVR Train on CLEVR, fine-tune human LSTM 27.5 36.5 CNN+LSTM 37.7 43.2 CNN+LSTM+SA 50.4 57.6 CNN+GRU+FiLM 56.6 75.9

Table 3.5: Accuracy obtained with CLEVR-Humans dataset among different architectures

basic training. This result demonstrates the high fidelity in term of reasoning capabilities the FiLM architecture is able to reach, and underlines the overall high robustness of this architecture.

(46)

(47)

Chapter 4 From VQA to R-CBIR

4.1 Introduction

Transforming a VQA problem into a R-CBIR task is challenging. The R-CBIR problem deriving from the VQA task could be formulated as follows: given a query, in the form question plus answer, the output will be the set of all the images that potentially make the query assertion true. Figure 4.1 shows an example emulating a real application scenario.

Figure 4.1: Example of R-CBIR deriving from a VQA scenario. Query: ”Where is the ruler to respect the pen? In front of it”

However, this even simple task translation generates immediately lot of challenges. The most critical one would consist in retrieving images by

(48)

querying the system with text. This would be a task for a complex system, able to manage different kinds of multimedia data.

For these reasons, in the considered scenario, R-CBIR will consist in the task of finding images containing similar relations to those carried by a query image given in input. This way, both inputs and outputs of the system consist in images, that therefore will remain the only kind of multimedia data we will deal with.

Application scenarios deriving from such a task could increase the understanding capabilities of modern information retrieval systems. Current technologies work with features that exhibit no relational understanding of the content. This would mean that future image engines could retrieve similar images by going a step further to respect modern approaches, that in early times consisted in global/local color features and, more recently, in deep learning based features with high semantic understanding. In fact, relational features would be the result of high level scene understanding that, besides the concepts, could fully describe the relations occurring among the entities involved in the image.

For example, two city skylines could be compared not only looking at shapes, corners and architectonic elements, but also taking in consideration the three-dimensional arrangement of buildings and skyscrapers, that creates a spatial pattern that only a relational engine could figure out.

This work, however, will be concentrated on the CLEVR dataset, as it sets a limit to the R-CBIR task complexity and gives tools to accurately monitor system performances. In a future time, then, these preliminary studies could be better broke-down and improved in order to be applied to real application scenarios.

4.2 Challenges

One of the goals of the CBIR task objective consists in viewing images as n-dimensional features arranged in a metric space. By defining a distance function we are able to define a similarity criteria among images in the dataset, as explained in section 1.3.

(49)

4.3. ARCHITECTURE MODIFICATIONS 41

Moving from CBIR to R-CBIR, the similarity criteria is changed as follows: images are similar if they exhibit almost the same relations among elements inside the scene. This way, we constraint the system to position features of images embedding similar relations near among them in the metric space.

It is clear that building such feature implies a strong relational behavior. So, in principle, the architectures presented in chapter 3 should be already able to deal with this task.

However, these architectures never try to relationally cluster images. Instead, since the task is VQA, they all try, although exploiting different techniques, to condition the images with queries in order to retrieve the answers.

For the R-CBIR task, instead, we have to fully understand relations happening in the scene without the need for the question conditioning; if it wasn’t, multiple features for every image would be necessary, one for each possible question that can be asked to that image. Though if the latter case would open the door for the plain reuse of already existing VQA architectures, the overall solution would become absolutely non-scalable.

The architecture that seems to better approach these needs is the one introduced by DeepMind in Santoro et al. [16] and summarized in section 3.2.1. In the following section we will describe the architecture changes needed in order to make the relational features extraction possible.

The goal is to train the modified architecture so that the accuracy reaches values near the ones measured by means of the original architecture. We’re probably going to loose overall performances but, on the other hand, we would obtain an architecture with which we can extract relational features for the images.

4.3 Architecture modifications

In order to modify the RN architecture (section 3.2.1) to extract relational features from images, the key element is to make the RN module work only with the images, without conditioning it with questions.

(50)

In other words, the major adjustment consists in changing the relational module behavior from the one expressed by means of eq. (3.3), bringing it back to the original RN module formulation, eq. (3.2) rewritten here:

r = fφ(

X

i,j

gψ(oi, oj)) (4.1)

However, in order to obtain almost the same accuracies as the original architecture, the question must condition the pipeline at some stage.

We realized that the question cannot be simply concatenated after the RN module, since the output from the RN now lacks important conditioning informations. The solution would be to insert another module h(·) after g(oi, oj) computation and immediately before the sum aggregation. The

question is concatenated to the output of the g(·) function, before being processed by h(·).

Thus, the new network equation becomes the one expressed in eq. (4.2) r = fφ(

X

i,j

h(gψ(oi, oj), q)) (4.2)

and the overall new architecture is shown in fig. 4.2

Figure 4.2: Changes to RN architecture for features extraction

Using this solution we constrained the network to learn relational concepts without considering the questions, at least during the first stages, before the h(·) function evaluation.

(51)

4.3. ARCHITECTURE MODIFICATIONS 43

Hence, relational features for the images will be extracted at the end of the g(·) function. Extracting features from this layer it means basically employing visual activations elaborated from the CNN and the g(·) function as features. Listing 4.1 exposes the core of the modified architecture. It is written for the CLEVR dataset; however, a part for some changing tensors dimensions, the code is identical for the Sort-of-CLEVR task. Modifications are handled by extending an ad-hoc created RelationalLayerBase class, that is inherited by both architectures, original and modified ones.

Listing 4.1: Relational Network code for R-CBIR

1 c l a s s R e l a t i o n a l L a y e r I R ( R e l a t i o n a l L a y e r B a s e ) : 2 def i n i t ( s e l f , i n s i z e , o u t s i z e , q s t s i z e ) : 3 super ( ) . i n i t ( i n s i z e , o u t s i z e , q s t s i z e ) 4 5 s e l f . g f c 1 = nn . L i n e a r ( i n s i z e , 2 5 6 ) 6 s e l f . g f c 2 = nn . L i n e a r ( 2 5 6 , 2 5 6 ) 7 s e l f . g f c 3 = nn . L i n e a r ( 2 5 6 , 2 5 6 ) 8 s e l f . g f c 4 = nn . L i n e a r ( 2 5 6 , 2 5 6 ) 9 10 s e l f . h f c 1 = nn . L i n e a r (256+ q s t s i z e , 2 5 6 ) 11 12 s e l f . f f c 1 = nn . L i n e a r ( 2 5 6 , 2 5 6 ) 13 s e l f . f f c 2 = nn . L i n e a r ( 2 5 6 , 2 5 6 ) 14 s e l f . f f c 3 = nn . L i n e a r ( 2 5 6 , o u t s i z e ) 15 16 s e l f . d r o p o u t = nn . Dropout ( p = 0 . 5 ) 17 18 def f o r w a r d ( s e l f , x , q s t ) : 19 # x = (B x 24 x 8 x 8 ) 20 # q s t = (B x 1 2 8 ) 21 ””” g ””” 22 b , k , d , = x . s i z e ( ) 23 #q s t s i z e = q s t . s i z e ( ) [ 1 ] 24 25 # add c o o r d i n a t e s 26 i f s e l f . c o o r d t e n s o r i s None : 27 s e l f . b u i l d c o o r d t e n s o r ( b , d ) # (B x 2 x 8 x 8 ) 28 29 x c o o r d s = t o r c h . c a t ( [ x , s e l f . c o o r d t e n s o r ] , 1 ) # (B x 24+2 x 8 x 8 )

(52)

30 31 x f l a t = x c o o r d s . view ( b , k+2 , d∗d ) # (B x 24+2 x 6 4 ) 32 x f l a t = x f l a t . permute ( 0 , 2 , 1 ) # (B x 64 x 24+2) 33 34 # add q u e s t i o n e v e r y w h e r e 35 q s t = t o r c h . u n s q u e e z e ( q s t , 1 ) # (B x 1 x 1 2 8 ) 36 q s t = q s t . r e p e a t ( 1 , d ∗ ∗ 2 , 1 ) # (B x 64 x 1 2 8 ) 37 q s t = t o r c h . u n s q u e e z e ( q s t , 2 ) # (B x 64 x 1 x 1 2 8 ) 38 39 # c a s t a l l p a i r s a g a i n s t e a c h o t h e r 40 x i = t o r c h . u n s q u e e z e ( x f l a t , 1 ) # (B x 1 x 64 x 2 6 ) 41 x i = x i . r e p e a t ( 1 , d ∗ ∗ 2 , 1 , 1 ) # (B x 64 x 64 x 2 6 ) 42 x j = t o r c h . u n s q u e e z e ( x f l a t , 2 ) # (B x 64 x 1 x 2 6 ) 43 #x j = t o r c h . c a t ( [ x j , q s t ] , 3 ) 44 x j = x j . r e p e a t ( 1 , 1 , d ∗ ∗ 2 , 1 ) # (B x 64 x 64 x 2 6 ) 45 46 # c o n c a t e n a t e a l l t o g e t h e r 47 x f u l l = t o r c h . c a t ( [ x i , x j ] , 3 ) # (B x 64 x 64 x 2 ∗ 2 6 ) 48 49 # r e s h a p e f o r p a s s i n g t h r o u g h network 50 x = x f u l l . view ( b ∗ d ∗ ∗ 4 , 2 ∗ 2 6 ) 51 x = s e l f . g f c 1 ( x ) 52 x = F . r e l u ( x ) 53 x = s e l f . g f c 2 ( x ) 54 x = F . r e l u ( x ) 55 x = s e l f . g f c 3 ( x ) 56 x = F . r e l u ( x ) 57 x = s e l f . g f c 4 ( x ) 58 x = F . r e l u ( x ) 59 60 #q u e s t i o n s i n s e r t e d 61 x img = x . view ( b , d∗d , d∗d , 2 5 6 ) 62 q s t = q s t . r e p e a t ( 1 , 1 , d∗d , 1 )

(53)

4.3. ARCHITECTURE MODIFICATIONS 45 63 x c o n c a t = t o r c h . c a t ( [ x img , q s t ] , 3 ) #(B x 64 x 64 x 128+256) 64 65 #a n o t h e r l a y e r 66 x = x c o n c a t . view ( b ∗ ( d ∗ ∗ 4 ) ,256+ s e l f . q s t s i z e ) 67 x = s e l f . h f c 1 ( x ) 68 x = F . r e l u ( x ) 69 70 # r e s h a p e a g a i n and sum 71 x g = x . view ( b , d ∗ ∗ 4 , 2 5 6 ) 72 x g = x g . sum( 1 ) . s q u e e z e ( 1 ) 73 74 ””” f ””” 75 x f = s e l f . f f c 1 ( x g ) 76 x f = F . r e l u ( x f ) 77 x f = s e l f . f f c 2 ( x f ) 78 x f = s e l f . d r o p o u t ( x f ) 79 x f = F . r e l u ( x f ) 80 x f = s e l f . f f c 3 ( x f ) 81 82 return F . l o g s o f t m a x ( x f , dim=1)

Extraction details and code will be reported in section 5.2.

In following chapters the quality of the extracted features will be strongly

analyzed and evaluated against some ad-hoc prepared ground-truth

(54)

(55)

Chapter 5 Learning R-Features from

Sort-of-CLEVR

The work around Sort-of-CLEVR for relational features extraction relies on a public available repository on github PyTorch implementation of DeepMind relational networks [13], that reimplemented the architecture explained in Santoro et al. [16] for the Sort-of-CLEVR task. After 30 epochs, this model reached 92% accuracy on relational questions and 100% accuracy on non-relational ones.

The code cloned from this repository has been changed in order to reflect the architecture modifications explained in section 4.3.

In the following sections we will include the description of the training workflows employed for the new architecture, as well as an exhaustive analysis for the quality of the extracted features.

5.1 Training

Initially, the training process was expected to be fully end to end: both the CNN and the RN would have taken part to the training process. However, this solution has immediately been shown quite computationally expensive. For the Sort-of-CLEVR task, this issue is still quite contained, since the dataset is very simple and there is no need for any language preprocessing network (such as

Design and Testing of a Neural Network for Relational Content-Based Image Retrieval

Dipartimento di Ingegneria dell’Informazione

Corso di Laurea Magistrale

Computer Engineering

Design and Testing of a Neural Network

for Relational Content-Based Image

Retrieval

Candidato:

Nicola Messina

Relatori:

Dott. Fabrizio Falchi

Dott. Claudio Gennaro

Dott. Giuseppe Amato

Abstract

Contents

Chapter 1

Introduction

1.1

Relational Reasoning

1.2

Question Answering

1.2.1

Visual Question Answering

1.3

Content-Based Image Retrieval

1.4

Contribution of this thesis

1.5

Outline of the thesis

1.6

Development Tools

1.6.1

Deep Learning Frameworks

1.6.2

Hardware accelerators

1.7

Hardware

Chapter 2

Datasets

2.1

Available Datasets for VQA

2.1.1

Non-synthetic datasets

2.1.2

Synthetic datasets

2.2

CLEVR

2.3

Sort-of-CLEVR

Chapter 3

Related work

3.1

Early techniques

3.2

State of the art techniques

3.2.1

Specialized architectures approaches

3.2.2

Compositional networks approaches

3.2.3

Conditioning network approaches

Chapter 4

From VQA to R-CBIR

4.1

Introduction

4.2

Challenges

4.3

Architecture modifications

Chapter 5

Learning R-Features from

Sort-of-CLEVR

5.1

Training