Deep descriptors for Object recognition

(1)

Politecnico di Milano

Scuola di Ingegneria Industriale e dell’Informazione

Corso di Laurea Magistrale in Ingegneria Informatica Dipartimento di Elettronica, Informazione e Bioingegneria

Deep Descriptors for Object Recognition

Relatore: Prof. Matteo Matteucci Correlatore: Dr. Marco Ciccone

Tesi di laurea di:

Stefano Cerri Matr. 849945

(2)

(3)

(4)

(5)

Acknowledgments

First of all I want to thank my parents Gianluigi and Elisabetta, and my brother Massimo. My family has provided me with great love and I am forever grateful to them for supporting me throughout my studies.

I want to thank my classmates Francesco, Luca, Marco and Stefano who shared with me these years at the university. I will always remember our countless card games during the break.

I want to thank Matteo and Marco who have been constant sources of new ideas and helped me out during these months.

And finally, last but by no means least, I want to thank Luca and Saverio and everyone at Horus Tecnology. It was a fantastic experience at Horus and all of you made it great.

(6)

(7)

Abstract

Comparing patches across images is one of the most important problems in computer vision; it plays a crucial role in a wide variety of vision tasks. These tasks can range from low-level operations such as structure from motion, wide baseline matching, building panoramas, and image super-resolution, up to higher-level tasks such as object recognition, image retrieval, and classification of object categories.

In this work we focus our attention on the task of object recognition. We first compare state-of-the-art handcrafted and deep descriptors on Horus dataset, evaluating them both in term of accuracy and computational time. We show how handcrafted descriptors outperform deep descriptors, being the latter not invariant to rotation and scale. We show how state-of-the-art deep descriptors in the literature cannot be used for real-time object recognition applications on the Horus device, since they require high computational time cost.

We propose a new descriptor, HorusDesc, which overcomes the limitations of the other deep descriptors by achieving rotation and scale invariance thanks to a Spatial Transform Network module.

We evaluate and compare HorusDesc and the other deep and handcrafted descriptors on state-of-the-art benchmarks in the field: Brown and HPatches datasets.

We finally show how deep descriptors have poor performance when they are evaluated on patches not similar to the ones where they were trained, lacking of generalization property, unlike handcrafted descriptors.

(8)

(9)

Sommario

Trovare corrispondenze di un punto in più immagini è uno dei problemi più importanti della visione artificiale, che svolge un ruolo cruciale in un’ampia gamma di attività visive. Queste attività possono variare da operazioni a basso livello come la struttura dal movimento, l’ampia corrispondenza di base, la creazione di panorami e la super-risoluzione dell’immagine, fino a attività di alto livello come il riconoscimento oggetti, il recupero di immagini e la classificazione delle categorie di oggetti.

In questo lavoro ci concentriamo sull’attività del riconoscimento oggetti. Per prima cosa valutiamo lo stato dell’arte dei descrittori artigianali e di quelli approfonditi sul dataset Horus, sia in termini di accuratezza che in termini di tempo di computazione, mostrando come i descrittori artigianali abbiano risul-tati migliori di quelli approfonditi grazie a un’invarianza a scala e rotazione. Inoltre dimostriamo come i descrittori approfonditi, implementati sul dispos-itivo Horus, non possono essere usati per il riconoscimento oggetti in tempo reale, a causa dei tempi di computazione troppo alti.

In questo lavoro proponiamo un nuovo descrittore, HorusDesc, che supera le limitazioni dei descrittori approfonditi raggiungendo un’invarianza a rotazione e scala, grazie a un modulo Spatial Transformer Network.

Valutiamo e compariamo HorusDesc e gli altri descrittori artigianali e ap-profonditi sui dataset di benchmark Brown e HPatches, lo stato dell’arte in questo ambito.

Mostriamo inoltre come i descrittori approfonditi hanno basse performance quando sono valutati su patch che non sono simili a quelle su cui sono stati allenati, e quindi non generalizzano come i descrittori artigianali.

(10)

(11)

List of Figures

1.1 Horus device . . . 2

1.2 Horus website. . . 3

1.3 Matching two images . . . 4

2.1 A typical Object Recognition pipeline. . . 8

2.2 Matching two images . . . 10

2.3 SIFT Gaussian Pyramid. . . 13

2.4 SIFT local extrema. . . 15

2.5 SIFT keypoint descriptor. . . 15

2.6 BRISK sampling pattern . . . 17

3.1 An example of ANN . . . 20 3.2 Activation functions . . . 21 3.3 Example of CNN . . . 22 3.4 Pooling Layer . . . 23 3.5 CNN: hierarchical representation . . . 24 3.6 STN architecture . . . 25 3.7 Brown Dataset . . . 27 3.8 Siamese network . . . 29

3.9 Margin Vs Ratio Loss . . . 31

3.10 HardNet architecture . . . 32

4.1 Horus dataset 1 . . . 36

4.2 Horus dataset 2 . . . 36

4.3 Horus Object Recognition pipeline. . . 38

4.4 Hardnet rotation invariance . . . 43

4.5 DeepDesc scale invariance . . . 43

5.1 Training patches example . . . 47

5.2 Augmented patches. . . 48

5.3 Margin vs Ratio Loss . . . 51

(14)

5.5 HorusDesc Rotation Invariance . . . 54 5.6 Scale invariance ITW . . . 61

(15)

List of Tables

3.1 Inconsistencies of descriptors performances . . . 28

3.2 DeepDesc architecture . . . 30

3.3 Patch pair classification benchmark . . . 33

4.1 Horus dataset: sets and number of images. . . 36

4.2 Horus dataset 1: FG benchmark. . . 40

4.3 Horus dataset 2: FG benchmark. . . 40

4.4 Horus dataset 1: ITW benchmark. . . 41

4.5 Tegra K1 time results. . . 41

4.6 Horus dataset 1: FG benchmark rotated patches. . . 42

4.7 Horus dataset 2: FG benchmark rotated patches. . . 42

4.8 Horus dataset 1: ITW benchmark rotated patches. . . 42

5.1 Tegra K1 time results for TFeat, HorusDesc1 and HorusDesc2. 55 5.2 Benchmark Horus dataset 1 FG. . . 57

5.3 Benchmark Horus Test. . . 58

5.4 Benchmark Horus dataset 2 FG. . . 59

5.5 Benchmark Horus dataset 1 ITW. . . 60

6.1 Brown Dataset Benchmark . . . 66

6.2 HPatches geometric noise . . . 68

6.3 Benchmark HPatches. . . 70

6.4 Patch Verification: BRISK. . . 71

6.5 Patch Verification: SIFT. . . 71

6.6 Patch Verification: TFeat. . . 72

6.7 Patch Verification: DeepDesc. . . 72

6.8 Patch Verification: HardNet. . . 72

6.9 Patch Verification: HorusDesc1. . . 72

6.10 Patch Verification: HorusDesc2. . . 72

6.11 Patch Verification: HorusDesc1HP. . . 73

6.12 Patch Verification: HorusDesc2HP. . . 73

(16)

6.14 Patch Retrieval: BRISK. . . 73

6.15 Patch Retrieval: SIFT. . . 74

6.16 Patch Retrieval: TFeat. . . 74

6.17 Patch Retrieval: DeepDesc. . . 74

6.18 Patch Retrieval: HardNet. . . 74

6.19 Patch Retrieval: HorusDesc1. . . 74

6.20 Patch Retrieval: HorusDesc2. . . 75

6.21 Patch Retrieval: HorusDesc1HP. . . 75

6.22 Patch Retrieval: HorusDesc2HP. . . 75

6.23 Distances per class. . . 76

(17)

Chapter 1

Introduction

"Gianni is a blind man from birth. He lives with his wife Anna who helps him on everyday routine matters. One day Anna finds an online article that talks about Horus, a wearable device that helps blind and visually impaired people by describing the surrounding environment. She immediately investi-gates to know more on this new device. She finds out that Horus is produced by an italian startup, Horus Technology, whose mission is to develop new prod-ucts that are powered by Artificial Intelligence, Deep Learning and Computer Vision to improve life for present and future generations. On the website1 Anna sees that Horus is composed of a wearable headset with two cameras and a pocket unit (see Figure 1.1). The cameras acquire images and send them to the pocket unit, where a powerful processor extracts information from them and describes them to the user through an audible message. Thanks to bone conduction technology, the headset allows the person to hear Horus without affecting his or her hearing. The user can activate each functionality through a set of buttons located both on the headset and the pocket unit. The buttons are very easy to find thanks to their different shapes. In some circumstances, Horus is also able to understand when something is relevant for the user, triggering an automatic action."

"Anna continues to navigate on the website and she reads that Horus is able to recognize the presence of printed texts, also on non flat surfaces, and helps the user in correctly framing the page thanks to audible cues. Once this is done, Horus will start reading it even when it is not in front of the camera. She thinks how many times she had to read his letters for him. Horus can also learn and recognize the faces of people. After detecting a face, the device can quickly understand if that is a known person and then notify the user. In case of an unknown person, teaching the device takes just a couple of seconds. She thinks now that Gianni can recognize all the faces of his grandchildren."

1

(18)

(19)

Figure 1.2: Horus website.

"By rotating an object in front of the cameras, Horus can learn its appear-ance and shape. Thanks to stereo vision, Horus will then recognize it from different angles, helping the user in recognizing objects which might be sim-ilar in shape. In order to help the user to correctly frame the object, Horus generates audible cues. When the cameras detect a foreground object, Horus will automatically try to recognize it. If it is not in its database, pressing the triangle button will allow the user to start the learning process. In this stage, the user should rotate the object in front of the cameras. Then Horus will prompt the user to record the object name before adding it to its database. Anna thinks all the time that she had to pick his favorite cereals from the shelf, now with Horus Gianni can do it himself."

The work done in this thesis has, as starting point, an internship at Horus Technology in which we aimed at improving the existing functionalities of the device. Specifically, we focus our attention on the problem of Object Recognition, i.e., the task of recognizing one or more objects in an image. The raw image is perfect for the human eye to extract all information from it. Think at the task of recognizing a bottle of water in an image; it could be in foreground or in background, in a dark room or in open place, it could be rotated, deformed or it could be partially occluded by another object but, since childhood, we are able to recognize it without any problem. For computers, however, this is generally an hard task due to different transformations such as viewpoint, illumination, occlusion, scale, deformation, background clutter and intra-class variations. Many machine learning models and algorithms have been successful at making computers able to classify objects belonging to small

(20)

Figure 1.3: Local feature description is concerned with methods to match specific areas of the image on the left with the image on the right, using feature description. Note the change in viewpoint, illumination, reflections and focus.

sets of them such as digits [1], letters [2] and traffic signs [3]. These techniques use a set of images, called training set, that is used to learn the features of an object, in order to make it able to recognize instances of that object never seen before.

Our work is focused on representing distinctive local image patches in a way that their representation is invariant under different viewing conditions. This is a crucial step in multiple applications such as image retrieval [4], structure from motion [5], object recognition [6], simultaneous localization and mapping (SLAM) [7] and tracking [8]. Recent interest in self-driving cars has made this area a very important part of any relevant advancement. In its most basic form, this problem is presented in Figure 1.3. We show the same physical area in the 3D world, captured from two very different angles. The goal of a robust feature descriptor is to represent a specific area of the image on the left in such a way that it can be easily matched with the equivalent area on the right image.

The success of deep learning methods across all areas of computer vision in recent years [9, 10] highlights the importance of hierarchical learning in knowledge representations. Nowadays state-of-the-art computer vision systems used deep Convolutional Neural Network trained on large dataset to learn discriminative features that are robust to perspective transformations [11, 12]. Deep learning has also strongly impacted the area of local patch description with many works optimizing such networks as feature descriptors.

In this thesis we investigate on the impact of these novel feature descriptors, when using them in a real-world scenario, such as Horus object recognition pipeline. We evaluate these feature descriptors both in terms of accuracy and computational time.

The thesis is organized as it follows:

(21)

problem. Specifically, we describe how the problem is handled by using handcrafted descriptors. We analyze ORB, BRISK and SIFT descriptors.

• In Chapter 3 we describe how the problem of object recognition is tackled by deep descriptors. We analyze recent works in this field, focusing our attention on DeepDesc, TFeat and Hardnet descriptors.

• In Chapter 4 we evaluate handcrafted and deep descriptors on the Horus dataset benchmark.

• In Chapter 5 we propose our new descriptor, HoruDesc. We describe how its architecture is structured and how we train it.

• In Chapter 6 we evaluate HorusDesc on the state-of-the-art benchmark datasets and we compare it with the previously described handcrafted and deep descriptors.

• The final chapter summarizes our work and discusses possible future developments.

(22)

(23)

Chapter 2

Handcrafted Descriptors for

Object Recognition

In this chapter we describe how a typical computer vision object recognition pipeline works, as described in Figure 2.1. We focus our attention on three handcrafted descriptors: SIFT, ORB and BRISK. These descriptors we will then be evaluated in the next chapters on the Horus object recognition pipeline.

2.1 Global and local features

There are essentially two ways to encode information from an image, namely, global features and local features. In the global features representation, the image is represented by one multi-dimensional feature vector, describing the information of the whole image such as color, texture and shape. With this representation, however, a small change in a part of the image could potentially change the output of the global feature and thus the computer vision system could fail to recognize an object. This is one of the reason why global fea-tures are not typically used in the object recognition problem while are quite effective in the localization one [13].

Tuytelaars and Mikolajczyk [14] define a local feature as:

"it is an image pattern which differs from its immediate neighbor-hood "

Then, they consider local invariant features, a representation that allows to efficiently match local structures between images. What we obtain is a sparse set of local measurements that capture the essence of the underlying input images and encode their interesting structures.

(24)

Figure 2.1: A typical Object Recognition pipeline.

2.2 Detector

When we think at an object recognition system a question arises: where can we find these local features? A feature detection algorithm solves this problem. It takes as input an image I and outputs a set of N keypoints or interest points P = {p₁, p2, . . . , pN}. Each keypoint pi has, as attributes, a

location pi(x, y), an orientation θi, a size si and a scale σi. To detect the right

local features a feature detection algorithm should fulfill these properties [14]:

• Robustness: the feature detection algorithm should be able to detect the same feature locations independent of scaling, rotation, shifting, de-formations and noise.

• Repeatability: the feature detection algorithm should be able to detect the same features of the same scene or object repeatedly, under variety of viewing conditions.

• Accuracy: the feature detection algorithm should accurately localize at the same pixel location the same feature in two different images.

• Generality: the feature detection algorithm should be able to detect features that can be used in different applications.

• Efficiency: the feature detection algorithm should be able to detect features in new images quickly, to support real-time applications.

• Quantity: the feature detection algorithm should be able to detect all or most of the features in the image, where the density of detected features should reflect the information content of the image in order to provide a compact image representation.

2.3 Descriptor

Once we have a set of keypoints from an image a feature descriptor al-gorithms computes, for each keypoint pi ∈ P , a descriptor vector φ(pi) =

(25)

2.4. Matching

(φ1, . . . , φK) where K is the size of the vector that encodes information of the

image patch. This patch has size s_i, center p_i(x, y), and it is rotated by an angle θi. When K is large we encode more information about the image patch,

while a small value of K gives better time performance. Typical values of K are 64, 128 and 256.

Descriptor vectors have binary or real values; a binary descriptor gives better performance in terms of time at the cost of encoding less information, while with a real values descriptor we can encode more information at the cost of increase the time required to compute it.

2.4 Matching

Once the keypoints and their associated descriptors have been extracted from two or more images, the next step is to establish some preliminary feature matches between them. Generally, the performance of matching methods based on interest points depends on both the properties of the keypoints and the choice of their associated descriptors. We examine now two matching methods, namely, bruteforce and Flann while other techniques exist in the literature, like hashing [15–17].

2.4.1 Bruteforce

The problem of feature matching using bruteforce can be formulated as follows: suppose that we have two images I1 and I2 and for each image we

have computed respectively the interest points

P = {p1, p2, . . . , pN} , Q = {q1, q2, . . . , qM}

and their associated descriptors

Φ(P ) = {φp1, φp2, . . . , φpN} , Φ(Q) = {φq1, φq2, . . . , φqM}

a distance measure between the two interest points descriptors φpi and φqj can

be defined as

dk(pi, qj) = kφpi− φqjk if we use real value descriptors

dk(pi, qj) = |φpi− φqj| if we use binary descriptors

Note that other distance measures are used in matching descriptors, like Hellinger [18] and Mahalanobis [19] distance. Based on dk, the points of Q are

sorted in ascending order independently for each descriptor, creating the sets

(26)

Figure 2.2: Matching image regions with Flann [20] based on their local feature descriptors computed with ORB [22] detector and descriptor.

such that,

ψk(pi, Q) = {(ψ1k, ψk2, . . . , ψkM) ∈ Q | dk(pi, ψks) ≤ dk(pi, ψ_kj), ∀s > j}

A match between the pair of interest points (pi, qj) is accepted only if pi

is the best match for q_j in relation to all the other points in the first image I₁ and, in its variance with crosscheck, bruteforce crosscheck algorithm requires also that qj is the best match for pi in relation to all the other points in the

second image I₂.

2.4.2 FLANN

FLANN stands for Fast Library for Approximate Nearest Neighbors [20]. It contains a collection of algorithms optimized for fast nearest neighbor search in large datasets and for high dimensional features. It is much faster than a bruteforce matcher for large datasets. FLANN uses the Hierarchical K-means Tree for generic feature matching. A Hierarchical K-means Tree is a tree in which every inner node is split in K-ways. K-means clustering is used to classify the data subset at each node. A L level tree would have approximately KL leaf nodes. They are all at level L. Nearest neighbors are discovered by choosing to examine the branch-not-taken nodes along the path. FLANN uses a priority-queue, Best-Bin-First, to do ANN from Hierarchical K-Means Tree [21]. Figure 2.2 shows an example of matching two images with FLANN by using local descriptors.

(27)

2.5. 2D Homography

2.5 2D Homography

Once we have a set of matches between two images we need to determine which of these matches are "good", inliers, and which are "bad", outliers. These discrimination can be done by compute a 2D homography. A 2D ho-mography is a mapping between any two projection planes with the same center of projection. Given a point p_i of an image we can relate a point p0_i of another image by using the equation

p0_i = Hpi,

where H is a 3 × 3 matrix. Matrix H contains 9 entries, but is defined only up to scale. Therefore, the total number of degrees of freedom in a 2D projective transformation is 8. Thus, to calculate the homography we need at least 4 pair of corresponding points.

In many practical situations, including object recognition, correspondences are computed automatically and are often mismatched. The mismatched points can be considered as outliers to a Gaussian distribution that explains the error in measurements. These outliers can severely disturb the estimated homography and should be identified. The goal then is to determine a set of inliers from the presented correspondences so that the homography can be esti-mated in an optimal manner from these inliers. RANSAC algorithm [23] solves this problem. It is a robust estimation tolerant to outliers, i.e., measurements following a different, and possibly unmodelled, error distribution.

2.5.1 RANSAC for homography estimation

The random sample consensus (RANSAC) algorithm is a learning technique that estimates the parameters of a model by random sampling of observed data. It is a non-deterministic algorithm in the sense that it produces a reasonable result only with a certain probability, with this probability increasing as more iterations are allowed. We can define the probability after k iteration that we have not picked a set of inlier as (1 − GP)kwhere G is the proportion of inlier in the matches found and P is the number of pair needed for the model, 4 in case of matrix H. Given a dataset of matches containing both inliers and outliers, RANSAC uses the voting scheme to find the optimal fitting result. Data elements in the dataset are used to vote for one or multiple models. The implementation of this voting scheme is based on two assumptions: that the noisy features will not vote consistently for any single model (few outliers) and there are enough features to agree on a good model (few missing data). The RANSAC algorithm for homography estimation iterates k times these steps:

(28)

2. compute the exact homography H

3. compute inliers where kp0_i− Hp_ik <

Then, it selects the largest set of inliers found and, with this set, it computes H by using least square approximation.

2.6 Handcrafted Detectors and Descriptors

We now describe three handcrafted descriptors: SIFT, ORB and BRISK. These descriptors will be object of evaluation in the next chapters.

2.6.1 SIFT

SIFT [24] stands for Scale Invariant Feature Transform. There are mainly four steps involved in SIFT algorithm:

1] scale-space extrema detection

2] keypoint localization

3] orientation assignment

4] keypoint descriptor

1] Scale-space Extrema Detection

The image is convolved with Gaussian filters at different scales, and then the difference of successive Gaussian-blurred images are taken. Keypoints are then selected as maxima/minima of the Difference of Gaussians (DoG) that occur at multiple scales. Specifically, a DoG image D (x, y, σ) is given by

D (x, y, σ) = L (x, y, kiσ) − L (x, y, kjσ),

where L (x, y, kσ) is the convolution of the original image I (x, y) with the Gaussian blur G (x, y, kσ) at scale kσ , i.e.,

L (x, y, kσ) = G (x, y, kσ) ∗ I (x, y)

Figure 2.3 shows how a DoG is computed. Hence a DoG image between scales k_iσ and kjσ is just the difference of the Gaussian-blurred images at scales

kiσ and kjσ. For scale space extrema detection in the SIFT algorithm, the

image is first convolved with Gaussian-blurs at different scales. The convolved images are grouped by octave (an octave corresponds to doubling the value of σ), and the value of kiis selected so that we obtain a fixed number of convolved

(29)

2.6. Handcrafted Detectors and Descriptors

Figure 2.3: SIFT Gaussian Pyramid.

images per octave. Then the Difference-of-Gaussian images are taken from adjacent Gaussian-blurred images per octave.

Once DoG images have been obtained, keypoints are identified as local minima/maxima of the DoG images across scales. This is done by comparing each pixel in the DoG images to its eight neighbors at the same scale and nine corresponding neighboring pixels in each of the neighboring scales. If the pixel value is the maximum or minimum among all compared pixels, it is selected as a candidate keypoint (see Figure 2.4).

2] Keypoint Localization

Once potential keypoints locations are found, they have to be refined to get more accurate results. A Taylor series expansion of scale space is computed to get more accurate location of extrema. Formally, the Taylor expansion of D (x, y, σ) is given by

D(x) = D +∂D_∂xTx + 1₂xT ∂_∂x2D2x

where D and its derivatives are evaluated at the candidate keypoint and x = (x, y, σ)T is the offset from this point. If the intensity at this extrema is less than a threshold value, 0.03 in paper, the keypoint is rejected. DoG has higher response for edges, so edges also need to be removed. For poorly de-fined peaks in the DoG function, the principal curvature across the edge would

(30)

be much larger than the principal curvature along it. Finding these principal curvatures amounts to solving for the eigenvalues of the second-order Hessian matrix, H: H = " Dxx Dxy Dxy Dyy #

The eigenvalues of H are proportional to the principal curvatures of D. The ratio of the two eigenvalues, say α is the larger one, and β the smaller one, with ratio r = α/β, is sufficient for SIFT’s purposes. The trace of H, i.e., Dxx+ Dyy, gives us the sum of the two eigenvalues, while its determinant,

i.e., DxxDyy− Dxy2 , yields the product. The ratio R = Tr(H)2/ Det(H) can

be shown to be equal to (r + 1)2/r, which depends only on the ratio of the eigenvalues rather than their individual values. R is minimum when the eigen-values are equal to each other. Therefore, the higher the absolute difference between the two eigenvalues, which is equivalent to a higher absolute difference between the two principal curvatures of D, the higher the value of R. It follows that, for some threshold eigenvalue ratio rth (10 in paper), if R for a candidate

keypoint is larger than (r_th+ 1)2/rth, that keypoint is poorly localized and

hence rejected.

3] Orientation Assignment

An orientation is assigned to each keypoint to achieve invariance to image rotation. First, the Gaussian-smoothed image L (x, y, σ) at the keypoint’s scale σ is taken so that all computations are performed in a scale-invariant manner. For an image sample L (x, y) at scale σ, the gradient magnitude, m (x, y), and orientation, θ (x, y), are precomputed using pixel differences:

m (x, y) = q

(L (x + 1, y) − L (x − 1, y))2+ (L (x, y + 1) − L (x, y − 1))2 θ (x, y) = atan2 (L (x, y + 1) − L (x, y − 1) , L (x + 1, y) − L (x − 1, y))

The magnitude and direction calculations for the gradient are done for ev-ery pixel in a neighboring region around the keypoint in the Gaussian-blurred image L. An orientation histogram with 36 bins is formed, with each bin cover-ing 10 degrees. Each sample in the neighborcover-ing window added to a histogram bin is weighted by its gradient magnitude and by a Gaussian-weighted circular window with a σ that is 1.5 times that of the scale of the keypoint. The peaks in this histogram correspond to dominant orientations.

Once the histogram is filled, the orientations corresponding to the highest peak and local peaks that are within 80% of the highest peaks are assigned to the keypoint. In the case of multiple orientations being assigned, an additional

(31)

Figure 2.4: SIFT local extrema.

Figure 2.5: SIFT keypoint descriptor.

keypoint is created having the same location and scale as the original keypoint for each additional orientation.

4] Keypoint Descriptor

A keypoint descriptor is created. A 16×16 neighbourhood around the key-point is taken. It is divided into 16 sub-blocks of 4×4 size. For each sub-block, 8 bin orientation histogram is created. So a total of 128 bin values are available. It is represented as a vector to form keypoint descriptor. In addition to this, several measures are taken to achieve robustness against illumination changes, rotation etc. See Figure 2.5 for a graphical representation of the descriptor.

(32)

2.6.2 ORB

ORB [22] stands for Oriented FAST [25, 26] and Rotated BRIEF [27]. It uses a modified version of FAST as detector and BRIEF as binary descriptor. It extracts keypoints with FAST, then it applies Harris corner measure [28] to find top N points among them. FAST does not produce multi-scale features, so ORB employs a scale pyramid of the image, and produces FAST features, filtered by Harris, at each level in the pyramid. Since FAST does not compute the orientation of the interest point it uses intensity weighted centroid of the patch with located corner at center.

BRIEF has an important property that each bit feature has large variance and mean close to 0.5. However, once it is oriented along keypoint direction, it loses this property and become more distributed. High variance makes a feature more discriminative, since it responds differentially to inputs. Another desirable property is to have the tests uncorrelated, since then each test will contribute to the result. To resolve all of these, ORB runs a greedy search among all possible binary tests to find the ones that have both high variance and means close to 0.5, as well as being uncorrelated. The result is called Rotated BRIEF.

The key advantage of ORB lies on the fact that it can be used for real-time applications since it is efficient to compute and it is two order of magnitude faster than SIFT [22].

2.6.3 BRISK

BRISK [29] stands for Binary Robust Invariant Scalable Keypoints. It provides both scale and rotation invariance. In order to compute the feature locations, it uses the AGAST corner detector [30], which improves FAST by increasing speed while maintaining the same detection performance. For scale invariance, BRISK detects keypoints in a scale-space pyramid, performing non-maxima suppression and interpolation across all scales. To describe the fea-tures, BRISK turns away from the random or learned patterns of BRIEF and ORB, and instead it uses a symmetric pattern. Sample points are positioned in concentric circles surrounding the feature, with each sample point representing a Gaussian blurring of its surrounding pixels. The standard deviation of this blurring is increased with the distance from the center of the feature.

To determine orientation, several long-distance sample point comparisons (e.g. on opposite sides of the descriptor pattern) are used. For each long-distance comparison, the vector displacement between the sample points is stored and weighted by the relative difference in intensity. Then, these weighted vectors are averaged to determine the dominant gradient direction of the patch.

(33)

Figure 2.6: BRISK sampling pattern with N = 60 points: the small blue circles denote the sampling locations; the bigger, red dashed circles are drawn at a radius σ corresponding to the standard deviation of the Gaussian kernel used to smooth the intensity values at the sampling points.The pattern shown applies to a scale of t = 1.

The sampling pattern is then scaled and rotated, and the descriptor is built up of 512 short-distance sample point comparisons (e.g. a sample point and its closest neighbors) representing the local gradients and shape within the patch. See Figure 2.6 for BRISK sampling pattern.

Overall, BRISK requires significantly more computation and slightly more storage space than either BRIEF or ORB.

(34)

(35)

Chapter 3

Deep Descriptors

Before talking about complex models for object recognition, such as Con-volutional Neural Networks, we analyze first simpler models, namely, Artificial Neural Networks. We then describe DeepDesc, TFeat and HardNet, three novel deep descriptors, and the datasets where they were trained. These descriptors will be then object of evaluation in the next chapters.

3.1 Artificial Neural Networks

The simplest definition of a neural network, more properly referred to as an artificial neural network (ANN), is provided by the inventor of one of the first neurocomputers, Dr. Robert Hecht-Nielsen. He defines a neural network as [31] :

". . . a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs."

The idea of ANN is based on the belief that the human brain, can be imitated using silicon and wires as living neurons and dendrites. ANNs are typically organized in layers composed of a number of interconnected nodes, each of which imitates a neuron. The input layer takes as input a vector of features x ∈ Rn+1 where vector x contains n input feature and, as last element, a bias term b. Then an arbitrary number of "hidden" layer precedes the last layer, called "output" layer. Two nodes are connected by an edge. Each connection has a weight w_ij where i is the starting node of the arc while j is the destination node. An activation function σ outputs the node value hj = σ(

n

P

i=1

xijwij + bj), where bj is the bias term, or in vector form hj =

σ(xTw). An example of an Artificial Neural Network is shown in Figure 3.1 while Figure 3.2 shows the Sigmoid, Tanh and Relu activation functions. Other

(36)

Figure 3.1: An example of ANN composed of three layers: an input layer with 3 neurons, a hidden layer with 4 neurons and an output layer with 2 neurons.

popular activation functions are Hard-Tanh [32] sometimes preferred over the Tanh function since it is computationally cheaper and SoftPlus [33], which can be considered an alternative to Tanh since it does not saturate as easily as hard clipped functions. More activation functions are described by Goodfellow et al. in [34].

ANNs can have different topologies, the simplest one is the so called Fully Connected. Neurons in a fully connected layer (F C) have full connections to all activations in the previous layer. Their activations can hence be computed with a matrix multiplication followed by a bias offset. This is a general purpose connection pattern and it makes no assumptions about the features in the data. Each weight can be update according to a learning rule. One of the most used algorithm to update the weights is called gradient descent, which itera-tively minimizes the error or loss function by using the update rule

Wk+1 = Wk− α_∂W∂E_k,

where α is the learning rate and _∂W∂E

k is the gradient of the error function with

regard to the weights.

Since the error function involves only addition, multiplication, and compo-sition of differentiable functions, we can apply several times the chain rule in order to compute the gradient. Here is an example of the chain rule applied twice: ∂E ∂wij = ∂E ∂oj · ∂oj ∂netj ·∂netj ∂wij ,

(37)

3.2. Convolutional Neural Networks

Figure 3.2: Different activation functions. Left: Sigmoid non-linearity squashes real numbers to range between [0,1]. Center: The Tanh non-linearity squashes real numbers to range between [-1,1]. Right: Rectified Linear Unit (ReLU) activation function, which is zero when x < 0 and then linear with slope 1 when x > 0.

where oj is the node j of the output layer, and netj is the weighted sum of the

input coming to node j.

The ANNs that we have described so far are Feed-forward ANNs. They allow signals to travel one way only: from input to output. There are no feedback loops; i.e., the output of any layer does not affect that same layer or the previous ones. Other ANNs that have feedback loops exist, i.e., RNN [35] and LSTM [36], but are beyond the scope of this thesis.

3.2 Convolutional Neural Networks

Convolutional Neural Networks (CNNs) or ConvNets are a special type of Artificial Neural Networks that make the explicit assumption that the inputs are images, which allows to encode sparse interactions, parameter sharing and equivariant representations into the architecture that reduce the amount of parameters in the network. Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. In particular, unlike a regular Fully Connected Neural Network, the layers of a ConvNet have neurons arranged in 3 dimensions: width, height and depth, where width and height represent the size of the image while depth typically represents the color channel, 1 for grayscale image, 3 for RGB image. See Figure 3.3.

A CNN is composed by three types of layer, namely, convolutional, pooling and fully connected.

Convolutional Layer

Convolutional layers (CON V ) are the core blocks of a CNN. The convo-lutional layer’s parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. For example, a typical filter on a first layer of a CNN

(38)

Figure 3.3: Left: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image), and an example volume of neurons in the first convolutional layer. Each neuron in the convolutional layer is connected only to a local region in the input volume spatially, but to the full depth (i.e. all color channels). Note, there are multiple neurons (5 in this example) along the depth, all looking at the same region in the input. Right: The neurons from a Artificial Neural Network remain unchanged: they still compute a dot product of their weights with the input followed by a non-linearity, but their connectivity is now restricted to be local spatially.

might have size 5 × 5 × 3 (i.e. 5 pixels width and height, and 3 color channels). During the forward pass, we convolve each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position. As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position.

Pooling Layer

It is common to periodically insert a pooling layer (P OOL) in-between successive convolutional layers in a CNN architecture. Its function is to pro-gressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network. The pooling layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation. Other operations are possible, such as average and L2-norm pooling. The most common form is a pooling layer with filters of size 2×2 applied with a stride of 2 down-samples every depth slice in the input by 2 along both width and height, discarding 75% of the activations. Every MAX operation would in this case be taking a max over 4 numbers (little 2×2 region in some depth slice). The depth dimension remains unchanged. See Figure 3.4 for an example of pooling layer.

(39)

3.2. Convolutional Neural Networks

Figure 3.4: An example of a 2×2, stride 2, MAX Pooling Layer.

Convolutional Neural Network architectures

The most common form of a CNN architecture stacks a few CON V -RELU layers, follows them with P OOL layers, and repeats this pattern until the image has been merged spatially to a small size. If the CNN is used for classification, it is common to transition to fully-connected layers. The last fully connected layer holds the output, such as the class scores. Softmax is usually applied to convert the class scores into probabilities. In other words, the most common ConvNet architecture follows the pattern:

IN → [[CON V → RELU ]N → P OOL]M _{→ [F C → RELU ]}K _{→ F C}

where IN is the input and the P OOL indicates an optional pooling layer. Moreover, N ≥ 0 (and usually N ≤ 3), M ≥ 0, K ≥ 0 (and usually K < 3).

Hierarchical features

A question now arises: what does a CNN learn at each layer [37]? Given the pixels, the first layer identifies edges, by comparing the brightness of neigh-boring pixels. Given the first hidden layer’s description of the edges, the second hidden layer searches for corners and extended contours, which are recogniz-able as collections of edges. Given the second hidden layer’s description of the image in terms of corners and contours, the third hidden layer detects entire parts of specific objects, by finding specific collections of contours and corners. Finally, this description of the image in terms of the object parts it contains is used to recognize the objects present in the image. Figure 3.5 shows an example of hierarchical features computed by a CNN. A ConvNet is capable of composing features of increasing complexity in each of its successive layers and is able to learn a rich representation of data. Here are some examples of learned features in different tasks:

(40)

Figure 3.5: Hierarchical representation of a Convolutional Neural Network; from low-level features to high-level features.

• Image recognition: pixel → edge → texton → motif → part → object

• Text recognition: character → word → word group → clause → sentence

• Speech recognition: sample → spectral band → sound → . . . → phone → phoneme → word

3.3 Spatial Transform Network

One issue of CNNs is that they are not invariant to relatively large input distortions. This limitation is due to having only a restricted, predefined pool-ing mechanism for dealpool-ing with spatial variation of the data. Spatial Transform Network or STN [38] addresses this issue by providing Convolutional Neural Networks with explicit spatial transformation capabilities. STNs possess three properties:

• modular: STNs can be inserted anywhere into existing architectures with relatively small tweaking. See Figure 3.6 for a general view of a STN module.

• differentiable: STNs can be trained with backpropagation allowing for end-to-end training of the models they are injected in.

• dynamic: STNs perform active spatial transformation on a feature map for each input sample as compared to the pooling layer which acted identically for all input samples.

(41)

3.3. Spatial Transform Network

Figure 3.6: The architecture of a spatial transformer module. The input feature map U is passed to a localization network which regresses the transformation parameters θ. The regular spatial grid G over V is transformed to the sampling grid T_θ(G), which is applied to U , producing the warped output feature map V . The combination of the localization network and sampling mechanism defines a spatial transformer.

The Spatial Transformer module consists of three components: localization network, parametrized sampling grid and differentiable image sampling.

Localization Network

The goal of the localization network is to compute the parameters θ of the affine transformation that will be applied to the input feature map. More formally, the localization net is defined as follows:

• input: feature map U of shape (H, W, C)

• output: transformation matrix θ. The size of θ can vary depending on the transformation type that is parametrised, e.g. for an affine transfor-mation θ is 6-dimensional.

• architecture: fully connected network or CNN, but should include a final regression layer to produce the transformation parameters θ.

Parametrised Sampling Grid

The goal of the grid generator is to output a parametrised sampling grid, which is a set of points where the input map should be sampled to produce the desired transformed output. Concretely, the grid generator first creates a normalized mesh-grid of the same size as the input image U of shape (H, W ), that is, a set of indices (xt, yt) that cover the whole input feature map, where subscript t stands for target coordinates in the output feature map. Then, since we are applying an affine transformation to this grid and we would like to use translations, we proceed by adding a row of ones to our coordinate vector

(42)

to obtain its homogeneous equivalent. Finally, we reshape the 6 parameter θ to a 2×3 matrix and perform the following multiplication which results in our desired parametrised sampling grid. The pointwise transformation is:

" xs_i y_is # = " θ11 θ12 θ13 θ21 θ22 θ23 #    xt_i y_it 1   

where (xt_i, y_it) are the target coordinates of the regular grid in the output feature map and (xs_i, ys_i) are the source coordinates in the input feature map that define the sample points.

Differentiable image sampling

Since bilinear interpolation is differentiable, it is perfectly suitable for the task at hand. With the input feature map and the parametrised sampling grid, we can use bilinear sampling and obtain the output feature map V of shape (H0, W0, C0). Note that this implies that we can perform downsampling and upsampling by specifying the shape of our sampling grid. There are other sampling kernels that we can use, but they must be differentiable to allow the loss gradients to flow all the way back to the localization network.

3.4 Datasets

We now describe the datasets where most of the deep descriptors were trained. These datasets will play an important role in next chapters as we will use them to compare handcrafted and deep descriptors.

3.4.1 Brown

Brown dataset [39] consists of three subsets Liberty,Yosemite and Notredame, each one containing more than 500k patch pairs extracted around keypoints. Interest points are computed using DoG, with associated position, scale and orientation. Brown dataset introduces a simple and unambiguous evaluation protocol, refer as patch verification: given a pair of patches, the task is to pre-dict whether they match or not, which reduces the matching task to a binary classification problem. This formulation is particularity suited for learning-based methods, including CNNs and metric learning in particular due to the large number of patches available in this dataset. Figure 3.7 shows some ex-amples of patches from the dataset.

(43)

3.4. Datasets

Figure 3.7: Brown Dataset: examples of patches from the Liberty’s subset.

3.4.2 HPatches

HPatches [40] is a recent dataset for evaluating local feature descriptors. It is inspired by the success of the Oxford matching dataset [41], the most widely adopted and still very popular benchmark for the evaluation of local features, despite consisting of only 48 images.

HPatches fulfills these properties:

• Reproducible, patch-based: descriptor evaluation should be done on patches to eliminate the detector related factors. See Table 3.1 for in-consistencies in the literature. This leads to a standardization across different works and makes results directly comparable.

• Diverse: representative of many different scenes and image capturing conditions.

• Real: real data have been found to be more challenging than a syn-thesized one due to nuisance factors that cannot be modelled in image transformations.

• Large: to allow accurate and stable evaluation, as well as to provide substantial training sets for learning based descriptors.

• Multitask: representative of several use cases, from matching image pairs to image retrieval. This allows cross-task comparison of descriptors performance within the same data.

In HPatches there are three benchmark tasks:

• Patch verification: descriptors are used to classify whether two patches are in correspondence or not.

(44)

Table 3.1: Contradicting conclusions reported in literature while evaluating the same descriptors on the same benchmark (Oxford [41]). Rows report in-consistent evaluation results due to variations of the implicit parameters e.g. of feature detectors.

BRISK > SIFT [29, 42] SIFT > BRISK [27] ORB > SIFT [22] SIFT > ORB [42] ORB > BRIEF [22] BRIEF > ORB [27]

• Image matching: descriptors are used to match patches from a refer-ence image to a target one.

• Patch retrieval: descriptors are used to find patch correspondences in a large collection of patches, a large portion of which are distractors, extracted from confounder images.

3.5 Deep Descriptors

Deep descriptors are descriptors that uses Convolution Neural Network instead of handcrafted algorithms. We now give a general description of three of them, DeepDesc, TFeat and HardNet.

3.5.1 DeepDesc

DeepDesc [43] is a deep descriptor trained on the Brown dataset. It uses a Siamese network [44], two CNNs with shared parameters, where a nonlinear mapping is represented by a CNN that is optimized for pairs of corresponding or non-corresponding patches, as shown in Figure 3.8.

Using the L2 norm as a similarity metric between descriptors D(x1) and

D(x2) the loss function used by DeepDesc, in terms of the hinge embedding

loss [45], can be stated as:

l(x1, x2) =    ||D(x1) − D(x2)||2 if p1= p2 max(0, C − ||D(x1) − D(x2)||2 if p16= p2

where p1, p2 are the indices of the 3D points projecting to the image patches

x1, x2 respectively. This loss penalizes corresponding pairs that are placed

far apart, and non-corresponding pairs that are less than C units apart. In particular, when ||D(x1) − D(x2)||2 = 0 we pay the maximal cost C and, as

their distance increases, the loss eventually reaches zero.

The architecture of DeepDesc is shown in Table 3.2. Each convolutional layer consists of four sublayers: filter layer, non-linearity layer, pooling layer and normalization layer.

(45)

3.5. Deep Descriptors

Figure 3.8: Schematic of a Siamese network, where pairs of input patches are processed by two copies of the same CNN.

The goal is to optimize the network parameters from an arbitrarily large set of training patches. Let us consider a dataset with k patches and m ≤ k unique 3D patch indices, each with ci corresponding image patches. Then,

the number of matching image patches, P (positives) and the number of non-matching images patches, N (negatives) is:

P = m X i=1 ci(ci− 1) 2 , N = m X i=1 ci(k − ci)

Since both P and N are intractably large, we can resort to Stochastic Gradient Descent, using random subsets of the training set to estimate the gradient of the loss function. For positives we can randomly sample a set of sp 3D point indices from the set {p1, . . . , pm}, and for each chosen 3D index pi

we randomly pick two 2D patches with corresponding 3D point indices. For negatives, one simple idea would be to randomly choose sn random

pairs with non-matching indices; but once the network has reached a reason-able level of performance, most non-corresponding points will already have a distance above C, contributing nothing to the loss and the gradient. This can result in a very small and noisy estimate of the gradient, effectively stalling the learning process. Instead, we can iterate over non-corresponding patch pairs to search for "hard" negatives, i.e., pairs that are close in descriptor space and incur a high loss. In this manner it becomes feasible to train discrimi-native models faster while also increasing performance. In particular, at each epoch we generate a set of s_n randomly chosen patch pairs, and after forward-propagation through the network and computing their loss we keep only a subset of the sH_n "hardest" negatives, which are back-propagated through the network in order to update the weights. Additionally, the same procedure can be used over the positive samples, i.e., we can sample s_p corresponding patch pairs and prune them down to the sH_p "hardest" positives.

(46)

Table 3.2: DeepDesc architecture: three-layer network with a 64 x 64 input in layer 1 and a 128-dimensional output in layer 3.

Layer 1 2 3

Input size 64 × 64 29 × 29 8 × 8 Filter size 7 × 7 6 × 6 5 × 5

Output channels 32 64 128

L2 Pooling & Normalization 2 × 2 3 × 3 4 × 4 Activation Function Tanh Tanh Tanh

Stride 2 3 4

3.5.2 TFeat

Recent work in [46] shows that learning representations with triplets of examples, gives much better results than learning with pairs using the same network. Inspired by this, TFeat [47] focus on learning feature descriptors based on triplets of patches. Learning with triplets involves training from samples of the form {a, p, n}, where a is the anchor, p positive, which is a different sample of the same class as a, and n negative is a sample belonging to a different class. In our case, a and p are different viewpoints of the same physical point, and n comes from a different keypoint. Furthermore, optimizing the parameters of the network brings a and p close in the feature space, and pushes a and n far apart. Given,

δ+ = ||D(a) − D(p)||2 , δ−= ||D(a) − D(n)||2,

we can define the margin loss as:

λ(δ+, δ−) = max(0, C + δ+− δ−)

and the ratio loss [46], which tries to force δ+

δ− → ∞ , as: ˆ λ(δ+, δ−) = ( eδ+ eδ+ _{+ e}δ−) 2_{+ (1 −} eδ− eδ+_{+ e}δ−) 2

Note that the goal of this loss function is to force ( eδ+

eδ++eδ−)

2 _{to 0 and}

( eδ+

eδ++eδ−)

2 _{to 1. There is no margin associated with this loss, and by definition}

we have 0 ≤ ˆλ ≤ 1 for all values of δ+,δ−. Note that unlike the margin loss,

where λ = 0 is possible, every training sample in this case is associated with some non-negative loss value (see Figure 3.9).

TFeat’s architecture is three-layer network, composed of two convolutional layers and one fully connected layer at the end, that takes a 32 x 32 grayscale patch as input, and outputs a 128-dimensional descriptor vector. Specifically TFeat’s architecture is a sequence of:

(47)

3.5. Deep Descriptors

Figure 3.9: (a) Margin ranking loss. It seeks to push n outside the circle defined by the margin C, and pull p inside. (b) Margin ranking loss values in function of δ−,δ+ (c) Ratio loss. It seeks to force δ+ to be much smaller than

δ−. (d) Ratio loss values in function of δ−,δ+.

2. convolutional layer: filter size 7 × 7 with 32 maps.

3. Tahn: activation function.

4. pooling layer: max operator, size 2 × 2 and stride 2.

5. convolutional layer: filter size 6 × 6 with 64 maps.

6. Tanh: activation function.

7. fully connected layer.

8. Tanh: activation function.

9. output: 128-dimension descriptor vector.

3.5.3 HardNet

HardNet [48] is a deep descriptor trained on HPatches. Its architecture is the same as L2Net [49], and it is shown in Figure 3.10. The difference between HardNet and L2Net lies on the loss function.

First, a batch X = (Ai, Pi) with i = 1, . . . , n of matching local patches is

generated, where A stands for the anchor and P for the positive. The patches A_i and P_i correspond to the same 3D surface. In X , there is exactly one pair originating from a given 3D point. Second, the 2n patches in X are passed

(48)

Figure 3.10: HardNet architecture: each convolutional layer is followed by batch normalization and ReLU, except the last one. Dropout regularization is used before the last convolution layer.

though the network. L2 pairwise distance matrix D = pdist(a, p), where, d(ai, pj) = p2 − 2aipj, i = 1, . . . , n, j = 1, . . . , n of size n × n is calculated,

where a_i and p_j denote the descriptors of patches A_iand P_j respectively. Next, for each matching pair ai and pi the closest non-matching descriptors, i.e., the

2nd _{nearest neighbor, are computed:}

pjmin where jmin = argminj=1,...,n,j6=id(ai, pj),

akmin where kmin = argmink=1,...,n,k6=id(ak, pi)

Then from each quadruplet of descriptors (ai, pi, pjmin, akmin), a triplet is

formed:

  

(ai, pi, pjmin) if d(ai, pjmin) ≤ d(akmin, pi)

(ai, pi, akmin) otherwise

The goal is to minimize the distance between the matching descriptor and closest non-matching descriptor. These n triplet distances are fed into the triplet margin loss:

L = 1 n

n

X

i=1

max(0, 1 + d(ai, pi) − min(d(ai, pjmin), d(akmin, pi))

where min(d(ai, pjmin), d(akmin, pi)) is pre-computed during the triplet

con-struction.

3.6 Patch pair classification benchmark

The patch pair classification benchmark measures the ability of a descriptor to discriminate positive patch pairs from negative ones in the Brown dataset. This dataset, as already described in Section 3.4.1, consists of three subsets Liberty, Yosemite, Notredame, with each containing more than 500k patch pairs extracted around keypoints. The evaluation follows the protocol proposed in [39] where the ROC curve is generated by thesholding the distance scores

(49)

3.6. Patch pair classification benchmark

Table 3.3: Patch pair classification benchmark. Yos: Yosemite, Lib: Liberty, Not: Notredame.

Training

Lib Yos Not Yos Lib Not Testing

Descriptor Not Lib Yos Mean

BRISK 74.88 79.36 73.21 75.82

SIFT 22.53 29.84 27.29 26.55

TFeat 3.12 3.85 7.22 9.79 7.82 7.08 6.47

DeepDesc 4.54 8.82 16.19 9.85

HardNet 1.00 1.45 3.12 4.30 3.10 2.59 3.00

between patch pairs. The number reported here is the false positive rate at 95% true positive rate (FPR95), as used in many influential works in the field. For the evaluation is used the 100K patch pairs, as defined in the benchmark. Note that DeepDesc, does not report performance with training based on a single dataset, therefore for each test set, the training is performed on the other two dataset.

The results for each of the combinations of training and testing using the three subsets of the Brown dataset are shown in Table 3.3 including the average across all possible combinations.

As one can notice, all the deep descriptors outperform SIFT and BRISK descriptor by a large margin. BRISK has the lowest results, as expected, since it is a binary descriptor. HardNet has the best results, followed, respectively, by DeepDesc and TFeat.

In the next chapter, starting from these promising results, we describe how we implemented and evaluated the deep descriptors on the object recognition pipeline at Horus in order to improve its performance.

(50)

(51)

Chapter 4

Descriptors Assessments

In Chapter 2 and Chapter 3 we gave a general overview of how the problem of object recognition is tackled by handcrafted descriptors and by deep descrip-tors. We show that deep descriptors outperform the handcrafted ones on the Brown dataset. Since our goal is to improve the performance of the object recognition pipeline on the Horus device, we implemented the deep descriptors on it and we evaluated them on the Horus dataset.

4.1 Horus Dataset

Horus dataset contains two sets called Horus dataset 1 and Horus dataset 2. Horus dataset 1 is composed of two subsets, namely, Foreground (FG) and In The Wild (ITW ). The FG subset is composed of a training set (46 objects, 149 images) and a validation set (46 objects, 156 images). In the training set we mimic the process of learning a new object on the Horus device; each object has multiple images of it taken from different points of view. In the validation set we mimic the process of recognizing an object that was already learnt by the device; the images could have different background, illumination and could include occluded objects. This is done in order to better replicate a real world scenario. The ITW subset is composed of 36 images, each of which contains different objects. The ITW subset is more challenging than the FG since it contains tough transformations, especially for scale. Figure 4.1 shows some examples of images of Horus dataset 1 both from the FG and ITW subset.

Horus Dataset 2 has just the FG subset composed of a training set (18 objects, 25 images) and a validation set (18 objects, 87 images). Both the training and validation set have the same properties of the ones in Horus dataset 1, but they contain different objects. Figure 4.2 shows some examples of images of Horus Dataset 2.

(52)

Figure 4.1: Horus dataset 1: FG and ITW. Upper-Left: an example of object of the training set. Upper-Right: an example of the same object of the validation set. Down-Left and Down-Right: two examples of images of the ITW subset.

Figure 4.2: Horus dataset 2: FG. Top row: two examples of images of the training set. Bottom row: two examples of images of the validation set.

Table 4.1: Horus dataset: sets and number of images.

Horus dataset 1 Horus dataset 2 FG training set 149 FG training set 25 FG validation set 156 FG validation set 87 ITW validation set 36

(53)

4.2. Horus object recognition pipeline

4.2 Horus object recognition pipeline

For our assessments we followed the Horus object recognition pipeline (See Figure 4.3). Specifically, for the for FG subset we followed these steps:

1. We detected from each image in the training set the interest points. We used both SIFT and ORB detectors.

2. For each keypoint we computed the associated descriptor; for the hand-crafted ones we passed to the descriptor the keypoints as is, while for the deep descriptors we extracted from the image the patch around the center of the keypoint at the size given by the keypoint size multiplied by the scale factor and then we forwarded the patch trough the network.

3. For each image in the validation set, we performed step 1 and 2 and then we found the k-nearest-neighbour using FLANN, where k is set to 4. We used FLANN instead of bruteforce since it gives us better time performance and does not change too much the matches founded. For BRISK descriptor, since it is a binary descriptor, we used the L1 norm while for all the other descriptors we used the L2 norm.

4. We retained the best matches found by thresholding their distance. The threshold is computed empirically to get the best performance from each descriptor.

5. We computed, from each image in the training set that has at least 10 matches, the homography matrix H using RANSAC. We discarded the one that has det(H) < 0, or has less than 10 inliers.

6. The image with the maximum number of inliers is the one that we matched. If the two images represent the same object we have a cor-rect recognition, if the object are different we have a wrong recognition, otherwise it means that no image passes the homography estimation pro-cess and so we have a no recognition.

The pipeline for ITW object recognition differs from the one for FG in the matching part; instead of matching two images, we matched an image with just a part of the other image (note that a ITW image contains more than one object).

The output of the pipeline can be summarize as follows:

• true positive T P : we have a correct recognition.

• false positive F P : we have a wrong recognition or no recognition. Note that in our case the no recognition output is a false positive, since the validation set does not contain objects that are not in the training set.

(54)

Figure 4.3: Horus Object Recognition pipeline.

Given T P and F P we computed the true positive rate T P R as

T P R = T P T P + F P,

and the false discovery rate F DR as

F DR = F P T P + F P.

In order to compare the results of the different sets, we computed the 95% confidence interval. In particular, we cannot use the Normal Approximation interval because nˆp ≤ 5 or n(1 − ˆp) ≤ 5, where n is the number of trials and ˆ

p is the proportion of success in n Bernoulli trials, as explained in [50]. We adopted Wilson score interval [51], which is defined as:

1 1 +1_nz2 " ˆ p + 1 2nz 2_{± z} r 1 np (1 − ˆˆ p) + 1 4n2z2 # ,

where z is the 1 − 1₂α quantile of a standard normal distribution corresponding to the target error rate α. For example, for a 95% confidence level we have α = 0.05, 1 −1₂α = 0.975 and z = 1.96.

4.3 Assessments Results

The experiments that we did follow the previously described object recogni-tion pipeline on the Horus device. All the handcrafted descriptors that we used

(55)

4.3. Assessments Results

are the ones implemented in the OpenCV library1. We used a C++ wrapper around Torch [52] libraries in order to use the deep descriptors in C++.

Table 4.2 shows the FG results on Horus Dataset 1 while Table 4.3 on Horus Dataset 2. The results contain:

• Detector: the detector that we used to extract the patches. We used ORB and SIFT detectors for each descriptor. We noticed that the de-tectors parameters play an important role on the performance of the descriptors, especially for the deep ones; the values of these parameters where tuned empirically in order to increase the performance of each descriptor.

• Descriptor: the descriptor that we evaluated. The handcrafted de-scriptors evaluated are BRISK and SIFT, while the deep dede-scriptors are TFeat, DeepDesc and HardNet.

• TPR@95%: the true positive rate with 95% confidence interval.

• FDR@95%: the false discovery rate with 95% confidence interval.

• Avg similarity: the average number of inliers found after computing the homography.

• Min similarity: the minimum number of inliers found after computing the homography.

• Max similarity: the maximum number of inliers found after computing the homography.

As one can see from the two tables SIFT descriptor outperforms all the other descriptors both in TPR@95% and in FDR@95%. This is in contrast with TFeat, DeepDesc and HardNet papers where authors show that deep descriptors perform better than SIFT descriptor by a large margin on Brown dataset (See Section 3.6). DeepDesc has the highest average similarity on the recognized objects. One can notice that the deep descriptors have high FDR@95% with low average similarity; if we increase the minimum number of inliers needed to recognize an object, 10 as described in Section 4.2, since we have high average similarity in TPR@95% we can decrease FDR@95% without decreasing too much TPR@95%.

Table 4.4 shows the ITW results on Horus Dataset 1. Also in this case, SIFT descriptor outperforms all the other descriptors. Note that the ITW subset is harder than the FG one as confirmed by lower results in TPR@95%. In fact the ITW subset has images with objects that are scaled 2 or 3 times

1

Deep descriptors for Object recognition

Politecnico di Milano

Scuola di Ingegneria Industriale e dell’Informazione

Deep Descriptors for Object Recognition

Acknowledgments

Abstract

Sommario

Contents

List of Figures

List of Tables

Chapter 1

Introduction

Chapter 2

Handcrafted Descriptors for

Object Recognition

2.1

Global and local features

2.2

Detector

2.3

Descriptor

2.4

Matching

2.5

2D Homography

2.6

Handcrafted Detectors and Descriptors

Chapter 3

Deep Descriptors

3.1

Artificial Neural Networks

3.2

Convolutional Neural Networks

3.3

Spatial Transform Network

3.4

Datasets

3.5

Deep Descriptors

3.6

Patch pair classification benchmark

Chapter 4

Descriptors Assessments

4.1

Horus Dataset

4.2

Horus object recognition pipeline

4.3

Assessments Results