• Non ci sono risultati.

CloudCache: Caching for Mobile Object Instance Recogntion Applications

N/A
N/A
Protected

Academic year: 2021

Condividi "CloudCache: Caching for Mobile Object Instance Recogntion Applications"

Copied!
165
0
0

Testo completo

(1)

HONG

KONG

U

NIVERSITY OF

S

CIENCE

AND

T

ECHNOLOGY

U

NIVERSITY OF

P

ISA

M

ASTER

D

EGREE

T

HESIS

CloudCache: Caching for Mobile

Object Instance Recognition

Applications

Author:

Luca L

OVAGNINI

Supervisor:

Dr. Nicola T

ONELLOTTO SyMLab

Department of Computer Science

Department of Computer Science and Engineering

(2)
(3)

iii

Declaration of Authorship

I, Luca LOVAGNINI, declare that this thesis titled, “CloudCache: Caching for

Mobile Object Instance Recognition Applications” and the work presented in it are my own. I confirm that:

• This work was done wholly or mainly while in candidature for a re-search degree at this University.

• Where any part of this thesis has previously been submitted for a de-gree or any other qualification at this University or any other institu-tion, this has been clearly stated.

• Where I have consulted the published work of others, this is always clearly attributed.

• Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work.

• I have acknowledged all main sources of help.

• Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself.

Signed: Date:

(4)
(5)

v

“Sixsmith, salgo i gradini dello Scott Monument ogni mattina, e tutto diventa chiaro. Vorrei poterti fare vedere tutta questa luminosità. Non preoccuparti, va tutto bene, va tutto così perfettamente maledettamente bene. Capisco ora che i confini tra ru-more e suono sono convenzioni. Tutti i confini sono convenzioni, in attesa di essere superate; si può superare qualunque convenzione, solo se prima si può concepire di poterlo fare. In momenti come questi, sento chiaramente battere il tuo cuore come sento il mio, e so che la separazione è un’illusione. La mia vita si estende ben oltre i limiti di me stesso.”

(6)
(7)

vii

University of Pisa

Abstract

Department of Computer Science

Master Degree

CloudCache: Caching for Mobile Object Instance Recognition Applications

by Luca LOVAGNINI

Cloud Computing made the idea of on-demand, scalable and high available resources a low-cost reality. In the smartphone era, where devices (with lim-ited computation and battery power) produce a massive quantity of data, Mobile Cloud Computing has become an hot topic in research community. In particular, it has been successfully exploited in the development of many Object Instance Recognition (OIR) applications designed for mobile devices, where local computation is too much expensive for the single smartphone. In this context, reduce the time needed for tasks computation gain advantages for both providers and end-users, giving a better user experience and saving battery life. In this thesis we introduce CloudCache, a framework designed to improve performance of mobile OIR-related applications, where previous (expensive) computations performed by the Back End system are reused in order to speedup similar tasks in the future. The design and implementation of CloudCache is described, where different tools from previous works in Computer Vision has been used to achieve the desired performance. Some of these tools has been parallelized to exploit the multi-cores architectures com-monly presented in most of Cloud providers nowadays, obtaining real-time results. CloudCache has been tested on three recognition applications, where two new image datasets are presented to evaluate the system, providing re-sults for similar tasks previously computed in 13 milliseconds in the worst case on a six cores machine . . .

(8)
(9)

ix

Acknowledgements

It’s not easy to cite all the people who have helped me during the last year and a half for this research, but I’ll do my best and I’m sorry if I will forget someone. First of all, I want to thank Professor Pan Hui, who gave me the unique opportunity to develop my master thesis research at HKUST: I think that there are no words to tell you how grateful I am for letting me work at SymLab and living in Hong Kong for over 6 months. I am grateful in the same way to my supervisor, Nicola Tonellotto, who accepted this crazy idea of mine to write my thesis at the other side of the world, following my progresses and giving me precious advices to improve my research skills. I am sorry if sometimes I looked stubborn, but I deeply believe that if one day I will become a good researcher it will also be because you spurred me on to improve myself day by day. Then, I want to deeply thank all my team mates: Farshid, who had the idea of this project, Shawn, who helped me so much during the evaluation part and Tad. I am incredibly grateful to Doctor Fabrizio Falchi who helped me in many different occasions with is huge experience in CBIR and cache systems. Then I want to say thank you to Relja Arandjelovic, who answered to all my countless e-mails about image encoding techniques: this work would not have been possible without your help. Finally, special thanks go to all my closest friends who I met during those months: Evan, Lisa, Lucas, Babbo, Carlo and all the other wonderful people that I met in that parallel dimension called Hong Kong.

Voglio ringraziare l’Università di Pisa per avermi reso l’informatico che sono oggi, aprendomi le porte ad un mondo di opportunità che solo ora inizio ad esplorare. Grazie a tutti i miei colleghi di corso che hanno reso più leggero il fardello dello studio in questi anni, in particolare Giuseppe, Enrico, Flavio, Francesco, Stefano, Alessandro, Giulia, Claudio ed Andrea. Grazie agli amici di Casa del Marce, per le innumerevoli serate in cui sono cresciuto con tutti voi. Ringrazio il Gruppo Draghi per avermi fatto capire che "nella vita è facile essere seri, il difficile è saper giocare". Grazie a Marta per avermi fatto capire che la Musica è l’unica compagna di vita da cui non riuscirò mai ad allontanarmi. Ringrazio Luana per esser stata non solo un’amica, ma per avermi mostrato come ogni essere umano possa rinascere da se stesso. Grazie ad Elisa per le mille avventure ed esplorazioni compiute insieme. Quando venni introdotto all’informatica (al terzo anno delle scuole superiori) ri-masi terribilmente deluso, tanto da iniziare ad odiarla ed ottenere i risultati più scadenti che abbia mai raggiunto in ambito scolastico. È solo grazie a Giuseppe se quell’estate mi dedicai con il suo aiuto nella comprensione della materia che oggi amo di più al mondo, come lo stesso vale per la matematica, fisica e tutte le dis-pline scientifiche moderne. Nel frattempo, Tiziana mi ispirava con altrettanto fer-vore nell’arte, la bellezza del mondo che ci circonda e l’importanza nel coltivare se stessi, come i fiori che ha (anzi, noi tutti abbiamo) sempre amato. Credo che la fortuna più grande per un figlio sia quella di avere due genitori che lo sostengano sempre e comunque nell’inseguire i propri sogni, anche se questi lo porteranno lon-tano da casa e da loro. Per questo, il mio ultimo e più grande ringraziamento va a voi, nella speranza che un giorno possa ripagare tutti i vostri sacrifici e l’amore che mi avete donato fin dal mio primo giorno di vita.

(10)
(11)

xi

Contents

Declaration of Authorship iii

Abstract vii

Acknowledgements ix

Contents xi

List of Figures xv

List of Tables xix

List of Abbreviations xxi

1 Introduction 1

1.1 Our Contribution . . . 2

2 Background 5 2.1 Computer Vision OIR Applications . . . 5

2.2 Similarity is application dependent . . . 6

2.3 Caching. . . 8

2.3.1 Exact Caching . . . 8

2.3.2 Similarity Caching . . . 9

2.4 The Nearest Neighbor Problem . . . 12

2.4.1 Locality Sensitive Hashing . . . 14

2.4.2 Tree based . . . 15

2.4.3 Graph based . . . 15

2.4.4 Product Quantization . . . 17

2.4.5 ANN Cases of Use and Benchmarks . . . 18

2.5 Image Feature Detector and Descriptor . . . 19

2.5.1 SIFT Algorithm . . . 19

2.5.2 SURF and Hessian-Affine Detectors . . . 21

Speeded Up Robust Features (SURF). . . 21

Hessian Affine Detector with SIFT Descriptor . . . 22

2.5.3 Others Keypoints Detectors and Descriptors . . . 22

2.5.4 Local Feature Detector and Descriptor Comparison . . 23

2.6 Image Encoding . . . 24

2.6.1 Bag of Features . . . 24

2.6.2 VLAD . . . 27

(12)

Residual Normalization (RN) . . . 29

Local Coordinate System . . . 29

Vocabulary Adaptation . . . 29

2.6.3 Fisher Vectors . . . 29

2.6.4 Similarity Threshold on Image Codes . . . 30

2.6.5 CloudAR . . . 31

2.7 Deep Learning . . . 32

2.7.1 Overview . . . 32

2.7.2 CNN Efficiency and Benchmarks . . . 33

3 Problem Statement and Motivation 35 3.1 Problem Statement . . . 35

Motivation Questions . . . 35

Research Questions . . . 35

Design Questions . . . 35

3.1.1 Goals and non-goals . . . 36

Goals . . . 36

Non-Goals and Non-Priority Aspects . . . 37

3.2 Motivation . . . 37

3.2.1 Why CloudCache? . . . 37

3.2.2 Applications . . . 39

Movie Poster Recognition . . . 39

Landmark Recognition. . . 41

Painting Recognition . . . 41

(Visual) Place Recognition . . . 42

3.2.3 Other possible applications for CloudCache . . . 42

CBIR Systems . . . 42

Improve Google Goggles . . . 42

3.2.4 Common aspects among applications . . . 43

3.2.5 Applications where CloudCache can not be used . . . 43

4 Approach 45 4.1 Assumptions . . . 45

4.2 Requirements . . . 46

4.3 Evolution of our approach . . . 48

5 Implementation 51 5.1 Which Programming Language? . . . 51

5.2 External Tools and Libraries . . . 52

OpenCV . . . 52 VLFeat . . . 53 Hessian-Affine Detector . . . 53 5.3 System Architecture. . . 53 5.4 CloudCache Workflow . . . 55 5.4.1 Training Workflow . . . 55 5.4.2 System Setup . . . 58 5.4.3 Querying CloudCache . . . 61

(13)

xiii

5.6 PHA: Parallel Hessian Affine detector . . . 64

5.6.1 Serial Algorithm . . . 64

5.6.2 Parallel Algorithm . . . 65

Ideal Algorithm . . . 65

Parallel Gaussian Blurs . . . 66

Splitting Keypoints Detection and Descriptors Compu-tation . . . 66

Keypoint Detection Load Imbalance . . . 67

Descriptor Building Load Imbalance . . . 68

Parallel Hessian-Affine Algorithm . . . 68

Performance Model . . . 69 5.6.3 Vectorization . . . 70 cv::GaussianBlur . . . 70 interpolate . . . 71 samplePatch . . . 71 5.7 Trainer . . . 72 5.7.1 k-Means . . . 72

5.7.2 Gaussian Mixture Model. . . 72

5.7.3 Local Coordinate System (LCS) Implementation . . . . 72

5.8 Image Encoder. . . 73

5.8.1 Fisher Vectors (FV) . . . 73

5.8.2 VLAD . . . 74

5.9 Cache Data Structure and Parallel Data Vectors Similarity . . . 75

5.10 Threshold Trainer . . . 77

5.10.1 Looking for a Threshold in Different Domains . . . 77

5.10.2 The Receiver Operating Characteristic (ROC) Curve. . 78

6 Evaluation 81 6.1 Implemented Applications. . . 82

6.2 Datasets . . . 83

6.2.1 Previous Works . . . 83

Oxford Building Dataset . . . 83

Paris Building Dataset . . . 85

6.2.2 New Datasets . . . 85

Movie Posters Dataset . . . 85

Uffizi Paintings Dataset . . . 87

OxfordForCache . . . 89

Bad Image Levels . . . 89

6.3 System Evaluation . . . 92

6.3.1 System Configuration and Setup . . . 93

Legends . . . 94

6.3.2 PHA Evaluation . . . 94

6.3.3 Descriptors Evaluation . . . 95

6.3.4 Encoders Evaluation . . . 99

6.3.5 Cache Evaluation . . . 102

Testing Multiple Criteria . . . 103

(14)

Cache Size Test . . . 107

Scalability Test . . . 109

Threshold Test . . . 112

6.4 Comparison with CloudAR . . . 113

7 Conclusions 115 A Code Structure 117 A.1 Code Structure. . . 117

A.1.1 Core . . . 117 A.1.2 CloudCache . . . 117 A.1.3 Cache . . . 118 A.1.4 Descriptors . . . 118 A.1.5 Descriptor . . . 118 A.1.6 Encoders . . . 119 A.1.7 Encoder. . . 119 A.2 DatasetTests . . . 119 A.2.1 DatasetTest . . . 120 A.2.2 CCDatasetTest . . . 120 A.2.3 OxfordTestManager. . . 121 A.3 Utilities . . . 121 A.3.1 CommandLineParser. . . 121 A.3.2 FileSystem . . . 121 A.3.3 Utility. . . 121 A.4 make . . . 122 B Code Structure 123 B.1 User Manual . . . 123 B.2 System Requirements . . . 123 B.3 Building OpenCV . . . 123 B.4 Building CloudCache . . . 124 Bibliography 127

(15)

xv

List of Figures

2.1 An example of Object Recognition applications. [206] . . . 6

2.2 An example where the closest cached query has not the opti-mal safe radius. [75] . . . 10

2.3 Binary hash codes produced through local sensitive hashing functions using random hyperplanes: the top (bottom) vector is relative to the red (yellow) dot. [156] . . . 12

2.4 Illustration of HNSW idea. Starting from the entry point at the top layer (shown red), red arrows show direction of the greedy algorithm to the query point (shown green) and how the characteristic radius decrease level by level. [135] . . . 16

2.5 Four quantizers of 64 centroids each. [114] . . . 17

2.6 Speed/Precision trade-off of different ANN algorithms. [11] . 18

2.7 A scale space example. [193] . . . 19

2.8 Process to obtain Difference of Gaussian. [193] . . . 20

2.9 An example of SIFT descriptors matching, taken from [214] . . 21

2.10 A graphical representation of SIFT keypoints. [101] . . . 21

2.11 Difference between detected keypoints using the Hessian Affine (bottom) and the Harris Affine (top) detectors. [159] . . . 23

2.12 Keypoints detected using Dense SIFT. [182] . . . 24

2.13 A simplified example visual words. [69] . . . 25

2.14 Illustration of burstiness. Features assigned to the most “bursty” visual word of each image are displayed, taken from [107]. . . 28

2.15 CloudAR overview showing the data flow and the main com-ponents of the mobile client and the server. [228] . . . 31

2.16 CloudAR components of latency for both local offloading sce-nario and Cloud offloading scesce-nario. [228] . . . 31

3.1 An example power-law graph, being used to demonstrate rank-ing of popularity. [58] . . . 38

3.2 An example power-law graph, being used to demonstrate rank-ing of popularity. [72] . . . 38

3.3 An example power-law graph, being used to demonstrate rank-ing of popularity. [72] . . . 39

3.4 An example about a cache hit performed by CloudCache for the Movie Poster Recognition application. . . 41

5.1 CloudCache architecture. We reported the name of the libraries which already implemented some of these elements between brackets, e.g., "SIFT (OpenCV)". . . 53

(16)

5.2 CloudCache training process workflow. For each element, we

used the same colors of Figure 5.2. . . 56

5.3 Training descriptors, serial pipeline. . . 58

5.4 Training descriptors, parallel farm. All elements are red be-cause this is all part of the Descriptor Module of Figure 5.2. . . 58

5.5 CloudCache setup pipeline. For each element, we used the same colors of Figure 5.2.. . . 59

5.6 CloudCache query submission process. For each element, we used the same colors of Figure 5.2. . . 61

5.7 Flowchart of the Hessian-Affine algorithm . . . 65

5.8 Flowchart of the Hessian-Affine algorithm . . . 66

5.9 Parallel Hessian-Affine flowchart . . . 70

5.10 Example of Cache with size twelve and using four threads to find the closest cached code w.r.t. the query code CodeQ. . . . 77

5.11 An example of a ROC curve: the orange curve is the best one, while blue one is the worse one.. . . 79

6.1 Sample images from the Oxford Building Dataset . . . 84

6.2 Sample images from the Oxford Building Dataset . . . 86

6.3 Discrete probability function of the Movie Poster dataset. The red line is the trending line representing a perfect power-law distribution. . . 87

6.4 Sample images from the Uffizi Painting dataset by (from top to lowest row) Caravaggio, Leonardo da Vinci and Piero della Francesca. . . 88

6.5 Discrete probability function of the Painting dataset. . . 89

6.6 Discrete probability function of the adapted Oxford dataset. . 90

6.7 Sample bad images from each tested dataset. . . 91

6.8 Legends in CloudCache. . . 94

6.9 Descriptor time performance for HA and PHA by using 12 threads. . . 96

6.10 mAP of PHA and HA when using VLAD codes and k = 64. . 96

6.11 speedup(12) of PHA over HA in relation of image resizing. . . . 96

6.12 Descriptors precision depending on image resizing. . . 97

6.13 Descriptor times for different descriptors. . . 98

6.14 Descriptor times for SURF and PHA only. . . 99

6.15 VLAD and FV precision by using PHA and different numbers of centroids. . . 100

6.16 VLAD and FV encoding time depending on the number of centroids using PHA. . . 100

6.17 Comparison between VLAD codes on different number of cen-ters by using PHA and SURF. . . 101

6.18 Encoding time for VLAD codes when using PHA and SURF descriptors.. . . 102

6.19 CHR in relation of the number of setup images per dataset and descriptor. . . 106

6.20 CCHRT depending on nQueriesFraction, using PHA. . . . 107

(17)

xvii

6.22 How CHR changes depending on cache size. Dark lines are the optimal replacement obtained by using the optimal Bélády’s replacing algorithm for the chosen threshold. . . 108

6.23 CCHRT in relation to cache size. . . 109

6.24 Cache Look-Up time based on (small) cache size. . . 110

6.25 Cache Look-Up time on large cache. Horizontal bars are the real-time constraints for the considered descriptor and encoder. 111

6.26 Zoomed version of Figure 6.24. . . 111

6.27 Look up operation speedup by using 100 thousands cache codes and using SURF (d = 4096) and PHA (d = 16384) codes. . . 112

6.28 Cache Hit Ratio CHR for different datasets (one per color). Vertical lines are thresholds decided by our algorithm for the corresponding dataset. . . 113

6.29 Correct Cache Hit Ratio (CCHR) for different datasets (one per color). Vertical lines are thresholds decided by our algorithm for the corresponding dataset. . . 114

(18)
(19)

xix

List of Tables

2.1 Different CNNs performance on high-end components. . . 34

6.1 Cache performance depending on image quality per dataset. . 104

6.2 Cache performance results grouped by different criteria. . . . 105

6.3 Cache performance results grouped by different criteria with-out considering bad paintings. . . 106

6.4 Descriptor (DT), encoding (ET) and remaining time for look-up (in seconds) by using optimal configurations. . . 109

(20)
(21)

xxi

List of Abbreviations

CC Cloud Computing

MCC Mobile Cloud Computing OIR Object Instance Recognition AR Augmented Reality

CBIR Content Based Image Retrieval

IRMA Image Recognition Mobile Applications NN Nearest Neighbors

ANN Approximate Nearest Neighbors BoF Bag (of) Features

VLAD Vector (of) Locally Aggregated Descriptors FV Fisher Vectors

GMM Gaussian Mixture Model LRU Least Recently Used

SIFT Scale Invariant Feature Transform SURF Speeded Up Robust Features HA Hessian Affine

PHA Parallel Hessian Affine

ROC Receiver Operating Characteristic MAP Mean Average Precision

CHR Cache Hit Ratio

(22)
(23)

xxiii

Ad Ade,

Bacco,

Efesto,

l’Oracolo

e Philophrosyne . . .

Siete il mio Olimpo.

(24)
(25)

1

Chapter 1

Introduction

In the last decade, Cloud Computing (CC) has become a popular solution [111] for starting new business without facing IT infrastructure and person-nel costs. Elasticity has permitted a dynamic (and convenient) control of the resources based on content demand, scalability allowed to process big quan-tity of data and failure tolerant mechanisms guaranteed high availability (with an high User Experience). Daily use applications like Gmail, Facebook, Amazon.com or YouTube heavily use Cloud technologies. Big players like Google, Amazon and Microsoft provide powerful, reliable and cost-efficient Cloud platforms ( [87], [8] and [143], respectively) .

In the smartphone era, users are moving from desktop computers to mobile devices: more than half of YouTube’s videos are watched on smartphones [223], Instagram [99] is not available for traditional PC, there are more mo-bile than desktop connections to Spotify [203] and 986 millions active users on Facebook mainly use mobile devices [71]. However, since smartphone devices still suffer of limited computational power, unreliable networks and short battery life [173] [189], Mobile Cloud Computing (MCC) has become an hot topic for research and industrial community [169]. It is clear that in such a scenario saving computation time is crucial for both Cloud providers (using less resources) and end-user (saving battery and having a better Using Experience). For these reasons, we face the challenging problem: can we take advantage of tasks recently submitted (by mobile applications) to the Cloud (or any kind of back end system) in order to haste the same jobs with similar inputs? It is clear then that in this work the term performance refers to the time necessary to complete a task submitted by an user.

However, providing a general solution for many different applications do-mains used in MCC could be too challenging: mobile applications for Speak Recognition, Music Retrieval, Recommendation Systems or Computer Vision (just to name a few) are very common in MCC, but the used approaches to develop them are too heterogeneous to find a general caching solution. For example, Music Recognition is implemented by encoding songs into hash codes, which are then compared (looking for a perfect match) with hash codes stored in the remote database (containing the hash codes of the known songs) [216]. On the other hand, Music Recommendations algorithms such

(26)

as Annoy [202] uses Approximate Nearest Neighbor techniques to recom-mend songs to users from mobile devices. As we can see, even applications from similar domains (such as Music Recognition and Music Recommenda-tions) use completely different approaches to solve the considered problem. In such a scenario, providing a single solution to the problem that we stated above for a specific kind of application makes more sense than a general one approach trying to find a common solution to different domains.

1.1

Our Contribution

We focused our research on a specific kind of computer vision applications: Mobile Object Instance Recognition Applications. Since a common require-ment in mobile applications is real-time processing, it is clear how task pletion time occupies a crucial role in this context. MCC is becoming a com-mon solution to develop Computer Vision applications [50]. In particular, CloudAR [228] is a recent library to easily develop mobile Augmented Re-ality (AR) applications exploiting the power offered by MCC without any particular knowledge of the application developer about (M)CC. However, each input submitted by the user (e.g., an image or a video frame) is pro-cessed by CloudAR without exploiting similar tasks previously propro-cessed. In a few words, CloudAR does not remember similar inputs, even the ones which have been recently computed and a lot of similar (and time consum-ing) computations are performed from scratch. Notice that this is true not only for CloudAR, but for any back end system which do not implement any (similarity) caching mechanism. As result, CloudAR (and most of the back end systems) can not perform recognition tasks in real-time.

The goal of our research is to develop a framework which tries to improve OIR tasks thanks to recent results, obtained by similar inputs, submitted to (and processed by) a remote back end system.

We have developed CloudCache, a framework which stores in a cache struc-ture client inputs for a specific task and the correspondent computation re-sults in order to quickly answer future and similar queries for the considered application. CloudCache has been specifically designed to provide solutions for OIR tasks with minimal effort: it can be placed between the user and the back end system which processes incoming queries. To our knowledge, this is the first work where cache mechanisms are used to improve OIR applica-tions performance. In addition, CloudCache library has been designed so it can be used for any application which queries can be expressed in a vector space, offering as much flexibility as possible, and not limiting his usage in OIR-related applications.

During our work we have faced many design and technical choices and in this thesis we describe why some solutions, strategies or tools have been pre-ferred instead of others (e.g., why using a local feature descriptor instead

(27)

1.1. Our Contribution 3 of a global one for describing images?). In particular, we made parallel solu-tions a primary aspect in this thesis, where we greatly improved performance of popular image description and encoding algorithms by exploiting multi-cores architectures (which are a standard in CC environments nowadays). Even the cache look-up process has been parallelized in order to obtain best performances, especially when many objects have been cached.

Computer Vision is another crucial aspect of this work and many different tools have been compared to find the best solution for our needs. In particu-lar, part of our contribution deals with comparing popular Computer Vision approaches in terms of time (while usually precision is mainly considered in this kind of researches). A minor contribution of this work is a detailed study about how image size does not necessarily means more precise results, while in most of the related works the opposite is assumed. Finally, to im-plement our cache mechanism we need some kind of similarity threshold: if the similarity between two images is lower than a given threshold, then they are considered as similar images and they treat the same topic, otherwise they are considered different objects (with different results). This problem has never been treated before and we adopted a simple but efficient solution to provide an accurate threshold.

We will show the performance (in terms of task completion time and preci-sion) of an OIR-related application tested on three different subjects. In order to do so, we re-adapted a popular image dataset used to test CBIR systems, so it can be used for the given application, while we created two novel image datasets to test the given application on the remaining two subjects. Finally, we will show how CloudCache is used on the same application in order to avoid similar computations and the performance gains from this strategy. In particular, we will use CloudCache to boost the recognition process per-formed by CloudAR, providing objects information in one tenth of the time needed by this new library.

This document is organized as follow: Chapter 2 provides the necessary background about CC, MCC, memoization techniques and the computer vi-sion techniques used to implement CloudCache. Chapter 3 introduces the Problem Statement and describe several possible mobile applications where CloudCache could be used as Motivation. Chapter4list our assumptions and describe possible advantages and disadvantages in CloudCache. Chapter5

describe CloudCache design, architecture and describe in details its imple-mentation, including our choices and difficulties in such a project. Chapter6

describes our experimental evaluation of CloudCache. Finally, Chapter7is a conclusion and treats possible future works.

(28)
(29)

5

Chapter 2

Background

In this chapter we will see how the concept of similarity depends on the application that we are considering. This justifies why CloudCache focuses on a particular application domain (i.e. OIR). However, we will show that vector spaces and popular distance functions help us to define objects simi-larity. Next, we will show how previous studies achieved something similar to CloudCache for other kinds of applications or proposing more general solutions (especially for metric spaces). Then, we will briefly review the k-Nearest Neighbor (k-NN) problem, which is useful in case we want to use CloudCache on large scale. Finally, we will introduce many different Com-puter Vision tools that have been successfully been used on different kind of applications: these tools will be combined to implement the approach fol-lowed by CloudCache (and some of them will be improved in terms of per-formance to obtain the best results).

2.1

Computer Vision OIR Applications

In this section we are going to review the problem of Object Instance Recog-nition (OIR), i.e. given a query image containing an object belonging to a specific category, label the object to a sub-category. Alternatively, the goal of instance-level recognition is to match (recognize) a specific object or scene [214].

Some examples include:

• Given a query image containing a painting, return the title of the paint-ing (e.g., Mona Lisa by Leonardo da Vinci).

• Given a query image containing a building, return the name of the building (e.g., Statue of Liberty).

• Given a query image containing a dog, return the breed of the dog (e.g., Beagle).

• Given a query image containing some currency, return the name of the currency.

(30)

FIGURE 2.1: An example of Object Recognition applications. [206]

This kind of applications did not receive much attention in the literature and because of that are often referred as Object Recognition applications, but they are definitely distinct: Object Recognition consists in identifying one or more specific objects in a digital image or video, not their specific instances.

This difference between these two types of applications is clear from Figure

2.1: as the reader can see, the Object Recognition application tries to identify multiple generic object categories in the query image (e.g., a cat), while from in Instance Recognition is more fine grained (e.g., Siamese cat). Notice that usually in OIR we are not interested in identifying objects in space, while it is usually the case for Object Recognition applications. Finally, note that in OIR it is usually returned one result, while in Object Instance Retrieval a set of ranked results are returned.

The state-of-the-art of Object Instance systems is usually identified with Google Goggles, which is able to identify object instances from many categories, such as landmarks, books, artworks and logos. In literature it is more easy to find specific subjects fields, such as Painting Recognition works or Landmark Recognition papers [141] [6] [231].

A common pattern can be found in all these works: given a query image, we extract a set of features from the query image which describe it, encode them into some kind of codes (i.e. vectors) and finally measure the similarity be-tween two codes by using some distance function (e.g., euclidean distance). The first two stages are described in sections 2.5 and 2.6, respectively. The concept of "objects similarity" instead is described in the next section.

2.2

Similarity is application dependent

Given a query, CloudCache tries to find the most similar query already pro-cessed by the system and return the associated result. But how can we con-sider two queries "similar"? Here are reported some examples of similarity between objects in different fields:

• Text similarity: this is one of the first fields where there was an attempt to define a concept of similarity between objects (documents). Appli-cations where we are interested in finding similar documents are so

(31)

2.2. Similarity is application dependent 7 many and so different that there are plenty measures of similarities [86] [22] [142]. For example, two different documents could be considered based on their term frequency, position or semantic. Simple (even if not so efficient) approach in information retrieval is to create TF-IDF vectors and compute the cosine similarity between them. More sophis-ticated and efficient solutions were proposed in the latest years [227] [102] [180] [19].

• Images similarity: computer vision and image processing have seen many progresses in the last 20 years and many measures and tech-niques for similarity has been defined [147]. For example, we can use color histograms to measure the similarity between two images [97]. Another approach is to describe an image trough some notion of fea-ture, which has to be detected and then possibly described. A real in-novation on this aspect was introduced by Harris and Stephens with their corner detection algorithm [93]: as when we solve a puzzle, cor-ners can be considered important features inside an image. We will see more details about this aspect on the Computer Vision background de-scribed. Deep learning techniques have been recently used for image classification [54] and Content-Based Image Retrieval [215].

• Music similarity: there has been a lot of activity in the Signal Pro-cessing community in order to understand if two music sources can be considered similar [131] [21] [15] [138] [190]. A popular approach pre-sented in [131] clusters spectral features by using k-Means, then uses the Earth Mover’s Distance to compare different histograms. A com-pletely different approach proposed in [138] uses Support Vector Ma-chines (SVM) to classify songs based on features calculated over their entire lengths.

From the previous examples two points are clear: first, objects from different domains have different concepts of similarity, second (and more important) different solutions and approaches are used in different domains (or even the same one!) to evaluate such a similarity. However, we can identify a common pattern between some of these approaches: the input object is encoded into some kind of vector and a distance function is defined in order to measure the similarity with other encoded vectors. Numerous surveys tries to describe the most common and popular distance functions in vector spaces and their applications [45] [4] [51] [125]:

1. Jaccard similarity coefficient: a popular index used in information re-trieval for scoring and measuring similarity sample sets on binary (and non-binary) variables. Applications examples of such distance are query-document matching or plagiarism detection.

2. Cosine similarity: the similarity between two vectors using this mea-sure is quantified as the cosine of the angle between them. It’s widely use in information retrieval for document similarity in a [0, 1] interval.

(32)

3. Hamming distance: given two binary vectors of the same length, the hamming distance between them is the number of bits that they dif-fer from each other. It’s particularly fast to compute in binary vectors exploiting the XOR operator. For this reason, many binary descriptors have been recently developed in image processing.

4. Euclidean distance: this distance is considered a standard metric for geometrical problems. Euclidean distance is the ordinary distance be-tween two points and it is widely used in clustering problems (it’s the default distance in the k-means algorithm). It is used, for example, to measure the similarity between two SIFT descriptor vectors.

5. χ2 distance: according to [226], chi-square distance is an effective mea-sure for comparing histogram features, for example in the Bag of Words model.

6. Kernel functions: this family of functions became popular in SVM al-gorithms for enabling to operate in a high-dimensional spaces through the simple dot-product (the so called kernel trick). However, it has to be shown [205] how this functions can be exploited as a good measure of similarity.

To conclude this section: defining similarity between objects is too much ap-plication dependent in order to propose a general solution. However, some applications from different domains use the same set of (metric) distances used to measure similarity between two objects encoded into vectors.

2.3

Caching

2.3.1

Exact Caching

Caching and buffering mechanisms were introduce in DBMS and high per-formance applications, where frequently requested memory (or disk) pages are stored in fast and small storage devices [68] [199]. In particular, the earliest relevant strategies were query folding [184] and semantic caching [61]. With the advent of the World Wide Web caching strategies became even more important since the high cost for loading web pages [217]. Nowadays, caching revealed to be a successful and necessary tool for improving web search engines performances [139] [123] [17] [39] [201]: caching the results of a large number of frequently submitted queries has the advantage to re-duce cost on the back end servers and the response times experienced by the end-users. This is particularly useful since users share the same query topics according to an inverse power law distribution [221] [195]. Lempel et al. proposed Probabilistic Driven Caching (PDC), which associates a prob-ability distribution with all the possible queries that a search engine could receive. In particular, for queries regarding the first page of results a SLRU policy is used, while a heap stores answers from the second page onward.

(33)

2.3. Caching 9 The distribution function evaluates to zero all the queries that have not pre-viously seen. The rest of the distribution is built using statistics computed on the query logs. Highly probable queries are highly ranked and have a low probability to be evicted from the cache. This probability distribution is used only for queries from the second page. In addition, PDC is the first policy which prefetches user requests, exploiting a model based on the user behavior and follow-up queries (i.e. queries requesting successive page of results). PDC was tested on a query log of AltaVista obtaining a 53.5% of hit-ratio with a cache of 256,000 elements and 10 pages prefetched, which is a very good result. Another scenario where caching is useful (or even neces-sary) for search engines is when the backend is unavailable for some reason, e.g., power failure, natural disaster, hardware and software failure, network partition, operator error, or denial-of-service attack [158]. Cambazoglu et al. try to face this problem [39] generating approximate answers by exploiting query results available in the result cache of the search engine. In particular they propose two different solutions: the first one aggregates the results of similar queries that are previously cached in order to create synthetic results for new queries. In the second approach new queries are answered through an inverted index containing textual information in the cache.

2.3.2

Similarity Caching

The last paper that we cited is particularly interesting since it uses caching in an unconventional way: the authors used the cache of the search engine to produce similar (approximate) answers in case of unavailability of the back-end service. All the other approaches are based on the concept of exact hit, where we have a gain in performances only if our query is already contained in the cache. However, we can consider to use a Similarity Caching : an approximate hit happens if the requested item is similar (according to some criterion, i.e. distance) but not necessarily equal to some cached item. In the works that we are going to cite, each result object is stored in the cache to-gether with its features (i.e. a vector) and an user query is an object in the same feature space. So the similarity between a given query object and a cached result object can be approximately computed. As the reader can no-tice, representing the query as a feature vector is the same approach that we have introduced in section2.2.

In 2009, Chierichetti et al. [49] discussed a concept of similarity caching, pro-viding the first theoretical analysis about the topic. In particular they proved how the problem becomes easier as we proceed from general metric spaces to those of bounded doubling dimension, and to Euclidean metrics. In [163] the authors propose similarity caching for content-match applications and contextual advertisement systems and the way the cache is managed is inter-esting: if the similarity between the requested item and a cached one (w.r.t. some similarity function) is bigger than a given threshold, then there is an approximate hit and the cached item (or some other object linked to it) is

(34)

returned as result. Otherwise there is a cache miss: the requested item is in-serted into the cache, and evict the cached item with the oldest reference time. This approach is called SIM-LRU and it has an interesting property: no two items in the cache are are closer than the given threshold, which means that the space of items is well covered (i.e. the cache doesn’t contain redundant items). CLS-LRU avoids redundancy given by objects around the cached item which balls significantly overlaps between them. Finally, they intro-duce randomization in approximate hit/miss so if there is an high similarity there is an high probability of cache hit, while there is a small probability of cache miss. As result, if a request item is submitted very often (and so it’s im-portant and no approximation should be returned), sooner or later it will be cached and so when the cached object will be required again (which happens often) there will be a cache hit (and not an approximate one).

FIGURE2.2: An example where the clos-est cached query has not the optimal safe

radius. [75]

Falchi et al. [75] [74] [73] apply Similarity Caching in Content Based Image Retrieval (CBIR) context. In such applications there are typically two kinds of queries (which are very common in many other appli-cations): range queries where, given a radius r, all the objects within a ball with center the query vec-tor and radius r are returned and k-Nearest Neighbor (k-NN) queries where given k, all the k nearest ob-ject w.r.t the query vector are re-turned (see section2.4 for more de-tails about this famous problem in computer science). Considering first the range queries, for each cached query and its range, the associated ball induces the complete

knowl-edge of the metric space up to the relative range, and so all the objects con-tained inside the ball. So they introduce the concept of safe radius: given a query q, a cached query query qi with radius rqi, they define the safe radius of

qw.r.t. the cached query object qias:

sq(qi) = rqi − d(q, qi) (2.1)

This notion is particularly useful because if sq(qi) > 0(and so q falls into the ball of qi) then there exists a k0 ≤ k (where k0 ≥ 0) where the first exact k0 nearest neighbors of q are cached. Figure2.2is an example of a safe radius sq for a query q w.r.t. a cached query qi.

QCache is the proposed algorithm which take advantage of the safe radius: first of all, the algorithm check for an exact hit with the query q. Otherwise, using a metric index it considers the h most closest cached queries to q, it

(35)

2.3. Caching 11 retrieves all the associated h × k results and extract the k closest object to q, which is an approximated result list. Then, we compute the largest safe radius ˜sqbetween the h closest objects, with the corresponding cached query ˜

q. Notice that ˜sqis possibly not the largest one (for example, in Figure2.2qh is closer to q but its safe radius is smaller than the one using qi), and so we may be able to find only the k∗ (where k∗ ≤ k0 ≤ k) exact nearest neighbor of q. All the objects withing the ball with center q and radius ˜sqare guaranteed to be the k∗ closest objects of q, while the remaining k − k∗ closest objects between the hxk retrieved are just an approximation. Since ˜sqis possibly not the largest safe radius for q, if it’s value is negative or its ball doesn’t contain any object (i.e. k∗ = 0, then the algorithm proceed with an heuristic based on the probability distribution of distances from q . Such heuristic evaluate the goodness of the approximate result set and if it’s not greater than a given threshold, then the cache gives up and the back-end CBIR is queried. They provide good results for query logs used in CBIR, but they don’t provide any result for other applications.

In [36] a List of Clusters (LC) [48] is used to embed the cache in a metric space index, while in the previous work the cache was decoupled from the index. If the system fails to get an approximate hit the results used by the cache are not wasted and exploited to continue the traversal of the index in order to produce approximate results. The centers forming the LC are the previously seen queries. In the first step, the query q is compared with each center from the LC in order to understand if it will become a new cluster center (follow-ing the policy proposed in [165]) and if so reconstruct the index accordingly, otherwise we check if there is a cache hit. If not, an approximate result can be provided if q lies into some cluster ball. If q is not contained in any cluster ball, the cache mechanism is not used and the search continues. Since the dis-tance between q and all the cluster center has already been computed in the caching phase, all the clusters which do not overlap with the query ball are discarded (hopefully many). Since the distance between each cluster center and its objects are computed and stored offline, they take advantage of the tri-angular inequality and the distances computed in the caching phase with all the cluster centers to define a lower-bound for the distance between q and all the objects contained in a cluster which overlaps with the query ball, so a lot of expensive distance computations are avoided. In those cases which these lower bounds do not allow them to directly discard an object, they explicitly compute the distance between it and q. For k-NN research, the list of closest neighbors is initialized with the sequence of the first k objects that q is com-pared with. Then, the search is like a range query with the radius updated accordingly to its current k-th nearest neighbor. Notice that this approach has three problems: 1. Depending on the sequence of the first k objects the radius could be very large (and so a lot of cluster could not be discarded, since they overlap with the query ball) 2. SSS guarantees that number of pivots (cache elements) will stabilize at some point, so the number of cache elements could be large and take a long time before it stabilizes. As consequence, initially the Similarity Cache mechanism is useless and they use it only for exact hits (which are very rare). 3. The authors themselves claims that results quality

(36)

FIGURE2.3: Binary hash codes produced through local sensi-tive hashing functions using random hyperplanes: the top

(bot-tom) vector is relative to the red (yellow) dot. [156]

highly depend by parameters tuning.

D-Cache [197] proposes something different, since it stores precomputed dis-tances between objects and not previously seen queries and the associated results.

As we will see in Chapter 3, all these works are somehow different from what we want to achieve with CloudCache, especially considering that in our system for each query we return one element (i.e. the most similar one w.r.t. the input element) and not the top-k elements. More formally, what we try to solve with CloudCache is 1-NN and not k-NN. In addition, k-NN is usually treated the number of searched elements is in order of millions of elements, while (as we will see) CloudCache caches thousands of elements. All this without considering the speedup obtained by common solutions nowadays. As conclusion, the k-NN problem may be resulting as an overkill for our problem, especially considering the number of cached elements.

2.4

The Nearest Neighbor Problem

In the previous section we have introduced the k-Nearest Neighbor (k-NN) problem, which is a generalization of the Nearest Neighbor (NN) problem. This is a popular problem in computer science, especially in multimedia searching, information retrieval, image processing, classification problems, pattern recognition, data compression, computational statistics, databases, data mining, machine learning, algorithmic theory, computational geometry, recommendation systems and etc. [9] [192] [24] [65] [124] [140]. It is nec-essary because of the so called curse of dimensionality [55]: when we deal with high-dimensional spaces, it becomes difficult to efficiently solve the NN problem (formalized below). Just to make an example: k − d trees are popular data structure which solve the k-NN problem in logarithmic time for low-dimensional space. However, the complexity becomes exponential

(37)

2.4. The Nearest Neighbor Problem 13 when dealing with high-dimensional spaces [79]: this makes the data struc-ture useless for popular high-dimensional vectors, e.g., SIFT descriptors (128-dimensions, introduced in section2.5.1) or VLAD codes (possibly thousands of dimensions, explained in section2.6.2).

In particular, the NN problem can be defined in a metric or in a non-metric space. Formally:

Definition 1(Metric Space). A metric space is a pair M = (D, d), where D is a domain of feature values and d is a total (distance) function d : M × M 7→ R s.t. for any x, y, z ∈D, the following holds [52]:

1. Symmetry: d(x, y) = d(y, x) 2. Non-negativity: d(x, y) ≥ 0

3. Identity of indiscernibles: d(x, y) = 0 ↔ x = y 4. Triangular-inequality: d(x, y) ≤ d(x, z) + d(z, y)

In particular, all the Similarity Caching studies that we have seen in subsec-tion2.3.2are designed for metric spaces, since they heavily use the triangular inequality property, in fact most of the NN methods have been defined for metric spaces, due to their popularity and large scale of possible applica-tions. Since some of the most important measures are non-metric distances, like the popular cosine-similarity (especially in information retrieval), new implementations focus on offering solutions for both metric and non-metric distances.

Definition 2(Exact NN). Given a set S of points in a d-dimensional space Rd(S ⊂ Rd), construct a data structure which given any query point finds the point in S with the smallest distance to q [192] [80] [1].

The k-NN problem is a generalization of NN where the k closest object to query q are returned as result, and the NN problem is equivalent to the 1-NN problem. In a small dataset and low-dimensional spaces, query-time can be performed in logarithmic time (thanks to k − d trees, for example). However, as soon as our dataset is composed by hundreds of thousands of features vectors in hundreds of dimensions, operations cost becomes prohibitive. For this reason, an approximate approach can be done in polynomial time: Definition 3(c-Approximate NN). Given a set S in a d-dimensional space Rd(S ⊂ Rd), construct a data structure which given any query point q ∈ Rd, reports any point within distance at most c times the distance from q to p, where p is the point in P closest to q [192] [1].

During the last 30 years, many different techniques for solving the k-(A)NN problem have been proposed [24] [1] [65] [222] [222] [18] [188] which can be broadly categorized into structured-less and structure-based techniques [24] [1] [65]. In this chapter we will focus on the second category, since many efficient open source implementations have been published, especially in the last years.

(38)

Unfortunately, there are few theoretical results regarding the complexity of these methods and there are no comprehensive empirical evaluations, espe-cially for non-metric spaces [179]. As a result, there isn’t a clear winner for solving ANN. Following are listed some of the most efficient ANN solutions:

2.4.1

Locality Sensitive Hashing

Locality Sensitive Hashing (LSH) was presented by Gionis et al. in [83] and it can be considered as the focus of the research community for solving the ANN problem for many years.

Definition 4(Locality Sensitive Hashing). A family F is called (R, cR, P1, P2) -sensitive if given a threshold R, an approximation factor c > 1 and for each function h ∈F any p, q ∈ Rdthe following properties hold:

1. PF(h(q) = h(p)) ≥ P1, d(p, q) ≤ R 2. PF(h(q) = h(p)) ≤ P2, d(p, q) ≥ cR

In order for a LSH family to be useful, it has to satisfy P1 > P2 [192]. An al-ternative definition of LSH is to define a familyF of hash functions operating on a collection of objects, such that for two objects x,y it holds Ph∈F[h(x) = h(y)] = sim(x, y)where sim(x, y) ∈ [0, 1] [46].

So the idea behind LSH is that the probability of collision for LSH functions is most likely high for similar objects, while it’s low for different ones (which is the opposite of the conventional cryptographic hash function, where similar items have slightly different hash values). A common solution for generating the hash code y is by concatenating the hash code of M randomly picked hash functions from the familyF. So formally we have y = h(x) where y = [y1...yM]T and [h1(x)...hM(x)]T [218].

Figure 2.3 is an example of how two points are mapped to two 6 dimen-sional binary vectors following the algorithm explained in [46]: for each ran-dom hyperplane r, assign white if the point is above r, black otherwise. Fi-nally, we can approximate the cosine similarity between the two points as cos(h/lπ)where h is the hamming distance between the two vectors and l is their length.

However, LSH suffers from poor access order because the hash functions are achieved without exploiting the distribution of data points and the points in the same bucket (with the same hash code) are not differentiated [219]. LSH modern techniques still need an high quantity of main memory, which make it unfeasible for applications of several millions (or billions) of complex objects.

(39)

2.4. The Nearest Neighbor Problem 15

2.4.2

Tree based

Multiple randomized k − d trees [194] achieved a great speedup in ANN. The main difference compared to classic k − d trees is that data is not split-ted according to the highest variance, but the split dimension is chosen ran-domly between the dimensions with the highest variance (where the number of dimensions is parameter-dependent). The intuition behind random hyper-planes is that if the query point is really close to a splitting hyperplane, then there is an high probability that its NN is on the other side of the hyperplane and so the tree has to be explored on the other side too. With random hyper-plane this scenario is less probable and the probability that the query point and its NN are in the same cell is higher.

k-Means Tree [80], Anchor Hierarchy [148], VP-Tree [222], Cover-Tree [23], Spill-Tree [129] and GNAT [35] are algorithms which decompose the space using different clustering algorithms for partition trees. In particular, in [149] was proposed a modified k-Means Tree, where at each level the data points are divided into K clusters using k-Means and proceeding by recursion and stopping when a region contains less than K points.

k-Means Tree [80], Anchor Hierarchy [148], VP-Tree [222], Cover-Tree [23], Spill-Tree [129] and GNAT [35] are algorithms which decompose the space using different clustering algorithms for partition trees. In particular, in [149] was proposed a modified k-Means Tree, where at each level the data points are divided into K clusters using k-Means and proceeding by recursion and stopping when a region contains less than K points. The tree is visited in a top-down approach from the root to the closest leaf combined with a priority queue sorted in increasing distance according to the query point which con-tains all the unexplored branches along the path. The branching factor K is an important parameter for good search performance.

Most of the data structures previously proposed are based on numerical con-straints such as the triangle inequality or distance ranges. This could be a disadvantage since because of this approach the number of objects actually examined can be highly variable, resulting difficult to predict the overall ex-ecution time. For these reasons, Rank Cover Tree (RCT) was introduced in [96], resulting competitive with approaches that make explicit use of similar-ity constraints.

2.4.3

Graph based

In this class of techniques we solve the ANN problem using graphs, where the nodes are the dataset points and edges are its nearest neighbors.

Recent works focus on the notion of small-world introduced by Milgram [211] for solving the ANN problem based on the navigation effect [137] [178] [136]. Random elements are consecutively inserted by connecting them to a certain number of closest neighbors from the previously inserted elements.

(40)

In this way, links connecting early inserted nodes will become bridges con-necting different parts of the network. Navigable Small World (NSW) algo-rithm is based on a greedy search which can outperform many other ANN algorithms [179] [151]. The drawback of this method is the low performance for low dimensional data [151] and the polylogarithmic complexity. This cost is given by the number of distance computations, which is proportional to a product of the average number of greedy algorithm hops (logarithmic) and the average degree of the nodes on the greedy path (logarithmic). In 2016, Hierarchical Navigable Small World (HNSW) [135] and is nowadays one of the best ANN solutions. Navigating in small world networks can be divided in two phases: 1. zoom-out: in this phase the length (or node’s degree) of each hop increase, until we reach an hub node which has an high degree. 2. zoom-in: here the opposite happens, ans so hops’ length become smaller

and smaller.

FIGURE 2.4: Illustration of HNSW idea.

Starting from the entry point at the top layer (shown red), red arrows show direc-tion of the greedy algorithm to the query point (shown green) and how the char-acteristic radius decrease level by level.

[135]

This can be considered the same process for flight routes from a re-mote region to another rere-mote one: at a certain point we will probably cross some airport hub. The search-ing process in HNSW starts from an hub node, which is a node in the upper level of the graphs hierar-chy and then find a local minimum in a greedy fashion. Then, we re-peat the process in the next lower level, starting from the previous lo-cal minimum. Using a constant average node degree per level, we get a logarithmic complexity scaling. For each inserted element an integer maximum layer level is randomly selected with an exponentially de-caying probability distribution. Fig-ure 2.4 shows an example of how HNSW works. Another contribu-tion from this work is an heuristic which uses the distances between the candidate elements to create di-verse direction connections instead of just using the closest neighbors: a can-didate node is inserted in the resulting set of connected nodes if the distance to the inserted element (or query) is smaller than the distance to any other already connected element. This heuristic easily keeps a global connected component, allowing connections between different clusters.

(41)

2.4. The Nearest Neighbor Problem 17

2.4.4

Product Quantization

Product Quantization (PQ) was introduced by Jegou et al. in 2011 [108] and it is considered one of the most efficient solutions for ANN in euclidean spaces [149]. Combining PQ with Principal Component Analysis (PCA) to image encoding techniques, such as VLAD [109] or Fisher Vectors [110] (we will talk about these encoding techniques again later), has been proven to be a successful and efficient (both in memory and time) approach for CBIR sys-tems with datasets up to hundreds of millions of images [109] [110].

Here, a quantizer is a function mapping a D-dimensional vector to the closest kcentroid obtained by k-means. However, obtaining small codes from such an approach is not straightforward: having a 128-dimension real vector (e.g., SIFT descriptor vector) a quantizer producing 64-bits codes needs k = 264 centroids, which is a prohibitive cost for both k-means complexity and mem-ory (128 × 264× 32 bits which is more than 9.4 zettabytes, respectively each vector dimensions, number of centroids and the float representation of each dimension). PQ solves this problem: the input vector is split into m distinct subvectors, where m is a multiple of D.

FIGURE 2.5: Four quantizers of 64 cen-troids each. [114]

So, we define a set of m subquan-tizers and the i-th subvector is as-signed to the centroid ci obtained applying the i-th subquantizer and so the final codebook is defined as c = c1×...×cm. If k∗is the number of centroids per subquantizers, the to-tal number of centroids is given by k = (k∗)m. Since we store k∗×m cen-troids, the number of bits per code-book is given by log2k∗× m. The au-thors show how using k∗ = 256and m = 8 is often a reasonable choice, i.e. 8 bytes. The approximate eu-clidean distance between the query vector and the database vector is computed through Asymmetric

Dis-tance Computation (ADC), where the database vector is encoded while the query is not.

The original PQ scheme has been improved in the last years: the Inverted Multi-Index (IMI) [16], Optimized Product Quantization [82] and Local Op-timized Product Quantization (LOPQ) [114] are recent improvements of the original PQ approach. Fig. 2.5 from [114] shows different Product Quanti-zation approaches and their weaknesses: PQ in Fig. 2.5(b) suffers of a big number of useless centroids (red points in the figures), without any database point (grey points) close to them. OPQ in2.5(c) exploits better its centroids, but still a lot of centroids remain without data support. Finally, we can see

(42)

FIGURE2.6: Speed/Precision trade-off of different ANN algo-rithms. [11]

that using LOPQ almost every centroid is surrounded by database points in Fig.2.5(d).

2.4.5

ANN Cases of Use and Benchmarks

Many different open-source implementation of different ANN have been re-leased and in Figure2.6shows the efficiency/precision trade-off of many dif-ferent ANN algorithms. The benchmark is defined on a dataset of 1 million 128-d SIFT descriptors and the obtained result are 10-NN w.r.t. the query vec-tor. Annoy [202] is a popular library released by Spotify, FALCONN [76] [10] is considered the state of the art for LSH in Cosine Similarity and Euclidean distance, while we have already talked about FLANN and HNSW (which is the most performant solution). As we can see, by using a c4.2xlarge instance on EC2, by using the most performant algorithms we can obtained 10-ANN in around one millisecond with high precision.

As we will see in section4.1, CloudCache is not designed to cache hundreds of thousands of elements. In addition, parallel solutions reduce the curse of dimensionality for sure. So, until we are dealing with a few thousands of elements, we do not actually need ANN solutions, which decrease the precision of returned results. In addition, all the the previous approaches are designed to deal with compact codes, i.e. vectors in a few hundreds of dimensions: in order to represent images which such small vectors, we have to give up on precision by using dimensionality reduction techniques (e.g., Principal Component Analysis) [109]. In CloudCache instead we use full size representations to obtain the best possible precision.

(43)

2.5. Image Feature Detector and Descriptor 19

2.5

Image Feature Detector and Descriptor

As we have seen in section2.2, an important part of Computer Vision consists in feature detection and description for a given input image.

2.5.1

SIFT Algorithm

FIGURE 2.7: A scale space example. [193]

The first and most popular algorithm which accurately described a query im-age was the Scale Invariant Feature Transform, SIFT in short. A baseline concept of this work is the scale spaces. Given the original image, a series of chain of blurred images is progressively generated, i.e. given the input image we generate different scales: the original image (which is not blurred) is the first scale, the image obtained by applying the blur filter for the first time is called second scale, the third image obtained by blurring the second scale is called third scale and so on and so forth. Then, the original image size is halved and the blurring process is repeated. The set of scales with the same size is called an oc-tave: the first octave is obtained by dou-bling the original image size, the second octave corresponds the original image size, the third one is half of the original size and so on. This approach generates a scale space, which final result can be seen in Figure 2.7. With this approach we are able to identify the query object in another image if its size is different or its blurred, i.e. the algorithm is scale in-variant.

The next step in the algorithm is to build a Difference of Gaussians (DoG) to lo-cate edges and corners in the image, which are good to find keypoints as found out by Harris et al. in [93]. This process is done by computing the difference between two consecutive scales, as shown in Figure2.8. Then key-points are obtained by find maxima and minima in the obtained DoG, which are then refined by eliminating low contrast features and edges, leaving only corners with high contrast, which are usually considered interesting points in an image.

(44)

The next thing is to assign an orientation to each keypoint in order to obtain rotation invariance. The basic idea is to collect gradient directions and magni-tudes for all pixels around each keypoint, obtaining an histogram. By using the histogram, the most prominent gradient orientations are identified and so assigned to the correspondent keypoint.

FIGURE2.8: Process to obtain Differ-ence of Gaussian. [193]

The last step consists in generating a descriptor, i.e. an "unique" 128 dimen-sional vector of floating numbers. To do so, a 16x16 window around the keypoint is created, which is splitted into sixteen 4x4 windows. For each window, gra-dient magnitudes and orientations are calculated, which are then collected into an histogram of 8 bins, each one corre-sponding to a different range of 45 de-grees. Finally, the obtained 128 values are normalized. An example of result-ing SIFT keypoints and descriptors de-tected for a query image is shown in Fig-ure2.10.

One of the main interesting point of the SIFT algorithm is that the similar-ity between two keypoints descriptor is computed by using the euclidean distance and so many known techniques to efficiently compute descriptor matching:

• Brute force matching is the simplest solution, where given a descriptor from the first image, we compute the (euclidean) distance between it and each other descriptor of the second image. However, this approach is not efficient when we have to deal with many descriptors (usually detected in images with many features and/or large size), even though GPU solutions have been implemented.

• ANN matching is used to quickly compute similar descriptors, but the obtained solution is possibly not the best one (since we get an approxi-mate solution). Many different approaches have been proposed and we already talked about them in section2.4.

Figure2.9shows an example of SIFT descriptor matching between two sim-ilar images. Thanks to its precision, SIFT is not only scale and rotation in-variant, but it is able to matching subjects represented from different point of views (with some limitations). Descriptor matching allows us to find out the most similar images in the dataset w.r.t. the query image: we can rank images depending on the number of descriptors matches and the top match-ing images are gomatch-ing to be the most similar ones. Of course, this approach assumes a good keypoints and descriptors: in the case of bad keypoints, we are going to miss interesting (and distinctive) points in the processed image (or even worse, we classify irrelevant points as relevant ones), while in case

(45)

2.5. Image Feature Detector and Descriptor 21

FIGURE 2.9: An example of SIFT descriptors matching, taken from [214]

of inaccurate keypoints description we could consider the area around two keypoints as visually similar, while actually it is not the case.

2.5.2

SURF and Hessian-Affine Detectors

Since SIFT paper publication, several different keypoints detector and de-scriptors have been published (see section2.5.3 for other examples). How-ever, others two features detection and descriptors are worth to be men-tioned, since they are very important in CloudCache:

Speeded Up Robust Features (SURF)

FIGURE2.10: A graphical representa-tion of SIFT keypoints. [101]

It has been designed to speed up SIFT in 2006 by Bay et al [20]. First of all, Laplacian of Gaussian (LoG) is approx-imated not with DoG but by using Box Filters, which can be efficiently com-puted by using integral image and the process can be done in parallel for different scales. Keypoint orientation and mag-nitude are computed by using wavelet responses and guassian weights, respec-tively. The prominent orientation is then computed by summing all the gaussian weights present in ranges of 60 degrees. If we do not need rotation invariance (as

it is the case for many applications), the whole process can be performed even more quickly by not computing the keypoint orientation (which guarantees

Riferimenti

Documenti correlati

Tuttavia, anche se nella prova non si è riusciti a risolvere, come auspicato, il problema della concimazione fosfatica, resta il fatto che una produzione biologica di piante da

Nel complesso il libro, basato sullo schema di un manuale di base per la gestione di una piccola biblioteca, offre indicazioni anche per chi si trova a operare con scarse risor-

Se, una volta individuate le specializzazioni funzionali di un certo numero di regioni, si intende indagarne l’eventuale coordinamento ed organizzazione allora si appronta uno

Tali dati, combinati con le informazioni sulla posizione dei centroidi all'interno del view plane della camera, permettono il calcolo della matrice di calibrazione, che viene scritta

Una delle maggiori caratteristiche che evidenziano l’evoluzione della stampa flessografica è la tendenza ad utilizzare lastre fotopolimeriche più sottili, in grado di

ad allineare la velocità tra le varie cellule che è simile ad un comportamento visco- so, inoltre ci sono due forze attive sul bordo una nel caso di una curvatura convessa, dove c’è

La prima parte del progetto di tesi ha riguardato la valutazione delle risposte raccolte da parte di Guess Europe in modo tale da identificare, strutturare ed implementare una

all’azienda uno strumento in grado di valutare la situazione attuale, che vede le aziende del gruppo posizionate al primo livello di maturità, ma molto più importante, permette