Interactive search techniques for content-based retrieval from archives of images

(1)

Ph.D. in Electronic and Computer Engineering Dept. of Electrical and Electronic Engineering

University of Cagliari

Interactive search techniques for

content-based retrieval from

archives of images

Luca Piras

Advisor: Prof. Giorgio Giacinto

Curriculum: ING-INF/05 SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI

XXIII Cycle March 2011

(2)

(3)

Ph.D. in Electronic and Computer Engineering Dept. of Electrical and Electronic Engineering

University of Cagliari

Interactive search techniques for

content-based retrieval from

archives of images

Luca Piras

Advisor: Prof. Giorgio Giacinto

Curriculum: ING-INF/05 SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI

XXIII Cycle March 2011

(4)

(5)

A Iole ed alla mia famiglia

per il loro sostegno

(6)

(7)

Abstract

Through a little investigation by file types it is possible to easily find that one of the most pop-ular search engines has in its indexes about 10 billion of images. Even considering that this data is probably an underestimate of the real number, however, immediately it gives us an idea of how the images are a key component in human communication. This so exorbitant number puts us in the face of the enormous difficulties encountered when one has to deal with them. Until now, the images have always been accompanied by textual data: descrip-tion, tags, labels, ... which are used to retrieve them from the archives. However it is clear that their increase, occurred in recent years, does not allow this type cataloguing. Furthermore, for its own nature, a manual cataloguing is subjective, partial and without doubt subject to error. To overcome this situation in recent years it has gotten a footing a kind of search based on the intrinsic characteristics of images such as colors and shapes. This information is then converted into numerical vectors, and through their comparison it is possible to find images that have similar characteristics. It is clear that a search, on this level of representation of the images, is far from the user perception that of the images.

To allow the interaction between users and retrieval systems and improve the perfor-mance, it has been decided to involve the user in the search allowing to him to give a feed-back of relevance of the images retrieved so far. In this the kind of image that are interesting for user can be learnt by the system and an improvement in the next iteration can be ob-tained. These techniques, although studied for many years, still present open issues. High dimensional feature spaces, lack of relevant training images, and feature spaces with low dis-criminative capability are just some of the problems encountered. In this thesis these prob-lems will be faced by proposing some innovative solutions both to improve performance obtained by methods proposed in the literature, and to provide to retrieval systems greater generalization capability. Techniques of data fusion, both at the feature space level and at the level of different retrieval techniques, will be presented, showing that the former allow greater discriminative capability while the latter provide more robustness to the system. To overcome the lack of images of training it will be proposed a method to generate synthetic patterns allowing in this way a more balanced learning. Finally, new methods to measure similarity between images and to explore more efficiently the feature space will be proposed. The presented results show that the proposed approaches are indeed helpful in resolving some of the main problems in content based image retrieval.

(8)

(9)

2.2 Image features . . . 10 2.2.1 Color . . . 10 2.2.2 Texture . . . 11 2.2.3 Interest points . . . 13 2.3 Metrics. . . 13 2.4 Evaluation. . . 15 2.5 Datasets . . . 17 2.5.1 Corel SMALL . . . 17 2.5.2 WANG dataset . . . 18 2.5.3 MSRC dataset . . . 18 2.5.4 Caltech-256 dataset . . . 18 2.5.5 MIRflickr-25000 dataset . . . 19

3 Semantic gap and other open issues 21 3.1 Relevance Feedback techniques . . . 23

3.1.1 Query Shifting Techniques . . . 23

3.1.2 Support Vector Machines . . . 24

3.1.3 Nearest Neighbor Approach . . . 25

3.1.4 Some other techniques . . . 26

3.2 Unbalanced learning . . . 26

3.2.1 Artificial patterns generation . . . 26

3.3 High-dimensional feature spaces . . . 27

3.3.1 Local manifold . . . 27

3.3.2 Global dimensionality reduction approaches . . . 28

3.3.3 Feature weighting. . . 29

3.4 Low capability to evaluate similarity between all relevant images . . . 30

3.4.1 Graph approaches . . . 30

3.5 Low informative training images . . . 31

3.5.1 Active learning . . . 31

(10)

iv CONTENTS

3.6 Low discriminative capability of the feature set . . . 32

3.6.1 Combining classifiers . . . 32

3.6.2 Representation by multi-feature spaces . . . 33

4 Relevance feedback techniques in individual feature spaces 35 4.1 Directed Pattern Injection . . . 35

4.1.1 Learning from imbalanced data . . . 36

4.1.2 Category Learning with Synthetic Feature Vectors. . . 37

4.1.3 Choice of the Parameters for the DPI technique . . . 39

4.1.4 Experimental Results . . . 40 4.2 On-line learning . . . 47 4.2.1 On-line approach . . . 47 4.2.2 Experimental Results . . . 51 4.3 Exploitation-Exploration . . . 54 4.3.1 MAX-min selection . . . 55 4.3.2 Experimental Results . . . 56

4.4 Dominant set score ranking . . . 59

4.4.1 Node weighting . . . 59

4.4.2 Score evaluation . . . 59

4.4.3 Experimental Results . . . 60

4.5 Locally Linear Embedding Score . . . 61

4.5.1 Reconstruction error evaluation . . . 61

4.5.2 LLE Relevance Score . . . 62

5 Relevance feedback techniques in multiple feature spaces 63 5.1 Weighted Combination . . . 63

5.1.1 Weighted Metrics . . . 64

5.1.2 Estimation of feature relevance. . . 65

5.1.3 Neighborhood-Based feature weighting. . . 66

5.2 Space combination . . . 73

5.2.1 Error minimization . . . 73

5.2.2 Dissimilarity representation from multi-feature spaces . . . 78

6 Combination of multiple relevance feedback techniques 87 6.1 Nearest neighbor and SVM scores fusion. . . 87

6.1.1 Score fusion . . . 88

6.1.2 Diversity measure. . . 88

7 Conclusions and Future Works 97 7.1 Where is the content based image retrieval now? . . . 97

7.2 Where are we going? . . . 98

Bibliography 101

(11)

List of Figures

1.1 “La pelosa” . . . 2

1.2 “palm” Google Images search . . . 2

1.3 “Cold” . . . 3

2.1 Image of a horse . . . 7

2.2 Vue de la fenêtre du domaine du Gras . . . 8

2.3 (Dis)similarity between features . . . 9

2.4 (Dis)similarity between concepts . . . 10

2.5 Indistinguishable figures for a color histogram . . . 11

2.6 Examples of different scales textures . . . 12

2.7 Different types of edges . . . 12

2.8 SIFT points . . . 13

2.9 Subsets for semi-global histograms . . . 16

2.10 Images per class in Corel SMALL datset. . . 17

2.11 Example images from WANG dataset . . . 18

2.12 Example images from MSRC dataset . . . 18

2.13 Example images from Caltech-256 dataset . . . 19

3.1 Relevance Feedback steps . . . 22

4.1 Example of k-NN “synthetic” pattern generation . . . 38

4.2 (DPI) MSRC Dataset - Precision . . . 43

4.3 (DPI) MSRC Dataset - Recall . . . 43

4.4 (DPI) MSRC Dataset - F measure. . . 44

4.5 (DPI) Caltech-256 Dataset - Precision. . . 44

4.6 (DPI) Caltech-256 Dataset - Recall . . . 45

4.7 (DPI) Caltech-256 Dataset - F measure . . . 45

4.8 (DPI) MSRC Dataset - Performance SMOTE vs DPI . . . 46

4.9 (DPI) Caltech-256 Dataset - Performance SMOTE vs DPI . . . 46

4.10 (Ol-learn.) WANG Dataset - Av. Precision & Precision. . . 52

4.11 (Ol-learn.) MSRC Dataset - Av. Precision & Precision . . . 53

4.12 (Ol-learn.) Caltech-256 Dataset CEDD - Av. Precision & Precision . . . 54

4.13 (Ol-learn.) Caltech-256 Dataset Edge Histogram - Av. Precision & Precision . . . . 54

4.14 Esploration-Exploitation algorithm . . . 56

4.15 (Expl-Expt) WANG Dataset - Av. Precision & Precision . . . 57

4.16 (Expl-Expt) MSRC Dataset - Av. Precision & Precision . . . 58

(12)

vi LIST OF FIGURES

4.17 (Dom-Sets) Corel SMALL Dataset - Precision & Recall . . . 61

5.1 Neighborhood-Based feature weighting scheme . . . 67

5.2 (FW) Corel SMALL Dataset - Precision . . . 70

5.3 (FW) Corel SMALL Dataset - Recall . . . 71

5.4 (FW) Corel SMALL Dataset - F measure. . . 71

5.5 (FW) Corel SMALL Dataset - PFRL Precision under T parameter variation . . . 72

5.6 (FW) Corel SMALL Dataset - PFRL Recall under T parameter variation . . . 72

5.7 Sigmoid function . . . 74

5.8 (Min-Err) WANG Dataset - Av. Precision & Precision . . . 76

5.9 (Min-Err) MSRC Dataset - Av. Precision & Precision . . . 76

5.10 (Min-Err) Caltech-256 Dataset - Av. Precision & Precision . . . 77

5.11 (DiS) Caltech-256 Dataset - Precision . . . 84

5.12 (DiS) Caltech-256 Dataset - Recall . . . 84

5.13 (DiS) Caltech-256 Dataset - Mean execution time. . . 85

6.1 (NN+SVM) MSRC Dataset - Precision for SIFT and CED Descriptor . . . 90

6.2 (NN+SVM) MSRC Dataset - Recall for SIFT and CED Descriptorl . . . 91

6.3 (NN+SVM) MSRC Dataset - Relative Precision & Relative Recall (SIFT) . . . 91

6.4 (NN+SVM) MSRC Dataset - Relative Precision & Relative Recall (CEDD) . . . 92

6.5 (NN+SVM) MSRC Dataset - Diversity measure for SIFT and CED Descriptor . . . . 92

6.6 (NN+SVM) Caltech-256 Dataset - Precision for EH and CED Descriptor. . . 93

6.7 (NN+SVM) Caltech-256 Dataset - Recall for EH and CED Descriptorl . . . 93

6.8 (NN+SVM) Caltech-256 Dataset - Relative Precision & Relative Recall (EH) . . . 94

6.9 (NN+SVM) Caltech-256 Dataset - Relative Precision & Relative Recall (CEDD) . . . 94

6.10 (NN+SVM) Caltech-256 Dataset - Diversity measure for EH and CED Descriptor . 95 6.11 (NN+SVM) WANG Dataset - Precision & Recall . . . 95

6.12 (NN+SVM) WANG Dataset - Relative Precision & Relative Recall . . . 96

(13)

Chapter 1 Introduction

The growing number of digital data such as text, video, audio, pictures or photos is creating the need to find ways to quickly and accurately retrieve information from that data. Whereas the results of traditional text data search methods are quite satisfactory, the same can not be said for visual or multimedia data.

The most so far common method for image retrieval is predicated on adding meta-data to the images as keywords, tag, label or short descriptions, so that retrieval can occur through such remarks. The manual cataloguing of the images, even though it requires expensive work and a large amount of time, often it is not so effective. Describing a picture in words is not always easy and the relevance of the description is strictly subjective. Figure1.1, for example, could be considered an image depicting a sunset or the sea or even a tower so that it is impossible to give a unique description.

In addition it is necessary take into account that the use of a limited number of words does not always allow clearly describe what an image represents and that the same word can have several meanings. For example, performing a Google search using the word “palm”, the result could be very different from that expected (Figure1.2). In fact, as you can see, only a picture out of twenty is a palm tree (honestly the first thing I thought when I did this research) and again only one out of twenty is the palm of a hand. To find more palm trees you need to scroll pages up to the third and for other palms of hands even up to the eighth. It is surely necessary to say that the Google searches are probably biased by economic factors or at least that its users seek more likely images of Palm Inc. products, however these small examples can help to understand the difficulties in describing an image. The multiplicity of the results of the search is due to the fact that manual indexing of images is independent of its content because it is not directly related to visual information, but only “connected” to them by author, subject depicted, the place of representation, etc.

To overcome these drawbacks, over the years several methods have been proposed. They are predicated on the idea of content based indexing by means the use of low-level features such as color, texture, shape, etc. and nowadays is a very active research field [94,62,24,105]. Systems that make use of this approaches are called content based image retrieval (CBIR) systems.

Of course, a description of the images through this kind of features is far from the com-mon perception that the user has of an image. For a human being indeed an image can represent many things: it can be fine or nasty, it can be good or bad, can evoke emotions and

(14)

2 CHAPTER 1. INTRODUCTION

Figure 1.1: Beach “La pelosa” - Stintino (SS, Italy)

Figure 1.2: “palm” Google Images search

memories, and can also feel sensations that are not really represented (e.g., see Figure1.3). For a computer is simply a set of pixels with different “color” and different intensities. In the last years there have been many attempts to bridge the gap between the high level fea-tures, those perceived by human beings which identify the semantic information related to the images, and the low level ones that are used in the searches. This difference in

(15)

percep-1.1. RESEARCH OVERVIEW 3

Figure 1.3: “Cold”

tion is widely known in the CBIR field as semantic gap. Among the proposed methods seem particularly effective retrieving images according to their similarity with respect to an image given as query (Query by Example) and to refine the search using feedbacks of the user that judges as relevant or non relevant (Relevance Feedback) the images that the system retrieves iteration by iteration.

Having said this, it might seem a now solved problem, however, there are still many ob-stacles to face. As already mentioned the pictures for a computer are sequences of numbers, so it is natural that lowlevel features are represented as vectors of numbers in a N -dimensional space. One of the first problems to address it is to make sure that relevance or semantic similarity can be measured and represented in this N -dimensional space. The problem is aggravated by the fact that, in general, the images that the user is interested in are just a small part of the totality of the images in the dataset and in addition the size of the space in which images are represented can be very large. In practice she is facing a search with a few items, in an very large area, populated mainly by objects in which she is not in-terested in. Fortunately some techniques from the field of Pattern Recognition and classifi-cation give us a helping hand, however, they carry with them some solutions as well as some weaknesses.

1.1 Research Overview

During the three-year period of the Ph.D. my research has been focused mainly on address-ing some of the open issues concernaddress-ing the relevance feedback and in general CBIR. In par-ticular, they have been investigated unbalanced learning, high dimensional feature space, low informative training set and low discriminative capability of the feature set problems. In addition some innovative solutions have been developed to evaluate a similarity score be-tween images.

unbalanced learning: Usually during the first few iterations the number of the images that

the user considers non relevant is much larger than the number of those considered relevant. This is due mainly to two factors. First of all even if the user was “collabo-rative” the system can not submit to her an excessive number of images, say no more than a few dozen. Second, in very large databases the amount of images that interest

(16)

the user, respect to the whole pictures, is usually very limited. Our main contribution in the unbalanced learning problem concerns a technique, called K-Nearest Neighbors Directed Pattern Injection (DPI K-NN), that consists of adding to the training set in the feature space some representations of “synthetic” images.

high dimensional feature space: In order to try to describe accurately the pictures, feature

spaces are becoming larger and larger. This event, if in some ways increases the de-scriptive capability of feature vectors for other makes them difficult to handle. In addi-tion, when sizes became very high often it happens to deal with sparse vectors, i.e. with a lot of components equal to zero, for some queries but not for others. This implies the need to find ways to reduce the feature spaces maintaining the pattern distribution or even better, trying to keep the relevant images close each other and far from the non relevant ones. On this topic, in this thesis, two techniques will be shown. The first is a preliminary work and consists in reconstructing all the pictures from an N -dimensional space in M --dimensional and P --dimensional spaces (with M , P < N ) as a linear combination of relevant and non relevant images, respectively. It is possible to evaluate the relevance of each image by the error committed during in the reconstruc-tion phase.

The second is based on attributing weights to the various components of the feature vectors. The weight evaluation that will be proposed is performed following two dif-ferent approaches: the first is founded on the relative distance of images relevant and not relevant with respect to a certain component of the feature space, the second on the minimization of the classification error.

low informative training set: In the field of image retrieval usually, the images, which the

user is interested in, belong to the same class but those that are not interesting for her may belong to many different classes. This behaviour sometimes makes the use of classifiers based on nearest neighbor not the most effective. Based on the concept that similar images are located in adjacent areas of space, this type of classifier retrieves relevant similar images but also non-relevant images similar to each other losing the generalization capability. In order to address this problem in this thesis it has been proposed a methodology that provides to combine the nearest neighbor paradigm (Ex-ploitation) with a phase of exploration of the immediate neighbourhoods of the area occupied by the relevant images (Exploration). In this way it is possible to properly choose images that allow a greater ability to generalize instead of choosing only im-ages that are the most similar to the examples.

low discriminative capability of the feature set: It is possible to extract from an image a huge

quantity of low-level features: color moments, color and edge histogram, directivity and co-occurrence descriptors, etc. but probably the use of just one of them is not effective enough to achieve excellent performance. For this reason, very often it has been tried to combine information from each feature space in order to exploit the dif-ferent characteristics that they are able to emphasize. The main problem is to find a way to increase the accuracy in the search but not the processing time. Another way to make the most of the features that have been extracted is not to combine the de-scriptors but different classifiers on the same set of features. Our contributions on this topic is twofold. One is the proposal of an approach that represents the images in a

(17)

1.2. ORGANIZATION 5

“(dis)similarities” space where each component is associated with a different feature space and that represents a similarity measure of the image with respect to the given query. In this way it is possible to obtain a space whose dimension is equal to just the number of spaces to be combined.

The second contribution has been the idea to combine two very different techniques for image retrieval in order to obtain good results in different datasets and represen-tations rather than high performance in a specific dataset. The techniques that have been chosen are a nearest neighbor based approach and the SVM.

relevance evaluation: Finding a method which allows to transform the semantic similarity

between two images into something measurable is a hoary open issue. Over the years several methods have been proposed that were more or less successful, depending on the context in which they were used. Under this perspective two new techniques to assess the similarity between images have been also proposed: one related on graphs and clustering, the other on on-line learning. In the first the image dataset is rep-resented as an undirected weighted graph, where the weight of the edge reflects the similarity between pairs of vertices. In this graph it should be possible form a clus-ter of relevant images where the nodes are similar among themselves and at the same time dissimilar from the other nodes that do not belong to the cluster. Starting from the relevant images chosen by the user and following the concept of Dominant Set it is possible iteratively to modify the weight of the graph and to find new relevant images. The other approach proposed on this topic is predicated on the on-line learning. More specifically the idea consists in exploiting the on-line nature of the relevance feedback in order to evaluate a classification function from a vector of coefficients. These are then computed iteratively by paying attention that they do not vary too quickly, stay-ing in some way “linked” to the value that took in the previous iteration (passive) but at the same time change proportionally to the earlier committed classification error (aggressive).

1.2 Organization

The rest of this thesis is structured as follow.

Chapter2gives the required background on Content Based Image Retrieval, presents the widely used evaluation metrics and the most common databases.

Chapter3presents an overview of the state of the art of this thesis, showing the most pop-ular techniques in the field of relevance feedback focusing particpop-ularly on those that have been proposed to address the open issues tackled in my research.

Chapter4discusses the solution of he main part of the problem introduced in the pre-vious chapter that it has been proposed for retrieval techniques working in a single feature space. For each issue will be described the proposed solution followed by the result obtained during the experimental phase.

Chapter5focuses on two problem whose solutions are in some way related. Both prob-lems will be faced separately. At the end some interesting result will be compared.

Chapter6introduces the purposes behind the idea to combine two different CBIR tech-niques and show why and how could be a “winning” idea in content based systems that work with very different databases.

(18)

Finally, Chapter7briefly summarizes the topics covered in this thesis and some future improvements that can extend our work.

(19)

Chapter 2 Content Based Image Retrieval

The men, ever since it came on the world, have always used images to communicate, to report or simply as a form of expression. Cave paintings of tens of thousands of years ago have even come down to us (Figure2.1). As time went by men have maintained this way to communicate gradually finding more and easier ways to make these representations. The first photo dates back to almost two centuries ago (Figure2.21) and ever since the spread has been unstoppable. By now, everyone has a camera in own mobile phone and with the

Figure 2.1: Cro-Magnon, Image of a horse (Upper Paleolithic, 17,000 BC-15,000 BC, Lascaux caves, Limousin, France)

1_{View from the window at Le Gras, the first successful permanent photograph (W}_IKIPEDI_A)

(20)

8 CHAPTER 2. CONTENT BASED IMAGE RETRIEVAL

Figure 2.2: J. N. Niépce, Vue de la fenêtre du domaine du Gras (1826, Saint-Loup-de-Varennes, Bourgogne, France)

advent of Internet, social networks and storage space almost “unlimited”, the exchange of photos and digital images has become frenetic, to say the least. Faced with this new scenario it has become increasingly urgent need to find a way to manage this heap of data. One needs only to reflect on the amount of photos brought at home after a short vacation, if you do not hurry to save them on your computer, taking care to divide them at least by place and date, very soon by looking them again, “difficult” questions will arise: Where we took this picture? What building is that?

This is true, without taking care to also consider the persons represented in the images. In that case, the cataloguing of the photos will take more time than the holiday itself. It is therefore evident that a manual archiving of all the images that one possesses is to categor-ically rule out. Everyone has surely needed, for various reasons, to search an image on the web and has struggled with the inefficiencies of searches by keyword with the consequent wasting time. For this reason, since the early nineties, the scientific community has inter-ested in the study of content based image retrieval [94].

In the rest of chapter it will be made a brief treatment on the different levels of meaning of an image and then it will be described the features that are extracted from the image to be processed by a retrieval system. In addition it will be described the way used to evaluate a rough but effective kind of similarity measures. Finally, the chapter will conclude with a description of the performance measures used in the rest of the thesis and the used dataset.

(21)

2.1. WHAT IS THE IMAGE CONTENT? 9

Figure 2.3: (Dis)similarity between features

2.1 What is the image content?

The first papers on CBIR are, by now, twenty years old [60] but we are still facing with prob-lems not yet fully solved. The largest of these is that currently the only way to handle the “content” of an image is to extract what are called low-level features, i.e. to extract numeri-cal values of the colors, shapes, edges of the images . This means that when two images are compared, the related “features” are compared while a human being compare the “concepts” expressed by them [101]. In the same light it is possible to distinguish between different “lev-els” [29,38] of image content retrieval:

Level 1: it is the lowest level and is the one where the comparisons between the images can

be made considering the intrinsic features of an image: find images with round ob-jects, finds pictures with yellow obob-jects, etc.. Of course, being the one needing less information has also the lowest effectiveness. For example, if you tried to evaluate the similarity between a orange and a lemon you might not always be satisfied by the re-sult [61]. A retrieval system based on this level of content may respond with either a very high value of similarity or very low. If you look at the Figure2.3it is not difficult to see how a retrieval shape-based would consider the two images as similar, one based on color does not.

Level 2: it is characterized by the properties derived from objects in the picture. The queries

can be the retrieval of images of specific objects (such as the Colosseum) or a specific type (like a bicycle). For searches at this level is needed a certain knowledge base, for example, that a particular building with a certain shape is known as the Colosseum.

Level 3: it is the highest level and requires a good deal of abstraction. At this level the search

aims at concepts that go beyond what is physically represented. These may be partic-ular events (ceremony of installation of the President of the United States) or feelings and emotions (cold, joy, etc.) and are definitely the hardest to find so much so that even a human being may be wrong when dealing with the identification of these con-cepts.

The most significant difference between levels is surely the one between the first and the second two (for which we speak of semantic retrieval). In fact it is not uncommon that two

(22)

Figure 2.4: (Dis)similarity between concepts

images, even though similar in terms of colors and shapes, can represent completely differ-ent objects. An example of this is the Figure2.4[7] taken from a dog food advertisement of a few years ago, although the colors, the background and the shape of the object in the foreground are almost identical, the semantic difference is considerable. This difference of similarity perception is called semantic gap.

2.2 Image features

Before to provide a description of the features it is necessary to divide them into two cate-gories, namely global and those local. Among the first ones those that are widely used are: color, texture and edges and are extracted from the image in its entirety. The local feature on the contrary, as the name suggests, are involved in the assessment of the most significant areas of the image [8,25].

2.2.1 Color

The color is certainly one of the most important low-level feature for an image, first of all because marks it out strongly and it is robust with respect to translation and rotation. Sec-ondly, a picture can be zoomed in or zoomed out without that its color distribution is sig-nificantly altered. Finally it is immediately perceived by human beings. It is therefore prob-ably that a similarity between images perceived by the user corresponds to a similarity of color features. The human visual system is such as the retina is more sensitive to three particular wavelengths: Red, Green and Blue. This has led to the emerging of trichromatic

(23)

2.2. IMAGE FEATURES 11

model (RGB), in which all colors can be represented as a combination of the three primary ones in the right proportions, in the functioning of all the monitors and televisions. To maintain compatibility between the old black and white monitors and the color ones it has been preferred, however, encode the RGB information through the YCbCrsignal, where

Y = (R + G + B), Cb= (B − Y) , and Cr= (R − Y). Another model, closer to human perception of

color, is the HSV model in which the three components represent the Hue, the level of color purity (Saturation) and the brightness value (Value).

A very simple description of the colors of an image can be done using a vector H = (h[1], . . . , h[i ], . . . , h[N ]) of a vector space of dimension equal to the number N of colors in the image and where h[i ] is the percentage of pixels in the image of the i-th color. The graphical representation of the vector H is called color histogram [96]. However, the simplicity of rep-resentation of color histograms is accompanied by an information lack. In fact, histograms capture the global color distribution in an image but do not preserve spatial information., For example, a red figure on a blue background has the same histogram of one with the same number of pixels, but random arranged (see Figure2.5). In order to overcome this

Figure 2.5: Indistinguishable figures for a color histogram

drawback is used a “layout version” of the histograms: before to extract the color values, the images is split in sub-images and for each an histogram is evaluated. Some color histogram descriptors are included in the standard MPEG-7, which sets the color spaces that can be used to calculate them. The most common are the Scalable Color Descriptor and the Color

Layout Descriptor [1] The main difference between the two descriptors are that the first is a color histogram in the HSV color space encoded by Haar Transform, the second is extract splitting the images in 8×8 sub-images, evaluating the the average color of each sub-image in the YCbCrcolor space and transforming them in coefficients by performing a Discrete

Co-sine Transform. In the following of the thesis this feature with the color histogram and color histogram layout [96] will be used as color descriptors.

2.2.2 Texture

The texture identifies the recurrence of patterns with similar geometric properties. The tech-niques used for its description are typically statistical and spectral. Spectral methods are based on the Fourier spectrum analysis of the images in order to identify the recurrence that are present. These reveal themselves as peaks in the spectrum and carry with them a high energy content. The statistical methods are used instead to obtain information about the relative positions of the pixels. These are obtained by considering not only the intensity dis-tribution but also the position of pixels with identical or very similar intensity values.

(24)

The best known texture feature set is that proposed by Tamura and that is called after him [97]. The set is composed by coarseness, contrast, directionality, line-likeness, regular-ity, and roughness and are designed according to human visual perception. Over the years the first three have been the widely used: the coarseness gives information about the size of the texture elements, the higher the coarseness value is, the rougher is the texture. The con-trast is designed to evaluate changing a picture quality, not picture structure, considering the stretching or the shrinking of its grey scale. The grey level distribution influence also the sharpness of edges in the images, in fact a higher contrast imply sharper edges. The direc-tionality measures the presence of orientation in the texture; does not matter the orientation of the patterns in the figures but, simply, the total degree of directionality; i.e., two patterns which differ only in orientation should have the same degree of directionality.

Figure 2.6: Examples of different scales textures

In different way works another texture feature, the co-occurrence texture matrix [39]. The entries of this matrix are evaluated counting how often in the image occurs a intensity variation between two adjacent pixels along a certain direction. This permits to taking in to account how much and following which orientation the edges change in the picture.

As for the color it is possible to extract a texture histogram, the Edge Histogram

Descrip-tor [1] in fact, represents the spatial distribution of five types of edges: vertical, horizontal, 45◦, 135◦, and non-directional. The feature vector is obtained dividing the image into 16 (4 × 4) sub-images and generating a 5-bin histogram for each one.

Figure 2.7: Different types of edges represented by the Edge Histogram Descriptor

In order to increase the discriminative capability of the feature it has been proposed to enclose in a unique vector different type of characteristics. In this light Color and Edge

(25)

2.3. METRICS 13

Figure 2.8: SIFT points

a histogram requiring low computational power for its extraction. The CEDD histogram is constituted by 6 regions from which the texture features are evaluated and each region is constituted by 24 individual regions from which the color feature are emanated.

2.2.3 Interest points

In the last few years interest points attracted increasing interest in the scientific commu-nity. With this term it is indicated a locations that, due to sudden changes in brightness or because they are on the edges of represented objects in the images, characterizes in a par-ticular way the image [40]. The advantage of this feature is that it is invariant to changes of scale, rotation and translation. In fact, considering only the main points of the represented object it is negligible where they are located in the figure. For this reason from images that depict similar object it is possible to extract similar interest points. These interest points are then represented by means of numerical vectors. In the literature have been proposed many types of descriptors[67,68], but that has emerged over the years is the Scale-invariant Fea-ture Transform (SIFT) [65]. In the following the use of expression “local feature” it will be meant as SIFT descriptors extracted at Harris interest points (see Figure2.82) [27,25].

2.3 Metrics

Over the years a large number of different similarity measures have been proposed by the researcher community. The choice of similarity measure depends on the chosen image de-scriptor, in fact some of them can be used with “standard” metrics some other requires par-ticular measures suitably designed. In this section it will be discussed a selection of widely used “general-purpose” measures and some metrics designed for the feature described above [86].

A distances (or metric) in a space F is any function d : F × F −→ R

x × y 7−→ d(x,y)

such as, ∀ x, y, z in F

(26)

• d (x, y) ≥ 0,

• d (x, y) = 0 ⇐⇒ I1= I2,

• d (x, y) = d(y,x),

• d (x, y) + d(y,z) ≥ d(x,z).

Given any two images and their representations in a N -dimensional feature space

x =£x1, . . . , xf, . . . , xN¤ ; y =£ y1, . . . , yf, . . . , yN¤ ;

the most common metrics are: • Euclidean metric (or L2)

d (x, y) = v u u t N X f =1 (xf− yf)2; (2.1)

• Weighted Euclidean metric

d (x, y) = v u u t N X f =1 wf(xf − yf)2 (2.2)

where wf weights each component in different way according to its relevance. If wf is

equal to one the distance becomes as Eq. (2.1); • Generalized Euclidean metric

d (x, y) = q

(x − y)T_{W (x − y)}2 _(2.3)

where W is a N × N correlation matrix. If W is diagonal the distance becomes as Eq. (2.2), if it is the identity matrix as Eq. (2.1).

These metrics are usually used for Tamura and Co-occurrence Matrix descriptors • Manhattan metric (or L1)

d (x, y) = N X f =1 ¯ ¯x_f − y_f ¯ ¯; (2.4)

• Weighted Manhattan metric

d (x, y) = N X f =1 wf ¯ ¯x_f − y_f ¯ ¯ (2.5)

where the weight wf has the same function as in Eq. (2.1).

Manhattan metric is probably the widely used similarity measure, the distances for a lot of descriptors are evaluated according to it. In particular in this thesis will be used for Color Histogram, Color Layout Histogram, Scalable Color, and SIFT descriptors.

(27)

2.4. EVALUATION 15

• CEDD metric

For the measurement of the distance of CEDD between the images, Tanimoto coeffi-cient [18] is used [13].

d (x, y) = x

T_y

xT_{x + y}T_{y − x}T_y (2.6)

In the case absolute congruence of the vectors, the distance takes the value 1, while in the maximum deviation it tends to zero.

• Color Layout metric

The distance between two Color Layout descriptors is calculated as follows [1,59]: d (x, y) = s X i ∈(Y ) w 1i¡xi− yi ¢2 + s X i ∈(Cb) w 2i¡xi− yi ¢2 + s X i ∈(C r ) w 3i¡xi− yi ¢2 ; (2.7)

where Y , Cb, Cr denote the set of elements of the Color Layout vector that are

associ-ated with the Y, Cb, and Crcomponents, respectively, and w 1i, w 2i, and w 3i are the

weights for the i-th vector’s component. • Edge Histogram metric

Because the 80 bins may not be sufficient to yield efficient image matching, in order to evaluate a similarity measure they are generated an additional global edge histogram and some semi-global edge histograms directly from the original 80 bins. The global edge histogram has 5 bins and each bin value is obtained by accumulating the bin val-ues of the corresponding 5 edge types. Similarly, for the semi-global edge histograms, it is possible to group in some subsets of the original histogram. They are defined 13 different subsets of the image and for each subsets they are extracted the edge distri-butions for the five different edge types from the 80 histogram bins (see Figure2.9). Consequently, there are a total of 150 bins (80 bins (original) + 5 bins (global) + 65 bins (13×5, semi-global)) for the similarity matching. The distance between two Edge His-togram descriptors is calculated as follows [1,115]:

d (x, y) = 79 X i =0 ¯ ¯ ¯x local i − y local i ¯ ¯ ¯ + 5 × 4 X i =0 ¯ ¯ ¯x global i − y global i ¯ ¯ ¯ + 64 X i =0 ¯ ¯ ¯x semi-global i − y semi-global i ¯ ¯ ¯ (2.8)

where x_ilocalrepresents the i-th component of the original vector of the image x, xglobal_i represents the i-th component of the global Edge Histogram for the same image, and finally xsemi-global_i is the i-th bin value for the semi-global edge histograms of image x. Because the number of bins of the global histogram is relatively smaller with respect to the local and semi-global a weighting factor of 5 is applied.

2.4 Evaluation

To evaluate CBIR, several performance evaluation measures have been proposed [69]. The widely used are surely the precision and the recall.

(28)

edge histogram represents the edge distribution for the whole

image space. Since there are 5 edge types, the global edge

his-togram has 5 bins and each bin value is obtained by

accumulat-ing and normalizaccumulat-ing the dequantized bin values of the

corre-sponding edge type of BinCounts[]. Similarly, for the

semi-global edge histograms, we can group some subsets of

Bin-Counts[](Fig. 8). We define 13 different segments (i.e., 13

dif-ferent subsets of the image-blocks) and for each segment we

can generate edge distributions for five different edge types

from the 80 local histogram bins. Consequently, we have a

to-tal of 150 bins (80 bins (local) + 5 bins (global) + 65 bins

(13

!5, semi-global)) for the similarity matching.

For the similarity matching, we calculate the distance D(A,B)

of two image histograms A and B using the following measure:

!

= = =

!

+

!

"

+

!

=

64 0 4 0 79 0

]

[

_

]

[

_

]

[

_

]

[

_

5 ]

[

_

]

[

_

)

,

(

i i i

i

B

Global

Semi

i

A

Global

Semi

i

B

Global

i

A

Global

i

B

Local

i

A

Local

B

A

D

(7)

T

where Local_A[i] represents the reconstructed value of Bin-

Count[i] of image A obtained by referring to Table 2. Similarly,

Local_B[i] is the reconstructed value of BinCount[i] of image

B. Global_A[] and Global_B[] represent the normalized bin

values for the global edge histograms of image A and image B,

respectively. Similarly, Semi_Global_A[] and Semi_Global _ B[]

represent the normalized histogram bin values for the

semi-global edge histograms of image A and B, respectively. Since

the number of bins of the global histogram is relatively smaller

than that of local and semi-global histograms, a weighting

fac-tor of 5 is applied in (7).

10 12 11 13 9 5 6 7 8 1 2 3 4

Fig. 8. Segments of sub-images for semi-global histograms.

V. EXPERIMENTS

For our experiments, we set the total number of image-blocks

at 1100 and the threshold for edge detection (T

edge

) at 11. For all

our experiments, we used the image data set from the MPEG-7

Core Experiment (CE) [5], which has 11639 images in the

da-tabase. Most of the images in the database are natural images

from “Correl One Million Gallery” and others are from images

provided by the proponents of other image descriptors such as

color and homogeneous texture descriptors. From 11639

im-ages, we used 51 imim-ages, which were selected by MPEG-7 CE

participants [5], as query images. The ground truths that we

used in our experiments were determined by three participants

of the MPEG-7 CE. Each participant proposed from 3 to 33

ground truth images for each query image and they were

ap-proved by the other two CE participants [5]. As a measure of

re-trieval accuracy, we used the Average Normalized Modified

Re-trieval Rank (ANMRR) [6]. Precision and Recall are

well-known measures for the retrieval performance. They are

basi-cally a “hit-and-miss” counter. That is, the performance is based

on the number of retrieved images, which have similarity

meas-ures that are greater than a given threshold. For more specific

comparisons, however, we also need rank information among

the retrieved images. ANMRR is the measure that exploits the

able 3. Retrieval performance (ANMRR).

2bits/bin 3bits/bin 4bits/bin 5bits/bin Matching with local-only histogram 0.396012 0.336060 0.317815 0.324698 Matching with local, semi-global and global histograms (proposed) 0.363593 0.296225 0.285961 0.284346

!

!"#$$ %&'( %&)% %&)' %&)* %&)+ %&)( %&*% ,-% '%% '-% )%% )-% *%% ./012/3456 7894:; 8<:= './012./< )./012./< *./012./< -./012./<

Fig. 9. Comparison between local histogram and proposed method.

ANMRR 150 200 250 300 350 400 bits/image 0.40 0.38 0.36 0.34 0.32 0.30 0.28 2bits/bin 3bits/bin 4bits/bin 5bits/bin ! Local-only Proposed method

ETRI Journal, Volume 24, Number 1, February 2002 Chee Sun Won et al. 27

Figure 2.9: Subsets for semi-global histograms

The retrieval precision pq for a given query q is the ratio between the number of relevant

images found by the system at a given iteration i and the total number of images retrieved in the same iteration.

pq=

Number of relevant images retrieved

Total number of images retrieved (2.9)

As you can see from Eq. (2.9), Pqmeasures, the capability of correct retrieval of the system in

a single iteration. The mean percentage of the precision on the total number Q of queries is: p(%)= 1 Q Q X q=1 pq· 100. (2.10) Recall

The recall rqis instead defined as the ratio between the number of relevant images retrieved

up to a certain iteration i for a query q from the system and the number of relevant images for the same query that are in the database.

rq=

Number of relevant images retrieved

Total number of relevant images (2.11)

Through r is possible to evaluate the system’s ability to retrieve images relevant to each new iteration. It is also possible a variation in recall measure obtained changing the denominator of Eq. (2.11) with the minimum between the number of relevant images in the dataset, and the total number of images evaluated by the user. In this way it is possible to take into account the maximum number of relevant images that can be actually retrieved. The mean percentage of recall on the total number of queries is:

r(%)= 1 Q Q X q=1 rq· 100. (2.12) F-measure

It is also possible to combine the precision and the recall in a unique performance measure: the F-measure [51]

F = 2 · p · r

p + r. (2.13)

(29)

2.5. DATASETS 17

The average precision for some aspects summarize the precision and the recall, in fact given a query q is the mean over the precision after each retrieved relevant image.

APq= 1 NR NR X i =1 p(i ) (2.14)

where NRis the total number of relevant images for the query q. The mean average precision

(MAP) is the mean of the average precision over all queries:

M AP = 1 Q Q X q=1 APq. (2.15)

2.5 Datasets

The technique that will be shown in the following chapter has been tested using several dataset; in the next subsection they will be presented.

2.5.1 Corel SMALL

It is a subset of the Corel dataset obtained from the UCI KDD repository3. It consists of 19511 images that have been manually subdivided into 42 semantic classes with a large variability in the size: from 96 to 1544 (see Figure2.10). Even if the Corel dataset is never used in its en-tirety, different subsets of the complete stock are very common in the scientific community.

Figure 2.10: Images per class in Corel SMALL datset

(30)

2.5.2 WANG dataset

WANG dataset4is another subset of Corel and consists of a subset of 1000 images of the Corel stock photo database which have been manually selected and which form 10 classes of 100 images each [112] (see Figure2.11). The WANG database is usually considered as a easy task in the Image Retrieval context.

Figure 2.11: Example images from WANG dataset

2.5.3 MSRC dataset

The Microsoft Research Cambridge Object Recognition Image database5(in the following referred to as MSRC) contains 3007 images subdivided into 17 “main” classes, each of which is further subdivided into subclasses, for a total of 33 semantic classes [114]. It is mainly used in object recognition, in the pictures there is usually one object “captured” from a particular point of view (front, side, rear, etc.) or at most more object of the same type. An example of object from different classes coming from two of the “main” classes (car and flower) is given in Figure2.12.

Figure 2.12: Example images from MSRC dataset

2.5.4 Caltech-256 dataset

The Caltech-256 dataset, from the California Institute of Technology6consists of 30607 im-ages subdivided into 257 semantic classes [37] as MSRC is widely used in object recognition and it is considered a difficult task. In Figure2.13it is possible to see some examples of difficult classes. In the first row the picture belong to the cactus class, in the second to the birdbath one. It is possible to see how much, in the same class, the images differ each other.

4_{http://wang.ist.psu.edu/docs/related.shtml} 5_{http://research.microsoft.com/downloads}

(31)

2.5. DATASETS 19

Figure 2.13: Example images from Caltech-256 dataset

2.5.5 MIRflickr-25000 dataset

The MIRflickr-25000 collection [52] consists of 25000 images downloaded from the social photography site Flickr through its public API. It is a multi tagged dataset with a average number of tags per image of 8.94. All images include the tags that the original photographer has assigned to them and in the collection there are a total of 1386 tags which occur in at least 20 images. Moreover for the images also some manual annotations are available. MIRflickr has been proposed only few years ago but is enjoying great success by the scientific commu-nity. Recently it has been proposed an updated dataset with 1,000,000 of images [53].

(32)

(33)

Chapter 3 Semantic gap and other open

issues

As the years go by, it is ever-easier to have access to a ever-greater amount of electronically archived images. As a consequence there is an increasing need for tools enabling the seman-tic search, classification, and retrieval of images. As above-mentioned the use of meta-data associated to the images solves the problems only partly, as the process of assigning meta data to images is not trivial, is slow, and closely related to the persons who performed the task. This is especially true for retrieval tasks in very high dimensional databases, where im-ages exhibit high variability in semantic. It turns out that the description of image content tends to be intrinsically subjective and partial, and the search for images based on keywords may fit users’ needs only partially. The main reason for the difficulty in devising effective image retrieval and classification tools is caused by the vast amount of information con-veyed by images, and the related subjectivity of the criteria to be used to assign labels to images [94,62,24,105].

This kind of problem is called semantic gap and it is precisely due to the different ways in which human beings and machines interpret the images. For the humans, these arouse emotions, memories or also reflections; for a computer are simple sets of pixels from which to extract numerical values. In order to capture such subjectivity, image retrieval tools may employ the so called relevance feedback [82,121]. In the relevance feedback techniques the user is involved in the process of refining the search. In a CBIR task in which the RF is applied (see Figure3.1), the users submit to the system a query image, that is an example of the pictures of interest (Fig.3.1(1)); the system starting from the query give to the images in

the database a score related to a similarity measure between the images and the query. A certain number of best scored images are returned to the users (Fig.3.1(2)) that label them

as relevant or not (Fig.3.1(3)) and that re-submit these images as new “query” (Fig.3.1(4)).

With this new information the system can improve the search and give a more accurate result in the next iteration.

The main distinctions can be made between different relevance feedback approaches concern the degree of relevance that the user can express and type of learning system: long or short-term, and active or passive. Regarding the type of judgement that the user can ex-press on an image it can be exex-pressed either by assigning a score, i.e. exex-pressing “how much” a certain image is similar to the one she desire, or by a definite judgement of relevance or non

(34)

22 CHAPTER 3. SEMANTIC GAP AND OTHER OPEN ISSUES

Figure 3.1: Relevance Feedback steps: (1) Query by example; (2) Returned best similarity scored images; (3) User’s judgement; (4) New query examples

relevance. Even if with the latter type of choice all the nuances that the human is able to cap-ture in the similarity evaluation are inevitably lost, it is the most widely used method. In fact, it is that allows a reduced complexity in the retrieval systems and requires less work for the user.

The type of learning, as already mentioned, can be either passive or active type. In the first case, the system returns to the user the images that “judges” the most relevant, in the second case, relying on the cooperation of the user, the system returns those on which has greater uncertainty and therefore that, once assessed by user, get more information. This second type usually obtains better results but requires a greater effort of the user who almost never is willing to repeat the evaluation process for more than four or five times. In the following it will be considered mainly the passive learning. Active learning techniques will be discussed in Section3.5.1.

A further distinction in the type of learning is between short and long-term. The first in-volves that the system, to show to the user the most significant images, takes into account only the RF iterations just made by that user. The second one takes into account all the queries of the same type carried out in the past by different users. This second type of learn-ing, in spite of the potential for better performance, has the disadvantage that, since each user expresses its judgement in a subjective manner, not always, the opinions given by a user may then bring a real “help” in a search of another one.

The core of the retrieval systems, obviously, is the algorithm that learns which images in the database the user is interested in by analysing the query image and any feedback. Some methods assign directly a relevance score to each image in the database [31], others find an hyperplane that separates the relevant images from the non relevant ones [118], and others

(35)

3.1. RELEVANCE FEEDBACK TECHNIQUES 23

map the images as a weighted graph [56] from which extracting a cluster of relevant images. The methods also can differ in the way they approach the problem of classification (broadly speaking) of images. In fact from the most common two-class approach, where a model is built that either classifies an image as positive or as negative, it has been also proposed one-class approaches, where the model is built only for the relevant one-class [99]. These techniques are well known in pattern recognition field, but are not always very suitable in application for content based image retrieval. Starting from the consideration that “all positive examples are alike; each negative example is negative in its own way” [120] a (1+x)-class approach has been proposed.

In the next section will be shown a brief review of some techniques of relevance feedback at the state of the art as query shifting, support vector machine, (dis)similarity spaces and nearest neighbor approach. In the following some of the most common problems currently investigated by the researchers and faced in this thesis will be described. In particular, un-balanced learning, high dimensional feature space, the low capability to evaluate similarity between all relevant images, low informative training set and the low discriminative capa-bility of the feature set problems have been investigated. In addition will be shown some solutions proposed to resolve them by the scientific community and some our innovative proposals.

3.1 Relevance Feedback techniques

3.1.1 Query Shifting Techniques

As mentioned in the Chapter2the similarity measure can be expressed by means of dis-tances in the feature space, a family of techniques to improve the performance have adopted the so called query shifting paradigm. A simple approach to finding more relevant images from the user’s feedback is based on a well known technique in Information Retrieval: the Rocchio’s formula [80]. This method, initially developed for the retrieval of text documents has been then adapted to image retrieval techniques [83]. This technique is based on the concept that the relevant images and the non relevant ones should form, in the feature space, distinct group. Then moving the queries in these areas more densely populated by relevant images should be easier to retrieve them. The approach is based on the minimization of a function that approximated the relevance of all the images of the database. Given a query q, the new query q0is evaluated as

q0= α · q + β · Ã 1 r X xi∈R xi ! − γ · Ã 1 n X xj∈N xj ! (3.1)

where xi and xj represent the images in the set of the relevant (R) and non relevant images

(N ), respectively, that have cardinality equal to r and n, respectively.α, β, and γ are param-eter to fix.

In [33] the authors proposed a modification of the original approach, they move the query according to the position of the mean point of the relevant and non relevant images. Let x and q be two feature vectors representing an image in a d -dimensional feature space and the initial query provided by the user to perform the k-NN search, respectively. Let Nr(q) and Nnr(q) be the sets of relevant and non relevant images, respectively, contained

(36)

in the neighbourhood of q N (q). The mean vectors of relevant and non relevant images of N (q), mr and mn, can be computed as follows:

mr = 1 r X x∈Nr_(q) x, mn= 1 n X x∈Nnr_(q) x (3.2)

where r and n are the sizes of relevant and non relevant image sets, respectively (r + n = k). The two normal distributions of relevant and non-relevant images are characterized with means mr and mn, respectively, and equal varianceσ2computed as follows:

σ2 = sW· sB (3.3) where s_W2 = 1 k Ã r r − 1 X x∈Nr_(q)(x − m r)T(x − mr) + + n n − 1 X x∈Nnr_(q)(x − m n)T(x − mn) ! (3.4) s2_B= (mr− mn)T(mr− mn) (3.5)

In other words, σ is computed as the geometric mean of the within-scatter sW and the

between-scatter sB. According to [33] the new “shifted” query can be evaluated as

qBQS= mr+ σ 2 kmr− mnk µ 1 − r − n max (r, n) ¶ (mr− mn) (3.6)

In the following, between the different versions of the query shifting approaches that it has been proposed in the literature, it will be used the BQS, as proposed in our research group.

3.1.2 Support Vector Machines

Differently from the previous methods that are essentially based on the density estimation there is another family of relevance feedback techniques that is based on discriminative training, i.e. methods that learn from a set of positive and negative labelled images how to classify the unlabelled ones. In this family, surely have a leading role the Support Vector

Machines (SVM) [111,23]. The idea behind the SVM is to find a hyperplane, in the feature space, that divides it into two subspaces. The first populated by positive samples, the second one by negative samples. The patterns closest to the hyperplane are called support vectors. The greater is the distance (margin) between these vectors and the hyperplane the greater is the reliability of the classifier. Given a set of linear separable training samples xi ∈ Rn, the

general form of linear classification function is

f (x) = w · x + b (3.7)

which corresponds to a separating hyperplane w · x + b = 0. Solving a constrained quadratic programming problem it is possible to find the vector w such that maximize _kwk1 that is the distance from the closest point to the hyperplane. The solution w has an expansionP

(37)

3.1. RELEVANCE FEEDBACK TECHNIQUES 25

in terms of a subset of patterns that are the nearest to the hyperplane and are called support vector. The function in the Eq. (3.7) can be now written as

f (x) =X

i

αixi· x + b (3.8)

the sign of this function permits to classify the pattern x. In its “plain” form the SVM is a linear classifier but can be used also in non linear problem using the so called kernel trick. The kernel trick permits to transform the feature space to a higher dimensional space through the functionφ(·) [23] and it is so possible to find a hyperplane that separates the patterns [87]. The dot product in Eq. (3.8) can be represented as

φ(xi) · φ(x) = K (xi, x) . (3.9)

Finally, the classification function can be now written as f (x) =X

i

αiK (xi, x) + b (3.10)

The SVM has been widely used in image retrieval [118] also in one-class problems [16]. In order to address its low performance in case of asymmetric training set it has been also pro-posed a solution where several classifiers have been learnt with a bag of balanced number of positive and negative example[98].

3.1.3 Nearest Neighbor Approach

Other approaches estimate the relevance of an image according to the relevant and non-relevant images in its neighborhood. The use of the nearest neighbor paradigm is motivated by its use in a number of different pattern recognition fields, where it is difficult to produce a high-level generalization of a class of objects, but where neighborhood information is avail-able [3,28]. In [31] a score is assigned to each image of a database according to its distance from the nearest image belonging to the target class, and the distance from the nearest im-age belonging to a different class. This score is further combined to a score related to the distance of the image from the region of relevant images. The combined score is computed as follows: rel(x)st ab= µ n/t 1 + n/t ¶ · relBQS(x) + µ 1 1 + n/t ¶ · relN N(x) (3.11)

where n and t are the number of non-relevant images and the whole number of images retrieved after the latter iteration, respectively. The two terms relN Nand relBQSare computed

as follows:

relN N(x) = kx − N N nr

(x)k

kx − N Nr(x)k + kx − N Nnr(x)k (3.12) where N Nr(x) and N Nnr(x) denote the relevant and the non relevant nearest neighbor of x, respectively, and k · k is the metric defined in the feature space at hand,

relBQS(x) =1 − e 1−dBQS(x) Á max i dBQS(xi) 1 − e (3.13)

where e is the Euler’s number, i is the index of all images in the database and dBQS is the

distance of image x from a modified query vector computed according to the Bayes decision theory (Bayes Query Shifting, BQS) [33] (see Section3.1.1).

(38)

3.1.4 Some other techniques

Strictly related to the SVM is the on-line binary classification problem, where as classifica-tion funcclassifica-tion is used the a vector of weight [9]. On-line learning is particularly suitable for facing RF problems due to the on-line nature of such a problem. Moreover with the on-line approach it is possible to obtain a very fast and compact RF method and this property is very important in a real CBIR scenario. In Section4.2will be shown a new application of this technique proposed in relevance feedback for content based image retrieval field.

Despite its wide use, achieving high performance by SVM can be not easy, in fact it de-pends on the tuning of a number of parameters that often have been set after a long series of tests. Rather than splitting the feature space into subspaces populated only by relevant / non relevant images or find out the distributions of the two images’ “classes”, sometimes what it is required by a retrieval system is to provide a list of images in order of relevance (passive learning) or in order of informative capacity (active learning), i.e. according to the impor-tance assigned them by the user for that specific kind of query [100]. In order to evaluate the importance it is necessary knowing what the user is looking for and exploit her feedback to find other “similar” images.

What does “similar” mean? This is a question that only the user can answer and for this reason over the years a lot of similarity measure have been proposed. As mentioned in Section2.3the easiest way to evaluate the similarity is introducing a metric in the feature space [86]. The pair-wise distances between patterns have been used in pattern recognition field also to build a (dis)similarity space [74] where the distances between patterns of the same class are smaller than those of patterns of different classes. A similar approach has been proposed [32] and used [70] to exploit relevance feedback in in content based image retrieval field. Different techniques use the user feedback in order to find a distance metric such that the distance in low-level features is consistent with the users’ relevance judge-ments [92]. This metric should try to minimize the distance between similar images and meanwhile maximize the distance between the feature vectors of dissimilar images.

3.2 Unbalanced learning

3.2.1 Artificial patterns generation

One of the most severe problem in the design of the classifier, is the imbalance between the number of samples of the class the user is interested in, and all other images of the database that share some characteristics with that class. In the machine-learning literature the imbal-ance problem has been widely investigated and solution based either on under-sampling the majority class, or on over-sampling the minority class have been proposed [41]. However, it is easy to see that these solutions may produce a distortion of the “real” distribution of the classes. In the image retrieval domain, some solutions proposed so far involve the reduc-tion of the set of images that are non-relevant to user’s interest by the creareduc-tion of bootstrap samples from the set of non-relevant images [98]. An ensemble of balanced training sets is thus created, each ensemble being made up of all the available relevant images and one of the bootstrap samples. This solution is computationally quite expensive, as an ensem-ble of classifiers has to be created, and the choice of the most appropriate set of parameters for the various parts of the systems is far from being a trivial task. Recently, some papers addressed the imbalance problem by proposing the generation of artificial patterns as in

Interactive search techniques for content-based retrieval from archives of images

Interactive search techniques for

content-based retrieval from

archives of images

Luca Piras

Interactive search techniques for

content-based retrieval from

archives of images

Luca Piras

A Iole ed alla mia famiglia

per il loro sostegno

Abstract

Contents

List of Figures

Chapter 1

Introduction

1.1

Research Overview

1.2

Organization

Chapter 2

Content Based Image Retrieval

2.1

What is the image content?

2.2

Image features

2.2.1

Color

2.2.2

Texture

2.2.3

Interest points

2.3

Metrics

2.4

Evaluation

edge histogram represents the edge distribution for the whole

image space. Since there are 5 edge types, the global edge

his-togram has 5 bins and each bin value is obtained by

accumulat-ing and normalizaccumulat-ing the dequantized bin values of the

corre-sponding edge type of BinCounts[]. Similarly, for the

semi-global edge histograms, we can group some subsets of

Bin-Counts[](Fig. 8). We define 13 different segments (i.e., 13

dif-ferent subsets of the image-blocks) and for each segment we

can generate edge distributions for five different edge types

from the 80 local histogram bins. Consequently, we have a

to-tal of 150 bins (80 bins (local) + 5 bins (global) + 65 bins

(13

!5, semi-global)) for the similarity matching.

For the similarity matching, we calculate the distance D(A,B)

of two image histograms A and B using the following measure:

!

!

!

!

+

!

"

+

!

=

]

[

_

_

]

[

_

_

]

[

_

]

[

_

5

]

[

_

]