Coding Local and Global Binary Visual Features Extracted from Video Sequences

(1)

Coding local and global binary visual features

extracted from video sequences

Luca Baroffio, Antonio Canclini, Matteo Cesana, Alessandro Redondi, Marco Tagliasacchi, Stefano Tubaro

Abstract—Binary local features represent an effective alterna-tive to real-valued descriptors, leading to comparable results for many visual analysis tasks, while being characterized by significantly lower computational complexity and memory re-quirements. When dealing with large collections, a more compact representation based on global features is often preferred, which can be obtained from local features by means of, e.g., the Bag-of-Visual Word (BoVW) model. Several applications, including for example visual sensor networks and mobile augmented reality, require visual features to be transmitted over a bandwidth-limited network, thus calling for coding techniques that aim at reducing the required bit budget, while attaining a target level of efficiency. In this paper we investigate a coding scheme tailored to both local and global binary features, which aims at exploiting both spatial and temporal redundancy by means of intra- and inter-frame coding. In this respect, the proposed coding scheme can be con-veniently adopted to support the “Analyze-Then-Compress” (ATC) paradigm. That is, visual features are extracted from the acquired content, encoded at remote nodes, and finally transmitted to a central controller that performs visual analysis. This is in contrast with the traditional approach, in which visual content is acquired at a node, compressed and then sent to a central unit for further processing, according to the “Compress-Then-Analyze” (CTA) paradigm. In this paper we experimentally compare ATC and CTA by means of rate-efficiency curves in the context of two different visual analysis tasks: homography estimation and content-based retrieval. Our results show that the novel ATC paradigm based on the proposed coding primitives can be competitive with CTA, especially in bandwidth limited scenarios.

Keywords—Visual features, binary descriptors, BRISK, Bag-of-Words, video coding.

I. INTRODUCTION

Visual analysis is often performed extracting a feature-based representation from the raw pixel domain. Indeed, visual features are being successfully exploited in a broad range of visual analysis tasks, ranging from image/video retrieval and classification, to object tracking and image registration. They provide a succinct, yet effective, representation of the visual content, while being invariant to many transformations.

Several visual analysis applications (e.g., distributed moni-toring and surveillance in visual sensor networks, mobile visual Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy, email: name.surname@polimi.it

The project GreenEyes acknowledges the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under FET-Open grant number: 296676.

The material in this paper has been partially presented in [1].

Copyright (c) 2015 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org.

search and augmented reality, etc.) require visual content to be transmitted over a bandwidth-limited network. The traditional approach, denoted hereinafter as “Compress-Then-Analyze” (CTA), consists in the following steps: the visual content is acquired by a sensor node in the form of still images or video sequences; then, it is encoded and efficiently transmitted to a central unit where visual feature extraction and analysis takes place. The central unit relies on a lossy representation of the acquired content, potentially leading to impaired performance. Furthermore, such a paradigm might lead to an inefficient management of bandwidth and storage resources, since a complete pixel-level representation might be unnecessary.

In this respect, “Analyze-Then-Compress” (ATC) represents an alternative approach to visual analysis in a networked scenario. Such a paradigm aims at moving part of the analysis from the central unit directly to sensing nodes. In particular, nodes process visual content in order to extract relevant infor-mation in the form of visual features. Then, such inforinfor-mation is compressed and sent to a central unit, where visual analysis takes place. The key tenet is that the rate necessary to encode visual features in ATC might be less than the rate needed for the original visual content in CTA, when targeting the same level of efficiency in the visual analysis. This is particularly relevant in those applications in which visual analysis requires access to video sequences. Therefore, in order to maximize the rate saving, it is necessary to carefully select suitable visual features and design efficient coding schemes.

Since ATC moves part of the analysis directly on (possibly low-power) sensing nodes, computational complexity of fea-ture extraction, coding and transmission plays a fundamental role. In this context, the last few years have seen a surge of interest in the optimization of such algorithms. A thorough comparative analysis of the ATC and CTA paradigms in terms of task accuracy, computational, transmission and memory requirements shows that there is not a clear winner [2]. In fact, each paradigm resulted to be the best one under particular conditions: ATC performs better when bandwidth is scarce, whereas CTA is the best approach when computational resources are extremely limited and bandwidth is not an issue. Moreover, the dualism between the two paradigms have been showcased resorting to a testbed implementation of a visual analysis system, showing the competitiveness of ATC with respect to CTA [3]. In this paper we consider the problem of encoding both local and global binary features extracted from video sequences. The choice of this class of visual features is well motivated from different standpoints [4]. First, binary features are significantly faster to compute than real-valued features such as SIFT [5] or SURF [6], thus being suitable whenever energy resources are an issue, such as in

(2)

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2015.2445294, IEEE Transactions on Image Processing

2

the case of low-power devices, where they constitute the only available option. Second, binary features have been recently shown to deliver performance close to state-of-the-art real-valued features. Third, they can be compactly represented and coded with just a few bits [7]. Forth, binary features are faster to match, thus being suitable when dealing with large scale collections.

The processing pipeline for the extraction of local features comprises: i) a keypoint detector, which is responsible for the identification of a set of salient keypoints within an image, and ii) a keypoint descriptor, which assigns a description vector to each identified keypoint, based on the local image content. Within the class of local binary descriptors, BRIEF [8] computes the descriptor elements as the result of pairwise comparisons between (smoothed) pixel intensity values that are randomly sampled from the neighborhood of a keypoint. BRISK [9], FREAK [10] and ORB [11] are inspired by BRIEF, and similarly to their predecessor, are also based on pairwise pixel intensity comparisons. They differ from each other in the way pixel pairs are spatially sampled in the image patch surrounding a given keypoint. In particular, they introduce ad-hoc spatial patterns that define the location of the pixels to be compared. Furthermore, differently from BRIEF, they are designed so that the generated binary descriptors are scale- and rotation- invariant. More recently, in order to bridge the gap be-tween binary and real-valued descriptors, BAMBOO [12][13] adopts a richer dictionary of pixel intensity comparisons, and selects the most discriminative ones by means of a boosting algorithm. This leads to a matching accuracy similar to SIFT, while being 50x faster to compute. A similar idea is also exploited by BinBoost [14], which proposes a boosted binary descriptor based on a set of local gradients. BinBoost is shown to deliver state-of-the-art matching accuracy, at the cost of a computational complexity comparable to that of real-valued descriptors such as SIFT or SURF.

On the other hand, global features represent a suitable alternative to local features when considering scenarios in which very large amounts of data have to be processed, stored and matched. Global features computed by summarizing local features into a fixed-dimensional feature vector have been effectively employed in the context of large scale image and video retrieval [15]. Global features can be computed based on the Bag-of-Visual-Words (BoVW) [16] model, which is inspired by traditional text-based retrieval. VLAD [17] and Fisher Vectors [18] represent more sophisticated approaches that achieve improved compactness and matching performance. More recently, the problem of building global features starting from sets of binary features was addressed in [19] and [20], extending, respectively, the BoVW and VLAD model to the case of local binary features. Solutions based on global image descriptors offer a good compromise between efficiency and accuracy, especially considering large scale image retrieval and classification. Nonetheless, local features still play a fundamental role, being usually employed to refine the re-sults of such tasks [21][16]. For example, in the context of image retrieval, global features can be exploited to efficiently extract potential matching content from very large databases, whereas local features are essential to improve the quality of

retrieval, by re-ranking or filtering query results. Furthermore, the approaches based on global features disregard the spatial configuration of the keypoints, preventing the use of spatial verification mechanisms and thus being unsuitable for track-ing and structure-from-motion applications [22], [23]. Indeed, such tasks usually exploit both keypoint descriptors and their position to find salient patterns, as in the case of detection and tracking, or to enforce geometric consistency, as in the case of retrieval re-ranking with RANSAC.

This paper proposes a number of novel contributions: 1) We consider the problem of coding local binary features

extracted from video sequences, by exploiting both intra-and inter-frame coding. In this respect, we adopt the general architecture of our previous work [24], which targeted real-valued features, and propose coding tools specifically devised for binary features. We build upon our preliminary investigations [1], further extending the coding architecture and producing a stable

implementa-tion1_{. Unlike our previous work, content-based retrieval}

is evaluated by means of a complete image retrieval pipeline, in which a video is used to query an image database. Furthermore, we present more exhaustive tests on the “homography estimation” scenario.

2) For the first time, we consider the problem of coding global binary features extracted from video sequences, obtained by summarizing local features according to the BoVW model, exploiting both intra- and inter-frame coding.

3) We compare the overall performance of ATC vs. CTA for both analysis tasks. In the case of homography estimation, we show that ATC based on local features always outper-forms CTA by a large margin. In the case of content-based retrieval, we show that ATC achieves a significantly lower bitrate than CTA when using global features, while it is on a par with CTA when using local features.

In the context of local visual features, several past works tackled the problem of compressing both real-valued and binary local features extracted from still images. As for real-valued local features, architectures based on closed-loop predictive coding [25], transform coding [26][27] and hash-ing [28] were proposed. In this context, an ad-hoc MPEG group on Compact Descriptors for Visual Search (CDVS) has been working towards the definition of a standard [29] that relies on SIFT features. As for binary local features, predictive coding architectures aimed at exploiting either inter-descriptor correlation [30] or intra-inter-descriptor redundancy [31] were proposed. Furthermore, Monteiro et al. proposed a clustering based coding architecture tailored to the context of binary descriptors [32]. Moreover, some works aimed at modifying traditional extraction algorithms, so that the output data is more compact or more suitable for compression. In this context, CHOG [33] is a gradient-based descriptor that offers performance comparable to that of SIFT at a much lower bitrate. As an alternative approach, Chao et al. [34] studied how to adjust the JPEG quantization matrix in order to preserve 1_{The source code for VideoBRISK can be found at http://lucabaroffio.com/} 2014/12/videobrisk/

(3)

local features extracted from decoded images.

The problem of encoding visual features extracted from video content has been addressed only very recently. Makar et al. [35], [36] propose to encode and transmit temporally coherent image patches in the pixel-domain, for augmented reality and image recognition applications. Thus, the detector is applied at the transmitter side, while the descriptors are extracted from decoded patches at the receiver. The encoding of local features (both keypoint locations and descriptors) extracted from video sequences was addressed for the first time in [37] for the case of real-valued features (SIFT and SURF) and later extended in [24]. To the best of the authors’ knowledge, the encoding of streams of binary features has not been addressed in the previous literature. Furthermore, the interest of the scientific community in this kind of problem is witnessed by the creation of a new MPEG ad-hoc group, namely Compact Descriptors for Video Analysis (CDVA), which has recently started its activities [38]. CDVA targets the standardization of the extraction and coding of visual features in application scenarios such as video retrieval, automotive, surveillance, industrial monitoring, etc., in which video, rather than images, plays a key role.

The rest of this paper is organized as follows. Section II states the problem of coding sets of local binary descriptors, defining the properties of the features to be coded, whereas Section III illustrates the coding architecture. Section IV in-troduces the problem of coding Bag-of-Visual-Words extracted from a video sequence and Section V defines the coding algorithms. Section VI is devoted to defining the experimental setup and reporting the results. Finally, conclusions are drawn in Section VII.

II. CODING LOCAL FEATURES:PROBLEM STATEMENT

Let In denote the n-th frame of a video sequence, which is

processed to extract a set of local features Dn. First, a keypoint

detector is applied to identify a set of interest points. Then, a descriptor is applied on the (rotated) patches surrounding each keypoint. Hence, each element of dn,i2 Dnis a visual feature,

which consists of two components: i) a 4-dimensional vector pn,i = [x, y, , ✓]T, indicating the position (x, y), the scale

of the detected keypoint, and the orientation angle ✓ of the image patch; ii) a P -dimensional binary vector dn,i2 {0, 1}P,

which represents the descriptor associated to the keypoint pn,i.

We propose a coding architecture which aims at efficiently

coding the sequence {Dn}Nn=1 of sets of local features. In

particular, we consider both lossless and lossy coding schemes: in the former, the binary description vectors are preserved throughout the coding process, whereas in the latter only a subset of K < P descriptor elements is lossless coded, thus discarding a part of the original data. Each decoded descriptor can be written as ˜dn,i = {˜pn,i, ˜dn,i}. The number of bits

necessary to encode the Mn visual features extracted from

frame In is equal to Rn= Mn X i=1 (Rpn,i+ Rn,id ). (1)

That is, we consider the rate used to represent both the location

of the keypoint, Rp

n,i, and the descriptor itself, Rdn,i. Note that

for both the lossless and the lossy approach, no distortion is introduced during the source coding process. That is, the values of the decoded descriptor elements are identical to the ones extracted by the sensing nodes. Nonetheless, since in the lossy case some descriptor elements are deliberately discarded, the accuracy of the visual analysis task might be affected. Furthermore, we consider the transmission channel to be ideal and thus error-free. Nonetheless, we plan to investigate the effect of error-prone channels on the effectiveness of the pipeline in our future work, possibly devising ad-hoc channel coding strategies.

As for the component ˜pn,i, we decided to encode the

coordinates of the keypoint, the scale and the local orientation i.e., ˜pn,i= [˜x, ˜y, ˜, ˜✓]T. Although some visual analysis tasks

might not require this information, it could be used to refine the final results. For example, it is necessary when the matching score between image pairs is computed based on the number of matches that pass the spatial verification step using, e.g., RANSAC [21] or weak geometry checking [22]. Most of the detectors produce floating point values as keypoint coordinates, scale and orientation, thanks to interpolation mechanisms. Nonetheless, we decided to round such values with a quantiza-tion step size equal to 1/4 for the coordinates and the scale, and

⇡/16for the orientation, which has been found to be sufficient

for typical applications [37], [24].

III. CODING LOCAL FEATURES:ALGORITHMS

Figure 1 illustrates a block diagram of the proposed coding architecture. The scheme is similar to the one we recently proposed for encoding real-valued visual features [37], [24]. However, we highlighted the functional modules that needed to be revisited due to the binary nature of the source. A. Intra-frame coding

In the case of intra-frame coding, local features are extracted and encoded separately for each frame. In our previous work we proposed an intra-frame coding approach tailored to binary descriptors extracted from still images [7], which is briefly summarized in the following. In binary descriptors, each ele-ment represents the binary outcome of a pairwise comparison. The descriptor elements (dexels) are statistically dependent, and it is possible to model the descriptor as a binary source with memory.

Let ⇡j, j 2 [1, P ] represent the j-th element of a binary

descriptor d 2 {0, 1}P_{. The entropy of such a dexel can be}

computed as

H(⇡j) = pj(0) log2(pj(0)) pj(1) log2(pj(1)), (2)

where pj(0)and pj(1)are the probability of ⇡j= 0and ⇡j=

1, respectively. Similarly, the conditional entropy of dexel ⇡j1

given dexel ⇡j2 can be computed as

H(⇡j1|⇡j2) = X x2{0,1},y2{0,1} pj1,j2(x, y) log2 pj2(y) pj1,j2(x, y) , (3)

(4)

4 Matching) dn,i Dn 1 dn,i pn,i Candidate) Selec/on) C Mode) decision) dn,i Entropy) coding) Transform) Entropy) coding) ˜ cn,i ˜ dn 1,l⇤ pn,1 pn,1 pn 1,l⇤

Let j, j [1, P ]represent the j-th element of a binary

descrip-tor, where P is the dimension of such a descriptor. The entropy of such a dexel can be computed as

H( j) = pj(0) log2(pj(0)) pj(1) log2(pj(1)), (2)

where pj(0)and pj(1)are the probability of j = 0and j = 1,

respectively. Similarly, the conditional entropy of dexel j1 given

dexel j2can be computed as

H( j1| j2) = x {0,1},y {0,1} pj1,j2(x, y) pj2(y) pj1,j2(x, y) , (3)

with j1, j2 [1, P ]. Let ˜j, j = 1, . . . , P , denote a

permuta-tion of the dexels, indicating the sequential order used to encode a descriptor. The average code length needed to encode a descriptor is lower bounded by

R =

P j=1

H(˜j|˜j 1, . . . , ˜1). (4)

In order to maximize the coding efficiency, we aim at finding the permutation of dexels ˜1, . . . , ˜P that minimizes such a lower

bound. For the sake of simplicity, we model the source as a first-order Markov source. That is, we impose H(˜j|˜j 1, . . . ˜1) =

H(˜j|˜j 1). Then, we adopt the following greedy strategy to

reorder the dexels: ˜j =

arg min _jH( j) j = 1

arg min jH( j|˜j 1) j [2, P ]

(5) 3.2. Inter-frame coding

As for inter-frame coding, each set of local features Dn is coded

resorting to a reference set of features. In this work we consider as a reference the set of features extracted from the previous frame, i.e., Dn 1. Considering a descriptor dn,i, i = 1, . . . , Mn, the encoding

process consists in the following steps:

- Descriptor matching: Compute the best matching descriptor in the reference frame, i.e.,

dn 1,l⇤=arg min

l C D(dn,i, dn 1,l), (6)

where D(dn,i, dn 1,l) = dn,i d_{n 1,l 0}is the Hamming

dis-tance between the descriptors dn,iand dn 1,l. We limit the search

for a reference feature within a given set C of candidate features, i.e., the ones whose coordinates and scales are in the

neighbor-hood of dn,i, in a range of (± x, ± y, ± ). The prediction

residual is computed as cn,i= dn,i dn 1,l⇤.

- Coding mode decision: Compare the cost of inter-frame coding with that of intra-frame coding, which can be expressed as

JINTRA(dn,i) = Rc,INTRAn,i + R d,INTRA

n,i , (7)

JINTER(dn,i, ˜dn 1,l⇤) = Rc,INTER_n,i (l ) + R_n,id,INTER(l ), (8)

where Rc

n,i and Rdn,irepresent the bitrate needed to encode the

location component (either the location itself or location displace-ment) and the one needed to encode the descriptor component (ei-ther the descriptor itself or the prediction residual), respectively. If JINTER(dn,i, dn 1,l⇤) < JINTRA(dn,i), then inter-frame coding is

the selected mode. Otherwise, proceed with intra-frame coding.

- Intra-descriptor transform: This step aims at exploiting the spa-tial correlation between the dexels. If intra-frame is the selected coding mode, then the dexels of dn,i are reordered according to

the permutation algorithm presented in Section 3.1. Similarly, a reordering strategy can be applied also in the case of inter-frame coding, in this case considering the prediction residual cn,i.

- Entropy coding: Finally, the sets of local features are entropy coded. In the case of intra-frame coding, it is necessary to encode the reordered descriptor and the quantized location component. Otherwise, for inter-frame coding, it is necessary to encode: i) the identifier of the matching keypoint in the reference frame and the displacement in terms of position, scale and orientation of the key-point with respect to the reference, which require RINTER

n,i (l )bits;

ii) the reordered prediction residual ˜cn,i. The probabilities of the

symbols used for entropy coding are learned from a training set of images. In particular, for each of the P dexels, we estimated the conditional probability of each symbol, given the previous one de-fined by the optimal permutation. Such a procedure is applied also in the case of inter-frame coding, exploiting a training set of pre-diction residuals. The estimated probabilities are then exploited to entropy code the features.

3.3. Descriptor element selection

The lossless coding architecture described in the previous section can be used to encode all the P elements of the original binary de-scriptor. However, in order to operate at lower bitrates, it is possible to decide to code only a subset of K < P descriptor elements. In our previous work we explored different methods that define how to select the dexels to be retained [22, 7, 8]. In this work, we employed the greedy asymmetric pairwise boosting algorithm described in [8] in order to iteratively select the most discriminative descriptor ele-ments. To this end, we used a training set of image patches [23], along with the ground truth information defining whether two image patches refers to the same physical entity. At each step, the asym-metric pairwise boosting algorithm selects the dexel that minimizes a cost function, which captures the error resulting from the wrong classification of matching and non-matching patches. The output of this procedure is a set of dexels, ordered according to their discrim-inability. Hence, given a target descriptor size K < P , it is possible to encode only the first K descriptor elements selected by this algo-rithm.

4. EXPERIMENTS

For the evaluation process, we extracted BRISK [5] features from

a set of six video sequences at CIF resolution (352 288) and

30 fps, namely Foreman, Mobile, Hall, Paris, News and Mother, each with 300 frames [24]. In the training phase, we employed three sequences (Mother, News and Paris), whereas the remaining sequences (Hall, Mobile and Foreman) were employed for testing purposes. The statistics of the symbols to be fed to the entropy coder were learned based on the descriptors extracted from the training se-quences. Moreover, the training video sequences were exploited to obtain the optimal coding order of dexels for both intra- and inter-frame coding, as illustrated in Section 3.1. Starting from the original BRISK descriptor consisting in P = 512 dexels, we considered a set of target descriptor sizes K = {512, 256, 128, 64, 32, 16, 8}. For each of such descriptor sizes, we employed the selection algorithm presented in Section 3.3.

As a first test, we evaluated the number of bits necessary to en-code each visual features using either intra-frame or inter-frame

cod-of such a dexel can be computed as

respectively. Similarly, the conditional entropy of dexel j1given

H(j1|j2) = x{0,1},y {0,1} pj1,j2(x, y) log2 pj2(y) pj1,j2(x, y) , (3) with j1, j2 [1, D]. Let ˜j, j = 1, . . . , D, denote a

R =

P j=1

H(˜j|˜j 1, . . . , ˜1). (4)

In order to maximize the coding efficiency, we aim at finding the permutation of dexels ˜1, . . . , ˜D that minimizes such a lower

reorder the dexels:

˜j= arg minjH( j) j = 1

arg minjH( j|˜j 1) j [2, D]

(5) Note that such optimal ordering is computed offline, thanks to a training phase, and shared between both the encoder and the decoder. 3.2. Inter-frame coding

As for inter-frame coding, each set of local features Dnis coded

dn 1,l⇤=arg min

lCD(dn,i, dn 1,l), (6)

where D(dn,i, dn 1,l) = dn,i dn 1,l 0is the Hamming

dis-tance between the descriptors dn,iand dn 1,l, and l is the index

of the selected reference feature used in the next step. We limit the search for a reference feature within a given set C of candi-date features, i.e., the ones whose coordinates and scales are in the neighborhood of dn,i, in a range of (± x, ± y, ± ). The

prediction residual is computed as cn,i= dn,i dn 1,l⇤, that is,

the bitwise XOR between dn,iand dn 1,l⇤.

JINTRA(dn,i) = Rp,INTRAn,i + Rd,INTRAn,i , (7)

JINTER(dn,i, ˜dn 1,l⇤) = Rp,INTERn,i (l ) + R d,INTER n,i (l ), (8)

where Rp

n,iand Rn,id represent the bitrate needed to encode the

location component (either the location itself or location displace-ment) and the one needed to encode the descriptor component (ei-ther the descriptor itself or the prediction residual), respectively. If JINTER(dn,i, dn 1,l⇤) < JINTRA(dn,i), then inter-frame coding is

- Intra-descriptor transform: This step aims at exploiting the spa-tial correlation between the dexels. If intra-frame is the selected coding mode, then the dexels of dn,iare reordered according to

- Entropy coding: Finally, the sets of local features are entropy coded. In the case of intra-frame coding, it is necessary to encode the reordered descriptor and the quantized location component. Otherwise, for inter-frame coding, it is necessary to encode: i) the identifier of the matching keypoint in the reference frame and the displacement in terms of position, scale and orientation of the keypoint with respect to the reference, which require Rp,INTER

n,i (l )

bits; ii) the reordered prediction residual ˜cn,i.

For both intra-frame and inter-frame coding, the probabilities of the symbols (respectively, descriptor elements or prediction resid-uals) used for entropy coding are learned from a training set of images. In particular, for each of the D dexels, we estimated the conditional probability of each symbol, given the previous one de-fined by the optimal permutation. The estimated probabilities are then exploited to entropy code the features.

4. EXPERIMENTS

For the evaluation process, we extracted BRISK [5] features from a set of six video sequences at CIF resolution (352 288) and 30 fps, namely Foreman, Mobile, Hall, Paris, News and Mother, each with 300 frames [26]. In the training phase, we employed three sequences (Mother, News and Paris), whereas the remaining sequences (Hall, Mobile and Foreman) were employed for testing purposes. The statistics of the symbols to be fed to the entropy coder were learned based on the descriptors extracted from the training se-quences. Moreover, the training video sequences were exploited to obtain the optimal coding order of dexels for both intra- and inter-frame coding, as illustrated in Section 3.1. Starting from the original BRISK descriptor consisting in P = 512 dexels, we considered a set of target descriptor sizes K = {512, 256, 128, 64, 32, 16, 8}. For each of such descriptor sizes, we employed the selection algorithm presented in Section 3.3.

As a first test, we evaluated the number of bits necessary to en-code each visual features using either intra-frame or inter-frame cod-of such a dexel can be computed as

H(j1| j2) = x{0,1},y {0,1} pj1,j2(x, y) log2 pj2(y) pj1,j2(x, y) , (3) with j1, j2 [1, D]. Let ˜j, j = 1, . . . , D, denote a

R =

P j=1

H(˜j|˜j 1, . . . , ˜1). (4)

reorder the dexels:

˜j= arg minjH(j) j = 1

arg minjH(j|˜j 1) j [2, D]

dn 1,l⇤=arg min

JINTRA_(d

n,i) = Rp,INTRAn,i + R d,INTRA

n,i , (7)

JINTER(dn,i, ˜dn 1,l⇤) = Rn,ip,INTER(l ) + Rn,id,INTER(l ), (8)

where Rp

n,iand Rdn,irepresent the bitrate needed to encode the

location component (either the location itself or location displace-ment) and the one needed to encode the descriptor component (ei-ther the descriptor itself or the prediction residual), respectively. If JINTER_(d

n,i, dn 1,l⇤) < JINTRA_(d

n,i), then inter-frame coding is

n,i (l )

4. EXPERIMENTS

cod-of such a dexel can be computed as

where pj(0)and pj(1)are the probability of j= 0and j= 1,

H( j1|j2) = x{0,1},y {0,1} pj1,j2(x, y) log2 pj2(y) pj1,j2(x, y) , (3) with j1, j2 [1, D]. Let ˜j, j = 1, . . . , D, denote a

R =

P j=1

H(˜j|˜j 1, . . . , ˜1). (4)

In order to maximize the coding efficiency, we aim at finding the permutation of dexels ˜1, . . . , ˜Dthat minimizes such a lower

reorder the dexels:

˜j= arg minjH( j) j = 1

dn 1,l⇤=arg min

JINTRA(dn,i) = Rp,INTRAn,i + Rd,INTRAn,i , (7)

JINTER_(d

n,i, ˜dn 1,l⇤) = Rp,INTERn,i (l ) + Rd,INTERn,i (l ), (8)

where Rp

n,i, dn 1,l⇤) < JINTRA_(d

n,i), then inter-frame coding is

n,i (l )

4. EXPERIMENTS

As a first test, we evaluated the number of bits necessary to en-code each visual features using either intra-frame or inter-frame cod-of such a dexel can be computed as

where pj(0)and pj(1)are the probability of j= 0and j= 1,

H( j1| j2) = x{0,1},y {0,1} pj1,j2(x, y) log2 pj2(y) pj1,j2(x, y) , (3) with j1, j2 [1, D]. Let ˜j, j = 1, . . . , D, denote a

R =

P j=1

H(˜j|˜j 1, . . . , ˜1). (4)

reorder the dexels: ˜j=

arg minjH( j) j = 1

dn 1,l⇤=arg min

JINTRA(dn,i) = Rp,INTRAn,i + R d,INTRA

n,i , (7)

JINTER(dn,i, ˜dn 1,l⇤) = Rp,INTERn,i (l ) + R d,INTER n,i (l ), (8)

where Rp

n,i, dn 1,l⇤) < JINTRA(dn,i), then inter-frame coding is

n,i (l )

4. EXPERIMENTS

cod-Fig. 1. Block diagram of the proposed coding architecture. The highlighted functional modules needed to be revisited due to the binary nature of the source. with j1, j2 2 [1, P ]. Let ˜⇡j, j = 1, . . . , P , denote a

permutation of the dexels, indicating the sequential order used to encode a descriptor. The average code length needed to encode a descriptor is lower bounded by

R =

P

X

j=1

H(˜⇡j|˜⇡j 1, . . . , ˜⇡1). (4)

In order to maximize the coding efficiency, we aim at finding the permutation of dexels ˜⇡1, . . . , ˜⇡P that minimizes such

a lower bound. For the sake of simplicity, we model the source as a first-order Markov source. That is, we impose H(˜⇡j|˜⇡j 1, . . . ˜⇡1) = H(˜⇡j|˜⇡j 1). From our experiments we

observed that larger coding contexts bring a slight advantage in terms of coding efficiency, at the cost of much more complex training and coding procedures. Then, we adopt the following greedy strategy to reorder the dexels:

˜ ⇡j= ⇢ arg min⇡jH(⇡j) j = 1 arg min⇡jH(⇡j|˜⇡j 1) j2 [2, P ] (5) The reordering of the dexel is described by means of a permu-tation matrix TINTRA_{, such that ˜c}

n,i = TINTRAdn,i. Note that

such optimal ordering is computed offline, thanks to a training phase, and shared between both the encoder and the decoder. As such, this does not require additional bitrate. Furthermore, the computed permutation matrix depends on the statistical properties of the signal at hand, that is, on the spatial pattern of pairwise comparisons exploited by the feature extraction algorithms. The characteristics of such pattern might also affect the achievable coding gain. Nonetheless, the approach is quite general and can be applied to different kinds of features. B. Inter-frame coding

As for inter-frame coding, each set of local features Dn is

coded resorting to a reference set of features. In this work we consider as a reference the set of features extracted from the previous frame, i.e., Dn 1. Considering a descriptor dn,i,

i = 1, . . . , Mn, the encoding process consists in the following

steps:

- Descriptor matching: Compute the best matching descrip-tor in the reference frame, i.e.,

dn 1,l⇤ =arg min

l2C D(dn,i, dn 1,l) + R p,INTER n,i (l), (6)

where D(dn,i, dn 1,l) = kdn,i dn 1,lk₀ is the

Ham-ming distance between the descriptors dn,i and dn 1,l,

Rp,INTER_n,i (l) is the rate needed to encode the keypoint

difference vector and l⇤ _{is the index of the selected}

reference feature used in the next steps. We limit the search for a reference feature within a given set C of candidate features, i.e., the ones whose coordinates and

scales are in the neighborhood of dn,i, in a range of

(_{± x, ± y, ±} ). The prediction residual is computed

as cn,i = dn,i dn 1,l⇤, that is, the bitwise XOR

between dn,i and dn 1,l⇤.

JINTRA(dn,i) = Rp,INTRAn,i + R d,INTRA

n,i , (7)

JINTER(dn,i, ˜dn 1,l⇤) = Rp,INTER_n,i (l⇤) + Rd,INTER_n,i (l⇤),

(8)

where Rp

n,i and Rdn,i represent the bitrate needed to

encode the location component (either the location it-self or location displacement) and the one needed to encode the descriptor component (either the descrip-tor itself or the prediction residual), respectively. If JINTER(dn,i, dn 1,l⇤) < JINTRA(dn,i), then inter-frame

coding is the selected mode. Otherwise, proceed with intra-frame coding.

- Intra-descriptor transform: This step aims at exploiting the spatial correlation between the dexels. If intra-frame

is the selected coding mode, then the dexels of dn,i

are reordered according to the permutation algorithm presented in Section III-A. Similarly, a reordering strategy can be applied also in the case of inter-frame coding, in this case considering the prediction residual cn,i, that is,

˜

cn,i = TINTERcn,i. The transform TINTER is estimated

at training time and shared between the encoder and the decoder, not requiring extra signaling at coding time. Note that, in general, TINTER_{6= T}INTRA_.

- Entropy coding: Finally, the sets of local features are entropy coded. In the case of intra-frame coding, for each local feature, it is necessary to encode the reordered descriptor and the quantized location component. Other-wise, for inter-frame coding, it is necessary to encode: i) the identifier of the matching keypoint in the reference frame and the displacement in terms of position, scale and orientation of the keypoint with respect to the

refer-ence, which require Rp,INTER

n,i (l⇤) bits; ii) the reordered

(5)

matching keypoint identifiers, we resort to an effective coding architecture based on smart reordering and entropy coding, significantly reducing the output bitrate. The interested reader is referred to [24] for further details. For both intra-frame and inter-frame coding, the probabil-ities of the symbols (respectively, descriptor elements or prediction residuals) used for entropy coding are learned from a training set of frames. In particular, for each of the P dexels, we estimated the conditional probability of each symbol, given the previous one defined by the optimal permutation. The estimated probabilities are then exploited to entropy code the features.

C. Descriptor element selection

The lossless coding architecture described in the previous section can be used to encode all the P elements of the original binary descriptor. However, in order to operate at lower bitrates, it is possible to decide to code only a subset of

K < P descriptor elements. In our previous work we explored

different methods that define how to select the dexels to be retained [7], [12], [13]. In this work, we employed the greedy asymmetric pairwise boosting algorithm described in [13] in order to iteratively select the most discriminative descriptor elements. To this end, we used a training set of image patches [39], along with the ground truth information defining whether two image patches refer to the same physical entity. At each step, the asymmetric pairwise boosting algorithm selects the dexel that minimizes a cost function, which captures the error resulting from the wrong classification of matching and non-matching patches. The output of this procedure is a set of dexels, ordered according to their discriminability. Hence, given a target descriptor size K < P , it is possible to encode only the first K descriptor elements selected by this algorithm.

IV. GLOBAL DESCRIPTORS BASED ON BINARY VISUAL

FEATURES

Let In denote the n-th frame of a video sequence, which

is processed to extract a set of local features Dn. A global

representation for the frame In can be computed, starting from

such set of local image descriptors. The key idea behind the Bag-of-Visual-Words (BoVW) approach is to quantize each local feature into one visual word. To this end, a vocabulary

V = {w1, . . . , wV} composed of V distinct visual words

has to be computed. Traditional approaches to the creation of BoVW models are based on real-valued local descriptors such as SIFT [5] or SURF [6]. In this context, a large set of

descriptors d 2 RK _{is exploited for learning the vocabulary,}

along with a clustering algorithm such as k-means (with

k = V) based on Euclidean distance [16], Gaussian Mixture

Model [40], etc.

More recently, the problem of constructing a BoVW model starting from sets of binary local descriptors was addressed in [41]. Analogously to the case of real-valued descriptors, a dictionary is learned starting from a large set of descriptors

d _{2 {0, 1}}K_{. To this end, a naive approach would consist}

in k-means clustering paired with Euclidean distance [42]. Besides, clustering techniques tailored to the peculiar nature

of the signal at hand have been introduced. In particular, k-medoids and k-medians algorithms, paired with Hamming distance, have been successfully exploited for creating the dictionary [41].

Then, given a vocabulary that consists of V of visual words V = {w1, . . . , wV} learned offline and a set of visual features

Dnextracted from the frame In, a global descriptor is obtained

by mapping such a set of features to a fixed-dimensional

vector vn 2 RV. The simplest strategy is to apply hard

quantization, which assigns each feature d 2 Dnto the nearest

visual word’s centroid wj2 V. The resulting global descriptor

vn = [v1, . . . , vV]T is a histogram, where vj represents the

number of local features in Dn having the dictionary word

wj as nearest neighbor. Soft quantization represents a more

sophisticated alternative to hard quantization, mapping each lo-cal feature to multiple visual words. Finally, several techniques

for normalizing the global descriptor vn have been proposed,

aimed at improving matching performance. A widely accepted approach consists in adopting the tf-idf weighting scheme,

followed by L2 normalization [43]. The former gives more

emphasis to rare visual words and less importance to common ones, whereas the latter avoids short vectors, i.e. BoVWs built starting from few local features, to be penalized during the matching stage. Note that weighting can be applied at the decoder, saving computational and storage resources on low-power sensing nodes.

V. CODING GLOBAL FEATURES

For each frame In of an input video sequence, a set of

binary local features Dn is extracted and mapped to a V

-dimensional global descriptor vn = [v1, . . . , vV]T by applying

the procedure described in Section IV. We propose a coding architecture which aims at effectively encoding the sequence

{vn}Nn=1 of global image descriptors. In particular, such a

lossy coding architecture enables the decoder to reconstruct an approximation {˜vn}Nn=1of the original sequence of global

descriptors.

Differently from the case of local descriptors, the coordi-nates of the keypoints are disregarded during the construction of the BoVW and they are not encoded. Hence, the number

of bits needed to encode a Bag-of-Visual-Words vn extracted

from frame In is equal to Rn=PVj=1Rvn,j, i.e., the sum of

the number of bits needed to encode each component of the vector vn.

Traditionally, Bag-of-Visual-Words are typically represented resorting to inverted index structures to reduce the storage requirements and the matching complexity. Such representa-tion require Rii

n = dlog2(V )e ⇥ Nn bits, where Nn is the

number of local features extracted from In. Nonetheless, more

sophisticated algorithms may be exploited to further reduce the bitrate.

A. Intra-frame coding

The Intra-frame coding approach is based on a frame-by-frame processing scheme, in which the global descriptor

extracted from the frame In is encoded independently from

(6)

6

architecture, uniform scalar quantization with step size j

is applied to each element vn,j, j = 1, . . . , V of the global

descriptor vn, that is ˜ qn,j = vn,j j ⌫ . (9)

Since the vectors are normalized according to a tf-idf

weight-ing scheme, the same quantization step size j = , j =

1, . . . , V, is fixed for each visual word.

The quantization index ˜qn,j is considered as the outcome

of a discrete memoryless source and entropy coded. To this end, the probabilities of the quantization symbols are estimated offline and fed to an arithmetic coder, so that the corresponding rate is equal to Rv

n,j= log2(p(˜qn,j)).

B. Inter-frame coding

In the case of inter-frame coding, local features are extracted on a frame-by-frame basis and quantized into BoVWs in order

to obtain a sequence of global descriptors {vn}Nn=1. Then,

differently from intra-frame coding, temporal redundancy is

exploited in the coding phase: the global descriptor vn

ex-tracted from frame In is encoded using vn 1 as reference.

In particular, each descriptor element vn,j is encoded having

vn 1,j as context.

To this end, considering a quantization step size j, the

quantization symbols ˜qn,j are obtained according to

Equa-tion (9), and then entropy coded using Rn bits. Similarly to

the case of intra-frame coding, the statistics of the quantization symbols are estimated at training time. In particular, given a sufficiently large sequence of training global descriptors, a

training phase aims at estimating the probabilities p(qn,j =

Xp|qn 1,j = Xq), i.e. the probabilities of the j-th dexel at

frame In assuming value Xp, given the j-th dexel at frame

In 1having value Xq. An arithmetic coder is used to entropy

code the quantization symbols, with an expected number of bits that amounts to

Rn = M X j=1 Rvn,j= M X j=1 log2(p(qn,j= ˜qn,j|qn 1,j= ˜qn 1,j)). (10) C. Coding sparse signals

Since the number of visual words composing a dictionary is typically much larger than the number of features extracted from a frame, the generated global representation tends to be sparse. In this context, coding algorithms specifically devised for sparse signals may be employed. For example, run-level coding has been successfully exploited to encode sparse binary images. Considering a global representation vn = [v1, . . . , vV]T , such method aims at efficiently encode

the length of runs of data having the same value. In particular, the algorithm identifies the sequences of zeros, i.e., the most probable value, and computes their length. Then, it builds a representation for the BoVW by interleaving the length of the runs and the values of non-zero entries. Finally, entropy coding can be applied to such compressed signal, further reducing the bitrate.

VI. EXPERIMENTS

We validated the effectiveness of the feature coding architectures and compared the two different paradigms, namely “Analyze-Then-Compress” (ATC) and “Compress-Then-Analyze” (CTA), on two traditional visual analysis tasks: • Homography estimation. Several high- and low-level vi-sual analysis tasks, including camera calibration, 3D re-construction, structure-from-motion, tracking, etc. might require the estimation of the homography defining the geometrical relationship between two frames with homo-geneous visual content. In this scenario, local features can be conveniently used to find correspondences between pixel locations in different frames or views. Conversely, global features based on BoVW do not represent a vi-able option, since they do not include any geometrical information about the visual content.

• Content Based Retrieval (CBR). Content Based Retrieval is a traditional, yet challenging, task within the computer vision community. Given an input query in the form of some kind of visual content, the goal is to retrieve the relevant multimedia documents within a large database. Accuracy and computational efficiency are key tenets to be considered when implementing algorithms for CBR, which typically target large scale scenarios. Our test considers an input query in the form of a video clip, with the goal of retrieving the most relevant database images. In this scenario, both global and local features are considers, in order to explore a trade-off between accuracy and computational efficiency.

A. Data sets

1) Training data sets: The methods discussed for encoding binary local descriptors require the knowledge of the probabil-ities of the symbols to operate the entropy coder, which were estimated from training sequences. To this end, we employed three video sequences, namely Mother, News and Paris [44]. The training video sequences were also exploited to obtain the optimal coding order of dexels for both intra- and inter-frame coding, as illustrated in Section III. Furthermore, a dataset of patches [39] was exploited along with an asymmetric pairwise boosting algorithm [13] [12] in order to identify the K most discriminative dexels according to the method presented in Section III-C.

In the case of BoVW-based global descriptors, the visual word dictionary was estimated exploiting a large database of images, namely VOC2010 [45], whereas the statistics of the coding symbols for both intra- and inter-frame coding archi-tectures were estimated offline, resorting to a sufficiently long video sequence, namely Rome in a nutshell, which consists of 15375 frames.

2) Test data sets: First, the coding architecture was eval-uated on three video sequences, namely Hall, Mobile and Foreman, to investigate the bitrate saving which can be ob-tained by properly encoding the binary features. Then, for the Homography Estimation test, we used a publicly available dataset for visual tracking [46], consisting in a set of video sequences, each containing a planar texture subject to a given

(7)

(a)

(b)

Fig. 2. a) Four frames sampled from one of the query videos employed for the retrieval task. b) A matching database image.

motion path. For each frame of each sequence, the homography that warps such frame to the reference is provided as ground truth. The sequences have a resolution of 640 ⇥ 480 pixels at 15 fps and a length of 500 frames (33.3 seconds). Finally, for the Content Based Retrieval (CBR) test, a set of 10 query video sequences was used, each capturing a different landmark

in the city of Rome2 _{with a camera embedded in a mobile}

device [47]. The frame rate of such sequences is equal to 24fps, whereas the resolution ranges from 480x360 pixels (4:3) to 640x360 pixels (16:9). The first 50 frames of each video were used as query. On average, each query video corresponds to 9 relevant images representing the same physical object under different conditions and with heterogeneous qualities and resolutions. Then, distractor images randomly sampled from the MIRFLICKR-1M dataset, so that the final database contains 10k images. As an example, Figure 2 shows some frames of a query sequence, along with a relevant image to be retrieved. The dataset is publicly available for download at XX. B. Methods

1) ATC-Training: The proposed coding architecture can be applied to any kind of local binary feature. Hence, in our experiments we evaluated the use of two different binary fea-tures, namely BRISK [9] and BINBOOST [14]. The detection threshold was set equal to 70 for both BRISK and BINBOOST. All other parameters were left equal to their default values. 2_{The Rome Landmark Dataset is publicly available at https://sites.google.} com/site/greeneyesprojectpolimi/downloads/datasets/rome-landmark-dataset

In both cases, the parameters x, y and , representing the location and the scale of each keypoint, were rounded to the nearest quarter of unit. Descriptors consisting of P = 512 and

P = 256 dexels for BRISK and BINBOOST, respectively,

were extracted from the training video sequences, using the original implementations of the feature extraction algorithms provided by authors.

As for BRISK, we considered a set of target descriptor sizes K = {512, 256, 128, 64, 32, 16, 8}. For each size, we employed the dexel selection algorithm presented in Sec-tion III-C in order to identify the elements to be retained. Then, in the case of intra-frame coding and for each descriptor length, the optimal coding order and the corresponding cod-ing probabilities were estimated accordcod-ing to the procedure introduced in Section III-A. Instead, in the case of inter-frame coding, for each descriptor a match was found within the features extracted from the previous frame, according to the method presented in Section III-B. Similarly to the case of intra-frame coding, a coding-wise optimal permutation of the elements of the binary prediction residual was computed, and the corresponding coding probabilities were estimated. As to BINBOOST, we considered a set of target descriptor sizes

K ={256, 128, 64, 32, 16, 8}. For each size, the first K dexels

of the BINBOOST descriptor were retained. Then, similarly to the case of BRISK, coding-wise optimal dexel permutations for intra- and inter-frame coding were computed, along with the probabilities of the coding symbols.

In the case of BoVW-based global descriptors, we fixed a set of target sizes for the dictionary of visual words M = {1024, 4096, 16384}. Then, for each possible dictionary size, BRISK or BINBOOST descriptors were extracted from the training set of images and vector quantization was applied in order to identify M visual words. Both k-means and k-medians algorithms have been tested for the dictionary construction stage, yielding similar results in terms of rate-accuracy per-formance. Furthermore, global descriptors based on BRISK and BINBOOST local features achieve very similar results. In the following, we refer to the best performing setup, that is, k-means clustering applied to BRISK descriptors, initialized according to the k-means++ [48] algorithm, and Euclidean distance. The output of this first stage is a dictionary composed of M visual words each represented by a P -dimensional vector, where P = 512 (P = 256) is the size of BRISK descriptors (BINBOOST descriptors).

In the context of image retrieval, dictionaries consisting of up to millions of visual words may be employed. Nonetheless, such large dictionaries require a larger amount of memory to be stored, and would increase the complexity of the BoVW creation process. Therefore, targeting low-power devices, we consider smaller dictionaries.

Then, a training video sequence was adopted to compute the coding probabilities. For each frame, local features were ex-tracted and the global descriptor was computed by hard assign-ing each feature to its nearest neighbor within the dictionary, according to the procedure presented in Section IV. Then, for

each target quantization step size = _{{0.01, 0.05, 0.1, 0.2},}

global descriptors were quantized and the coding probabilities for both intra- and inter-frame were computed according to the