CORSO DI LAUREA SPECIALISTICA in COMPUTER ENGINEERING 2016
DESIGN AND IMPLEMENTATION OF A TOOL FOR
PERSON RE-IDENTIFICATION BASED ON DEEP
LEARNING
Supervisors:
Claudio GENNARO
Giuseppe AMATO
Fabrizio FALCHI
Candidate:
Giacomo GIORGI
DESIGN AND IMPLEMENTATION OF A TOOL FOR
PERSON RE-IDENTIFICATION BASED ON DEEP
LEARNING
Giacomo Giorgi
2016
Abstract
In this thesis work, a person re-identification system is designed and
imple-mented.
The person re-identification problem is defined as the process that
recog-nizes if a person has been observed in different locations over a set of
non-overlapping camera views.
The problem, presents various challenges due to low image quality, different
pose, different illumination that can be affect the recognizing process.
The system presented is based on an existing deep convolutional network
ar-chitecture specifically designed to address the problem of re-identification.
The network is able to learn visual features and a corresponding similarity
metric for person re-identification. Given a pair of images as input, the
net-work computes the similarity score to indicate if the two images depict the
same person.
The thesis describes the entire network implementation process focusing on
the implementation of the cross-input neighborhood differences layer, (core
component of the network) able to capture local relationships between the
two input images based on mid- level features from each input image.
Experiments have been performed on the public person re-identification dataset
CUHK03, in order to compare the results obtained, with the results of
net-works that represents the state of art. Specifically, the results of the network
reproduced have been compared with the results of the original network and
the results of a novel network based on the GoogleNet, which actually
sig-nificantly outperforms the state of art.
We show how the results of the system implemented are comparable or
slightly higher than the original one and significantly lower with respect to
the novel network. A downside of the latter network, however, is in its
inef-ficiency in terms of computation time of the similarity between two images.
This is an aspect that cannot underestimated in real time applications such as
person re-identification.
Contents
List of Figures vii
List of Tables x
1 Introduction 1
1.1 Challenges in person re-identification . . . 2
1.2 Contributions and Outline of This Thesis . . . 3
1.3 Thesis summary . . . 4
2 Background 7 2.1 Classification problem . . . 7
2.2 Basics of Neural Network . . . 10
2.2.1 Biological neuron . . . 10
2.2.2 Artificial neuron . . . 10
2.2.3 Activation function . . . 11
2.2.4 Feedforward Neural Networks . . . 12
2.2.5 Training Artificial Neural Network: Backpropagation and gradi-ent descgradi-ent algorithm . . . 13
2.2.6 Advanced NN architecture . . . 17
2.3 Deep Neural Networks . . . 18
2.3.1 Comparison between DNN and a "shallow NN" . . . 18
2.3.2 Convolutional Neural Networks . . . 20
3 Related works 27 3.1 Hand-crafted systems for person re-identification . . . 27
3.2 Deep learning systems in person re-identification . . . 28
4 System implementations 33 4.1 Introduction . . . 33
4.2.1 Caffe model structure . . . 34
4.3 Cross Neighborhood Difference Network implementation . . . 37
4.3.1 Input data layer . . . 38
4.3.2 Convolution and max pooling layer . . . 38
4.3.3 Cross Neighborhood difference layer . . . 41
4.3.4 Cross Neighborhood difference layer: Custom implementation . . 44
4.3.5 Cross Neighborhood Difference layer: implementation by the stan-dard caffe layers . . . 55
4.3.6 Patch summary features layer . . . 59
4.3.7 Across Patch Features layer . . . 60
4.3.8 Higher order relationships layer . . . 61
5 Dataset 63 5.1 CUHK01 dataset . . . 64
5.2 CUHK03 dataset . . . 65
6 Experiments 67 6.1 Training set composition . . . 67
6.1.1 Augmentation . . . 68
6.1.2 CUHK03 training set . . . 69
6.2 Experiments . . . 71
6.2.1 Cross Neighborhood difference network: Python implementation (python CND) . . . 71
6.2.2 Cross Neighborhood difference network: Caffe implementation (Caffe CND) . . . 73 6.2.3 DCSL: CUHK03 . . . 74 6.3 Metric Evaluation (CMC) . . . 75 6.4 Results . . . 77 6.4.1 Networks comparison . . . 78 7 System application 81 7.1 Surveillance zone . . . 81
7.2 Moving object detection . . . 82
7.3 Person re-identification . . . 84
8 Conclusions 87 8.1 Future works . . . 88
List of Figures
1.1 Camera viewpoints . . . 1
1.2 Real re-identification sample pairs . . . 2
1.3 Person re-identification system . . . 4
2.1 Learning system schema . . . 8
2.2 Machine learning schema . . . 8
2.3 Deep learning schema . . . 9
2.4 Classificator schema . . . 9
2.5 Biological neuron structure . . . 10
2.6 Artificial neuron structure . . . 11
2.7 TanH activation function . . . 11
2.8 Sigmoid activation function . . . 12
2.9 ReLu activation function . . . 12
2.10 Feedforward fully connected network . . . 13
2.11 Generic artificial neuron . . . 13
2.12 Fully connected neural network with weights . . . 15
2.13 Siamese NN architecture . . . 18
2.14 DNN and shallow NN . . . 19
2.15 Convolution operation . . . 21
2.16 Receptive field in a fully connected network . . . 22
2.17 Receptive field in a CNN . . . 22
2.18 Max pooling operation . . . 23
2.19 CNN classifier . . . 23
3.1 Deep metric learning network . . . 28
3.2 CNN on the Deep metric learning network . . . 29
3.3 Siamese CNN in the Deep metric learnin network . . . 29
3.4 FPNN architecture . . . 29
3.5 CND architecture . . . 31
4.1 Forward and backward passes in a deep learning network . . . 34
4.2 Caffe data blob . . . 34
4.3 Caffe layer connections . . . 35
4.4 Caffe network example . . . 36
4.5 Cross-Neighborhood Difference Network architecture . . . 37
4.6 First conv/pool layers in the CND network . . . 38
4.7 Max pooling layer in the CND network . . . 39
4.8 Output of first Conv/pool layer in the CND network . . . 40
4.9 Second conv/pool layers in the CND netowrk . . . 41
4.10 Output of the second conv/pool layers in the CND network . . . 41
4.11 Input feature maps to the CND network . . . 42
4.12 Neighborhood Difference (view A - view B) . . . 43
4.13 Neighborhood Difference (view B - view A) . . . 43
4.14 Neighborhood layer architecture . . . 45
4.15 Feature padding . . . 46
4.16 Image To Column operation . . . 46
4.17 Rashaping and replication operation . . . 47
4.18 Subtraction operation on CND python . . . 48
4.19 Reshape CND architecture in python . . . 49
4.20 The cross neighborhood difference layer architecture applied to both the input feature maps. . . 49
4.21 Backward pass example . . . 50
4.22 Backward loss propagation in CND implemented in python . . . 50
4.23 Gradient computation in the reshape layer . . . 51
4.24 Gradient computation in subtraction layer . . . 52
4.25 Backward pass in the neighborhood laye . . . 52
4.26 Gradient computation in replication layer . . . 53
4.27 Gradient computation in image to column layer . . . 54
4.28 Gradient computation in padding layer . . . 55
4.29 Image to column operation . . . 55
4.30 Slicing, duplication and concatenation operation . . . 56
4.31 Subtraction operation between the column and replication data . . . 57
4.32 CND architecture exploiting Caffe layers . . . 58
4.33 Output of the CND layers . . . 59
4.34 Patch summary on Python and Caffe implementation . . . 60
4.35 Across parch features layer . . . 61
5.1 Camera setup for a person re-identification dataset construction. . . 64
5.2 Identities in CUHK01 view A. . . 64
5.3 An identities in CUHK01 view B. . . 64
5.4 Differences between CUHK03 labeled and detected . . . 65
5.5 Identities in CUHK03 view A. . . 65
5.6 Identities in CUHK03 view B. . . 66
6.1 Positive and negative training pairs . . . 68
6.2 Data augmentation example . . . 68
6.3 Training augmentation adopted . . . 69
6.4 All positive pairs of one identity . . . 70
6.5 CMC composition . . . 77
6.6 Comparison between the CND and the DCSL network on CUHK03 . . . 79
6.7 Detected pairs from the Cross Neighborhood Difference network imple-mented . . . 80
7.1 Cameras network . . . 82
7.2 Cameras database . . . 82
7.3 Re-identification pipeline . . . 86
List of Tables
5.1 List of common used person re-identification dataset. . . 63
6.1 Number of images per view of each identity before and after augmentation. 69 6.2 Number of positive and negative pairs used to train the model. . . 71
6.3 CPU time in CND python implementation . . . 72
6.4 CPU GPU time in CND python implementation . . . 72
6.5 GPU time on CND implemented with caffe layers . . . 73
6.6 GPU time on DCSL iteration . . . 75
8.1 Forward GPU time of the DCSL . . . 87
8.2 Forward GPU time of the Cross Neighborhood Difference network im-plemented . . . 88
List of acronyms
CND Cross Neighborhood Difference CNN Convolution Neural Network
DCSL Deep Correspondence Structure Learning DNN Deep Neural Network
DPM Deformable Part Models detector NN Neural Network
1
Chapter 1
Introduction
An interesting problem of video surveillance applications is the person re-identification. Systems security applications, such as online tracking of individuals over different cam-eras, or offline retrieval of the video sequences containing an individual of interest, are based on the person re-identification.
"Person re-identification is defined as the process that recognizes if a person has been observed in different locations over a set of non overlapping camera views".
Due to image variations like illuminations, colors, poses, occlutions present in different viewpoints, the person re-identification is still a challenging problem.
Figure 1.2: Typical samples of pedestrain images in person re-identification from CUHK03 data set. On the left, pair with illumination problem, on the right, pair with occlusion problem.
The first works addressing person re-identification are dated in 2003, but only in the last five years we have a large increase in computer vision research oriented to that problem. The researchers cannot solve the problem relying on robust conventional biometrics, such in face recognition, due to insufficient image details for extracting them.
The objective of the researchers, has been to build a system able to: • Extract feature which are:
− Discriminative for identity. − Invariant to pose.
− Invariant to light.
• Combine the features extracted in a re-identification learning model.
1.1
Challenges in person re-identification
The process of person re-identification, must face the following main challenges: • Features representation
The process of person re-identification starts from the comparison of two images which belong to different viewpoints. The initial stage of this comparison is the fea-tures extraction with the aim to find a suitable feafea-tures representation for the person re-identification problem, which is invariant to illumination, viewpoint, background cluttering, occlusion and image quality/resolution. There exists no universal robust
invariant feature representation, which can be applied to different camera views and adaptable to all individuals.
• Inter/Intra class variation
Same person from the human standpoint, can appear different when observed under different camera views (Intra variation), or different person can appear very similar under different viewpoints (Inter variation). This human effect is reproduced also in computer vision and represents a big limitation in a learning model.
• Generalization capability
Once a trained model has been created for a dataset belonging to a certain set of cameras, it is difficult apply the model to another set of cameras. It is a challenge to obtain a model with good generalization ability, which can be applied to different dataset or different camera views.
• Small sample size
Generally, for a learning model it is necessary a big quantity of positive samples to train the network. The real availability of data is at maximum five sample per person in each view (i.e., the CUHK03 dataset) which is not sufficient to learn a good re-identification model.
• Long-term re-identification
The longer the time between wiew, the greater the possibility that the same person can appear different (in clothes or carried objects). The challenge is to find a re-identification system robust to that changes.
1.2
Contributions and Outline of This Thesis
The objectives of this thesis work are:
• Design and implement a system of person re-identification. • Compare it with the already existent systems.
In general, a global person re-identification system is composed by the pipeline in Figure 1.3.
Figure 1.3: Person re-identification system.
The basic pipeline of a person re-identification system is composed by the following mod-ules:
• Object detection. • Person identification.
• Persons similarity estimation (or person re-identification).
This thesis is primarily focused on the person re-identification stage, which is the core component of the pipeline.
An in-depth study of the past and actual works published was done examining the method-ologies used and the advantages/disadvantages of the various techniques.
Subsequently was built a system, inspired to a proprietary work, developed in 2015 to the Mitshubishi Electronic Research Laboratories (MERL) [1] as private project. The thesis proposes an implementation of their system with two different designs of the core of the network.
Furthermore, the results of the system implemented has been compared with the results obtained training, in the same condition, the "Deep Correspondence Structure Learning (DCSL) network"[14] , that is a network presented and published in July 2016 to the 25th International Joint Conference on Artificial Intelligence, which outperforms the current state of art.
Finally, a complete real person re-identification system was built, which permits to extract the moving objects from a camera video stream and compare it with an input probe image related to a person to track.
1.3
Thesis summary
The thesis has the following structure:
Chapter 2 describes some useful concepts used to solve a classification problem, starting from the basic on neural networks to the concept of deep learning. The chapter as well as explains the basic concepts and introduces the advantages of using deep learning in the classification problem.
Chapter 3 is a survey on person re-identification problem, analyzing oldest and recent approaches and highlighting the differences (in terms of advantages) between the meth-ods.
Chapter 4 describes in details the two methodologies used to build the system, starting from the system description through the design to the implementation description. The chapter introduces also to the development tool used.
Chapter5 describes the dataset used to train the network, explaining its structure and its possible defects, which can effect the experiments.
Chapter 6 describes the methodology used to train the networks, the metric used to test the models, and the results obtained.
Chapter 7 describes the design of a complete person re-identification system, explaining all the system components.
7
Chapter 2
Background
This chapter describes some useful concepts that can help to understand the work done on the person re-identification system. The chapter starts with an introduction of classifica-tion problemand the difference between a classical machine learning and a deep learning solution. Next, some concepts about the neural network is provided as introduction to the deep learning neural networks. They are explained through its primary concepts focusing on convolutional neural networks that are extensively used in this thesis.
2.1
Classification problem
The person re-identification problem can be associated to a classification problem: predict whether two images belong or not to the same identity.
The classification is composed by two main phases: • Training phase:
the aim of this phase is to train a learning algorithm using a dataset which contains the training data (image, text,...) and their corresponding labels.
• Prediction phase:
the aim of this phase is to predict labels of unseen images using the trained model. The training processis a learning process composed by hundreds or even millions steps, where in each step the model works on a new unfamiliar data and makes prediction based on a feedback about how accurate was the previous step. In other words, the feedback rep-resents the error between the prediction result and the corresponding input label (correct solution). The task of the learning algorithm is to predict and adjust the model seeking to minimize the error. The iterative process continues until the error decrease.
Figure 2.1: Learning system schema. The training phase is composed by two main steps:
• feature extraction:
this phase extracts features from the input data in order to get its main components and uses them to distinguish better the object.
• model training:
in this phase the machine learning schema (showed in Figure 2.1) is applied to the features extracted, with the aim to learn a suitable model used to classify correctly the input data.
In the past training systems (as explained in Figure 2.2) kept separate the two phases executing the first phase with hand-craft feature extractors, such as SIFT, SURF, ORB and the second phase with a traditional machine learning model (see Figure 2.2). In the
Figure 2.2: Machine learning schema.
last few years, with the introduction of deep learning systems there was a sort of inclusion of the feature extraction phase in the training phase. The feature engineering (process of using domain knowledge of the data to create features) is the main difference between the machine learning algorithm and deep learning algorithm. While in the first case the
machine learning needs hand-crafted features, in deep learning, feature engineering is made directly by the learning algorithm (see Figure 2.3). As said Yoshua Bengio (one of the leader in deep learning) in his 2012 paper [3]:
"Deep learning algorithms, seek to exploit the unknown structure in the input distribution in order to discover good representations, often at multiple levels, with
higher-level learned features defined in terms of lower-level features".
Figure 2.3: Deep learning schema learn to extract optimal features and to classify cor-rectly the input data
The prediction phase is composed by a classifier, which predicts the membership of a given input data to a given class exploiting the trained model.
2.2
Basics of Neural Network
Both machine learning and deep learning models, are based on Neural network. In the following sections is described a recap on basic concepts of neural network in order to understand better the thesis work.
2.2.1
Biological neuron
The elementary building block of a biological neural network is the neuron (Figure 2.5). The human brain consists of a collection of neurons interconnected with each other via axons terminals and dendrites. Each neuron receives input signals from its dendrites and produces output signals along its axons. Each link between axon and dendrite is governed by a synapse with a learnable strength. Higher strength between two neurons, higher is the influence of one neuron on the other. The activation of a neuron depends on the sum of the signals coming from their dendrites.
Figure 2.5: Structure of a typical biological neuron.
2.2.2
Artificial neuron
With respect to the concept of biological neuron, artificial neuron (Figure 2.6) can be expressed as an element with a finite number of inputs with weights associated to them, an activation function and an output represented by the result of the activation function applied to the weighted sum of inputs.
Figure 2.6: Structure of a typical artificial neuron. Formally the output of an artificial neuron is given by:
y = f (PK
i=0WiXi+ θ)
where Xi is the ith input of the neuron with weight Wi, Θ is the threshold/bias of the
neuron and K is the number of inputs. f (·) represents the activation function of the neuron governed by Θ. The output of the neuron is sent to all the neurons connected.
2.2.3
Activation function
The aim of the activation function is to transform the weighted sum of input signals in an output signal. In other words the activation function represents the function that decides if a neuron "get fired or not".
Some of common activation function are listed below:
• Hyperbolic Tangent f (x) = tanh(x)
Figure 2.7: TanH activation function.
• Sigmoid
Figure 2.8: Sigmoid activation function.
• ReLu
f (x) = max(0, x))
Figure 2.9: ReLu activation function.
2.2.4
Feedforward Neural Networks
The feedforward neural network (Figure 2.10) is the simplest schema of artificial neural network.
It is composed by at least three layers: • Input layer.
• One or more hidden layers. • Output layer.
The information flows in one direction, forward, from the input nodes, through the hidden nodes to the output nodes. The schema in Figure 2.10 represents a feedforward neural network with two hidden layer where each node is connected to all nodes of the next layer (fully connected). The free parameters of the network are represented by the number of hidden layers and their size (number of neurons in each layer).
Figure 2.10: Fully connected Feedforward Neural Network with 2 hidden layer.
2.2.5
Training Artificial Neural Network: Backpropagation and
gra-dient descent algorithm
The goal of the training algorithm is to modify the weights of the network in order to minimize the prediction error. To reach this objective the backpropagation algorithm with Gradient Descent or one of its derivatives is used. One of the most used algorithm to solve an optimization problem is the Gradient Descent, whose goal is to minimize the objective function represented by the difference between the real output l (label) and the predicted output y = f (wTx) by updating the parameters in the opposite direction of the
gradient of the objective.
Considering the case of a single neuron, as represented in Figure 2.11:
Figure 2.11: Generic artificial neuron. the output of a single neuron is:
y = f (
K
X
i=0
(wixi)) = f (wTx)
considering a sigmoid function as activation function: f (u) = 1
then the output is given by:
y = f (wTx) = 1 1 + ewTx supponing the error calculated using the squared loss function:
E = 1
2(l − y)
2 = 1
2(l − f (w
Tx))2
the objective is to find wT ables to minimize E. The Gradient Descent algorithm updates the weight parameters following the direction of the loss function until to reach the min-imum. In each iteration, the algorithm, depending on the type, randomly selects one or more data point from the dataset and moves in the direction of the gradient with respect to the data selected.
The gradient of the objective function E respect to an arbitrary weight wi is:
∂E ∂wi = ∂E ∂y ∂y ∂u ∂u ∂wi where: ∂E ∂y = (y − l) ∂y
∂u = y(1 − y), considering ∂f (u)
∂u = f (u)(1 − f (u)) the derivative of sigmoid function ∂u ∂wi = x then, ∂E ∂wi = (y − l)y(1 − y)x The Gradient Descent update, is given by:
wnew = wold− η(y − l)y(1 − y)x where η is called learning rate.
The above formulas are related to the update of one single neuron, but a neural network is a composition of neurons and the weight updating is influenced by all the neurons con-nected.
Let us consider the simple fully connected neural network presented in Figure 2.12, It is composed by one input layer with K neurons x = x1, x2, ..., xK, one hidden layer with
N neurons h = h1, h2, ..., hN and one output layer with M neurons y = y1, y2, ..., yM.
Each input neuron is connected with each hidden neuron with wkiweight, where wki
In the same manner, each hidden neuron is connected with each output neuron with wij0 weight, where w0ij represents the weight between the ithhidden neuron and the jthoutput
neuron.
Figure 2.12: Fully connected neural network.
It is useful to think the weights between the input and the hidden layer as a matrix W of size K · N and the weights between the hidden and the output layer as a matrix W0 of size N · M .
Considering the sigmoid activation function f (u) = 1
1 + eu
the output of a generic hidden neuron is:
hi = K
X
k=1
(wkixk)
yi = N
X
i=1
(w0ijhi)
the squared loss function is given by:
E = 1 2 M X j=1 (yj− lj)2
In order to update the two sets of weights (wki and w
0
ij), it is necessary to compute the
gradient of the objective function E respect to wki and w
0
ij. In order to do that is applied
the derivatives chain rule:
∂E ∂wki0 = ∂E ∂yj ∂yj ∂u0j ∂u0j ∂w0 ij where, ∂E ∂yj = yj− lj ∂yj ∂u0j = yj(1 − yj) ∂u0j ∂w0 ij = hi ∂E ∂w0ki = (yj− lj)yj(1 − yj)hi and the Gradient Descent update of the hidden neuron is given by:
w0ijnew = wi0oldj − η(yj− lj)yj(1 − yj)hi
In contrast, the gradient of the objective function E respect to wkiis given by:
∂E ∂wki = M X j=1 (∂E ∂yj ∂yj ∂u0j ∂u0j ∂hi )∂hi ∂uj ∂uj ∂wki
(The sum is due to the fact that each hidden neurons is connected to each output unit). where, ∂E ∂yj ∂yj ∂u0j = (yj− lj)yj(1 − yj) ∂u0j ∂hi = ∂PN i=1(w 0 ijhi) ∂hi = w 0 ij ∂hi ∂ui = hi(1 − hi) ∂ui ∂wki = ∂PK k=1(wk ixk ) ∂wki = xk finally:
∂E ∂wki = M X j=1 [(yj− lj)yj(1 − yj)w 0 ij]hi(1 − hi)xk
in conclusion the updating weights are given by:
wnewk i = wkoldi − η M X j=1 [(yj− lj)yj(1 − yj)w 0 ij]hi(1 − hi)xk
There exist three variants of gradient descent algorithms, which differ in how much data are used to compute the gradient of the objective function.
Below, the three techniques are briefly explained. • Batch gradient descent
Batch gradient descent, computes the updating parameters for the entire training dataset. Therefore it is necessary to compute the gradient for the whole dataset to perform just one update. This method is very slow and impossible to use with models whose dataset does not fit in memory.
• Stochastic gradient descent (SGD)
Stochastic gradient descent computes the updates for each training sample. Com-puting one update each time makes the updating process much faster and it can also be used to learn online.
• Mini batch gradient descent
Mini batch gradient descent is a composition of the two previous methods, it per-forms an update for every mini-batch of n training samples. It reduces the variance of the parameter updates, which can lead to more stable convergence respect to the SGD, and it is much faster respect to the Batch gradient descent.
2.2.6
Advanced NN architecture
One of the Neural Network (NN) architecture mostly used in the recognition field is the Siamese neural network architecture. The Siamese neural network is a class of neural net-work architectures that contains two or more identical subnetnet-works. Identical is referred to the sharing of parameters and weights. Siamese NNs are popular among tasks that in-volve finding similarity or a relationship between two comparable things such in the case of person re-identification. Generally, in such tasks, two identical subnetworks are used to process the two inputs extracting the main information in the same way, and another module will take their outputs and produce the final output. The Figure 2.13 describes
the architecture of a simple siamese NN that compares two input data, and returns their distance.
Figure 2.13: Siamese NN architecture, composed by two input data, that trough the same NN in separated way.
The benefits of this architecture are:
• Few parameters thanks to the sharing of the weights.
• Each subnetwork essentially produces a representation of its input. If the inputs are of the same kind, it makes sense to use similar model to process similar inputs. In this way, the architecture extracts the same semantics for both the input, making them easier to compare.
2.3
Deep Neural Networks
In this section the main concepts of Deep Neural Network (DNN) are given. We starts by talking about the main difference between a DNN and a "shallow" NN and then give the basic concepts on the Convolution Neural Network (CNN) and finally discuss about the importance of the network depth.
2.3.1
Comparison between DNN and a "shallow NN"
The depth of a network is given by the number of layers of the network. Given a feed-forward NN with one input layer, one hidden layer and one output layer, the depth of the network is three. The number of hidden layers can be more than one and the depth of the
network can increase.
It is possible to define a DNN, as:
"an artificial neural network (ANN) with multiple hidden layers of units between the input and output layers." [2].
In contrast a shallow neural network is:
"an artificial neural network (ANN) with zero or one hidden layer.
Figure 2.14: Difference between shallow network (left) and DNN (right).
The principal advantages of a DNN are:
• It can implements functions with higher complexity than shallow one, using the same number of resources.
• It offers invariance to shifting, scaling, and other forms of distortion.
• It approximates better the depth of visual system in human brain. There exist ten-twenty layers of neurons from the retina to the inferotemporal cortex (where object categories are encoded).
• It eliminates the need of hand-crafted feature extractors, that is one of the most time-consuming parts of machine learning practice.
However, DNN presents some disadvantages: • It requires a large amount of data.
• It requires high computational resources in order to elaborate the results.
The last disadvantage has been reduced by the recent availability of powerful Graphical Processing Units (GPUs).
2.3.2
Convolutional Neural Networks
The CNN are the most used neural network in deep learning.
A convolutional neural network, can be summarizes into three main concepts: • Local receptive fields.
• Shared weights.
• Pooling or downsampling.
Local receptive fields and shared weights
In visual recognition, it is often advantageous to consider local part of an image respect to the whole image, due to the fact that pixels close together tend to be more correlated respect far pixels. This is obtained applying the convolution operation. The convolution operation is possible considers as a sliding window applied to a matrix, where the window is the kernel (or filter) and the matrix is the image. The kernel is sliding over the input image. For each position of the kernel is multiplied the overlapping values of the kernel and image are multiplied, and added up. This sum of products will be the value of the output image at the point in the input image where the kernel is centered. In the Figure 2.15, the convolution operation with a 3 · 3 filter is explained. The result is obtained mul-tiplying element-wise the values of the filter with each overlapping value of the matrix and sum them up. To get the full convolution, called feature map, the operation is done over the whole matrix sliding the window with a certain stride, which dictates how many pixel the window is moved. In the example the stride is one. By applying the convolution, the CNN architecture is able to capture the local part of the input image, such that each neuron depends only on a spatially local subset of the variables in the previous layer. To understand better this concept, it is possible to consider the following example:
suppose an input image of size 32 · 32 pixels (32 · 32 = 1, 024 input neurons) and con-sidering an hidden layer fully connected with the input. In this case, each input neuron is connected to all neurons of the hidden layer (Figure 2.16).
Each color in the figure corresponds to different weight. Then, if the size of the input layer is 32 · 32, the number of weights is (32 · 32) · (number of hidden neurons).
Figure 2.15: Convolution with 3x3 Filter and stride 1.
neuron is connected to a small local group of size 8 · 8 of the input neuron (Figure 2.17). The set of nodes in the input layer that affect the activation of a neuron is referred to as neuron’s receptive field.
In the Figure 2.17 each neuron of a certain feature map use the same kernel filter and share the weights, in contrast, neurons of different feature maps have different weights. Then, if the kernel size (size of receptive field) is 8 · 8, the number of weights necessary will be (8 · 8) · (number of feature maps).
In conclusion, in a CNN, individual neurons generally have a local receptive field rather than a global receptive field as in the fully connected case and the number of weights used in CNN is much lower than the weights used in the fully connected architecture.
Figure 2.16: Receptive field in a fully connected network.
Figure 2.17: Receptive field in a CNN.
the filters work on every part of the image. Therefore, if the location of a certain feature in the input is translated, the activation output of the layer also translates proportion-ally.
Pooling
The pooling layer, is typically applied after a convolution layer and performs the task of subsampling their input.
The operation usually consists on applying a subsample operation to a multiple patches of the input in order to cover the whole input matrix. This operation is conducted by varying a sliding window over the entire input matrix, for each variation of the window is computed a local subsampling. The typical subsampling operation adopted is the max-pooling, which computes the maximum between all the element in all single patch. The advantages of using max-pooling are:
• It provides a fixed size output matrix
It is possible want to have a certain size of the features extracted. • invariance to translation and rotation
After a pooling over a region, the output will stay approximately the same even if the image is translated/rotated by a few pixels, because the max operations will pick out the same value regardless.
• keeps the salient information reducing the dimensionality
Applying the pooling, the exact global features position information about locality are lost but it is maintained local information captured by the filters.
Figure 2.18: Max pooling operation. Application of a CNN architecture
After the brief introduction about the basic elements of a CNN, in this section we present the basic architecture of a CNN classifier.
Figure 2.19: Simple CNN classifier.
The input layer of the network in Figure 2.19 is composed by a 28 · 28 image.
Then, a convolution layer with kernel size 5 · 5 which produce five feature maps with size 24 · 24 is applied. These feature maps describe the local features extracted by different filters.
For each feature map, a pooling layer with kernel size 2 · 2 is applied, which downsamples the feature maps of a factor two, then, the largest number from each local patch of the feature maps is recorded. At the end of the network there is a fully connected layer that combines the input information and extracts the feature vector to pass it to a softmax element, with the aim to classify the image (softmax regression).
The softmax regression is a generalization of logistic regression and it is used for multi-class multi-classification (assuming the multi-classes mutually exclusive).
following form is used: P (y = j | zi) = φsof tmax(zi)) = ez(i) Pk j=0ez (i) k where, z = x0w0+ x1w1+ ... + xmwm = m X l=0 xlwl= wTx
w is the weights vector, x is the vector of one training sample, and w0 is the bias unit.
The softmax function computes the probability that the tranining sample x(i) belongs to class j, given the weight and the net input z(i). Hence, it computes the probability
p(y = j|x(i) | w
j) for each class label in j = 1, ..., k. In order to apply a backpropagation
algorithm, it is necessary define a cost function J to minimize it. The cost function is given by the average of the cross entropies over the training samples
J (θ) = 1 n n X i=1 H(Ti, Oi) where, H(Ti, Oi) = − X m Tilog(Oi)
which represents the entropy between the target Ti and the output Oi computed by
soft-max.
The computation of the weights with the backpropagation algorithm is the same one used in any feedforward neural network (see Section 2.2.5).
Depth of CNN
The CNN proposed in the previous example is a shallow CNN never used in real appli-cations. It is possible obtain better results with deep CNN composed by replications of convolution and pooling layers. The experiments in the last years have shown that by increasing the depth of the network, the performance improves. It is possible think to the VGG network [11] (the actual state of art in visual recognition), which shows as the depth of the network is a critical component for good performance, in fact the network contains 16 convolution and fully connected layers. Another deep CNN used also in re-identification, is the googleNet [12], (22 layers deep network) which wons the imageNet
competition in 2014 for the image classification and object localization challenge with one hundred categories.
27
Chapter 3
Related works
In this chapter the main works on person re-identification from the earlier to the more recent period are presented. We also describe the two networks tried in this thesis work.
The existing person re-identification approaches can be classified in two categories: • Hand-crafted systems.
• Deep learning systems.
3.1
Hand-crafted systems for person re-identification
Hand crafted systems are systems which keep separated the feature extraction from the metric learning and they try to optimize both problems separately.
Metric learning
The goal of methods based on metric learning is to learn a distance metric to reduce the distance of the matched images, and enlarge the distance of the mismatched images. In person re-ID, the majority of works fall into the scope of supervised global distance met-ric learning. One of the most used metmet-ric in person re-identification is the Mahalanobis distance function, which generalizes Euclidean distance using linear scalings and rota-tions of the feature space. This metric is used in the KISSME project [9].
Feature extraction for person re-identification
The goal of methods based on feature extraction is to extract feature sufficiently dis-criminative for re-identification, much possible invariant to lighting, pose, and viewpoint changes. The most commonly used feature is color, while texture features are less
fre-quent in this field due to the fact, in most cases, only low-resolution images can be used. Traditional features such as color histogram is most widely used in the first re-identification works [5]. With the consideration of the influence of illumination varia-tions, it is possible to calculate color histograms in different color spaces separately and fuse them to make the final feature more robust to illumination changes. Other approaches based on color histogram are the Weighted Color Histogram, which assigns larger weights to pixels near the symmetrical axis and forms a color histogram for each part or the Max-imally Stable Color Regions (MSCR), which detects stable color regions and extracts features such as color, area, and centroid.
3.2
Deep learning systems in person re-identification
With the introduction of CNN and deep learning, the new re-identification approaches are based on deep networks that simultaneously find an effective set of features and a corre-sponding optimization of the metric similarity function.
Deep Metric learning in person re-identification
In the literature, the first work to apply deep learning in the person re-identification prob-lem is proposed by Yi et al. in [13]. The network developed is based on siamese convo-lutional networks (SCNN) and consists in three independent convoconvo-lutional networks that act on three overlapping parts of the two input images (Figure 3.1). Each SCNN consists
Figure 3.1: Network structure of Yi et al project.
in two convolutional layers with max pooling, followed by a fully connected layer (Figure 3.2).
The fully connected layer combines the information extracted in the previous layers and produces an output vector for each input image, finally, the two output vectors are com-pared using a cosine function.
Figure 3.2: CNN of the siamese subnetwork.
Figure 3.3: Siamese CNN applied to a part of the image. (Figure 3.3).
Deep Filter Pairing Neural Network for Person Re-Identification
The network proposed by Li et al [8] (DFPNN) is a siamese network (Figure 3.4) with a single convolutional layer with max pooling, followed by a patch-matching layer that multiplies convolutional feature responses from the two inputs at a variety of horizontal offsets. This is followed by a max-out grouping layer that keeps the largest response from each horizontal strip of patch-match responses, followed by another convolutional layer with max pooling, and finally a fully connected layer and softmax output.
The benefits of this method is that jointly optimizes feature learning, photometric trans-forms, geometric transtrans-forms, misalignment, occlusions and classification under a unified deep architecture.
Cross-Neighborhood Difference Network for Person Re-Identification
The Cross Neighborhood Difference network (CND) (Figure 3.5), which has inspired this thesis work, [1] is a siamese network with a pair of images as input.
The first two layers are tied convolution with max pooling used to extract the salient features, then is applied a cross-input neighborhood differences, which computes dif-ferences in feature values across the two views around a neighborhood of each feature location. The difference results is given to patch summary features which summarizes these neighborhood difference maps, and passed to the next layer which learns the spatial relationships across neighborhood difference. Finally is computed the higher-order rela-tionships between the two images applying a fully connected layer and a softmax function to yield the final estimate of whether the input images are of the same person or not. This
Figure 3.5: Cross-Neighborhood Difference Network architecture. architecture will be study in depth in the next chapters.
Semantics-Aware Deep Correspondence Structure Learning for Robust Person Re-identification
The proposed DCSL network (Figure 3.6), is a person re-identification network divided in two main parts:
• Feature extraction network. • Learning matching function.
The first part of the network (feature extraction part) is composed by the googleNet net-work [12] (classification netnet-work which won the imagenet competition in 2014), which learns discriminative features for person identification and returns their intrinsic structural information. The second part of the network is built to learn the feature matching func-tion, which outputs the matching correspondence results between the learned semantics-aware image representations extracted in the previous phase. The entire deep network seek to jointly optimize the processes of semantics-aware image representation learning and cross-person correspondence structure learning.
This network beat all the previous methods proposed in literature and represent the actual state of art in person re-identification.
33
Chapter 4
System implementations
4.1
Introduction
The aim of this thesis work, as already described in Chapter 1, is to design and imple-ment a system for person re-identification and compare it with others existing systems. The system implemented is inspired to the Cross Neighborhood Difference Network de-scribed briefly in Section 3.2. The design and the implementation of each network layer are described in details in this chapter focusing primarily on the core layer of the net-work.
4.2
Tools used
The entire thesis work, network implementation, training and testing phases has been de-veloped using Caffe [7] open framework, which provides a complete toolkit for training, testing, finetuning, and deploying neural network models.Caffe offers a set of standard and precompiled layers availables to build a neural network model, but the modularity of the software allows also to extend to new data formats, network layers, and loss functions. Each layer is written with C++/CUDA library, which permits to take advantage of GPU unit. Thanks to the modularity of the code, Caffe offers also the possibility to build a cus-tom layer using different languages (C, C++/CUDA, Python). For rapid prototyping and interfacing with existing research code, Caffe provides Python and MATLAB bindings which permits to build networks and classify inputs.
4.2.1
Caffe model structure
The basic components used to implement a Caffe network model are described in the fol-lowing sections.
Blob data
A Caffe model network is a composition of layers connected each other, where the con-nections describe the data flow process in forward and in backward passes.
Figure 4.1: Forward and backward passes in a deep learning network.
The data are stored and managed in elements called blobs. A blob in caffe is basically an N-dimensional array stored in a C-contiguous fashion.
The conventional blob dimensions for a batch of image data are given by: N · C · H · W
Figure 4.2: Data blob (NxCxHxW).
and W is the dimension of the data (i.e. in image data, H and W represent the height and width of the image).
Each blob stores in memory two data:
• The data part, which contains the data computed in the forward pass. • The diff part, which contains the gradient computed in the backward pass. Layer connections
The basic network computations in Caffe are performed to the layer. The most important operations offered by the layers, are:
• Convolution. • Pool. • Inner product. • Element-wise transformations. • Normalization. • Data load. • Loss computation.
Each layer receives one or more input blob through bottom connections and makes output through top connections (Figure 4.3).
Figure 4.3: Layer connections.
The composition of a layer is defined by three critical computations: • Setup
• Forward
Used to compute the layer operations. It takes the input from the bottom and returns the output to the top.
• Backward
Used to compute the gradient of the layer. It takes the gradient from the top and returns the gradient calculated to the bottom.
A typical example of Caffe network is dipicted in Figure 4.4:
Figure 4.4: Caffe network example.
The network begins with a data layer that loads data from disk, then the data is passed to a fully connected layer (Inner product) and finally the output (i.e. the classification score) and the loss value using Softmax loss layer is computed. The loss value goes back in each layer of the network in order to adjust the weights.
4.3
Cross Neighborhood Difference Network
implemen-tation
The design and the implementation of each layer of the Cross Neighborhood Difference Network (CND) introduced in Section 3.2 is described in detail in this section. In Figure 4.5, the basic network architecture is reported, and as we will see, it suffers changes depending on the implementation done.
4.3.1
Input data layer
The network implemented is a siamese network (see Section 2.2.6), which takes in input a pair of images with a correspondent label indicating whether the person in the two images are the same or not. The input images are RGB images of dimension 160 · 60. Each pairs is composed by two images, which represents same or different identity in two different viewpoints.
4.3.2
Convolution and max pooling layer
The first processing layers present in the network are two convolution and max pooling layers used to extract the salient features of the input images. The layers are tied: they share the same weight parameters in order to use same filters to highlight the same fea-tures.
Figure 4.6: First convolution layer.
The data input is an RGB image pair with dimension 3 · 160 · 60, and the convolution is made with 20 learning filters of size 3 · 5 · 5 and stride 1. Let us consider for simplicity a batch size of dimension 1, the input blob size is:
the output blob size of the convolution layer is given by: n · co· ho· wo = 1 · 20 · 156 · 56 where: ho = hi+ 2 · padh− kernelh strideh + 1 = 160 + 2 · 0 − 5 1 + 1 = 156 wo= wi + 2 · padw− kernelw stridew + 1 = 60 + 2 · 0 − 5 1 + 1 = 56
and co is the number of filters used in convolution (20 filters).
The 20 feature maps obtained are passed to a max pooling layer (Figure 4.7) with the objective to halves the feature dimensions. The max pooling layer using a 2 · 2 kernel and stride 2 computes for each feature map the max in each 2 · 2 subregion obtaining 20 feature maps of half size.
Figure 4.7: First max pooling layer. Let us consider the input size:
n · ci· hi· wi = 1 · 20 · 156 · 56
the output blob size of the max pooling layer is given by:
where: ho = hi+ 2 · padh− kernelh strideh + 1 = 156 + 2 · 0 − 2 2 + 1 = 78 wo= wi + 2 · padw− kernelw stridew + 1 = 56 + 2 · 0 − 2 2 + 1 = 28
and cois equal to the number of filters used (20 filters). The output of the first Conv/pool
layer is reported in Figure 4.8. It is possible to see as the convolution layer highlights the salient image information whereas the pooling layer summarizes this information.
Figure 4.8: Output of first Conv/pool layer: on the left the output of convolution layer, on the right the output of pooling layer.
The output of the first Conv/pool layer is connected to a second Conv/pool layer, (Figure 4.9) which performs first, a tied convolution with 25 filters with kernel size of 20 · 5 · 5 and stride 1, then a pooling, which halves the 25 features maps obtained applying a 2 · 2 kernel size with stride 2.
Figure 4.9: Second convolution and max pooling layer.
Figure 4.10: Output of second Conv/pool layer: on the left the output of convolution layer, on the right the output of pooling layer.
4.3.3
Cross Neighborhood difference layer
The Cross Neighborhood difference layer is the core of the network, which computes differences in feature values across the two views around a neighborhood of each feature location. The first two convolutional layers of the network provide a set of 25 feature maps per input image as showed in Figure 4.11.
Given the fi and the gi feature maps, which represent the ith feature map from the first
Figure 4.11: Input feature maps to the Cross Neighborhood difference layer. fi is the ith
feature map of the first input image, giis the ithfeature map of the second input image.
Ki(x, y) = fi(x, y)I(5, 5) − N [gi(x, y)]
where:
(1 ≤ x ≤ 12) and (1 ≤ y ≤ 37) I(5, 5) is a matrix 5 · 5 of 1s
N [gi(x, y)] is the 5 · 5 neighborhood of gicentered at (x,y)
and for symmetry:
Ki0(x, y) = gi(x, y)I(5, 5) − N [fi(x, y)]
where:
(1 ≤ x ≤ 12) and (1 ≤ y ≤ 37) I(5, 5) is a matrix 5 · 5 of 1s
and N [fi(x, y)] is the 5 · 5 neighborhood of ficentered at (x,y)
The output of the Cross Neighborhood Difference layer is composed by 25 difference feature maps per branch. An example of this computation is explained in Figure 4.12 and 4.13 where K represents the neighborhood difference of feature maps belonging to the first input image with respect to the feature maps belonging to the second input image whereas K0 represents the inverse difference.
This layer has been implemented in two different versions: • Building a python custom layer.
Figure 4.12: Neighborhood difference view A - view B.
In the following sections the two versions are extensively explained.
4.3.4
Cross Neighborhood difference layer: Custom implementation
A possible implementation of the Cross Neighborhood Difference layer is to build a new custom Caffe layer as explained in detail in this section. To this end, it was necessary to implement the forward and the backward pass of the layer. In order to exploit the matrix functions that the language offers the entire layer has been implemented using python. Obviously the python code cannot implement layers which exploit the GPU unit, this disadvantage represents a big inefficiency during the training phase as explained in Chapter 6. In order to make the layer efficient in terms of forward and backward computation it was necessary to rethink the design of the layer.
The new layer design is composed by three main parts: • Neighborhood layer.
• Difference layer. • Reshape layer.
In the following sections, the forward and the backward pass of each layers is explained in details.
Neighborhood layer: forward pass
The Neighborhood layer applies a particular reshapes of the two input feature maps (given from the two previous conv/pool layers) obtaining two blobs data per branch: column and replicationblobs, which are ready for a simple element wise subtraction offered by the difference layer (Figure 4.14).
Figure 4.14: Neighborhood layer architecture. Image to column data: "column blob"
The operation performed in order to build the column data blob are extensively explained below.
Given a 25 · 12 · 37 blob data, the neighborhood layer computes the "column" data blob, which represents the "image to column" transformation, for each input feature map. Given a feature map and a block size 5 · 5, the image to column operation rearranges each feature map blocks into columns. To this end, the first step is to compute the padding, which enlarges the feature map of 2 pixels each side in order to compute the neighborhood block also to the corners (Figure 4.15).
On the padded feature map a reshaping of each map blocks sliding a 5 · 5 window over the whole map with stride 1 is executed (Figure 4.16).
Figure 4.15: Feature padding.
Replication data: "replication blob"
The steps necessary to build the replication data blob are explained in this section. Given a 25·12·37 blob data, the neighborhood layer computes the "replication" data blob, which represents the reshaping and replication of each input feature map. Each input map is reshaped from 12 · 37 to 1 · 444, then the reshaped result is replicated 25 times obtaining a map of dimension 25 · 444 as explained in Figure 4.17
Figure 4.17: Rashaping and replication of the input data.
Difference layer: forward pass
The difference layer applies an element wise subtraction between the arriving input from the Neighborhood layer: column and replication data blobs. Each "column" map contains in each row a block 5 · 5 of the feature map related to the first input of the neighborhood layer, in contrast the "replication" map contains in each row a single pixel of the feature map related to the second input of the neighborhood layer. In particular the ithrow of the
columnmap is the 5x5 neighborhood values of the pixel in position 12i , i mod 2 in the first input map and the ith row of the replication map is the replication of the pixel in position 12i , i mod 2 in the second input map. Then we perform an element-wise subtraction between the two maps that it means subtract each region 5·5 of the first feature map to the center pixel of the same region in the second feature map (see Figure 4.18).
Figure 4.18: Subtraction architecture on the left and its implementation on the right.
Reshape layer: forward pass
In order to get a better vision of the neighborhood differences a reorganization of the sub-traction output in blocks of dimension 25 · 25 is need. Each row of the subsub-traction blob (25 · 444) is reshaped into a blob which contains 12 · 37 blocks of 25 · 25, where each block represents the differences in feature values across the two views around a neighborhood of each feature location (see Figure 4.19).
Figure 4.19: Reshape architecture on the left and its implementation on the right.
Figure 4.20: The cross neighborhood difference layer architecture applied to both the input feature maps.
Backward pass
The backward pass, as said in Section 2.2.5, mathematically is based on the chain rule that is used for computing the derivative of the composition of two or more functions.
Figure 4.21: On the left the forward pass of a generic function f (x, y), on the right the corresponding backward pass exploiting the chain rule.
In the Figure 4.21, the forward pass on the left calculates z as a function f (x, y) using the input x and y, on the right it is showed the backward pass: the layer receives dL/dz, as gradient of the loss function with respect to z from the next layer, the gradients of x and y on the loss function is calculated by applying the chain rule, as shown in figure. The whole backpropagation of the cross neighborhood difference layer is showed in Figure 4.22. For simplicity is represented only one branch of the network, the other one is iden-tical but with the inverted input.
Reshape layer: backward pass
The reshape layer receives a 25 · 60 · 185 "diff blob" containing the gradients from the next layer of the network. The gradient computation in this layer is only a reshaping of the gradient arrived. Each block 5 · 5 of the gradient produced is reshaped in a single row 1 · 25, as explained in Figure 4.23. Eventually a 25 · 25 · 444 gradient blob to pass to the previous layer is obtained (see below).
Figure 4.23: Gradient computation in the reshape layer.
Difference layer: backward pass
The gradient blob produced by the reshape layer is the input of the difference layer in the backward pass. In the forward pass, the difference layer computes the element wise differences between each input feature maps (see Section 4.3.4). Mathematically, in the forward pass the following subtraction for each feature map is computed:
S = I − R where, I is the feature map of the column blob R is the feature map of the replication blob
In the backward pass in order to compute the gradients for the column data and the repli-cationdata it is necessary to compute:
∂(I) = ∂(S)
where ∂(I) is the column gradient and ∂(R) is the replication gradient.
Figure 4.24: Gradient computation in subtraction layer.
Neighborhood layer: backward pass
The backward pass of the Neighborhood layer returns the gradient of the two input given to the layer. In order to compute those gradients it is necessary to apply the inverse pro-cedures of the column and replication operations as showed in Figure 4.25.
Figure 4.25: Backward pass in the neighborhood layer.
The procedure to build each gradient blob is explained in the following sections.
Replication: backward pass
The replication layer, in the backward pass, receives a 25 · 25 · 444 "diff blob", which represents the negative gradient of the subtraction layer. In the forward pass each row of the replication output map, is the replication of a single element of the input data. In the backward pass each gradient row, contains the derivatives of one single element, so that the summation of the gradients in one diff blob row is the derivative of one single element
in the backward output blob. Mathematically, ∂Y1,1 = 25 X k=1 (∂Rk,1) ∂Y1,2 = 25 X k=1 (∂Rk,2) ∂Y12,37 = 25 X k=1 (∂Rk,444)
where ∂R is the gradient arriving from the subtraction layer related to the replication blob (see Section 4.3.4) and
∂Y = { ∂Yx,y}
x=12,y=37 x=1,y=1
is the gradient of the second input of the neighborhood layer.
Figure 4.26: Gradient computation in replication layer.
Column: backward pass
In the forward pass, the image to column is the operation applied to compute the column data. The inverse computation of the image to column operation is the column to image. It is applied in the backward pass in order to redistribute the gradients correctly. The layer receives in the backward pass a 25 · 25 · 444 "diff blob", which represents the posi-tive gradient of the subtraction layer (column gradient). The column to image operation rearranges each row of gradient in 5 · 5 blocks to place it in the backward output data.
Considering that, in the forward pass each row of column blob sliding a 5 · 5 window with stride 1 is computed, then, in the result, each consecutive rows shares 45 of the window. In the backward pass each row is reshaped in a 5 · 5 block and it is placed in the output blob with stride 1 and the45 superimposed derivatives is summed as showed in Figure 4.27
Figure 4.27: Gradient computation in image to column layer.
Finally, in the forward pass, in order to compute the column data the padding operation is performed, which enlarges each side map of two pixels replicating the values on the border. In the backward pass, the procedure is the inverse, the map is reduced of two pixels where the derivatives of each enlargement is summed together as showed in Figure 4.28. The result obtained ∂X is the gradient of the first input of the neighborhood layer.
Figure 4.28: Gradient computation in padding layer.
4.3.5
Cross Neighborhood Difference layer: implementation by the
standard caffe layers
In order to avoid the python custom layer limitations, has been implemented the same layer exploiting the standard Caffe layers, which offers the forward and backward imple-mentation in CUDA/C++ and takes the advantages of the GPU unit. The Cross Neigh-borhood Difference layer receives from the firsts Conv/pool layers (see Section 4.3.2), the two input feature maps f, g of dimension 25 · 12 · 37 related respectively to the two images presented to the network. To the first input of the layer (f ) an image to column operation is applied, which rearranges each 12 · 37 feature map into a 25 · 12 · 37 feature map using a 5 · 5 sliding window with stride 1. The output are 25 feature maps of size 25 · 12 · 37 where each block 25 · 1 · 1 represents a block 5 · 5 of the original feature map. In other words each block of the original feature map is rearranged along the depth of the output feature map. Each of this blocks represents the neighborhood of one pixel of the original feature map as explained in Figure 4.29.
Figure 4.29: Image to column operation, the blue block in N represents the neighborhood of the pixel (1,1) of the first feature map of f. Instead, the green block represents the neighborhood of the pixel in position (12,37) in the first feature map of f.
given the first input feature maps
f = fk(x, y)25k=1
where, 1 ≤ x ≤ 12, 1 ≤ y ≤ 37,
the image to column operation computes:
N ={ Nkh(x, y) }25h=1}25k=1
where:
{ Nkh(x, y) }25h=1
is the neighborhood of fk(x, y)
The second input layer blob of size 25 · 12 · 37 (g) is sliced in depth axis in order to get separately each feature map. Each slice (gk) is replicated 25 times through the duplicate
layer obtaining a 25 · 12 · 37 feature map for each of the 25 original ones as showed in Figure 4.30.
Figure 4.30: Second input data after slice, duplication and concatenation. The input feature map value in position (1,1) (blue value of g), is replicated 25 times and placed in the blue block in position 1 of R. In the same way the green value in (12,37) is replicated in the block green in R.
Formally:
given the second input feature maps
g = { gk(x, y) } 25 k=1
1 ≤ x ≤ 12, 1 ≤ y ≤ 37,
the duplication operation computes:
R ={ Rkh(x, y) }25h=1}25k=1
where:
{ Rkh(x, y) }25h=1
is the replication of gk(x, y)
The subtraction between the image to column feature maps (N) and the replications fea-ture maps (R) is an element wise subtraction, which returns 25 feafea-ture maps, where each block 25 · 1 · 1 of this ones represents the differences between the replication of the kth value in gi and the neighborhood of the kthvalue in fi.
Formally:
Figure 4.31: Subtraction operation between the Image to Column of the first feature maps f and the replication of the second feature maps g.
the subtraction feature maps are given by:
K ={ Kkh(x, y) }25h=1}25k=1
where { Kkh(x, y) }25h=1= { Nkh(x, y) }25h=1− { Rkh(x, y) }25h=1
layer which computes the cross neighborhood differences: K = f − g
The layer is applied also with the input inverted in order to compute the inverse differ-ences:
K0 = g − f
Figure 4.32: Cross neighborhood difference layer architecture exploiting standard Caffe layers.
4.3.6
Patch summary features layer
The cross differences between the feature maps of the two input images obtained by the Cross Neighborhood Difference layer are the inputs to the Patch Summary Features layer. This layer applying a convolutional layer summarizes each difference patch obtained in the previous layer. Each Cross Neighborhood Difference layer version, python custom layer (Section 4.3.4) and Cross Neighborhood difference exploiting Caffe layers (Section 4.3.5), returns the same result but in different organization (see Figure 4.33).
Figure 4.33: Output of Cross Neighborhood difference: On the left the implementation using the standard caffe layers, on the right the custom implementation.
For this reason, the two implementations need two different forms of convolution. In the first case, (custom implementation) in order to summarize the 5 · 5 neighborhood differ-ence a convolution with filters of dimension 25 · 5 · 5 and stride 5 is necessary. In the second case (implementation exploiting the standard Caffe layers) to summarize the 1 · 1 neighborhood difference is necessary a convolution with filters of dimension 625 · 1 · 1 and stride 1. Both the implementations return a patch summary map of dimension 25 · 12 · 37. The layer is applied separately to both the difference maps obtained from the Cross Neigh-borhood layer (K and K0) obtaining L and L0 patch summary maps respectively related
to K and K0. The convolution operation is explained in the Figure 4.34
Figure 4.34: On the left the convolution of the difference layer in the standard caffe layer implementation, on the right the convolution of the difference layer in the custom implementation.
4.3.7
Across Patch Features layer
The Across Patch Features Layer is another convolution layer followed by a max pooling layer. Whereas in the previous layer, the convolution is used to obtain a local represen-tation of the neighborhood difference map, the convolution in the current layer tries to learn the spatial relationship across the two Neighborhood difference maps (K and K0) convolving the patch summary maps with 25 filters of size 25 · 3 · 3 with a stride 1. Then in order to reduce the weight and the height of a factor 2 a max pooling layer with kernel size 2 · 2 is applied. The across patch features maps obtained are showed in the Figure 4.35