Design and Implementation of a Vehicle Tracking System Based on Deep Learning

(1)

Universit`

a di Pisa

Dipartimento di Ingegneria dell’Informazione

Computer Engineering

Design and Implementation of a Vehicle

Tracking System Based on Deep

Learning

Candidate: Davide Ruisi Supervisors: Claudio Gennaro Giuseppe Amato Fabrizio Falchi Academic Year 2018/2019

(2)

(3)

Abstract

Advanced on deep learning research and the availability of a lot of data to be trained, thanks to the growth of the internet, has allowed progress in many fields of computer vision, such as object detection, object tracking, and object re-identification.

Tracking vehicles over multiple cameras placed at different positions is not a single task, but the composition of three distinct research problems: detection, single-camera-tracking, and identification. In this thesis work, we realize a system capable of tracking and re-identifying the same vehicle from different cameras using state-of-the-art approaches for detection, tracking, and re-identification.

A new vehicle re-identification baseline, V-ReID-KTP-Baseline, that exploits the use of vehicle keypoints, traklets, and license plate information for re-identification, is de-ployed. In particular, a new re-ranking method based on license plate information is designed specifically for this task. We also present a new labeled dataset, V-ReID-AB-Dataset, created and employed to test the use of license plate information for vehicle re-identification. Test on this new dataset suggests that the availability of license plate information can make a considerable improvement in results for the task of vehicle re-identification.

(4)

(5)

List of Figures

2.1 Sparse interaction, viewed from above. We highlight one output unit, s3,

and highlight the input units in x that affect this unit. These units are known as the receptive field of s3. (Top) When s is formed by convolution

with a kernel of width 3, only three inputs affect s3. (Bottom) When s is

formed by matrix multiplication, connectivity is no longer sparse, so all the inputs affect s3. [1] . . . 8

2.2 Parameter sharing. Black arrows indicate the connections that use a par-ticular parameter in two different models. (Top) The black arrows indicate uses of the central element of a 3-element kernel in a convolutional model. Because of parameter sharing, this single parameter is used at all input locations. (Bottom) The single black arrow indicates the use of the central element of the weight matrix in a fully connected model. This model has no parameter sharing, so the parameter is used only once. [1] . . . 9 2.3 Recognition problems related to object detection: (a) Image level object

classification, (b) Bounding box level object detection, (c) Pixel-wise se-mantic segmentation, (d) Instance level sese-mantic segmentation. [2] . . . 11 2.4 High-level diagram of RCNN. [2] . . . 12 2.5 High-level diagram of YOLO. [2] . . . 13 2.6 An illustration of the output of an object tracking algorithm. Each output

(9)

2.7 Usual workflow of an object tracking algorithm: given the raw frames of a video (1), an object detector is run to obtain the bounding boxes of the ob-jects (2). Then, for every detected object, different features are computed, usually visual and motion ones (3). After that, an affinity computation step calculates the probability of two objects belonging to the same target (4), and finally an association step assigns a numerical ID to each object

(5). [3] . . . 16

2.8 An example of vehicle re-identification: the same white BMW SUV cap-tured by different surveillance cameras. The blue circles and arrows denote the locations and directions of cameras. [4] . . . 17

2.9 Object Re-identification system diagram. . . 18

3.1 Vehicle samples selected from the VeRi dataset. [4] . . . 27

3.2 The urban environment and camera distribution of the CityFlow dataset. The red arrows denote the locations and directions of cameras. Some ex-amples of camera views are shown. [5] . . . 28

3.3 TNT framework for multi-object tracking. [6] . . . 31

3.4 Architecture of Multi-scale TrackletNet. [6] . . . 32

3.5 Example results of the U. Washington IPL’s vehicle re-identification frame-work. [7] . . . 35

3.6 Overview of the U. Washington IPL’s vehicle re-identification method. [7] . 36 3.7 Example of vehicle keypoints detection. [7] . . . 37

3.8 The structure of the Temporal Attention model. [7] . . . 37

3.9 Illustration of the WPOD-NET pipeline. [8] . . . 38

3.10 Fully convolutional detection of planar objects in WPOD-NET. [8] . . . 39

4.1 Urban scenarios and distribution of cameras in V-ReID-AB-Dataset. Red arrows indicates directions of the cameras. . . 42

4.2 The same vehicle viewed from different cameras in V-ReID-AB-Dataset. . . 43

4.3 Example result obtained with ResNeXt101 . . . 45

(10)

4.5 A comparison of results obtained with VeRi dataset in per-image-basis

Re-ID and in per-tracklet-basis-ReRe-ID. . . 49

4.6 In this example, query (a) has the same licence plate of gallery (b), and their distance is set to 0. Gallery (c) is put closer to query (a), too, because it is similar to the gallery (b). . . 51

4.7 A comparison of results obtained with VeRi dataset in per-tracklet-basis ReID with and without the LP-based re-ranking phase. The results with re-ranking are obtained with plate th = 0.7.) . . . 53

4.8 A comparison of results obtained with V-ReID-AB-Dataset in per-tracklet-basis ReID with and without the LP-based re-ranking phase. The results with re-ranking are obtained with K = 105 and plate th = 0.6.) . . . 55

4.9 A single frame of the video produced during the first phase. The vehicle of interest is identified by the ID number 5 in the video. . . 57

4.10 Example result obtained in case of positive match. . . 58

5.1 Proposed System, step 1. Tracklet Generation. . . 63

(11)

List of Tables

3.1 Summary of Track 2 leader board. [9] . . . 30 4.1 Results for Re-ID on CityFlow with pre-trained models on ImageNet. . . . 45 4.2 Results for Re-ID on VeRi using V-ReID-KTP-Baseline with different plate th

values. . . 52 4.3 Results for Re-ID on V-ReID-AB-Dataset using V-ReID-KTP-Baseline

with different plate th values. . . 54 4.4 Results for Re-ID on V-ReID-AB-Dataset using V-ReID-KTP-Baseline

(12)

(13)

Chapter 1 Introduction

1.1 The Vehicle Tracking Problem

The use of traffic cameras is now widespread in many cities, and the applications that can be realized with the videos obtained by these cameras are the most varied:

• traffic activity monitoring and analysis; • urban surveillance;

• accident or anomaly detection; • law enforcement;

• smart transportation.

Tracking vehicles over large areas that span multiple cameras at different intersections is not a single task, but the composition of three distinct but closely related research problems:

(14)

• singe-camera-tracking (SCT), i.e., the procedure to find a correspondence between detected targets across multiple frames. This procedure, also called Multi-Target Single-Camera (MTSC) tracking or Multi Object Tracking (MOT), tries to assign a consistent label to each target in multiple frames;

• re-identification of targets across multiple cameras, i.e., to identify a particular target as the same one observed on a previous occasion.

The scientific community is, therefore, interested in developing solutions in order to satisfy the increasingly insistent demands of cities. Many solutions have been proposed to solve this problem. Unfortunately, progress has been limited for several reasons. The following are the main challenges:

• small inter-class variability and large intra-class variability, i.e., two different vehicles seen by the same viewpoint look more similar than the same vehicle seen by two different viewpoints;

• the lack of appropriate datasets prevents to learn a good model of each vehicle’s intra-class variability;

• vehicles can be partially or completely occluded by other vehicles or objects; • necessity to deal with illumination and weather changes.

1.2 Application Scenario

The work of this thesis must lay the bases for the creation of an application that can be used for an application scenario described below.

(15)

1.2.1 Scenario: CNR Park

In this scenario, the entrance to a reserved area (such as the parking lot of the CNR, Consiglio Nazionale delle Ricerche, in Pisa) is continuously filmed with a camera. Of all the vehicles that enter the reserved area, we want to be able to monitor the activity within the reserved area, such as the area where the vehicle is parked, through other cameras that take pictures of the parking area periodically.

1.3 Objectives and Outline of this Thesis

In this thesis work, we see how it is possible to realize a system capable of tracking and re-identifying the same vehicle from different cameras using state-of-the-art approaches for detection, tracking, and re-identification. The goal is to design, deploy, and test a system that can extract information about a vehicle from a video stream, and use this information to find the vehicle in other video streams. The use of vehicle license plate information is exploited for re-identification by integrating a new re-ranking method designed specifically for this task. We also present a new labeled dataset created and employed to test the use of license plate information for vehicle re-identification.

The thesis has the following structure:

• in Chapter 2, we give an overview of some useful concepts to understand our work better;

• in Chapter 3, we review previous literature on datasets for vehicle re-identification and state-of-the-art approaches for vehicle tracking, vehicle re-identification, and license plate recognition.

(16)

• in Chapter 5, we describe the design of a prototype system for vehicle tracking and re-identification;

• in Chapter 6, we report our conclusions on this work and discuss possible future works.

(17)

(18)

Chapter 2 Background

In this chapter, we describe the computer vision tools used to achieve the objectives of this thesis. Computer vision problems are almost always addressed through the use of deep neural networks, and in particular convolutional neural networks. In the first part we introduce the general principles of convolutional neural networks.Then we move on to describe some tasks in the field of Computer Vision: object detection, object tracking, and object re-identification. Finally, we briefly describe the most popular evaluation metrics for object re-identification tasks and we list all the framework and tools used in this thesis work.

2.1 Convolutional Neural Networks

Convolutional Neural Networks, known as CNNs, are a class of Neural Networks that have led to excellent results in the processing of two-dimensional data, such as images or videos; for this reason, it has had many applications in various fields of Computer Vision, such as face detection, image classification, and handwriting recognition.

The animal visual cortex organization inspires the CNN architecture. Each neuron re-sponds to stimuli only in a specific region of the visual field, called the receptive field.

(19)

In CNNs, convolution has replaced matrix multiplication in standard NNs: in this way, there is a decrease in the number of weights, and therefore a reduction in the network complexity.

As described in [1], convolution exploits three key ideas:

• sparse interaction; • parameter sharing;

• equivariant representation.

Sparse interaction refers to the fact that the kernels are smaller than the inputs and are used for the whole image: this reduces the computation complexity compared to the standards NNs in which each output unit is connected to each input unit to perform the matrix multiplication. In Figure 2.1 a graphical demonstration of sparse interaction.

Parameter sharing refers instead to the idea of sharing the same set of weights at each location: each member of the kernel is used in all input positions. This means that compared to traditional NNs it is necessary to train fewer weights, also reducing the storage requirements. In Figure 2.2 a graphical demonstration of parameter sharing.

A consequence of parameter sharing is the property of equivariance to translation. Con-sidering that we want to process an image, this implies that if we move the object in the input, its representation will move in the same way in the output.

(20)

Figure 2.1: Sparse interaction, viewed from above. We highlight one output unit, s3, and highlight the

input units in x that affect this unit. These units are known as the receptive field of s3. (Top) When s is

formed by convolution with a kernel of width 3, only three inputs affect s3. (Bottom) When s is formed

(21)

Figure 2.2: Parameter sharing. Black arrows indicate the connections that use a particular parameter in two different models. (Top) The black arrows indicate uses of the central element of a 3-element kernel in a convolutional model. Because of parameter sharing, this single parameter is used at all input locations. (Bottom) The single black arrow indicates the use of the central element of the weight matrix in a fully connected model. This model has no parameter sharing, so the parameter is used only once. [1]

(22)

2.2 Computer Vision

Computer Vision is concerned with extracting high-level information from images or videos. It is the science that tries to endow machines with the ability to see the world around them.

Some tasks in this field that we deal with in this thesis are:

• object detection; • object tracking;

• object re-identification.

Since 2012, when AlexNet, a Deep Convolutional Neural Network (DCNN) which achieved record results in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [10], was presented in [11], research in many fields of Computer Vision has shifted to the use of methods focused on deep learning.

2.2.1 Object Detection

Object detection is a Computer Vision task that deals with identifying objects in an image, i.e., finding their position and size, usually expressed through bounding boxes, and assigning them a category or class. In Figure 2.3 it is possible to see an example of object detection in comparison with other related recognition problems.

An exhaustive survey on the progress on object detection can be found in [2], from early researches to the latest solutions based on deep learning.

The first object detection solutions were based on local appearance feature descriptors [12], such as Scale-Invariant Feature Transform (SIFT) [13], Haar-like features [14], and Histogram of Oriented Gradients (HOG) [15]. In recent years, the methods that have

(23)

(a) Object Classification (b) Object Detection

(c) Semantic Segmentation (d) Object Instance Segmentation Figure 2.3: Recognition problems related to object detection: (a) Image level object classification, (b) Bounding box level object detection, (c) Pixel-wise semantic segmentation, (d) Instance level semantic segmentation. [2]

achieved the best results are based on deep learning, particularly Deep Convolutional Neural Networks.

Let us now analyze which are the most relevant object detection networks developed in recent years. We can divide object detectors into two classes:

• two-stage (region-based) object detectors; • one-stage (unified) object detectors.

Region-based Object Detectors

In a region-based object detector, we have a pre-processing step for the generation of object proposals. Object proposals, also called region proposals or detection proposals,

(24)

are a set of regions in an image that could contain an object of any category. After the first step of object proposals generation, a CNN extracts the features from each of these regions. Classifiers are then used to determine the category to which the object proposal belongs. Examples of region-based object detectors are RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], RFCN (Region based Fully Convolutional Network) [20], Mask RCNN [21], Chained Cascade Network [22] and Cascade RCNN [23], and Light Head RCNN [24]. In Figure 2.4 it is possible to see a high-level diagram of RCNN [16]: this integrates AlexNet [11] with a region proposal selective search [25].

Figure 2.4: High-level diagram of RCNN. [2]

Unified Object Detectors

Region-based object detectors require high computing power, and it is, therefore, chal-lenging to integrate them into mobile/wearable devices with limited resources in both storage and computation capability. For this reason, the first detectors belonging to the class of unified object detectors have been developed. These detectors have an archi-tecture with a single feed-forward CNN that directly predicts the class probabilities and the bounding boxes from the entire image. Using a single network, we can easily make some optimizations in order to include the network in constrained devices. Examples of

(25)

unified object detectors are DetectorNet [26], OverFeat [27], YOLO [28], YOLOv2 and YOLO9000 [29], SSD [30], and CornerNet [31]. In Figure 2.5 you can see a high-level diagram of YOLO (You Only Look Once) [28]: a single CNN is applied to the full image; the image is divided into an S ×S grid, each predicting B bounding boxes with confidence scores and C class probabilities.

Figure 2.5: High-level diagram of YOLO. [2]

2.2.2 Object Tracking

Object tracking task, also known as Multi Object Tracking (MOT), consists in following the trajectory of different objects in a video. In this task, too, there has been a revolution in the proposed solutions with the advent of deep learning. In [3], you can find a com-prehensive survey on works that employ deep learning models to solve the task of object tracking on single-camera videos.

Object Tracking is a task in the field of Computer Vision that aims to analyze a video in order to detect and track objects from one or more classes, such as people, vehicles, animals, or generic objects, without knowing in advance the appearance and number of targets. Unlike object detection, whose output is a set of bounding boxes in the image, object tracking also associates an ID to the bounding boxes in order to distinguish between detected objects of the same class. The association between a set of bounding boxes

(26)

extracted from different frames and an ID forms the so-called ”tracklet”. An example of how object tracking works is shown in Figure 2.6.

Figure 2.6: An illustration of the output of an object tracking algorithm. Each output bounding box has a number that identifies a specific target in the video. [3]

(27)

The detection step, necessary to identify all the targets that enter and exit the camera scene, plays an essential part in the object tracking task. The quality of the detection step has a significant influence on the final quality of the performed tracking.

As described in [3], the standard approach used for tracking is called tracking-by-detection. All the video frames are analyzed in the detection step to extract the bounding boxes of the present objects. Then these bounding boxes are used in the tracking process, trying to associate the same ID to the bounding boxes that contain the same target.

The typical steps, followed by almost all tracking algorithms, are as follows:

• detection stage: an object detector (such as those described in Subsection 2.2.1) is used to find all instances of objects of class(es) of interest in each video frame; • feature extraction/motion prediction stage: feature extraction algorithms (usually

based on CNNs) are used to extract appearance, motion or interaction features from each detection obtained in the previous stage; possibly a motion predictor is also used to predict the next position of each target in the next frame of the video. • affinity stage: the extracted features and the motion prediction are used to calculate

a similarity/distance score between each pair of detections or tracklets;

• association stage: the similarity/distance score is used to associate detections and tracklets belonging to the same target, i.e., we assign them the same ID.

(28)

Figure 2.7: Usual workflow of an object tracking algorithm: given the raw frames of a video (1), an object detector is run to obtain the bounding boxes of the objects (2). Then, for every detected object, different features are computed, usually visual and motion ones (3). After that, an affinity computation step calculates the probability of two objects belonging to the same target (4), and finally an association step assigns a numerical ID to each object (5). [3]

2.2.3 Object Re-identification

Object re-identification is a task part of the Computer Vision field. The re-identification is the process of identifying the same target object in images or videos taken from different cameras with non-overlapping fields-of-views (FOVs). When a target object exits the FOV of one camera and enters the FOV of another camera, object re-identification is used to establish a match between the two disconnected tracklets and thus obtain the tracking of the object across multiple cameras.

The two main applications in the re-identification task are person re-identification and vehicle re-identification. In Figure 2.8 we show an example of the vehicle re-identification task.

(29)

Figure 2.8: An example of vehicle re-identification: the same white BMW SUV captured by different surveillance cameras. The blue circles and arrows denote the locations and directions of cameras. [4]

In Figure 2.9, we can see a schematic representation of a typical re-identification system, as described in [32]. This is composed of two steps:

• extraction of object descriptors from multiple cameras; • establish correspondence.

In input to the system, we can have an image or a video. If you use an image, in the first step, you need the detection module to locate the object in the image. In the case of a video, an additional module is necessary to perform the tracking and establish a correspondence between detections from different frames.

(30)

(31)

Following the detection and tracking phases, the first step for re-identification is to learn an object’s visual descriptor. The extraction of a reliable descriptor depends heavily on the quality of the previous detection and tracking phases. Detection or tracking errors lead to errors in feature extraction and descriptor generation, thus reducing the overall quality of re-identification. In addition, it is necessary to use complex descriptors. Simple descriptors such as color, texture, or shape are hardly unique, and prone to variations when illumination, view angle, or scale changes in a multi-camera scenario.

The last step of matching for Re-ID also presents its difficulties. As we previously men-tioned, the main challenge in the re-identification task is given by large intra-class variation and small inter-class variation. Moreover, the larger the number of potential candidates to match with, the larger the probability of matching error.

A re-identification system consists of a set of known objects, the gallery set, and a set of unknown objects, the query set, on which we want to perform the re-identification. We can represent the gallery set with:

G= (g1, g2, ..., gN) (2.1)

and its respective IDs with:

id(G) = (id(g1), id(g2), ..., id(gN)) (2.2)

where the id(.) function specifies the ID assigned to its argument. Similarly, we represent the query set and its respective IDs with:

Q= (q1, q2, ..., qM) (2.3)

id(Q) = (id(q1), id(q2), ..., id(qM)) (2.4)

When a query is submitted to the system, it is compared with each element in the gallery set, and a similarity measurement is computed. The gallery is then ranked using this similarity measure in order to decide the query ID. In [32], depending on the scenario, the object re-identification task is categorized into:

(32)

• open set Re-ID; • closed set Re-ID.

The query ID decided after the gallery ranking varies depending on the scenario.

In closed set Re-ID, we have that the query set is a subset of the gallery, i.e., the query ID exists in the gallery, and the goal is to determine the actual query ID. So, given a query q, its ID is id(q) = id(gi∗), such that:

i∗= arg maxi∈1,..,Np(gi|q) (2.5)

where p(gi|q) is the likelihood that id(p) = id(gi), and it is usually represented by a

similarity measure. In a nutshell, the ID assigned to the query q is equal to the top-ranked gallery ID.

In open set Re-ID, on the contrary, the query may not appear in the gallery. This means that, before deciding the query ID, you must first decide if the query ID is actually con-tained in the gallery. So in addition to calculating i∗ as done previously in Equation 2.5, you must also verify the following condition:

p(gi∗|q) > τ (2.6)

where τ is an acceptance threshold. If the condition is met, then the query ID belongs to the gallery; otherwise, the query ID is unknown.

2.3 Evaluation Metrics

The two most popular and widely used evaluation metrics for object re-identification are:

• rank-K mean Average Precision (mAP) • Cumulative Matching Characteristic (CMC)

(33)

In order to understand mAP, we need first to review precision, recall and Average Preci-sion.

2.3.1 Precision

Precision is the ratio of True Positive T P and the total number of predicted positives:

P = T P

T P + F P (2.7)

2.3.2 Recall

Recall is the ratio of True Positive T P and the total number of ground truth positives:

R= T P

T P + F N (2.8)

2.3.3 Average Precision

Considering a single image query, the Average Precision is computed as:

AP = Pn

k=1Pk× ck

Nc

(2.9)

where k is the rank in the order of retrieved objects, n is the number of fetched objects, Nc is the number of relevant objects. Pk is the precision at cut-off k in the list, and ck is

equal to 1 if the item at rank k is a relevant object, zero otherwise. Note that the average is over all relevant objects and the relevant objects not retrieved get a precision score of zero.

2.3.4 mAP

(34)

The mAP is computed as:

mAP = PM

i=1APi

M (2.10)

where M is the total number of query images.

Rank-K mAP is the mAP when considering only the top-K results for each query, i.e., replacing n with K in Equation 2.9:

AP = PK

k=1Pk× ck

Nc

(2.11)

We use K = 100 results for each query.

2.3.5 CMC curve

The CMC curve indicates the probability that a query identity occurs in different-sized candidate list.

For each query, all the ranked gallery samples are considered, and the CMC top-k accuracy for a query q is:

rankk,q =     

1 if top-k ranked gallery samples contain the query q identity 0 otherwise

(2.12)

which is a shifted step function. The final CMC curve is computed by averaging the shifted step functions over all the queries:

rankk =

PM

i=1rankk,i

M (2.13)

where M is the total number of query images.

2.4 Frameworks and Tools

(35)

YOLOv3 is officially implemented using the Darknet framework. Darknet1 _{is an open}

source neural network framework written in C and CUDA. It is fast, easy to install, and supports CPU and GPU computation. You can find the source code on GitHub2_.

We have also used the Ultralytics3 _{unofficial implementation of YOLOv3 in PyTorch.}

The source code is on GitHub4 _{and freely available for redistribution under the GPL-3.0}

license.

PyTorch5 _{is an open source machine learning library based on the Torch library, used}

for applications such as computer vision and natural language processing. It is primarily developed by Facebook’s artificial intelligence research group. It is free and open-source software released under the Modified BSD license.

Torch6 _{is a scientific computing framework with wide support for machine learning}

algo-rithms that puts GPUs first. It is easy to use and efficient, thanks to an easy and fast scripting language, LuaJIT, and an underlying C/CUDA implementation.

TensorFlow7 _{is an end-to-end open source platform for machine learning. It has a}

compre-hensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered ap-plications.

Keras8 _{is a high-level neural networks API, written in Python and capable of running on}

top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. 1_{https://pjreddie.com/darknet/} 2 https://github.com/pjreddie/darknet 3 https://www.ultralytics.com/ 4_{https://github.com/ultralytics/yolov3} 5 http://pytorch.org 6_{http://torch.ch} 7_{https://www.tensorflow.org} 8 https://keras.io

(36)

Many other tools and libraries have been used: Google Colab9_{, Microsoft Excel}10_{, OpenCV}11_,

NumPy12_{, Matplotlib}13_{, scikit-learn}14_{, SciPy}15_{, PIL}16_{, Docker}17_{, CUDA}18_{, cuDNN}19_.

9 https://colab.research.google.com 10_{http://www.office.microsoft.com/excel} 11 https://opencv.org 12_{http://www.numpy.org} 13_{https://matplotlib.org} 14 https://scikit-learn.org/ 15_{https://scipy.org/scipylib/} 16 http://www.pythonware.com/products/pil/ 17 https://www.docker.com/ 18_{https://developer.nvidia.com/cuda-zone} 19 https://developer.nvidia.com/cudnn

(37)

(38)

Chapter 3 Related Work

This chapter contains the description of two datasets for vehicle re-identification, VeRi and CityFlow, which we use in this thesis for our experiments. Following, we introduce some examples of state-of-the-art approaches for vehicle tracking, vehicle re-identification, and license plate recognition that we use as starting points for the work of this thesis.

3.1 Datasets existing in literature

In this section, we describe the two primary datasets used in literature for vehicle re-identification that we use in this thesis in our experiments: VeRi [4, 33, 34] and CityFlow [5].

3.1.1 VeRi

VeRi [4] is a large-scale benchmark dataset for vehicle re-identification in the real-world urban surveillance scenario. It has been extended in VeRi-776 in [33]. The dataset contains 50,000 images of 776 vehicle identities captured from 20 surveillance cameras with different orientations and tilt angles. Each vehicle is captured by 2-18 viewpoints

(39)

with different lighting conditions and resolutions. Each vehicle is annotated with bounding boxes, type, color, and cross camera vehicle correlation. Information on license plates and spatial-temporal relations for the tracks of all vehicles is also annotated. The dataset is the most used in literature for the vehicle re-identification task, given its large number of vehicles, images for each vehicle, and annotated information. In Figure 3.1, we can see some examples of vehicle images from the dataset.

(a) Diversity of vehicle colors and types. (b) Variation of the viewpoints, illuminations, res-olutions, and occlusions for the vehicles in different cameras.

Figure 3.1: Vehicle samples selected from the VeRi dataset. [4]

3.1.2 CityFlow

The CityFlow dataset [5] was created specifically for the 2019 AI City Challenge [9]. It consists of two parts: the part for the Multi-Camera Vehicle Tracking task, and the part for the Multi-Camera Vehicle Re-identification task. It contains about 3 hours of video from 40 cameras scattered in 10 intersections in a mid-sized city in the United States. The maximum distance between two different cameras is 2.5 km. It is important to note that all faces and license plates have been obscured for privacy reasons. In Figure 3.2, the urban environment and camera distribution of the CitiFlow dataset.

(40)

Figure 3.2: The urban environment and camera distribution of the CityFlow dataset. The red arrows denote the locations and directions of cameras. Some examples of camera views are shown. [5]

We have a total of 5 different scenarios, both in highway and residential areas, each served by a different number of cameras. The minimum resolution of each video is 960p, with a frame rate of 10 FPS.

The part of the dataset for the re-identification task contains a total of 56,277 images, of which 36,935 belong to 333 vehicle identities of the training set, 18,290 belong to 333 vehicle identities of the gallery set, the remaining 1,052 form the query set. Each vehicle is represented on average with 84.50 images captured from 4.55 different cameras. The ground truth for gallery set and query set is not publicly available.

(41)

3.2 The 2019 AI City Challenge

The AI City Challenge1 _{was created to accelerate intelligent video analysis so that cities}

can be smarter and more secure. At the third edition of the AI City Challenge [9], held in 2019, 334 research teams from 44 different countries participated. The Challenge consisted of three tracks:

• city-scale multi-camera vehicle tracking

• city-scale multi-camera vehicle re-identification • traffic anomaly detection

The dataset used for the first and second track is CityFlow [5], for the third one Iowa DOT [35].

Let’s briefly summarize the main works on the re-identification task. The Table 3.1 shows the leader board for the track in question.

The best results were obtained by Team 59 Baidu ZeroOne [36], using a method that combines visual features extracted using CNNs and semantic features derived from trav-eling direction and vehicle type classification. Team 21 U. Washington IPL [7] used a similar method as well. The use of semantic features, although very useful, has not been exploited by other teams because it requires additional annotations to the dataset. Some teams have used models to extract vehicle pose, from which they can obtain orientation information [7, 40]. In some works, the presence of trajectory information in CityFlow has been exploited through temporal attention/pooling [36, 7, 37]. Another technique widely used to improve the results was post-processing by re-ranking [7, 40, 46, 50]. Finally, most teams relied on feature embedding schemes and distance metric learning [41, 43, 44, 48, 49, 50, 51].

1

(42)

Rank Team ID Team name (and paper) rank-K mAP 1 59 Baidu ZeroOne [36] 0.8554 2 21 U. Washington IPL [7] 0.7917 3 97 Australian National U. [37] 0.7589 4 4 U. Tech. Sydney [38] 0.7560 5 12 BUPT Traffic Brain [39] 0.7302 8 5 U. Maryland RC [40] 0.6078 13 27 INRIA STARS [41] 0.5344 18 24 National Taiwan U. [42] 0.4998 19 40 Huawei AI Brandits [43] 0.4631 23 52 CUNY-NPU [44] 0.4096 25 113 VNU HCMUS [45] 0.4008 36 26 SYSU ISENET [46] 0.3503 45 64 GRAPH@FIT [47] 0.3157 50 79 NCCU-UAlbany [48] 0.2965 51 63 Queen Mary U. London [49] 0.2928 54 46 Siemens Bangalore [50] 0.2766 60 43 U. Autonoma de Madrid [51] 0.2505

Table 3.1: Summary of Track 2 leader board. [9]

3.3 TrackletNet Tracker

In [6], Wang et al. present an innovative tracking method called TrackletNet Tracker (TNT) that, starting from a tracking-by-detection framework, combines multiple cues together into a unified framework based on an undirected graph model:

• appearance feature of each detected object;

• temporal relation for location among frames in a trajectory; • interaction among different target objects.

(43)

Each vertex in the graph model represents a tracklet, and each edge between two vertices measures the connectivity between the two tracklets. Given the unreliable detections and the possibility of occlusions, the trajectory of an object may be divided into several distinct tracklets. For this reason, using the graph model, tracking is treated as a clustering problem that aims to group the tracklets into one big cluster.

The flowchart of the proposed tracking method TNT is shown in Figure 3.3.

Figure 3.3: TNT framework for multi-object tracking. [6]

We can identify three phases:

• tracklet generation; • connectivity measure; • clustering.

(44)

To generate the tracklets, i.e., the vertices of the graph, they associate the detections of consecutive frames based on intersection-over-union (IOU) and on the similarity of appearance features. In the case of moving cameras, the IOU criterion is unreliable, so epipolar geometry is used to compensate for camera movement.

To calculate connectivity between two tracklets, the multi-scale TrackletNet is used, as depicted in Figure 3.4, which combines both trajectory and appearance information. A connectivity measure pe , represents the similarity of the two tracklets connected by the

edge e ∈ E.

Figure 3.4: Architecture of Multi-scale TrackletNet. [6]

First, they extract embedded features from two input tracklets, which include 4D location features and 512D appearance features along the time window of 64 frames. The input layer has three channels: one for tracklet embedded features and the other two for binary masks. Four types of 1D convolution kernels are applied for feature extraction in three convolution layers. For each convolution layer, max pooling is adopted for down-sampling in the time domain. Average pooling is conducted on the dimensions of the appearance feature after Conv3. Then two fully connected layers are conducted to get the final output.

(45)

The edge cost is finally defined as

c= log 1 − pe pe

(3.1)

After building the graph, the object trajectories are obtained by clustering the graph in different sub-graphs, so that the tracklets in the same sub-graph can represent the same object. They exploit a greedy search-based clustering method proposed by [52] in which five different clustering operations, i.e., assign, merge, split, switch, and break, are used.

TNT has achieved promising results and outperforms state-of-the-art methods in multi-object tracking for both MOT16 and MOT17 [53] benchmarks.

Starting from the work in [7], we create a vehicle re-identification baseline, V-ReID-KT-Baseline (where K and T stands for Key-points and Tracklets), that exploits the key-points of the vehicles, and that can use tracklets in the query-set and the gallery-set, in addition to the individual images. In Figure 4.4, we show the various steps p

3.4 Multi-View Vehicle Re-Identification using

Tem-poral Attention Model and Metadata Re-ranking

In this section, we briefly describe the vehicle re-identification framework proposed in [7], ranked second in the 2019 AI City Challenge Track-2 [9] with an achieved mAP of 79.17%. Some example results of this method are shown in Figure 3.5.

In Figure 3.6, you can see an overview of the proposed method. The frame-level features, which include appearance features and vehicle structure features, are initially extracted for each clip. Next, a temporal attention model (TA) is used to obtain the aggregated clip-level features. These are combined with the structure features and are used to train the re-identification model by a weighted combination of cross entropy loss [54] and batch

(46)

sampling triplet loss [55]. Finally, metadata classification features are used to perform a re-ranking phase and improve the final results.

Below we describe in more detail how frame-level and clip-level features are extracted.

3.4.1 Feature Extraction

Frame-level features are extracted through a ResNet50 [56] network pre-trained on Ima-geNet. In order to represent the appearance of the vehicle, the 2048-dim fully-connected layer before the classification layer is used. In addition, they use the keypoint localization method from [57] to obtain 36 vehicle points, which are used to extract information about the viewpoint of the vehicle. An example of vehicle keypoints is shown in Figure 3.7.

After extracting frame-level features, these are combined into clip-level features using a temporal attention model (TA) [58]. The model structure is shown in Figure 3.8.

The orientation features fi

o are expanded and added up with the CNN features fCN Ni

to form the frame-level features fi

C, where i indicates the frame index of the C clip. In

parallel, fi

o is concatenated with fSi, the re-sized fCN N midi , and given as input to the

temporal convolutional layers to obtain the attention score for each frame. Finally, the clip-level feature fC is obtained by the weighted average of the frame-level features.

(47)

(48)

(49)

Figure 3.7: Example of vehicle keypoints detection. [7]

(50)

3.5 License Plate Detection and Recognition in

Un-constrained Scenarios

We now briefly discuss a complete Automatic License Plate Recognition (ALPR) system, presented in [8].

The main feature that distinguishes this system from the classic ALPR systems is the use of a network that can detect the License Plate (LP) from different viewpoints, and estimate its distortion, allowing a rectification process before the Optical Characters Recognition (OCR) phase.

As shown in Figure 3.9, the system consists of three main steps:

• vehicle detection; • LP detection; • OCR.

Figure 3.9: Illustration of the WPOD-NET pipeline. [8]

Given an input image, the first module must detect the vehicles. A simple YOLOv2 network [29] was used for vehicle detection.

(51)

For LP detection, they created a new CNN architecture, called Warped Planar Object Detection Network (WPOD-NET). This network, exploiting the fact that LPs are intrin-sically rectangular and planar objects, can detect LPs in different distortion conditions, returning the coefficients of an affine transformation capable of re-establishing the rect-angular shape in frontal view of the LP. The WPOD-NET combines features from YOLO [28, 29, 59], SSD [30] and Spatial Transformer Networks (STN) [60]. In Figure 3.10, the detection process is illustrated. The network takes as input the image of the vehi-cle identified in the previous step and returns an 8-channel feature map that encodes object/non-object probabilities and affine transformation parameters.

Figure 3.10: Fully convolutional detection of planar objects in WPOD-NET. [8]

Finally, we have the OCR phase, in which a modified YOLO network, presented in [61], is used but trained with a larger dataset that includes LPs from different regions of the world.

(52)

(53)

Chapter 4 Contributions of this Thesis

This chapter describes the main contributions of this thesis. The main purpose is to create a robust system of vehicle tracking and re-identification to be used in smart-city applications, using state-of-the-art technologies of artificial intelligence and computer vi-sion. To this end, we introduce a new dataset, V-ReID-AB-Dataset, specifically created to verify how much the recognition of license plates affects vehicle re-identification. Then we describe the design and development of a baseline for vehicle re-identification, V-ReID-KTP-Baseline, testing it through the two datasets described in Section 3.1 (CityFlow and VeRi) and our V-ReID-AB-Dataset. Finally, we show a practical application of vehicle re-identification in the scenario described in Section 1.2.

4.1 V-ReID-AB-Dataset

In this section, we describe in detail a new dataset for vehicle re-identification, V-ReID-AB-Dataset, explicitly created during the work of this thesis.

For the creation of the dataset, we shot 26 videos at a resolution of 4k and a framerate of 30 FPS. The videos were shot in two city scenarios, using multiple cameras at the same time, positioned so that you can capture the same vehicle on the screen from different

(54)

angles. In both the first and the second scenario, 3 cameras were used, oriented as shown in Figure 4.1.

(a) Scenario A (b) Scenario B

Figure 4.1: Urban scenarios and distribution of cameras in V-ReID-AB-Dataset. Red arrows indicates directions of the cameras.

Each video was manually inspected and cut to extract only the most significant parts. In this way, we have extracted a total of 78 clips. Each clip shows one or more vehicles of interest for the vehicle re-identification problem (i.e., vehicles whose license plates are clearly legible and which are seen from at least two different cameras). We have identified a total of 28 vehicles of interest, of which we have also annotated the model, color, and number license plate.

From each of these clips, we have extracted the frames. We have then given each frame as input to a YOLOv3 detection network [59] in order to obtain the bounding boxes of the vehicles within each frame. For the detection and the extraction of the bounding boxes, we have used a freely available implementation in PyTorch of YOLOv31_{. The network is}

already pre-trained for detecting vehicles, from [62]. We have considered only detections with at least a 0.95 confidence level.

The next step is to use a Multi-Target-Single-Camera (MTSC) tracking network to group the detections belonging to the same vehicle into different frames and thus form a tracklet.

1

(55)

(a)

(b)

(c) _(d)

Figure 4.2: The same vehicle viewed from different cameras in V-ReID-AB-Dataset.

To do this, we have used the tracking network presented in [6], suitably modified to run on the hardware at our disposal.

The tracklets thus obtained are not perfect: the network, in fact, produces noisy trajec-tories, and often the trajectory of the same vehicle is split into several tracklets. A phase of manual correction of tracklets was then performed: we have selected and merged only tracklets of vehicles of interest, and we also assigned the ID of the corresponding vehicle for each of these tracklets. From this last phase, we have finally obtained a total of 105 tracklets.

Finally, using the results obtained as described above, the areas containing the vehicles were cut from each frame of the video using the coordinates of the identified bounding

(56)

boxes and labeled accordingly, thus obtaining the images of the vehicles and the ground truth necessary to perform tests on the dataset.

4.2 Experiments

In this section, we describe all the experiments done on the datasets Cityflow [5], VeRI [4, 33, 34], and V-ReID-AB-Dataset. In parallel, we describe the development of a baseline for vehicle re-identification, V-ReID-KTP-Baseline, that expoits key-points, tracklets, and license plates, and that we use to perform tests on the mentioned datasets.

4.2.1 Test with modified CityFlow dataset and

V-ReID-KT-Baseline

Initially, we perform tests on the CityFlow dataset, described in Section 3.1.

The ground truth of the test-set is not publicly available, so to obtain results, having no need to perform a training phase, we use the train-set as a test-set. For each vehicle in the original test-set, an image is extracted and added to our query-set; the remaining images are used to form the gallery-set. All vehicles in the dataset have the license plate obscured, so this dataset cannot be used to see how much the implementation of a license plate recognition affects the re-identification results.

We initially test the dataset using pre-trained baseline networks on ImageNet [63] using Torchreid2_{, a library built on PyTorch for deep-learning person re-identification, and}

Google Colab3_.

The chosen pre-trained models are:

2_{https://github.com/KaiyangZhou/deep-person-reid} 3

(57)

• ResNet50 [56] • ResNet101 [56] • ResNeXt101 [64] • DenseNet121 [65]

The results are shown in Table 4.1.

Model mAP Rank-1 Rank-5 Rank-10 Rank-20 ResNet50 1.4% 6.3% 12.0% 16.9% 21.7% ResNet101 1.7% 7.2% 13.0% 15.1% 21.1% ResNeXt101 2.0% 9.9% 15.4% 19.3% 23.5% DenseNet121 1.9% 9.0% 15.7% 18.4% 23.5%

Table 4.1: Results for Re-ID on CityFlow with pre-trained models on ImageNet.

As you can see, the results are terrible. This is understandable, not having done a training phase to adapt the weights of the networks to the dataset, but using the simple networks pre-trained on ImageNet. From Figure 4.3 we can see how images of different vehicles taken from the same viewpoint are reported as similar. This is a clear example of the main challenge of vehicle re-identification, i.e., small inter-class and large intra-class variability.

Figure 4.3: Example result obtained with ResNeXt101

The results obtained on the actual dataset by performing the training phase are instead available in [5].

We then test the re-identification using the networks pre-trained on the CityFlow dataset in [7]. We choose to test only the extraction of the key-points of the vehicles and

(58)

re-identification with visual features parts. We exclude the re-ranking part that uses meta-data because it depends too much on the meta-dataset. Used metameta-data are color, model, and type of the vehicle: these metadata are different depending on the application scenario.

Starting from the work in [7], we create a vehicle re-identification baseline, V-ReID-KT-Baseline (where K and T stands for Key-points and Tracklets), that exploits the key-points of the vehicles, and that can use tracklets in the query-set and the gallery-set, in addition to the individual images. In Figure 4.4, we show the various steps performed to compute the visual distance between two different tracklets.

(59)

As a first step, we extract the key-points from all the images in the query-set and gallery-set [57]. For each tracklet in the query-gallery-set and the gallery-gallery-set, using the images of the tracklet and the key-points obtained in the previous phase, we compute the visual features using the pre-trained model already available from [7]. For each query, we compute the Euclidean distance between the vector of visual features of the query and the vector of visual features of each tracklet in the gallery. The first K results for each query are then returned (i.e., the K tracklets with the smallest Euclidean distance from the query), and mAP and CM C curve are computed for the entire dataset, as described in Section 2.3. The results obtained on modified CityFlow using our V-ReID-KT-Baseline in per-image-basis are as follows:

mAP Rank-1 Rank-5 Rank-10 Rank-20 88.3% 99.4% 99.7% 99.7% 99.7%

These results are obviously exaggeratedly high because the network that extracts the visual features have been previously trained on the same tracklets that in this experiment we use in query-set and gallery-set, as mentioned above.

4.2.2 Test with VeRi dataset and V-ReID-KTP-Baseline

In order to adequately test the effectiveness of the model, we decide to use the VeRi dataset, described in Section 3.1. The dataset is first pre-processed to organize query-sets and gallery-sets so that they can be used with the V-ReID-KT-Baseline described in the previous Subsection.

Initially, we test the dataset using single images, not tracklets. The results obtained in per-image-basis are as follows:

(60)

Subsequently, we reorganize the query-set and the gallery-set in tracklets. In particular, the original query-set contains individual images, while the complete tracklets are only present in the gallery-set. Rather than using the single images of the query-set, we use the same tracklets present in the gallery-set as queries. The results obtained using tracklets instead of single images are as follows:

The use of tracklets instead of single images has led, as you can see, to a significant improvement in mAP and CM C curve. The use of tracklets, in fact, gives the network more information in input, allowing, for example, to use an image of the vehicle closer to the camera and with greater detail, rather than an image of the same vehicle further away from the camera or presenting occlusion with other objects or vehicles. In Figure 4.5, we can see a comparison between results obtained considering single images and considering tracklets.

Unlike the CityFlow dataset, the VeRi dataset also provides vehicle license plates in some images, although these are only clearly visible in a limited number of them. We have therefore decided to try the implementation of a re-ranking phase based on the recognition of license plates. We then extend the baseline introduced int the previous Subsection, implementing the additional module that performs re-ranking based on recognized license plates, obtaining V-ReID-KTP-Baseline (where P stands for Plates). For the recognition of the plates, two steps are performed:

• detection of number plates in the cropped images of vehicles;

• detection and recognition of characters in cropped license plate images.

We start from the work done in [8], exploiting two pre-trained networks, the first one for the detection of the license plate in a vehicle image and the second one for the recognition

(61)

Figure 4.5: A comparison of results obtained with VeRi dataset in image-basis Re-ID and in per-tracklet-basis-ReID.

of characters in a license plate image. Together with the license plate bounding boxes and each recognized character, the networks return the corresponding confidence value. We then have a confidence value for the detection of the license plate and a confidence value for each recognized character of the license plate.

We describe below what is done to extract the license plate characters and the final confidence value for each tracklet. Each image of the tracklet is processed by a first network to identify the license plate bounding box. If the confidence value of the found

(62)

bounding box is higher than the p threshold, then the license plate cropped image is processed by the second network for the extraction of characters and their confidence values. We empirically choose the p threshold to be 0.97. Then, we compute the total confidence of the obtained license plate:

• if the length of the string is different from the actual len plate length of the plates considered (in the case of the VeRi dataset len plate = 6) then the total confidence value of the string is equal to 0;

• otherwise, the total confidence value of the string is set as the minimum confidence value between all the single characters.

Finally, among all the plates recognized by the images of the same tracklet, we choose the one that has obtained the largest total confidence value. The final confidence assigned to the plate of the tracklet will, therefore, be equal to the confidence of the chosen string.

After calculating the Euclidean distances between the visual features of the query tracklet and those of the tracklets in the gallery, we consider the license plates of the K tracklets in the gallery closest to the query (if present). For each pair (query tracklet, gallery tracklet), if there are recognized plates in both query tracklet and gallery tracklet and the confidence of both is greater than the threshold plate th we make the following reasoning:

• if the characters of the plates of the query tracklet and the gallery tracklet are the same, then the gallery tracklet is placed at the top of the results list by setting its distance from the query tracklet to 0. In addition, to further improve the results, for each other tracklet in the gallery, we compare the distances from the query tracklet and the gallery tracklet: the distance between the tracklet in question and the query tracklet is set equal to the minimum of the two distances. In this way, in addition to put the query closer to the gallery tracklet from which the license plate is read, we also put the query closer to all other tracklets in the gallery from which it is not possible to read the license plate, but that are similar as regards the visual features

(63)

to the gallery tracklet from which it is possible to read the license plate. An example of this logic is shown in Figure 4.6.

• if the characters of the plates of the query tracklet and the gallery tracklet are different, then the gallery tracklet is placed at the bottom of the results list by setting its distance from the query tracklet to the maximum between the distances between all (query tracklet, gallery tracklet) pairs.

(a) License Plate read correctly

(b) License Plate read correctly (c) License Plate not read Figure 4.6: In this example, query (a) has the same licence plate of gallery (b), and their distance is set to 0. Gallery (c) is put closer to query (a), too, because it is similar to the gallery (b).

In Table 4.2, we show the results obtained with the maximum possible value for K (i.e., the total number of tracklets in the gallery-set) with different values of plate th.

As you can see, with plate th = 1, the results converge to the same results obtained with no license plate integration, in fact, none of the license plate in the dataset is recognized with a perfect confidence value, then we are not using any of them in the re-ranking phase. The best mAP and Rank-1 are obtained for plate th = 0.7.

We can see, from Figure 4.7 that, unfortunately, the improvement in the results is not enough to justify the use of the re-ranking phase. We have identified the cause of this in the quality of the images in the dataset: in fact, there are few plates that we can recognize

(64)

plate th mAP Rank-1 Rank-5 Rank-10 Rank-20 0.4 26.3% 61.9% 81.2% 87.6% 92.1% 0.5 27.1% 63.4% 82.3% 88.5% 92.7% 0.6 27.4% 63.7% 83.1% 89.3% 93.3% 0.7 27.4% 63.9% 83.4% 89.5% 93.6% 0.8 27.2% 63.6% 83.7% 89.8% 93.8% 0.9 27.0% 63.4% 83.7% 89.8% 93.8% 1 27.0% 63.4% 83.7% 89.8% 93.8%

Table 4.2: Results for Re-ID on VeRi using V-ReID-KTP-Baseline with different plate th values.

correctly with a reasonable confidence value. This is the main reason that pushed us to create the new V-ReID-AB-Dataset.

4.2.3 Test with V-ReID-AB-Dataset

We then test V-ReID-KTP-Baseline with our new V-ReID-AB-Dataset created specifi-cally to assess the effectiveness of the implementation of the recognition of license plates for vehicle re-identification. In this dataset, the plates are clearly visible in most of the tracklets, so we expect a significant increase in results following the license plate based re-ranking.

For technical reasons, it is first necessary to pre-process the dataset to reduce the number of images of each tracklet. With very long tracklets, such as those obtained by extracting each frame from the videos in 30 FPS in the dataset, there can happen memory errors because the hardware at our disposal was not able to process so many images at the same time. We, therefore, decide to reduce the size of the tracklets to a maximum of 10 images per tracklet. The images are selected so as to be equidistant from each other in time, in order to represent the trajectory taken by the vehicle inside the camera with a smaller number of frames.

(65)

Figure 4.7: A comparison of results obtained with VeRi dataset in per-tracklet-basis ReID with and without the LP-based re-ranking phase. The results with re-ranking are obtained with plate th = 0.7.)

We then perform the actual test using V-ReID-KTP-Baseline. The results obtained in per-tracklet-basis without re-ranking are as follows:

(66)

We then perform the test with the re-ranking phase, looking first for the most appropriate value for plate th, as we did before for VeRi dataset. The results when plate th varies and K is set to its maximum value are in Table 4.3.

plate th mAP Rank-1 Rank-5 Rank-10 Rank-20 0.4 77.09% 84.76% 86.67% 86.67% 89.52% 0.5 77.35% 84.76% 85.71% 85.71% 88.57% 0.6 79.64% 86.67% 89.52% 89.52% 92.38% 0.7 79.29% 86.67% 90.48% 92.38% 94.29% 0.8 77.49% 82.86% 87.62% 91.43% 95.24% 0.9 44.17% 43.81% 68.57% 80.00% 89.52% 1 44.17% 43.81% 68.57% 80.00% 89.52%

Table 4.3: Results for Re-ID on V-ReID-AB-Dataset using V-ReID-KTP-Baseline with different plate th values.

By setting plate th to the value that gave us the best mAP , i.e., 0.6, we obtain the results in Table 4.4 when K changes.

K mAP Rank-1 Rank-5 Rank-10 Rank-20 10 63.19% 71.43% 75.24% 83.81% 90.48% 20 70.28% 78.10% 80.95% 86.67% 90.48% 30 76.89% 84.76% 88.57% 89.52% 91.43% 40 78.68% 86.67% 89.52% 90.48% 91.43% 50 78.81% 86.67% 89.52% 89.52% 90.48% 60 79.36% 86.67% 89.52% 89.52% 90.48% 70 79.23% 86.67% 89.52% 89.52% 90.48%

Table 4.4: Results for Re-ID on V-ReID-AB-Dataset using V-ReID-KTP-Baseline with different K values.

In Figure 4.8, we can see a comparison between results obtained with and without the license plate based re-ranking phase (with K = 105 and plate th = 0.6).

(67)

Figure 4.8: A comparison of results obtained with V-ReID-AB-Dataset in per-tracklet-basis ReID with and without the LP-based re-ranking phase. The results with re-ranking are obtained with K = 105 and plate th = 0.6.)

As we can see, the implementation of the re-ranking method based on license plate recog-nition has led in this case to a clear improvement of the results, both for the mAP and for the CMC curves. We can conclude that license plate recognition can be used to achieve concrete results in the presence of high-definition images from which license plates can be clearly read.

(68)

4.2.4 Application Scenario Example: CNR Park

This section provides a description of the implementation of an application made to show a potential practical use of the work done in this thesis.

In this particular scenario, a video of a vehicle is taken at its entrance to the CNR Park and must then be found inside the parking area using the photos taken periodically of the parking area itself. The application must then extract the information about the vehicle from a query video, and then verify the presence of the vehicle in a gallery image. If present, the application must also return the exact image area in which the vehicle has been found.

The first step is the pre-processing of the query video. In this phase, we first extract the frames from the video. For each frame, we make the detection of vehicles using the same YOLOv3 network used for the creation of V-ReID-AB-Dataset in Section 4.1. As done for the dataset before, we extract the tracklets from frames with detections using the TNT network [6]. Once the tracklets are extracted, a video is generated in which the found tracklets are shown, each one distinguished through its own ID. As an example, you can look Figure 4.9.

The user of the application, seeing the video, can then choose which of the tracklets found in the video to use as queries, specifying the tracklet ID. In fact, in the same video, there may be more than one vehicle, when the user may be interested in searching for only one of these. The user can also specify which image to search against, i.e., the gallery image.

Once you choose the tracklet, there is a second stage of pre-processing, in this case, for the gallery image. The image is passed through the same YOLOv3 detection network to extract all vehicles from the image. This gives us a number of gallery tracklets for re-identification equal to the number of vehicles detected in the image. Each gallery tracklet for the re-identification phase is then composed of a single image (the cropping of the bounding box obtained from the detection). We assign an ID for each of the gallery

(69)

Figure 4.9: A single frame of the video produced during the first phase. The vehicle of interest is identified by the ID number 5 in the video.

tracklets, which is used, once the re-identification is done, to find the corresponding bounding box in the original gallery image.

At this point, obtained the query tracklet and the gallery tracklets, the license plate infor-mation is extracted using the license plate detection network and the character recognition network, as explained in Subsection 4.2.2.

The next step is to use the model for re-identification introduced in the previous sections to extract the visual features vectors. The Euclidean distance between the visual feature vectors of the query tracklet and each visual feature vectors of the gallery tracklets is com-puted. Through these distances, you get an ordered list of the gallery tracklets closest to the query tracklet. The re-ranking phase is then performed by exploiting the information on the identified plates, as described in more detail in the previous sections.

Among all the obtained results, we consider only the one with the shortest final distance, i.e., the closest to the query. In the same image, in fact, it is not possible to have more than

(70)

one instance of the same query vehicle. If the distance between the query tracklet and the nearest gallery tracklet is less than a certain threshold distance th then we consider that both tracklets represent the same vehicle: we then report that the query vehicle is present in the query image and we produce an image with the bounding boxes of all detected vehicles highlighted (the one of the nearest tracklet gallery, i.e., the vehicle chosen, in red, and all others in yellow, so you can distinguish them). If the distance between the query tracklet and the nearest gallery tracklet is greater than the istance th threshold, then we return as output that the query vehicle is not present in the gallery image. An example of the result obtained in the case of a positive match is shown in Figure 4.10.

Figure 4.10: Example result obtained in case of positive match.

The choice of the distance th threshold is crucial for the excellent functioning of the application and is closely linked to the application scenario. The lower its value, the

(71)

higher the false-positive matches; on the contrary, the higher its value, the higher the false-negative matches. In the case of the tested examples, we empirically set its value to 90.

(72)

(73)

Chapter 5 System Implementation

In this chapter, we describe the design of a prototype system based on the work done in this thesis. In particular, we take as an example of the use case scenario the CNR Park described in previous chapters. The parking area of the CNR is, in fact, equipped with a series of cameras that continuously monitor the parking lots and that are used for applications to verify the state of occupancy of the parking lots [66, 67, 68, 62, 69]. However, the real system can be generalized to any scenario where similar features are required.

The objective of the system is to be able to locate the vehicles inside the parking lot through the cameras spread throughout the area and to be able to immediately identify them, i.e., to be able to obtain information on the license plate, even though it is not actually visible.

The proposed system consists of:

• one or more cameras at the entrance; • a central server;

Design and Implementation of a Vehicle Tracking System Based on Deep Learning

Universit`

a di Pisa

Dipartimento di Ingegneria dell’Informazione

Computer Engineering

Design and Implementation of a Vehicle

Tracking System Based on Deep

Learning

Abstract

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

The Vehicle Tracking Problem

1.2

Application Scenario

1.2.1

Scenario: CNR Park

1.3

Objectives and Outline of this Thesis

Chapter 2

Background

2.1

Convolutional Neural Networks

2.2

Computer Vision

2.2.1

Object Detection

2.2.2

Object Tracking

2.2.3

Object Re-identification

2.3

Evaluation Metrics

2.3.1

Precision

2.3.2

Recall

2.3.3

Average Precision

2.3.4

mAP

2.3.5

CMC curve

2.4

Frameworks and Tools

Chapter 3

Related Work

3.1

Datasets existing in literature

3.1.1

VeRi

3.1.2

CityFlow

3.2

The 2019 AI City Challenge

3.3

TrackletNet Tracker

3.4

Multi-View Vehicle Re-Identification using

Tem-poral Attention Model and Metadata Re-ranking

3.4.1

Feature Extraction

3.5

License Plate Detection and Recognition in

Un-constrained Scenarios

Chapter 4

Contributions of this Thesis

4.1

V-ReID-AB-Dataset

4.2

Experiments

4.2.1

Test with modified CityFlow dataset and

V-ReID-KT-Baseline

4.2.2

Test with VeRi dataset and V-ReID-KTP-Baseline

4.2.3