Design and Implementation of a System Based on Deep Convolutional Networks for Intelligent Visual Surveillance

(1)

Universit`

a di Pisa

Dipartimento di Ingegneria dell’Informazione

Computer Engineering

Design and Implementation of a System

Based on Deep Convolutional Networks

for Intelligent Visual Surveillance

Candidate:

Indrit Kertusha

Supervisors:

Claudio Gennaro

Fabrizio Falchi

Giuseppe Amato

Academic Year 2017/2018

(2)

(3)

Abstract

The theme of public safety, in light of the latest dramatic events, has recently become a relevant aspect, especially in large urban areas. The possibility to act promptly in case of dangers or alarms can be of fundamental importance in determining the favorable outcome of the interventions. For example, identifying and tracking an individual or a suspicious vehicle that moves in an urban context may require significant deployment of forces, with costs that can sometimes render the interventions ineffective.

The intelligent camera represents a technology that can help manage this problem and that in recent years has seen a rapid development in many direc-tions. The most interesting aspect is certainly its low cost. Furthermore, the advent of the 5G network and its integration with drone technology could pro-vide a synergy for the development of innovative and low-cost public security applications. The visual information detected by the camera (photo and video) can be effectively exploited in the phase of identification on the ground, through the use of artificial intelligence technologies based on deep learning. The devel-opment of these technologies has been enormous as well. Thanks to the power of graphics cards equipped with Graphics Processing Units (GPU), recognizing in real time, in a video, a face of a person or a specific object in motion is now within the reach of a home computer.

(4)

(5)

List of Figures

2.1 System overview. . . 5

2.2 A running time comparison of recent object detectors . . . 6

3.1 Example of edge detection using convolution . . . 11

3.2 Layers in convolutional neural networks. . . 12

3.3 Architecture of a convolutional neural nework . . . 12

3.4 Example of object detection using R-CNN. . . 14

3.5 Example of Selective Search . . . 14

3.6 High level architecture of R-CNN. . . 15

3.7 High level architecture of Fast R-CNN. . . 17

3.8 Faster R-CNN region proposal . . . 18

3.9 YOLO bounding box predictions . . . 19

3.10 Example of YOLO detection. . . 20

3.11 SSD architecture . . . 21

3.12 Multi-scale feature maps . . . 22

3.13 Detecting objects on different scales. . . 23

3.14 Face pose. . . 23

3.15 VGGFace2 pose variation example . . . 25

(10)

3.16 VGGFace2 age

variation example . . . 25

3.17 Example of K-NN classification . . . 27

4.1 Application scenario. . . 32

4.2 Processing center architecture. . . 34

4.3 YOLO(Darknet) detection results. . . 35

4.4 COCO dataset categories . . . 35

4.5 YOLOv3-tiny example . . . 36

4.6 YOLOv3 example . . . 36

4.7 OpenCV face detector results . . . 36

4.8 Face detection example . . . 37

4.9 Face detection and recognition example . . . 38

4.10 Example of the kNN classifier . . . 38

4.11 Event database schema . . . 40

4.12 Web-based GUI . . . 41

5.1 Face recognition false positive rate . . . 45

5.2 Face recognition false negative rate . . . 45

5.3 Face recognition accuracy . . . 46

5.4 Test video sample . . . 47

(11)

List of Tables

2.1 Running times of recent object detectors . . . 8

3.1 VGGFace2 Dataset statistics . . . 25

5.1 Running times of YOLO object detectors (in FPS) . . . 43

5.2 Average number of detected objects per frame . . . 44

5.3 Number of distinct detected classes . . . 44

5.4 Running times of face detection and recognition . . . 44 5.5 Running times of object detection, face detection and recognition 46

(12)

(13)

Chapter 1

Introduction

1.1 Scenario

Real-time object detection is crucial for many applications of Unmanned Aerial Vehicles (UAVs) such as reconnaissance and surveillance, search-and-rescue, and infrastructure inspection. In the last few years, Convolutional Neural Networks (CNNs) have emerged as a powerful class of models for recognizing image con-tent, and are widely considered in the computer vision community to be the de facto standard approach for most problems. However, object detection based on CNNs is extremely computationally demanding, typically requiring high-end Graphics Processing Units (GPUs) that require too much power and weight, especially for a lightweight and low-cost drone.

The rising ubiquity of high-speed mobile networks (4G-5G) allows bypassing the computing problem by transferring the image processing task from the drone itself to dedicated calculation centers. By doing this, the drone will only need the capacity to stream video, making unnecessary to carry the weight of a more powerful computer, thus saving energy.

A number of problems arises in this context, due to the limitations of drones:

• The images captured by the drone may be blurry due to its sudden move-ments

• The onboard cameras of drones have usually low resolution and thus the individual objects may be small

(14)

CHAPTER 1. INTRODUCTION

Real-time object detection techniques extensively use Convolutional Neural Net-works (CNNs). A CNN is a type of artificial neural network inspired by biolog-ical processes in that the connectivity pattern between neurons resembles the organization of the animal visual cortex. Since such techniques are very com-plex, they were recently made feasible especially thanks to the availability of powerful hardware and software infrastructure.

1.2 Objectives and Outline of this Thesis

The goal of this thesis work is the design and implementation of a system able to execute object recognition, face detection and recognition in real-time, on a video stream. The interaction with the system will be done through a web in-terface, where a user passes the desired commands and the results are displayed. The backend of the system is run on a remote server, with sufficient computa-tional power for executing the detection/recognition task. The detection results are stored on a database for logging purposes.

Given this goal, we have conducted an analysis of current approaches for object detection, based on deep learning architectures, and selected the appro-priate ones. We then analyzed some current models for face recognition that offer the best tradeoff between speed and accuracy. Thanks to the selection of these models, and to the tools that we used, the proposed system implementa-tion is efficient on current hardware.

This thesis work has the following structure:

• Chapter 2 presents related work on real-time cloud-based object detec-tion. We analyze strategies that have been proposed in the computer vision community in the last years, highlighting their positive and nega-tive aspects.

• Chapter 3 describes some useful concepts and tools employed throughout this work. We first introduce the Convolutional Neural Networks, an ef-fective solution to the object detection problem, and then we analyze some existing CNN-based approaches. We present some of the issues regarding face recognition and how a carefully selected dataset helps mitigate those issues. Finally, the frameworks and tools we used are presented

• Chapter 4 contains the contributions of this work. First we present a closer look at the various modules of the system. Then we show the implementation of the system in more detail, using the frameworks, tools and the CNN architectures described in the previous chapter

(15)

CHAPTER 1. INTRODUCTION

• Chapter 5 presents the performance values of the previously implemented system. We test the performance of each single CNN architecture first, to understand which are the most expensive computationally, then we test their combination

• Chapter 6 reports overall considerations, future enhancements and con-clusions.

(16)

(17)

Chapter 2

Related Work

This chapter briefly analyzes some system architectures that have been proposed for real-time object detection on Unmanned Aerial Vehicles (UAV).

2.1 Real-Time, Cloud-based Object Detection

for Unmanned Aerial Vehicles

Lee et al. from Indiana University proposed in [9] an object detection system composed of two parts: a Local Machine and a cloud based object detection server

Figure 2.1: System overview.

The Local Machine is based on a Parrot AR.Drone 2.0 (ARD), which is a small, lightweight and low-cost hardware platform. The ARD is equipped with two cameras: a front facing one with a resolution of 1280x720 at 30fps, a downward-facing one with a resolution of 320x240 at 60fps. The authors propose running a lightweight object detector locally, on the smaller camera, in

(18)

CHAPTER 2. RELATED WORK

order to save bandwidth from data transfers to/from the cloud server. They use the Binarized Normed Gradients (BING) algorithm to measure objectness on input frames. When the LM detects an object, it captures a high resolution image from the other camera and sends it to the cloud server. After receiving an image of a candidate object, the Faster R-CNN algorithm for object detection is applied by the cloud server.

The authors conducted sets of experiments to demonstrate their approach. The first set of experiments focuses on testing the accuracy of deep network based object detectors on aerial images. The second set of experiments evaluates the speed of the cloud based object detection approach, comparing with running time of the fastest deep learning based object detector on a local laptop (with a GTX 770M GPU).

Experimental results from the paper are shown on figure 2.2.

Figure 2.2: A running time comparison of recent object detectors

Even though YOLO has a slightly lower mAP than the other object detection methods, its running time is lower and thus achieves higher frames per second. By executing the heavier Faster R-CNN on the cloud server, the execution times are lowered drastically but now we have a non negligible latency comprised of sending the image and receiving results.

In any case, the execution times of the Faster R-CNN object detection are unsatisfactory for real time applications.

(19)

2.2 Fast Object Detection for Quadcopter

Drone using Deep Learning

Budiharto et al. from Bina Nusantara University have proposed in [1] a similar architecture as shown in [prev. Section]. The system is composed of a Parrot AR drone and a local object detection server

The authors do not use an object proposal algorithm on the drone to detect objects: images are captured from the drone and then are sent directly to the server. On the local server the heavy Faster R-CNN algorithm is replaced with a fast, efficient object detection method based on the MobileNet architecture and the Single Shot Detector (SSD) framework.

Experimental results show that the system runs on 14 FPS on average. The authors have not disclosed the hardware on which the algorithm runs on. In any case, SSDs are better suited for real-time applications.

2.3 A Distributed Drone-Oriented Architecture

for In-Flight Object Detection

Vaquero-Melchor et al. propose in [20] a more general architecture than the previously shown systems. Among the requirements that the authors consider, two are the most attractive:

• The architecture should be designed in such way that at any moment it is possible to change the object detection model. The operation scenario may vary, for example, from an urban field (pedestrians, cars and so on) to a rural one (crops, roads, rivers)

• The architecture will be designed such that the main access to the system must be done using web-based interface

The system is composed of three modules:

• The On-board Module, responsible for serving the video stream from the drone

• The Processing Center, which executes the object detection • The Client, which is used by the user to interact with the system

(20)

There are two possibilities when returning detection results. The first one is to return the joined input video joined with the bounding boxes of the objects. The second one is to return only the bounding box information (position and size). The first alternative has a much higher bandwidth consumption but the video can be played on any compatible player. The second solution has small bandwidth consumption but it is necessary to combine the video and the detections result.

The authors have tested various algorithms for object detection inside the Processing Module, with a focus on SSD algorithms. They have tested both Mobilenet and Inception feature extractors. Moreover, they evaluate the slower and more accurate RFCN and Faster R-CNN algorithms. The results are shown on table 2.1. As expected, SSD provides the best computing time.

Algorithm Feature extractor Mean (ms) SSD Mobilenet 39.85 SSD Inception 39.23 Faster R-CNN ResNet 50 604.29

(21)

Chapter 3

Background

In this chapter we present the network architectures used in this thesis work. We first discuss the object detection problem as part of computer vision. We then discuss convolutional neural networks and their importance. Finally we briefly discuss the various convolutional neural networks based approaches that are used in the object detection/recognition task.

3.1 Computer Vision

Computer vision deals with the extraction of meaningful information from dig-ital images or videos. From an engineering perspective, its aim is to automate tasks that the human visual system can do. Computer vision tasks include methods for acquiring, processing, analyzing digital images, and finally extract-ing high-dimensional data that represents relevant information about the image. Computer vision applications include systems for: automatic inspection, iden-tification tasks, detecting events, modeling objects or environments, navigation and organizing information. Typical tasks in these applications are recognition, motion analysis, scene reconstruction and image restoration.

Machine learning is a necessary component of many computer vision al-gorithms which can be described as a combination of image processing and machine learning. Effective solutions require algorithms that can cope with the vast amount of information contained in images, and for many applications, can carry out the computation in real time.

(22)

CHAPTER 3. BACKGROUND

3.1.1 Object detection

Object detection is one of the classical problems of computer vision. The ability to identify the objects present in an image is one of the most basic requirements for every computer vision application. Solutions to this problem must account for deformation and changes in lighting and viewpoint. Object detection in-volves both locating and classifying regions of an image.

To detect an object, first we need to have some idea where the object might be and how the image is segmented. Low-level visual features of an image may be used as a guide for locating candidate objects. The location and size is defined using a bounding box, which is typically stored in the form of corner coordinates. The sub-image of the candidate object contained in the bounding box is then classified by an algorithm that has been trained using machine learning. After making the initial guess, the bounding box coordinates can be further refined iteratively.

The first popular solutions for object detection utilized feature descriptors, such as scale-invariant feature transform (SIFT) [11] and histogram of oriented gradients (HOG) [3]. In recent years there has been a shift towards utilizing convolutional neural networks.

3.2 Convolutional Neural Networks

Using traditional neural networks for solving computer vision problems is prob-lematic due to the large amount of information contained in an image.

For example, for a 1280x720x3 (where 3 refers to color channels) image, we would need a fully connected network with 2764800 weights. As the image resolution increases the number of weights of the model quickly becomes larger: this results in slow performance and overfitting. Moreover, if we translate by some amount the image or some object contained within, the network must be trained to recognize this new information. Thus, fully connected neural networks are not translation invariant.

In machine learning, a Convolutional Neural Network (CNN) is a class of deep artificial neural networks with applications in image and video recognition, natural language processing and other pattern recognition tasks. They were inspired by the connectivity of neurons in the visual cortex, in which individual cortical neurons respond to stimuli only in a restricted region of the visual field called receptive field [4]. Receptive fields are sensitive to certain type of stimulus and overlap with each other.

(23)

The function of the receptive fields can be approximated with the convolution operation [12]. In image processing, convolution is the process of adding each element of the image to its local neighbors, weighted by the kernel (also called filter). It is used for blurring, sharpening, edge detection and more. The result of this operation is called a feature map or activation map. Figure 3.1 shows an example of edge detection with convolution.

Figure 3.1: The feature map is obtained by convolving the input image pixel values with the convolution kernel.

A set of convolutional filters can be combined to form a convolutional layer of a neural network. The values of the convolution kernel are now the neuron parameters and are trained using machine learning. The multiplication opera-tion of regular neural networks is replaced with the convoluopera-tion operaopera-tion. The output of the layer is usually described as a volume, where the height and width depend on the dimensions of the activation map and the depth depends on the number of filters.

Since the same filters are used over all parts of the image, the number of parameters is drastically reduced compared to a fully-connected neural layer. The neurons of the convolutional layer are only connected to a local region of the input and share the same parameters, which ensures translation invariance. Successive convolutional layers, combined with other types of layers (see next section), form a convolutional neural network. Lower layers (those closer to the input) learn to recognize simple features, like edges and corners, while higher layers (those closer to the output) learn to recognize more complex features, as shown on figure 3.2. Figure 3.3 shows a general architecture of a CNN.

(24)

Figure 3.2: Layers in convolutional neural networks.

Figure 3.3: Architecture of a convolutional neural nework. Image from Wikipedia.

3.2.1 Additional layers and parameters

A convolutional neural network includes additional layers, some of which are briefly discussed here.

A pooling layer replaces the output of the network at a certain location with a summary statistic of the nearby outputs. This layer effectively reduces the dimensions of the feature map. This helps to make the representation approxi-mately invariant to small translations of the input. A typical pooling operation is max pooling which reports the maximum output within a rectangular neigh-borhood.

One of the parameters of the convolution operation is the stride. This pa-rameter controls the movement of the filter over the input image. For a value of n, the filters are moved n pixels at a time, thereby reducing the feature map dimensions.

The pooling layer, together with the stride parameter, improve the com-putational efficiency of the network by reducing the feature map size. These operations are shown as subsampling on figure.

The final hidden layers of a CNN are typically fully-connected layers. These layers perform the classification task by mapping a feature map from the

(25)

previ-CHAPTER 3. BACKGROUND

in order to be practical. Pooling and stride are applied on various stages of the network in order to reduce the feature map volume that reaches these layers.

If the network is used for classification, the activations of the layers before the fully connected ones can be used to generate a feature representation of an image. Thus, the convolutional neural network is used as a large feature detector.

3.3 The R-CNN family

In object detection algorithms we try to draw a bounding box around some object of interest in the image. We should also consider that in the image there may be multiple objects, each with its own bounding box and we do not know how many beforehand. We cannot build a CNN to solve this problem because the number of occurrences of objects of interest is not fixed, thus the length of the output layer of the network is variable. We could try to solve this problem by taking different regions of interest from the image, and using a CNN to classify the presence of the object within that region.

Previous detection systems use a sliding window approach, where the classi-fier is run in different locations over the entire image. This is very inefficient due to the large number of windows that have to be analyzed. Newer approaches use the region proposal methods, where potential bounding boxes are generated and a classifier is run over each one. Regions with CNN (R-CNN) and its im-provements (Fast R-CNN, Faster R-CNN) follow the second approach. These networks have one part dedicated to providing region proposals, followed by a high quality classifier to classify these proposals. We will briefly present them in the following sections.

3.3.1 R-CNN

Regions with CNN features (R-CNN) [6] is an object detection system, probably the first to show that a CNN can lead to dramatically higher object detection performance. R-CNN takes an image as input and produces as output the labels and and bounding boxes for each detected object. An example of detection results using R-CNN is shown in figure 3.4.

First, the system proposes a number of boxes and then checks if any of them corresponds to an object. In order to create these bounding boxes, or region proposals, R-CNN uses the Selective Search algorithm [19]. At a high level, the Selective Search algorithm looks at the image through windows of different sizes and for each size tries to group together adjacent pixels by texture, color, or

(26)

intensity to identify objects (see figure 3.5 for an example).

Figure 3.4: Example of object detection using R-CNN.

Figure 3.5: Selective Search example: using windows of different sizes allows the detection of both girls inside the green bounding boxes

Once the proposals are created, R-CNN resizes the regions to a standard square size and passes each one of them through a modified version of AlexNet [8] or VGG-16 [18]. At this point features are extracted from each region proposal to its fixed-length feature vector. On the final layer of the CNN, there is a Support Vector Machine (SVM) that classifies each feature vector by checking if it corresponds to an object, and if so, to which one. Finally, a simple linear regression is run on the region proposal network in order to iteratively produce bounding box coordinates that better fit to the size of the object. Figure 3.6

(27)

Figure 3.6: High level architecture of R-CNN.

3.3.2 Fast R-CNN

While R-CNN works well, it is quite slow for the following reasons:

• It requires a forward pass of the CNN (AlexNet or VGG-16) for every single region proposal for every single image (that’s around 2000 forward passes per image).

• It has to train three different models separately: the CNN to generate image features, the classifier that predicts the classes, and the regression model to find better bounding boxes. This makes the pipeline extremely hard to train.

Fast R-CNN [5] solves these problems by:

• Using Region of Interest (RoI) pooling: RoI pooling is a neural-net layer that achieves a significant speedup of both training and testing and main-tains a high detection accuracy. The layer takes two inputs: a fixed-size feature map obtained from a deep convolutional network with several con-volutions and max pooling layers and a N x 5 matrix representing a list of regions of interest, where N is a number of RoIs (the first column rep-resents the image index and the remaining four are the coordinates of the top left and bottom right corners of the region). For every region of interest from the input list, RoIPool takes a section of the input feature map that corresponds to it and scales it to some pre-defined size (e.g., 7 x 7). The scaling is done by dividing the region proposal into equal-sized sections (the number of which is the same as the dimension of the

(28)

output), finding the largest value in each section, and copying these max values to the output buffer. The result is that from a list of rectangles with different sizes we can quickly get a list of corresponding feature maps with a fixed size. If there are multiple object proposals on the frame (and usually there’ll be a lot of them), we can still use the same input feature map for all of them. Since computing the convolutions at early stages of processing is very expensive, this approach can save a lot of time. • Combining the three models into one network: while in R-CNN we had

different models to extract image features (CNN), classify (SVM), and resize bounding boxes (regressor), Fast R-CNN instead uses a single net-work to compute all three. Each feature vector is fed into a sequence of fully connected layers that finally branch into two sibling output layers: a softmax layer which replaces the SVM classifier, that produces probability estimates over K object classes, and another layer that outputs bounding box coordinates for each of the K object classes. A high-level architecture of Fast R-CNN is shown on figure.

3.3.3 Faster R-CNN

Even with all these improvements, the region proposal system still remains a bottleneck in the Fast R-CNN. As we saw, the very first step to detecting the locations of objects is generating a number of potential bounding boxes or regions of interest to test. In Fast R-CNN, these proposals were created using the Selective Search algorithm, which was determined to be a fairly slow process. In [16] a team at Microsoft Research found a way to make the region proposal step almost cost free through an architecture they named Faster R-CNN. They noticed that region proposals are built exploiting features of the image, the same features that are already calculated with the forward pass of the CNN (first step of classification). In Faster R-CNN a single CNN is used to both carry out region proposals and classification. This way, only one CNN needs to be trained and we get region proposals almost for free. So, unlike in R-CNN and Fast R-CNN, in this new detection system the input is the image and not the region proposal. Figure 3.7 shows the high level architecture of Faster R-CNN. Faster R-CNN, is composed of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector that uses the proposed regions. The entire system is a single, unified network for object detection. The fully convolutional network added on top of the network is called the Region Proposal Network (RPN). To generate region proposals, a small network slides a window over the convolutional fea-ture map and outputs k potential bounding boxes and an objectness score that estimates the probability of there being an object for each proposal. The k

(29)

pro-CHAPTER 3. BACKGROUND

An anchor is centered at the sliding window in question, and is associated with a scale and aspect ratio (see figure 3.8).

An important property of using anchors is translation invariance: if one translates an object in an image, the proposal should translate and the same function should be able to predict the proposal in either location. Another prop-erty of anchors is that they can be used for predictions on multiple scales and aspect ratios: this allows higher detection accuracy with a more cost-efficient computation.

(30)

Figure 3.8: A 3x3 window is slid on the input conv feature map. For each window the network predicts k proposals (anchors) and maps conv feature map to a lower-dimensional feature. This last feature is input at two fully-connected layers, a box-regression layer (reg) and a box-classification layer.

3.4 Single Shot Detectors

The problem with the approach followed by the R-CNN family is that the ob-jects of interest might have different spatial locations within the image and different aspect ratios. Hence, we would have to select a huge number of regions and this does not scale. These methods are very accurate but come at a big computational cost, which is unacceptable for embedded devices but also for high-end hardware, where they are unable to execute in real-time.

In single shot detectors, a single network sees the image entirely during train-ing and testtrain-ing and produces boundtrain-ing box coordinates and class probabilities. The network architecture is thus more simple, since we no longer need region proposals.

In the following sections we will discuss two algorithms that follow the sin-gle shot detector approach. We will describe You Only Look Once (YOLO), together with its improved versions, and SSD..

(31)

3.4.1 YOLO

In this section we describe You Only Look Once (YOLO) [15] and its improved versions.

YOLO divides the input image into an S × S grid. Each of these grid cell predicts B bounding boxes and class probabilities for those boxes. Each bounding box consists of 5 predictions: x,y,w,h,and confidence. The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box. Figure 3.9 shows an example of these bounding boxes.

Figure 3.9: YOLO bounding box predictions

Each grid cell predicts C conditional class probabilities. These probabilities are conditioned on the grid cell containing an object. YOLO only predicts one set of class probabilities per grid cell, regardless of the number of boxes B. At test time the confidence score for the bounding box and the class prediction are combined into one final score that tells us the probability that this bounding box contains a specific type of object. There are S × S × B bounding boxes in total and most of these boxes will have very low confidence scores. Depending on the level of accuracy that we choose, only those boxes with a final score above a certain threshold are kept.This technique is known as non-maximum suppression. Figure 3.10 an example of final predictions.

One limitation of YOLO is that it imposes strong spatial constraints on bounding box predictions since each grid cell only predicts B boxes (in the paper B is set to two) and can only have one class. On one hand this limits the number of nearby objects that the model can predict and thus YOLO struggles with small objects that appear in groups, but on the other hand this helps mitigate multiple detections of the same object.

(32)

Figure 3.10: Example of YOLO detection.

The second version of YOLO, called YOLO9000 (or YOLOv2) [13], intro-duces new features in order to improve speed and accuracy. These are summa-rized below:

• Batch normalization on convolution layers.

• High-resolution classifier by training the network on a smaller (224x224) and larger (448x448) resolution.

• Convolution with anchor boxes: instead of making arbitrary guesses about bounding boxes, we predict offsets with respect to some default anchor boxes. These anchor boxes are similar to those in Faster R-CNN

• Dimension clusters: use a clustering algorithm on the training set bound-ing boxes to find good anchor boxes.

• Darknet-19: YOLOv2 proposes a new network for feature extraction that is faster than the previous one.

• By using a hierarchical classification of objects, called WorldTree, YOLO9000 is able to recognize more than 9000 classes.

The third version of YOLO, called YOLOv3 [14], improves further the accuracy while trying to maintain a reasonable speed. It introduces the following:

(33)

• Prediction across 3 different scales, instead of two from the previous ver-sion

• Darknet-53: YOLOv3 proposes a new network for feature extraction. This network is larger than the previous one, and thus slower, but improves the accuracy.

3.4.2 Single Shot MultiBox Detector

Single Shot MultiBox Detector (SSD) [10] follows a similar approach to YOLO. SSD uses a convolutional network that produces a certain number of bounding boxes along with a score for each bounding box that indicates the presence of an object’s class instance. The removal of the region proposal network speeds up the detection process but results in a drop of accuracy. To recover it, SSD applies a few improvements including multi-scale features and default boxes. These improvements allow SSD to maintain a high accuracy using lower resolution images, which further pushes the speed higher. The early layers of the network perform feature extraction, the latter layers perform the multi-scale features and default boxes. Default boxes are similar to anchor boxes in Faster R-CNN and YOLOv2-v3.

Object detection is achieved in 2 parts:

• Feature extraction, which uses VGG16 to extract the feature map. In general, any network can be used for feature extraction.

• Applying convolution filters for object detection.

Figure 3.11: SSD architecture

SSD uses multiple layers (multi-scale feature maps, as part of the extra feature layers) to detect objects independently. These layers decrease in size progres-sively and allow predictions of detections at multiple scales. The convolutional

(34)

model for predicting detections is different for each feature layer. Lower reso-lution layers are used to detect larger scale objects, higher resoreso-lution layers for smaller objects. For example, the 4x4 feature maps are used for larger scale object. Multi-scale feature maps improve the detection accuracy significantly.

Figure 3.12: Multi-scale feature maps

SSD also uses default boxes which are a collection of boxes overlaid on the image at different spatial locations, scales and aspect ratios that act as reference points on the ground truth images. For each default box on each cell the network outputs the following:

• A probability vector of length c, where c is the number of classes. • A vector with 4 elements (x, y, width, height) that represent the offset

required to move the default box position to the real object.

As an example, from figure 3.13 we see that the cat has 2 boxes that match on the 8x8 feature map, but no box is present for the dog. Now on the 4x4 feature map there is one box that matches the dog and the cat is missing. It is important to note that the boxes in the 8x8 feature map are smaller than those in the 4x4 feature map: this allows the network to identify objects across a large range of scales.

Given the large number of boxes generated during a forward pass of SSD at inference time, it is essential to prune most of the bounding boxes by applying the non-maximum suppression technique: boxes with a confidence loss threshold less than some value ct and Intersection over Union (IoU) less than some value lt are discarded, and only the top N predictions are kept. This ensures only the most likely predictions are retained by the network, while the more noisier ones are removed.

(35)

Figure 3.13: Detecting objects on different scales.

3.5 Face Recognition

Face recognition in images and videos has received significant attention in recent years. It is a special case of object recognition where the aim is finding human faces in the input image.

Face recognition includes two tasks:

• Face detection : determine the location, size and pose (yaw, pitch and roll) of a face in the image.

• Face recognition : perform a one to many comparison between a detected face from the image (query image) and a set of face images belonging to known people, in order to identify the face depicted in the query.

(36)

There are many methods proposed for face recognition, distinguished by the usage of deep learning. The ones that do not use deep learning start by extract-ing a representation of the face image usextract-ing handcrafted local image descriptors such as SIFT [11]; then these local descriptors are aggregated into an overall face descriptor through a pooling mechanism, for example the Fisher Vector [17]. The methods that use deep learning have the defining characteristic of using a CNN for feature extraction. A deep CNN is trained to classify faces into a large number of classes, each class containing a large number of samples. In order to use a deep learning face recognition system, first it is needed to train it over a set of face images belonging to many individuals. In the training phase, characteristic face features are detected in the training images, and by taking advantage of the presence and intensity of such features in the images associated to different people, the system learns models of faces of different people.

The detected features are mapped to vectors in a lower dimensional space, and a distance function is used between two vectors to efficiently evaluate the similarity between them. After the learning phase, features extracted from any new input image are easily compared to features found in the training images. In this way, it is possible to verify whether a face corresponds to another face from the training set.

Face recognition systems have to address the following issues:

• Large datasets for training are not readily available

• High intra-class variability : significant differences may exist between im-ages depicting the same person due to the so-called A-PIE variability (Age, Pose, Illumination and Expression)

• Small inter-class variability : small differences between images depicting different persons

3.5.1 VGGFace2

A number of face recognition systems which address the previous problems and achieve increasingly higher performances have been proposed in the recent years. One such system is the VGGFace2 [2].

Concurrent with the rapid development of deep Convolutional Neural Net-works (CNNs), there has been much recent effort in collecting large scale datasets in order to produce more accurate models. Previous datasets have explored the importance of intra- (many images of one subject) and inter-class (many

(37)

sub-CHAPTER 3. BACKGROUND

specifically explored pose and age variation. VGGFace2 address that by design-ing a dataset generation pipeline to explicitly collect images with a wide range of pose, age, illumination and ethnicity variations of human faces.

The VGGFace2 dataset contains 3.31 million images from 9131 celebrities spanning a wide range of ethnicities and professions. The Images were down-loaded from Google Image Search and show large variations in age, pose, lighting and background. Pose (yaw, pitch and roll) and apparent age information are estimated by pre-trained pose and age classifiers. The VGGFace2 provides an-notation to enable evaluation on two scenarios: face matching across different poses, and face matching across different ages.

Figure 3.15: VGGFace2 pose variation example

Figure 3.16: VGGFace2 age variation example

The dataset collection process is summarized in table 3.1.

Stage Aim Type Classes Images (million) 1 Name list selection M 500K 50.00 2 Image downloading A 9244 12.94

3 Face detection A 9244 7.31

4 Filtering by classification A 9244 6.99 5 Near duplicate removal A 9244 5.45 6 Final filtering A/M 9131 3.31

Table 3.1: VGGFace2 Dataset statistics

In order to evaluate the quality of the dataset the authors use ResNet-50 and SE-ResNet-50 as the architectures for comparing amongst training datasets. The previous networks, trained on the VGGFace2 dataset, achieve state-of-the-art performance on benchmarks.

(38)

3.6 Classification

In this section we will briefly discuss the cosine similarity and the k-nearest neighbour algorithm. When used together, these provide good scores for the face classification model.

3.6.1 Cosine similarity

Cosine similarity is a measure of similarity between two given vectors that mea-sures the cosine of the angle between them. The cosine of 0 is 1, and it is less than 1 for any angle in the interval (0,180]. As an example, consider two vectors with different magnitudes. The euclidean distance between these vector will depend on their magnitude, even though they may be pointing in similar directions. The cosine similarity doesn’t depend on the magnitudes, but only by the angle between the vectors, and thus results in a better similarity measure.

Given two vectors A and B, the cosine similarity, cos(θ), is calculated as follows:

similarity = cos(θ) = A · B k A kk B k where · is the dot product, and kk the magnitude.

The cosine similarity is most commonly used in high-dimensional positive spaces. For example, in information retrieval, a term is assigned a different dimension and a document is characterized by a vector where the value in each dimension corresponds to the number of times the term appears in the docu-ment. This characterization of the vector is known as the vector space model. Cosine similarity then gives a useful measure of how similar two documents are likely to be in terms of their subject matter.

A big advantage of cosine similarity is its low-complexity, especially for sparse vectors: only the non-zero dimensions need to be considered.

3.6.2 k-nearest neighbors algorithm

In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a method used for classification and regression. It is non-parametric, meaning that it does not make any assumptions on the underlying data distribution, and lazy learning, meaning that it does not use the training data points to do any

(39)

gener-CHAPTER 3. BACKGROUND

The input consists of the k closest training examples in the feature space and the output (for the classification case) is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. For k=1, the object is simply assigned to the class of that single nearest neighbor.

The training examples are vectors in a multidimensional feature space, each with an assigned class label. The training phase of the algorithm consists in storing the feature vectors and class labels of the training samples. In the classification phase, k is a user-defined constant, and an query vector is classified by assigning the label which is most frequent among the k training samples nearest to that query point. A commonly used distance metric is the Euclidean distance. Choosing the parameter k is not easy. A small value of k means that noise will have a higher influence on the result. A large value makes the class boundaries less distinct and also makes the algorithm computationally expensive. Typically the values of k are odd to avoid ties between classes. Figure shows an example of how k affects the predicted class.

Figure 3.17: Example of K-NN classification: For k = 3, the query object is assigned to class B, for k = 6 it’s assigned to class A.

A drawback of the basic k-NN classification occurs when the class distribu-tion is skewed. Due to the large number of examples from a frequent class, the prediction of a new example will contain more examples of that class. One way to overcome this problem is to weight the classification, taking into account the distance from the query point to each of its k nearest neighbors. The class of each of the k nearest points is multiplied by a weight proportional to the inverse of the distance from that point to the query point. Weighting by similarities is often more accurate than simple voting.

(40)

3.7 Frameworks and Tools

In this section, we describe the main tools and frameworks used in this work. YOLO, YOLO9000 and YOLOv3 are officially implemented on the Darknet framework. Darknet is an open source neural network framework written in C and CUDA. It is fast, easy to install, and supports CPU and GPU computation. Sources are available on GitHub1_.

OpenCV2 _{(Open Source Computer Vision Library) is an open source}

com-puter vision and machine learning software library, which includes a comprehen-sive set of both classic and state-of-the-art algorithms. These algorithms can be used to detect and recognize faces, identify objects, classify human actions in videos, track camera movements, track moving objects, etc. It has C++, Python, Java and MATLAB interfaces and supports Windows, Linux, Android and Mac OS. OpenCV leans mostly towards real-time vision applications and takes advantage of MMX and SSE instructions when available.

Caffe3_{(Convolutional Architecture for Fast Feature Embedding) is an open}

source deep learning framework, written in C++, with a Python interface. It supports many different types of deep learning architectures geared towards image classification and image segmentation, and supports CPU- and GPU-based acceleration computational kernel libraries such as NVIDIA cuDNN.

Django4_{is a free and open-source web framework, written in Python, which}

follows the model-view-template (MVT) architectural pattern. Django’s pri-mary goal is to ease the creation of complex, database-driven websites, with emphasis on reusability and ”pluggability” of components, less code, low cou-pling and rapid development.

MongoDB5is a free and open-source cross-platform document-oriented database program. It stores data in flexible, JSON-like documents, and supports ad hoc queries, indexing and real time aggregation. MongoDB offers a Python interface through the PyMongo package.

1_{https://github.com/pjreddie/darknet} 2_{https://opencv.org/}

3_{caffe.berkeleyvision.org/}

(41)

CUDA1 is a parallel computing platform and programming model devel-oped by Nvidia for general computing on its own GPUs (graphics processing units). CUDA enables developers to speed up compute-intensive applications by harnessing the power of GPUs for the parallelizable part of the computation. cuDNN is a library for deep neural nets built using CUDA. It provides GPU accelerated functionality for common operations in deep neural networks such as forward and backward convolution, pooling, normalization, and activation layers.

MJPG-Streamer2 _{is a free and open-source command line application that}

copies JPEG frames from one or more input plugins to multiple output plugins. It can be used to stream JPEG files over an IP-based network from a webcam or a file to various types of viewers capable of receiving MJPG streams.

Various Python packages have been used including numpy, pafy, google-images-download. These can be installed through the pip package management system.

1_{https://developer.nvidia.com/cuda-zone} 2_{https://github.com/jacksonliam/mjpg-streamer}

(42)

(43)

Chapter 4

Contributions of this Thesis

4.1 Application Scenario

This section provides a general description of the system architecture for real-time detection of predefined objects in a video stream provided by a drone. In practice, the detected and recognized object will be identified on the image through a bounding box and will be classified according to its category. The proposed architecture allows to transfer the necessary calculations to perform the object detection from the on-board device to a high-performance remote server.

For this purpose, we offer a distributed client-server architecture that can also be applied in other use cases.

The requirements to be considered for the final system are:

• Real-time services: the system must provide real-time video flow and de-tection of predefined objects and faces

• Independence of the object detection algorithm: the architecture is de-signed in such a way that at any time it is possible to change the objects and their models to be detected and recognized. This is a requirement be-cause the operational scenario can vary, for example, from an urban con-text (pedestrians, cars and so on) to a rural environment (fields, roads, rivers). Furthermore, it is necessary to give the user of the system the possibility to modify the parameters of the model, in order to be able to decide at the execution time the objects and faces to be recognized and the confidence thresholds with which they are recognized

(44)

CHAPTER 4. CONTRIBUTIONS OF THIS THESIS

• Web-based user interface: the communication between the client and the server will be through a secure connection (https) supported by all the main browsers, to watch the identified objects in real time

• Independence of the UAV platform: the architecture should be generaliz-able as much as possible for different types of drones with different loading capacities

• Persistence of data: the video stream as well as the detection data will be stored in a database located in the server for future research and possible re-elaboration with different detection algorithms

Taking the previous requirements into consideration, the architecture design has been divided into three main modules. The first module, the air monitoring station, represents the drone and its payload, which in this scenario consists of a camera and a 5G modem for transmitting the video stream to the ground. The second module, the Processing Center, consists of the control center that re-ceives the video stream through a 5G connection and performs object detection. Finally, the third module, the Control Station, is the client of the system that is used by the user to interact with the system. Figure 4.1 shows the schematic diagram of the architecture. The operation of the system is based on the high performance connectivity provided by the 5G experimental network developed by the project consortium.

(45)

Air monitoring station. The payload of the drone consists of a 5G modem that establishes a secure connection through the 5G project network to the processing center, and a high resolution cameras possibly equipped with optical zoom. The camera allows you to shoot a video of the place to be monitored and send it via the 5G connection on the ground. In the simplest scenario, the 5G channel is used unidirectionally to send only the video stream. In a more advanced scenario, it is envisaged to use the 5G channel to send other information or commands such as camera zoom control.

Processing center. It is the heart of this usage scenario and consists of:

• 5G Gateway. It is a computer equipped with a 5G modem, able to receive the video stream transmitted by the drone and to send it through the local network to the video analysis server. Alternatively, the processing center can be connected to a fixed network and reachable from the 5G network through an appropriate APN

• Video analysis server. It has the task of processing the video received from the Gateway 5G and storing it in a dedicated repository (Video Storage). The server uses a module consisting of one or more video analyzers, based on deep learning techniques, to identify and recognize the objects de-scribed in the model database (model DB) within the video frames. Each model is associated with a convolutional neural network that exploits the computing power of a graphics card equipped with GPUs. The models that will be taken into consideration refer to objects present in the city such as cars, vans, people and faces. With regard to the latter, in partic-ular, the model DB contains a specific database of faces to be recognized (belonging for example to wanted people) within the video in real time. The spatial and temporal information concerning the recognized objects and faces are stored in the event DB for future visualizations and searches • Web server. It is a WWW server dedicated to the management of the entire system and to the real-time display of images processed by the video analysis server

Control station. It is the center where the drone and the images processed by the processing center are monitored. The control station is typically located in a different location from that of the Processing Center. Through a web browser, an operator, who generally is different from the one who controls the drone, connects through the network to the web server of the control center and has the task of both managing the processing center (deciding the objects to be intercepted) and to monitor the video stream processed by the control station.

(46)

4.2 Design of the Processing center

Now, let us take a more detailed look at the processing center. Figure 4.2 shows the architecture of this module. As we said earlier, the processing center consists of the 5g gateway, video analysis server and web server. Due to the lack of a drone unit in our work, we have skipped the implementation of the 5g gateway and have instead simulated it using video stream coming from different sources such as webcams, video files or HTTP links that point to a valid video file.

Figure 4.2: Processing center architecture.

4.2.1 Video analysis server

The video analysis server takes in input the video stream and performs object detection, face detection and face recognition on every frame. The results from these operations are stored on the Event database. A copy of the video, which contains the original unmodified frames, is also stored locally.

(47)

Object detection

For the object detection we have used YOLOv3 and YOLOv3-tiny models pro-vided by the Darknet framework. The weights of both models are publicly available in the Darknet website1 2_.

YOLO returns as results from the detection function a series of records having the structure shown in figure 4.3, where confidence (with values between 0 and 1) is the IOU between the predicted class and any ground truth class, x and y are the coordinates of the midpoint of the bounding box, and w,h is the width and the height. The x and y coordinates are transformed to the coordinates of the :

• Upper-left corner by [x1 = x - w/2, y = y1 - w/2] • Lower-right corner by [x2 = x1 + w, y2 = y1 + h]

Figure 4.3: YOLO(Darknet) detection results.

For every object that YOLO has detected on the frame, we first check if these detections correspond to the detection classes specified by the user: if a match is not found we do nothing else, if a match is found we check the confidence of the detection. This value is compared to the value requested by the user. Only confidence values higher than the user has specified are considered. The classes that we have used in our tests, and can be selected by the user, are taken from the COCO dataset3. This dataset contains 80 object categories and they are shown on figure 4.4.

Figure 4.4: COCO dataset categories

When YOLO detects an object in the frame, the bounding box coordinates are proportional to the object size. If YOLO can see only half of a specific

1_{https://pjreddie.com/media/files/yolov3.weights} 2_{https://pjreddie.com/media/files/yolov3-tiny.weights} 3_{http://cocodataset.org/}

(48)

object, and can correctly determine the object’s class, it will try to make an estimation on the box coordinates of the missing half. In these cases, the values that are returned by YOLO are higher or lower than the frame resolution. Thus, we must make sure to change these coordinates so they fit inside the frame.

Figures 4.5, 4.6 show a visual comparison between the two YOLO versions.

Figure 4.5: YOLOv3-tiny example Figure 4.6: YOLOv3 example

Face detection and recognition

For the face detection task, we use the model provided by OpenCV, which is based on the SSD framework using ResNet-10 architecture. The model, named “res10 300x300 ssd iter 140000” and implemented as a Caffe model, can be downloaded through the OpenCV extra repository1_.

This network returns, as a result of the forwarding function, a series of records having the structure shown in figure 4.7, where x,y are the coordinates of the upper-left corner, w and h are the width and height of the bounding box of a face, and confidence (with values between 0 and 1) shows how confident the network is that the bounding box contains a face. For every face that the model detects on the frame, we first compared the confidence value to that requested by the user. Only confidence values higher than the user has specified are considered. Figure 4.8 shows an example of the OpenCV face detector.

Figure 4.7: OpenCV face detector results

The same considerations as with YOLO, regarding the bounding box coordi-nates apply here: the values that are returned by model may be higher or lower

(49)

Figure 4.8: Face detection example

than the frame resolution, so we must make sure to change these coordinates so they fit inside the frame.

For the face recognition task, we use the model provided by VGGFace2 authors, which is based on the ResNet-50 architecture and is trained on the VGGFace2 dataset. The model of this network is implemented as a Caffe model and can be downloaded from 1_{. For every face that was detected from the}

previous model, we generate smaller frames which contain only the face pixel values, using the bounding box coordinates. Then, we take the data layer of the recognition network and reshape it according to the number of the detected faces. Each face is set as input to the model. At this point we do a forward pass to the network.

Actually, we do not exploit the classification output of the VGGFace2 model, since clearly this has been trained on a face dataset that is different from what we want to recognize in our scenario. To this end, we exploit the features extracted from one of the layer of the network and use them to build a kNN classifier. In particular, we extract the features from the VGGFace2 that are located on the output of the "pool5/7x7_s1" layer. These features are stored as a vector of 2,048 floats. For each extracted feature, we compute the cosine similarity to the features loaded from the model database. These values are stored in memory as a series of records having the form [class, cosine similarity]. After computing all the similarities, these records are sorted according to the cosine score, in decreasing order. From the sorted records we use the k-nn algorithm, by taking the top k scores: scores belonging to the same class are summed together. The class with the highest score is assigned as the recognized face and is drawn on the frame. Figure 4.9 shows an example of both detection and recognition. Figure 4.10 shows an example of the kNN classifier.

(50)

Figure 4.9: Face detection and recognition example

Figure 4.10: Example of the kNN classifier

Face Recognition using the kNN Classifier

We did not use the SeNet-501_{model for classification since it uses a non-standard}

layer which is not included in the master branch code of Caffe. In any case, since the performance of the ResNet-50 model version is practically the same, we decided to use this one instead in implementation and testing.

Now we will discuss briefly about other methods of classification and why we made the choice of using in our system a combination of the cosine similarity and the k-nn algorithm for the face classification.

(51)

Neural networks can learn to recognize new sets of faces, by training them on representative images from each face class. Two problems arise in this approach. First, we would need a large number of example images (ranging from 50 to 100) from each new class. These images must also contain differences in age, pose and illumination as we discussed in the VGGFace2 section. Second, for every newly added class, we need to retrain the network, which is an expensive process in terms of processing times. The advantage of this approach is that the network will have a better classification accuracy.

Cosine similarity and k-nn, on the other hand, are computationally inexpen-sive operations. They work even on a limited number of example images (eg 5-10). The main disadvantage is that this approach is less accurate than the previous one, and also requires fine tuning by playing with the parameter k.

4.2.2 Database of models

The database of models contains all the face features that were extracted from existing images, containing known individuals, using the ResNet-50

"pool5/7x7_s1" layer. These features are currently stored as binary files inside directories named after each class. During the loading operation the folder name is used as the class name, and an entry of this data structure has the form of [class, features]. This storage can be updated by the user as described on the GUI section, and is refreshed after each such operation.

In our implementation we stored only the faces in this database, but in general, we could also include other types of objects such as cars, license plates etc. This makes the system more flexible by allowing it to adapt according to users needs. As an example, it can be used for car recognition systems[7].

4.2.3 Event Database

The results from the detection and recognition tasks are stored inside the event DB. We have implemented this centralized database with MongoDB.

In MongoDB, we insert new documents using the JSON format. Each de-tected face and object is stored on this database using the format shown in figure 4.11. The type field can be either object or face. The name field contains the category of the object or face. In cases where a face is detected, but is not recognized, we store the value as ‘unknown’. The bounding box coordinates are stored as the coordinates of the upper-left (x1, y1) and lower-right (x2, y2) corners.

(52)

Figure 4.11: Event database schema

For performance considerations we have decided not to store the respective results if the user has disabled the object or face detection.

4.2.4 Web Server

The web server serves as an interface between the user and and the video analysis server. It is composed of the front end and the back end.

The web-based GUI front end allows the user to start the detection/recog-nition process and to configure the various parameters. It has been written in HTML, CSS and Javascript. We have used Bootstrap to simplify the develop-ment and the managedevelop-ment of the graphical components. The GUI is shown on figure 4.12.

The first buttons allow the selection of the video source accordingly:

• For the webcam source, we use the WebRTC API to open a webcam stream right from the user’s browser. A JPEG image is captured with a specific frequency and is sent via websocket to the backend

• The file option allows the user to upload a video file from his local com-puter

• The link option allows reading a video file from http links or from youtube.com Through the radio buttons, the user can select which object detection model to use, between YOLOv3 and YOLOv3-tiny. If there is no need for object detection, it can be disabled by holding the Control key and clicking on it. The face recognition checkbox enables/disables the face detection and recognition task.

(53)

Figure 4.12: Web-based GUI

• The detection threshold of the face detection model

• The parameter k of the knn classification and the threshold of the recog-nition task: these are used to fine tune the results of the face recogrecog-nition. We have to use this threshold, because otherwise the closest face from the model database would be always chosen as recognized on the frame. • The option to add new models to the model database from google or from

the video: New models can be added to the database by downloading the top 10 images from the google search results of the specified term. The option to add models straight from the video is particularly useful for tracking purposes of unknown individuals. In this case the user must assign a name to the class beforehand.

Finally, through the class selector, the user can choose which categories will be detected by YOLO. Multiple classes can be chosen by holding down the Control key. Each of the parameters above can be changed at any time, even during the execution of the system.

The web server back end has been written in Python using the Django framework. It has two main functions, which in Django’s terminology are called views. A view is simply a Python function that takes a Web request and returns a Web response. The first function is used to set or update some internal variables of the video analysis server and to launch the detection/recognition task through a Python thread. The second function is used to stop the task.

(54)

Every frame that has been processed by the video analyzer is saved in mem-ory as a JPEG image and is streamed as an HTTP MJPEG stream. The front end connects to this stream’s location and displays it on the GUI.

4.2.5 Implementation

The implementation environment was a server equipped with an Intel Core i7-6800K 3.40 GHz CPU, 32 GBs of RAM and two NVIDIA GeForce GTX 1080 GPUs. The operating system was Ubuntu 16.04 LTS.

Deep learning models are very heavy computationally and thus their execu-tion times will be non negligible if we run them on the CPU. Graphical Process-ing Units (GPU) are typically employed to facilitate these processProcess-ing-intensive operations. Implementing GPU acceleration on NVIDIA hardware requires the installation of CUDA and cuDNN libraries. We used CUDA 8 and cuDNN 7.5. Both Darknet and Caffe support GPU acceleration and the option must be en-abled before compiling them. Unfortunately, OpenCV dnn module does not yet support GPU acceleration. Runing the face detection model on Caffe is quite problematic since it contains non-standard layers, which are not recognized by Caffe. However, this task is less expensive in terms of computation overhead with respect the object detection and face recognition tasks. For this reason, we decided to run the detection module on CPUs.

Darknet uses its own format for storing images in memory. After calling the detection function from Python, we would have to transform that image into a numpy array, in order to do further processing. A way to avoid this, is to change the source code of Darknet and to specifically return a numpy array as the result of the detection function. The changes that have to be made are in Appendix A.

(55)

Chapter 5

Performance Evaluation

In this chapter, we describe the experiments that we have conducted to evaluate the performance of our system.

5.1 Object Detection

For this case we tried to compare YOLOv3’s and YOLOv3-tiny’s performance in terms of running time, average number of objects detected and the number of different classes detected. The running time is expressed in average frames per second, and is calculated as the average of the runtime of each frame processing over the total number of frames. The average number of objects is calculated as the total number of detected objects over the total number of frames. All the classes were selected for detection on all these tests.

We ran the detectors on three different scenarios. In the first scenario the drone camera is positioned on ground level in a city context. In the second scenario the drone is flying over a city. In the third scenario the drone is flying over a countryside. Tables 5.1, 5.2, 5.3 show the results of the tests.

Scenario 1 Scenario 2 Scenario 3

YOLOv3-tiny 36 36 36

YOLOv3 22 22 22

Design and Implementation of a System Based on Deep Convolutional Networks for Intelligent Visual Surveillance

Universit`

a di Pisa

Dipartimento di Ingegneria dell’Informazione

Computer Engineering

Design and Implementation of a System

Based on Deep Convolutional Networks

for Intelligent Visual Surveillance

Candidate:

Indrit Kertusha

Supervisors:

Claudio Gennaro

Fabrizio Falchi

Giuseppe Amato

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Scenario

1.2

Objectives and Outline of this Thesis

Chapter 2

Related Work

2.1

Real-Time, Cloud-based Object Detection

for Unmanned Aerial Vehicles

2.2

Fast Object Detection for Quadcopter

Drone using Deep Learning

2.3

A Distributed Drone-Oriented Architecture

for In-Flight Object Detection

Chapter 3

Background

3.1

Computer Vision

3.1.1

Object detection

3.2

Convolutional Neural Networks

3.2.1

Additional layers and parameters

3.3

The R-CNN family

3.3.1

R-CNN

3.3.2

Fast R-CNN

3.3.3

Faster R-CNN

3.4

Single Shot Detectors

3.4.1

YOLO

3.4.2

Single Shot MultiBox Detector

3.5

Face Recognition

3.5.1

VGGFace2

3.6

Classification

3.6.1

Cosine similarity

3.6.2

k-nearest neighbors algorithm

3.7

Frameworks and Tools

Chapter 4

Contributions of this Thesis

4.1

Application Scenario

4.2

Design of the Processing center

4.2.1

Video analysis server

4.2.2

Database of models