Design and implementation of a system for real-time moving object detection and classification in video streams

(1)

Dipartimento di Ingegneria dell’Informazione Corso di Laurea Specialistica in Computer Engineering

Laurea Specialistica in Computer Engineering

Design and implementation of a system for

real-time moving object detection and

classification in video streams

Supervisors: Fabrizio Falchi Claudio Gennaro Giuseppe Amato Candidate: Leonardo Agueci Academic Year 2015-2016

(2)

supportato e incoraggiato

Agli amici e colleghi con cui ho condiviso

il mio percorso univeritario

A Diletta che ha reso indimenticabili

questi ultimi anni

(3)

Desidero ringraziare tutti coloro che hanno svolto un ruolo fondamentale nella stesura della mia tesi, a loro va tutta la mia gratitudine. In particolare, desidero ringraziare i miei relatori Fabrizio Falchi, Claudio Gennaro e Giuseppe Amato i cui consigli sono stati preziosi per portare a termine nel miglior modo possibile questo lavoro. Un ringraziamento va anche a tutti i ragazzi del RedLab per aver reso questi ultimi mesi molto piacevoli, nonostante qualche battuta di troppo. Infine, ringrazio Alex Moser per la pazienza e la disponibilit`a ogni qualvolta ho chiesto un suo consiglio.

(4)

The aim of this study is to design and implement a complete system for moving object detection and classification that works in real-time. The system is designed as a pipeline of processing steps, applied to the incoming video stream. It is composed of three sub-modules: Object Detection, Object Tracking and Object Classification. First, for the object detection, a Gaussian Mixture-based Background/Foreground Segmentation Algorithm named MOG2 is used. Background subtraction is a widely used approach for detecting moving objects in videos recorded in steady conditions. Second, for the classification task, a Convolutional Neural Network is designed and trained. CNNs are Neural Network Architectures belonging to the Deep Learning field. These type of networks need to be trained with a large amount of data, hence to this purpose an ad-hoc dataset is created combining different datasets. Since the system is thought to be used for traffic monitoring, the identified classes of object are car, motorbike, tram, van, bicycle, person, truck and bus. Different networks have been tested in order to achieve the best trade-off between efficiency and accuracy. The framework used for training and testing the networks is Caffe, which is probably the most used in computer vision. In order to simplify the training, the NVIDIA Deep Learning GPU Training System (DIGITS) is employed. The system presented in this work is developed exploiting OpenCV and Caffe libraries and it is entirely written in C++. To reduce the number of objects to classify for each frame, a tracking mechanism is also designed and implemented. Such mechanism exploits the center of mass computed on each detected object, to discriminate whether the object was already present in the previous frames. Several parameters are made available, that can be tuned by the user in order to adapt the system depending on the demands of performance and the context in which it is inserted. A statistical study of these parameters has been carried out, in order to show how the system behaves when they change.

(5)

Acknowledgements ii

Abstract iii

Contents iv

List of Figures vi

List of Tables viii

1 Introduction 1

1.1 Background . . . 3

1.1.1 Background Subtraction Methods . . . 3

1.1.2 Convolutional Neural Networks . . . 4

1.2 Technologies Used . . . 9 1.2.1 OpenCV . . . 9 1.2.2 Caffe . . . 9 1.2.3 DIGITS . . . 10 1.3 Related Work . . . 11 2 Architecture 13 2.1 Goals and Assumptions . . . 13

2.2 System Architecture . . . 15 2.2.1 Object Detection . . . 15 2.2.2 Object Tracking . . . 15 2.2.3 Object Classification . . . 16 2.2.4 Output . . . 16 3 Object Detection 17 3.1 Gaussian Mixture Model . . . 18

3.2 Pre and Post Processing Operations . . . 19

4 Object Tracking 23 4.1 The Track Class . . . 24

(6)

5 Object Classification 28

5.1 Alex-Net . . . 29

5.2 Squeeze-Net . . . 29

5.3 The Dataset . . . 32

5.4 Training, Validation and Testing . . . 32

5.4.1 Results . . . 34

6 Experimental Results 39 6.1 Object Detection . . . 39

6.1.1 Maximum Number of Objects per Frame . . . 40

6.2 Object Classification . . . 41 6.2.1 Probability Threshold . . . 44 6.3 Object Tracking . . . 45 6.3.1 Thresholds Evaluation . . . 45 6.4 Speed Evaluation . . . 48 7 Conclusions 50 Bibliography 52

(7)

1.1 Background Subtraction Scheme . . . 3

1.2 Convolutional Neural Networks Scheme . . . 5

1.3 The Convolutional Operation . . . 6

1.4 The ReLU Activation Function . . . 7

1.5 The Max Pooling Operation . . . 7

1.6 Dropout Neural Net Model . . . 8

1.7 Fusion Method Architecture . . . 11

1.8 YOLO Architecture . . . 12

2.1 Raspberry Pi . . . 14

2.2 System Architecture . . . 15

2.3 System Output . . . 16

3.1 Object Detection Phase . . . 17

3.2 Background Subtraction based on GMM . . . 19

3.3 Shadow Detection . . . 20

3.4 Gaussian Filter . . . 21

3.5 Dilate and Erode Operations . . . 21

3.6 Removing Objects Operation . . . 22

4.1 Object Tracking Phase . . . 23

5.1 Object Classification Phase . . . 28

5.2 Alex-Net Structure . . . 29

5.3 Fire Layer: Convolutional Filter . . . 30

5.4 Squeeze-Net Structure . . . 31

5.5 The Dataset . . . 32

5.6 Overtraining Example . . . 33

5.7 Accuracy and Loss function Alex-Net 227x227x3 . . . 35

5.8 Accuracy and Loss function Squeeze-Net 227x227x3 . . . 35

5.9 Accuracy and Loss function Squeeze-Net 114x114x3 . . . 36

5.10 Accuracy and Loss function Squeeze-Net 114x114x3 low resolution . . . 37

5.11 CNNs Actual Speed Comparison . . . 38

6.1 Number of Frames having a given Number of Objects . . . 40

6.2 Cumulative Percentage of Frames having at most a given Number of Objects 40 6.3 Single Class Evaluation Measures . . . 43

6.4 FP and FN Rate when Probability Threshold varies . . . 44

(8)

6.6 Number of Objects classified when Distance Threshold varies . . . 46

6.7 F1 Score Trend when Average Color Threshold varies . . . 47

6.8 Number of Objects classified when Average Color Threshold varies . . . 47

6.9 Tracking Module enabled and disabled Comparison . . . 48

(9)

5.1 Accuracy Summary . . . 37

5.2 CNNs Relative Speed Comparison . . . 38

6.1 Performance Table for Instances labeled with a Class Label A . . . 41

6.2 Micro-averaged Measures . . . 44

6.3 Macro-averaged Measures . . . 44

6.4 Comparison with different Parameter Configurations . . . 48

6.5 Average and Minimum FPS on a Laptop . . . 49

(10)

Introduction

Moving object detection and classification has become a very popular topic in the last few years: it is indeed one of the most important task in many applications, such as surveillance, camera-only active safety systems, intelligent autonomous vehicles, and robotic vision. The detection task finds the locations and sizes of moving objects in natural scene images, while the classification task recognizes the detected moving objects into their respective categories.

The most challenging part, in this field, is the real-timing constraint: the algorithms for object detection and classification often require powerful hardware in order to work in real-time; even more so nowadays that Convolutional Neural Networks are taking place. They have in fact proven to be very effective in areas such as image recognition and classification, but they also place an additional problem in terms of computational power.

Recent developments in deep learning approaches have greatly advanced the performance of these state-of-the-art visual recognition systems and this work is intended as a further step. In fact, the aim of this study is to design and implement a complete system for moving object detection and classification, that works in real-time. By further reducing the computational time, it may be possible to also run the system on embedded devices, such as smart cameras, whose limited hardware resources pose challenges for the real-timing constraint.

In this work, a Moving Object Detection and Classification System is presented: pro-vided with a video stream from a static camera, the system is able to detect moving objects and classify them into one of the possible categories. Moreover, the system is able to track objects over time, namely it can identify whether an object was already present in a previous frame or not, in order to prevent the reclassification. Several tunable parameters are made

(11)

available to the user, so that he/she can adapt the system depending on the demands of performance and the context in which it is inserted.

Notice that the system is designed for traffic monitoring, but with the appropriate modifica-tions, it can be adapted and used for any application that requires moving object detection and classification.

The system has been tested, by using a video stream of approximately 4 minutes taken under realistic conditions, in order to evaluate the efficiency in terms of computational time (FPS) and the accuracy in the classification task. The video has been manually annotated to be used as Ground Truth.

Thesis Structure

In the rest of this Chapter 1, the most commonly used approaches for foreground extraction are briefly described and Convolutional Neural Networks (CNNs) are introduced. Moreover, the technologies used are reported and some related works are cited.

In Chapter 2, the application scenario is described and system assumptions and specifications are listed. Furthermore, the chosen system architecture is described and all the submodules are briefly introduced.

In Chapter 3, the first submodule, dedicated to the detection of moving objects in the video stream, is described. The implemented technique is deeply illustrated, together with the exploited pre/post processing operations.

In Chapter 4, the second submodule is described. This phase is dedicated to object track-ing. The implemented tracking algorithm is described and several tuning parameters are presented.

In Chapter 5, the third and last submodule, dedicated to object classification, is described. Different networks, with their structures, are presented.

In Chapter 6, the experiments performed and the evaluation, both in terms of accuracy and efficiency, of the presented system are reported.

(12)

1.1 Background

In this section, the techniques used for the detection of moving objects and the one used for the classification are briefly introduced.

1.1.1 Background Subtraction Methods

Background Subtraction (or BS)1, also known as foreground detection, is a technique em-ployed in fields like image processing and computer vision, wherein an image’s foreground is extracted for further processing (object recognition etc.). Generally, image’s regions of interest are objects (humans, cars, text etc.) in their foreground.

BS is a widely used approach for detecting moving objects in videos taken with static cameras. The rationale in this approach is that of detecting the moving objects through the difference between the current frame and a reference frame, often called ”background image”, or ”background model” (see Figure 1.1). BS is mostly applied when the image in question is part of a video stream. BS provides important cues for several applications in computer vision, for example surveillance tracking or human poses estimation.

Figure 1.1

Background Subtraction: Frame differencing scheme, taken from [22]

(13)

Basic Models Essentially the BS process consists of creating a background model. In a simple way, this can be done by manually setting a static image that represents the background and having no moving objects; next, for each video frame, the absolute difference between the current frame and the static image is computed. This method is called Static Frame Difference. However, a static image is not always the best choice: if the ambient lighting changes, then the foreground segmentation may fail dramatically. Alternatively, it is possible to use the previous frame rather than a static image. This approach, called Frame Difference, works with some background changes, but fails if the moving objects suddenly stop. Other approaches are presented in [5, 7, 16].

Statistical Models For situations in which some background objects are not perfectly static and/or the global illumination is not constant in time, a more elaborate BS strategy is required. In this perspective, many authors propose modeling each background pixel with a probability density function (PDF) learned over a series of training frames. In this case, the motion detection problem often becomes a PDF-thresholding problem [4].

Fuzzy Models Recently, some authors have introduced fuzzy concepts in the different steps of the background subtraction process [3]. In [25], the authors perform the BS by the similarity measure of color and texture features of the input image and the background model using Sugeno Integral [21]. Later, Baf et al. got better results [1] using Choquet Integral [6].

Other Models BS has also been achieved by many other methodologies. For example the computation of eigenvalues and eigenvectors [17] or the combination of histograms and Bayesian inference [10].

1.1.2 Convolutional Neural Networks

Machine-learning technology powers many aspects of modern society. Machine-learning systems are used, for example, to identify objects in images, match new items and select relevant results of search. Representation Learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep-learning methods are representation learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each

(14)

transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level [15].

A Neural Network is a machine-learning technique which is modeled after the brain structure. It comprises of a network of learning units called neurons. These neurons learn how to convert input signals (e.g. picture of a cat) into corresponding output signals (e.g. the label “cat”), forming the basis of automated recognition.

Convolutional Neural Networks [24] are a special type of feed-forward networks. Goodfellow et al. [11] claim:

“Convolutional Networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers”.

Such models are designed to emulate the behavior of the visual cortex. They are a category of Neural Networks that have proven to be very effective in areas such as image recognition and classification. CNNs have been successful in identifying faces, object and traffic signs apart from powering vision in robots and self driving cars.

The simplest architecture of a CNN starts with an input layer (images) followed by a se-quence of convolutional layers and pooling layers, and ends with fully-connected layers. The convolutional layers are usually followed by one layer of ReLU activation functions (see Fig-ure 1.2).

Figure 1.2

(15)

Convolutional Layer This layer consists of a set of learnable filters that are slided over the image in a spatial manner, computing dot products between the entries of the filter and the input image. In practice, a CNN learns the values of these filters on its own during the training process. The primary purpose of Convolution, in case of a CNN, is to extract features from the input image. Convolution preserves the spatial relationship between pixels.

In Figure 1.3 the Convolutional operation is shown: the orange matrix (filter) is slided over the original image (green) by 1 pixel (also called ”stride”) and for every position, the element wise multiplication (between the two matrices) is computed and the multiplication outputs are added to get the final integer which forms a single element of the output matrix (pink). Note that the 3x3 matrix ”sees” only a part of the input image in each stride.

Figure 1.3

The Convolutional Operation

ReLU Layer ReLU is the activation function used in this kind of layers. It stands for Rectified Linear Unit and is a non-linear operation. Its output is given by:

Out put= max(0, Input)

ReLU is an element-wise operation (applied per pixel) which replaces all negative pixel values in the feature map with zero. The purpose of ReLU is to introduce non-linearity in CNNs, since most of the real-world would be non-linear.

Notice that other non-linear functions such as tanh or sigmoid can also be used instead of ReLU, but ReLU has been found to perform better in most situations.

(16)

Figure 1.4

The ReLU Activation Function

Pooling Layer Pooling is a form of non-linear down-sampling. The goal of the pooling layer is to progressively reduce the spatial size of the representation, by retaining the most important information only, in order to lower the amount of parameters and computation in the network, hence allowing to control over-fitting. A pooling function replaces the output of the network at a certain location with a summary statistic of the nearby outputs. There are several functions used to implement pooling, among them the most common one is Max Pooling, which reports the maximum output within a rectangular neighborhood (see Figure 1.5).

Figure 1.5

Max Pooling with a 2x2 Filter

Fully Connected Layer Finally, after several convolutional and max pooling layers, the high-level reasoning in the neural network is done via fully connected layers. A fully con-nected layer takes all the neurons in the previous layer (being it fully concon-nected, pooling, or convolutional) and connects them to every single neuron it has.

(17)

Dropout CNNs with a large number of parameters are very powerful machine learning systems. However, over-fitting is a serious problem in such networks. Dropout is a tech-nique for addressing this problem: the key idea is to randomly drop units (along with their connections) from the neural network during training (see Figure 1.6). This prevents units from co-adapting too much.

(a) standard neural net (b) after applying dropout Figure 1.6

Dropout Neural Net Model: (a) a standard neural net with two hidden layers. (b) a thinned net produced by applying dropout to the network.

Learning process It consists of two steps: forward and backward passes, which are con-ducted for all objects in a training set. In the forward pass, each layer transforms the output of the previous layer according to its function. The output of the last layer is compared with the label values and the loss function is computed. In the backward pass, the derivatives of loss function with respect to the outputs are consecutively computed, from the last layer to the first, together with the derivatives with respect to the weights. After that, the weights are changed in the direction which decreases the value of the loss function. This process is performed simultaneously for a batch of objects in order to decrease the sample bias. The processing of all objects in the dataset is called epoch. Training usually consists of many epochs, conducted with different batch splits, i.e. at each epoch the objects in the dataset are given to the CNN in a different order.

(18)

1.2 Technologies Used

In this section, the used technologies (libraries, frameworks, tools) are briefly described.

1.2.1 OpenCV

OpenCV2 (Open Source Computer Vision Library) is an open source computer vision and machine learning software library. OpenCV was built to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in the commercial products. The library has more than 2500 optimized algorithms, which include a comprehensive set of both classic and state-of-the-art computer vision and machine learning algorithms. These algorithms can be used to detect and recognize faces, identify object, classify human actions in videos, track camera movements, track moving objects, extract 3D models of objects etc. OpenCV has more than 47 thousand user in its community and an estimated number of downloads exceeding 7 million. The library is used extensively in companies, research groups and by governmental bodies. It has C++, C, Python, Java and MATLAB interfaces and supports Windows, Linux, Android and Mac OS. There are over 500 algorithms and about 10 times as many functions that compose or support those algorithms. OpenCV is written natively in C++.

1.2.2 Caffe

Caffe3 _{is a deep learning framework made with expression, speed, and modularity in mind.}

This popular computer vision framework is developed by the Berkeley Vision and Learning Center (BVLC), as well as community contributors. Its expressive architecture encourages application and innovation. Models and optimization are defined by configuration without hard-coding. Switch between CPU and GPU by setting a single flag to train on a GPU machine then deploy to commodity clusters or mobile devices. Caffe powers academic re-search projects, startup prototypes, and large-scale industrial applications in vision, speech, and multimedia.

2_{http://opencv.org/}

(19)

1.2.3 DIGITS

DIGITS4 (the Deep Learning GPU Training System) is a web application for training deep learning models. It puts the power of deep learning into the hands of engineers and data scientists. DIGITS can be used to rapidly train the highly accurate CNN for image classifi-cation, segmentation and object detection tasks. DIGITS simplifies common deep learning tasks such as managing data, designing and training neural networks on multi-GPU systems, monitoring performance in real time with advanced visualizations, and selecting the best per-forming model from the results browser for deployment. DIGITS is completely interactive so that data scientists can focus on designing and training networks rather than programming and debugging.

(20)

1.3 Related Work

A Real-Time Moving Object Detection and Classification Approach for Static Cameras

Vu et al. propose a fusion method consisting of detection and classification modules, i.e. using BS technique for the detection task and AdaBoost algorithm for the recognition task. They adopt the BS technique to extract moving objects proposals instead of sliding window detection on whole scene images. For the classification, they use AdaBoost algorithm to classify the detected moving objects into their categories. The framework of the proposed fusion method is shown in Figure 1.7, where yellow rectangles depict moving objects proposals and red rectangles illustrate pedestrian detection results. To evaluate the proposed fusion method, they use a computer with Intel i7 3.4 GHz Core and 16G DDR2 memory as platform: it reaches 30 FPS for 720x480 image resolution [23].

Figure 1.7

The framework of proposed fusion method: (a) original video clip. (b) after BS. (c) detection results. (d) classification results. Taken from [23].

(21)

YOLO: You Only Look Once

Redmon et al. present YOLO [18], a new approach to object detection. Prior works on object detection repurposes classifiers to perform detection. In YOLO, they frame object detection as a single regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images, in one single evaluation. Their unified architecture (see Figure 1.8) is extremely fast: their base YOLO model processes images in real-time at 45 frames per second on Titan X GPU. This means they can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the average precision of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background.

Figure 1.8

(22)

Architecture

In this chapter the architecture of the Moving Object Detection and Classification System is described. Moreover, the objectives and the assumptions made are specified.

2.1 Goals and Assumptions

The main goal of the presented work is to design and implement a system being able to:

• visually detect moving objects in the environment,

• track the objects over time,

• classify the detected objects.

Application Scenario In this work, it is assumed that the camera used by the system is fixed and working in outdoor environments. In the application scenario, the camera is continuously screening a road while vehicles and/or people are passing by. The scene is therefore composed by a substantially fixed part containing object like houses, trees and traffic lights in which vehicles and/or people appear. Each time an object enters the scene, the camera can extract it from the background, recognize and label it. In particular, the possible classes of object to recognize are car, van, bus, truck, tram, bicycle, motorbike and person.

(23)

Computing Platform The software is entirely written in C++ and the computer vision algorithms are implemented using the OpenCV library, in particular the used version is the 3.1.0. This choice ensures a great portability on different smart camera platforms, which usually run an embedded Linux operating system. For the CNNs implementation the Caffe framework is used and the NVIDIA Deep Learning GPU Training System (DIGITS) is exploited in order to simplify the training and validation phases of the CNNs. Moreover, the developed software has been tested on a 2.5 GHz Intel Core i7 CPU and on a Raspberry Pi 3 platform equipped with a Pi Camera Module in order to have a feedback on the performances on a possible smart camera platform.

Raspberry Pi 3 The Raspberry Pi 31, the third generation of Raspberry Pi, is a low-power credit card-sized single board computer with a 1.2GHz 64-bit quad-core ARMv8 CPU, 1Gb of RAM and around $35 of price. Camera sensors can be attached to it via USB using classic web-cameras or better via CSI (Camera Serial Interface) using the Raspberry Pi Camera module. Its specifications make the Raspberry Pi a good candidate for a smart camera hardware platform.

Figure 2.1

A Raspberry Pi equipped with the Pi Camera Module

(24)

2.2 System Architecture

The system is designed as a pipeline of processing steps, applied to the incoming video stream. The first step is the moving object detection, then the tracking mechanism tries to assign the detected objects to already classified objects belonging to previous frames. Only the unassigned objects need to be classified in the last step.

Figure 2.2

Moving Object detection and classification: System Architecture Overview

2.2.1 Object Detection

In this first phase, the incoming video stream is analyzed by looking for moving objects. A background model is created and continuously updated: each incoming frame of the video is compared to the model and non matching parts are interpreted as foreground. Moving objects are extracted and provided to the next phase. Notice that in this phase false positives are possible; false positive refers to an object that is part of the background but which is considered as part of the foreground. This is due to background changes like illumination variations and/or windy scenes.

This phase is described in details in Chapter 3.

2.2.2 Object Tracking

In this phase, the system tries to find a match for each new detected object. It scans the database of objects found in the previous frames and compares them with the new detections: if a match is found, then there is no need to provide the corresponding object to the next phase. In fact, a match means that the object was present in a previous frame, hence it

(25)

was already classified. The database is updated with the new assigned objects and only the unassigned objects are provided to the next phase.

2.2.3 Object Classification

In this last phase, the objects remained unassigned after the tracking phase are sent to a classifier (CNN) and classified into one of the possible classes. To reduce the number of false positives coming from the detection phase, an additional class background is considered: objects belonging to this class (trees, traffic lights, roads etc.) are not shown. Moreover, in order to reduce the number of wrongly classified objects, only the objects classified with a probability greater than a certain threshold are shown. Once classified, the objects are stored in the database for a certain number of frames making them available for future matches.

2.2.4 Output

The system gives in output the video stream with the contours and the labels of the moving objects that are detected (see Figure 2.3).

Figure 2.3

(26)

Object Detection

Figure 3.1

Moving Object Detection Phase

The first phase, as already said, regards the object detection. The video stream is acquired and a BS is used in order to detect moving objects, after a pre-processing phase described later. The algorithm used is based on the Gaussian Mixture Model, which is described in the following section. Once obtained the foreground mask, after a post-processing phase, the contours of the objects are located and the extracted objects are given to the next phases.

(27)

3.1 Gaussian Mixture Model

One of the most popular BS methods is based on a parametric probabilistic background model proposed by Stauffer and Grimson [20]. In this model, the distribution of each pixel color is represented by a sum of weighted Gaussian distributions, defined in a given color space: the Gaussian Mixture Model (or GMM). These distributions are generally updated using an on-line expectation-maximization algorithm, which behavior is described later. Employing this technique significantly improves the performance obtained if compared to the use of a single Gaussian distribution.

In particular, in this approach, statistics of each pixel are modeled by a mixture of a variable number N of Gaussian distribution. The probability of observing a pixel with a certain RGB value x ∈ R3 _{is the following:}

p(x) =

N

∑

i=1

wi· n(x; µi, Σi)

where n(x; µi, Σi) is a multidimensional Gaussian probability density with mean vector µi∈ R3

(representing the mean value of the pixel) and covariance matrix Σi∈ R3×3:

n(x; µi, Σi) =

1 p(2π)3_|Σ

i|

e−12(x−µi)TΣ−1i (x−µi)

and wi are the weights associated with each Gaussian distribution, defined such that N

∑

i=1

wi= 1

This kind of model can represent up to N different “sources” of background in a single pixel. For example, given a repetitive moving background such as a tree blowing in the wind, a pixel value could oscillate between the green of the leaves and another color from the background. A Gaussian component of the mixture could represent the green value and another component could represent the underneath color. If the model is properly trained, both colors can be correctly classified as background.

The model training and update require to change the parameters of the mixture accordingly to the incoming frames; such parameters are the weigths w1. . . wN and the parameters

of each Gaussian (µ1, Σ1) . . . (µN, ΣN). A good choice of these values is their

maximum-likelihood estimates, given a set of samples (i.e. pixel values). This can be done using the EM (Expectation-Maximization) algorithm [2] on a sliding window of pixel values. The EM algorithm iteratively performs two steps: in the first one (E-step) the likelihood of

(28)

each sample in the window is calculated evaluating the gaussian mixture with the current parameters. In the second step (M-step) these likelihoods are used to refine the current parameters. The algorithm stops when the relative change in likelihood is small enough.

Two algorithms based on the GMM are deployed in OpenCV: Mixture of Gaussian (or MOG), which was introduced in [13] and Mixture of Gaussian 2 (or MOG2), which is based on [26, 27]. The latter selects the appropriate number of Gaussian distributions for each pixel, thus it provides better adaptability to varying scenes (for example due to illumination changes).

According to [19], algorithms based on the GMM have shown good performance in the analysis of outdoor scenes. Even if this method is able to handle low illumination variations, rapid variations of illumination and shadows are still problematic.

In Figure 3.2 a result of the application of this method is shown.

(a) original frame (b) foreground mask Figure 3.2

Result of Background Subtraction based on GMM: (a) the current frame on which the foreground needs to be extracted. (b) the foreground mask obtained after the background subtraction.

3.2 Pre and Post Processing Operations

In this section the processing operations made before and after applying the BS method on the video stream are described. This operations are made in order to improve the object detection.

Shadow Detection (Pre) The algorithm implemented in OpenCV allows to either detect shadows or not. Since for the purpose of the system shadow detection is not necessary,

(29)

the shadow of objects is not considered. In Figure 3.3 it is possible to observe the different foreground masks obtained either with shadow detection enabled or disabled.

(a) shadow detection enabled (b) shadow detection disabled Figure 3.3

Shadow Detection: (a) the foreground mask obtained with shadow detection enabled. (b) the foreground mask obtained with shadow detection disabled. Images taken from [22].

Gaussian Filter (Pre) Smoothing, also called blurring, is a simple and frequently used image processing operation. There are many reasons for smoothing, one of which is the reduction of noise. To perform a smoothing operation on an image, a filter must be applied, for example one of the most used is the Gaussian filter. Gaussian filtering is done by convolving each point in the input array with a Gaussian kernel and then summing them all to produce the output array. In Figure 3.4 it is possible to see the different foreground masks obtained by applying the smoothing operation or not.

Dilate and Erode (Post) They are two basic operators in mathematical morphology. The basic effect of the erosion operator on a binary image is to erode away the boundaries of foreground pixels (the white pixels). This way, the areas of foreground pixels shrink in size and the “holes” within those areas become larger. On the other hand, the basic effect of dilation on binary images is to enlarge the areas of foreground pixels at their borders. This way, the areas of foreground pixels grow in size, while the background “holes” within them shrink. These morphological filtering operations can be applied to the obtained foreground mask in order to remove unwanted noise, such as single pixels foregrounds or background holes in a foreground object. In Figure 3.5 it is possible to observe the result of applying these operations.

(30)

(a) original frame (b) smoothed frame

(c) foreground mask on original frame (d) foreground mask on smoothed frame

Figure 3.4

Gaussian Filter: (a) the current frame. (b) the current frame after smoothing operation. (c) the foreground mask obtained on the original frame. (d) the foreground mask obtained on the smoothed frame.

(a) foreground mask before dilate and erode

(b) foreground mask after dilate and erode

Figure 3.5

Dilate and Erode: (a) the foreground mask obtained on the original frame. (b) the foreground mask obtained on the original frame after dilate and erode operations.

(31)

Removing Objects (Post) Not all the detected objects are considered. In particular, the “border” objects (i.e. the objects that touch the borders of the frame) are not considered. Often, in fact, they are only partial objects since they are entering or leaving the scene and this can lead to a wrong classification. Moreover, only objects with an area greater than a given value are considered.

(a) frame with no objects removed (b) frame with objects removed Figure 3.6

Removing Objects Operation: (a) the frame with no objects removed. (b) the frame with objects removed.

(32)

Object Tracking

Figure 4.1

Object Tracking Phase

Since classification is the most expensive task, this module is fundamental for the aim of this work. It is used for efficiency reasons: enabling such mechanism, in fact, the number of objects to classify significantly decrease and, as a consequence, the time required to process a frame decreases in the same way. In this phase, the objects coming from the Object Detection module are compared with previously classified objects, maintained in structures named tracks. The algorithm tries to find a match between objects belonging to different frames. In any given frame, some detections may be assigned to tracks, while other detections and tracks may remain unassigned. The assigned tracks are updated using the corresponding detections, whereas the unassigned tracks are marked invisible. An unassigned detection begins a new track and will be classified.

(33)

4.1 The Track Class

For the purpose of this phase, a class named Track is created. In a few words, a track is a structure representing a moving object in the video. The purpose of the structure is to maintain the state of a tracked object, where the state consists of information used to assign new detections to tracks, terminate useless tracks, and display each track.

The structure contains the following fields:

• x, y : coordinates of the center of mass of the object

• bndBox: image representing the object

• rec: rectangle around the object

• avgColor : mean color of the image representing the object

• label : class name assigned through classification

• prob: probability of belonging to the chosen class

• assigned : state of the track in the current frame

• framesWithoutUpdate: number of consecutive frames in which the track is unassigned

• lifeTime: number of frames since the track was created

4.2 Tracking Algorithm

The algorithm is composed by four sequential steps that are performed at each frame:

Updating of the Tracks As described in Algorithm 1, initially the algorithm tries to update the existing tracks T . Updating a track means finding an object in the current frame corresponding to the object represented by the track, which is an object already found in a previous frame. To this purpose, for each track, the euclidean distance between its center of mass and the center of mass of each object detected in the current frame is computed. The closest object is taken and if the distance is smaller than a given threshold, also the euclidean distance between the average colors is computed. Once again, if this distance is smaller than another given threshold, the two objects belonging to different frames are considered to be the same and the track is updated with the new detected object. Some fields of the track

(34)

are updated while others are not. In particular, the attributes x, y, bndBox and avgColor are updated with the new values of the object assigned, whereas the fields label and prob remain unchanged; the field assigned is set, framesWithoutUpdate is reset and the attribute lifeTime is incremented. Finally, the assigned object is removed from the list of detected objects O. On the contrary, if a match is not found because one of the distances is above the corresponding threshold, the track is not updated. This means that most fields remain unchanged, assigned is reset while framesWithoutUpdate and lifeTime are incremented.

Algorithm 1 Tracking Algorithm: Update Tracks

for all ti∈ T do . T is the set of all existing tracks

cmt← (xi, yi)

for all oj∈ O do . O is the set of the objects detected

in the current frame cmo← centerO f Mass(oj)

dj← euclideanDistance(cmt, cmo)

D← D + {dj} . D is the set of all distances

end for dmin← min(D)

omin← {oj: dj= dmin}

if dmin<= distanceT H then

mco← meanColor(omin) dc← euclideanDistance(avgColori, mco) if dc <= avgColorT H then update ti O← O − {omin} else not update ti end if else not update ti end if end for

Creation of new Tracks At this point (see Algorithm 2), since every time an object is assigned to a track it is removed from the list of detected objects, the set O contains only the unassigned detection. For each of these, a new track is created and added to the list of

(35)

tracks. When a new track is created all fields are initialized except label and prob, which are initialized by the Object Classification module.

Algorithm 2 Tracking Algorithm: Create New Tracks for all oi∈ O do

ti← createNewTrack(oi)

T ← T + {ti}

end for

Classification of the Tracks In Algorithm 3 , the list of tracks T is scanned looking for unclassified tracks. All of these are then sent to the classifier in batch, that is they are classified in a single forward propagation. Once classified, the fields label and prob are updated with the corresponding values.

Algorithm 3 Tracking Algorithm: Classify Tracks for all ti∈ T do

if ti not classi f ied yet then

Tc← Tc+ {ti} . Tc is the set of tracks not classified

yet end if

end for

send Tcto classi f ication module

for all ti∈ Tc do

update tiwith label and probability

end for

Removal of useless Tracks In the last step, as reported in Algorithm 4, all the tracks that are no longer necessary are removed from T . This happens either if a track is not updated for a certain number of consecutive frames, which means that the object represented by the track left the field, or if a track is too old (i.e. it was created a certain number of frames ago). If, for instance, the threshold noUpdateTH is equal to 0, the system tries to find a match between objects belonging to consecutive frames only. The higher the threshold noUpdateTH, the higher the number of objects stored, hence, the greater the probability to find a match. Furthermore, if the threshold lifeTime is equal to 10, the system removes a track after ten frames, no matters whether it was updated or not. This threshold allows a wrongly classified object to be reclassified after a certain period.

(36)

Algorithm 4 Tracking Algorithm: Delete Useless Tracks for all ti∈ T do

if f ramesWithoutU pdatei> noU pdateT H || li f eTime > li f eTimeT H then

T ← T − {ti}

end if end for

Notice that all the thresholds can be tuned by the user in order to adapt the system depending on the demands of performance and the context in which it is inserted. A statistical study of these parameters is carried out in Chapter 6 in order to show how the system behaves when they change.

(37)

Object Classification

Figure 5.1

Object Classification Phase

In this last phase, the objects are given to a previously trained classifier (CNN). In order to improve the efficiency, all the objects belonging to a frame are processed in batch. The output of the classifier is a class label which, together with a bounding box, is printed around the object. The label is shown only if the probability that the object belongs to that particular class is greater than a given threshold. A special case is the class background : if an object is classified as background, it is not shown.

In the following sections, two of the most used CNNs in image classification are introduced. Afterwards, the dataset used for training, validation and testing is presented, as well as the related results.

(38)

5.1 Alex-Net

The original Alex-Net architecture [14] has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, three connected layers and a final 1000-way softmax. To reduce over-fitting in the fully-connected layers the dropout regularization method is used.

Figure 5.2

An illustration of the Alex-Net structure, taken from [14]

The first convolutional layer filters the 224x224x3 input image with 96 kernels of size 11x11x3 with a stride of 4 pixels. The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5x5x48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size 3x3x256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth convolutional layer has 384 kernels of size 3x3x192 , and the fifth convolutional layer has 256 kernels of size 3x3x192. The fully-connected layers have 4096 neurons each.

5.2 Squeeze-Net

Recent researches on CNNs have focused primarily on improving accuracy. For a given accuracy level, it is typically possible to identify multiple CNN architectures that achieve that accuracy level. With equivalent accuracy, smaller CNN architectures offer several advantages. One of these is certainly that smaller CNNs are better deployed on hardware with limited

(39)

memory. Moreover, they ensure better performance in terms of computational time, hence they are more feasible for real-time application.

To provide all of these advantages, Iandola et al. propose a small CNN architecture called Squeeze-Net [12]. Squeeze-Net achieves Alex-Net’s level of accuracy on Image-Net with 50x fewer parameters. Additionally, with model compression techniques, it is possible to compress Squeeze-Net to less than 0.5MB (510x smaller than Alex-Net). This results are obtained using a custom layer called Fire Layer.

Fire Layer A Fire module is comprised of: a squeeze convolutional layer (which has only 1x1 filters), feeding into an expand layer that has a mix of 1x1 and 3x3 convolutional filters (see Figure 5.3). The authors expose three tunable dimensions (hyperparameters) in a Fire module: s1x1, e1x1, and e3x3. In a Fire module, s1x1 is the number of filters in the squeeze

layer (all 1x1), e1x1 is the number of 1x1 filters in the expand layer, and e3x3 is the number

of 3x3 filters in the expand layer.

Figure 5.3

Organization of Convolutional Filters in the Fire Module, taken from [12]

As illustrated in Figure 5.4, Squeeze-Net begins with a standalone convolutional layer (conv1), followed by 8 Fire modules (fire2-9), ending with a final convolutional layer (conv10). The number of filters per fire module from the beginning to the end of the network is gradually increased. Squeeze-Net performs max-pooling with a stride of 2 after layers conv1, fire4, fire8, and conv10.

The CNN tested in this work is the Squeeze-Net v1.11,which requires 2.4x less computation than Squeeze-Net v1.0 without diminishing accuracy.

(40)

Figure 5.4

(41)

5.3 The Dataset

The dataset used to train the CNNs is the one made available for the PASCAL Visual Object Classes Challenge 2012 [9], in particular the one used for the classification task. This dataset is composed by twenty objects classes for a total of about 27,000 objects, considering only the “non-difficult” objects. Several classes are deleted since they are not useful for the purpose of this work; the remaining classes car, bicycle, motorbike, person and bus are expanded with new images taken from ImageNet2. Moreover, new classes tram, van, truck and background are added, still using images from ImageNet. The class background is a special class composed of objects that can be commonly found in a street: asphalt, trees, traffic lights, street lights etc. In conclusion, the final dataset consists of nine classes and approximately 32,000 colored images.

Figure 5.5

The Dataset, an example for each class

5.4 Training, Validation and Testing

Training During training, a neural network cycles through the data repeatedly, changing the values of its weights to improve performance, that is reaching the point with the minimum error. To this purpose, a backpropagation algorithm is needed: the one used in this work, called AdaGrad, is descibed in [8].

To train a CNN there are two possible approaches. The first one is training from scratch, i.e. with a random initialization of the weights. In practice, an entire CNN is rarely trained from scratch, because it is relatively rare to have a dataset of sufficient size. The second approach consists of using a CNN pre-trained on a very large dataset. This approach allows to fine-tune the weights of the pre-trained network by continuing the backpropagation. It is either possible to fine-tune all the layers of the CNN, or keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the earlier layers of a CNN extract more

(42)

generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, whereas later layers become progressively more specific to the details of the classes contained in the original dataset.

The approach exploited in this work is the second one, because it leads to better results in terms of accuracy with relatively small datasets.

Validation A network, however, can learn something different from what its trainer had in mind. It can also memorize the training examples without learning what they have in com-mon. In order to prevent inappropriate memorization of input data, also called overtraining or overlearning, it is possible to interleave training and testing. More precisely, a neural network is overtraining when the training set error continues to decline, but the validation set error begins to increase (see Figure 5.6).

Figure 5.6

Example of Overtraining

Testing Results measured with the training data say very little about the neural network’s reliability in the application. In fact, a neural network is useful only if it returns appropriate results when fed with data that were not used to train it. Measuring this ability, called generalization, requires testing the network with an independent dataset.

To this purpose, the dataset is divided in three parts: 900 images are used for testing (100 images for each class), while the rest is used partially for training and partially for validation

(43)

(25%). All the images are resized to 256x256 or 128x128 depending on the network type and are cropped to fit the dimension of the input layer of networks.

5.4.1 Results

In this section, for both Alex-Net and Squeeze-Net, the obtained results in terms of accuracy are presented. In particular, for what concerns the second network, different configurations have been tested. Moreover, a speed comparison on the testing set has been performed.

Each CNN is trained for 50 epochs with a base learning rate equal to 0.01. This choice follows the fact that there are no significant increments of accuracy in successive iterations for every CNN. Moreover, a step down policy is used: this decreases the learning rate of a factor of 10 after every 33% of the epochs. The validation step is performed at every epoch and, as it is possible to see below, none of the CNNs overtrain.

Alex-Net The first network tested is the Alex-Net. In particular, a slightly different archi-tecture is used: the one implemented in Caffe, which differs from the original because of the different input size that is 227x227x3. In Figure 5.7 the trend of the validation accuracy over the epochs is shown, as well as the error (loss) on both validation and training phases. At the last epoch, the obtained validation accuracy is about 91%, while the testing accuracy is about 93%. This results are obtained by fine-tuning a model pretrained on ImageNet3_{. In}

particular, all layers are fixed except the fc6 and fc7 layers, which are tuned with a learning rate ten times lower. Obviously, also the last layer is tuned, since the original model has 1000 classes, while the system presented in this work has only 9 classes.

Squeeze-Net All the different configurations of the Squeeze-Net tested are trained by fine-tuning a model, once again pretrained on ImageNet4_.

In Figure 5.8 it is shown the trend of the validation accuracy of a Squeeze-Net with an input layer whose size is 227x277x3. In this case, all the fire layers are tuned with a learning rate ten times lower. The validation accuracy is about 92%, while the testing accuracy is about 93%.

3_{https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet} 4_{https://github.com/DeepScale/SqueezeNet/tree/master/SqueezeNet_v1.1}

(44)

Figure 5.7

Accuracy and Loss function of Alex-Net 227x227x3 over the epochs

Figure 5.8

(45)

In order to further reduce the computational time, a different configuration with an input layer reduced to 114x114x3 is tested. In this case, also the first convolutional layer needs to be tuned to achieve the best result. As shown in Figure 5.9, the validation accuracy is about 89%, while the testing accuracy is about 88%.

Figure 5.9

Accuracy and Loss function of Squeeze-Net 114x114x3 over the epochs

Alternatively, the configuration just seen is also trained using a low resolution version of the dataset, in which the height and the width of the images are halved. As shown in Figure 5.10, the validation accuracy is about 89%, while the testing accuracy is about 90%.

(46)

Figure 5.10

Accuracy and Loss function of Squeeze-Net 114x114x3, trained on a low resolution dataset, over the epochs

In Table 5.1 all the obtained accuracies are grouped in order to have a complete overview.

CNN Validation Accuracy Testing Accuracy

Alex-Net 227x227x3 91% 93%

Squeeze-Net 227x227x3 92% 93%

Squeeze-Net 114x114x3 89% 88%

Squeeze-Net 114x114x3 (low resolution) 89% 90%

Table 5.1

Summary of the Accuracy obtained with the different CNNs

Speed Comparison In Figure 5.11 it is possible to observe the actual time taken to classify all the test images by each CNN on a 2.5 GHz Intel i7 CPU, while in Table 5.2 it is shown the relative processing time. It can be noticed that the 227x227x3 version of the Squeeze-Net is more than 6 times faster than the Alex-Net, while the 114x114x3 is about 22 times faster. Comparing the two versions of the Squeeze-Net, the second one is more than 3.5 times faster.

(47)

Figure 5.11

Actual computational time taken to classify all the test images by each CNN

CNN Relative processing time

Alex-Net 227x227x3 1

Squeeze-Net 227x227x3 0.162

Squeeze-Net 114x114x3 0.044

Table 5.2

(48)

Experimental Results

In order to evaluate the entire system a custom video acquired in Pisa with a static camera is used; it consists of 6250 frames and it lasts 4.10 minutes. At first the objects detected are 13774, extracted from the video considering a resolution of 1280x720. Next, some of them are preventively discarded because too small or only partial (border objects). A portion of the remaining objects is manually annotated, resulting in the creation of a Ground Truth of approximately 7000 objects, used for evaluating the system. In particular, the objects labeled are the ones belonging to the first 1800 frames and the frame 4700 onwards.

Notice that, since the Squeeze-Net is much faster than the Alex-Net and, considering that efficiency in terms of computational time is an important part of this work, in this chapter only the results obtained by the first one are considered.

6.1 Object Detection

As already said, during the detection phase false positives are possible. However, evaluating the BS method is not the aim of this work, thus in this section only a statistical study of the average number of objects per frame is carried out. In this way, it is possible to set a threshold maxObjects on the number of objects per frame: if the objects found in a frame exceed the given threshold, the system avoids any kind of computation and nothing is displayed. On one hand, this choice allows to avoid processing frames in which, due to background changes, several false positives occur. On the other hand, it sets an upper limit on the processing time of a frame, thus establishing a minimum frame rate.

(49)

6.1.1 Maximum Number of Objects per Frame

In Figure 6.1 the frames are grouped based on the number of objects detected. For instance, there are about 1600 frames in which only 1 object is detected.

Figure 6.1

Number of frames having a given number of objects

Figure 6.2 represents the cumulative percentage of frames in which at most a given number of objects are detected. Taking a threshold, for example equal to 10, allows to process the 99% of the frames and miss only the 1%.

Figure 6.2

(50)

6.2 Object Classification

In this section, the evaluation of the Object Classification Module is reported. During this phase the tracking mechanism is disabled, hence all the detected objects are sent to the classifier. The evaluation measures taken into consideration to validate the module are reported below:

Precision

P= TP

TP + FP

which is the fraction of objects classified as A that really belong to the class A.

Recall

R= TP

TP + FN

which is the fraction of all objects belonging to the class A correctly classified as A.

F1 Score

F₁= 2 · P · R P + R which is the harmonic mean of precision and recall.

True label A True not A

Predicted label A true positive (TP) false positive (FP) Predicted not A false negative (FN) true negative (TN)

Table 6.1

Performance table for instances labeled with a class label A

When classification uses multiple class labels, it is possible to produce a table like Table 6.1 for each class label. Precision, Recall and F-score can be computed for each class label. Averaging the evaluation measures can give a view on the general results, there are two names to refer to such results: micro-averaged and macro-averaged results.

Let L = {xi: i = 1...n} be the set of all labels and B(T P, T N, FP, FN) a binary evaluation

measure that is calculated based on the number of true positives, true negatives, false positives and false negatives. Moreover, let T Px, FPx, T Nx and FNx be the number of true

(51)

positives, false positives, true negatives and false negatives after the binary evaluation for a label x. Macro-averaged measure Bmacro= 1 n n

∑

x=1 B(T Px, T Nx, FPx, FNx),

which is the unweighted average of the measure taken separately for each class. There-fore it is an average over classes since it gives equal weight to each of them.

Micro-averaged measure Bmicro= B n

∑

x=1 T Px, n

∑

x=1 FPx, n

∑

x=1 FNx, n

∑

x=1 T Nx ! ,

In the micro-averaged method, the individual true positives, false positives, false neg-atives and true negneg-atives of the system for different classes are summed up and used to get the measure. It is an average over instances, therefore classes which have many instances are given more importance.

There is no complete agreement among authors on which measure is better. In fact, since the Precision, Recall and F1 Score measures ignore true negatives and their magnitude is

mostly determined by the number of true positives, large classes dominate small classes in micro-averaging. To get a sense of effectiveness on small classes, macro-averaged results should be computed.

In Figure 6.3 the Precision, Recall and F1Score obtained from each class considering a binary

classification problem are presented . Notice that, regarding the classes tram and bus, all the measures are equal to 0 since there are no objects belonging to these classes. Thus, in the averaged measures in Table 6.2 and Table 6.3 they are not taken into consideration.

Regarding the micro-averaged measures, the best results are obtained by the Squeeze-Net 114x114x3 which reaches a F1 Score of 0.752, whereas regarding the macro-averaged ones,

the best results are obtained again by the Squeeze-Net 114x114x3, but this time, from the one trained on a low resolution dataset: it reaches a F1 Score of 0.547.

(52)

(a) Precision

(b) Recall

(c) F1Score

Figure 6.3

(53)

Precision Recall F1 Score 227x227x3 0.73 0.724 0.727 114x114x3 0.753 0.751 0.752 114x114x3 low resolution 0.721 0.719 0.72 Table 6.2 Micro-averaged measures

Precision Recall F1 Score

227x227x3 0.483 0.334 0.395 114x114x3 0.479 0.485 0.482 114x114x3 low resolution 0.507 0.593 0.547 Table 6.3 Macro-averaged measures 6.2.1 Probability Threshold

So far, the probability that an object belongs to a given class has not been considered. It is possible to define a probability threshold in order to decide whether to show the classification or not, i.e. to establish whether the classification can be correct or not. To this purpose in Figure 6.4 it is shown, as the threshold changes, the false positives rate, i.e. the percentage of objects wrongly classified, but considered correct and the false negatives rate, i.e. the percentage of objects correctly classified, but considered incorrect. Such analysis has been carried out for the Squeeze-Net trained on a low resolution dataset.

Figure 6.4

(54)

The Equal Error Rate, i.e. when false positive and false negative rate is equal, is achieved with a threshold of 0.872. Choosing this value for the threshold leads to a FP and FN rate of approximately 0.288: about only 29% of the correctly classified objects are discarded and about only 29% of the wrongly classified objects are considered. Different value of this threshold can be chosen, based on the kind of application and context.

6.3 Object Tracking

In this section the results obtained with the Tracking Module enabled are presented. In particular, only the Squeeze-Net trained on a low resolution dataset has been exploited and the evaluation measures considered are the macro-averaged and micro-averaged F1 Scores

only. Moreover, in order to show the effectiveness of the tracking mechanism, the num-ber of objects classified is also reported. Since this module comprises several thresholds (distanceTH, avgColorTH, noUpdateTH and lifetimeTH), it is unfeasible to explore all the possible combinations of values: it was therefore decided to vary them individually, i.e. one at time, while the others remain fixed. Finally, notice that distanceTH and avgColorTH are normalized to assume a value between 0 and 1, in order to do so they are divided for their respective maximum values.

6.3.1 Thresholds Evaluation

Distance Threshold For the evaluation of this threshold, the further check on the eu-clidean distance between mean colors is deactivated (avgColorTH is set to 1), noUpdateTH is set to 0 in order to compare objects belonging to consecutive frames only and lifetimeTH is not considered.

In Figure 6.5 it is shown the trend of the F1 Score when the threshold varies between 0 and

0.5 (i.e. half the length of the frame’s diagonal), while in Figure 6.6 it is shown the relative number of objects classified. Focusing on the latter, it is easily visible how it significantly decreases with the increasing of the threshold. What is more, also the F1 Score generally

(55)

Figure 6.5

F1Score trend when distance threshold varies

Figure 6.6

Number of objects classified when distance threshold varies

Average Color Threshold The distanceTH is set to 0.013: this value guarantees a good trade-off between F1 Score and number of objects classified.

Once again, in Figure 6.7 it is shown the trend of the F1 Score when the threshold changes

between 0 and 0.2 (for higher values there are no significant changes), while in Figure 6.8 it is shown the relative number of objects classified.

(56)

Figure 6.7

F1Score trend when average color threshold varies

Figure 6.8

(57)

No Update Threshold and Lifetime Threshold The avgColorTH is set to 0.03: such value increases the number of objects to classify, but at the same time, guarantees an higher F1 Score.

Finally, by tuning noUpdateTH and lifetimeTH, it is possible to adapt the system to the demand of performance either in terms of accuracy or efficiency. In Table 6.4, the results obtained with different configurations of such thresholds are reported.

noUpdateTH, lifetimeTH Macro F1 Score Micro F1 Score Objects Classified

0, 12 0.546 0.716 1286

28, 38 0.52 0.686 1012

50, 64 0.482 0.658 955

Table 6.4

F1Score and number of objects classified with different configurations of noUpdate and lifetime

thresholds.

In particular, the first configuration guarantees to have a F1 Score almost equal to the one

with the tracking mechanism disabled, but with a number of objects classified approximately equal to the 18% (see Figure 6.9).

Figure 6.9

Tracking Module enabled and disabled comparison

6.4 Speed Evaluation

The system has been tested, both with the tracking mechanism enabled and disabled, on a 2.5 GHz Intel Core i7 CPU. Enabling the Tracking Module leads to a significant increment of the efficiency, as it is possible to see in Figure 6.10.

(58)

Figure 6.10

Computational time taken to elaborate the video stream, with tracking mechanism enabled and disabled

Moreover, in Table 6.5, the average and the minimum values of FPS are reported. The minimum FPS is computed considering the time needed to classify a frame in which the number of detected objects is equal to the threshold maxObjects (i.e. 10).

Average FPS Minimum FPS

Tracking Off 19 7

Tracking On 34 10

Table 6.5

Average and minimum FPS on a Laptop, with the Tracking Module enabled and disabled

On the Raspberry Pi 3, the time needed to elaborate the entire video stream increases significantly. Despite the tracking mechanism, the average time taken to elaborate a frame is on the order of seconds. As it is shown in Table 6.6, the average FPS is 0.71.

Average FPS Minimum FPS

Tracking Off 0.26 0.09

Tracking On 0.71 0.14

Table 6.6

(59)

Conclusions

In this work, a system for moving object detection and classification in outdoor environments has been presented. The video stream that needs to be analyzed can either be acquired in real-time from a camera or provided as a previously recorded video. The system then detects moving objects by using a statistical BS method named MOG2. Subsequently, the detected objects are classified by a CNN and labeled accordingly. In addition, a tracking mechanism able to track objects over time is implemented.

Several metrics have been considered in order to evaluate the system and the experiments showed a good propensity to properly categorize the objects: considering the Squeeze-Net 114x114x3 low resolution, a macro-averaged F1 Score equal to 0.55 was obtained, while the

micro-averaged F1 Score resulted to be 0.72. It is necessary to emphasize however, that

the video taken into consideration for the evaluation was acquired by using a low resolution camera; for this reason, such results can easily be improved by using higher resolution videos. What is more, the system can either work with the tracking mechanism enabled or disabled: in both cases the experiments showed that it is possible to obtain approximately the same results in terms of F1 Score, but with a significant increase of performance in the first case.

In fact, when enabling the tracking mechanism, only the 18% of the objects is classified, increasing the average number of FPS from 19 to 34 on a 2.5 GHz Intel Core i7 CPU. This means that a video stream can be processed in real-time with less than 30 milliseconds of latency on average.

The system can also run on smart cameras platforms: the experiments performed on a Raspberry Pi 3 attest that it can work with an average number of FPS approximately equal to 0.71 when the tracking mechanism is enabled. In this case, the average processing time

(60)

of a frame is significantly higher, but it can be considered acceptable considering the limited hardware of the platform. Nowadays there are several smart cameras platforms, although more expensive than the Raspberry Pi, with greater computing power, allowing the system to work much faster.

The source code is available on GitHub1.

(61)

[1] Baf, F. E., Bouwmans, T., and Vachon, B. (2008). Foreground detection using the choquet integral. In 2008 Ninth International Workshop on Image Analysis for Multimedia Interactive Services, pages 187–190.

[2] Bilmes, J. A. et al. (1998). A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. International Computer Science Institute, 4(510):126.

[3] Bouwmans, T. (2012). Background subtraction for visual surveillance: A fuzzy approach. Handbook on soft computing for video surveillance, 5:103–138.

[4] Bouwmans, T., Porikli, F., H¨oferlin, B., and Vacavant, A. (2014). Background Mod-eling and Foreground Detection for Video Surveillance, pages 5–1–5–19. Chapman and Hall/CRC.

[5] Calderara, S., Melli, R., Prati, A., and Cucchiara, R. (2006). Reliable background suppression for complex scenes. In Proceedings of the 4th ACM International Workshop on Video Surveillance and Sensor Networks, VSSN ’06, pages 211–214, New York, NY, USA. ACM.

[6] Choquet, G. (1953). Theory of Capacities, pages 131–295.

[7] Cucchiara, R., Grana, C., Piccardi, M., and Prati, A. (2001). Detecting objects, shadows and ghosts in video streams by exploiting color and motion information. In Proceedings 11th International Conference on Image Analysis and Processing, pages 360–365.

[8] Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159.

[9] Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.