Using Virtual Worlds to Train an Object Detector for Personal Protection Equipment

(1)

Computer Engineering

Computer Systems and Networks

Master’s thesis

Using Virtual Worlds to Train an Object Detector for

Personal Protection Equipment

Candidate:

Enrico Meloni

Supervisors:

Giuseppe Amato

Fabrizio Falchi

Claudio Gennaro

Marco Di Benedetto

Academic Year 2018-2019

(2)

5.1.1 Scenario Creator . . . 39 5.1.2 Dataset Annotator . . . 42 5.1.3 Bounding Boxes . . . 43 5.1.4 Obstruction Check . . . 46 5.1.5 Pedestrian Creation . . . 48 5.1.6 Pedestrian Classification . . . 49 5.1.7 Limitations . . . 50 5.2 Virtual Datasets . . . 51 5.2.1 Annotations . . . 52 5.2.2 Images . . . 53 5.2.3 First Dataset . . . 53 5.2.4 Second Dataset . . . 55

(4)

6 Experiments and discussion 56

6.1 Performance Evaluator . . . 56

6.2 Experimental Settings . . . 57

6.3 Confidence Intervals . . . 58

6.4 Validation . . . 58

6.4.1 Validation on Virtual Dataset . . . 59

6.4.2 Testing on PPE Dataset . . . 59

6.5 Abbreviations . . . 62

6.6 Results . . . 63

6.6.1 Validation on Virtual images . . . 63

6.6.2 Testing on real images . . . 63

6.6.3 Mixing virtual and real data . . . 72

7 Implementation 76 7.1 Overview . . . 76

7.2 Centralized System . . . 77

7.3 Decentralized System . . . 78

7.4 Monitoring of safety requirements . . . 79

(5)

Abstract

Neural Networks are known to be an effective technique in the field of Artificial Intelligence and in particular in the field of Computer Vision. Their main advantage is that they can learn from examples, without the need to program into them any previous expertise or knowledge.

Recently, Deep Neural Networks have seen many successful applications, thanks to the huge amount of data that has steadily become more and more available with the growth of the internet. When annotations are not already available, images must be manually annotated introducing costs that can possibly be very high. Furthermore, in some contexts also gathering valuable images could be impractical for reasons related to privacy, copyright and security.

To overcome these limitations the research community has started to take interest in creating virtual environments for the generation of automatically annotated training samples. In previous literature, using a graphics engine for augmenting a training dataset has been proven a valid solution.

In this work, we applied the virtual environment to approach to a not yet considered task: the detection of personal protection equipment. The first contribution is V-DAENY, a plugin for GTA-V, a famous videogame. V-DAENY allows the creation of scenario with the possibility of customizing most aspects of it: number of people, their equipment and behavior, weather conditions and time of day. With V-DAENY, we automatically generated over 140,000 annotated images in several locations of the game map and with different weather conditions.

The second contribution are two different datasets composed of real images, that can be used for training and testing. One of them contains images with copyright limits, while

(6)

the second contains only copyright free images. Both datasets contain pictures taken in various contexts, such as airports, building sites and military sites.

The third contribution is the evaluation of the performances achieved by learning with virtual data. We trained a network starting from a pre-trained YOLOv3 detector and applying a phase of Transfer Learning with virtual data and a phase of Domain Adaptation with a small amount of the manually annotated real dataset. Then, we tested the network on the other part of the real dataset. The network trained with this approach achieves promising performances. After being trained with only virtual data, the network achieves excellent precision on virtual data and good precision on real data. After applying Domain Adaptation, the network achieves high precision on both real and virtual data. As comparison, applying only Domain Adaptation to base YOLOv3 achieves a precision similar to that obtained when training with only virtual data. These results suggest that computer generated training samples can replace most of the real dataset and still achieve very good results.

(7)

Chapter 1 Introduction

Neural Networks are known to be an effective technique in the field of Artificial Intelligence and in particular in the field of Computer Vision. They are inspired by the way the animal brain processes information, and their main feature is that they can learn from examples, without the need for programming into them any previous expertise or knowledge. Recently, Deep Neural Networks have seen many successful applications, thanks to the huge amount of data that has become available with the growth of the internet. The era of big data allows us to train DNNs on many tasks and different contexts, such as Object Detection of common objects found in daily life, e.g. COCO challenge [1].

Nowadays, several image datasets are publicly available and successfully used in many academic and commercial applications. Some examples are COCO [1], Open Images [2], Pascal VOC [3], ImageNet [4]. Many of these datasets are composed of images manually annotated by human operators, which made their creation a costly and hardly scalable process.

Objects in an image can be annotated in different ways, the most popular being bounding boxes and semantic pixel-wise segmentation. The annotation of bounding boxes in an image is a difficult and error-prone process for a series of reasons, such as the bounding

(8)

box size being ambiguous in some cases. Defining standards for the annotations increases dramatically the cost of building new datasets. Semantic segmentation can be even harder since every single pixel in the image must be annotated with the correct class.

In some contexts, gathering valuable training images could be impractical for many rea-sons, including privacy, security, copyright, or lack of data. For example, using security cameras footage can cause privacy issues for workers or citizens depicted. Furthermore, the occurrence of people wielding weapons in public spaces is, fortunately, very rare. A motivating example is monitoring public events, to preserve people safety and prevent violent acts, by detecting weapons and dangerous objects. Acting after a weapon has been used can often be a late intervention. When the event involves a moving crowd, human operators can have huge difficulties at noticing dangerous objects. Besides having many people to monitor, the event could be located in a big area, introducing other difficulties, such as the need for many operators to cover the whole area. An object detection network trained to recognize weapons and dangerous objects could greatly help the human operator, signaling a smaller area and location on which dangerous activities are detected. Moreover, there are no publicly available datasets that can be used for training. Furthermore, creating a dataset can be hard, because this kind of dangerous activity is rare and real images with people holding weapons during public events are not common.

Another good example is learning to detect Personal Protection Equipment (PPE) in safety critical scenarios, such as airports, construction sites, and shipyards. Accidents in construction sites and similar scenarios are always a present risk, but there are cases in which injuries are caused by misuse or missing equipment. As building sites become larger and safety policies become stricter, human operators find more and more difficulties at enforcing these rules. Therefore, an object detection network trained to recognize PPE can greatly help the human operator, signaling the position of workers violating safety regulations. This work is mainly focused on using the virtual world approach to solve this task.

(9)

To the best of our knowledge, there is no public real dataset available for these tasks and gathering images can be very difficult for many reasons, for the reasons stated above, first of all the privacy of workers and citizens. Also capturing real significant examples where people make a bad use of personal protection equipment is difficult. These are fortunately infrequent events that would require a very long time to be observed, and that would make the training set unbalanced.

To overcome these limitations and train neural networks to perform well in these scenarios, an interesting opportunity is using photorealistic 3D rendering engines to create virtual worlds and generating training samples. The main advantage of using computer-generated training samples is that they can be automatically annotated with little to no intervention from a human operator. In this work, we explored the idea of using Personal Protection Equipment detection as a test case.

We chose RAGE, the graphics engine used by the game GTA-V since it allows us the use of pieces of software called plugins to interact with the game world and to inspect positions and sizes of the in-game objects. We created a plugin, called V-DAENY (Virtual

DAtaset gENerator), which allows the user to set up a scenario with the use of web-like

forms, configuring many settings such as time of day, weather conditions, positions of viewpoints from which to capture training samples, number of people and behavior. The equipment is randomly generated so that the resulting dataset is balanced and all classes appear uniformly. We set up 30 scenarios in 10 different locations, with three different weather conditions for each location, with an average of 12 people with different types of equipment per picture, from three to five different viewpoints per scenario. The viewpoints have been chosen so that people appear in the training samples with different sizes and perspectives. We generated more than 140,000 images, all automatically annotated. 27 of these scenarios were used as a training set, while a random sample of 350 images from the other three scenarios was used for validation.

(10)

We created two manually annotated datasets about Personal Protection Equipment, to be used as test sets. One is composed of copyright protected images, while the second is composed exclusively of copyright-free images.

We used the virtual dataset to apply Transfer Learning to an existing pre-trained network, which is YOLOv3 trained on MS COCO. We confront the results on validation with those on testing, to quantify the performance loss when passing from virtual to real.

We also explored the idea of using Domain Adaptation to improve performances on real data: we split the copyright-free dataset into two parts, one for Domain Adaptation and the second for testing. After this second phase of training, we confront the obtained performance to understand if more appropriate learning techniques can help even if the 3D renderings are not photorealistic enough.

The rest of the thesis is structured as follows. Chapter 2 provides a brief background of Computer Vision, Machine Learning, and Object Detection. In Chapter 3 previous literature on the matter is reviewed and compared to the approach followed in this work. In Chapter 4, we give an overview of the state-of-the-art detectors in the YOLO family, since we used YOLOv3 as the base for our PPE detector. In Chapter 5, we describe the implementation of V-DAENY with which we generated the virtual dataset and we describe how the dataset is composed. In Chapter 6, we describe the experiments performed with the trained network and we discuss the results. In Chapter 7, we describe the design of a simple system which uses the PPE detector in a safety-critical scenario. Finally, in Chapter 8, we summarize the conclusions and discuss possible future works.

(11)

Chapter 2 Background

Object Detection is a branch of Computer Vision, which is strictly tied to Artificial In-telligence and Machine Learning. For this thesis, Deep Learning and in particular Deep Convolutional Neural Networks have been used to enable Object Detection. In this

sec-tion, a background of these technologies will be provided.

2.1 Computer Vision

Computer Vision is an interdisciplinary field that is concerned with the extraction of high-level information from digital images or videos. It is the science that studies how to allow computers to visually sense the environment around them, similarly to how humans do.

While Computer Vision was born in universities that were pioneering Artificial Intelli-gence, it has not always been tied with Machine Learning. In the early days of Computer Vision, researchers studied algorithms for edge extraction, motion estimation, and shape inference. Only recently the use of Machine Learning, thanks to its developments, has become prominent in Computer Vision.

(12)

Typical tasks studied in Computer Vision are:

1. Recognition, the classical problem associated with Computer Vision: it is the task of determining whether an image contains some particular object.

2. Motion Analysis, the task of estimating the velocity of points in the scene or of the camera producing the images.

3. Scene Reconstruction, the task of reconstructing a 3D model of a scene from a series of 2D images or videos of that scene.

4. Image Restoration, the task of removing noise and defects from an image.

In this thesis, we will focus mostly on the recognition problem. Typical variations of the recognition problem are:

1. Object Classification 2. Object Detection 3. Identification

There are many more specialized tasks based on these generic tasks, some examples of which are Face Recognition, Pose Estimation, Optical Character Recognition.

2.2 Object Classification

Object Classification, or Object Categorization, is the task of assigning a category, or

class, to an input. The algorithm that performs Object Classification is called classifier.

In computer vision, the input is often an image. The classifier uses salient points and important visual information of the image, called features or descriptors, to assign it a

(13)

category. Historically, descriptors and features were handcrafted to be as informative as possible and the classifier received only those as input. Using a Deep Neural Network as a classifier has the advantage of needing less pre-processing since features are computed by the upper layers of the network.

Features and descriptors can represent particular colors or positions of interest such as corners and edges. Corners can be used for example to recognize shapes, and colors can help to recognize particular objects. A red round spot would be recognized as an apple, for example.

Face Recognition is a classic example of Object Classification. The classifier is shown a picture of a face and it has to recognize whose face it is. Salient points such as the position of nose, mouth, and eyes would be good descriptors. Also, colors would be good features, since they could help to distinguish a man with dark or white skin.

2.3 Object Detection

Object Detection is the task of finding the position and size of objects in a digital image or video, classifying each of them with a category. The main difference between Object Detection and Object Classification is that the latter classifies the image as a whole, while the former may detect and classify more than one object in the same image. Furthermore, classification does not give any location info about the object; it only states if the image presents an instance of a category.

Object Classification is easier than Object Detection: in fact, some object detectors use a classifier as a part of the detection process: a first phase suggests candidate regions, the second phase presents each region to a classifier to determine what class it belongs to or if it is background.

(14)

Object Detection had a great development following the improvements in GPU comput-ing. GPUs have been able to support more complex architectures thanks to their huge scalability and parallel computing.

2.4 Neural Networks

Neural Networks are a class of Machine Learning systems, inspired by the biological neural networks in animals. The network is composed of a number of artificial neurons.

In Figure 2.1 we can see both a Natural Neuron (NN) and Artificial Neuron (AN). A dendrite in the NN propagates a neural signal from other neurons to the neuron body, while an axon propagates the signal from the neuron body to another neuron. A NN has many dendrites but only one axon. The signal propagated from the axon depends on the strength of the input signals. Inputs in the AN act like the dendrites in the NN, bringing outside signals into the AN; weights, transfer function and activation function in the AN act like the body of the NN, where the inputs are processed and an output is computed; the output in the AN acts like the axon in the NN, giving signals to the next neuron. An input weight represents in a sense the importance of an input in the computation of the output.

In an Artificial Neural Network (ANN), ANs are typically organized in layers. A common type of layer is the Fully Connected Layer (FCL), where each neuron of a layer takes as input the output of each neuron in the previous layer. In this way, the signal propagates from the first layer, called the input layer, to the last, called the output layer. Before Convolutional Neural Networks were developed, ANNs were usually composed exclusively of FCLs. A neural network, as of today, usually comprises millions of neurons and many millions of connections.

Neural Networks are truly a powerful tool since they can be used to perform a task with little knowledge on how to manually solve that problem. In fact, ANNs can learn how

(15)

(a) Natural Neuron

(b) Artificial Neuron

(16)

to perform the task by analyzing many pairs of correct input-output. These pairs are called ground truth and the more pairs are in the ground truth, the more the ANN will perform well on the task. For example, one could train a Neural Network to recognize a handwritten digit without programming anything about digits, by showing them a big set of handwritten digits alongside the correct digit [5].

The most desirable property of an ANN is to be able to give correct answers also when exposed to inputs it has never seen before. This property is called generalization. To achieve generalization, the ANN must be trained not only with a good number of samples but also with good quality samples. By good quality, we mean that samples must represent the objects of interest in many different variations, such as lighting, orientation, and colors. For example, if we only show frontal pictures of a face, the network will never be able to recognize a face from the side. Not being able to recognize an object when slight variations occur is a phenomenon called overfitting. It happens when the ANN begins to learn how to distinguish only objects in the training samples, focusing on features that are not relevant in the general case but only to those images on which it trains. This is instead the most undesirable property of an ANN, as it makes the network unable to correctly perform a task on previously unseen images.

A common algorithm for training is the Gradient Descent with Error Backpropagation. This algorithm has two phases: the forward phase, where the network is shown a sample and it produces an output. The output is then compared to the ground truth to obtain the error. The error is then used in the backward phase, where the weight of each input in each layer is updated so that the output will be closer to the ground truth. The error is allocated to each input proportionally to its weight. In a sense, the input with higher weight has made a bigger part in committing the error.

The error can be computed on-line, that is updating the weights for each sample shown, or it can be computed with a batch, that is updating the weights after all the samples in the training set have been shown. Halfway to these methods, there is the Stochastic

(17)

Gradient Descent, where only a random subset of the training set is shown to the network before updating the weights.

Neural Networks have had highs and lows during the years. The idea of Artificial Neural Networks has been around since 1943, but the interest of the research community was not always high. The main problem with ANNs is the high computing workload of training them. As the hardware improved, ANNs got new interest until they again hit the obstacle of the hardware not performing well enough. Recently, with the improvements in parallel computing, thanks to GPUs, Deep Neural Network and Convolutional Neural Networks obtained an important place between machine learning techniques. They can excellently take advantage of parallel computing: the output of each neuron in a layer can be computed independently from the others since there is usually no connection between neurons in the same layer. GPUs are inherently parallel computing machine, therefore fitting perfectly the ANN computing model.

2.5 Convolutional Neural Networks

Convolutional Neural Networks, also known as CNN or ConvNets, are a class of Neural Networks commonly applied to the analysis of digital images. They were inspired by the biological process of the visual cortex in animals: each neuron responds to stimuli only in a specific region of the visual field, called receptive field. Each receptive field partially overlaps the receptive fields of other neurons. Together, they fully cover the visual field. In Figure 2.2 we can see a comparison of natural and artificial receptive field.

The same concept is used in CNNs: each artificial neuron is not connected to every input, but only to the inputs inside its receptive field. In a CNN, each receptive field is replicated across the whole layer, therefore all neurons share the same weights applied to the inputs in its receptive field. This is a huge optimization, with respect to a fully connected network: if the input layer of an FCN takes a 100x100 image, which is very small, for each neuron in the first layer there would be 10,000 connections. Instead, if the

(18)

(a) Natural Receptive Field

(b) Artificial Receptive Field

Figure 2.2: Comparison of a Natural Receptive Field and an Artificial Receptive Field. Figure 2.2a is a work by Cenveo licensed under CC 3.0

(19)

neuron only responds to a receptive field of 5x5 pixels, it will have just 25 connections to train.

A set of weights in a receptive field is called a filter and represents a particular local feature, e.g. the presence of a corner or a particular color. Each neuron can have multiple filters applied to the same receptive field.

Convolutional Neural Networks can be deeper than Fully Connected Networks, stemming from the fact that the number of parameters in each layer can be much lower in the first case than in the second case. Furthermore, the convolutional architecture exploits the strong spatial correlation in natural images. Let’s take as an example an image of a leaf: if venation appears in the region covered by a receptive field, also nearby receptive fields will probably show venation. Therefore, a feature in a region strongly affects nearby regions.

This is done thanks to the following characteristics of a Convolutional Neural Network: • Neurons 3D arrangement: neurons in a convolutional layer can be arranged in

three dimensions: length, width, and depth. An example of depth in a 100x100 RGB image would be its colors: each pixel has a horizontal and a vertical position, and three values that represent the intensity of red, green and blue. Therefore the input layer would be arranged in a 100x100x3 structure.

• Local Connectivity: using the concept of receptive fields, neurons are connected to a small number of nearby neurons in the previous layer, therefore producing stronger responses to local input patterns. When many such layers are stacked, the network first produces representations of a small portion of the image, then puts these parts together to create composed representations of a larger area.

• Shared Weights: filters are replicated throughout all the visual field. This means that they all share the same parameters, thus features can be detected independently by their positions in the visual field. This is an important property, called

(20)

trans-lation invariance that earned the alias Shift Invariant Artificial Neural Networks

to CNNs.

Like normal Neural Networks, also Convolutional Neural Network can take advantage of the parallel computing capabilities of a GPU. Every filter in a Convolutional Layer can be applied independently, therefore it can be computed in parallel thanks to the GPU.

2.6 Existing Approaches

During past years, thanks to the many improvements of software and hardware, such as GPU computing, work on Object Detection using Deep Learning has proliferated. A complete survey of the progress made in the past two years can be found in [6], which has been an excellent resource throughout all the thesis work. In this work, we will summarize the most important breakthroughs relative to the focus of the thesis.

2.7 Brief History of Object Detection

Up to 1990, Object Detection was mostly based on geometric representations, such as shape, corners, edges, etc. [7]. Geometric representations are handcrafted based on a 3D geometric model of the object. Geometric representations have some advantages, such as invariance to viewpoint and illumination, and they can take advantage of the fact that man-made objects are usually described by primitive geometric shapes.

In the late 1990s and early 2000s, detection by geometric models was replaced by statistical classifiers (such as Neural Networks, Support Vector Machines, etc.) based on appearance features. Appearance features are not invariant to viewpoint and lighting but can take advantage of statistical classifiers, that is they can be learned. Therefore, there is no need for knowing a precise 3D geometric model of the object of interest, which is often

(21)

relative to a small region of the image. Most of the work on Object Detection was based on local appearance features [8], such as SIFT [9], Haar-like features [10], and Histogram of Gradients [11]. However, even if appearance features can be learned by a statistical classifier, they still have to be handcrafted by experts.

Up until 2012, computer vision was carried through with tuned pipelines of hand-crafted local features. In 2012, a Convolutional Neural Network broke through the record for image classification in the ImageNet classification challenge [12]. The main advantage of using CNNs instead of handcrafted features is that each layer of a CNN produces a more abstract representation of the image, practically learning also how to produce a fea-ture, and not just how to recognize it. A filter can be seen as a local appearance feature computer, similar to those who were handcrafted in the past. The success of CNNs on object classification was then transferred to Object Detection, with the first successful application in Object Detection: RCNN [13]. Since then, Convolutional Neural Networks have dominated the world of Object Detection, as GPU computing improved and more datasets and challenges became available, such as ImageNet [4] and MS COCO [1].

2.8 Milestone Detectors

Here we will analyze the most popular Object Detection Networks since the beginning of the use of CNNs for the task. Most of the detectors proposed in literature are based on one of these networks, with improvements or small changes. Detectors can be categorized into two broad classes:

• two-stage detectors: these detectors have a two-stage pipeline. The first stage is region proposal, where the network produces a set of regions of interest in the image, where an object could be present, without using any information about its category. The second stage is a classifier, to which each region is fed as input, which determines if an object is actually present and if it is, which category it belongs to.

(22)

(a) RCNN architecture

(b) YOLO architecture

Figure 2.3: Architectures of RCNN and YOLO. Images extracted from [6]

• one-stage detectors: these detectors do not separate region proposals from classi-fication, therefore the pipeline has only one stage. They are usually faster than two-stage detectors. Training in one-stage detectors is faster because in two-stage detectors each stage must be trained separately and thus it is harder to optimize their integration.

2.8.1 RCNN (two-stage)

(23)

high object detection quality, but it has major drawbacks both in training and testing. In training, each stage must be trained separately, and thus it is hard to optimize the final result. In testing, each proposed region must be passed through the CNN. Since proposed regions can be many, this dramatically slows down the process. The architecture is exemplified in 2.3a.

2.8.2 SPPNet (two-stage)

SPPNet [15] was born to overcome the obvious disadvantage of RCNN during testing. It introduces a Spatial Pyramid Pooling [15], which is a technique to aggregate information to a fixed-length feature vector in the final layers of the CNN. This allows images of arbitrary size as input since the final layers will aggregate the convolutional features to a fixed-length feature vector that can be fed to a Fully Connected layer (which usually comes last in CNNs). SPPNet showed that an SPP layer can be used to greatly speed-up testing, up to 100 times. The image is passed through the CNN, obtaining a features map of the image. Proposed regions are mapped to the features map. Each of the proposed regions is passed through the SPP layer, obtaining a fixed-length feature vector of that region. The feature vector is then passed to a Fully Connected layer for classification. While SPPNet is faster than RCNN in testing with comparable or better detection quality, the training is not improved, since the two stages must still be trained separately.

2.8.3 OverFeat (one-stage)

OverFeat [16] was one of the first and most successful one-stage detector proposed in literature. It performs detection in a multiscale sliding window, using only convolutional layers, except for the final classification and regression layer which computes the bounding boxes and their classes. The convolutional layers share computation between windows of the image, while the network internally rescales the image to up to six different scales. In [17] OverFeat was compared to RCNN since they were proposed in the same period.

(24)

They show that OverFeat is much faster than RCNN (up to 9x) but has a lower mAP of about 10%.

2.8.4 YOLO (one-stage)

YOLO [18] is a one-stage detector which tries to solve the object detection problem as a regression problem from the image pixels to bounding boxes and their class probabilities. The region proposal phase is replaced by a set of precomputed candidate regions. The main difference from region-based approaches such as RCNN is that YOLO uses features from the full image to compute bounding boxes and classes. The most important feature of YOLO is that it is fast by design, running at almost 45 fps, but with low precision. YOLO architecture is covered in-depth in Chapter 4. The architecture is exemplified in

2.3b.

2.8.5 YOLOv2 and YOLO9000 (one-stage)

YOLOv2 and YOLO9000 [19] are improved versions of the original YOLO. The underly-ing GoogLeNet in YOLO was replaced by Darknet19 [20], an open source neural network engine written in C, made by the same authors of YOLO. Other improvements are the use of state-of-the-art techniques such as batch normalization, multiscale training and com-puting the candidate boxes using k-means on the training set bounding boxes. YOLO9000 can detect over 9000 object classes. This feat was accomplished by a joint training method with both ImageNet and COCO.

2.9 Performance Evaluator: Mean Average Precision

In Object Detection, the most popular metric for evaluating the performance of a detector is Average Precision (AP). This metric, which is a number between 0 and 1, measures

(25)

the probability of the detector giving a correct bounding box for an object. A detector with precision near 1 will give mostly correct answers. Mean Average Precision (mAP) is computed by averaging the AP computed over each object class. In this appendix, we will describe in depth how to compute mAP, by first defining Precision, Recall and

Intersection over Union. The examples in this chapter have been adapted from [21].

2.9.1 Definitions

To define precision and recall, we need first need to define True Positives, False Positives, and False Negatives.

True Positive. A True Positive (TP) is an object which was correctly detected in the

image.

False Positive. A False Positive (FP) is a detection that does not correspond to an

object in the image.

False negative. A False Negative (FN) is an object which is present in the image but

was not detected.

With these measures, we can define Precision, Recall and Intersection over Union.

Precision. Precision (P) measures the percentage of correct predictions, that is how

accurate the detector is.

T P

T P + F P (2.1)

Recall. Recall (R) measures the percentage of objects in the test set that were correctly

detected, that is how good the detector is at finding all objects.

R = T P

(26)

Intersection over Union. Intersection over Union (IoU) measures how much two

bounding box overlap and it is used to determine if a prediction is correct. When a detected bounding box (D) overlaps a bounding box in the ground truth (GT) more than a defined IoU threshold (e.g. 0.5, the most common value) and the classes match, the prediction is considered correct.

IoU = area(D ∩ GT )

area(D ∪ GT ) (2.3)

2.9.2 Mean Average Precision

To explain how to compute AP, we propose a simple example that will make the compu-tation clear. Suppose that in the whole test set we have only five objects of class Person. We run the detector over the whole test set, and we collect the predictions, which will be a list of bounding boxes with associated class and confidence scores. The detection must be done setting a low enough confidence threshold so that all the objects in the test set are correctly found; Darknet, for example, uses a confidence threshold set to 0.005. We sort the list by decreasing confidence score and we determine if the prediction was correct, using a fixed IoU Threshold.

We can consider each element in the predictions list as a row in a table that has for columns: the rank (i.e. the order in the list), the correctness, precision so far, and recall so far. By ”so far”, we mean considering only elements with a lower rank. We stop considering detections when Recall reaches 1. In Table 2.1 we can see an example where the detector detects 10 total objects, of which five are True Positives and five False Positives. For example, we compute precision and recall for the fourth row.

Precision is the portion of True Positives out of all predictions so far. P = 2

(27)

Recall is the portion of True Positives out of all objects in the test set. R = 2

5 = 0.4 (2.5)

Table 2.1: Recall Precision table

Rank Correctness Precision Recall

1 True 1,0 0,2 2 True 1,0 0,4 3 False 0,67 0,4 4 False 0,5 0,4 5 False 0,4 0,4 6 True 0,5 0,6 7 True 0,57 0,8 8 False 0,5 0,8 9 False 0,44 0,8 10 True 0,5 1,0

We can plot the data in Table 2.1, obtaining the curve depicted in Figure 2.4. We can see that Recall can only increase as rank increases, while Precision has a sawtooth pattern. AP is computed as the average value of precision over every value of recall. This is the same as computing the Area Under the Precision-Recall Curve (AUC). There are many different approximations used in literature, which only differ in the quality of the final value computed.

After having computed the AP for each object class in the test set, mAP is computed by averaging the AP for each class. It summarizes the capability of the detector to produce correct detections. This mAP computation is used in the PASCAL VOC Challenge, and it is also the one used in this work.

(28)

2.9.3 COCO mAP

In the COCO challenge, the mAP is computed in a different way. COCO mAP tries to award higher scores to detectors which are better at predicting very precise bounding boxes in both position and sizes, instead of those that approximately locate objects. The main difference is that AP is computed with different values of the IoU Threshold, starting from 0.5 up to 0.95 with steps of 0.05 and all classes are considered at the same time. Therefore, in the context of the COCO challenge, AP and mAP are considered synonyms.

While it is true that the COCO mAP awards precise detectors, the main criticism is that the quality of detection at higher IoU thresholds may not be very important or noticeable to end users. The original mAP definition for the PASCAL VOC Challenge [22] stated that the IoU threshold was chosen as 0.5 to take account of possible inaccuracies in the bounding boxes of the test set. Thus, COCO mAP could unnecessarily punish a detector when the test set is manually and approximately annotated, such is the case of this work. Furthermore, in [23] it is shown that humans have difficulties at distinguishing bounding boxes with IoU of 0.3 from those with IoU of 0.5.

(29)

(30)

Chapter 3 Related Work

In this chapter, we present previous works that explored the idea of using virtual images for training a CNN, stating both similarities and differences to our own approach. Using virtual data in computer vision has been successful for a long time, as shown in a work published in 2007 [24], where a video game called Half-Life was used to create to train and evaluate tracking in a surveillance system. Their work already showed that using virtual data or using real data can lead to a similar performance by comparing the performance obtained by using only virtual data with that obtained by using only real data.

The idea of training a deep neural network using GTA V, a well-known videogame from Rockstar Games, was presented for the first time in [25]. Similarly to our work, the an-notations are automatically generated, but the method for generating them is completely different: unlike our work, they do not use the game API to create a mod, but instead use a graphics debugging tool called RenderDoc to analyze calls to the system graphics rendering module and infer pixel-level semantic from that. While this allows for more fine-grained annotations, it is less flexible and customizable than our approach. It is, however, a more general approach, as it can be applied also where the game engine APIs are not available.

(31)

In [26] they used GTA V for training a self-driving car. They generated around 480,000 images for training and 50,000 images for testing. Their work evidenced how GTA V can indeed be used to automatically generate a large dataset, but with many limitations, which are better explained in Subsection 5.1.7 alongside those we found ourselves. The use of GTA V to train a self-driving car was explored also in [27], where images from the game were used to train a classifier for recognizing the presence of stop signs in an image and estimate their distance. The annotation is automatically obtained through the use of a mod, similarly to our work. They collected around 1.4 million images, then used them to train a CNN. Their work showed good performance on a virtual validation set, but poor performance on real data. They explain the poor performance by noting that real data and virtual data differ much in the camera model and data collection setups. In [28] a different game is used for training a self-driving car: TORCS, an open source racing simulator with a graphics engine less focused on realism with regards to GTA V. The difference in photo-realism can be clearly seen in Figure 3.1. In their work, they show that the game can be used to train a network to drive a car, measuring its performance with the KITTI dataset for distance estimation. Their results show that the network learns well to drive the in-game car, but the performance on real data is worse. This could be explained by the reduced photo-realism of the game.

Back to GTA V, the game was used also in [29] to train a network for pedestrian tracking and pose estimation. They generated a dataset of 460 800 frames, with dense annotations about human joints and poses; this virtual data was used to train a network for people tracking and pose estimation. These works provided us the idea of using GTA V to generate a virtual training dataset, adapting their ideas to the context of object detection and classification.

An approach for using virtual clones of real-world scenarios is presented in [30]; in this work, they use real images to clone a real-life scenario into a virtual world, that can then be used to generate a virtually limitless amount of data, with varying weather conditions and camera viewpoints. They show the performance gains that can be obtained by

(32)

pre-training a network on virtual images and then fine-tuning on real images. They also insist on the importance of varying weather conditions and camera viewpoints. These ideas were considered during our experiments, but we decided to create a completely synthetic world, which would not need any real annotated data at all. [30] also shows that performance measured using virtual clones as a test dataset can be very similar to performance measured on real images, suggesting that virtual datasets can indeed replace manually annotated real images. The main drawback of this approach, as also stated in [25], is that currently, open-source or custom graphics engines lack the same amount of realism that is instead provided by proprietary game engines such as the one from GTA V.

(33)

(a) TORCS

(b) GTA V

(34)

Chapter 4 YOLO detectors family

There are three detectors belonging to the YOLO family: YOLO, YOLOv2, and YOLOv3. In this section, we describe their main features and differences and discuss the performance for each of them. Finally, we explain why we choose YOLOv3 as the backing network for our object detection network.

4.1 YOLO

YOLO (You Only Look Once) is a Convolutional Neural Network for Object Detection,

proposed in [18]. It is the first detector proposed in the YOLO family.

4.1.1 Overview

YOLO uses a one-stage detection framework, while the majority of previous object detec-tors usually adopted a two-stage detection framework. A two-stage detection framework is composed of region proposal, where a set of regions in the image are proposed as a candidate object, and classification where each of the proposed regions is categorized as

(35)

the region proposal stage and unifies it with the classification stage, therefore in a single stage both the regions and its categories are proposed.

To detect objects in a particular region, YOLO uses the features of the entire image at the same time, instead of exclusively local features of that region. This means that YOLO contextually learns also to distinguish objects from their background and thus produces less false positives than previous detectors.

Since region proposal is dropped, YOLO uses Fully Connected Layers on top of the convolutional features to compute bounding boxes. Given an image, YOLO splits it in an SxS grid. For each grid, a set of C class probabilities are predicted, together with B bounding box locations and confidence scores. All of these predictions are represented by an SxSx(5B+C) tensor.

While the clear advantage of this architecture is speed, the major drawback is that lo-calization errors are more likely, resulting from the coarse splitting of the image and the limited number of candidate anchor boxes. This is especially true for small objects: they are more likely to not be seen from the network.

YOLO architecture is faster than most detectors: its base architecture runs at 45 fps on a Titan X GPU; there is also a fast version that runs at 150 fps. It is behind most not real-time detectors in terms of mAP but achieves double the mAP of other real-time detectors.

4.1.2 Architecture

As previously stated, YOLO does not use a region proposal module to detect candidate regions of interest. Instead, it processes the image as a whole. In Figure 4.1 we can see the full architecture of the YOLO network.

The input image is split into an SxS grid. Each cell in the grid is responsible for objects whose center falls inside that cell. Each cell proposes B bounding boxes and a confidence

(36)

Figure 4.1: Full YOLO architecture diagram. Image extracted from [18]

score for each of them. A confidence score is interpreted as Pr(Obj)*IoU(truth,prediction). This means that if the confidence is near zero, the bounding box does not cover an object and contains only backgrounds. If the box covers an object, the confidence score will estimate how good is the bounding box with respect to the ground truth. This structure is exemplified in Figure 4.2.

Other than the confidence score, YOLO produces four values for each box: x,y,w,h. (x,y) is the center of the box relative to the upper-left corner of the cell. w and h are width and height of the image relative to the full image. Having x and y relative to the cell boundaries instead of the full image boundaries is very important: if the image boundaries were chosen, the network would lose the invariance property to the position of the object in the image. In fact, it would learn to detect objects only in the same position they are in the training set.

Finally, each cell produces also C conditional class probabilities, which are used to classify the object in the bounding box, if present. Thus, each value is to be interpreted as P r(classi|Obj), which is the probability of the object belonging to class i if the object is

(37)

(38)

4.1.3 Limitations

YOLO has strong limitations in its detection performance. Since each cell can detect only B objects centered in the same cell, YOLO struggles with groups of nearby objects. Furthermore, the cell predicts only one class, therefore the network has difficulties in correctly recognizing two nearby objects of different classes.

YOLO learns to predict bounding boxes from the training set; therefore, it has difficulties when faced with objects with unusual aspect ratios.

The loss function used during training doesn’t differentiate errors done on small bounding boxes from those made on large bounding boxes. This can decrease the accuracy of the loss function since small errors on small boxes are much more influent than small errors on large boxes.

4.2 YOLOv2

YOLOv2 and YOLO9000 have been proposed at a later time [19] as improvements to YOLO.

4.2.1 Overview

In YOLOv2 the underlying network has been replaced from GoogLeNet to Darknet19, which includes state-of-the-art strategies like using batch normalization and anchor boxes. Anchor boxes replace the fully connected layers and are learned with k-means and multi-scale training.

YOLOv2 can run at different sizes, offering different tradeoffs in terms of speed versus accuracy. For example, the size that allows 67 FPS achieves 76.8% mAP on VOC 2007;

(39)

YOLOv2 outperforms other state-of-the-art networks such as Faster R-CNN, while still being faster than them.

YOLO9000 addresses the problem of the gap between object detection and object sification datasets. The latter are usually bigger since producing labels for object clas-sification is much cheaper. YOLO9000 is YOLOv2 trained jointly on a detection and a classification dataset. Detection and classification samples are mixed in the training set. When a detection sample is presented, the backpropagation reaches all the network weights; when a classification sample is presented, the backpropagation stops just after updating the classification model. This allows the network to detect objects for which no detection data is available. In this way, YOLO9000 is able to detect over 9000 object categories, even if with not great performance.

4.2.2 Architecture

One of the main improvements from YOLO is that the FC layers that computed bounding boxes are dropped and replaced by pre-computed anchor boxes. The idea of anchor boxes is inspired by the Region Proposal Networks used in two-stage detectors. Anchor boxes could be hand-picked, but a poor choice could make the training more difficult for the network. Instead, the anchor boxes are computed through k-means clustering on the training set. The number k of clusters was chosen as a tradeoff between accuracy and model complexity, finding in k = 5 the best compromise. Therefore, each grid cell proposes five bounding boxes, using each of the anchor boxes.

Blindly applying anchor boxes to YOLO caused model instability, that is the network would take many iterations before stabilizing to precise locations. Region Proposal Networks output tx and ty, and the bounding box center coordinates are computed as

x = (tx ∗ wa) − xa and y = (ty ∗ ha) − ya, where wa, ha, xa, ya are respectively width,

height, and center coordinates of the anchor box. In this formulation, tx and ty are

(40)

Figure 4.3: How bounding boxes location are computed from the output coordinates. Image extracted from [19]

which detected the object. With random initialization of weights, stabilization can take a long time.

For each cell, YOLOv2 proposes five bounding boxes, one for each anchor box. For each bounding box, five values are produced: tx, ty, tw, th, and to. All of these values are

constrained between 0 and 1 by a logistic activation function. In Figure 4.3 we can see how the values are used to compute the bounding box center and size.

During the training of the network, the input size is changed randomly every few itera-tions, pulling the size from the following set of multiples of 32: 320, 352, ...608. This forces

(41)

the network to learn even if the input has different sizes with respect to the images in the training dataset.

4.2.3 Limitations

The use of anchor boxes slightly worsens the achieved mAP, from 69.5% mAP to 69.2% mAP. But this almost insignificant drop is counterbalanced by a big increase in recall, from 81% to 88%, meaning that the model has room to improve further.

YOLO9000 is the first promising step for closing the gap between the sizes of object detection and object classification sizes, but it still achieves low mAP on the categories for which there is no detection data. On these categories, it achieves only 16.0% mAP. While it is a low mAP, it is still an important feat as it achieves it without any detection samples for that category.

4.3 YOLOv3

YOLOv3 [31] is the latest detector in the YOLO family, introducing mostly small im-provements and bugfixes.

4.3.1 Overview

YOLOv3 main improvement is the introduction of a new convolutional feature extractor, Darknet-53, that replaces Darknet-19 from YOLOv2. As the name suggests, it contains 53 convolutional layers. Class prediction has been improved using multilabel classification. This helps when the network is used in more complex domains like Open Images Dataset, where the same box can have many overlapping labels, i.e. dog and animal. YOLOv3 introduces also multiscale detection: the detector predicts boxes at three different scales.

(42)

4.3.2 Architecture

Darknet-53, YOLOv3 feature extractor, is composed mainly of 3x3 and 1x1 convolutional layers, with several shortcut layers, making it way larger than Darknet-19. This makes it more powerful than Darknet-19, both in accuracy and in efficiency. In fact, [31] shows that Darknet-53 is 17% more efficient in terms of BFLOP/s, meaning that the new feature extractor can better exploit GPU parallelization. The main drawback is that Darknet-53 runs at less than half the speed of Darknet-19, mainly because the former is deeper and more complex, requiring more than double the number of operations. Even if it slower than Darknet-19, it is still faster than state-of-the-art classifiers, with comparable accuracy.

Multilabel classification is obtained by predicting classes with independent logistic clas-sifiers instead of using a softmax layer. A softmax layer would make each class mutually exclusive, but this is not the case of more complex datasets such as Open Images, where a box can have more than one overlapping label: for example, the network would have to choose between dog and animal, but the ground truth requires both labels to be pre-dicted. Independent logistics classifiers do not cause this issue, while not badly affecting performance.

Multiscale detection is achieved with a technique similar to Feature Pyramid Networks [32]. After the feature extractor, a few convolutional layers predict three bounding boxes (four offsets) alongside with objectness (one value) and class predictions (80 values), encoded in a SxSx[3 ∗ (4 + 1 + 80)] 3D-tensor. The feature map in from two layers previous is concatenated with a feature map from earlier in the network. The result contains both semantic information from the latest layer and finer-grained information from the early layers. A few more convolutional layers are used to predict another 3D-tensor, which is concatenated to the previous one. This technique is applied a second time, to ultimately obtain a SxSx[9 ∗ (4 + 1 + 80)] 3D-tensor. Therefore, the detector predicts 9 bounding boxes per cell, three for each of the three scales. The anchor boxes

(43)

4.3.3 Limitations

YOLOv3 is greatly slower than YOLOv2, even though it still runs at more than real-time speed. Most of the results described by [31] use the same mAP detection metric we used in this paper, but, as they also note, YOLOv3 is not as great when measured with the COCO AP metric, which averages the mAP achieved when varying IoU threshold from 0.5 to 0.95. Other than speed, greater complexity of the feature extractor also means more memory requirements.

(44)

Chapter 5 Virtual Dataset Generation:

V-DAENY

The first contribution of this thesis is V-DAENY (Virtual DAtaset gENerator), a tool for generating a training set using a virtual world, which can be useful when the creation and annotation of a real-world dataset are too expensive in terms of both money and time spent. While other works have used a virtual world for generating a training set, to the best of our knowledge there is no public tool that helps in generating annotated samples for object detection. A similar tool, described in [29], uses GTA-V to produce a dataset, but for pose estimation and pedestrian tracking. The virtual world chosen for this work is Rockstar Advanced Game Engine (RAGE), which is used in some modern videogames, such as GTA V, and presents an almost photorealistic environment. The dataset can then be used to train an Object Detection Convolutional Neural Network.

5.1 Dataset Generation on RAGE

V-DAENY is developed as a plugin for RAGE; it hooks up to a GTA-V game instance and interacts with the game engine: it can add elements on the scene, inspect their positions,

(45)

change the behavior of people depicted, and so on. With V-DAENY, a Scenario can be custom made, starting from the number of people, to the position of the camera which will record the scene. V-DAENY has also access to the screen position of an object in the game: this is the most important information, since the CNNs are trained on 2D images, and thus the object position must be reported as a screen position.

In a similar fashion to [29], V-DAENY is composed of two main components: Scenario

Creator and Dataset Annotator.

5.1.1 Scenario Creator

The Scenario Creator, as the name suggests, manages the creation of the scenario to be recorded and annotated. This is done through a series of forms, similar to those used in websites, that are shown in the game screen. The available forms are:

1. Camera Form: used to set up the viewpoints from which the scenario will be recorded. The user can completely customize a viewpoint, both in position and orientation.

2. Pedestrians Form: used to set up the number of people in the scene and their behavior. Behavior can be chosen from a set offered by the Game Engine, such as wandering around an area, chatting between themselves, fighting, and so on. 3. Place Form: used to set up the place where the pedestrians will be generated. There

is a set of pre-configured places from where to choose, but the user can choose their custom place by manually inserting the coordinates or moving the playing character to that place.

4. Time Form: used to set up the time of day during which the scene will take place. 5. Weather Form: used to set up the weather conditions under which the scene will

(46)

Each of the viewpoints set with the Camera Form will be used during the generation: a single picture will be taken from each of them, and each picture will be almost simultane-ous. Simultaneity is impossible due to how the game engine works: when you change the viewpoint from which the scene is rendered, some frames are needed to properly render everything. Therefore, two pictures that should be simultaneous will actually have about some frames of time distance. But this is not really a problem, since the people in the scene usually move slowly, so the difference between two pictures is not appreciable. All these settings can be recorded to a scenario file that can be loaded at a later time. The file is in the JSON format, making it easy to edit the scenario without the need for loading the game and using the Scenario Creator. JSON format also has the quality of being human-readable with ease, allowing a user to understand the contents of a scenario without the need of loading it in the game.

An example of such a scenario can be seen in Listing 5.1. We have two viewpoints, 12 peds with behavior wander (identified by the number 3), a custom location where to play the scenario. The simulated weather condition will be clear sky (identified by number 1) at 16:00.

Listing 5.1: Sample scenario

1 { 2 "CameraSettings": { 3 "Cameras": [ 4 { 5 "Position": { 6 "X": 913.0537, 7 "Y": -3057.90112, 8 "Z": 5.94400358 9 }, 10 "Rotation": { 11 "Pitch": 7.24213648, 12 "Roll": 0.0, 13 "Yaw": 167.974014 14 }, 15 "Fov": 50.0 16 }, 17 {

(47)

19 "X": 916.6541, 20 "Y": -3064.105, 21 "Z": 6.474021 22 }, 23 "Rotation": { 24 "Pitch": 0.447376281, 25 "Roll": 1.33406193E-08, 26 "Yaw": 61.70442 27 }, 28 "Fov": 50.0 29 } 30 ] 31 }, 32 "PedsSettings": { 33 "PedsNumber": 12, 34 "PedBehavior": 3, 35 "PedShouldGroup": false 36 }, 37 "PlaceSettings": { 38 "Place": { 39 "Name": "CUSTOM", 40 "Position": { 41 "X": 910.899, 42 "Y": -3064.40039, 43 "Z": 5.900763 44 } 45 } 46 }, 47 "TimeSettings": { 48 "Hour": 16, 49 "Minute": 0 50 }, 51 "WeatherSettings": { 52 "Weather": 3 53 } 54 }

While Camera Settings and Place Settings are hard to manually edit, since they involve positions and rotations that are not easily understood without visual feedback, the other settings can be easily changed to create several weather and time variations of the same scenario in a fast way. In fact, this method was used during the thesis to create most

(48)

of the scenarios in the dataset: there are 10 base scenarios, and for each scenario, two additional variations were created, totaling 30 scenarios.

To automate the generation of images with the minimum amount of user interaction, V-DAENY can load several scenarios at once from a directory and then execute them once at a time. In this way, the user does not need to manually load and start each scenario, instead they can just start once and return after some hours, when every scenario has been generated and annotated.

5.1.2 Dataset Annotator

The Dataset Annotator is the component that creates the annotated images for the dataset. For each viewpoint set up in the scenario, Dataset Annotator applies the follow-ing steps:

1. Pause 2. Detection 3. Annotation

These steps are examined more in-depth in the following paragraphs, without dwelling on implementation details.

Pause

The game is paused for a small amount of time. The reason why this pause is necessary is that the content of the screen is usually behind in time with respect to the inspected positions of objects. If an object moves at high speed the effect is more noticeable and can break the training. This could happen if a person is running, for example. Pausing the game doesn’t stop graphics from being rendered but stops processing new position

(49)

information. Thus, the effect is that after a small time, the graphic on the screen and the position information are aligned.

Detection

Each object in the scenario is processed to extract its position on the 2D image. This is done through N steps:

1) 3D bounding box is computed 2) Obstruction check is done

3) Bounding box is projected over the 2D image

Annotation

The current screen is captured and saved to an image file. Alongside with the image file, a text file is created containing the annotations for detected objects, in YOLO format. Pictures are taken at Full HD, which is 1920x1080, with 24 bits color depth (simple RGB without alpha channel) and are saved in the lossless PNG format. Each image occupies a little more than 3MB of storage memory.

5.1.3 Bounding Boxes

RAGE provides information about the sizes of objects, such as weapons, cars, and people in the form of a box, which is a 3D vector with one component for each of width, height, and depth. But this information cannot be immediately used as a bounding box. In general, the box is not in the same orientation as the in-game object. The orientation of the object can be retrieved from the game engine; so, the first thing to do is re-orient the box using the orientation of the object.

(50)

There are certain cases in which the box given is bigger than the object of interest. This difference appears when the object is not rigid, such is the case of a person. In this case, the box returned from the game engine is as large as the object can be in every direction. Using again a person as an example, the width would be the width of a person in a T-pose, with both arms parallel to the ground; the height would be the height of a person with both arms up. This would make the bounding box of a person appear much bigger than it should be. Also, in the case of most of the object classes of interest, such as the head or the chest, with or without equipment, their size is not immediately available.

These sizes can be computed from the provided size using standard human proportions. We used a table of proportions similar to the one in Figure 5.1 to compute the correct sizes for each of the objects of interests. The bounding boxes so computed are approximations of the real size, but this method can be easily tweaked to obtain better precision.

Bounding boxes can be projected to the 2D screen, to obtain a 2D bounding box that will be then used by YOLOv3. This, however, has a limitation: when you try to project a 3D point that is outside the screen, the function that projects it to a 2D point fails. In this case, we have two choices: discard the box completely, or try to compute the projection nonetheless.

We made it possible to project the box if at least the center of the box can be projected. We take all the points that can be projected to the 2D screen, filtering those outside the screen. As an example, we compute the X value of the vertical edges of the box: for the leftmost edge, we take the minimum X value of all projected points that are to the left to the center of the bounding box; for the rightmost edge, we take the maximum X value of all projected points that are to the right to the center of the bounding box. If there are no points to the right left to the center, we take 0 as the X value of the leftmost edge; if there are no points to the right of the center, we take 1920 as the X value of the rightmost edge. The same reasoning can be done for the Y values of the horizontal edges of the box. In other words, this means that when the object continues outside the screen, the bounding box stops on the edges of the screen.

(51)

(52)

Bounding boxes computed in this way have another problem: when the camera gets very near to the object, bounding boxes can become much bigger than the object itself. This creates pictures with lower quality ground truth: in this case, it is said that the dataset is noisy. Having a noisy dataset is not a huge problem, as the network should be able to learn even with a small amount of noise if overall the dataset contains mostly correct annotations.

In Figure 5.2 we can see the two steps followed to compute 2D bounding boxes. In 5.2a the pedestrian position, orientation, and size are used to compute 3D bounding boxes. Since they are obtained through coarse heuristics, they can be larger than the actual size of the object. In 5.2b we see how 2D bounding boxes are obtained by circumscribing the 3D bounding box.

5.1.4 Obstruction Check

The obstruction check is needed to prevent that an obstructed from view is incorrectly annotated. If an object cannot be at least partially seen, it should not appear in the ground truth. The game engine does not offer a method of discerning obstructed objects from visible objects, therefore we had to build our own method of obstruction detection. The game engine offers an interface for simple ray tracing. Ray tracing is a technique that allows computing the path that a ray of light would travel. In RAGE, we can create a ray that travels from point to point, and the interface will provide information about what happened to that ray; a ray can arrive at its destination without any interference or hit something in its path.

This simple ray tracing can be used for obstruction detection: a ray can be traced from the recording camera position to each of the vertices of the 3D bounding box. If at least one of these rays successfully reaches its destination, it means that the object is at least partially visible. To improve the quality of the detection, we traced ray also to intermediate points in the bounding box, such as the center of its edges. If a bounding box

(53)

(a) 3D Bounding Boxes computed with game information

(b) 2D Bounding Boxes obtained by 3D Bounding Boxes

(54)

passes the obstruction check, it is added to the detected objects list and its annotation is added to the picture metadata.

This method is not perfect, and both false negative and false positives can happen, de-pending on particular viewpoints and on the orientation of an object. This introduces noise to the dataset, but as noted above, a small amount of noise does not cause any major problem.

5.1.5 Pedestrian Creation

The Scenario Creator allows one to specify the number of people in the scenario, but the task of actually creating the pedestrian is left to the Dataset Annotator. A pedestrian must have a model assigned, which represents the appearance that the pedestrian will have, including clothes and equipment. For this work, the models available for the Dataset Annotator to choose are:

1. Air Worker, a person who works in an airport;

2. Construction Laborer 1, a person working in a construction site; 3. Construction Laborer 2, a variation of the previous model; 4. Dock Worker 1, a person working in a shipyard or a port; 5. Dock Worker 2, a variation of the previous model;

The first choice we made for creating a new pedestrian was to choose a random model and then choose random equipment. While this method is fair with respect to how much every model appears, it is not equally fair with respect to how much every object class appears. In fact, every one of these types has different appearance and equipment. For example, some of them never appear without a High Visibility Vest, such as the Air Worker; welding masks appear only in three of these types, namely Construction Laborer

(55)

1 and 2, and Dock Worker 1. Therefore, a welding mask has a much lower probability to appear with respect to an HVV or a bare head. This was the method used during the creation of the first dataset, as we can see in Subsection 5.2.3.

To avoid this, V-DAENY does not randomly pick a model. Instead, it first creates a pedestrian without any equipment, then one with a helmet, then one with a welding mask, and finally one with hearing protection, choosing the model accordingly. The same is done for alternating between pedestrians with or without an HVV. This was the method used during the creation of the second dataset, as we can see in Subsection 5.2.4.

Unfortunately, due to how the game engine was created, for each of these models, there is only a male version. Therefore, female workers are not included in the dataset, as nothing could be done except extending the game engine. We will discuss possible extensions in Chapter 8.

5.1.6 Pedestrian Classification

During annotation, a pedestrian must be classified, to understand what kind of equipment they wear. This is not immediately available in the game engine, but V-DAENY can use the model of the pedestrian to understand what they are wearing.

Every pedestrian has a variation vector (VV), which describes the clothes he is wearing and a props vector (PV), which describes the equipment he is wearing. These vectors interpretations differ from model to model, for example, an Air Worker with High Visi-bility Vest has [3, 0] as VV, while a Construction Laborer 1 has [8,0] as VV. Therefore, for each model, V-DAENY uses a Classifier that implements the correct interpretations for each of the models and returns information about the equipment that the pedestrian is wearing.