• Non ci sono risultati.

Detection

1 Camera Detection

Figure 4.1: Deep neural network are able to recognize more and more complex feature at each layer. Simplifying a bit, the network first recognizes elements like lines or circles, then wheels and windows, finally vehicles [nVidia - based on [29]]

The output layer of the network and final processing functions are changed in order to obtain different types of output. By default, the outputs of the last layer are real values, hardly meaningful.

In the recognition case, one simple application is to assign an image to one category of a list of predefined possibilities. For instance, deciding if in a camera image there is a car, a truck, a bike or a person.

In order to obtain a percentage of confidence, indicating the probability that an image belongs to one of these categories, the softmax function is applied to the output layer. Indeed, this function will normalize the output to an array of real values within the range (0, 1), and the sum of these values will be equal to 1.

Thus, these values can be interpreted as the probability for the image to belong to one of the categories.

The expression for calculating this probability is for a neuron output zj is then:

P (j) = ezj

K k=1ezk

For instance, given the four outputs from the neural network [2.2, −4.1, 3.92, 4.1]

after applying softmax we will have [0.0753451, 0.000138357, 0.420767, 0.50375], whose sum is equal to 1. If our categories are car, truck, bike, person as stated before, then we will affirm with 0.50375 confidence that the image shows a person.

Now, suppose we have an image that is not focused on a single object but on a more varied scenario, where multiple objects are present. We are interested in detecting all the objects, their category and their position in the image. As usual, the position of an object in an image can be defined with a bounding box, that is, a box centered around the object, often encoded with a tuple of four values:

(centerx, centery, width, height)

This kind of network is useful for instance when applied to multiple patches of the image, some of which may contain an object of interest. This approach is called region proposal. It is possible to scan all the regions one by one, from top-left to bottom-right.

Other algorithms are able to directly identify both objects and their position within the image. These approaches have better performance both in recognition and timing, since they do not need to repeat inference at different scale and position within the image.

The YOLO detector is one of these.

1.2 YOLO

The image recognition capability was built on the second iteration of the YOLO neural network, YOLOv2[40]. As of today, this neural network achieves almost the best performance when compared to existing solutions and yet is very light, so much that it can run on smartphone.

The innovation of this network relies in its simple pipeline and learning ap-proach: No region proposal, no complex functions applied to the output, but just a network that outputs both predicted class and bounding box directly, so that it is possible to train it end-to-end1. As reported in the original paper [41], the use of a single network with no additional steps in the pipeline has many advantages: the full system is faster to execute and train and contextual informations are implicitly used since there is no region proposal and convolutional properties are maintained on all scales.

The network works by dividing the input image in a grid and in each cell of the grid a class is associated. Also, in each cell a variable number of bounding boxes (usually 2) are predicted. These bounding box are characterized by the (x, y) coordinates, relative to the cell origin, the (w, h) dimension, relative to the full normalized image dimension, and finally the confidence, defined as P r(Object) · IOUpredtruth, that is the

1That is, the input and output of the system corresponds to the input and output of the full system

probability that an object exists in the cell times the IOU1 between the prediction and the ground truth. The training jointly optimizes for these output values, even-tually picking only classes with higher confidence. The architecture fig. 4.2 of the network is also simple, composed by convolutional and maxpool layers.

Figure 4.2: architecture of the original YOLO network. YOLOv2 adopts a sim-ilar network, but empirically add some steps in order to improve accuracy while maintaining realtime performances on high end hardware

The results of YOLO on the standard datasets are impressive, both for accuracy and performance. On our images fig. 4.3, the network shows good qualitative per-formance even with no further tailoring from our side. The recognized objects are published on theimage_recognition topic as an array of bounding boxes containing the corresponding class ID and confidence probability.

1The Intersection over Union is a common measure of the quality of prediction of the bounding box, it is defined as Area(Overlap)

Area(U nion) of two bounding boxes. A good predictor should have a bounding box close to the ground truth and with similar size, thus with IOU close to 1

Figure 4.3: YOLOv2 applied on our images. Generally, the detector is able to find car, people and traffic lights in the sequence of images, but fails to detect all objects in a single frame.

2 Lidar Detection and Recognition

There is a growing and recent literature on the problem of detecting and recognizing objects from a point cloud. The problem is quite close to the corresponding 2D image case and, not surprisingly, many algorithm are a transposition from 2D to 3D space. In image processing 3D data is processed through the use of histograms, features, euclidean metrics and recently neural networks. Some basic solution for detection generates 2D image from 3D data. While this works, it does not fully exploit the information provided by a laser scanner, and thus achieves lower performance than direct 3D analysis. Tracking of recognized objects can still be based on Kalman filtering, also here with the necessary modification for handling 3D translations and rotations.

Documenti correlati