Visual SLAM for Driverless racing vehicle

(1)

U

NIVERSITY OF

P

ISA

and

S

ANT

’A

NNA

S

CHOOL OF

A

DVANCED

S

TUDIES

Master Degree in E

MBEDDED

C

OMPUTING

S

YSTEMS

Visual SLAM for Driverless racing

vehicle

Supervisors:

Prof. Carlo Alberto Avizzano Prof. Alessandro Filippeschi Dr. Paolo Tripicchio

Candidate:

Gabriele Baris

(2)

In collaboration with E-Team Squadra Corse of University of Pisa and PERCRO laboratory of Scuola Superiore Sant’Anna.

(3)

“Per aspera ad astra” Latin motto

(4)

Abstract

This work presents a VSLAM algorithm for Formula SAE® and Formula Stu-dent Driverless vehicles. Popular VSLAM algorithms are implemented thinking about autonomous agents that can move around and explore an unstructured environment. In this kind of competitions, instead, vehicles have to move along a track delimited by coloured cones. These provide semantic features that can be exploited to simplify the VSLAM pipeline.

Instead of using generic features like SIFT, SURF or ORB, the proposed vision module uses an object detector to get 3D cones location analysing an RGB-D frame. Two different detectors are proposed: YOLO-based and colour-based.

The SLAM module uses the aforementioned 3D locations as landmarks for an EKF-based VSLAM using a vehicle kinematic model. The algorithm imple-ments semantic data association and loop closure at the end of each lap. Also a localization-only mode is provided for events such as Acceleration and Skid Pad, where the track geometry is known a-priori.

The map provided by this VSLAM system is a sparse semantic one and it can be tricky for planning purposes. For this reason, it is used as an input by the track detection module, which connects cones to define track boundaries and the region where the vehicle can drive. Cones may not be detected in order and ordering them based on Euclidean distance may lead to wrong track recon-struction. For this reason, a custom metrics is proposed, aiming to minimize Euclidean distance and maximizing cones collinearity.

The SLAM module was first implemented in MATLAB, testing the correct-ness. Then the entire system was implemented in ROS, testing it using a RC car and small coloured cones.

(5)

Ringraziamenti

Il pi ù grande dei ringraziamenti va sicuramente a mamma e pap à, che mi hanno cresciuto, educato, sostenuto e non mi hanno mai proibito di inseguire le mie passioni. Questo traguardo lo devo soprattutto a voi. Grazie anche a quel romp-iscatole di Lorenzo, mio fratello, senza il quale la vita sarebbe stata sicuramente pi ù noiosa. Grazie a tutti i parenti che, anche se lontani, non hanno mai fatto mancare il loro affetto.

Grazie a Vale, che da quasi 3 anni e mezzo `e al mio fianco ogni giorno, anche se adesso ci separano centinaia di chilometri. Grazie per tutto l’amore ed il supporto datomi in questi anni, soprattutto nei momenti difficili (e s`ı, ce ne sono stati).

Ringrazio il prof. Avizzano e, attraverso lui, tutto il laboratorio PERCRO per il supporto in questi lunghi mesi di tesi.

Grazie a Salvo, per gli innumerevoli esami e progetti svolti insieme; grazie per esserci sempre fatti forza. Grazie a Gabri A. e Gabri S., fidi compagni di viaggio dal primo anno. Grazie a Davide, per le lunghe conversazioni ingeg-neristiche a i Praticelli (quanto mi sono mancate negli ultimi mesi!). Grazie a Fede, per tutti i progetti che un giorno, forse, riusciremo a fare insieme. Grazie a Domenico per le piacevolissime conversazioni sorseggiando una birra. Gra-zie a Gioia e Kevin, quei solo colleghi che non vedo quasi mai, ma che sento sempre vicinissimi. Grazie a Chiara per la sua compagnia, benefica in queste ultime settimane.

Grazie all’E-Team, che mi ha permesso di crescere moltissimo in questi anni, sia personalmente che professionalmente. Grazie in particolar modo a Michael, Niccol `o, Francesco M., Andrea, Marco, Francesco F., Federico e Si-mone, i ragazzi della Driverless: abbiamo iniziato qualcosa di veramente bello. Rimbocchiamoci le maniche e continuiamo su questa strada.

Grazie a tutti i professori che durante questo percorso mi hanno insegnato molto ed in particolar modo al prof. Sbano, che da ormai dieci anni `e per me un mentore.

Grazie a tutti, anche a chi ho dimenticato di citare esplicitamente, per avermi permesso di essere qui in questo giorno. Ad maiora!

(6)

2.1.5 Depth Sensing . . . 21 2.1.6 HSV colour space . . . 24 2.2 Object detection . . . 25 2.2.1 YOLO . . . 25 2.3 SLAM . . . 26 2.3.1 Premise . . . 26 2.3.2 Sensing . . . 29 2.3.3 Maps . . . 29 2.3.4 Data association . . . 29 2.3.5 Loop closure . . . 30 2.3.6 VSLAM . . . 30 2.4 Proposed solution . . . 34 3 Cone Detection 36 3.1 YOLO-based cone detector . . . 36

3.2 Colour-based cone detector . . . 38

(7)

3.3 The algorithm . . . 39 4 Visual SLAM 43 4.1 Notation . . . 43 4.2 Initialization . . . 44 4.3 Prediction step . . . 45 4.4 Correction step . . . 46 4.4.1 Landmark observation . . . 46 4.4.2 Velocity observation . . . 48 4.4.3 Data association . . . 49 4.4.4 Landmark addition . . . 50 4.4.5 Loop closure . . . 51 4.5 Recap . . . 51 5 Track Detection 53 5.1 Boundaries detection . . . 53 5.1.1 Boundary initialization . . . 54

5.1.2 Next cone selection . . . 55

5.1.3 Boundary closure . . . 56 5.2 Road Detection . . . 57 6 Implementation 59 6.1 MATLAB . . . 59 6.2 ROS . . . 66 6.2.1 Structure . . . 66 6.2.2 Camera calibration . . . 69

6.2.3 Real case scenario . . . 70

7 Conclusion 75 7.1 Summary . . . 75

7.2 Future work . . . 76

List of Acronyms 78

(8)

Chapter 1 Introduction

1.1 Formula Student and Formula SAE

®

Formula SAE®_(FSAE)1_{an international competition for university (either}

under-graduate or under-graduate) students. They are asked to design, develop and com-pete with a small formula-style vehicle. Engineering design and performance are not the only metrics to win the competition, but also financial and sales planning are taken into account for the final scores.

The first competition was held in 1981 by the Society of Automotive Engi-neers (SAE) in the USA. Each year, around 140 student teams from all over the world take part to it.

The competition was at first limited to few American universities, but it grew exponentially very fast. In fact, in 1998 the brand new Formula Student (FS)2

competition was born. It is organized by SAE and IMechE in England and hosts around 70 teams every year. Nowadays, Formula Student and Formula SAE®

competitions are held annually in Europe (Figure 1.1), America, Asia and Aus-tralia.

1.2 Competitions

Historically, competitions were focused on Combustion Vehicle (CV), which was the one and only category. The, after some years, with the technological ad-vances in the field of electric vehicles, they introduced also the Electric Vehicle (EV) category.

1_{https://www.sae.org/attend/student-events/} 2_{https://www.imeche.org/events/formula-student}

(9)

Figure 1.1: Formula SAE® and Formula Student events in Eu-rope

Since 2017 they have introduced the Driverless Vehicle (DV) category, in which teams work either on a combustion or electric vehicle to let it be fully autonomous.

In the following, everything will refer to the Driverless category, since it is the context in which this thesis takes place. Each competition consists of Static and Dynamic events, each one containing several disciplines:

• Static events

– Business Plan: each team presents their business plan for the

con-structed prototype to a fictitious company represented by judges. Teams first have a ten minutes presentation of their idea and then a five minutes discussion and question round with the judges

– Cost and Manufacturing: the team must provide a written cost

re-port about all the costs for the manufacturing process of the vehicle. Then, judges ask team members questions about those costs

– Engineering Design: a jury of experienced engineers evaluates

lay-out, technical design, construction and implementation of the pro-duction of the actual vehicle based on documents provided by the

(10)

Figure 1.2: Schema of the Acceleration track structure (Source:

FSG 2019 Competition Handbook)

team. Then, team members are questioned by the judges about de-sign choices and technical details

• Dynamic events

– Acceleration: vehicle‘s acceleration from a standing start is

mea-sured over a 75 m straight lane delimited by cones. After that, the vehicle must come to a full stop within 100 m (see Figure 1.2).

– Skid Pad: vehicle must drive a figure of 8 circuit lined with track

cones, performing two laps of each circle. In each case, only the second lap will be measured (see Figure 1.3).

– Autocross: vehicle has to traverse an unknown 500 m track with

straights, curves, slaloms and chicanes.

– Track Drive: the vehicle has to perform 10 laps on the very same

Autocross track, proving its reliability and durability under long-term conditions (see Figure 1.4).

– Efficiency: during the Track Drive event, fuel consumption (or

en-ergy consumption) is measured, weighting it with vehicle speed (to discourage teams to drive slowly to be more efficient)

(11)

Figure 1.3: Schema of the Skid Pad track structure (Source:

(12)

Figure 1.4: Schema of the Track Drive track structure (Source:

FSG 2019 Competition Handbook)

1.3 E-Team Squadra Corse

E-Team Squadra Corse (Figure 1.5) is the racing team of the University of Pisa. Every year, team members design and build a vehicle to take part to Formula SAE®and Formula Student competitions.

The team was born in 2007 counting just 35 members, under the supervision of professor Emilio Vitale (former President of the School of Engineering) and professor Massimo Guiggiani (President of the Master Degree in Automotive Engineering).

Year by year, the interest in the team has grown constantly, counting now more than 80 members, coming from more than 15 courses of study.

In 2019 the team has given birth to the Driverless Division, aiming to take part to Formula SAE® _{and Formula Student competitions in Summer 2020.}

The division counts about ten members with different backgrounds (Embed-ded Computing Systems, Robotics and Automation, Mechanics, Telecommuni-cation, Computer Science) who are studying how to transform the traditional combustion vehicle into a fully autonomous one, compliant with the rules.

1.4 Motivation

This thesis takes place within the broader context of the E-Team Driverless Project, depicted in Figure 1.6.

(13)

Figure 1.5: E-Team and Kerub-H at FSG 2018

These are the main pieces needed to have a fully autonomous vehicle. The Vision block takes the video stream from the camera and, running a cone de-tection algorithms, computes their 3D location relative to the vehicle. This infor-mation is then sent to the SLAM block, which builds an abstract representation of the track (map) and localize the vehicle within it. Map and vehicle pose are used by the Planning block to compute vehicle desired trajectory.

At the same time, the State Estimation block fuses data coming from different sensor to have a reliable vehicle state estimation.

Vehicle state and planned trajectory are used by the Vehicle Control block, which, based on vehicle dynamics model, provides references to a lower-level Actuation Control block. The latter provides references to the actuators mounted on the vehicle. Notice that Gear and Clutch are needed since the base vehicle has a combustion engine.

(14)

1.5 Thesis objectives

As already stated, competition tracks are structured environments, with traffic cones on both sides. The objectives for my thesis are the following:

• starting from the videos stream coming from a stereo camera, compute cones 3D location relative to the vehicle (Vision block)

• using only data coming from the previous detector, implement a Visual SLAM algorithm (SLAM block)

• using the map from the previous point, detect the area where the vehicle can drive (part of the Planning block)

1.6 Structure

For greater comfort, let’s summarize how this thesis is organized.

• Chapter 1 contains an introduction to competitions, teams and thesis

ob-jectives

• Chapter 2 provides a brief description of some background concepts

use-ful to understand the provided solutions

• Chapter 3 describes in details how the problem of cone detection were

addressed

• Chapter 4 describes in details the maths behind the proposed VSLAM

algorithm

• Chapter 5 provides an algorithm for track detection

• Chapter 6 describes implementations using MATLAB and ROS, providing

experimental results

• Chapter 7 finally summarizes the work and suggests some topics for

(15)

Camera Sensors Steering Wheel Throttle Brake Gear Clutch Vision SLAM State Estimation Planning Vehicle Control Actuation Control

(16)

Chapter 2 Background

In this chapter, I will do a brief review of the topics needed as the foundations for the work I have done. It is not intended to be a complete explanation of all of them, but just a summary of the key concepts. I will first cover topics related to Computer Vision and object detection, ending the chapter with SLAM-related concepts.

2.1 Computer Vision

2.1.1 Camera model

The simplest way to describe what happens inside a camera is by means of the so called pinhole camera model (Figure 2.1). In this model, the camera is represented as an infinitesimal point and, from each 3D point in the surrounding space, only a single ray of light enters the hole, projecting onto the image plane. As a result, the obtained image is always in focus and it size is proportional to a single parameter: the focal length. It expresses the distance from the pinhole to the image plane.

Mathematically speaking

−x = fX Z

where f is the focal length, Z is the distance of the object from the pinhole, X is the height of the object and x is the height of the object projected on the image plane. The minus signs takes into account the fact that the projected image is upside-down.

To work with a easier maths we can rearrange the previous model by swap-ping the pinhole and the image plane (Figure 2.2). The projected object now appears right side up. This ways, the point in the pinhole is interpreted as the

(17)

Figure 2.1: Pinhole camera model (Source: [11])

centre of projection. The point at the intersection between the optical axis and the image place is called principal point. The relationship between the real and the projected point is exactly the same as before, but without the minus sign.

So, in a perfect world you would just map a point Q = (X, Y, Z) in a pixel q = (x, y)as

x = fX Z y = fY

Z

But since world is not perfect, we need to provide some adjustments. In particular, we introduce the two values cxand cyto model possible displacement

of the centre of the image from the optical axis. Moreover, on a low-cost imager, pixels are rectangular rather than square. This implies that the actual focal length (which is proportional to both lens focal length and pixel size) is different along the two axis. This leads to introduce two new factors fx and fy to model

this behaviour.

(18)

Figure 2.2: Swapped pinhole camera model (Source: [11]) x = fx X Z + cx y = fy Y Z + cy

2.1.2 Projective Geometry

The relationship that maps points Q_i in the real world to points (pixels) q_i on the projection screen is called projective transformation.

In projective geometry it is very common (and useful) to work in homoge-neous coordinates. They consists of mapping an n-dimensional vector (i.e. [x, y, z]) into an (n + 1)-dimensional one (i.e. [x, y, z, w]), with the additional con-straint that any two points that are proportional are, in fact, equivalent points.

This said, let’s recall our problem. Our projective space is the 2-dimensional image plane. Its points can be represented in homogeneous coordinates a q = [q1, q2, q3]. Keeping in mind that points having proportional values are equivalent,

(19)

This allows us to collect all the camera parameters we have previously seen in a single 3 × 3 matrix called camera intrinsics matrix and easily compute

q = KQ

where Q = [X, Y, Z]T _{is a point in the real world, q = [x, y, w]}T _{is the}

pro-jected point and

K =   fx 0 cx 0 fy cy 0 0 1   is the camera intrinsics matrix.

Doing all the computations you obtain w = Z, so dividing by w (or Z) you can obtain the earlier definition.

2.1.3 Lens Distortions

The pinhole model we have seen so far is hardly applicable in practice. The very few light that enters the hole would require a lot of time for the imaging process. To overcome this issue you can insert a lens before the pinhole: this will focus a large amount of light in a single point, speeding up the imaging process. The drawback is that, even if in theory it is possible to have a lens that introduces no distortions, in practice no lens is perfect. The two main sources of lens distortions are radial distortion and tangential distortion.

Radial distortion

Radial distortions (Figure 2.3) arise when pixel near the edge of the image are more deformed than ones near the centre. This happens when rays farther from the centre of the lens are bent more than those closer in. This is also known as fisheye effect.

Thus, the distortion is 0 at the (optical) centre of the imager and increases as the the distance r from the centre increases. In practice, this distortion model can be modelled by the first few terms of a Taylor series expansion around r = 0. For classical web-cams, the first two coefficients k1 and k2 are sufficient,

while for highly distorted cameras (such the ones equipped with fisheye lenses) you also need a k3 term.

(20)

Figure 2.3: Effect of radial distortion: arrows shows how pixels

are moved away from the centre (Source: [11])

xcor = xraw 1 + k1r2+ k2r4 + k3r6

ycor = yraw 1 + k1r2+ k2r4+ k3r6

where [xcor, ycor] is the undistorted pixel, [xraw, yraw]is the original one and r

is its distance from the centre of the lens.

Tangential Distortion

This second form of distortion arises from the manufacturing defects in the as-sembly of the camera. In particular, it is the result of the lens not being exactly parallel to the imaging plane (Figure 2.4).

This is expressed by two parameters p1 and p2

xund = x +2p1xy + p2 r2+ 2x2

yund = y +p1 r2+ 2y2 + 2p2xy

(21)

Figure 2.4: Effect of tangential distortion (Source: [11])

Putting all together, we have a distortion vector D = [k1, k2, p1, p2, k3]

which can be used to un-distort images.

2.1.4 Projections and 3D vision

Transformation hierarchy

At the heart of 3D vision there are the so called projective transforms. They build a hierarchy of transformations, from the most specialized to the most general one

Euclidean ⊂ Similarities ⊂ Affine ⊂ Projective

Euclidean transformations are used to model the motion of a rigid body and are by far the most important isometries1 in practice. A planar Euclidean trans-formation can be modelled as

x0 = HEx =

R t 0T 1

x

1_{Let’s recall that an isometry is a particular type of geometric transformation which preserves}

(22)

where R is a 2-by-2 rotation matrix and t is a displacement vector (an Eu-clidean transformation is in fact a composition of translation and rotation). It has 3 DoF, one for the rotation and two for the translation.

A similarity is an isometry composed with an isotropic scaling. In case of an Euclidean transformation composed with a scaling s, you can write

x0 = HSx =

sR t 0T 1

x

It is also known as equi-form transformation since it preserve the shape of the object. It has 4 DoF, adding the scaling to the ones of the Euclidean transform.

An affine transformation (or affinity) is a non-singular linear transform fol-lowed by a translation. In the planar case

x0 = HAx =

A t 0T 1

x

where A is a non-singular 2-by-2 matrix. An affinity has 6 DoF, 4 for matrix Aand 2 for vector t. An affinity preserves

• Parallel lines

• Ratio of lengths of parallel line segments • Ratio of areas

A projective transformation can be expressed in the planar case as x0 = HPx =

A t vT v

x

where v = [v1, v2]T. The matrix has 9 elements, but only their ratio is

impor-tant, so the transformation has 8 DoF.

2.1.5 Depth Sensing

Depth information can be obtained according to several techniques. Passive stereo techniques use simply a pair of images. Matching (as the name implies) the same a set of features on both images, it is able to triangulate (and thus reconstruct) the 3D position of objects in the scene knowing the geometrical re-lation between the cameras. On the other hand, active stereo techniques adopt an active component to measure distance. In this class fall ToF and structured light cameras.

(23)

Stereo matching

Stereo vision is naturally performed everyday by our brain: it combines images coming from both left and right eyes and retrieves 3D information of what is around us. The same thing can be done by means of two cameras, as depicted in Figure 2.5.

Figure 2.5: Stereo vision (Source: [11])

Stereo imaging typically involves four steps, which are briefly summarized below:

1. Undistortion: remove radial and tangential distortion 2. Rectification: adjust images so to be collinear

3. Correspondence: find same features on both images and compute their displacement. This step provides the so called disparity map

4. Reprojection: the disparity map is transformed in a depth map knowing the geometric arrangement of the cameras and triangulating points

(24)

After undistortion and rectification you have a perfect stereo rig, where image planes are aligned and collinear. The distance T between left and right principal points is called baseline and the difference xl− xr disparity. The depth value for

a point can be computed as

Z = f T xl− xr

Thus, depth is inversely proportional to disparity: when disparity approaches zero, small variations in disparity provides huge variations in depth. On the other hand, when disparity is large, small disparity variations do not affect depth so much. In fact, to appreciate better depth values for far object you should increase the baseline.

Time-of-Flight

A Time-of-Flight (ToF) camera (Figure 2.6) is a range imaging device which uses time-of-flight to compute distance of objects in the scene. The typical distance resolution is 1 cm. The physical principle is the following: the camera is equipped with IR emitters which emit light pulses at a given frequency. Then, the imager captures the reflected pulses and, knowing the speed of light c ' 300 000 000 m/s it is possible to compute the travelled distance.

Figure 2.6: Working principle of ToF camera (Source: Google

(25)

Structured light

Structured light approaches (Figure 2.7), instead, project a light pattern on the scene. The imager then captures the scene and it is possible to compute dis-tance and surface information by computing how the pattern is modified by the objects in the scene. Depending on the application, the pattern can be inside or outside (i.e. infrared) the visible spectrum.

Figure 2.7: Working principle of structured light camera (Source:

Google Images)

2.1.6 HSV colour space

The most widely used way of representing colours in computer graphics is the RGB colour model (Figure 2.8a). This was defined in the 1970s to mimic the way in which humans perceive colours. In this model, the three colour components vary in the range 0 % to 100 % and are mixed together to generate any colour.

This approach is simple and very famous, but for some Computer Vision applications other colour spaces are preferred. For example, in the field of colour segmentation, the Hue Saturation Value (HSV) color model (Figure 2.8b) is maybe the most famous one, since it allows to isolate the color component using only the hue.

The hue component is arranged in a circle, having red primary at 0°, green primary at 120° and blue primary at 240°, thus having hue in the range 0° to 360°. This model represents how colours mix together, with saturation (in range 0 %

(26)

(a) RGB colour gamut (Source:

Wikime-dia Commons)

(b) HSV colour solid cylinder (Source:

Wikimedia Commons)

Figure 2.8: RGB and HSV color spaces

to 100 %) representing various shades of brightly coloured paint and value (in range 0 % to 100 %) resembling the mixture of those paints with some amounts of black and white.

2.2 Object detection

2.2.1 YOLO

You Only Look Once (YOLO)2 _{is maybe the state-of-the-art object detector. It}

was first proposed in [18] where, for the first time, they proposed object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.

Typical detection systems had a classifier for a given object and tested it in various location of the image and at different scales. On the other hand YOLO, as the name implies, analyse the whole image using a convolutional network to simultaneously predict multiple bounding boxes and class probabilities for those boxes.

As you can see in Figure 2.9, YOLO divides the image into an S × S grid and for each grid cell predicts B bounding boxes, their confidence and C class probabilities. All these things are encoded into a S × S × (B ∗ 5 + C) tensor.

Since YOLO sees the entire image, it implicitly encodes contextual informa-tion about classes and their appearance.

In the last years, authors continued working on YOLO, releasing first YOLOv2 [19] and finally YOLOv3. Main improvements regarded accuracy.

(27)

Figure 2.9: YOLO prediction model (Source: [18])

2.3 SLAM

2.3.1 Premise

Given a robot with an unknown position in an unknown environment, the Si-multaneous Localization And Mapping (SLAM) problem (Figure 2.10) asks if its possible to build (and update) a consistent map of that environment and, at the same time, localize the robot within it. To do so, the robot must be equipped with sensors and must find some reference points in the environment, called landmarks.

This is a very common problem in different robotics domains, ranging from in-door to outin-door, underwater and airborne systems. From a theoretical perspec-tive, it can be considered a solved problem, since in the last decades several algorithms were proposed. However, great challenges remain in the algorithm realization.

Researchers started thinking about SLAM during the IEEE Robotics and Automation Conference in 1986. They agreed on the fact that consistent proba-bilistic mapping was a big issue in robotics and then it involved both conceptual and computational issues. During the next years different researchers pub-lished papers addressing those problems. It is worth citing in particular the work of Smith and Cheesman [24] and Durrant-Whyte [8], who defined a statistical basis for describing relationship between landmarks and manage geometric un-certainty.

(28)

Figure 2.10: Essential SLAM problem (Source: [7])

In the same years, people started working in visual- [2] or sonar- [5, 6] based navigation. Those two fields have a lot in common and soon resulted in the work of Smith et al. [23]. This paper showed that as a mobile robot moves through an unknown environment taking relative observations of landmarks, the estimates of these landmarks are all necessarily correlated with each other because of the common error in estimated vehicle location.

For correctly dealing with the SLAM problem, we have to define some quan-tities:

• xk is the state vector, describing vehicle location and orientation

• uk is the vehicle control input

• miis a vector describing the location of the ith, with assumed time invariant

location

• zik is the observation of landmark i performed by the vehicle at time k

(29)

• U0:k is the history of control inputs

• m is the set of all landmarks

• Z0:k is the set of all observation landmarks

The probabilistic formulation of the SLAM problem requires to compute for all times k the probability distribution

P (xk, m|Z0:k, U0:k, x0) (2.1)

Starting with a value for the probability distribution at the previous step P (xk−1, m|Z0:k−1, U0:k−1)

a control command ukand an observation zk, it is possible to compute the next

state probability using the Bayes theorem. To do so we need the so called motion model and observation model.

The motion model can be expressed as a Markov process P (xk|xk−1, uk)

in which the next state only depends on the previous state and the control input.

The observation model gives the probability of making an observation zk

given a vehicle and map state

P (zk|xk, m)

The SLAM algorithm is then implemented in a two-step predict and correct form.

Time update (predict)

P (xk, m|Z0:k−1, U0:k, x0) =

= Z

P (xk|xk−1, uk)P (xk−1, m|Z0:k−1, U0:k−1, x0)dxk−1 (2.2)

Measurement update (correct)

P (xk, m|Z0:k, U0:k, x0) =

P (zk|xk, m)P (xk, m|Z0:k−1, U0:k, x0)

P (zk|Z0:k−1, U0:k)

(2.3) Popular solutions for the SLAM problem exploit particle filters, Extended Kalman Filter (EKF) and GraphSLAM.

(30)

2.3.2 Sensing

Depending on the type of sensors the agent is equipped with, you can imple-ment different SLAM techniques and achieve different performance. Those sen-sors include (mono or stereo) cameras, LiDAR, radars, tactile sensen-sors or radio-based (mainly for indoor applications).

Sensor models mainly divide into landmark-based and raw-data approaches. The first class allows to define landmarks, namely features (such as objects) in the real world whose location can be estimated by the sensor. The latter makes no assumption about landmarks but instead models the observation probability as a raw function of the location.

2.3.3 Maps

SLAM algorithms build abstract maps of the environment. The type of map depends on the algorithm and the specific application domain. Typical repre-sentations are topological maps and grid maps.

(a) Topological map (Source: Google

Im-ages)

(b) Grid map (Source: Google Images)

Figure 2.11: SLAM maps

The aim of a topological (Figure 2.11a )map is to build the connectivity of the environment rather than create an accurate map. For this reason, it typically lacks details, scales and other geometric properties of the environment. On the other hand, grid maps (Figure 2.11b) create a discrete approximation of the environment using a grid (with either square or hexagonal cells). Then, each cell has a probability of being occupied.

2.3.4 Data association

Data association is the problem of associating an observation with a previously seen landmark. Bad associations jeopardize the performance of the algorithm.

(31)

For each new observation, if it is matched with a landmark then both map and agent state are updated; otherwise, the observation is added to the map as a new landmark.

2.3.5 Loop closure

With data association, it is another very challenging problem for SLAM. Loop closure is the problem of recognizing a previously-visited location and update the state accordingly. This is done typically by means of a second algorithm. It allows to correct the drift accumulated by the moving agent and re-set its position, consequently correcting the entire map.

2.3.6 VSLAM

The Visual SLAM (VSLAM) is a particular instance of the Simultaneous Local-ization And Mapping (SLAM) problem, where the only available sensors are mono- or stereo-cameras.

In the following two of the most popular VSLAM ROS implementations are described.

RTAB-Map

Real-Time Appearance-Based Mapping (RTAB-Map)3 is an open source SLAM solution available. The project started in 2009 and became a library in 2013, evolving then into a complete graph-based SLAM approach. As a result, it is now shipped both as a stand-alone cross-platform C++ library and the rtabmap_ros ROS package4_.

The first solution proposed an appearance-based loop closure detection mechanism [12] exploiting an efficient memory management to deal with large environments in real-time. The algorithm was later extended to multi-session mapping [13] and to support LiDAR-based SLAM [14] other than VSLAM.

In the following, a brief description on how RTAB-Map works is provided. Figure 2.12 depicts the structure of the rtabmap ROS node.

The algorithm uses a graph to store vehicle poses and sensor data, which are continuously updated by means of a graph optimization algorithm. Doing this in a naive way would be very inefficient, since the graph would be bigger and

3_{http://introlab.github.io/rtabmap/} 4_{http://wiki.ros.org/rtabmap_ros}

(32)

Figure 2.12: Block diagram of the rtabmap ROS node (Source:

[14])

bigger as the robot moves. For this reason, RTAB-Map uses an efficient mem-ory management based on Short-Term Memmem-ory (STM) and Long-Term Memmem-ory (LTM).

When the algorithm computes loop closure, the accumulated drift is com-pensated by the computations made by graph optimization.

A node of the graph can be either in the LTM or in the Working Memory (WM). When it is transferred to the LTM it is no more available to to the other modules inside the WM. Nodes are moved outside the WM either when you reach a given threshold or you exceed the computation time. The swap out policy is based on heuristics, keeping more important nodes (for example based on frequency or time spent).

The node can provide several outputs such as map data, robot tf, occu-pancy grid.

Regarding the odometry node, it is a flexible one which can implement either visual odometry or LiDAR odometry depending on robot sensory equipment.

The case of LiDAR odometry will be neglected since is not strictly related to the work of this thesis. Instead, the structure of visual odometry node is depicted in Figure 2.13.

If proprioceptive odometry is not available or is not accurate enough, RTAB-Map allows to exploit visual odometry using either RGB-D or Stereo cameras. Both standard approaches [21] are implemented, namely Frame-To-Frame (F2F) and Frame-To-Map (F2M). The main difference between the two of them is that the first one registers a new frame against the last key-frame, while the second one against a local map of features extracted from all the past key-frames. Dif-ferences between the two approaches are highlighted with different colours in the picture.

(33)

Figure 2.13: Block diagram of the rgbd odometry and stereo odometry ROS nodes (Source: [14])

are extracted. Then, for F2M they are tested using Nearest Neighbour Distance Ratio (NNDR) test [15] with BRIEF descriptors [4], while optical flow directly on GFTT features is used for F2F.

A motion model based on previous transformations is used to predict where features belonging to the key-fame (F2F) or feature-map (F2M) should be in the current frame. Those candidate positions are used to match features in the new frame and compute the movement of the robot. Bundle adjustment [10] is used to refine the estimation. The resulting estimation is used to update both robot pose and map.

Loop closure detection uses the bags-of-words approach already described in [12]. They are collections of visual features such as SURF [3], SIFT [15], ORB [20] or BRIEF [4]. When using visual odometry, it is possible to re-use the same features so to avoid extracting them twice. These are used to com-pute a probability that the robot is again in a previously visited point: once this probability exceeds a given threshold, a loop closure is detected.

(34)

ORB-SLAM2

ORB-SLAM25_{is another very popular VSLAM algorithm used in ROS. The}

au-thors proposed first a monocular version [16] and then a stereo and RGB-D extension [17].

The algorithm uses a feature-based method to extract salient features from input images, as depicted in Figure 2.14.

Figure 2.14: ORB-SLAM2 input pre-processing (Source: [17])

Regardless the nature of the robot-mounted camera (either RGB-D or Stereo), ORB features are extracted and used to obtain monocular and stereo key-points. Then, input images are discarded and the rest of the algorithm works only on those features, being independent on the type of camera. ORB features are used since they are robust to rotation and scale and have a good invariance to camera auto-gain, auto-exposure and illumination changes. The very same ORB features are used for tracking, mapping and place recognition.

System threading model is depicted in Figure 2.15. The system has three main parallel threads

• Tracking thread is used to localize the camera, finding in every frame fea-tures that match to the local map. It tries to minimize the reprojection error • Local mapping thread is used to manage the local map, optimizing it using local Bundle Adjustment. The policy behind the algorithm is to insert new key-frames very often and cull redundant ones afterwards

• Loop closing thread is used to detect loops and correct the accumulated drift

(35)

Figure 2.15: ORB-SLAM2 threads and modules (Source: [17])

The last thread launches a Full Bundle Adjustment thread which performs full Bundle Adjustment to compute the optimal structure and motion solution. The system also have a Place Recognition module which allows to perform loop detection, re-localization and recover from failures. It is based on DBoW2 [9].

Finally, ORB-SLAM2 incorporates also a Localization Mode, in which map-ping and loop closing threads are deactivated. This is useful for localization in well mapped environments, where there are no significant changes. This functionality exploits visual odometry matches (with respect to features in the previous frame) and matches to map points (to compensate drift).

2.4 Proposed solution

The solutions described so far are very general, allowing a robot to map an unstructured environment, completely ignoring which is the final application do-main. We, instead, are going to take part to competitions where the environ-ment is structured, with coloured cones defining the track where the vehicle can move. So, instead of extracting geometric features from an image, it is possible

(36)

to exploit cones as features, building a sparse semantic map of the environment. Moreover, while for an autonomous agent it is possible to move around and close loops very often, for a Driverless vehicle during the Track Drive event the only possible loop closing happens at the end of a lap. For this reason the loop closure technique can be implemented by looking at cones defining the start line.

With these ideas in mind, in this thesis EKF-based VSLAM algorithm is pro-posed.

(37)

Chapter 3 Cone Detection

3.1 YOLO-based cone detector

At the moment of writing the team still had no chance to buy the cones used for the competitions. So I searched on the Internet to find videos of previous competitions1_{and downloaded two videos from YouTube.}

The main idea was to train YOLO from scratch, using just three classes: cone_blue, cone_yellow, cone_orange. Due to the small number of classes and its low resource demanding, I decided to use Tiny YOLOv3.

I extracted 22 frames and labelled them using an annotation tool. Each frame had around 10 cones.

The training was done using 20 images, keeping the other 2 for validation. Despite the really small dataset, after only 1800 epochs, the system is able to correctly classify not only cones in the video from which the images were extracted, but also on the other one (as depicted in figures).

Both training and inference were done on a NVidia GeForce 1050 Ti. Infer-ence runs smoothly at 30 fps.

Even if with some issues, due to the small dimensionality of the dataset, this is very promising result (Figure 3.1). The confidence of detected cones on the video from which frame where extracted is very high (Figure 3.1a), but it decreases on when working on a different video (Figure 3.1b). Also, some false positives arise.Anyway, this solution should be further exploited as soon as real cones will be available.

(38)

(a) Same video as training frames

(b) Different video from training frames

(39)

3.2 Colour-based cone detector

3.2.1 Premise

The results obtained with the YOLO-based detector are very promising, but the idea behind this thesis was to test the algorithm on a real world scenario. For doing this, two main problems arose: the team still did not have the official cones for the competition and the vehicle is still work-in-progress.

To do so I decided to use the testing equipment depicted in Figure 3.2. It consists of

• a RC car

• a set of red, blue, pink, yellow and green cones, 10 per each colour • an Intel®_{RealSense™ D435}

(40)

RGB frame

Depth frame Thresholding

Filtering Cropping RGB2HSV HSV frame HSV color Mask Filtering Cropping

Contours & Area

Center of Mass

Acceptance test 3D position [x,y,z]

For each cone color

Figure 3.3: Colour-based cone detector pipeline

Given the aforementioned problem with cone availability, I decided to use a set of small coloured cones. With that idea in mind, I implemented a colour-based cone detector using legacy computer vision techniques instead of DNN-based ones.

3.3 The algorithm

This section explains all the steps implemented by the cone detector algorithm. The algorithm takes as input an RGB-D aligned frame and provides as output and array of 3D cone positions and types. The complete pipeline is depicted in Figure 3.3. All the computer vision tasks are performed by means of the OpenCV2library.

The algorithm is very general and is not specific for a particular stereo sys-tem, but it assumes that the RGB-D pair is aligned. Once the coloured frame Figure 3.4a and the depth frame Figure 3.4b arrive, they pre-processed. This involves filtering and cropping.

The filtering stage is used to get rid of noise and distortions. Regarding the RGB frame, the image is first managed to correct lens distortions (see Lens Distortion and Camera Calibration). Then image is then filtered using the pyrMeanShiftFiltering()3 _{function. This provides a “posterized” version}

2_{https://opencv.org/}

3_{https://docs.opencv.org/3.4/d4/d86/group__imgproc__filter.html#}

(41)

of the image, where textures are flattened. It allows to reduce the influence of lights and shades in the image, having more uniform colours. The depth frame is instead filtered using a medianBlur()4_{function to remove noise.}

Cameras may have a large field of view, capturing a lot of unnecessary de-tails. For example, in Figure 3.4a you can see the sky on the top part of the image and the front of the vehicle on the bottom one. For this reason, dur-ing the croppdur-ing stage a ROI is cut out from the images, so to avoid wastdur-ing processing power on useless data.

The result of filtering and cropping stages for both color and depth frames are depicted respectively in Figure 3.4c and Figure 3.4d.

As previously presented, RGB is only one of the color-spaces available. In-stead, color recognition is better implemented using the HSV color-space, since the H channel contains all the information related to the pure color. This ap-proach is in fact more robust than RGB-based ones with respect to variations of lights. For testing the algorithm I was interesting into recognizing 5 types of coloured cones, so I shot a photo to each of them and extracted the HSV coor-dinate of each colour. Since it is impossible to have a perfectly homogeneous coloured cone, I heuristically defined a range of variation for each colour, as shown in Table 3.1. Notice that HSV values are in OpenCV format, with H from 0 to 180, S from 0 to 255 and V from 0 to 255.

Colour Min value Max value

Red [0, 120, 100] [15, 255, 255] Blue [105, 120, 100] [120, 255, 255] Yellow [23, 120, 100] [38, 255, 255] Green [60, 120, 100] [75, 255, 255] Pink [150, 120, 100] [165, 255, 255]

Table 3.1: HSV values for cone-colours

Now, the cropped RGB image is converted into HSV and compared against the colours inside the previous table. Let’s consider the pink colour. Its min and max values are used, together with the HSV frame, as inputs for the cv::inRange()5

function. The result is depicted in Figure 3.4e: for each pixel in the source im-age, the corresponding pixel in the destination image is white if the values is inside the ranges of the colour, black otherwise.

4_{https://docs.opencv.org/3.4/d4/d86/group__imgproc__filter.html#}

ga564869aa33e58769b4469101aac458f9

5_{https://docs.opencv.org/3.4.3/d2/de8/group__core__array.html#}

(42)

At this point the cv::findContours()6 is applied to the binary mask, ob-taining a set of candidate cones. They cannot be accepted for sure, since the detector only relies on colours. So, an acceptance test is needed. At first I tried to test the “triangularity” of the shape, but this is not the case since cones have round edges. So I implemented a simple acceptance test based on dis-tance and area. For each candidate cone ci _{you compute the area a}i _(with

cv::contourArea()7) and image moments Mi _{(with cv::moments()}8_{). The}

lat-ter are particular weighted average pixel intensities which allow you to compute some image features. One of the most important ones is the centroid, which can be computed as ¯ xci = M₁₀i Mi 00 ¯ yci = M₀₁i Mi 00

where (¯x_ci, ¯y_ci) is the centroid (in pixel) of candidate cone ci. The very same

pixel coordinates are used to access the cropped depth frame, obtaining the distance di_.

Defined the minimum distance dmin, the maximum distance dmax, the

mini-mum area amin and the maximum area amax, a candidate cone ci is assumed a

real cone ˆci if it passes the test

di > dmin∧ di < dmax∧ ai > amin∧ ai < amax

For each ˆci, given the centroid in homogeneous coordinates qi _{= [¯}_x

ci, ¯y_ci, 1]T,

and the camera intrinsics matrix K, you can compute pi = K−1qidi

where pi _{is the 3D position of the cone with respect to the camera.}

These steps are done for each color in Table 3.1, obtaining a set of left, right and start cones, each one with its own position.

6_{https://docs.opencv.org/3.4.3/d3/dc0/group__imgproc__shape.html#} ga17ed9f5d79ae97bd4c7cf18403e1689a 7_{https://docs.opencv.org/3.4/d3/dc0/group__imgproc__shape.html#} ga2c759ed9f497d4a618048a2f56dc97f1 8_{https://docs.opencv.org/3.4/d3/dc0/group__imgproc__shape.html#} ga556a180f43cab22649c23ada36a8a139

(43)

(a) Original RGB frame (b) Original Depth frame

(c) Filtered and cropped RGB frame (d) Filtered and cropped Depth frame

(e) Mask

(f) Contours and centroids on RGB frame (g) Contours and centroids on Depth

frame

(44)

Chapter 4 Visual SLAM

The base of the work in this chapter is inspired by [25].

4.1 Notation

System state collects both vehicle and map state. We will use two different frame of reference as depicted in Figure 4.1, a global reference frame (fixed) and a local reference frame (attached to the vehicle).

ψ

x y

Figure 4.1: Global reference frame (big one) and local reference

frame (small one)

(45)

V =       x y ψ u r      

where x and y are vehicle position in the global reference frame, ψ is the yaw an-gle in global reference frame and u and r are vehicle longitudinal and rotational velocities in the local frame.

The map is represented as

M =      `1 `2 .. . `M      where each `i is defined by its 2D position

`i =

x_`_i y`i

expressed in the global reference frame.

So, the complete state mean vector x ∈ Rn_is

x = V M

while the complete state covariance matrix P ∈ Rn×n

P = PVV PVM PMV PMM

where n = 5 + 2M , with M number of landmarks in the map.

Each cone can be of any type t ∈ {start, lef t, right}. This information is stored in the state, but is not shown for the sake of simplicity. Also, let’s define the function typeof (·) which returns the type of either a mapped landmark or an observation.

4.2 Initialization

State dimensionality augments over time, as long as the vehicle detects new landmarks. So, at the beginning, system state contains only the variables

(46)

re-lated to the vehicle. State mean vector and covariance matrix are initialized as x =       0 0 0 0 0       P =       0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0      

since, at the beginning, the moving frame coincides with the fixed one and vehi-cle stands still. Also, there is absolute certainty about system state (null covari-ance matrix).

4.3 Prediction step

Vehicle motion model expresses how vehicle state evolves over time. It is ex-pressed by a non-linear function

f (V, u, q) =                xk+1 = xk+ ukcos ψk∆t + qx yk+1= yk+ uksin ψk∆t + qy ψk+1 = ψk+ rk∆t + qψ uk+1 = uk+ qu rk+1 = rk+ qr

As you can see, state evolution does not depend on any external input u, but it is driven by the velocity of the vehicle at the previous step, modelled as a random walk. Vector q = [qx, qy, qψ, qu, qr]T ∼ N (0, Q) represents motion

estimation uncertainty.

Since we are working with a SLAM based on an EKF, we need to compute the Jacobian matrices of the previous function. The Jacobian with respect to the vehicle state is

FV = ∂f ∂V =      

1 0 −u sin ψ∆t cos ψ∆t 0 0 1 u cos ψ∆t sin ψ∆t 0 0 0 1 0 ∆t 0 0 0 1 0 0 0 0 0 1      

(47)

Fq= ∂f ∂q =       1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1      

In SLAM, vehicle state is time-variant, while the map is static. So we have that for the prediction step

V ← f (V, u, 0) M ← M

which can be rewritten, avoiding all the trivial operations, as

V ← f (V, u, 0) PVV ← FVPVVFTV + FqQFTq PVM ← FVPVM PMV ← PTVM

4.4 Correction step

4.4.1 Landmark observation

The second important part of any EKF-based SLAM technique is the measure-ment model. Given any landmark `iinside the map, it projects landmark position

from the global to the moving reference frame.

Given the vehicle pose p = [x, y, ψ]T_{, it can be expressed as}

h(p, ì) = cos ψ sin ψ − sin ψ cos ψ x_`_i− x yì − y = (xì − x) cos ψ + (yì − y) sin ψ −(xì − x) sin ψ + (yì − y) cos ψ

Again, you need to compute the Jacobian matrices since the system is non-linear. The Jacobian with respect to p is

Hp=

∂h ∂p =

− cos ψ − sin ψ −(x_`_i− x) sin ψ + (y`i− y) cos ψ

sin ψ − cos ψ −(x` − x) cos ψ − (y` − y) sin ψ

(48)

while with respect to ì Hì = ∂h ∂ì = cos ψ sin ψ − sin ψ cos ψ

Assume that at a given time instant the vehicle has observed k cones. For all of them the data association step is performed (see section 4.4.3). If that observation does not match with any landmark in the map, then it is assumed to be a new cone and is added to the map (see section 4.4.4). On the other hand, all the matching observations are used together for a joint state update.

So, considering only the observations yi associated to previously mapped

landmarks z =    .. . y_i− h(p, ì) .. .    Z =    .. . ... ... ... Hp · · · Hì · · · .. . ... ... ...         Ppp · · · Ppì · · · .. . . .. ... ... Pìp · · · Pìì · · · .. . ... ... . ..           · · · HT_p · · · .. . ... ... · · · HT_`_i · · · .. . ... ...      +    . .. 0 0 0 R 0 0 0 . ..   

where each sample y_iis affected by Additive White Gaussian Noise r ∼ N (0, R). Block matrices contain values related to landmark i in position i and zeros for non-observed landmarks. Notice that Hp and H`i are different for each

ob-served landmark.

At this point you can compute the Kalman gain as

K = PVp · · · PV`i · · · PMp · · · PM`i · · ·      · · · HT p · · · .. . ... ... · · · HT_`_i · · · .. . ... ...      Z−1

where, also in this case, only the blocks related to observed landmarks are non-zero.

Finally, the update step

x ← x + Kz P ← P − KZKT

(49)

4.4.2 Velocity observation

Exploiting the Visual Odometry (VO) feature of this kind of SLAM, it is possible, given at least 3 non-aligned cones, to measure the pose variation of the vehicle in two subsequent frames. This is done according to [1]. Let’s assume to have two sets of 2D landmark coordinates Lk−1 and Lk containing the landmarks

seen, respectively, at time k − 1 and k. Let’s also assume that those landmarks are the same (non-matching landmarks have been removed).

So, if you have at least 3 matching cones in two subsequent frames, you can compute the transformation Tk which minimizes the L2 distance between the

two sets, thus finding vehicle rotation and translation from the previous frame: arg min Tk X i Li k− TkLik−1

where i denotes the ith _{cone in the set and positions are expressed as}

ho-mogeneous coordinates Li

{·} = [x, y, 1]T.

First, you have to compute the centroids of the two sets Lk−1 and Lk. Then,

given the Singular Values Decomposition

U SVT = svd((Lk−1− Lk−1)(Lk− Lk)T)

Then the rotation Rk can me computed as

Rk= V UT =   r1,1 r1,2 0 r2,1 r2,2 0 0 0 1   while the translation can be computed as

tk = Lk− RkLk−1 =   tx ty 1  

At this point, knowing that the vehicle has moved in a time interval ∆t, you have the measured velocities

uk = − tx ∆t rk = − atan2(r2,1, r1,1) ∆t

where the minus sign comes from the fact that the vehicle as an opposite movement with respect to the cones.

(50)

The observation model for the velocities v = [u, r]T _{is simply}

h(v) =u r

whose Jacobian is simply a 2-by-2 identity matrix.

The correction step now involves the velocities only and can be written as

z = y − h(v) Z = Pvv+ R

K = PvvZ−1

xv ← xv+ Kz

Pvv ← Pvv− KZKT

where y is the actual observation of u and r, affected by Additive White Gaussian Noise r ∼ N (0, R).

4.4.3 Data association

Data association is, with loop closure, one of the most delicate aspects of any SLAM algorithm. Wrong data association may lead to catastrophic results.

With this algorithm we are building a semantic map that collects only the po-sition and type of each cone, so the map is quite sparse and memory occupancy is limited. Data association is performed by comparing the observation with all the landmarks of the same type already in the map. This is not so computational demanding because of the aforementioned properties.

So, given an observation yi it is compared with all `j such that typeof (yi)is

equal to typeof (`j). Landmark j is projected into vehicle frame and compared

to the observation, as seen before h(p, `j) = cos ψ sin ψ − sin ψ cos ψ x`j − x y`j − y

Again, you need to compute the Jacobian with respect to p Hp=

∂h ∂p and with respect to `j

H`j =

∂h ∂`j

(51)

Then compute the innovation z = y_i− h(p, `j) Z =Hp Hì Ppp Ppì Pìp Pìì HT p HT_` i + R and the Mahalanobis distance

d = zTZ−1z

All these distances are filtered with respect to a 3σ threshold and, among the ones which pass the test, the minimum one is taken: that define the matching landmark.

4.4.4 Landmark addition

This step is performed each time a new cone is detected. When, such in this case, the sensor provides all the degrees of freedom of the landmark, the initial-ization function g(·) is simply the inverse of the observation function g = h−1(·).

So, given vehicle pose p = [x, y, ψ]T _{and an observation y}

n+1= [xyn+1, yyn+1] T_,

new landmark position in the map can be computed as `n+1 = g(p, yn+1) = cos(ψ) − sin(ψ) sin(ψ) cos(ψ) xyn+1 yyn+1 +x y

Jacobian with respect to vehicle pose is Gp =

∂g ∂p =

1 0 −yyn+1cos(ψ) − xyn+1sin(ψ)

0 1 xyn+1cos(ψ) − yyn+1sin(ψ)

and with respect to local landmark pose Gy_n+1 = ∂g ∂y_n+1 = cos(ψ) − sin(ψ) sin(ψ) cos(ψ)

Then, landmark covariance P`n+1`n+1 and cross-covariance P`n+1x can be

computed as P`n+1`n+1 = GpPppG T p + Gyn+1RG T yn+1 P`n+1x= GpPpp PpM

(52)

x ← x `n+1 P ← P PT_` n+1x P`n+1x P`n+1`n+1

4.4.5 Loop closure

In these competitions, the vehicle cannot move wherever it wants, but has to complete laps on a given track. This simplifies a bit the problem of loop closures, since it cannot happen at any given point of the track, but only when a lap is completed. Moreover, track start line is well identified by particular cones.

Loop closure algorithm simply checks at any point in time if vehicle observes two start cones and detects which one is on the left and which one is on the right. Then, matches each of them with the correct cone in the map (landmark association) and performs the update step as already explained.

4.5 Recap

A summary of the entire SLAM algorithm is depicted in Figure 4.2.

Prediction Correction Data association x, P System state Velocity observation Landmark

observation Landmark addition

Old landmarks New landmarks

Figure 4.2: SLAM block diagram

Both the prediction and the correction step work on the same variables x and P, which define the system state. The prediction step is performed at a given rate and acts as dead-reckoning. The correction step is done at each new camera frame. First, data association is performed, obtaining old and new

(53)

landmarks. Observations matching landmarks in the map define old landmarks, which are used correct both vehicle state and map. All other observations define new landmarks, which are added to the map.

The localization-only feature is obtained by disabling the Landmark addition step.

(54)

Chapter 5 Track Detection

So far the map provided by the SLAM algorithm is simply an unordered set of cones. Each of them is expressed by a tuple (x, y, t) where x and y represent cone coordinates in global frame and t ∈ {start, lef t, right} is cone type. This chapter formalizes the method used to move from such an unordered set to a structured way of describing track boundaries and drivable region.

5.1 Boundaries detection

Left and right track boundaries can be obtained by connecting sequentially cones of the same type. Let’s focus on a single type of cone, for the sake of simplicity. An example is depicted in Figure 5.1, where the filled dot is the start-ing cone.

So the problem can be stated as follows: given an ordered sequence of already connected cones

C = {c1, c2, . . . , cn}

choose the correct next cone cn+1 from the set of cones

U = {u1, u2, . . . , um}

The simplest solution would be

cn+1= arg min ui

|ui− cn| (5.1)

which means taking as next cone the one having minimum Euclidean distance. This is for sure a possible solution, but it fails in some scenarios, as the one we are taking as an example. The result is depicted in Figure 5.2.

(55)

-8 -6 -4 -2 0 2 4 6 8 10 12 -4 -2 0 2 4 6 8

Figure 5.1: Possible cones sequence scenariocones

To overcome the limitations of eq. (5.1) another approach must be used. The proposed one is explained in the next subsection. As already stated, everything is explained for a single track boundary, but it must be executed for both of them (left and right).

5.1.1 Boundary initialization

At the beginning, track boundary is empty. The first cone to be added is a start cone, but the vehicle sees two of them, one on the left and one on the right.

Since the vehicle moves along its x axis and the fixed frame is assumed in the vehicle initial position, start cones belong to the left or right boundary simply depending on their y coordinate. In particular, the cone with y < 0 belongs to the right boundary, while the one with y > 0 to the left one. This is based on the assumption that no cone lies exactly along the vehicle x axis (otherwise it would have y = 0).

At this point you have a boundary with a single cone. The second cone is picked up as the one having minimum Euclidean distance (see eq. (5.1)). This simplification holds if the geometry of the track is “quite regular” and could be removed in the future, but for now it works fine. Now you have a boundary with two cones and you can apply the iterative method described in what follows.

(56)

-8 -6 -4 -2 0 2 4 6 8 10 12 -4 -2 0 2 4 6 8

Figure 5.2: Wrong connections using Euclidean distance

5.1.2 Next cone selection

The method proposed in this thesis is to choose the next cone cn+1as the ui ∈ U

having two properties

1. minimum Euclidean distance 2. maximum collinearity Let’s define δi , |(cn− cn−1) × (ui− cn)| |cn− cn−1||ui− cn| (5.2) which represents the sin(·) of the angle α between the two segments cn−1cnand

cnui. Thus, δi → 0 as α → 0, meaning that the two segments are collinear.

Still, this does not take into account distance: you may have two aligned cones u1, u2 ∈ U having same α with respect to segment cn−1cn, but you have

to choose the closest one. To do so, Euclidean distance is included in the cost function for next cone selection, obtaining

cn+1 = arg min ui

(57)

were w is a weight parameter used to take into account how much you want to penalize cones which deviate from the previous segment. It can be tuned heuristically. The complete result is depicted in Figure 5.3, while a detailed view is in Figure 5.4. Even if u5 and u6 are closer to cn than u1, the latter is chosen

since it is the best one both in terms of distance and collinearity.

-8 -6 -4 -2 0 2 4 6 8 10 12 -4 -2 0 2 4 6 8

Figure 5.3: Right connections using proposed cost function

As you can see from eq. (5.3), when δi → 0 you are basically minimizing the

Euclidean distance as in eq. (5.1). Otherwise, both distance and displacement from the path so far are taken into account, properly weighted.

This process has to be iterated until all the cones belonging to U are added to the C set.

5.1.3 Boundary closure

Similar to SLAM loop closure, a way to decide when to close the boundary is necessary to correctly compute track geometry. When all cones in U are added to C, boundary closure test is performed.

Let’s define a tunable threshold parameter d. If |cn− c1| < d

then boundary closure is detected and cn is connected to c1. Now, depending

(58)

c

_n

c

_n-1

u

₁

u

₂

_u

3

u

₄

u

₅

u

₆

Figure 5.4: Detail of cone connection with proposed cost function

added to the map after track boundary closure. In that case, the segment cnc1

is removed, new cones are added following eq. (5.3) and new cn is connected

to c0 to close again the boundary.

5.2 Road Detection

At the end of the previous section you have constructed a left boundary L = {l1, l2, . . . , ll} and a right boundary R = {r1, r2, . . . , rr} (they are not required to

have the same number of cones). This starting situation is illustrated in Fig-ure 5.5.

For planning purposes it is important to know where the vehicle can move, namely, which is the actual road. As an abstract representation of the road I decided to use a triangular tessellation, as done for 3D meshes in computer graphics.

The algorithm iteratively builds triangles by using cones as vertices. The first triangle t1is made by start cones and the first left cone (you can either start with

a right, the reasoning remains the same) t1 = (l1, r1, l2)

then the second triangle t2 is

(59)

r₁ r2

r₃ r4

l₁

l₂

l₃ l4

Figure 5.5: Left and right track boundaries

and so on and so forth. The result is depicted in Figure 5.6.

r₁ r2

r₃ r4

l₁

l₂

l₃ l4

Figure 5.6: Road tessellation using triangles

So, the key idea, is to take at each step the last two vertices of the last triangle and then add as third vertex a cone from the boundary other than the one already used. This is done only if that boundary still has cones, otherwise a vertex can be picked from the same boundary.

Track loop closure is obtained for free from boundary closure detection. Even in this case, when track closure was performed and new cones are added, the closing triangles are deleted and tessellation is performed again with the new set of cones.

(60)

Chapter 6 Implementation

6.1 MATLAB

Before implementing the previously described algorithm in ROS, I wrote a small library and a MATLAB script to test it. The main structure is reported in Folder Organization 6.1. Since the VSLAM algorithm provides a 2D description of the environment and the vehicle, the real world in this MATLAB simulator is implemented in 2D for sake of simplicity.

driverless simulator matlab/ +vehicle/ Camera.m Vehicle.m +slam/ VisualSLAM.m +track/ Track.m Acceleration.m Ring.m +transformations/ from frame.m to frame.m +utils/ awgn.m shuffle.m simulation.m

Visual SLAM for Driverless racing vehicle

U

NIVERSITY OF

P

ISA

S

ANT

’A

NNA

S

CHOOL OF

A

DVANCED

S

TUDIES

Master Degree in E

MBEDDED

C

OMPUTING

S

YSTEMS

Visual SLAM for Driverless racing

vehicle

Abstract

Ringraziamenti

Contents

Chapter 1

Introduction

1.1

Formula Student and Formula SAE

1.2

Competitions

1.3

E-Team Squadra Corse

1.4

Motivation

1.5

Thesis objectives

1.6

Structure

Chapter 2

Background

2.1

Computer Vision

2.1.1

Camera model

2.1.2

Projective Geometry

2.1.3

Lens Distortions

2.1.4

Projections and 3D vision

2.1.5

Depth Sensing

2.1.6

HSV colour space

2.2

Object detection

2.2.1

YOLO

2.3

SLAM

2.3.1

Premise

2.3.2

Sensing

2.3.3

Maps

2.3.4

Data association

2.3.5

Loop closure

2.3.6

VSLAM

2.4

Proposed solution

Chapter 3

Cone Detection

3.1

YOLO-based cone detector

_u