• Non ci sono risultati.

Visual SLAM basato su omni-camera per Rover Marziani

N/A
N/A
Protected

Academic year: 2021

Condividi "Visual SLAM basato su omni-camera per Rover Marziani"

Copied!
102
0
0

Testo completo

(1)

Omnicamera-based vSLAM for

Martian Rovers

Marco Lapolla

DIBRIS - Department of Computer Science, Bioengineering,

Robotics and System Engineering

Thales Alenia Space Italia

Master Thesis submitted for the degree in

Robotics Engineering (a.y 2019/2020)

October 27, 2020

Prof. Fabio Solari Thesis Coordinator

(2)
(3)

Acknowledgements

First of all, I would like to thank the manufacturer in which I perform most of my thesis, Thales Alenia Space Italia. In particular, I would like to thank Andrea Merlo, my thesis advisor within the laboratory. Moreover, I would like to thank all the other experts in the lab which helped me during my thesis and make me feel already part of the fam-ily. In particular I want to mention Ciro Napolitano, Genny Scalise and Marco Carpentiero.

I would like also to thank my thesis coordinator Prof. Fabio Solari of the DIBRIS at University of Genoa. Even in this particular situation, Prof. Solari was always available whenever I ran into a trouble spot or had a question about my research or writing. He consistently al-lowed this paper to be my own work, but steered me in the right the direction whenever he thought I needed it.

Finally, I must express my very profound gratitude to my parents and to my friends for providing me with unfailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis. This accomplishment would not have been possible without them.

Thank you to all.

(4)

Acknowledgements

Volevo dedicare poche parole a tutte le persone che mi hanno perme-sso di arrivare fin qui e di portare a termine questo lavoro di tesi. Non posso non menzionare i miei genitori, mio fratello Matteo, la mia futura “sorella” Marta, mia nonna, i miei zii e tutta la mia famiglia che da sempre mi sostengono nella realizzazione dei miei progetti. Non finir`o mai di ringraziarvi per avermi permesso di arrivare fin qui. Ringrazio i miei colleghi di corso per aver condiviso con me questo percorso, pieno di momenti felici, ma anche di difficolt`a.

Ringrazio soprattutto i miei amici, sia quelli con cui sono abituato a vedermi tutti i giorni, che quelli con cui sono cresciuto e grazie a cui sono diventato chi sono ora. Nonostante non ci siano pi`u grandi occasioni per vedersi regolarmente, certi rapporti sono praticamente impossibili da deteriorare.

Grazie specialmente ai miei amici Alessio, Dario, Diego, Giorgia e Giulia per essere stati sempre presenti sempre negli ultimi anni e so-prattutto durante questa ultima fase del mio percorso di studi. Grazie per aver ascoltato i miei sfoghi, per aver condiviso tutti i momenti di felicit`a e anche quelli pi`u tristi, e per aver sempre creduto in me. Ringrazio infine Federico, per avermi dimostrato come in poco tempo sia possibile costruire un rapporto di amicizia molto solido e profondo, per avermi dato supporto e consiglio durante lo sviluppo e la scrittura di questa tesi.

Infine, dedico questa tesi alle persone che avrei voluto con tutto il cuore che fossero presenti a questo momento, Nonno Renzo e Nonna Lucia.

Grazie mille a tutti, vi voglio bene. Marco

(5)

Abstract

Rover for planetary exploration should be capable of operating with a very high level of autonomy, especially in situations where it has to travel for long paths with minimal human supervision. Control operations must be minimized in order to reduce traverse time, op-timize the resources allocated for telecommunications and maximize the scientific output of the mission.

Control algorithms have to provide the appropriate inputs to actua-tors from the knowledge of the goal position and the information about the vehicle dynamics. Path planning algorithms use three-dimensional models of the surrounding terrain in order to safely avoid obstacles. Moreover, rovers for the sample and return missions planned for the next years, have to demonstrate the capability to return to a previ-ously visited place for sampling scientific data or to return a sample to an ascent vehicle.

The estimation of the motion is a fundamental task in rover con-trol, and planetary environment presents some specific issues. Wheel Odometry has wide uncertainty due to slippage of wheels on a sandy surface, inertial measurement has well-known drift problems and GPS-like positioning systems is not available on extraterrestrial planets. Vision systems have demonstrated to be reliable and accurate motion tracking measurement methods. One method that nowadays has not been investigated widely, is Visual Odometry with omni-directional camera. This method mitigates problems such as occlusions and also have the opportunity to reconstruct the surroundings of the rover at 360◦.

Thanks to vSLAM (visual Simultaneous Localization and Mapping) techniques, a rover is able to reconstruct a consistent map of the en-vironment and to localize itself with reference to this map. SLAM technique presents two main advantages: the map of the environment construction and a more accurate motion tracking, thanks to the solu-tions of a large minimization problem which involves multiple camera poses and measurements of map landmarks.

After rover touchdown, one of the key tasks requested to the opera-tions center is the accurate measurement of the rover position on the

(6)

inertial and fixed coordinate systems, such as the J2000 frame and the Mars Body-Fixed frame. For engineering and science operations, high precision global localization and detailed Digital Elevation Models of the landing site are crucial. This thesis mainly treats the problem of localizing a rover with respect to a local frame coordinates, ignoring completely the absolute localisation task, which can be equally im-portant and challenging.

During rover traverse, Visual Odometry methods could be used as an asset to refine the path estimation.

The Visual Odometry algorithm, which has been implemented and presented in this thesis, firstly guesses motion by a linear 3D-to-3D method embedded within a RANdom SAmple Consensus (RANSAC) process to remove outliers. Then, motion estimation is computed from the inliers by minimizing the Euclidean distance between the trian-gulated landmarks.

Among all the vSLAM algorithms, ORB-SLAM2 has been chosen and adapted for this application, which is based on ORB (Oriented FAST and rotated BRIEF) features. The overall structure of the framework has been changed in order to work with the particular camera’s ge-ometry, and the 3D dense reconstruction has been added to construct a point-cloud, which will be used to construct the traversability map and apply path planning algorithms.

Moreover, a sensor fusion algorithm has been implemented to merge the Visual Odometry with other sources of information, such as Iner-tial Odometry (based on gyroscope) and Mechanical Odometry (based on wheel encoders).

The thesis has been developed in collaboration with Thales Alenia Space Italia in order to integrate the algorithm developed inside their framework and test it with an Adept MobileRobots Seekur Jr. rover and a PointGrey Ladybug 3 omnicamera inside the test-bed facility which emulates Martian terrain, in Turin.

(7)

Contents

List of Acronyms . . . vii

List of Figures . . . xi

List of Tables . . . xiv

1 Introduction 1 1.1 Space Robotics Systems . . . 1

1.2 Planetary Exploration . . . 2

1.3 Mars Exploration . . . 5

1.3.1 ExoMars Mission 2020 . . . 7

1.4 Thesis Outline . . . 8

2 Background 9 2.1 Camera Model and Calibration . . . 9

2.2 Feature Detection . . . 12 2.3 Visual Odometry . . . 14 2.3.1 Motion Estimation . . . 17 2.3.2 RANSAC . . . 18 2.3.3 Local Optimization . . . 19 2.3.4 Bundle Adjustment . . . 20 2.4 SLAM Models . . . 23

2.4.1 Extended Kalman Filter SLAM . . . 25

2.4.2 Particle Methods SLAM . . . 26

2.4.3 Graph-Based SLAM . . . 27

2.5 3D Reconstruction . . . 28

2.6 Sensor Fusion with Extended Kalman Filter . . . 30

3 State of the Art 33 3.1 Visual SLAM . . . 33

3.2 Localisation on Mars . . . 35

3.2.1 Relative Localisation . . . 36

3.2.2 Absolute Localisation . . . 37

(8)

CONTENTS

4 Algorithms and Methods 39

4.1 Method Outline . . . 39

4.2 SLAM System . . . 39

4.2.1 Camera Model and Calibration . . . 40

4.2.2 Feature Detection and Matching . . . 44

4.2.3 Visual Odometry & Bundle Adjustment . . . 44

4.2.4 The Ladybug-SLAM System . . . 46

4.3 3D Reconstruction System . . . 55

5 Implementation and Experiments 56 5.1 Software Architecture . . . 56

5.1.1 WorkFrame . . . 56

5.1.2 Test Bench for Robotics Applications . . . 59

5.2 Hardware Resources . . . 62

5.2.1 Test-bed Rover . . . 62

5.2.2 PointGrey Ladybug Camera . . . 64

5.3 System Architecture . . . 65

5.4 Vocabulary for Mars Terrain . . . 68

5.5 Calibration Results . . . 69

5.6 Offline Tests with Kashiwa Dataset . . . 71

5.7 Online Tests in ROXY Enviroment . . . 74

6 Discussion and Conclusions 77 6.1 Personal Contribution . . . 77

6.2 Future Works . . . 78

(9)

List of Acronyms

API Application Programming Interface. 42, 57

ASTRA Advanced Space Technologies in Robotics and Automation. 36

BA Bundle Adjustment. 46, 47

BBS Black-Board System. 57

BoW Bags of Words. 52, 54

BRIEF Binary Robust Independent Elementary Features. 14, 36,44

BRISK Binary Robust Invariant Scalable Keypoints. 14

C2TAM Cloud framework for Cooperative Tracking And Mapping. 34

CAN Controller Area Network. 61, 63

CCD Charge-Coupled Device. 9, 11, 64, 65

CENSURE CENter SUrround Extremas. 14

CNES French Space Agency. 37

DEM Digital Elevation Model. 3, 29,38, 61

DGPS Differential GPS. 75

DIMES Descent Image Motion Estimation System. 6

DLT Direct Linear Transformation. 16

DoF Degree of Freedom. 7,20, 72

(10)

List of Acronyms

DTAM Dense Tracking And Mapping. 34

ECSS European Cooperation for Space Standardisation. 1

EKF Extended Kalman Filter. xii, 25–27, 30–32,67, 68, 78

ELAS Efficient Large-scale Stereo Matching. 39, 55

ESA European Space Agency. 7, 36, 38

ESOC European Space Operations Centre. 7

FAST Features from Accelerated Segment Test. 14, 44

FoV Field of View. 6, 11, 12,36, 45, 54, 75

GNC Guidance, Navigation and Control. 7, 36,59, 65

GNSS Global Navigation Satellite System. 72

GPS Global Positioning System. 37,63, 75

GPU Graphics Processing Unit. 14, 55

GRAAL Genoa Robotics and Automation Laboratory. 56

HAL Hardware Abstraction Layer. 60, 75

HiRISE High Resolution Imaging Science Experiment. 37, 38

IEEE Institute of Electrical and Electronics Engineers. 64, 66

IMU Inertial Measurement Unit. 4,6, 30, 38,61, 63, 65

IO Inertial Odometry. 66, 67

IT Information Technology. 1

JPL Jet Propulsion Laboratory. 5

KAL Kernel Abstraction Layer. 56, 57

KF Kalman Filter. 30

(11)

List of Acronyms

LM Levenberg-Marquart. 22

LSD-SLAM Large-Scale Direct SLAM. 35

LSL Landing Site Local. 6

MAV Mars Ascent Vehicle. 2

MBF Mars Body-Fixed. 6

MER Mars Exploration Rover. 2, 4, 6, 7

MSL Mars Science Laboratory. 2

NASA National Aeronautics and Space Administration. 2, 5, 36

NET NETworking. 57

ORB Oriented FAST and rotated BRIEF. 14, 44,48, 49, 51, 68

OS Operating System. 63

OVO Oxford Visual Odometry. 36

P3P Perspective-Three-Point. 18

PnP Perspective n Points. 52

PTAM Parallel Tracking And Mapping. 34

PTU Pan Tilt Unit. 63, 65

RANSAC RANdom SAmple Consensus. 16, 18,19, 36, 50

RAS Robotics and Autonomous Systems. 1

REMODE REgularized MOnocular Depth Estimation. 34

RMSE Root Mean Square Error. 72

ROCC Rover Operations Control Centre. 7

ROS Robot Operating System. xii,xiii, 66,67, 72–74

ROXY ROvers eXploration facilitY. xiii, 56, 74–76, 78

(12)

List of Acronyms

RTAI Real-Time Application Interface. 56, 57, 63

RTEMS Real-Time Executive for Multiprocessor Systems. 56, 57

SBC Single Board Computer. 62

SfM Structure for Motion. xi, 14,15, 36

SIFT Scale Invariant Feature Transform. 14,36

SLAM Simultaneous Localisation and Mapping. xi,xiv,8, 9,12, 23–29, 33–37,

39–42, 44, 48–50, 54, 55,70–73, 75,77, 78

SURF Speeded Up Robust Feature. 14, 36

SVO Semi-direct Visual Odometry. 35

TAS-I Thales Alenia Space Italy. 74,77

TBRA Test Bed for Robotics Applications. xii, 59, 60,65, 66, 77

ToF Time of Flight. 61, 65

UDP User Datagram Protocol. 66, 75

VO Visual Odometry. xi, xii, 4, 7, 8, 15–19, 36, 38,40, 49, 66

vSLAM visual SLAM. 8, 39,46, 67, 71, 77

WF WorkFrame. 57, 58

(13)

List of Figures

1.1 Pose of the rover, according to aircraft convention. . . 4

2.1 Pinhole camera model, where o is the camera centre and p0 the projection on the image plane of the point p. Figure adapted from

Sturm (2014). . . 10

2.2 Omni-directional images example. . . 12

2.3 The Structure from Motion (SfM) problem. . . 15

2.4 Generalized block diagram showing the main components of a VO

system. There are three relative pose estimation methods: 2D-to-2D, 3D-to-3D, or 3D-to-2D. Adapted from Durrant-Whyte & Bailey (2006). . . 16

2.5 The uncertainty of the pose estimation drifts over the time. The final uncertainty is a combination of the step uncertainty Ck,k−1

and the actual uncertainty Ck−,R. . . 21 2.6 A graphical model of the SLAM problem. xi indicate the robot

pose, lj are landmarks positions directly observable by the robot,

mi,j are the landmarks measurements and ui the control vector.

Through these quantities, the focus is one the estimation of the path of the robot and the landmarks map. . . 24

2.7 Binocular disparity. . . 29

3.1 Monocular visualSLAMopen source algorithms comparison. Data from Filipenko & Afanasyev (2018). . . 35

(14)

LIST OF FIGURES

4.2 (a) The panoramic camera model. We used the rigorous collinear-ity between C and P0 instead of between S and P that would cause slight localization error. (b) The ambiguity in projecting the edge pixels of two adjacent fisheye cameras to the panoramic camera. One points u on sphere corresponds to two fisheye image points u1 and u2, which would cause a seam on the panoramic image but would not affect the geometric transformation between the fisheye and panoramic cameras. Image taken from Shi et al. (2013). . . . 42

4.3 An example of a multi-fisheye camera rig (Ladybug 3). The La-dybug 3 camera consists of five fisheye cameras covering a hori-zontal 360◦ scene and one camera pointing upwards; the bottom row shows the unwrapped and stitched panoramic image. Image adapted from Ikeda et al. (2003). . . 43

4.4 Graph-based Bundle Adjustment. The triangle node represents the pose, the green circular node represents the map point, and the edge represents the error term. (a) Pose optimization; (b) Local Bundle Adjustment; (c) essential graph optimization; and (d) Global Bundle Adjustment for both poses and map points. Image adapted from Mur-Artal et al. (2015). . . 45

4.5 Monocular visual ORB-SLAM overall algorithm process. Schema extracted from Mur-Artal et al. (2015). . . 47

4.6 The pipeline of the Ladybug SLAMVO system including tracking (first row), local mapping (second row), and loop closing and global optimization (third row). . . 49

5.1 Block diagram of the WorkFrame subsystems. . . 58

5.2 TBRA modules commands tree. . . 60

5.3 An illustration of the Adept MobileRobots Seekur Jr (Test-bed Rover). . . 62

5.4 The final configuration of the Seekur Jr. with all the sensors mounted. . . 64

5.5 The ROS architecture for the vSLAM. . . 67

5.6 Schema of the Extended Kalman Filter. The EKF#1 is in charge to estimate Roll (φ) and Pitch (θ) of the rover, while theEKF #2 uses these measures as parameters to estimate the overall pose. The latter use left and right wheel velocity for prediction, and the estimation computed by the Ladybug SLAM as measurement. . . 68

5.7 A sample of acquisition for the calibration phase. . . 69

5.8 A distorted image captured by the front camera of the PointGrey Ladybug 3. . . 70

(15)

LIST OF FIGURES

5.9 A rectified image captured by the front camera of the PointGrey Ladybug 3. . . 71

5.10 Offline test with the Kashiwa dataset. . . 72

5.11 ROS rendering of the offline test with the Kashiwa dataset. . . 73

5.12 ROS rendering of the offline test with the Kashiwa dataset, with the semi-dense reconstruction activated. . . 74

5.13 TheROXY facility. . . 74

(16)

List of Tables

2.1 Comparison of feature detectors: properties and performance. . . 13

3.1 Feature-based visual SLAMand direct SLAM methods character-istics. Data from Filipenko & Afanasyev (2018). . . 34

5.1 Comparison of localization and rotation accuracy, correctly tracked number of frames, and average map points per frames from the ORB-SLAM2, Cubemap-SLAM and our Ladybug SLAM. . . 73

(17)

Chapter 1

Introduction

In this chapter we will contextualize and introduce the problem of pose estimation for planetary exploration rovers. We are going to present the different approaches proposed over the years, and the main advantages and issues associated with the use of cameras to achieve this solution.

1.1

Space Robotics Systems

Space robotics and autonomous systems (or SpaceRAS) are a relatively new sub-ject, but they have gained more and more importance in the last decade. Indeed, space RAS play a critical role in the current and future development of mission-defined machines, that must survive in the hostile space environment, performing several tasks, such as exploration, assembly, construction, maintenance, or ser-vicing tasks.

Modern space RAS represent a multi-disciplinary emerging field that builds on space engineering, terrestrial robotics, computer science as well as materials sci-ence, chemistry and IT.

Depending on these applications (either orbital or planetary), space robots are often designed to possess locomotion and have different capabilities: manipulate, grip, rove, drill and/or sample.

These robots are expected to possess various autonomy capabilities, ranging from teleoperation by a human operator to fully autonomous robots, commanded with high-level plans.

The 4 levels of autonomy defined in the European Cooperation for Space Stan-dardisation (ECSS) Standards ESA/ESTEC (2005) for a robotic agent in space are:

(18)

1.2 Planetary Exploration

received from Earth;

• E2 (Time-tag commanding) - The commands are set with an associated time-tag, and executed at a designated time;

• E3 (Event-driven) - In this modality, a set of pre-defined events trigger on-board control procedures or direct commands;

• E4 (Goal commanding) - The system is commanded via high-level goals that are decomposed by a planning algorithm into a sequence of lower level commands to be executed. An on-board supervisor controls the execution of these low-level commands.

1.2

Planetary Exploration

Thanks to in situ scientific measurements, robotics explorers could greatly in-crease the scientific return of planetary and small solar system bodies exploration missions. NASAMars Exploration Rovers (MER) with the synergistic coordina-tion of three orbiter, Mars Odyssey, Mars Express and Mars Reconnaissance Orbiter, have shown that the surface of Mars has been modelled by the itera-tion with water across its history. Mars exploraitera-tion program is now following a scientific goal know as “Seek Signs of Life”. Mars Science Laboratory (MSL), which was designed to search for past and present habitable environments at Gale crater, has the capability to detect complex organic molecules in rocks and soils (Grotzinger et al., 2012). MSL mission had also demonstrated long-range mobility, long term surface operations and the sky crane precision landing. The Pasteur payload of the ExoMars 2018 rover, with his suite of instruments dedicated to exobiology and geochemistry, was able to analyze sample up to a depth of 2 meters (Vago et al., 2015). It has investigated signs of the past and present life on Mars, inspected the planet’s subsurface and deep interior in order to figure out the evolution and the habitability of Mars.

In the next decade Mars Exploration efforts will be focused in the preparation of a Mars sample-return campaign. The possibility to analyse Mars samples on Earth will open a deeper understanding of the planet Mars, and in particular the process and history of climate, the geological evolution of the surface and determine if life ever arose on Mars. NASA scheme for this campaign foresee a sequence of three mission: a rover mission to collect and cache the samples, afterward the cached samples will be sent to a Mars Sample Return Orbiter by means of a Mars Ascent Vehicle (MAV) and returned to Earth for an intensive

(19)

1.2 Planetary Exploration

analysis.

Key Technologies for achieves these scientific output are the development of sur-face robotic explorer that should be able to select samples and documenting the geological context. These explorers should integrate imaging system and remote spectroscopy to establish the context and identify targets. The ability to select potential samples should not be only related to location where liquid water has occurred but also related to the detection of organic molecules and to the pos-sibility of reconstructing the geochemical history of the target rock formation as indicator of organic matter or coupled redox reactions characteristic of life (Pratt et al., 2010).

Rover and robotic technologies will be also fundamental to prepare and as sup-port for a manned mission to Mars. Robotic explorers will be essential to obtain information about hypothetical resources and hazards, testing flight system and build infrastructures for astronauts’ exploration activities (Council, 2011). The benefit of in situ measurement for the solar system exploration is not only limited to the Mars and planetary investigations. Planetary exploration at desig-nated or multiple location of small solar system body such as Near-Earth Objects and the moons of Mars, will help the scientific community to characterize the early solar system history, to study planetary habitats and highlight the mechanism of planetary process formation (Starek et al., 2016). Asteroids, comets and small solar system bodies are characterized by weak gravitational fields, for this reason new mobility concept are needed. Many different small-body mobility concepts have been proposed: hopping, wheeled, legged, hybrid and other mobility plat-forms.

A synergistic approach between an orbiter and a robotic explorer is fundamen-tal for mission success and increase its scientific output. Global positioning is needed to correlate Rover scientific measurements with remote orbiter one, and other mission measurement to the same celestial body, and validate planetary global models. Detailed Digital Elevation Models (DEM), produced from orbiter images and stereo-photogrammetry, provides significant support to the landing site selection and to plan rover operations.

The goal of the global positioning system is to estimate the pose of the robot with respect to a fixed frame. The pose is expressed as a 6-dimensional vector, in which the first three elements represent the spacial displacement of the rover in space (i.e. x axis along the longitudinal motion of the robot, y axis along the transverse and the vertical axis is represented as the z). On the other hand, the other three values aim to represent the orientation of the vehicle in the space.

(20)

1.2 Planetary Exploration

One of the most commonly used representation derives from the aircraft/nautical convention; the rotation is expressed in the so called yaw, pitch and roll angles. The yaw (ψ) is the rotation along the vertical axis, the pitch (θ) is the one along the transverse and the roll (φ) is the angle referred to the longitudinal axis (see Figure 1.1 to better clarification).

Figure 1.1: Pose of the rover, according to aircraft convention.

Due to spacecraft low-bandwidth and communication delay to Earth, rovers must utilize a high degree of autonomy (E3 or E4 according to theESA/ESTEC(2005) standards) .

Starting fromMERrovers, vision system measurements have demonstrated to be an asset for rovers operations during traverse. Relative localization by means of Visual Odometry (VO) has enabled the rovers to drive safely and more effectively in highly sloped and sandy terrains. Indeed, methods that rely on Wheel Odom-etry and Inertial Measurement Units (IMU) perform poorly in harsh terrain. As a results of using VO and vision system techniques mission science return has been increased, because they have reduced the number of days required to drive into interesting areas (Maimone et al.,2007).

Nowadays rovers navigate the surface at a rate of 40-60 m/sol (with sol it is indicated the solar day on Mars ∼ 1.02 Earth day) using a very computational demanding pipeline. First of all the terrain traversability is verified thanks to

(21)

1.3 Mars Exploration

stereo imagery, then the rover plans its motion and finally conducts its manoeu-vre. Every few-meters step can take up to several minutes. These limits are caused by the available on-board computation and power of the flight-qualified processors. Of great interest for future applications, will be to fuse on-board sensing with higher-resolution orbital imagery for assessing terrain traversability in more effective and automated ways. Moreover, sample and return missions rovers have to demonstrate the capability to return a collected sample to the ascent vehicle.

1.3

Mars Exploration

In 1997, twenty years after theNASA“Viking Missions”, they revisited the planet Mars with Pathfinder lander. The lander contained the mobile robot, called So-journer. This rover, which was based on Rocky III and IV JPL robotics demon-strators, had six wheels powered by six separate motors (four of these wheels were steerable) and the rocker-bogie system.

Sojourner rover took pictures and positioned a science instrument against desig-nated soil and rocks.

In Team (1997) is reported a detailed description of Sujourner operations. The nominal speed of this rover was 0.4 m/min and maximal turning rate was 7 deg/s. The navigation was based on a step by step approach. The moment to switch off the motors was triggered on an average of motor encoder (drive) or poten-tiometer (steering) readings. When the rover was stopped the on-board computer conducted proximity and hazard detection function. The laser striping and cam-era system has been used to determine the presence of obstacles in its path. In addition, the estimation of the travelled distance and of the heading was com-puted while the rover was stopped. The number of turns of the wheel motors and the onboard gyroscope was used to provide an estimate of progress to the com-manded location and the distance travelled from the lander, both was expressed in the local reference frame.

The Sojourner rover operations range were limited by the lander camera field of view, indeed at the end of each sol of rover traverse, the lander provided a stereo image of the vehicle position in the surrounding terrain. Stereo images, portions of terrain panorama and images from the rover camera were used by an operator at the control station to designate new points as target location for next sol rover traverse. The image size provided by the rover camera is 768 × 484 pixels. Thanks to Rocky VII prototype rover JPL has demonstrated the capability to traverse natural terrain up to a distance of one kilometer. The navigation sensors of Rocky VII were minimal: three accelerometers measured the tilt of the chassis and potentiometers measured the configuration of the rocker-bogey suspension;

(22)

1.3 Mars Exploration

the relative heading was provided by the integration over time of the signal of a quartz rate sensor, which measured the rate of rotation of the vehicle about its vertical axis; finally, the sun sensor was attached to the rover facing upward. By using the accelerometer signal to determine sensor tilt and local time from an on-board clock, it was possible to compute the absolute vehicle heading.

Following missions on the Planet were the Mars Exploration Rover (MER): Spirit (MER-A) and Opportunity (MER-B), which demonstrated the Visual Odometry capabilities for the first time on a planetary environment. The two visual sen-sor onboard, Pancam (panoramic camera) and Navcam (navigation camera), not only have been used for scientific investigation but also as primary instrument for localization and mapping. A description of MER rover sensor, localization strategy and topographic measurements are given in Li et al.(2004).

Pancam and Navcam are two stereo cameras with different characteristics. Navcam is used in mapping close objects, has a baseline of 20 cm, a focal length of 14.67 mm, and an effective depth of fields of 0.5 m to infinity, it has a field of view of 45 deg. Pancam is used in mapping far object, has a stereo base of 30 cm and a focal length of 43 mm, the focal depth of field is 3 m to infinity, its FoV is 16 deg. The image size for both the cameras is 1024 × 1024 pixel.

All the rovers operations require accurate relative three-dimensional position of the rover itself and of the target object. In addition, rover localization has to be converted into the Mars inertial and Mars Body-Fixed (MBF) reference frame, in order to localize the observations in a regional context captured from orbiter images.

The global localization of the MERrovers were fundamental for planning science and engineering operations. The localization procedure were accomplished during the first eight days after landing (before the rover starts its traverse). In order to have an accurate positioning in the regional context, landing site positions were calculated reconstructing Entry, Descent and Landing in DIMESdescent images and finding common features with orbiter images in the MBF frame and in the Landing Site Local (LSL) reference frame. When the first panoramas composed by surface images from Navcam and Pancam were available, it was possible to triangulate crater and mountain peaks that may appear in the orbital images. The navigation goal for MER rovers was to bound the position estimation error to be no more than 10% during a 100 m drive. At each rover movement the Rover pose was updated at 8 Hz. The movement estimation was given by the combi-nation of IMU attitude signal with Wheel Odometry. This navigation technique was efficient for great part of the terrains, but not on slippery surfaces or slopes. Furthermore, Maimone et al. (2007) highlight the Visual Odometry process and performances used on-board the MER rovers. This technique has been mainly

(23)

1.3 Mars Exploration

used to correct the erroneous Wheel Odometry-based estimation when wheel lose traction on large rocks and steep slopes. Visual Odometry on MER rovers com-puted an update of 6-DoF rover pose tracking the position of terrain features between two stereo images, in the 2D image space and in 3D space. Rover final motion estimation was realized thanks to a maximum likelihood estimator ap-plied to the computed 3D offset.

The relative localisation was conceived as an “extra credit” capability and evolved into a critical vehicle safety system.

VOhas demonstrated good performance with high rates of successful convergence 97% on Spirit, 95% on Opportunity and measured movements as small as 2 mm. An overview of the MER rover autonomous capabilities is given in Biesiadecki et al. (2007).

During the first two years of operations, Sprit’s longest commanded drive was 124 m, for 62 m it used direct driving mode and for the other 62 m it used Ter-rain Assessment and Local Path Selection. Opportunity’s longest drive travelled distance was 390 m, divided in 106 m Local Path Selection without any vision processing and 284 m using a combination of Local Path Selection with Terrain Assessment.

1.3.1

ExoMars Mission 2020

The ExoMars mission is a join collaboration between ESA and Roscosmos (Rus-sian Space Agency), it consist of the Trace Gas Orbiter plus an Entry, Descent and landing demonstrator module, Schiaparelli, launched on March 2016, and the other, will carry a rover, the launch date is 2020. The ExoMars rover has been designed to search for signs of life. It will collect samples with a drill and anal-yse them with next-generation instruments. ExoMars will be the first mission to study Mars at depth up to two meters. The ExoMars Trace Gas Orbiter, part of the 2016 ExoMars mission, will support communications. The Rover Operations Control Centre (ROCC) will be located in ALTEC, Turin, Italy. TheROCC will monitor and control the ExoMars rover operations. Commands to the Rover will be transmitted through the Orbiter and the ESA space communications network operated at ESA’s European Space Operations Centre (ESOC).

The global localization studies have been developed in collaboration with ALTEC and Thales Alenia Space for ExoMars 2020 rover.

Within the Guidance, Navigation and Control (GNC) system of these robots, the current level of autonomy for the European Space Agency (ESA) rovers is E3, for both the mission ExoMars 2018 (McManamon et al., 2013) and the mission ExoMars 2020 (Bora et al., 2016).

(24)

1.4 Thesis Outline

To overcome the problems cited above in computing the pose, modern Mars rovers (includind ExoMars 2020) implement localisation algorithms based on vi-sual simultaneous localisation and mapping (vSLAM) techniques, using as priors motion estimates computed by Visual Odometry (VO). This philosophy, by anal-ogy to Wheel Odometry (WO), refers to the process of determining the pose of a camera system by successively analyzing the images acquired over time. These methods typically do not rely on a single camera, but on several ones, arranged in a particular configuration, which form the so called camera vision systems. Since the first missions on Mars, the most common approaches have been based on stereo-vision camera rig, therefore exploiting a pair of cameras, displaced hor-izontally from one another to obtain two differing views on a scene, in a manner similar to human binocular vision.

1.4

Thesis Outline

As we have seen during this introduction, the global positioning system is one of the most important task in Planetary exploration. Wheel Odometry has wide uncertainty due to slippage of wheels on a sandy surface, inertial measurement has well-known drift problems and GPS-like positioning systems is not available on extraterrestrial planets.

During the last 50 years of missions, the introduction and development of camera systems allowed for more and more autonomy for the rover employed.

Therefore, the focus on this thesis is the development of a novel SLAM system based on a omni-camera. This different vision system could be more robust with respect to occlusions during the traverse, while allowing to densely reconstruct the entire surroundings of the rover for a more secure local path planning. This master thesis is organized as follows: in Chapter 2 we will explain some basics of Computer Vision on which the entire work is built; in 3 we will review the state of the art about visual SLAM and all the sub-processes inside this com-plex problem. Then in Chapter 4 we will explain in more details the framework chosen and adapted to carry out the relative localisation task. Chapter 5 will explain the robot used for the experiments, the software architecture in which the vSLAM algorithm is introduced and in addition the offline and online exper-iments performed to evaluate the performances.

(25)

Chapter 2

Background

In this chapter we are going to explain briefly some Computer Vision basics which are the fundamentals of the thesis. We are going to explain the concept of camera model and calibration, as well as the feature detection inside an image. Further-more, we are going to explain the idea behind Visual Odometry. In addition, we will explain the SLAM problem and how to deal with it. Finally, we will introduce the basics for the 3D reconstruction.

2.1

Camera Model and Calibration

Accurate calibration of a vision system is necessary for any Computer Vision task requiring extracting metric information of the environment from 2D images, like in ego-motion estimation and structure from motion.

The pinhole camera model is fundamental for the most of the geometric vision problem. The model is developed for CCDlike sensor and it describes the central projection of 3D points through the centre of projection onto the image plane. Let the pinhole camera centre being the origin of a reference frame o(x, y, z), the z-axis is the same as the camera pointing direction, the image plane or focal plane is the plane z = f, see Figure 2.1 for better clarification. The line from the camera centre perpendicular to the image plane is called the principal axis or principal ray of the camera, and the point where the principal axis intersects the image plane is called principal point. A point in the 3D space p = (X, Y, Z) is mapped to the image plane in the intersection between the line joining the point p to the centre of the camera (u, v).

(26)

2.1 Camera Model and Calibration

Figure 2.1: Pinhole camera model, where o is the camera centre and p0 the projection on the image plane of the point p. Figure adapted fromSturm(2014). Considering the principal point offset (cx, cy) the projection equation is:

u v  =     fx X Z + cx fy Y Z + cy     (2.1)

where fx and fy are the camera’s focal lengths in the x and y directions. See

Figure 2.1 for a graphical explanation of this process.

The equation above can be written in terms of matrix multiplication as:

s   u v 1  =   fx 0 cx 0 fy cy 0 0 1     X Y Z  = K   X Y Z   (2.2)

where s ∈ R+ is an arbitrary positive scalar associated with the depth of the

point, and the matrix K is the matrix of intrinsic parameters. During the pro-jection the scale information is lost; it means that the propro-jection of all the points that belong to the line from the point p to the camera centre is the image plane point (u, v).

(27)

2.1 Camera Model and Calibration

focal length and the principal point offset are the intrinsic parameters of the cam-era, they can be estimated through the camera calibration process. They does not depends from the scene viewed, and if the camera focal length is fixed, they can be reused for all the image sequence.

The camera motion around the scene is described by the joint rotation-translation matrix [R | t], or camera extrinsic parameters (Kannala & Brandt (2006)). The joint rotation-translation matrix could also expresses the rotation and translation of an object around a fixed camera. Considering the extrinsic camera parameters the projection function is obtained:

s   u v 1  =   fx 0 cx 0 fy cy 0 0 1     r11 r12 r13 t1 r21 r22 r23 t2 r31 r32 r33 t3       X Y Z 1     (2.3) Lens Distortion

Real lenses present distortions that have to be considered in the projection model. This phenomenon arise mainly form the lens shape, (so it is called radial distor-tion), and from CCD alignment respect to the optical axis, (in this case it is called tangential distortion). The most common model used for the distortion is the polynomial model (Ronda, 2019). This is limited to the 3rd order, with the

tangential distortion is given by:

xd= xu 1 + k1r2+ k2r4+ k3r3 + 2p1xuyu+ p2 r2+ 2x2u  yd= yu 1 + k1r2+ k2r4+ k3r3 + 2p2xuyu+ p1 r2+ 2y2u  (2.4) where r2 = x2 u+ yu2 and xu = px/pz yu = py/pz.

k1, k2, k3 are radial distortion coefficients, p1 and p2 are tangential distortion

coefficients.

In literature, it is possible to find other distortion models, like the model presented inDevernay & Faugeras(2001), suited for fish-eyes lenses and wideFoVcameras. Omnidirectional Model

Omni-directional cameras are very different and in general a different model must be used.

For omnidirectional camera is usually intended a vision system providing a 360◦ panoramic view of the scene. Such an enhanced field of view can be achieved by either using catadioptric systems, obtained by opportunely combining mirrors and conventional cameras, or employing purely dioptric fish-eye lenses.

(28)

2.2 Feature Detection

Figure 2.2 shows an example of omni-directional image taken by a KT&C 18X camera. Dealing with this hardware is particular and needs a whole new

geome-Figure 2.2: Omni-directional images example.

try. A very good reference to deal with all the omnidirectional models has been done by Scaramuzza (2008).

Another type of omnidirectional image can be taken from a so-called omnidi-rectional multi-camera system, which is based on a set of cameras disposed in a well defined position. The omnidirectional image can be reconstruct by simply place side by side all the single images or, in case there is an intersection between the different FoV, an overall camera model must be implemented to fuse the re-dundant information and enhanced the precision.

In the work presented inWang et al. (2011), the authors proposed a rectification algorithm based on spherical camera model for rectifying omnidirectional stereo pairs.

This method is able to be widely used in different kinds of camera systems as long as it is single viewpoint and a spherical camera model is able to be applied. Moreover it shows promising results also for the 3D reconstruction problem.

2.2

Feature Detection

In Structure for Motion, Visual Odometry and feature-basedSLAMthe first step begins with the identification of image salient regions, this Computer Vision pro-cess is called feature detection. For each feature is associated a descriptor. Then the features are matched with each others based on some similarity metric be-tween their descriptors.

(29)

2.2 Feature Detection

Table 2.1: Comparison of feature detectors: properties and performance.

Metho d Corner Detection Blob Detection Rotation In v arian t Scale In v arian t Affine in v arian t Rep eatabilit y Lo calisation Accuracy Robustness Efficiency Harris × × +++ +++ ++ ++ Shi-Tomasi × × +++ +++ ++ ++ FAST × × × ++ ++ ++ ++++ SIFT × × × × +++ ++ +++ + SURF × × × × +++ ++ ++ ++ CENSURE × × × × +++ ++ +++ +++

Corner detectors or blob detectors allow a precise measurement of the salient region in images coordinates. The first type of feature, the corner, correspond to image regions that have high intensity gradient in two orthogonal directions, generally is the intersection of one or more edges. The second type of feature, the blobs, correspond to image spot that have an intensity different from the neighbouring regions.

The attractive properties for a feature detector are listed in Table 3.1, where the data are presented in Fraundorfer & Scaramuzza(2012). A feature should be re-detected in the next images that look to the same scene, this property is called repeatability. The robustness is the property of the detector to not be sensitive to image noise. Furthermore a feature should be detected after images photometric and geometric changes, like rotations, scale and affine transformations.

In the choice of a feature detector, localization accuracy, robustness and invari-ance properties have to be weighed against computational efficiency. For instinvari-ance, if the purpose is to reconstruct a detailed three dimensional topographic map with sparse images, the feature detector should be accurate and invariant to geometric changes. Instead if the focus is the computational time the focus must be in the efficiency against other performances. The last case is the case of many robotics application, where the lower performances of the detector are compensated with smarter feature searching techniques, like reducing the searching window by using camera motion priors.

(30)

2.3 Visual Odometry

Among the corner detectors they are Harris (Harris & Stephens, 1988), Shi-Tomasi (Shi & Tomasi,1994) and Features from Accelerated Segment Test (FAST) (Rosten & Drummond, 2006). Among the blob detectors there are several meth-ods, such as Scale Invariant Feature Transform (SIFT) (Lowe,2004), Speeded Up Robust Feature (SURF) (Bay et al., 2006), CENter SUrround Extremas ( CEN-SURE) (Agrawal et al., 2008) and many others.

After feature extraction, the other important task in every Computer Vision al-gorithm is to establish correspondence between different image features. The information about feature surroundings is embedded in a floating point or binary value vector. One of the methods used to find the feature correspondence between two images, consists in the calculation of the vectorial distance between all the image descriptor, the smallest `2-norm corresponds to a feature matching. In the binary case the Hamming distance is the most common method. Two descriptors that provide the properties of scale and rotation invariance are the SIFT and

SURF descriptors.

One of the most computationally efficient descriptor is the Binary Robust In-dependent Elementary Features (BRIEF) (Calonder et al., 2010), which uses bi-nary string to represent the keypoint surroundings. The Bibi-nary Robust Invariant Scalable Keypoints (BRISK) (Leutenegger et al., 2011) combine the efficiency of a binary descriptor with the rotation invariant properties.

Another feature detector which have gained much popularity is the Oriented

FAST and Rotated BRIEF (ORB) (Rublee et al., 2011) feature description and matching. This method combine the good qualities of FAST(the efficiency) and

BRIEF (the scale invariance).

In Rublee et al. (2011), the researchers demonstrated its performance and effi-ciency relative to other popular features (FAST, SIFT, BRIEF).

Moreover, ORB features allow real-time performance without GPUs, providing also good invariance to changes in viewpoint and illumination.

2.3

Visual Odometry

Figure 2.3 shows the Structure from Motion (SfM) problem: given a set of im-ages taken at different positions and orientations it is possible to reconstruct the three dimensional environment and camera relative positions. In SfM problems, images could be un-ordered and the three dimensional environment is computed simultaneously. The fundamental building blocks of SfM is the computation of the relative pose between two view-points, so it is possible to retrieve a first

(31)

in-2.3 Visual Odometry

Figure 2.3: The Structure from Motion (SfM) problem.

formation about the three dimensional scene and then to compute the absolute and relative pose of the other viewpoints. Considering the un-ordered sequence of images a challenging part of the SfMproblems is to firstly identify images that observes the same part of the scene.

Visual Odometry can be regarded as a subset ofSfM problem, where the relative pose computation between two view-points (the SfM building block) is applied sequentially to a temporally ordered series of images, the incremental transfor-mations between ordered camera frames are computed in order to reconstruct camera trajectory. Visual Odometry has gained interest in robotic applications with the reduction of the processing time. Figure2.4 shows a general scheme for a VO system.

Inside these systems, there it is possible to identify three mainVOfamily: monoc-ular based, stereo based or multi-camera.

Stereo VO: Most of the StereoVOmethods have in common that for each stereo pair the point cloud of the triangulated features is calculated, the motion esti-mated with a 3D-to-3D point cloud registration. In this case, for each stereo pair the feature matching is performed intra stereo-pair and inter stereo-pair; in other

(32)

2.3 Visual Odometry

Figure 2.4: Generalized block diagram showing the main components of a VO

system. There are three relative pose estimation methods: 2D-to-2D, 3D-to-3D, or 3D-to-2D. Adapted from Durrant-Whyte & Bailey (2006).

word the same feature is matched in the four available images.

But the stereo VO methods are not only limited to a 3D-to-3D estimation.

Nister et al. (2004) estimated the relative motion using a 3D-to-2D camera-pose estimation problem. Moreover, they incorporated RANdom SAmple Consensus (RANSAC) outlier rejection into the motion estimation step. In Comport et al.

(2007) the motion estimation relies on 2D-to-2D images matches, this method avoids the triangulation of the 3D points.

Monocular VO: In monocular systems the absolute scale is unknown for the three dimensional scene and the trajectory. Usually the distance between the two first images is set to the unity, as consequence everything is scaled to this unit distance. As suggested by Mur-Artal et al. (2015) , if the scene is planar dominant, the relative transformation between the two images can be described by a homography pc = Hcrpr and they suggest to use the normalized DLT

al-gorithm (Hartley & Zisserman, 2004) to retrieve the homography matrix Hcr.

On the other hand a non-planar scene with enough parallax can be described by the fundamental matrix p>cFcrpr= 0, which is possible to retrieve thanks to the

8-point algorithms (Hartley & Zisserman,2004). As a new image arrives its pose relative to the first two frames could be computed by exploiting the knowledge of the three dimensional structure, in this case it is necessary first to triangulate the scene points, or using the trifocal tensor.

(33)

2.3 Visual Odometry

2.3.1

Motion Estimation

As already cited above, in VO there are three categories of camera absolute and relative pose estimation: 3D-to-3D, 3D-to-3D, 2D-to-2D. In case of a set of images taken at a discrete time k, in the monocular case, the sequence is notes as I0:n = {I0, . . . , In} . Analysing the sequence of a stereo camera, there are the

left and the right images at the instant k:

Il,0:n = {Il,0, . . . , Il,n} and Ir,0:n = {Ir,0, . . . , Ir,n} (2.5)

The rigid body transformation in R3between two cameras at two different instants

k and k − 1 is given by a 4 × 4 matrix of the following form:

Tk,k−1 =

 Rk,k−1 tk,k−1

1 0



(2.6)

where Rk,k−1 ∈ SO(3) is the rotation matrix and tk,k−1 the translation

vec-tor. SO(3) being the Lie group of rotation matrices and Tk,k−1 ∈ SE(3) is a

Special Euclidean group. The vector of poses T1:n = {T1,0, . . . , Tn,n−1}

con-tains all the step poses. If the objective is to retrieve the camera pose rela-tive to the initial frame, the solution is to concatenate all the poses, for in-stance Tn,0 =

Qn

k=1Tk,k−1. All the trajectory is contained in the vector C1:n =

{T1,0, T2,0, . . . , Tn,0}.

3D-to-3D: In this case, both fk−1 and fk are specified in 3D. To triangulate

the 3D points it is necessary a stereo camera system. The motion estimation is obtained by the minimization of the `2 distance between the two feature sets,

both specified as a 3D position: argmin Tk,k−1 X i peik− Tk,k−1peik−1 (2.7)

where peik−1 are the homogeneous coordinates: ep = (X, Y, Z, 1)> of the 3D point correspondent to the feature i.

A solution to this minimization problem is given by the Arun’s method for align-ing point clouds (Arun et al., 1987). The minimal-case solution involves 3 non-collinear correspondences.

3D-to-2D: The pose is estimated by minimizing the re-projection error of the triangulated 3D points (Nister, 2004):

argmin Tk−1 X i uik− π pi k−1, Tk,k−1  2 (2.8)

(34)

2.3 Visual Odometry

where π pik−1, Tk,k−1 is the re-projection of the 3D point pik−1 into the image

Ik through the transformation Tk,k−1pik−1 can be estimated from stereo data or,

in the monocular case, from triangulation of the image measurements ui

k−1 and

uik−2. The minimal-case solution involves 3-point correspondences. One of the most used solution is the PnP algorithm proposed by Lowe (2004); which is a non-iterative solution whose computational complexity grows linearly with n ≥ 4. Widely used is also the Perspective-Three-Point (P3P) solution of Kneip et al.

(2011), which aims at determining the position and orientation of the camera from three point correspondences by using aRANSAC scheme for robust motion estimation in presence of outliers, to a lower number of points needed for the minimal case correspond a lower number of RANSAC iterations and a more effi-cient implementation of the VO system.

2D-to-2D: Both the feature fk and fk−1 are expressed in the 2D image space.

The rotation and translation estimation between two subsequent frames pass trough the essential matrix E estimation. The minimal case solution involves five correspondences, an efficient implementation is proposed by Nister (2004).

2.3.2

RANSAC

As shown from the general scheme of Figure 2.4the motion estimation process is embedded in a RANdom SAmple And Consensus (RANSAC) scheme. In Com-puter Vision literature RANSAC is became a standard to fitting a model to ex-perimental data in presence of a great part of outliers. Indeed, the probability of an erroneous feature correspondence between two images it is elevated. RANSAC

was introduced by Fischler & Bolles (2010) in order to determine the absolute pose of a camera given an image depicting a set of landmarks with known loca-tions. The different steps involved in a RANSAC scheme may be summarized as follow:

1. Select randomly a minimum set of points required s to determine the model parameters. For instance in the for the PnP algorithm of the 3D-to-2D case the minimal set is four points. These points are needed in order to compute a motion hypothesis Tk,k−1;

2. Compute a motion hypothesis Tk,k−1;

3. Project all the set of n points using the motion hypothesis Tk,k−1and

com-pute the error for each point ei

k,k−1= ui k− π pik−1, Tkk−1  . Verification of the motion hypothesis by the determination of how many points from

(35)

2.3 Visual Odometry

the set of all points fit with a predefined tolerance eik,k−1 < ε.

For a 3D-to-3D method the 3D error for each point has to be computed; 4. If the fraction of the number of inliers over the total number points in the set

exceeds a predefined threshold τ, re-estimate the model parameters using all the identified inliers and terminate;

5. Otherwise, repeat all the steps for maximum of N times.

Fischler & Bolles (2010) gives also the minimum number of iterations N that it is necessary to find the correct solution:

N = log(1 − p)

log (1 − (1 − ε)s) (2.9)

where p is the required probability of success, s the minimum set of point required by the model and ε is an estimation of the point set outlier percentage. It is straight-forward to see that greater is s, greater will be the number of iterations N required for the same values of p and ε.

2.3.3

Local Optimization

The final step of a VO algorithm consists into optimize the first estimation of

Tk,k−1, obtained through the RANSAC procedure, using all the n inliers. Many

computer vision problems are non linear, in VO pose refinement could be per-formed through the Levemberg-Marquardt algorithm.

The most common approach involves the computation refinement of the pose though the minimization of the `2 distance between the triangulated 3D points

(3D-to-3D implementation). Equation (2.10) shows the non-linear cost function Enl. Each component of the error vector ei of feature i is weighted taking into

account landmark uncertainty. The uncertainties of the 3D points are represented by 3 × 3 covariance matrices as calculated by the Kline-McClintock formulas:

     ei = pik− Rk,k−1pik−1− tk,k−1 Ωi = Ωik+ Rk,k−1Ωik−1R>k,k−1

Enl =Pni=1 ei>Ωi−1ei



(2.10)

Where pi

k are the three dimensional coordinates of the landmark i, expressed

relative to the camera during step k and pi

k−1 is the same 3D point expressed

relative to the stereo camera during at the previous location k − 1. The term

(36)

2.3 Visual Odometry

system translation expressed in the frame of reference k − 1.

Finally, Ωik and Ωik−1 are the 3 × 3 covariance matrices of the same landmark in the two frames of reference.

In the monocular case is possible to refine the camera pose by minimizing the re-projection error ei:

ei = uik− π pi

k−1, Tkk−1



(2.11) where π(·) is the projection function.

The cost function to be minimized is: Enl =

n

X

i=1

ei>Ωi−1ei (2.12)

where Ωi is the covariance matrix associated to the uncertainty of the feature in the 2D image.

In Visual Odometry all the trajectory is given by the concatenation of all the step poses. Each estimation of the relative pose transformation Tk,k−1 has an

uncertainty, the uncertainty of the pose estimation of the last frame in the initial reference frame is give by the propagation of all the step uncertainties. As shown by Fraundorfer & Scaramuzza (2012), the camera-pose uncertainty is always in-creasing when concatenating transformations. As follows, it is important to keep the step uncertainty as small as possible to reduce the drift.

2.3.4

Bundle Adjustment

A map point could be observed by multiple frames. Thus it is possible to set up a large optimization problem which attempts to optimize all the camera poses and the 3D points position, the task is called Global Bundle Adjustment. If the objective is to optimize only the poses of the frames that seen last frame 3D points, then the task is called Local Bundle Adjustment. In the case that it is necessary to optimize m camera poses and n 3D points, the optimization step involve the computation of 2mn × (6m + 3n) Jacobian matrices, by considering the 6 DoF(tx, ty, tz, φ, θ, ψ) for the camera pose and 3 DoF(X, Y, Z) for the 3D

points. The computational cost become very expensive with the number of frame and points. Moreover, due to the non linear nature of the problem the solution could easily converge to a local minima, so good initial conditions are needed. Bundle Adjustment is used to refine the initial camera and structure parameters. Assume that n of the 3D points are seen in m views. Indicating with ui,j the

(37)

2.3 Visual Odometry

Figure 2.5: The uncertainty of the pose estimation drifts over the time. The final uncertainty is a combination of the step uncertainty Ck,k−1 and the actual

uncertainty Ck−,R.

parametrized by a vector a j and each 3D point i by a vector bi. While vi,j = 1

if the point i is visible in j view: argmin ai,bj n X i=1 m X j=1 vi,jkui,j− π (aj, bi)k 2 (2.13)

where π (aj, bi) is the projection of the point i to the image j.

In R3, one possible parametrization for a

i is the 6-dimensional vectors (ω, v),

where ω = (ω1, ω2, ω3) is the axis-angle representation of the rotation and v is

the rotated version of the translation exp SE(3):

expSE(3)(ω, v) = exp SO(3)(ω) V v

1 0  = R t 1 0  (2.14) where, exp SO(3)(ω) = I + sin θ θ (ω)×+ 1 − cos θ θ2 (ω) 2 x (2.15) V = I + 1 − cos θ θ2 (ω)×+ θ − sin θ θ3 (ω) 2 X (2.16) θ = kωk2 (2.17)

(38)

2.3 Visual Odometry

This large optimization problems could be solved using the Newton-Raphson or Levenberg-Marquart (LM) optimization methods.

Let h be the measurement function which maps a parameter vector p to an estimated measurement vector ˆx = h(p). In the case an initial parameter P0 and

a measured vector x are provided and it is desired to find the vector p+ that best

satisfies the function h locally, that is, minimizes the squared distance ε>ε with ε = x −bx the residual error. The basis of theLM algorithm is an affine approxi-mation to h in the neighbourhood of p. For a small kδpk, h is approximated by: h(p + δp) ≈ h(p) + J δp (2.18) where J is the Jacobian of h.

Like all non linear optimization methods,LMis iterative, initiated at the starting point p o, the vector parameter p is updated with the rule:

pl = pl+1+ δp (2.19)

that converge toward a local minimum value p+ for h. Hence, at each iteration, it is required to find the step δp that minimizes the quantity:

kx − h(p + δp)k ≈ kx − h(p − Jδpk = kε − Jδpk (2.20) The sought δp is thus the solution to a linear least-squares problem: the minimum is attained when J δp − ε is orthogonal to the column space of J. This leads to J>(J δp−ε) = 0, which yields δp as the solution of the so-called normal equations: J>J δp = J>ε (2.21) where J>J is the first order approximation to the Hessian of 1

>ε, δp is the

Gauss-Newton step and J = dhdp. The LM method actually solves a slight variation of Equation (2.21):

N δp = J>ε (2.22) where N = J>J + µI, J>J is also called the Hessian matrix and µ > 0 is the damping term.

If the uncertainties of our measurements is considered, that in a 3D-to-2D frame-work corresponds to the uncertainty of the 2D feature position, Equation (2.21) becomes:

J>Ω−1ij J + µI = J>Ω−1ij ε (2.23) where Ωij · I2x2 is the covariance matrix associated to the features uncertainty.

(39)

2.4 SLAM Models

The iterative scheme to solve a minimization problem with the Levenberg-Marquardt algorithm can be summarized this way:

• 1. Given the initial guess/current state estimate pl, calculate for each

measurement εij,l = uij− h (pl) and the Jacobian matrix Jij,l=

dπ(a,bi) da, l or Jij,l = dπ(a,bdbji) l

from the Bundle Adjustment model; • 2. Build the linear system (2.23) with ε> = P

i,jJ >

ij,lΩijεij,l and H =

P

i,jJ >

ij,lΩijJij,l. Note that the uncertainty of the measurements Ωij does

not depends by the iteration l; • 3. Solve the linear system J>−1

ij J + µI δp = J >−1

ij ε for the increment

δp, update the state pl = pl+1+ δp and iterate.

In Levenberg-Marquardt iterations the computation of H−1 is computational ex-pensive. Different calculation strategies exist for the inversion of the matrix H. As shown by Lourakis & Argyros (2009) the nature of this matrix is sparse due to the lack of interaction among certain subgroups of parameters, and this can be exploited to achieve considerable computational savings.

2.4

SLAM Models

The simultaneous localization and mapping problem solution is seen by the robotic community the means to make a robot truly autonomous. In case the robot is placed in an unknown environment, thanks to its sensors the robot is able to reconstruct a consistent map of the environment and a to localize itself with reference to this map. The tutorials of Durrant-Whyte & Bailey (2006) provides a brief history of the early development of SLAM problems, the formulation of the problem and the most common solutions, issues in computation, convergence, and data association.

(40)

2.4 SLAM Models

Figure 2.6: A graphical model of the SLAMproblem. xi indicate the robot pose,

ljare landmarks positions directly observable by the robot, mi,jare the landmarks

measurements and ui the control vector. Through these quantities, the focus is

one the estimation of the path of the robot and the landmarks map.

Cadena et al. (2016), in their work, reviewed of the state of the art SLAMcodes and the open challenges and the newest research issues in SLAM.

Figure 2.6 shows a graphical model of the SLAM problem, a robot is moving through an environment and is taking relative measurement about the position of a series of unknown landmarks.

The purpose is to estimate robot path and landmarks relative positions. The following quantities are considered in the problem:

• xi is called the robot state (also known as the robot pose). It represents

the position and attitude at a time instant i; • X0:i = {x0, . . . , xi} the history of robot state;

• ui the input vector, applied at time instant i − 1 to drive the robot from

the state xi−1 to the state xi;

• U0:i = {u0, . . . , ui} is the history of inputs;

• lj is the j-th landmark, its position is supposed to be time invariant;

• l = {l1, . . . , lj} is the set of all the landmarks;

(41)

2.4 SLAM Models

It is possible to reformulate the SLAMproblem as an estimation of the posterior probability distribution over the robot’s trajectory and the landmark set, given all the measurements, the control inputs and the robot initial state:

p (xi, l | M0:i, U0:i, x0) (2.24)

The probability distribution is computed for all times i.

To solve the SLAM problem a motion model and an observation model have to be considered.

The motion model relies input measurements ui to robot state xi−1 and xi.

Sup-posing that motion model is subject to Gaussian noise, its description in terms of probability distribution is:

p (xi | xi−1, ui) ∼N (g (xi−1, ui) , Ri) (2.25)

It is then defined by a normal distribution centred at g (xi−1, ui) , where g() is

the kinematic model of robot motion and Ri a 3 × 3 covariance matrix.

The probability of making an observation mi,j when the vehicle location and

landmark locations are known is generally described in the form:

p (mi,j | xi, lj) ∼N (h (xi, lj) , Q) (2.26)

where h (xi, lj) is the measurement function and Q the covariance matrix

describ-ing the sensor uncertainty.

Siciliano & Khatib (2016) identifies three main mathematical framework devel-oped up to date: Extended Kalman Filter (EKF) SLAM, Particle Filter SLAM

and the Graph-Based SLAM.

2.4.1

Extended Kalman Filter SLAM

In Extended Kalman Filter SLAM the robot estimate is represented by a multi-variate Gaussian:

p (xi, l | M0:i, U0i, x0) ∼N (µi, Σi) (2.27)

where µi is a vector containing robot position and orientation, and the

environ-ment landmark positions.

By considering a robot moving on a 2D environment, µi dimension would be

3 + 2N, indeed, two variables must be introduced to define robot position and one variable for the robot orientation, 2N variables for the N landmarks in the map. Σi is the robot state covariance matrix, representing the uncertainty in µt

(42)

2.4 SLAM Models

estimate, it is a (3 + 2N ) × (3 + 2N ) matrix. Supposing that the g(·) and f (·) function are linear in their arguments the Kalman filtering is applicable, g(·) and f (·) will be linearised by using the Taylor expansion.

The major issue related to EKF SLAM techniques is the quadratic nature of the covariance matrix. As the robot moves new landmarks are added to the map and then to the state vector, the covariance matrix grows quadratically. This pose a great limitation also for medium scale maps because the processing time and memory consumption are O (N2) in the size of the map.

2.4.2

Particle Methods SLAM

In particle methods the posterior probability of robot position is represented by a set of particles. Particle methods were introduced in the SLAM literature by

Montemerlo et al. (2002) with Fast-SLAM.

For instance, Fast-SLAM maintains each time the robot moves a set of K parti-cles:

X[k]i µ[k]i,1, . . . , µ[k]i,N Σ[k]i,1, . . . , Σ[K]i,N (2.28) k denote the index of the sample.

Each particle contains the robot path estimation X(k]i , an estimation of the the landmarks position µ[k]i,j with the relative variance Σ[K]i,j · j is the landmark index (1 < j < N ).

The initialization is performed by setting each particle at the robot known posi-tion, at this state the map is empty. When a new input is given to the robot, or a new odometry reading is received, for each particle new position variables are generated stochastically.

New particle positions are generate through the motion model: x[k]i ∼ pxi | x

[k] i−1



(2.29) For each new measurement mi,j the filter computes the measurement probability:

w[k]i ∼Nmi,j | x (k) i , µ [k] i,j, Σ k| i,j  (2.30) w[k]i is the importance weight, it measure how important is the particle in the light of the new sensor measurement. The weight is normalized so that the sum of all particle weights has sum equals to 1.

Then the resampling step is performed: a set of new particle is created, the prob-ability of drawing a new particle is based on the normalized importance weight. The landmarks estimate µ[k]i,j and Σ[K]i,j is updated for the new set of particles based

(43)

2.4 SLAM Models

on the measurement mi,j using the Extended Kalman Filter rules. As the robot

moves and time passes good particles survive while bad estimates of the state are discarded.

One of the advantage of particle methods over EKF methods is to break down the posterior over maps into low-dimensional Gaussians. Moreover, using tree methods to represent the landmark estimates is possible to improve algorithm efficiency: the update can be performed in O(log(N )), logarithmic in the number of landmarks, and linear in the number of particles M , so O(M ).

2.4.3

Graph-Based SLAM

TheSLAMmethod used in this work, ORB-SLAM, is a Graph-Based method. In these methods landmarks lj, and robot state xi are thought as nodes in a graph.

Two consecutive positions xi−1and xi are connected by an edge which represents

the information obtained by the input kinematics or the odometry reading ui.

Other edges are set up between the robot positions xi and the j -th landmark.

Figure 2.6 shows a nodes and edges in a SLAM graph. These edges are soft constraints in the graph. The best estimates for the map and the full path is retrieved by relaxing these constraints.

Often the graph is treated as a spring-mass model, to find a solution is equivalent to compute the state of minimal energy of the model.

The graph correspond to the log-posterior of the full SLAM problem: log p (xi, l | M0:i, U0:i) = const

+X i log p (xi | xi−1, ui) +X i,j log p (mi,j | xi, lj) (2.31)

where p (xi | xi−1, ui) are the motion constraints and p (mi,j | xi, lj) are the

land-mark measurements constraints.

Assuming kinematic model and sensors with Gaussian noise, the following quadratic expression can be derived:

log p (xi, l | M0i, U0i) = const +X i kxi− g (xi−1, ui)k > Ri−1kxi− g (xi−1, ui)k +X i,j kmi,j− h (xi, lj)k > Q−1i,j kmi,j− h (xi, lj)k (2.32)

Figura

Figure 1.1: Pose of the rover, according to aircraft convention.
Figure 2.1: Pinhole camera model, where o is the camera centre and p 0 the projection on the image plane of the point p
Figure 2.2 shows an example of omni-directional image taken by a KT&amp;C 18X camera. Dealing with this hardware is particular and needs a whole new
Table 2.1: Comparison of feature detectors: properties and performance.
+7

Riferimenti

Documenti correlati

There was no significant difference in saccadic onset times and text reading times in simple vs.. Perception time was longer for ambiguous images compared to control 4.048

(A, B) One- step and two-step changes in miR-124 expression levels dur- ing direct (A) and indirect (B) neuronogenesis. Stimulation of direct neuronogenesis and expansion of the

clarato una volta per tutte i loro effetti sulle elezioni nazionali (Jeffery e Hough 2003, Schakel 2011), è però probabile che appuntamenti elettorali sempre

 An online, one-way coupling between the SWAN phase-averaged, spectral wave model and the SWASH time domain, multi-layered non-hydrostatic flow model has been developed.. 

The experimental phase, reported in Chapter 5 proves that the overall document classification accuracy of a Convolutional Neural Network trained using these text-augmented

The algorithms implemented and tested in this thesis work for particle l- ter based stereo and monocular visual SLAM have proven to be eective in reducing the uncertainty

These indices form the scientific basis for the calculation of: (i) the EnvRI index which serves to address the risk of biodiversity reduction in contaminated river basins by

including Cindy Ebinger (see Planet Earth Autumn 2006) and me had set up ten new seismometers in Afar to record earthquake and volcanic activity.. Each seismometer was buried half