Features-based SLAM for augmented reality using low-cost camera and real-time streaming

(1)

Scuola Superiore Sant’Anna

Master of Science in Embedded Computing Systems

Feature-Based SLAM for augmented reality

using low-cost camera and real-time

streaming

Author:

Andrea Vincentini

Supervisors: Prof. Franco Tecchia Prof. Marcello Carrozzino

(2)

Cameras are powerful sensors as they provide high-resolution environment infor-mation (color, shape, texture, etc.) while being lightweight, low-power, and inex-pensive. Exploiting such sensor data for exploration tasks typically falls into the field of monocular simultaneous localization and mapping (SLAM), where both the camera’s pose and a map of the environment are estimated concurrently from the representation produced by a single hand-held camera. Especially, SLAM using cameras is referred to as visual SLAM (vSLAM) because it is based on visual in-formation only. vSLAM can be used as a fundamental technology for various types of applications and has been discussed in the field of computer vision, augmented reality, and robotics in the literature.

In monocular vSLAM, there is no way to directly measure the depth to any single point in the map, because the input is only a 2D image (as opposed to say a stereo pair or depth map). One possible solution can involve the use of a fiducial marker to reconstruct a known scale but it requires the possibility to place this marker in a specific position in the environment. This is not always possible so a better solution involves the use of inertial data.

The fusion of inertial and visual data is widely used to improve an object’spose estimation. An Inertial Measurement Unit (IMU) is able to measure the 3D ac-celeration and rotation of a moving object. An integration of the acceleration measurements over time from the IMU yields a position in meters whereas an ap-plied vSLAM algorithm on the vision data provides a position with an unknown scale factor.

This thesis project intends to design and develop a low cost and low power hand-handle video streaming real-time system. The hand-hand-handle device (a Raspberry PI) transmits data video along with the IMU data synchronized frame per frame, then a host application receives all data and applies vSLAM algorithm while renders the video.

(3)

Il primo ringraziamento che voglio fare va ai miei genitori Anna Rita e Alessandro e a mio fratello Marco che mi hanno permesso di affrontare questo percorso in totale serenità, senza pressioni, e che mi hanno dato l’opportunità di concentrarmi a pieno sulla mia carriera universitaria.

Vorrei ringraziare il mio relatore Franco per la guida fornitami durante la realiz-zazione di questa tesi e per avermi dato la possibilità di fare esperienza in un ambito che mi affascina e in cui spero di poter lavorare in futuro.

Ringrazio tutti i miei amici, collegi e non, che mi hanno accompagnato in questi lunghi anni, ognuno di voi mi ha insegnato qualcosa che porterò sempre con me. Infine ringrazio Alice. Grazie per essermi sempre stata vicina, sincera, dolce e paziente, sempre pronta a incoraggiarmi e a darmi la forza di migliorarmi. Che questo traguardo possa essere il primo tanti altri.

(4)

Abstract ii

Acknowledgments iii

1 Introduction 1

1.1 Application domain for SLAM . . . 4

1.2 Thesis proposal . . . 6

1.3 Thesis outline . . . 6

2 State of the Art 8 2.1 vSLAM types . . . . 8

2.1.1 Direct and Indirect methods . . . 8

2.1.2 Sparse and Dense methods . . . 9

2.1.3 Features types overview . . . 11

2.1.3.1 SIFT: Scale Invariant Feature Transform . . . 11

2.1.3.2 SURF: Speeded up robust features . . . 12

2.1.3.3 FAST: Features from Accelerated Segment Test . . 13

2.1.3.4 ORB: Oriented FAST and Rotated BRIEF . . . 13

2.1.3.5 Comparative analysis from the literature . . . 14

2.2 vSLAM algorithms . . . 15

2.2.1 monoSLAM . . . 15

2.2.2 PTAM - Parallel Tracking and Mapping . . . 16

2.2.3 ORB-SLAM . . . 18

2.2.4 DSO - Direct Sparse Odometry . . . 19

2.3 vSLAM modules . . . 20

2.4 Issues and open problems . . . 21

2.4.1 Pure rotation in monocular SLAM . . . 21

2.4.2 Static world assumption . . . 21

2.4.3 Camera calibrations . . . 22

2.4.4 Rolling Shutter camera . . . 24

(5)

2.5 Visual-inertial fusion . . . 24

2.5.1 Loosely coupled fusion in the literature . . . 26

2.5.2 Tightly coupled fusion in the literature . . . 27

3 Physical Architecture 28 3.1 Probe component . . . 28

3.1.1 Raspberry Pi . . . 28

3.1.2 Camera . . . 30

3.1.3 IMU MPU-9150 . . . 30

3.1.4 Hand-held camera-imu linking solution . . . 32

3.2 Base station component . . . 33

4 Software Architecture 34 4.1 Probe software on Raspberry Pi . . . 34

4.1.1 RaspiVid implementation . . . 34

4.1.1.1 UDP communication capabilities . . . 36

4.1.1.2 H264/MPEG-4 AVC . . . 37

4.1.1.3 Multi Media Abstraction Layer . . . 37

4.1.1.4 RaspiVid code . . . 38

4.1.2 IMU data acquisition process . . . 41

4.1.2.1 RTIMULib . . . 42

4.1.2.2 IMU process code . . . 43

4.2 Base station software on Pc application . . . 44

4.2.1 Multithreading implementation . . . 44

4.2.1.1 Base station application’s threads . . . 45

4.2.1.2 Win32 threads management . . . 46

4.2.2 vSLAM application . . . 49

4.2.2.1 FFmpeg video decode library . . . 50

4.2.2.2 OpenCV . . . 53

4.2.2.3 Visual-Inertial SLAM algorithm . . . 57

4.2.3 Configuration files . . . 61

4.2.4 User Interface . . . 62

5 Visual and Inertial data synchronization 64 5.1 Communication inside the probe component . . . 64

5.2 Full system communication . . . 68

6 vSLAM tecnologies analysis 74 6.1 PTAM accuracy evaluation . . . 77

(6)

6.1.1.1 Absolute trajectory error . . . 79

6.1.1.2 Relative pose error . . . 79

6.1.2 Testing PTAM w.r.t tum_floor dataset . . . 79

6.2 ORBSLAM accuracy evaluation . . . 80

6.2.1 Testing ORBSLAM w.r.t the tum_desk dataset . . . 80

6.2.2 Testing ORBSLAM w.r.t the tum_floor dataset . . . 82

6.3 DSO accuracy evaluation . . . 85

6.3.1 Testing DSO w.r.t the tum_desk dataset . . . 85

6.3.2 Testing DSO w.r.t the tum_floor dataset . . . 85

6.4 Video sequences analysis . . . 88

6.5 CPU usage . . . 90

6.6 Results evaluation . . . 90

7 Visual-Inertial SLAM analysis 92 7.1 Machine Hall easy dataset . . . 93

7.2 Machine Hall medium dataset . . . 93

7.3 Machine Hall difficult dataset . . . 95

7.4 ORBSLAM and VI-ORBSLAM comparison . . . 96

7.5 Timinig Analysis . . . 97

7.6 Base station CPU utilization . . . 98

8 Conclusions 99 8.1 Future work . . . 101

(7)

1.1 A table reconstruction using Edge Based Visual Odometry for a

Monocular Camera . . . 2

1.2 A 3-D terrestrial LIDAR scan reconstruction . . . 3

1.3 Indoor environment reconstruction using a handheald device . . . 5

1.4 Zurich city recostruction using a direct SLAM algorithm on the im-ages acquired by a camera mounted on a MAV . . . 5

2.1 Differences between indirect (on the left) and direct (on the right) methods workflow. . . 9

2.2 An example of environment recostruction using a sparse algorithm . 10 2.3 An example of environment recostruction using a sparse algorithm . 11 2.4 Scaling (a), noise (b) and fisheye (c) performances . . . 14

2.5 Sunlight (a), Incandescent (b) and dim light (c) performances . . . . 15

2.6 PTAM Mapping thread’s flow. . . 17

2.7 ORB-SLAM system overview, showing all the steps performed by the tracking, local mapping and loop closing threads. . . 19

2.8 Camera model projecting a 3D point X onto the image plane . . . . 22

2.9 Rolling shutter camera rows acquisition . . . 24

2.10 Loosely [a] and tightly [b] fusion approaches . . . 25

3.1 Raspberry Pi model B+ . . . 28

3.2 Raspberry Pi model B+ specifications . . . 29

3.3 SainSmart camera with fish-eye lens . . . 30

3.4 InvenSense IMU MPU-9150 . . . 31

3.5 Raspberry Pi model B+ GPIO configuration . . . 32

3.6 From camera to world reference and viceversa . . . 32

3.7 Camera - IMU linking solution . . . 33

4.1 Raspberry Pi video stack . . . 35

4.2 RaspiVid behavior represenation exploiting Simulink stateflow . . . 35

4.3 Imu process behavior represenation exploiting Simulink stateflow . . 41 vii

(8)

4.4 Workflow representation of the base station threads behavior . . . . 46

4.5 Structure of the visual-inertial system . . . 59

4.6 Evolution of the optimization in the tracking thread . . . 59

4.7 Local BA in the visual inertial algorithm . . . 61

4.8 User interface . . . 62

5.1 Probe component activity diagram . . . 65

5.2 IMU data packet structure . . . 66

5.3 IMU data samples between two consecutive synchronization mechanism 67 5.4 IMU data packet structure extended for multiple samples . . . 68

5.5 Whole system activity diagram . . . 69

5.6 Packet structure: [a] containing only visual data, [b] containing vi-sual and inertial data . . . 73

6.1 PTAM (a) and ORBSLAM (b) estimate trajectory from an indoor video sequence . . . 74

6.2 Groundtruth representations and associated paths along the three dimensions of the tum_desk dataset . . . 75

6.3 Groundtruth representations and associated paths along the three dimensions of the tum_floor dataset . . . 76

6.4 PTAM 3D representation of the tum_desk before elaboration (a) and after alignment and scale correction (b). . . 78

6.5 PTAM tum_desk estimated trajectories and groundtruth in terms of XYZ coordinates. . . 78

6.6 PTAM 3D representation of the tum_floor before elaboration (a) and after alignment and scale correction (b). . . 80

6.7 ORBSLAM 3D representation of the tum_desk before elaboration (a) and after alignment and scale correction (b). . . 81

6.8 ORBSLAM tum_desk estimated trajectries and groundtruth in terms of XYZ coordinates. . . 81

6.9 ORBSLAM 3D representation of the tum_floor before elaboration (a) and after alignment and scale correction (b). . . 83

6.10 ORBSLAM tum_floor estimated trajectries and groundtruth in terms of XYZ coordinates. . . 83

6.11 DSO 3D representation of the tum_desk before elaboration (a) and after alignment and scale correction (b). . . 85

6.12 DSO 3D representation of the tum_floor before elaboration (a) and after alignment and scale correction (b). . . 86

6.13 DSO tum_floor estimated trajectries and groundtruth in terms of XYZ coordinates. . . 86

(9)

6.14 PTAM estimated trajectory and ORBSLAM reference evaluating a indoor video sequence. . . 88 6.15 PTAM estimated trajectory and ORBSLAM reference evaluating a

outdoor video sequence. . . 89 6.16 PTAM estimated trajectory and ORBSLAM reference evaluating a

second outdoor video sequence. . . 90 7.1 MH_01 dataset: [a] comparison between groundtruth and estimated

trajecotries, [b] comparison along the three directions . . . 93 7.2 MH_03 dataset: [a] comparison between groundtruth and estimated

trajecotries, [b] comparison along the three directions . . . 94 7.3 MH_05 dataset: [a] comparison between groundtruth and estimated

trajecotries, [b] comparison along the three directions . . . 95 7.4 Comparison between ORBSLAM and VI-ORBSLAM . . . 97 7.5 Time needed to track with different number of ORB features . . . . 98

(10)

2.1 Comparison of main monocularvSLAM algorithms. . . 21

3.1 Camera’s features . . . 30

3.2 IMU properties . . . 31

5.1 Raspberry Pi CPU usage . . . 68

6.1 APE results of running PTAM on the tum_desk dataset . . . 79

6.2 RPE results of running PTAM on the tum_desk dataset . . . 79

6.3 APE results of running ORBSLAM on the tum_desk dataset . . . . 82

6.4 RPE results of running ORBSLAM on the tum_desk dataset . . . . 82

6.5 APE results of running ORBSLAM on the tum_floor dataset . . . . 84

6.6 RPE results of running ORBSLAM on the tum_floor dataset . . . . 84

6.7 APE results of running DSO on the tum_floor dataset . . . 87

6.8 RPE results of running DSO on the tum_floor dataset . . . 87

6.9 ATE and RPE results between PTAM and ORBSLAM in a indoor video sequence . . . 89

6.10 CPU usage . . . 90

7.1 Estimated paths length in mh_01 dataset . . . 93

7.2 APE and RPE results running the mh_01 dataset . . . 94

(11)

4.1 RaspiVid command used in the project. . . 35

4.2 RaspiVid main function . . . 38

4.3 RaspiVid UDP init . . . 39

4.4 RaspiVid send function . . . 40

4.5 IMU process: IMU initialization . . . 43

4.6 IMU process: Unix-socket stuff . . . 43

4.7 CreateThread function prototype . . . 46

4.8 Application’s thread creation procedure . . . 47

4.9 CreateEvent function prototype . . . 48

4.10 SetEvent function prototype . . . 48

4.11 WaitForSingleObject function prototype . . . 48

4.12 InitializeCriticalSection function prototype . . . 49

4.13 Application’s thread creation procedure . . . 49

4.14 Main function: decoder initialization . . . 50

4.15 decoder Thread . . . 51

4.16 Main function: undistort initialization . . . 54

4.17 slam Thread . . . 56

5.1 IMU process: main behavior . . . 65

5.2 RaspiVid process: callback function . . . 68

5.3 Receiver function . . . 70

(12)

(13)

Introduction

"I’m excited about Augmented Reality because unlike Virtual Reality, which closes the world out, AR allows individuals to be present in the world but hopefully allows an improvement on what’s happening presently."

–Tim Cook Appeared about 540 million years ago in fossils found from the lower Cambrian period the eye evolved towards various different shapes and functions. Ranging nowadays from the "pinhole-model" in animals like the Nautilus to a sophisticated "lens-mounted color camera" in humans.

It seems effortless how humans perceive and interact with the environment. The ability and efficiency of the visual cortex to interpret the large amount of informa-tion perceived by the eyes is astonishing. Although humans feel the percepinforma-tion of a scene as an immediate action, it is evident, that they need some computing power to process the vast information of this sensor.

Humans can describe their motion in the three-dimensional space, characterize the size and structure of the room on which they are in. If the scene is only illuminated during the eye movement and completely dark before and after the movement, we do process a blurred image as it is the only source of information. In both cases, we perceive the image, but extracting information is the costly process we subcon-sciously decide to do or not to do. We start having difficulties to distinguish events of a temporal distance of less than 60ms. Also, we developed a strategy to decide early on which images we should fully process.

While we possess and use other senses such as touch and hearing, the sheer portion of the human brain devoted to visual processing (about 30%) demonstrates both the importance as well as the complexity of the ability to understand the 3D world around us from 2D projections observed by our eyes.

While this subconscious behavior evolved naturally to excellence in humans and an-imals, it has to be implemented explicitly in artificial devices. Replicating

(14)

scale understanding of space and motion in artificial systems represents an enor-mous challenge for scientists and engineers; however, even the smallest steps in this direction have the potential to unlock a myriad of exciting applications such as autonomous cars, service robots, or assistive devices for the blind.

Image processing has long been considered in off-line tasks on only a very few, low-resolution images due to the expensive devices, the limited calculation power, and the early algorithms, that limited the extensive use of image processing, for instance in the robotics domain. Instead, sonars, range finders, and bulky laser devices found their applications in both industrial and mobile robots.

As artificial devices autonomous cars, quadrocopters, full-sized robots or even vir-tual and augmented reality systems start to interact with, or adapt to the sur-randing 3D world. They need the ability to perceive, reconstruct and ultimately understand it in a similar manner as the humans. For example, a car that drives itself needs to know where it is, and it needs to recognize and avoid obstacles, both dynamic and static or to convincingly display a virtual object standing on a real-world table as in Figure 1.1, both the pose of the observer, as well as the pose and shape of the table need to be known.

Figure 1.1: A table reconstruction using Edge Based Visual Odometry for a Monocular Camera

Researchers in computer vision, mathematics and robotics have thus spent decades on the task of reconstructing the 3D world, geometry and camera motion, from 2D images. It is commonly called Simultaneous Localization and Mapping (SLAM) or Structure and Motion (SaM).

SLAM will always use several different types of sensors, and the powers and limits of various sensor types have been a major driver of the algorithms. Different types of sensors give rise to different SLAM algorithms whose assumptions are which are most appropriate to the sensors. At one extreme, laser scans or visual features provide details of a great many points within an area, sometimes rendering SLAM

(15)

inference unnecessary because shapes in these point clouds can be easily and un-ambiguously aligned at each step via image registration. At the opposite extreme, tactile sensors are extremely sparse as they contain only information about points very close to the agent, so they require strong prior models to compensate in purely tactile SLAM. Optical sensors may be one-dimensional or 2D-laser rangefinders, 3D High Definition LiDAR, 3D Flash LIDAR, 2D or 3D sonar sensors and one or more 2D cameras. Figure 1.2 shows an example of LIDAR application.

Figure 1.2: A 3-D terrestrial LIDAR scan reconstruction

Since 2005, there have been intense research into vSLAM (visual SLAM) using primarily visual (camera) sensors, because of the increasing ubiquity of cameras such as those in mobile devices. Visual and LIDAR sensors are informative enough to allow for landmark extraction in many cases. Other recent forms of SLAM in-clude tactile SLAM [FEPP12], radar SLAM [MMvHvH13], and wifi-SLAM (sensing by strengths of nearby wifi access points). Recent approaches apply quasi-optical wireless ranging for multi-lateration (RTLS) or multi-angulation in conjunction with SLAM as a tribute to erratic wireless measures. A kind of SLAM for human pedestrians uses a shoe mounted inertial measurement unit as the main sensor and relies on the fact that pedestrians are able to avoid walls to automatically build floor plans of buildings by an indoor positioning system [RAK09].

The scale ambiguity is a well-known limitation of optical process. The reconstruc-tion is only possible up to an unknown scale factor when using a monocular camera. However, the scale information would often be useful, for example, when making body size measurements for online shopping. Similarly, the scale information could be utilized in 3D printing. The user could first scan the object with a smart device and then print the object in exact dimensions with a 3D printer.

(16)

depth camera. Besides the fact that the stereo and depth cameras have a lim-ited operational range, they are also more expensive and rarely included in mobile devices. The global positioning system (GPS) can also be used for obtaining the metric scale of the reconstruction. However, the GPS is typically relatively inaccu-rate and only works outdoors. Some scale estimation methods avoid the need for extra hardware by making assumptions about the scene content.

The use of an inertial measurement unit (IMU) as a complementary sensor to the camera promises increased robustness and accuracy. While a single moving camera is an exteroceptive sensor that allows the user to measure appearance and geom-etry of a three-dimensional scene, up to an unknown metric scale; an IMU is a proprioceptive sensor that renders metric scale of monocular vision and gravity ob-servable and provides robust and accurate inter-frame motion estimates. During the last decades, the IMU has become less expensive with the introduction of micro-machined electromechanical systems (MEMS) technology allowing many consumer products such as cell phones, cameras, and game consoles to include an IMU. The IMU measures acceleration and angular velocity. These measurements can be integrated over time to obtain the position and the orientation. The noise inherent in the IMU’s measurements are also included in the integration and will cause the estimates of the position and orientation to drift away from the real value. An IMU can be sampled with a very high sampling frequency and sense fast motions very well. Because of the drift, it can only be used without an aiding sensor during shorter periods of time. A camera can estimate the pose (position and orientation) accurately during longer periods of time under slow motion but suffers heavily from motion blur and rolling shutter effects during fast motion. Since the IMU can sense fast motions better than cameras, a combination of the two sensors might be more suitable. The recent advances in computer processing have enabled image process-ing algorithms to be run in real time even on consumer products and together with the less expensive IMU, vision-aided inertial navigation has become a very popular field of research.

1.1 Application domain for SLAM

SLAM has many practical applications, which are becoming increasingly important for current technological developments. The most prominent areas areVirtual and

Augmented Reality (VR/AR) and Robotics.

Considering VR/AR application the camera pose and the geometry of the scene are required to correctly render virtual objects into the image, and allow them to inter-act (e.g., collide with or disappear behind) real-world objects. At the same time, wearable devices – such as headsets or smartphones – impose severe restrictions on

(17)

the cost, size, weight, and power consumption of the used sensors. This leads to passive vision becoming an important sensor modality – both in a monocular or stereo set-up, and typically combined with an IMU.

Figure 1.3: Indoor environment reconstruction using a handheald device Alternatively talking about Robotics some examples include autonomous quadro-copters, driver-less cars, and robot vacuum cleaners. SLAM is used to estimate the robot’s position with respect to the environment and to navigate without colliding with other objects. While in some cases other sensory modalities (such as laser scanners or RGB-D cameras) can be used, using passive vision is a good – and sometimes the only – option for resource-constrained systems, due to restrictions on per-unit production cost or on size/ weight/power consumption.

Figure 1.4 shows an example of 3D environment recostruction exploiting a SLAM algorithm and a MAV.

Figure 1.4: Zurich city recostruction using a direct SLAM algorithm on the images acquired by a camera mounted on a MAV

(18)

1.2 Thesis proposal

This thesis proposes to realize a low-cost system able to track the user and show a virtual world instead of the real one. In this thesis tracking for a handheld camera is considered.

The purpose of this thesis is to:

• investigate which one of the already existing SLAM algorithm provides the better results in terms of tracking quality and computational requirements using a real-time video streaming captured with a low-cost rolling shutter-camera,

• evaluate if fusion of camera and IMU information can be useful to increase the performance

1.3 Thesis outline

The rest of this document is organized as follow.

Chapter 2 presents an overview of the current state of the art. The content of this

chapter is based on what it is stated in the main papers of the literature. The objective is to understand where the technology has arrived and how some applica-tions are realized in order to choose the "best" solution on which build the system.

Chapter 3 shows the hardware structure of the system. For each component, it

summarizes the main characteristics taking advantages from the datasheets. For the probe component, the Raspberry Pi board, the camera, and the inertial sensor will be presented. Instead for the base station component, it will be provided only the hardware capability.

Chapter 4 represents the core of this thesis project. here it will be explained in

detail the designed system in terms of software development. For both components will be provided an overview of the main flow and a brief description of the main libraries used. Furthermore, the most important segments of the code are here inserted and commented to make clear how to achieve the results.

Chapter 5 explains how one of the main problems of this project is solved. It deals

with the synchronization of the visual and inertial data. Starting with the analysis of how these data are acquired by the probe component, this chapter presents also how these data are streamed to the base station component to be processed.

Chapter 6 reports the results achieved by comparing the vSLAM algorithms

ex-ploiting two TUM datasets. These results highlight how the monocular vSLAM algorithms are affected by the different problems as the unknown scale or the rolling shutter problem. A further analysis was done using two custom video sequences to

(19)

better compare some possible solution.

Chapter 7 is used to show the performances of the visual-inertial SLAM algorithm

proposed in this thesis. Also here the performances are evaluated by exploiting two datasets, in this case, two EuroC datasets.

Chapter 8 outlines the conclusion obtained in this project and tries to provide some

(20)

State of the Art

Starting from 2000 several technologies were introduced in the domain of vSLAM. This chapter first provides an overview of the types of vSLAM technologies, then analyzes the different modules included in the vSLAM. In the end, it summarizes the evolving of the mainvSLAM ’s algorithms.

2.1 vSLAM types

For a long time, the field ofvSLAM was dominated by a feature-based (or indirect) approach, while in recent years new approaches, called direct, were introduced. At the same time, indirect methods were used in combination withsparse methods that reconstruct a map using only a selected set of independent points (traditionally corners) while modern direct methods generally, usedense methods to reconstruct all pixels in the 2D image domain. The difference between dense and sparse is not synonymous to direct and indirect, indeed there exist solutions belong to all the four combinations.

Mainly, both direct and indirect approaches are used to acquire a raw measurement as input and compute an estimator of 3D world model and camera motion in an unknown environment. Sparse/Dense describes the quantity of regions used in each received image frame, while direct/indirect describes different ways in which the image data are used.

2.1.1 Direct and Indirect methods

vSLAM systems can be classified according to the way they utilize information

re-ceived from an image frame.

The traditionalfeature-based or indirect method, both filtering-based and keyframe-based, is used to split the overall problem of estimate geometric information from images into two sequential steps. Firstly, a series of feature observations is

(21)

nized from the image. Then the camera position, posture and scene geometry is computed as a function of these feature observations only. These features can be simple geometric features such as corners or edges, or more sophisticated feature descriptors as SIFT [Low04], ORB [RRKB11], or FAST [RD06].

Splitting simplifies the overall problem, it comes with an important limitation. The only information that conforms to the feature type can be used. Another issue with feature based methods is that storing the processed features can quickly become very costly.

Direct methods use pixel photometric information (intensity between the pixel in

one image and its warped projection in another image) rather than extracting in-termediate features. Direct methods try to recover the environment depth and the camera pose through an optimization on the map and camera parameters together. Some disadvantages are that they cannot handle outliers very well, as they will always try to process and implement them into the final map, and that they are slower than feature based variants.

Figure 2.1 shows the differences between direct and indirect approach.

Figure 2.1: Differences between indirect (on the left) and direct (on the right) methods workflow.

2.1.2 Sparse and Dense methods

Sparse methods use and reconstruct only a selected set of independent points (tradi-tionally corners), whereas dense methods attempt to use and reconstruct all pixels in the 2D image domain.

(22)

cam-eras and it is also known in the computer vision communities as Structure-from-Motion (SfM). Given a sequence of images, sparse algorithms estimate the camera’s pose where each image was taken, with a sparse point cloud of features which rep-resent the map of the real scene. Typically, salient features (e.g. corners or lines) are detected in the images and associated across frames. Given these associations, the relative pose transform between the images can be inferred, which then noti-fies the 3D locations of the points that correspond to the features. Finding the optimal setting for both the poses and map points is usually achieved using an ap-proach called bundle adjustment, which frames the problem as a sparse, nonlinear least squares objective that can be optimized efficiently using the Gauss-Newton or Levenberg-Marquardt algorithms. Salient features are first detected in an image and then associated with the features that comprise the current map, which allows the current pose of the camera to be estimated. The camera poses are then used to refine the map by minimizing a nonlinear least squares objective function. Sparse monocular SLAM algorithms are preferred if optimizing for the camera tra-jectory due to their ability to cleanly and efficiently close loops and minimize drift. Nevertheless, the sparse point map representation presents problems if the map is to be used for online motion planning to avoid obstacles. Although the number and density of features may vary from algorithm to algorithm.

Figure 2.2 shows an example of the environment recostruction using a sparse re-construction.

Figure 2.2: An example of environment recostruction using a sparse algorithm

Encouraged by the propagation of GPUs and general purpose GPU programming languages, as well as interest from the virtual and augmented reality communities,

(23)

dense monocular SLAM methods catches on. Dense approaches estimate depth for

every pixel in each incoming frame into either dense point cloud or mesh represen-tations that can be used to describe obstacles in extremely fine detail. The hearts of these approaches lie in the stereo and multi-view stereo literature, where dense geometry is estimated from a set of images taken from known poses. Rather than detecting a sparse set of salient features and associating them across images, these algorithms typically match patches of pixels in one image with those of another comparison image that is along the epipolar line (the projection of the ray of all possible depths onto the comparison image). The noisy depth estimates are then fused and spatially regularized.

Figure 2.3 shows an example of the environment recostruction using a dense method.

Figure 2.3: An example of environment recostruction using a sparse algorithm

2.1.3 Features types overview

An ideal feature detection technique should be robust to image transformations such as rotation, scale, illumination, noise and affine transformations. In addition, ideal features must be highly distinctive, such that a single feature to be correctly matched with high probability.

2.1.3.1 SIFT: Scale Invariant Feature Transform

Scale Invariant Feature Transform (SIFT) is a feature detector that consists of four major stages: (1) Scale-space extreme detection; (2) key point localization; (3) ori-entation assignment; (4) key point descriptor.

(24)

In the first stage, the image is scanned over location and scale in order to determine potential interest points that are invariant to scale and orientation. These are the local scale-space maxima of the Difference of Gaussian (DoG) which is obtained by subtracting the different Gaussian scales. While the points that have a low contrast are rejected with respect to a predefined threshold, the non-edge points are eliminated based on the idea under the Harris method in which it is assumed that the distribution on an edge region should give larger eigenvalues and the dis-tribution on a non-edge region should give small eigenvalues. To obtain descriptors that invariant to rotations, an orientation histogram was formed from the gradient orientations of each local maximum of the DoG function within a region around the key point. The final stage of SIFT constructs a feature vector by considering the direction of a key point which is gradient strength is maximal.

Although SIFT has proven to be very efficient in object recognition applications, it requires a large computational complexity which is a major drawback, especially for real-time applications.

2.1.3.2 SURF: Speeded up robust features

Speeded up robust features (SURF) is able to generate scale and rotation invariant interest points and descriptors. SURF has been used as a feature selector in many studies because of some goals such as descriptors generated by SURF are invariant to rotation and scaling changes and computational time of SURF is small and fast in comparison to other feature extraction algorithms in case of interest point local-ization and matching. Systematically, SURF uses 2-D Haar wavelet and integral images. For keypoint detection, it uses the sum of the 2D Haar wavelet [CH97] response around the point of interest. A 2D Haar wavelet is obtained by an integer approximation to the determinant of Hessian matrix that extracts blob-like struc-tures at locations where the determinant is maximum. Therefore, the performance of SURF can be attributed to non-maximal-suppression of the determinants of the Hessian matrices. In the description phase, firstly the neighborhood region of each key point is divided into a number of 4x4 sub-square regions. Then, it computes the response of a 2D Haar wavelet response each sub-region. Again, this procedure can be computed with aid of the integral image. Each response contributes four values to a descriptor, so each key point is described with a 64-dimensional (4x4x4) feature description of all sub-regions. Although the SURF method runs faster than the SIFT, in some situations like viewpoint and intensity change it does not give good results as SIFT produced

(25)

2.1.3.3 FAST: Features from Accelerated Segment Test

Features from Accelerated Segment Test (FAST) corner detector uses a circle of 16 pixels to classify whether a candidate point p is actually a corner or not. Assuming the processed pixel p with intensity IP is selected, each pixel in the circle is labeled from integer number 1 to 16 as clockwise. To make the algorithm fast, first compare the intensity of pixels 1, 5, 9 and 13 of the circle with IP. If at least three of these four pixels satisfies the threshold criterion so that p is chosen as an interesting point. On the other hand, if at least three of the four-pixel values (I1, I5, I9, and I13) are not above or below IP + T, then P is not an interesting point (corner). Else if at least three of the pixels are above or below IP + T, then check for all 16 pixels and in this case, 12 contiguous pixels should fall in given criterion. Likewise, repeat the procedure for the all others remaining pixels in the image. Because of some limitations such as for n<12 the algorithm does not work well, the choice of pixels is not optimal and multiple features are detected adjacent to one another, a machine learning approach has been employed to the algorithm to deal with these issues. A training set is constructed as for every feature point “p”, store the 16 pixels around it as a vector. Each pixel in these 16 pixels can have one of the following three states: darker, similar and brighter according to some given rules. Then, the ID3 (a decision tree classifier) is performed to select the point which yields the most information about whether the candidate pixel is a corner with respect to an entropy minimization criteria.

2.1.3.4 ORB: Oriented FAST and Rotated BRIEF

ORB is basically a fusion of FAST keypoint detector and BRIEF descriptor with many modifications to enhance the performance. First, it uses FAST to find key points, then applies the Harris corner measure to find top N points among them. It also uses a pyramid to produce multiscale-features. One problem is that FAST doesn’t compute the orientation so some modifications were made. It computes the intensity weighted centroid of the patch with a located corner at center. The direction of the vector from this corner point to centroid gives the orientation. To improve the rotation invariance, moments are computed with x and y which should be in a circular region of radius r, where r is the size of the patch. For descriptors, ORB uses BRIEF descriptors.

BRIEF (Binary Robust Independent Elementary Features) is an efficient feature point descriptor. It is highly discriminative even when using relatively few bits and is computed using simple intensity difference tests. As with all the binary descriptors, BRIEF’s distance measure is the number of different bits between two binary strings which can also be computed as the sum of the XOR operation between

(26)

the strings.

What ORB does is to “steer” BRIEF according to the orientation of key points. For any feature set of n binary tests at location (x_i, y_i), define a 2xn matrix, S which contains the coordinates of these pixels. Then using the orientation of patch,

θ, its rotation matrix is found and rotates the S to get steered(rotated) version Sθ.

ORB discretize the angle to increments of 2π/30 (12 degrees) and construct a lookup table of precomputed BRIEF patterns. As long as the key point orientation θ is consistent across views, the correct set of points Sθ will be used to compute its

descriptor.

2.1.3.5 Comparative analysis from the literature

In the literature there exist various papers that analyze the performance of the feature with respect to several aspects. Here, some results will be summarized. In [KPS17] Karami et al. investigate the sensitivity of SIFT, SURF, and ORB against each intensity, rotation, scaling, shearing, fisheye distortion, and noise. Au-thors compared a static image with the same one on which they applied different kinds of transformations and deformations. According to the results, SIFT per-forms better than the SURF and ORB but it requires a feature extraction time, in general, ten times greater the ORB and SURF. With the aim of constructing a vSLAM system with the property described in Chapter 3, this work provides inter-esting results in terms of scaling, fisheye distortion and noise. Figure 2.5 shows the result achieved by the authors’ analysis for scaling, noise and fisheye distortion.

(a) (b) (c)

Figure 2.4: Scaling (a), noise (b) and fisheye (c) performances

In [SIAS13] Saipullah et al. analyzed the object detection methods with respect to efficiency, quality and robustness of the object detection for seven methods such as SIFT, SURF, Center Surrounded External (CenSurE), Good Features To Track (GFTT), Maximally-Stable External Region Extractor (MSER), ORB, and FAST. This paper presents the analysis of real-time object detection method for an em-bedded system particularly the Android smartphone.

The overall performances show that the FAST algorithm has the best average per-formance with respect to speed, the number of keypoint and repeatability error. The most robust object detection method is the ORB that achieves the lowest RER (Repeatability Error Rate) even in different illumination (Figure 2.4) and

(27)

ori-entation. However, in this particular system, the ORB method consumes too much time in computing its algorithm and does not archive real-time video performance. Unlike FAST, all other algorithms consume too much time in the computation thus result in lagging on the video that reduces the video quality significantly. Only FAST achieves the real-time performance in the object detection in an embedded device. However, the object detection performance of FAST is not significantly high compared to other object detection methods and also a little insensitive to orientation and illumination change. As for the previous analysis, also, in this case, is possible to highlight the most interesting results. In particular, the robustness varying the types of lighting used which are the fluorescent light (normal), sunlight, dim light (without the room light).

(a) (b) (c)

Figure 2.5: Sunlight (a), Incandescent (b) and dim light (c) performances

2.2 vSLAM algorithms

Exploiting the combination of the above types a lot of different algorithms have been developed over the years. In the following of this chapter, it will be provided an overview of the main sparse algorithms in the field of the monocular camera application.

2.2.1 monoSLAM

Davison et al. [DRMS07] present the monoSLAM algorithm. This is one of the first representative method in the filter-basedvSLAM domain.

InmonoSLAM the camera motion and the 3D structure of an unknown environment are simultaneously estimated using an EKF (extended Kalman Filter). At the end this method create a probabilistic feature-based map, representing at any instant a snapshot of the current estimates of the state of the camera and all features of interest and the uncertainty in these estimates. The map is created at system start-up and start-updated until the operation ends using the probabilistic state estimation and feature observation done by the EKF.

(28)

as a complete scene description, and authors, therefore, aim to capture a sparse set of high-quality landmarks. They assume that the scene is rigid and that each landmark is a stationary world feature. The camera is modeled as a rigid body needing translation and rotation parameters to describe its position and they also maintain estimates of its linear and angular velocity.

The system is initialized using a small amount of prior information about the scene in the shape of a known target placed in front of the camera. This provides several features (typically four) with known positions and of known appearance. There are two main reason for this:

• in single camera SLAM, with no direct way to measure feature depths or any odometry, starting from a target of known size allows assigning a precise scale to the estimated map and motion rather than running with scale as a completely unknown degree of freedom;

• having some features in the map right from the start means that it is possible to immediately enter the normal predict-measure-update tracking sequence without any special first step.

After start-up, the state vector is updated in two alternating ways: the prediction step, when the camera moves in the “blind” interval between image capture and the update step, after measurements have been achieved of features.

The problem of this method is a computational cost that increases in proportion to the size of an environment. In large environments, the size of a state vector becomes large because the number of feature points is large. In this case, it is difficult to achieve real-time computation.

2.2.2 PTAM - Parallel Tracking and Mapping

In [KM07] authors present a method of estimating camera pose in an unknown scene using a system specifically designed to track a hand-held camera in a small AR workspace. They propose to divide tracking and mapping into two separate tasks, processed in parallel threads on a dual-core computer. One thread deals with tracking hand-held motion, while the other produces a 3D map of cloud point features from previously observed frames.

Using two separated threads, tracking is no longer probabilistically slaved to the map creational procedure, and any robust tracking method desired can be used. As a consequence, freed from the computational responsibility of updating a map at every frame, the tracking thread can perform more accurate image processing, further increasing performance.

(29)

can be skipped because they contain redundant information, particularly when the camera is not moving. This algorithm focus on processing some smaller number of more useful keyframes respect analyzing redundant information among consecutive frame. These new keyframes then need not be processed within strict real-time limits and this allows operation with a larger numerical map size. Finally, it replaces incremental mapping with a computationally expensive but highly accurate batch method, as the bundle adjustment algorithm [TMHF99].

Mapping occurs in two different stages. First, an initial map is built using a stereo technique. Secondly, the map is improved and expanded by the mapping thread by adding new keyframes. Figure 2.6 shows the mapping phases.

Figure 2.6: PTAM Mapping thread’s flow.

When the system is initialized, the algorithm uses the five-point stereo algorithm [Nis04] to initialize the map. The algorithm requires the user cooperation: the user first places the camera above the workspace to be tracked and presses a key. At this point, the system’s first keyframe is stored in the map. The user then smoothly translates the camera to a slightly offset position makes a further key-press. The 2D patches are tracked through the smooth motion, and the second key-press thus provides a second keyframe and feature correspondences from which the five-point algorithm and RANSAC [FB87] can estimate an essential matrix and triangulate the base map. The resulting map is refined through bundle adjustment.

Once a map has been created, it will be updated using the information obtained by the tracking thread. The tracking system acquires images from the hand-held camera and keeps a real-time estimate of the camera pose relative to the built map. Using this estimation, augmented graphics can then be drawn on top of the video

(30)

frame.

New keyframes are added whenever the following conditions are met: • tracking quality must be good;

• time since the last keyframe was added must exceed twenty frames;

• and the camera must be a minimum distance away from the nearest keypoint already in the map.

In [CKM08] and [KM09] authors present an exstension of the PTAM algorithm running with multiple map and on mobile phone, respectly.

2.2.3 ORB-SLAM

Mur-Artal et al. [MAMT15] presents ORB-SLAM, a feature-based monocular SLAM system that operates in real time, in small and large, indoor and outdoor environments.

The system is built using the ORB features which allow real-time performance without GPUs, providing good invariance to changes in viewpoint and illumination. They are extremely fast to compute and match, while they have good invariance to viewpoint. This allows to match them from wide baselines, boosting the accuracy of BA.

The ORB-SLAM algorithm is based on three main threads that run in parallel: tracking, local mapping and loop closing. The tracking is designated of localizing the camera with every frame and deciding when to insert a new keyframe. The lo-cal mapping processes new keyframes and performs lolo-cal BA to achieve an optimal reconstruction in the surroundings of the camera pose. New correspondences for unmatched features in the new keyframe are searched in connected keyframes in the covisibility graph to triangulate new points. The loop closing searches for loops with every new keyframe. If a loop is detected, it computes a similarity transforma-tion that informs about the drift accumulated in the loop. Then both sides of the loop are aligned and duplicated points are fused. Finally a pose graph optimization over similarity constraints is performed to achieve global consistency. Figure 2.7 summarizes the behavior of the algorithm.

If the tracking is lost the place recognition module is used to perform a global re-localization. The system has embedded a bags of words place recognition module, based on DBoW2 [GLT12], to perform loop detection and relocalization. These visual words are just a discretization of the descriptor space, which is known as the visual vocabulary. The vocabulary is created offline with the ORB descriptors extracted from a large set of general images, in order to use the same vocabulary for different environments getting a good performance.

(31)

Figure 2.7: ORB-SLAM system overview, showing all the steps performed by the tracking, local mapping and loop closing threads.

In [MAT17a] is presented an extension of the ORB-SLAM to use it with stereo and RGB-D cameras.

2.2.4 DSO - Direct Sparse Odometry

In [EKC18] authors propose a sparse and direct approach to monocular visual odometry that runs in real time on a laptop computer. As direct method, DSO does not depend on keypoint detectors or descriptors, thus it can naturally sample pixels from across all image regions that have intensity gradient, including edges or smooth intensity variations on mostly white walls.

This method takes full advantage of photometric camera calibration, including lens attenuation, gamma correction, and known exposure times. This integrated pho-tometric calibration further increases accuracy and robustness. Differently, from indirect methods where feature extractors and descriptors are invariant to photo-metric variations, in addition to a geophoto-metric camera model, this approach uses a photometric camera model, which includes the function that maps real-world en-ergy received by a pixel on the sensor (irradiance) to the respective intensity value. In order to suppress accumulative error, DSO removes error factors as much as possible from geometric and photometric perspectives.

In DSO, the input image is divided into several blocks, and then, high intensity points are selected as reconstruction candidates. By using this strategy, points are spread within the whole image.

It should be noted that DSO considers local geometric consistency only. Therefore, is more proper to consider DSO as a VO algorithm respect to a vSLAM ones.

(32)

2.3 vSLAM modules

The vSLAM process consists of a sequence of steps. The goal of the process is to use the surrounding environment to update the position of the camera. Camera pose updating is done mainly in three phases:

1. Initialization 2. Tracking 3. Mapping

To initialize the system, it is necessary to define a global coordinate system for camera pose estimation and 3D reconstruction in an unknown environment. In general, to do that the relative pose is computed between two frames to triangu-late an initial set of map points. This method should be independent of the scene (planar or general). Therefore, in the initialization, the global coordinate system should first be defined, and a part of the environment is reconstructed as an initial map in that global coordinate system.

After the initialization, tracking and mapping are performed to continuously esti-mate new camera poses. In the tracking, the reconstructed map is tracked in the image to estimate the camera pose with respect to the map. In themapping, the map is expanded by computing the 3D structure of an unknown environment when the camera observes unknown regions where the mapping is not performed before. The are other two extra modules that are not included in allvSLAM system:

1. Relocalization

2. Loop closing detection and global map optimization

Sometimes the tracking fails due to fast camera motion or some external distur-bances (as the light variation). In this case, if the relocation module is not included in the vSLAM system the system does not work anymore. Otherwise, if it is in-cluded the camera can be easily estimated with respect to the already reconstructed map. To do that the camera must be moved back in an already tracked region. This process is calledrelocalization.

The loop closing process follows the same behavior of the relocation module. In particular, a closed loop is first searched by pairing a current image with previously acquired images. If a loop is detected, it means that the camera captures one of the previously observed scenes. This module is useful because of the accumulative estimation error introduced by the camera movement. When a previous region of a map is revisited after some camera movement, reference information that represents the accumulative error from the previous to the present can be computed. Then, a

(33)

loop constraint from the reference information is used as a constraint to suppress the error in the global optimization.

Fundamentally, relocalization is used to recover a camera pose while loop detection is used to obtain a geometrically consistent map.

Table 2.1 shows the summary of the mainvSLAM algorithms in terms of tracking method, map density and extra provided functionality.

Table 2.1: Comparison of main monocular vSLAM algorithms. Algorithm Method Map density Global optimization Loop closure

monoSLAM Feature Sparse No No

PTAM Feature Sparse Yes No

ORB-SLAM Feature Sparse Yes Yes

DSO Direct Sparse No No

2.4 Issues and open problems

In a survey proposed by Yang et al. [YWGC17], the authors identify at least three main problems for a vSLAM system. In particular, they identify that the performance can be influenced by some low-level aspects as photometric calibration, motion bias, and rolling shutter effect. However, there are other aspects that can decrease the performance of a vSLAM system such as a poor geometrically camera calibration, initialization procedure, adverse condition, and map scaling.

2.4.1 Pure rotation in monocular SLAM

One of the most serious problems for monocular SLAM systems is the pure rotation situation, which may lead to large errors in camera pose in a SLAM system. This can be explained by considering a camera looking at a point. The trajectory of this point projected onto the image plane when the camera rotates about its centre will appear very similar to that caused by a large translation movement. Therefore, when there is a paucity of information from other points, the SLAM system will be confused about what the true 3D motion is, which will result in the computation of an incorrect camera pose due to the misguided conclusion that a large translational movement happened. Despite various work [PSR13] addressing the problem of pure rotation, it is still a most serious problem in monocular SLAM systems.

2.4.2 Static world assumption

Most existing SLAM algorithms rely on the "static world assumption" to localize the sensor and construct a map of the environment, which simply means that

(34)

nothing in the scene is moving with respect to the mapped environment. If the environment changes, the SLAM system can simply become lost as its location relative to the static map it has built becomes meaningless. The reason why current SLAM systems can still work in such situations is there are often some filtering mechanisms (e.g. RANSAC [FB81]) used within the SLAM system that can treat the small movements as noise. Nevertheless, SLAM systems will always fail if the environment changes dramatically.

The static world assumption also poses problems in AR applications. Imagine a camera facing a table that has a cup on it. A modern SLAM system will not fail if the cup is moved because it should filter the movement as noise; but this means that augmented reality content (for example, a small advertisement appearing to stick out from the surface) cannot be added to the cup using the SLAM system if it moves, because it is no longer being tracked.

2.4.3 Camera calibrations

To correctly deduce the position of a camera, and to display the graphics properly, a system requires to know some properties of the camera and lens configuration be-ing used. When these values are known, the system is said to be usbe-ing a calibrated camera.

Geometrical camera calibration [Ope18] is the process by which we are able to

de-termine the parameters of the camera. These relevant parameters of the camera (focal length and principle point) should be known, as these are involved in project-ing 3D points into a 2D image. Figure 2.8 shows how a 3D object projects onto an image plane.

Figure 2.8: Camera model projecting a 3D point X onto the image plane The relationship, together with a translation of the origin, skew and aspect ratio

(35)

can also be described in matrix notation using homogeneous coordinates:     x y w     =     fx s cx 0 fy cy 0 0 1         X Y Z     (2.1) x = KX (2.2)

The matrix K contains theintrinsic or internal camera parameters, and describes how the camera transforms the inhomogeneous point X onto the image. c_x and

cy describe the translation of the principal point required to move the origin into image coordinates. fx and fy are camera focal lengths. The pixels may also be

skewed, but in most cases s = 0. The presence of w is explained by the use of homography coordinate system (and w = Z).

The extrinsic or external camera parameters describe how the camera relates to a world coordinate system. This relation, or transformation, can be described as a translation d and a rotation R and expressed as a matrix multiplication:

x = K[R|d] ˜X (2.3)

where ˜X is a homogeneous point, ˜X = [XT 1]T.

The model described in Equation 2.1 is also known asglobal shutter camera model. A poor calibration means 3D points will not project to the correct 2D location, which can mean points fail to track as they become too far from their expected lo-cation; and map geometry can suffer if the triangulation of points is systematically wrong.

As aforementioned, photometric calibration involves direct method. In particular, pixels corresponding to the same 3D point may have different intensities across images due to camera optical vignetting, auto gain and exposure controls.

In [EUC16] it has been proven that photometric calibration can significantly im-prove the performance of direct methods. They suggest to calibrate the camera response function G, as well as pixel-wise attenuation factors V : Ω → [0, 1] (vi-gnetting). Without known irradiance, both G and V are only observable up to a scalar factor. The combined image formation model is then given by

I(x) = G(tV (x)B(x)) (2.4) where t is the exposure time, B the irradiance image, and I the observed pixel value.

(36)

2.4.4 Rolling Shutter camera

An extra complication is the use of a rolling shutter, which is especially common in lower-end webcams or mobile phone cameras.

Instead of acquiring the entire image on the camera sensor at the same instant of time, the image is built up line-by-line. This takes some time, so the top part of the image is captured slightly earlier than the bottom. Figure 2.9 shows how a rolling shutter camera acquires consecutive rows of an image.

Figure 2.9: Rolling shutter camera rows acquisition

This effect is less noticeable when the camera is stationary, but for faster motions, it manifests as a shearing of the image. Left unmodelled, this can cause huge problems for tracking.

In case of rolling shutter camera the model described in Equation 2.3 must be revisited to introduce the time dependencies of the extrinsic parameters while the intrinsic parameters remain the same. By assuming that the scanning begins at the top row, down to the bottom row:

x = K[R(t)|d(t)] ˜X (2.5) where t = 0 represents the first row of the frame.

In papers [CJK08], [HSB10] authors propose a post image processing scheme based on motion vector detection to suppress the rolling shutter distortion.

2.5 Visual-inertial fusion

A major issue of vision-based navigation using one camera as the only exteroceptive sensor is the recovery of the metric error between the desired camera pose and the actual pose. Only based on this, robust controllers can be designed.

(37)

having a known visual pattern one can think of a variety of sensors to retrieve the absolute scale. While laser rangefinders are too heavy, infrared sensors are too sensitive to sunlight and ultrasound sensors lack in their range in which they reliably can measure the distance. This reduces the choice to a second camera, a pressure sensor or an inertial sensor (i.e. IMU). However, the first loses its range measurement ability for scenes far away with respect to the stereo baseline. The second option is unreliable indoors (sudden pressure changes when closing doors, activating air conditioning etc). This leaves the IMU as the logical choice.

The usage of IMUs for navigation has mostly been limited to the aviation and marine industry due to the cost and size [Hol11]. However, during the last years, the field of application has broadened. The problem to accurately fuse information from several sensors and a motion model, often called sensor fusion, is to figure out how much to trust the different sensors and the motion model.

The approaches to fusing measurements from different sensors can be categorized into two categories, loosely and tightly coupled. In loosely coupled approaches the measurements from one or more sensors are pre-processed before they are fed to a filter that estimates the states while in the tightly coupled processes the measure-ments are directly used in a filter without any kind of pre-processing. Figure 2.10 shows the difference between loosely and tightly fusion approaches.

(a) (b)

Figure 2.10: Loosely [a] and tightly [b] fusion approaches

Considering, for example, a filter fusing camera and IMU measurements. The mea-sures can be either pre-processed by an image processing algorithm that estimates the pose of the camera, which then is used as a measurement in the filter or can be carried directly to the filter.

For fusing the IMU data with the visual input, the least square or an Extended Kalman Filter (EKF) approach are possible solutions. One can imagine comparing the integrated inertial values to the derivated visual positions in the least squares setup [KMW+11]. Despite the possible high speed of this approach, it lacks robust-ness because of its sensitivity to outliers. Furthermore, the least squares approach only estimates the visual scale, and it does not obtain an optimal estimation of the velocity and position. In [Mar12] authors proposed an extension of the least

(38)

squares approach including the velocity, the attitude, and the bias in their calcu-lations. The sensitivity to outliers and the lack of self-calibration remains. As an alternative to the filtering-based fusion, in [CLKC16] and [IWKD13] authors pro-poses a keyframe-based non-linear optimization.

Related work on visual-inertial can be categorized along three main factions con-sidering the number of camera-poses involved in the estimation, the uncertainty for the measurements and the Gaussian prior or considering the number of times in which the measurement model is linearized. While full smoothers estimate the entire history of poses, fixed-lag smoothers consider a window of the latest poses, and filtering approaches only estimate the latest state. Both fixed-lag smoothers and filters marginalize older states and absorb the corresponding information in a Gaussian prior. The Extended Kalman Filter represents the uncertainty using a covariance matrix; instead, information filters and smoothers resort to the in-formation matrix or the square-root of the inin-formation matrix. While a standard EKF processes a measurement only once, a smoothing approach allows linearizing multiple times.

Filtering algorithms enable efficient estimation by restricting the inference process to the latest state of the system. The complexity of the EKF grows quadratically in the number of estimated landmarks, therefore, a small number of landmarks are typically tracked to allow real-time operation [JS11].

Fixed-lag smoothers estimate the states that fall within a given time window, while marginalizing out older states [LLB+15]. Fixed-lag smoothers lead to an opti-mization problem over a set of recent states. For nonlinear problems, smoothing approaches are generally more accurate than filtering, since they relinearize past measurements. Moreover, these approaches are more resilient to outliers, which can be discarded a posteriori (i.e., after the optimization), or can be alleviated by using robust cost functions.

Full smoothing methods estimate the entire history of the states (camera trajec-tory and 3D landmarks), by solving a large nonlinear optimization problem [JT01]. It guarantees the highest accuracy; however, real-time operation quickly becomes infeasible as the trajectory and the map grow over time. Therefore, it has been pro-posed to discard frames except for selected keyframes or to run the optimization in a parallel thread, using a tracking and mapping dual architecture.

2.5.1 Loosely coupled fusion in the literature

Loose coupling implies an EKF at the top level receiving only inputs from SLAM. This has the main disadvantage of ignoring correlation between internal states of the devices.

(39)

Loosely-coupled VIF has received a significant consideration in the recent past with the advent of “black-box” vSLAM algorithms such as PTAM. In [WAL+12], the images acquired using a camera were processed using a modified version of PTAM and then fused with accelerometer and gyroscope measurements utilizing an EKF to estimate the pose of a quadrotor MAV. In [NWSS11] a comparable setup is presented but in this case, the authors also consider an additional air pressure sensor. In [ESC12] the authors make use of a PTAM to estimate camera pose and exploit the quadrotor motion equations to derive metric scale velocity estimates using inertial measurements and then proceed to derive the scale of the PTAM velocity. The PTAM position estimates, in metric scale, are then fused again with inertial measurements within an EKF. While the authors claim accuracies in the order of a few centimeters, it is important to note that their experiments only consider a hovering quadrotor MAV, which is the ideal operating scenario for PTAM.

2.5.2 Tightly coupled fusion in the literature

In a tightly coupled system, information from the EKF is used by the SLAM al-gorithm. For example, in one of the first system [BKCD97] based on both vision and inertial, authors disambiguate the pose estimation based on the prediction of the filter. In [ORK+03], and [CP04] the sensor states, as well as the visual land-marks, are jointly estimated. This allows correlation between measurements to be taken into account. This approach is computationally expensive, so real-time performance is achieved at the expense of accuracy of the SLAM by reducing the number of visual landmarks. Another potential issue with tightly coupled solutions is the difficulty of handling large position jumps (say after GPS corrections) since they may destabilize the vision system.

In summary, the tightly coupled solution is often more computational complex but more accurate since it uses all the information about the measurements.

(40)

Physical Architecture

In this chapter will be presented the system considering the physical components. For each physical component of the system a brief overview of the main features will be provided.

The system consists of a probe component composed of a hand-held camera con-nected to a Raspberry Pi board and a base station running the vSLAM software described in Chapter 4. Exploiting this configuration, a user can move the camera freely in all the directions. Images are shared between the Raspberry Pi and the base station over an ethernet link.

3.1 Probe component

3.1.1 Raspberry Pi

The Processing Unit used to stream images acquired from the camera to the base station with a set of specific setting is a Raspberry Pi 1 Model B+ V1.2 (Figure 3.1).

Figure 3.1: Raspberry Pi model B+

Basically, a Raspberry Pi is a microprocessor suitable for the embedded applica-tions. The board length is only 85mm and width is only 56mm. Its size only as

(41)

big as a credit card but it is a capable little PC. It can be used for many of the things that your desktop PC does, like high-definition video, spreadsheets, word-processing, games and more. Raspberry Pi also has more wide application range, such as music machines, parent detectors to weather stations, tweeting birdhouses with infra-red cameras, lightweight web server, home automation server, etc. It enables people of all ages to explore computing, learn to program and understand how computers work.

All models present a Broadcom system on a chip (SoC), which includes an ARM-compatible central processing unit and an on-chip graphics processing unit (Video-Core IV). The CPU speed range varies from 700 MHz to 1000MHz (overclocking). Secure Digital (SD) cards are used to store the operating system and program memory in either the SDHC or MicroSDHC sizes. In this application, the oper-ating system used is Raspbian, a Debian-based computer operoper-ating system. Most boards have between one and four USB slots, HDMI and composite video output, and a 3.5 mm phono jack for audio. The lower level output is provided by a number of GPIO pins which support common protocols like I2C. The B+ models have an Ethernet port and 4 USB ports.

Figure 3.2 summarizes the specification of the Raspeberry Pi model B+ used in this thesis.