Dense visual odometry and inertial sensor fusion for UAV navigation

(1)

N

U

IVERSITA DI ISA

P

Master’s Thesis in Aerospace Engineering

University of Pisa

Year 2013/2014

Dense visual odometry and sensor

fusion for UAV navigation

Nicol`o Valigi

University of Pisa tutor:

Prof. Roberto Galatolo

(2)

In the beginner’s mind there are many possibilities, in the expert’s mind there are few.

Alla mia famiglia, ad Angela, e a chi mi `e stato vicino in questi anni.

(3)

A B S T R A C T

We tackle the SLAM (Simultaneous Localization and Mapping) problem for Unmanned Aerial Vehicles with a combination of two promising technologies: incremental smoothing and dense visual odometry using stereo cameras. In constrast to filtering-based approaches, incremental smoothing refers to the possibil-ity of performing efficient optimization of all the past system pose to improve accuracy and reliability. In dense visual odom-etry, camera images taken as the robot moves are aligned to compute the relative pose between the two viewpoints.

For an optimal fusion of information from the IMU and the cameras, we go further than the usual loosely coupled approach and directly integrate pixel-level information into the SLAM op-timization problem.

(4)

A B B R E V I AT I O N S

i m u Inertial Measurement Unit

v o Visual Odometry

d v o Dense Visual Odometry

g t s a m Georgia Tech Smoothing and Mapping

s e 3 Special Euclidean 3

s o 3 Special Orthogonal 3

p d f Probability Density Function

r o s Robot Operating System

(5)

C O N T E N T S

1 i n t r o d u c t i o n 1

2 i n c r e m e n ta l s m o o t h i n g o f s e n s o r d ata 4

2.1 The SLAM Problem 4

2.2 Incremental smoothing 9

2.3 Optimization on non-Euclidean spaces 16 2.4 Inertial navigation 20

3 d e n s e v i s i o n o d o m e t r y 22

3.1 The pinhole camera model 23 3.2 The algorithm 24

3.3 Using a stereo camera 29

3.4 Hardware and other technologies used 31 4 t h e l o o s e l o o p a p p r oa c h 34

4.1 Inertial navigation with ground data 34 4.2 Smoothing for Visual Odometry 40 4.3 Execution time 44

4.4 The effect of keyframe selection 46 5 t h e t i g h t l o o p a p p r oa c h 51

5.1 Leveraging the inertial platform 51

5.2 Building a unified optimization problem 54

5.3 Evaluation 62

5.4 Selective marginalization 65 6 c o n c l u s i o n s 68

a t h e l o o s e ly c o u p l e d f a c t o r 69

(6)

L I S T O F F I G U R E S

Figure 1 A simple Bayes network for a SLAM

prob-lem 6

Figure 2 Bayes networks in the case of the full prob-lem, filtering, and smoothing approaches. 7 Figure 3 A simple factor graph with unary and

bi-nary factors. 10

Figure 4 Factor graph solution process. 14

Figure 5 A Bayes network and its equivalent Bayes

tree. 14

Figure 6 The pinhole camera model 23

Figure 7 Residual with respect to translation of the image in the X direction 28

Figure 8 Once the camera baseline is known, dis-parity can be used to compute the pixel

depth. 30

Figure 9 The VI-sensor platform, with twin

cam-eras and IMU. 31

Figure 10 Rviz interactive 3D pose visualization tool. 33 Figure 11 Coordinates frames as described in

sec-tion 4.1. 35

Figure 12 Factor graph with absolute visual

odom-etry measurements. 36

Figure 13 Online and offline estimations of the ac-celerometer bias. 37

Figure 14 Coordinates frames as described in sec-tion 4.1.2. 38

Figure 15 Online and offline estimations of the Body-Camera frames relative pose. 39

Figure 16 Online and offline estimates of the CB pose with inaccurate prior knowledge. 40 Figure 17 Coordinates frames as described in

sec-tion 4.2. 41

Figure 18 Coordinates frames as described in

sec-tion 4.2, in the case of no addisec-tional keyframes 42 Figure 19 Factor graphs for the visual

(7)

List of Figures

Figure 20 Error resulting from translations in dif-ferent directions. 44

Figure 21 Factor graph optimization time as func-tion of the number of included frames. 45

Figure 22 Execution times for DVO and GTSAM. 46

Figure 23 Effect of the keyframe interval on the resid-ual error at convergence. 48

Figure 24 Distribution of keyframe intervals with the rotation-based heuristic. 49 Figure 25 PDF of the residual error at convergence

for the two keyframe selection approaches. 50 Figure 26 PDF of the number of iterations for both

the “naive” and IMU-based pose

predic-tion 52

Figure 27 PDF of the rotation prediction error for the “naive” and IMU-based pose

predic-tion 53

Figure 28 Operation with tightly-coupled visual

odom-etry. 55

Figure 29 Operation with loosely-coupled visual

odom-etry. 56

Figure 30 Execution time (in number of CPU clocks) for the iSAM2 algorithm and the batch

solver. 63

Figure 31 Number of relinearized factors for differ-ent thresholds 64

Figure 32 Optimized trajectory as obtained by the iSAM smoother with different re-linearization

thresholds. 65

Figure 33 Translation component as optimized by the incremental smoother and the fixed lag smoother with two different

(8)

1

I N T R O D U C T I O N

Surveillance, mapping, and aerial photography are some of the many areas that have recently been starting to take advantage of Unmanned Aerial Vehicles (UAVs). Many more possibili-ties are bound to appear in the coming years, like mail deliv-ery or industrial inspection. In particular, quadrocopters (i.e. helicopters with four independent propellers) are capable of hovering and flying at low speeds, making them suitable for centimeter-accurate indoor missions.

Due to their aerodynamics, however, autonomous quadro-copters are inherently unstable and thus require specially crafted controller systems. Part of the problem lies in accurately lo-calizing the robot with respect to its environment. While out-door applications usually take advantage of the satellites of the Global Positioning System (GPS), indoor-flying robots have no such possibility.

The robotics community has been tackling this localization problem for many decades now, using several different types of sensors, from cameras to laser beacons. Due to their widespread use and low cost, this thesis will be focusing on video cameras and Inertial Measurement Units (IMUs), which measure accel-eration and rotational velocity of the robot. A familiar appli-cation of IMUs is found on smartphones, where they are used for sensing the screen orientation and to control some games by tilting the screen.

However, these two different sources of information can not be used together without a powerful framework for sensor fu-sion. Our bodies have incredibily advanced systems for this purpose: the vestibular system in the ear and eyes are tightly connected by the vestibulo–ocular reflex to optimize our percep-tion of balance and space. In the first part of the thesis, we will discuss how the robotics community has been trying to emulate such a great achievement.

(9)

i n t r o d u c t i o n

It is no wonder that man’s visual cortex takes up the biggest part of its brain: cameras produce a wealth of visual informa-tion that is very difficult to make sense of, especially with a computer. For this reason, the second part of this thesis will dis-cuss a state-of-the-art procedure for the vehicle to understand its own motion by looking at images of its sorroundings pro-duced by the on-board cameras.

Finally, we present our solutions to the sensor fusion prob-lem, first with a “loose” solution and then with tighter inte-gration between image and motion sensing for optimal perfor-mance.

t e c h n i c a l i n t r o d u c t i o n

In the present work, we build on the literature about SLAM and visual odometry to implement a smoothing-based approach for UAV navigation using dense image information from a stereo camera and an inertial measurement unit.

As opposed to feature-based procedures, dense vision odom-etry requires no costly feature extraction step and operates in-stead by aligning images from two consecutive viewpoints. The alignment step is carried out by minimizing the least-squares error of the intensity of corresponding pixels in images taken during the motion. Compared to feature-based methods, dense visual odometry requires no fine tuning of the parameters and is thus more reliable.

Regarding the sensor fusion part, we implement both a “lo-osely” and “tightly” coupled solution. The former is the main-stay solution for its computational speed and versatility: vi-sual odometry motion estimates are simply included as rela-tive robot pose measurements (as could be obtained by wheel encoders).

Since both sensor fusion and visual odometry are formu-lated as least-squares optimizations, it is natural to combine everything into a single problem. This motivation leads to the “tightly coupled” approach, in which vision and IMU data are

fed into the same SLAM problem.

For both “loosely” and “tightly” coupled systems, we eskew the traditional filter-based approach to localization and perform full optimization over all past robot poses instead (smoothing). We take advantage of recent developments in literature to

(10)

per-i n t r o d u c t per-i o n

form the optimization incrementally and allow for real-time ex-ecution on limited hardware.

o u t l i n e o f t h e w o r k

In the first chapter, we describe some of the approaches to the SLAM problem (Simultaneous Localization and Mapping). In particular, we devote some space to a promising line of action, namely incremental smoothing, that we will build on during the rest of the work.

In the second chapter, attention turns to the dense visual odometry algorithm for pose estimation, including some no-tions about its implementation on stereo RGB cameras (i.e. pas-sive 3D).

In the third chapter, we implement and benchmark the loosely coupled approach using the GTSAM library for batch SLAM optimization and incremental smoothing. We specifically inves-tigate the execution speed and the effect of keyframe selection.

In the fourth chapter, we conclude with a description of the tightly coupled localization solution and compare its perfor-mance with the loosely coupled approach.

(11)

2

I N C R E M E N TA L S M O O T H I N G O F S E N S O R

D ATA

2.1 the slam problem

In the robotic community, SLAM stands for Simultaneous Lo-calization and Mapping, and concerns the situation of an au-tonomous vehicle that finds itself in an unknown position in an unknown environment. Solving the SLAM problem means in-crementally building a map of this environment while simulta-neously using it to determine the absolute position of the robot. Good SLAM performance is considered one of the most fun-damental features of a truly autonomous robot, as it obviates the need for external infrastructure or prior mapping of the environment. Unfortunately, SLAM is also a clear chicken-and-egg situation, since on one hand a map is needed to localize the robot, while on the other hand an accurate position estimate is necessary for building such map. This means that every SLAM solution must proceed as a series of steps:

• First, the robot starts in a known absolute position;

• Once the robot starts to move, a motion model is used to predict the new position and its accuracy, which is bound to grow with time;

• The robot observes landmarks in the environment and es-timates their position, thus building the map. The inaccu-racy of the map is a combination of the inaccuinaccu-racy in the robot position and the measurement errors.

• By remeasuring the same landmarks, the robot can im-prove the estimates of both its own position and the land-marks’;

(12)

As can be seen in the above description, a SLAM solution has to deal with a situation rife with uncertainty and noisy in-formation that naturally lends itself to a probabilistic interpre-tation [7]. More specifically, our goal can be restated as estimat-ing the posterior probability distribution:

Bel(x_t) = p(x_t_|z₁, z₂, ..., z_t)

where xt is the state of the system at time t and z1..zt is the set of past measurements from sensors. In this model, the state xt contains both the pose of the robot as well as an encoding of the map that the robot itself has been building. Such encoding can take different forms depending on the type of landmarks observed and of sensors used.

To simplify the computation of the posterior function, it is usually assumed that the system obeys the Markov property, meaning that each state of the system only depends on the pre-vious one and that measurements at time t only depend on the state xt at the same time. These simplifications allow the prob-lem to be represented with a Bayes network, which is a directed acyclic graph having a node for each random variable in the system. Directed edges connect two nodes when there is a re-lation of conditional dependence between them. An example of a typical Bayes network is shown in figure 1 and will be discussed in the next section.

2.1.1 A Bayesian interpretation

Figure 1 is a graphical representation of information in the SLAM problem, and can be clearly explained with reference to a Bayesian interpretation of the estimation problem, as outlined generally in [25] and more specifically in [6]. The Bayesian ap-proach provides powerful tools to draw inferences in the pres-ence of uncertainty in the data. More specifically, every time a new measurement is available, we first compute our prior belief:

Bel−₍_x

t) =

Z

p(x_t_|x_t₋1)Bel(xt−1)dxt−1

based on the previous state and a model of the system’s evolu-tion over time, as expressed in the propagaevolu-tion model p(x_t_|x_t₋₁).

(13)

en-2.1 the slam problem

x0 x1 xt−1 xt xT

m

u1 xt−1 xt xT

z1 zt−1 zt zT

Figure 1.: A simple Bayes network for a SLAM problem ables us to incorporate the new measurement zt in the prior belief Bel−_.

Bel(x_t) = αp(zt|xt)Bel−(xt) (1) where α is just a normalizing constant. Equation 1 gives the pos-terior distribution and contains our best estimate of the system’s state vector. We have used the measurement model, or perceptual model, p(z_t_|x_t) to define the probability of making an

obser-vation xt given the current state xt. The measurement model obviously depends on the type of sensor used, and captures its uncertainty characteristics.

We can now link this mathematical framework to the entities in figure 1. The edges connecting nodes xi to xn represent the propagation (or state transition) model. We also see how the measurements z_i are influenced both by the environment map

mand by the states x_i. In this model, the arrows pointing from

the states x_ito the measurements z_irepresent the measurement models p(z_t_|x_t). The Bayesian network thus describes how the

system evolves in time and how new measurements and states interact with each other and the environment. It is also clear how performing inference on a large graph can become com-putationally intractable. This problem will be the subject of the next section.

(14)

x0 x1 x2 x3 x4 x5

T0 T1 T2 T3

(a) The full problem

x0 x1 x2 x3 x4 x5 T0 T1 T2 T3 (b) Filtering x0 x1 x2 x3 x4 x5 T1 T2 T0 T3 (c) Keyframes

Figure 2.: Bayes networks in the case of the full problem, filter-ing, and smoothing approaches.

(15)

2.1.2 Solving the Bayes network: filtering and smoothing

This section contains a conceptual introduction to the problem of performing inference on a Bayes net to obtain the desired MAP distribution. A more exhaustive comparison of available solving methods is presented in [23], with a strong emphasis on their corresponding computational load.

We will refer to figure 2a, a specialization of figure 1 to a SLAM problem with states x_i and landmark observations T_j (odometry measurements are not pictured).

b u n d l e a d j u s t e m e n t The optimal solution is called Bun-dle Adjustement (BA) and involves recomputing the MAP esti-mate of the full graph at each time step (i.e. every time a new measurement is added, represented by a new node in the net-work). While applicable offline, this approach becomes progres-sively hard to compute onboard as the robot spends a longer time interacting with the environment. BA is also often referred to as smoothing in literature. As outlined in [23], BA has to be abandoned in order to keep computational costs to a resonable level. To this end, there are several alternatives, depicted in figures 2b and 2c and described below.

f i lt e r i n g The most commonly employed solution is referred to as filtering and consists in marginalising out all the states but the current one, as shown in figure 2b. Since new features are added only when exploring new areas, the graph stays com-pact. On the other hand, the missing connections between the marginalised states and measurements are replaced by interac-tions between the states themselves, so that the graph becomes densely connected.

The most common implementation by far, the Extended Kal-man Filter (EKF), represents all these interactions by a single covariance matrix, thus limiting the computational load. Even though their performance is satisfactory in computationally-constrained environments, EKFs do not deal well with non-linearities because past states are not kept in the filter (and re-linearization becomes less effective).

This problem can be somewhat alleviated by keeping a win-dow of past states within the filter. Such a modified filter is usually referred to as fixed lag smoother, basically a conventional

(16)

2.2 incremental smoothing

Kalman filter working with an augmented state vector that in-cludes delayed values of the system states. It is clear how fixed-lag smoothers are a stop-gap solution that increases the compu-tational load without solving the underlying problem.

k e y f r a m e a p p r oa c h The smoothing-based keyframe approach is an alternative to filtering. Full BA is performed at each step on the simplified graph of figure 2c that only contains a strategi-cally chosen subset of past states. The other poses and measure-ments are simply discarded, thus keeping the graph sparsely connected and easy to compute inference on.

i n c r e m e n ta l s m o o t h i n g An innovative approach has been developed recently and has been named incremental smoothing by Kaess et al. in [14]. Instead of discarding or marginalising out past states, incremental smoothing takes advantage of the sparse nature of the Bayes network to solve the BA problem incrementally. Since only a small part of the graph changes at each time step, the algorithm only optimizes a small subset of the nodes, with correspondingly low computational load.

In light of the previous discussion, we have dedicated the present work to the exploration of incremental smoothing for on-board sensor fusion and localization purposes. The follow-ing section introduces the abstractions and mathematical back-ground, with strong emphasis on the work by Dellaert et al.

As we mentioned above, incremental smoothing is a relatively recent advancement that significantly improves the performance of the Bundle Adjustement procedure by taking advantage of the intrinsic characteristics of the SLAM problem. Incremen-tal smoothing as implemented by Kaess et al. relies on a well-known graphical model well-known as the factor graph, that we present in the next section.

(17)

2.2 incremental smoothing x0 f_abs x1 f_abs x2 f_abs x3 f_abs

fIMU fIMU fIMU

Figure 3.: A simple factor graph with unary and binary factors. 2.2.1 Factor graphs

What follows is a concise introduction to factor graphs and their use in the SLAM problem, based on references [11], [12], and [17].

For ease of notation, we will include the calibration param-eters in the set xi of navigation states at time ti. Zi represents the set of measurement carried out while the system was in state x_i. Furthermore, at a current time t_k, we denote the set of past states and measurements as _V_k and _Z_k respectively. The general joint probability distribution function 1 can always be factorized in terms of a priori knowledge and individual process and measurement models:

p(_V_k_|Z_k) = p(_V₀)

∏

i  p(x_i_|x_i₋₁)

∏

zj∈Zi p(z_j_|V_ij)  

whereV_ij is the set of variables appearing in the measurement model p(z_j_|V_ij)for measurement z_jat time t_i. The last equation

naturally leads to a representation of the form: p(_V_k)∝

∏

i

fi(Vki)

in which each factor fi involves a different subset V_ki of system states. The last equation can be graphically represented in a so-called factor graph with a factor node for each f_i and variable nodes that represent system states. If state variable x_j appears in factor f_i, x_i ∈ V_ki and an edge connects the variable node for xj with the factor fi.

Figure 3 is an example of a simple factor graph that encodes different types of probabilistic information. Variable nodes are

(18)

shown in white, whereas factor nodes have a black background. In figure 3, some factors are connected to two states and thus represent the system evolution over time. Examples of this type of binary factors could be wheel rotation or IMU measurements. Factors that represent prior knowledge or external measure-ments (such as GPS) are only connected to a single variable node.

At this point, we reformulate our problem with respect to equation 1 by saying that we are actually interested in the sin-gle most probable configuration of the system states. This par-ticular value is usually referred to as the maximum a posteriori (MAP) estimate:

MAP=arg max

X Bel(xt) (2)

where X denotes the set of all possible states x. In the factor graph formulation, the MAP estimate in equation 2 immedi-ately takes the form:

MAP=arg max

X

∏

_i fi(V i

k) (3)

Furthermore, the robotics community usually assumes that all noise distributions are Gaussian, so that we can write factors in the general form of a multivariate Gaussian distribution with the squared Mahalanobis distancekak2_Σ =aTΣ−1a:

f_i(_V_ki) = exp

−₂1kerri(Vki, zi)k2Σi

(4) where erri(xi, zi)is a factor-specific error function, andΣi is the covariance matrix. Under this assumption, we can reformulate equation 3 as an optimization problem by taking its negative logarithm:

MAP=arg max

X

∏

_i fi(V i

k) = arg min_X

∑

i k

err(_V_ki, z_i)_k2_Σ_i (5)

Each factor has now become an error term in a least squares problem, and must be minimized to obtain the MAP estimate.

Before describing how factors for different models are imple-mented, we turn to the algorithm for solving equation 5, with particular reference to the software we will be using throughout this thesis.

(19)

2.2.2 Solving the factor graph

The conventional procedure

As it stands, equation 5 can be conveniently solved with a tra-ditional Gauss-Newton non-linear least squares solver that ceeds by repeated linearisation. Reference [7] outlines this pro-cedure in detail. We begin by defining:

ek_i =err_i(_V_ki, z_i) =err_i(x, z_i)

which is a function of the full set of state variables at time k, that we denoted by x for future convenience. By using the def-inition of a multivariate Gaussian distribution (or, equivalently, of Mahalanobis distance), we can rewrite equation 5 at time t_k as follows:

MAP=arg min F(x) = arg min

X

∑

_i e

T i Ωiei

| {z }

Fi(x)

whereΩi =Σ−1 is the information matrix for factor i. We then take a first-order approximation of the error function around the guessed value ˆx:

e_i(ˆx+∆x) _∼e_i+J_i∆ˆx

where J_i is the Jacobian matrix of e_i computed in ˆx. If we use this last expression in each F_i(x), we obtain:

F_i(ˆx+∆x) =

= (e_i+J_i∆x)TΩ_i(e_i+J_i∆x)

=eT_i +2eT_jΩ_iJ_i∆x+∆xTJ_iTΩ_iJ_i∆x

=c_i+2b_i∆x+∆xTHi∆x

By summing all the Fi together again, we obtain a linearised representation of the entire graph:

F(ˆx+∆x) =

∑

i

F_i(ˆx+∆x)

=c+2bT∆x+∆xTH∆x

that can be minimized with respect to ∆x by solving the linear system:

(20)

The solution for this step of the iteration is recovered by adding the increment to the initial step:

x? = ˆx+∆x?

The Gauss-Newton algorithm proceeds by iterating the lineariza-tion, solulineariza-tion, and update steps until convergence is accom-plished. As a linearized version of the factor graph, matrix

H = J_iTΩ_iJ_i reflects its structure and is thus sparse. This fact

greatly reduces the computation load, and equation 6 can be easily solved by Cholesky or QR matrix factorization.

Incremental smoothing

In this section, we compare the standard procedure outlined above with the incremental smoothing approach that was touched upon in section 2.1.2. In particular, we will be taking advantage of the iSAM2 algorithm as presented in [14], which is heavily based upon graphical representations instead of matrix algebra. The main steps of the algorithm are outlined below.

First, the factor graph is converted to a Bayes network such as the one in figure 1. The connection between a factor graph and its corresponding Bayes net is very direct: [14] proves how in the Gaussian case, elimination is equivalent to QR factorization of the measurement Jacobian H.

Evidence of this fact is shown in figure 4. The simple factor graph in figure 4a produces the Jacobian shown on the right. The i-th factor appears as the i-th set of rows in the Jacobian, with the Bij block being non-zero when the i-th factor involves the j-th system state.

After choosing an (arbitrary) elimination order, the probabil-ity information embedded in the factor graph can be encoded directly in a Bayes net, as shown in 4b. The Jacobian matrix is now square (and upper-triangular), with the B_ij block being non-zero when p(x_i) has a conditional dependece on state x_j.

It so happens that Bayes trees constructed from factor graphs are chordal, meaning that every cycle of four or more vertices has a chord (an edge that is not part of the cycle but connects two of its vertices). This fact enables a new graphical represen-tation, which is the original contribution in [13].

The Bayes tree is a directed graph model in which each node represents a set of variables (called a clique). The Bayes tree must be variable-connected, meaning that if a variable appears

(21)

2.2 incremental smoothing x1 x2 c1 c2 fIMU f_bias f_abs f_Prior f_Prior        x1 c1 x2 c2 × × × × × × × ×       

(a) Factor graph and its relative Jacobian matrix

x1 p(x1|c1; x2) x2 p(x2; c2) c1 p(c1|x2; c2) c2 p(c2)      x1 c1 x2 c2 × × × × × × × × ×     

(b) The equivalent Bayes network and its corresponding (upper trian-gular) Jacobian matrix.

Figure 4.: Factor graph solution process.

A T R X L S B D

(a) The original Bayes network

T L R A T R X R L B L S B R D B

(b) The Bayes tree version

(22)

in two cliques, then it must appear in all cliques on the path between those two. A fictional example of a Bayes network with its corresponding Bayes tree is shown in figures 5a and 5b. A chordal Bayes net can be transformed to its correspond-ing Bayes tree by discovercorrespond-ing its cliques through one of several algorithms, such as Maximum Cardinality Search.

The key insight that lead to the development of Bayes trees lies in the way they encode the information matrix of the factor graph. Since information can only travel “upwards” towards the root of tree, only a small part of it is affected when a new measurement is added. In particular, if a new factor f(x_i, x_j)

is added, only the paths between the cliques containing x_i and x_j and the root are affected. Neither the sub-tree below these cliques are affected, nor any other sub-trees not containing x_i or x_j.

Incremental additions to the Bayes tree can thus be conve-niently carried out by detaching the affected part of the tree and transforming it back to a factor graph. The new measure-ment is added to the factor graph, which is again converted to a Bayes tree and re-attached to the unaffected part. Reference [14] has a more formal description of this process.

It is intuitively clear how the possibility to “re-use” most of the past calculations reduces the execution time and allows for a greater number of factors to be added within the envelope of real-time operation.

The GTSAM Library

This thesis makes extensive use of the tools provided by the software library GTSAM (Georgia Tech Smoothing and Map-ping), developed by Kaess et al. at Georgia Tech. In the authors’ own words:

GTSAM is a library of C++ classes that implement smoothing and mapping (SAM) in robotics and vi-sion, using factor graphs and Bayes networks as the underlying computing paradigm rather than sparse matrices.

GTSAM provides an object-oriented framework that facilitates the creation and manipulation of complex factor graphs such as those encountered in a SLAM problem. Taking advantage of the iSAM2 algorithm is also a nearly plug-and-play operation

(23)

2.3 optimization on non-euclidean spaces

that replaces standard solvers such as Gauss-Newton. Further-more, the library has been designed to be easily inter-operable with other pieces of software, and can be integrated with the other components of the sensing solution.

GTSAM comes equipped with a series of factors that model common sensor types such as cameras, wheel odometry, and IMUs. A (non-exhaustive) list is reported below:

p r i o r f a c t o r is the simplest factor, encoding a measurement model in which the value of the state is distributed accord-ing to the specified noise model. We will use this type of built-in factor to constrain the initial poses.

b e t w e e n f a c t o r provides information about the relative pose between two states, which is assumed distributed accord-ing to the specified noise model. We will be usaccord-ing a cus-tomized version of this factor for our visual odometry measurements.

r a n g e f a c t o r gives information about the scalar distance between two poses or a pose and a point. This factor could be useful for laser or infrared range finders.

i m u f a c t o r constrains two relative poses (expressed in an in-ertial frame) given a series of IMU measurements (acceler-ation and angular velocity). We will describing this factor more extensively in the following sections.

p r o j e c t i o n f a c t o r measures the projected (pixel) coordi-nates of the landmark in a camera’s image plane given its pose. This factor is useful when using feature-based visual odometry.

Most importantly, a mechanism of class inheritance enables the user to write her own custom factors that respond to spe-cific needs and integrate seamlessly in the rest of the library. In the present work, we have taken advantage of this possibility by writing custom factors while leveraging GTSAM optimizers and solution algorithms.

In our case, the optimization states referred to in the last section are mainly 3D pose transformations, for which some care must

(24)

be put in specifying an appropriate parametrization. The next pages explain how we carried out this process in the context of least-squares optimization. In particular, the group of 3D rotations is non-Euclidean, so that the concept of a manifold needs to be introduces.

2.3.1 The representation of rotations

Following the steps outlined in [21], we first define a rotation matrix from the inertial frame A to the body frame B as:

R_ab= [x_ab y_ab z_ab]

where x_ab, x_ab, x_ab are the coordinates of the principal axes of B relative to A. We note that rotation matrices have two inter-esting properties: RRT ₌ _{1 and det R} ₌ _{1. The set of all 3x3} matrices that obeys these properties is called SO(3), for special orthogonal):

SO(3) =_{R_∈ R3×3: RRT = I, detR= +1_}

It can be shown that SO(3) forms a group under matrix mul-tiplication with the identity matrix as the identity element. A rotation matrix can also be interpreted as a mapping R3 _→ _R3 that rotates the coordinates of a point from frame B to frame A:

qa = [xab yab zab]qb =Rabqb

The composition rule for rotation matrices is matrix multiplica-tion:

Rac =RabRbc

where Racis again a mapR3→R3that rotates the coordinates of a point from frame C to frame A.

Rotations are described by an axis ω _∈ R3 _{and angle θ: for} this reason, we would like to write R ∈ SO(3) as a function of ω and θ. This operation is carried out by the exponential map. It can be proved that for every R ∈ SO(3), there exists ω _∈ R3, kωk = 1 and θ ∈ R such that R = exp( ˆωθ), where ˆω is the skew-symmetric matrix related to ω, and:

exp(ˆω) = eˆωθ = I+θ ˆω+θ 2

2! ˆω2+...

Geometrically, θ gives the angle of rotation and ˆω gives the rota-tion axis, in the spirit of the equivalent axis representarota-tion. Other

(25)

commonly used representations for rotations are Euler angles and quaternions, that will be discussed later when dealing with overparametrization of the SO(3) space.

In the case of general rigid body motion, we consider the position vector pab ∈ R3 from the origin of frame A to the origin of frame B, so that the configuration of the system can be described by the pair (p_ab, R_ab) in the space SE(3) (special

Euclidean):

SE(3) = _{(p, R): [_∈ R3, R_∈ SO(3)_{} =}R3_×SO(3)

By appending 1 to the coordinates of a point, we obtain a vector inR4_{which is its homogeneuos representation:}

¯q= [q₁q2 q3 1]T

Vectors, being the difference of points, have the form: ¯v = [v₁v₂ v₃0]T

Using this notation, a transformation g_ab _∈ SE(3) can be

repre-sented in linear form: ¯qa = q a 1 = R ab pab 0 1 q b 1

Again, composition can be carried out by straightforward trix multiplication. We now generalize the skew-symmetric ma-trix ˆω we introduced for rotations by defining the correspond-ing se(3) group:

se(3) = _{(v, ˆω) : v_∈ R3, ˆω _∈ so(3)_} (7)

In homogeneous coordinates, we write an element ˆξ _∈ se(3) as:

ˆ ξ = ˆω v 0 0 ∈ R4

Analogously to the exponential map for rotations, it can be shown how, given ˆξ∈ se(3) and θ_∈ R, we have:

eξθˆ _∈ SE₍3₎

and that every rigid transformation can be written as the ex-ponential of some twist ˆξθ ∈ se(3). The exponential map for

a twist gives the relative motion of a rigid body in the same reference frame:

(26)

Least squares on a manifold

There are two main approaches to carrying out estimation prob-lems in a non-Euclidean space such as SO(3), which is the case

of our UAV. The first uses a minimal representation such as Eu-ler angles. Unfortunately, it can be proved how there exists no parametrization of SO(3) in R3 that has no singularities.

Sin-gularities cause ”gimbal-lock”, and thus need to be dealt with accordingly.

The second approach makes use of overparametrized repre-sentations that have no singularities, such as quaternions inR4_. However, this creates a new set of problems. First, the algo-rithm does not take into account the constraints on the repre-sentation (such askqk = 1 for a quaternion q) and optimizes fic-tional degrees of freedom. Secondly, the parametrization may behave in a non-Euclidean way, in which parameters affect the rotation with varying magnitude.

The most flexible solution takes advantage of the concept of manifolds: even tough SO(3)is not globally an Euclidean space,

it can be considered as such locally. Correspondingly, a global overparametrized representation is used, whereas incremental changes are expressed in minimal coordinates that ideally be-have as an Euclidean space.

To formalise this, we note that matrix multiplication is a con-tinuous and differentiable function, as is its inverse. This means that SE(3) is a differentiable manifold, and thus a Lie group.

Even though a complete study of Lie groups is out of context here, we note that the tangent space to a Lie group at identity is a Lie algebra, which in the case of SO(3) actually is so(3) as

given in equation 7.

This means that we can use the exponential map (and its inverse, the logarithm map) as mappings from increments in the Euclidean space to the global overparametrized space, as described in [10]. In practice, this is carried out by defining an operator that maps a local variation ∆x in the Euclidean

space to a variation on the manifold: ∆x → x∆x. The error

function can now be defined as:

˘e_k(∆˜x_k) = e_k(˘x_k∆˜xk)

where ˘x is expressed in the overparametrized representation, and ˜x in a minimal one. A common choice in this respect is the vector part of the unit quaternion.

(27)

2.4 inertial navigation

Operator is a convenient shorthand for a procedure that

starts by converting the rotation part of the increment to a full quaternion, which is then applied to ˘x together with the trans-lation. In particular, during the Least Squares optimization, increments∆˜x∗are computed in the local Euclidean space, and are then accumulated in the global non-Euclidean one:

x = ˘x∆˜x∗

Thanks to this theoretical framework, optimization in the SE3 can be carried out simply by exchanging the usual Euclidean sum with theoperator described above.

Reference [11] describes two different approaches for incorpo-rating IMU measurements in the factor graph. We will see how the second formulation enables usage of IMU measurements at faster rates by reducing the number of needed states.

c o n v e n t i o na l a p p r oa c h In the conventional approach, each IMU measurement corresponds to a factor that relates the navigation states at two different times x_k and x_k+1. Such

rela-tion embodies the non-linear time-propagarela-tion of states in the form:

x_k+1 =h(xk, ck, z_kIMU)

where zIMU

k is the set of IMU measurements (acceleration and angular velocity) and ck is the set of parameters for the IMU er-ror model. In practice, c is estimated along with the navigation states and contains the biases for the gyroscope and accelerom-eter. Its time propagation model is usually assumed to be a random walk with zero mean. The information required for this assumption is taken from the manufacturers’ specification of the IMU. It should also be noted that the navigation pose x_k must be given in an inertial frame of reference to correctly subtract the gravity vector.

Formally, the IMU measurement factor is the error function: fIMU(x_k+1, xk, ck) = d(xk+1−h(xk, ck, zk))

In this approach, a new navigation state xk must be added every time a new IMU measurement is available. Since the

(28)

IMU is sampled at very high rates (up to 100Hz), such a pro-cedure quickly becomes computationally intractable due to the high number of states and factors involved. The next paragraph presents an alternative solution that alleviates this problem. e q u i va l e n t i m u f a c t o r Lupton et al. ( [19]) developed an innovative inertial navigation algorithm that is especially suited for factor graph based smoothing. The conventional ap-proach to inertial navigation involves carrying out the integra-tion of IMU measurements in the inertial (navigaintegra-tion) frame.

However, this solution is inconvenient because the integra-tion must be repeated every time the factor graph is re-linearized during the solving process. On the other hand, [19] suggests computing a pre-integrated delta∆xi→jin the body frame instead. Since such precomputed increment depends linearly on the ini-tial navigation pose, its calculation must only be carried out once.

This second approach is extremely advantageous in the cur-rent case, since visual odometry will be available at most at camera rates. It would then be wasteful to add new navigation states between one VO estimate and the next just to keep up with the IMU. By using the precomputed increment, the IMU measurements are “accumulated” and integrated until a new VO estimate is available. Once it is ready, the equivalent IMU factor is added to the graph as follows:

FEquiv(x_j, x_i, c_i) =d(x_j₋hEquiv(x_i, c_i,∆_x_i_→_x_j))

Since the IMU factor is used within the optimization frame-work, there is no need to use sophisticated integration algo-rithms: a simple Euler method is enough to constrain the rela-tive pose between two consecurela-tive system states. GTSAM ships with an implementation of [19] as a CombinedIMUFactor that will be used throughout the rest of the current work.

More specifically, we decided to add the bias factors c_k at camera rate, just like the navigation poses x_k, even though the dynamics of the bias evolution are much slower than the robot’s motion.

(29)

3

D E N S E V I S I O N O D O M E T R Y

Visual Odometry is a technique for estimating the 3D motion of the robot using a sequence of images captured by on board cameras. More specifically, the motion of the robot is computed incrementally, as a difference between two consecutive images. Historically, Visual Odometry has successfully replaced wheel measurements as far away as Mars in the 2004 JPL mission.

There are two main approaches to compute the relative mo-tion of the robot between consecutive image frames. Feature-based methods work by tracking salient features across camera frames, and have successfully been employed in the past due to their relatively low computational requirements. Implemen-tations of visual-intertial sensor fusion in literature are mostly based on this technology. We refer the reader to Weiss ( [27]) and Engel et al. ( [5]) that employ the well-known PTAM algo-rithm on Unmanned Aerial Vehicles. The workflow for feature-based VO is as follows. First, a so-called detector, such as Harris or FAST, extracts a set of salient features. Secondly, features in the current image are linked to their correspondent in the past image. As a last step, the relative pose between the two images is found by minimizing the reprojection error within each pair of features.

The main drawbacks of feature-based methods lie in their low accuracy and many intermediate steps that require fine tun-ing. Thanks to the recent advancements in computing power, a second class of algorithms has emerged. Dense visual odometry works with all pixels in the original image, and computes the relative pose transform by minimizing the photometric error be-tween each pair of corresponding pixels. Dense methods have reached a sufficient level of maturity and performance when implemented on GPUs (see [22]). Luckily, less computationally heavy implementations have started to appear that are more

(30)

3.1 the pinhole camera model Y X Z P p C

Figure 6.: The pinhole camera model

readily deployed within the limited resources of an Unmanned Aerial Vehicle.

In this respect, the present thesis is based on Kerl’s work (see [15]). His implementation runs in real time on consumer CPUs, while retaining the accuracy and robustness that make dense VO a compelling choice. We will be taking advantage both of his code and theoretical framework. A more complete theoretical analysis of dense vision odometry concepts can be found in [2].

3.1 the pinhole camera model

In the context of computer vision, a camera is a mapping π from 3D to 2D coordinates:

π : R3 →R2

The most commonly used model is very simple: the whole cam-era system is represented by a single hole through which all light rays pass before forming the image on the sensor. The projection of a generic 3D point X is found by extending the line from X to the optical center C until it intersects the image plane in point p, as shown in figure 6. The distance from the op-tical center to the image plane is referred to as the focal length

f .

Given a point X = (X, Y, Z) in the camera reference frame,

we denote the coordinates of its projection by x = (x, y). With

simple geometrical optics relations, we obtain: x = X fx

Z +ox y=

Y fy

(31)

3.2 the algorithm

where fx and fy are the equivalent focal lenghts in the x and y directions due to a possibly non-square sensor. In this simple pinhole model, the focal length is exactly the distance between the focal plane and the pinhole. The other constants ox and oy take into account the fact that the origin of the image plane is not coincident with the optical center. The set _{fx, fy, ox, oy} of camera intrinsic parameters must be estimated during the calibra-tion process.

Projection of physical points to the camera plane can be ex-pressed easily in the projective space by augmenting the pixel coordinate x with an additional component to make it an R3

vector. In this projective space, points are understood to be co-incident if their corresponding vectors are proportional. With this notation, the projection function in 8 can be rewritten in matrix form: q =MQ q =      x y z w      M=   fx 0 cx 0 fy cy 0 0 1   Q =   X Y Z  

It should also be noted that the pinhole model is linear: fur-ther effects such as lens distorsion are disregarded. However, there exist several algorithms for preprocessing sensor data so that the pinhole model can still be applied successfully. In par-ticular, optical distorsions of the lens can be accounted for with quadratic or cubic calibration functions.

Usually, camera calibration parameters are estimated with the help of special markers (“chessboards”). In our case, how-ever, the VI sensor is already factory-calibrated.

3.2 the algorithm

Dense methods for visual odometry rely on the Lambertian as-sumption that the radiance of a surface does not change with the angle of observation. In practice, one assumes that the pixel to which a given 3D point is projected has the same intensity regardless of camera position. Mathematically, one defines a warping function τ(ξ, x) that maps a pixel x ∈ R2 from the first image to the second image given the camera motion twist

(32)

3.2 the algorithm

ξ ∈ R6. With this formalism, the photo-consistency assumption for each pixel x can be rewritten as:

I1(x) = I2(τ(ξ, x)) (9) To express the warping function, we first find the 3D point P = (X, Y, Z) corresponding to pixel p = (x, y). To do this, we

invert the projection function in 8 and obtain:

P=π−1(p, Z) =Z x +o_x fx , y+o_y fy , 1) T

Let T(ξ, P) denote the coordinates of P after a camera motion with twist ξ. The full warping function is then given by:

τ(ξ, p) =π(T(ξ, P))

=π(T(ξ, π−1(x, Z)))

Looking at equation 9, we can define the residual of the i-th pixel:

r_i(ξ) = I2(τ(ξ, xi))−I1(xi) (10) We now formulate the image tracking process as a least-squares problem, by assuming that the residuals for each pixel are in-dependent and identically distributed according to the sensor model p(r_i_|ξ). Under these assumptions, the probability distri-butions of each pixel can be multiplied together to obtain the probability of observing the whole residual image r given the pose ξ:

p(r_|ξ) =

∏

i

p(r_i_|ξ)

Following a procedure similar to that outlined for the solu-tion of factor graphs, we use Bayes’ theorem:

p(ξ|r) = p(r|ξ)|p(ξ) p(r)

Our goal is finding the MAP estimate: ξ? =arg max

ξ

p(ξ|r) =arg max

ξ

∏

_i

p(r_i_|ξ)p(ξ) by setting to zero the derivative of the log likelihood:

∂r_i ∂ξw(ri)ri =0 w(ri) = ∂log p(r_i) ∂ri 1 r_i (11)

(33)

3.2 the algorithm

The last equation corresponds to the following least-squares problem:

ξ_MAP =arg min

ξ

∑

_i

(r_i(ξ))2 (12)

3.2.1 Linearization of the problem

Since the residuals are non-linear in the pose coordinates, equa-tion 12 must be linearized. The process is analogous to that outlined in section 2.2.2. We start with a first-order aproxima-tion of the i-th residual:

r_lin(ξ, x_i) = r(0, x_i) +J_i∆ξ (13) where J_i is a 6-dimensional vector Jacobian of the i-th residual with respect to the minimal representation of the relative pose between the two images.

With the linearized residuals, equation 11 can be rewritten in matrix form with the (diagonal) weighting matrix:

JTW J∆ξ =₋JTWr(0) (14)

Here, J is the n×6 Jacobian matrix with the n stacked Ji Jaco-bians of each pixel, and r is the vector of residuals.

Inspection of the equations shows that both sides of the equa-tions can be computed pixel-by-pixel. In fact, this is how Kerl’s implementation proceeds, without storing the complete J ma-trix in memory. A simple 6-dimensional linear system is built incrementally:

A∆ξ =b (15)

and solved by Cholesky decomposition.

As we have seen in the previous chapter, analytical compu-tation of the pixel-by-pixel error Jacobian J_i(ξ_k) is required for the optimization to perform well. We consider the i-th pixel at xi in the first image, and its corresponding 3D point pi = (x, y, z) = π−1(xi, Z1−1(xi)). Through the rigid transformation g described by ξ, p_i is transformed to p= (x0, y0, z0) = g(ξ, p_i). Looking at equation 10, we see that we can use the chain rule to compose the pixel-by-pixel image derivative with the Jacobian of the warping function:

(34)

3.2 the algorithm

where J_I is the 1_×2 Jacobian matrix of the image derivatives in the x and y directions (of the second image):

JI = ∂I2(x) ∂π

The warp function Jacobian Jwcan be decomposed in two parts, a 2×3 Jπ matrix for the projection function π and a 3×6

Ja-cobian of the transformed point with respect to the minimal representation of the relative pose of the two cameras:

Jw = JπJg where: Jπ = ∂π(p) ∂p _p₌_g₍ ξ,p_i) = fx 1 z0 0 −fxx 0 z0 0 fx_z10 −fyy 0 z0 !

Note that the derivative of the projection function π is evalu-ated at the transformed point p. The other derivative,

Jg = ∂g(ξ, pi) ∂ξ

can be found in the literature about Lie algebras.

As reported in [2], there exist different approaches for lin-earization of the problem, which differ in computational effi-ciency but are otherwise mathematically equivalent. We follow the classification in [1] and highlight the difference between the additive and compositional approaches. In the additive approach, we solve iteratively for increments∆ξ to the parameters ξ:

r_i(ξk+1) = I2(w(ξk∆ξ))−I1(xi)

The last equation is in fact the straightforwards product of our past discussion, but has the drawback that both the image gra-dient and the Jacobians depend on the current estimate ξ_k of the parameters.

Instead of solving for an increment to the parameters, we can solve for an incremental warp instead. This is the core concept of the compositional approach:

r_i(ξ_k+1) = I2(w(ξk, w(∆ξ, xi)))−I1(xi)

Since the Taylor expansion of the warping function with respect to ∆ξ is carried out at identity, the Jacobian can be computed only once, with obvious computation savings.

(35)

3.2 the algorithm

As the Jacobian depends on ξ, both approaches require its expensive re-computation for each image. Reference [1] builds on the seminal work in [8] and introduces the inverse composi-tional algorithm, which is much more efficient. The key insight in [8] is switching the roles of the reference and current image, so that the Jacobian can be pre-computed and does not need to be updated at each iteration.

3.2.2 Image pyramids

Figure 7 plots the sum of the residual errors over all the pixels as a function of translation in the x direction. In practice, we took an image, warped it to account for the translation, and then computed the residuals. Figure 7 shows three cases in which the original camera frame is downsampled once, twice, or three times, respectively corresponding to 320×240, 160×

120 and 80×60 final images.

0.06 0.04 0.02 0.00 0.02 0.04 0.06 X translation [m] 0.000 0.005 0.010 0.015 0.020 R es id ual er ro r 80_⇥60 160⇥120 320⇥240

Figure 7.: Residual with respect to translation of the image in the X direction

While smallest images have a less distinct minimum (lead-ing to lower accuracy), the convex area close to the minimum is larger. This suggests the possibility of building a pyramid of downsampled images to speed up processing. A rough esti-mate of the minimum point could be computed on the smaller

(36)

3.3 using a stereo camera

frames, improving speed and likelihood of convergence (thanks to the gentler shape of the function). Afterwards, optimiza-tion on the bigger images would obtain a more precise estimate thanks to the stepper gradient, at the expense of computation time.

Kerl’s original work (see [15]) is intended for use with a RGBD camera such as the Microsoft Kinect, and is thus not directly applicable to our flight setup with a RGB stereo camera (see section 3.4.1 for more details about the hardware we are using). As we have seen earlier, the dense approach requires the knowl-edge of the depth of every pixel of the image. While an RGBD camera naturally provides this information at every frame, the image pair shot by our stereo setup requires further processing.

Since such processing is quite computationally intensive, our current implementation does not extract the depth information of every image, but routinely selects keyframes to which the stereo pipeline is applied. Each image is then tracked to the currently active keyframe, rather than to the immediately pre-ceding image as in Kerl’s original implementation. Obviously, a keyframe selection algorithm had to be developed. We will explore this challenge in more detail in section 4.4.

The stereo processing pipeline is based on the widely-used OpenCV library, and consists of the following steps:

r e c t i f i c at i o n o f t h e i m a g e s This process makes use of the known calibration between the cameras to match the corresponding rows of pixels in the two images. The re-sult corresponds to a particular arrangement of the cam-eras, known as frontal parallel, meaning that the optical planes of the two cameras are coplanar and the optical axes are parallel. Aligning the image rows improves the speed and reliability of the following steps.

s t e r e o c o r r e s p o n d e n c e consists in matching a single 3D point to both images. We used OpenCV’s block match function that exploits the presence of texture in the im-ages. This means that the number of pixels for which depth can be computed strongly depends on the type of scene. Strongly-textured images will have many more

(37)

pixels matched, and thus convey more depth informa-tion. On the other hand, poorly-textured images will have “holes” in their depth image due to matching failures.

Rectification proves useful at this stage because matching points are located in the same row in both images. The output of this step is a disparity image that contains the xr ₋_xl _{difference between the pixel position in the two} images.

t r i a n g u l at i o n Since the baseline between the two cameras is known, the disparity data can be triangulated to obtain depth information for each pixel (as shown in figure 8).

416 | Chapter 12: Projection and 3D Vision

plane. Th is intersection depends on the optical axis of the lens. As we saw in Chapter 11, the image plane is rarely aligned exactly with the lens and so the center of the imager is almost never exactly aligned with the principal point.

Moving on, let’s further assume the images are row-aligned and that every pixel row of one camera aligns exactly with the corresponding row in the other camera.* We will call such a camera arrangement frontal parallel. We will also assume that we can fi nd a point P in the physical world in the left and the right image views at pl and pr, which will

have the respective horizontal coordinates xl_{and x}r_.

In this simplifi ed case, taking xl_{and x}r_{to be the horizontal positions of the points in}

the left and right imager (respectively) allows us to show that the depth is inversely pro-portional to the disparity between these views, where the disparity is defi ned simply by d = xl_{– x}r_{. Th is situation is shown in Figure 12-4, where we can easily derive the depth Z}

by using similar triangles. Referring to the fi gure, we have:†

* Th is makes for quite a few assumptions, but we are just looking at the basics right now. Remember that the process of rectifi cation (to which we will return shortly) is how we get things done mathematically when these assumptions are not physically true. Similarly, in the next sentence we will temporarily “assume away” the correspondence problem.

† Th is formula is predicated on the principal rays intersecting at infi nity. However, as you will see in “Cali-brated Stereo Rectifi cation” (later in this chapter), we derive stereo rectifi cation relative to the principal points cxleft and cxright. In our derivation, if the principal rays intersect at infi nity then the principal points have

the same coordinates and so the formula for depth holds as is. However, if the principal rays intersect at a fi nite distance then the principal points will not be equal and so the equation for depth becomes Z = fT / (d – (cxleft – cxright)).

Figure 12-4. With a perfectly undistorted, aligned stereo rig and known correspondence, the depth Z can be found by similar triangles; the principal rays of the imagers begin at the centers of projection

Ol and Or and extend through the principal points of the two image planes at cl and cr

Figure 8.: Once the camera baseline is known, disparity can be used to compute the pixel depth.

The final output of this process is a depth map of the image that contains the depth of each pixel of the image. Depending on the camera baseline and on the presence of texture in the images, some pixels will not be able to be associated with their depth. In this case, they are marked as invalid and not used in the optimization process.

As noted earlier, due to the way the problem is linearised, depth information is only required for the keyframe, not for all individual images. Besides reducing the computation cost,

(38)

3.4 hardware and other technologies used

Figure 9.: The VI-sensor platform, with twin cameras and IMU. this also opens up the possibility to use a monocular camera together with an initial 3D map of the environment. For more information on the depth extraction procedure, one can refer to [3].

3.4.1 The Visual-Inertial sensor

Experimental data has been acquired with a visual-inertial plat-form by Skybotix AG, the VI-sensor. The VI-sensor is a self-contained embedded computer with ARM CPU and Artix FPGA, twin 752 x 480 cameras with 110mm stereo baseline and a tacti-cal grade Analog Devices IMU.

With the aid of ROS drivers and a Gigabit Ethernet connec-tion to the host PC, hardware timestamped camera images and IMU readings are available for further processing. The cameras operate at 30Hz, while the IMU is sampled at 200Hz. Other salient features include factory calibration of the cameras (in-trinsics and ex(in-trinsics) and IMU (sensitivity, axis misalignment, and bias).

3.4.2 The development platform

By leveraging widely-used software packages, the resulting soft-ware implementation will be entirely platform-independent. In spite of this consideration, we aimed at a particular running target especially designed for UAV use, the Asctec Mastermind.

The Mastermind is a lightweight development board featur-ing an Intel Core 2 Duo processor, 4GB of RAM, and a Gigabit

(39)

Ethernet interface for connection with the VI sensor. While weighting less than 300g, the Mastermind offers desktop-class interface and thus the ability to use a familiar development en-vironment.

In particular, the x86 instruction set requires no cross compi-lation of binaries, so that development can be carried out on a standard PC with Ubuntu Linux. Furthermore, the Core 2 Duo CPU supports modern instruction sets such as SSE and MMX that speed up vector operations. Kerl’s implementation of the VO algorithm already includes such optimizations on compute-heavy operations such as image warping.

3.4.3 ROS

ROS (Robot Operating System) is a widely used middleware platform for robot software development. It provides libraries and well-tested code for hardware abstraction (device drivers), message-passing and other commonly used functionality.

More specifically, ROS provides a series of building blocks for complex robotics applications. Computations, such as our state estimation routines, can be implemented within a ROS “node” that receives messages from device drivers (such as cameras and IMU). The results can then easily be recorded or loaded in a plotting application.

One useful ROS feature that has been used extensively in the present work is the ability to record a flight dataset containing all sensor data and camera frames, to be played back later with accurate timing. Thanks to this convenience, algorithm devel-opment has been carried out iteratively with the same known input data. Recorded inputs can also be played back at slower or higher rates to test the execution time of the algorithm. This has been useful when implementing slower versions of the state estimator, that could be still run at slower-than-realtime speeds to benchmark their accuracy.

In pratice, most of the software described througout this the-sis is developed as a ROS node that receives the camera images and the IMU measurements and outputs the estimated pose of the UAV. Evaluation of the software is also made easier by the availability of visualization tools such as rviz that show a 3D representation of the coordinate transforms available in the system.

(40)

Figure 10.: Rviz interactive 3D pose visualization tool. An example of the typical rviz screen is reproduced in figure 10, with two 3D poses shown as a set of Cartesian axes with respect to the world frame.

(41)

4

T H E L O O S E L O O P A P P R O A C H

This chapter is a first exploration of the feasibility of using the GTSAM library to implement a smoother for Visual Odometry and IMU data. Section 4.1 starts out by using the smoother to combine ground truth data from an external source with the IMU. After validation of the smoother performance, we start using VO estimates in place of the ground truth data.

In this chapter, the results of Kerl’s VO framework are used after the optimization is completed, that is, as a pose estimates similar as what could be obtained with a GPS or a range sensor. In the next chapter, we will develop the framework for a tighter integration between the two systems.

4.1 inertial navigation with ground data

As a first step, we investigated using GTSAM to smooth IMU measurements together with a ground truth source, namely a VICON motion capture system. The VICON motion capture system is a state-of-the-art infrared tracking system that can track 3D displacements with millimeter resolution. The pro-cedure involves fitting infrared reflectors to the quadrocopter body, so that IR cameras can track its indoor flight within a room.

Let us attach a frame of reference B to the Vicon markers on the quadrocopter. IMU measurements are produced in the C frame (the body frame of the robot).

The GTSAM library comes with several built-in factors, in-cluding one to model relative pose measurements between two optimization states which comes close to be useful in the cur-rent situation. However, in the curcur-rent case, the C and B frames are distinct, and thus a custom factor had to be developed.

(42)

The algorithms for inertial navigation involve removing the effect of Earth gravity on the accelerometer measurements and thus require the knowledge of the position of the body frame in an inertial and gravity aligned frame of reference. We de-note such frame with the letter I and observe that the Vicon reference frame satisfies the afore-mentioned attributes.

With this considerations in mind, the natural choice for the set of system states was:

c

iT Pose of the body frame (C) with re-spect to the inertial frame (I)

b

cT Pose of the Vicon markers frame (B) with respect to the body frame (C)

Each VICON data point is thus a measurement of pose b_iT used by the smoother to update the system states.

I

B

C

Figure 11.: Coordinates frames as described in section 4.1. The first step for creating a custom factor is defining a mea-surement model as described in section 2.1.1. For our factor, this is a simple composition of poses to obtain b_iT as expected given the current system states:

b

iT = ciTbcT (16)

This measurement model is implemented thanks to GTSAM’s built-in handling of Lie algebras for SE3 pose compositions.

To complete the definition of a custom factor that could be used for optimization, the Jacobians of the error with respect to the states must also be defined (in our case, a Jacobian with respect toi

cT and one with respect tocbT). As always, these must be expressed in terms of the minimal representation of the SE3

(43)

x0 f_abs

x1 x2 x3 x4 x5 x6

fIMU fIMU fIMU fIMU fIMU

Figure 12.: Factor graph with absolute visual odometry mea-surements.

Lie algebra, as linear compositions of the generators of the Lie algebra.

Now that we have shown how the custom factor is imple-mented, we turn to the description of the execution flow. Each time a new Vicon measurement is ready, we define a set of new system states, corresponding to the poses at that time instant. Afterwards, a new factor is created linking these new states with the reference, as shown in the structure in figure 12.

At the same time, IMU measurements relative to the time interval between the current and the past Vicon measurement are accumulated and added to GTSAM’s built-in CombinedIMU factor.

Both factors are then added to the factor graph. An initial guess for the new states is also needed, which is found by in-terpolation of the old states. Optimization is carried out incre-mentally on the existing graph, to which new states and factors are added at each iteration.

Since sensor fusion is carried out using a smoother, all past states are retained. This allows for a more accurate estimate of the itinerary to be retrieved at the end, once all the measure-ments are known. For example, plot 13 shows both the on-line and off-line results for the X component of the accelerometer bias. It is clear how the knowledge of all measurements makes the off-line results more accurate and stable compared to the on-line estimate. In fact, the offline estimate of the bias has very small variations, validating the good performance of the IMU hardware. With this evidence in mind, one could further optimize the performance of the system by reducing the num-ber of bias states included in the factor graph. In the current