Parallel Obstacle Avoidance with FastFlow

(1)

Department of Computer Science

Master Degree in Computer Science

Master Thesis

Parallel Obstacle Avoidance

with FastFlow

Candidate:

Nicola De Socio

Supervisor:

Prof. Marco Danlutto

Refree:

Prof. Giuseppe Prencipe

(2)

(3)

5 Implementation 47 5.1 Application Steps . . . 47 5.1.1 Image Conversion . . . 48 5.1.2 Stereo Matching . . . 48 5.1.3 Optical Flow . . . 49 5.1.4 Pixel Labelling . . . 49 5.2 Parallel Components . . . 53 5.2.1 Emitter . . . 54 5.2.2 Workers . . . 55 5.2.3 Collector . . . 56 5.2.4 Parameters . . . 57 6 Results 59 6.1 Data Parallel algorithms . . . 60

6.1.1 OpenCV . . . 60

6.1.2 Split . . . 62

6.2 Stream Parallel Application . . . 62

6.2.1 Stereo . . . 65

6.2.2 Stereo Flow . . . 66

6.3 Obstacle Detection Results . . . 67

7 Conclusions 73 Appendix A Source Code 75 A.1 OpenCV test . . . 75

A.1.1 main.cpp . . . 75 A.2 Task . . . 82 A.2.1 task.h . . . 82 A.3 frameReader . . . 83 A.3.1 frameReader.h . . . 83 A.3.2 frameReader.cpp . . . 84 A.4 SceneFlow . . . 86 A.4.1 sceneFlow.h . . . 86 A.4.2 sceneFlow.cpp . . . 87

A.5 Pixel labelling . . . 89

A.5.1 pixelLabelling.h . . . 89

A.5.2 pixelLabelling.cpp . . . 91

A.6 Skeleton Code . . . 96

(5)

Chapter 1 Introduction

At the current state of the art, many obstacle avoidance algorithms have been studied. The applications of those technologies vary from robotics per-ception and motion to medical aids to visually impaired people. Lately also the automotive industry is investing on those algorithms for applications in autonomous driving.

There are many possible approaches to solve obstacle avoidance problems, many of which are classified by the required hardware. Usually the first element to be considered is the camera. Depending on the type of camera we have, we can easily separate three kind of approaches [5]:

• single camera: It is the simplest configuration, and it usually applies to portable lightweight devices, for example small drones. Those ap-proaches lacks precision and they can see an obstacle only if it is near and it is approaching frontally to the device [24] [36].

• stereo camera: It is the cheapest setting in terms of hardware and weight to measure depth. It works for indoor and outdoor scenes and the measurements are fairly accurate, even if they lose precision pro-portionally with the distance. The drawback is that those methods are computationally intensive, they demand a high quantity of re-sources and power and are algorithmically very challenging [34] [11] [16]. There are some exceptions, which performs simplified obstacle avoidance methods with limited hardware, based on stereo matching [33].

• RGB-D camera: this camera is capable to capture an image and a depth map from the scene, using light beams. Those solutions are limited to indoor applications, due to the camera limitations, but they

(6)

(a) Smart glasses for visually impaired people, from eyesynth

(b) The ICub robot performing a grasp-ing task

(c) Toyota self driving car test (d) The DelFly explorer MAV Figure 1.1

proved to be very precise in three dimensional reconstruction an speed measurement of small and close objects [8].

In figure 1.1 we show some possible applications which uses stereo vision to analyse the surrounding environment. This includes not only obstacle avoidance and obstacle detection, but also object recognition and shape re-construction, as in the first two examples.

In 1.1a the smart glasses use a stereo camera setting to process infor-mations from the surrounding, and communicate with the visually impaired user trough a vocal output [15].

Dangerous objects coming towards the observer may be detected consid-ering his velocity, direction and position. In the same way, this tool may be used to help the user to move without hitting any obstacle.

[16] presents a wearable guide system for blinds, which includes a pair of cameras which can be worn on the head, and a haptic warning system on the ankle.

This research is focused on detecting free space where the user can walk. As a feedback, the device on the ankle vibrates when close to an obstacle.

In 1.1b the ICub robot performs a grasping task. This task can be per-formed in various ways, and it is possible to use only the visual sensors, as

(7)

shown in [32].

The first phase of this task, which is object recognition, is the same of the one used to detect an obstacle.

In 1.1c it is shown a self driving car, with a stereo camera mounted over the hood of the car.

On the roof of the car there are also radar, lidar and gyroscope sensors. In such critical applications usually the obstacle avoidance and obstacle de-tection task is performed by fusing informations from all the sensors on the car.

In [11] the author proposes a system for detecting obstacles while driving on a car, using the stereo camera pair. In this kinds of application it is also particularly important to satisfy the real time constraints. Better accuracy may be achieved by applying a clever and careful sensor fusion process.

In 1.1d we can see a very small Micro Aerial Vehicle flying autonomously. It is very important for those drones to mount a very lightweight payload, as it has a very weak lifting force. A good choice is the stereo camera, which in this case weight only 4 grams and has a very low power consumption, compared to other active sensors.

The research in [33] describes in detail how this drone solves the flying task. This small drone uses stereo vision with a processing module onboard, and it implements a very efficient obstacle avoidance module.

As computer scientists, we are particularly interested in this stereo camera setting, which combine algorithmic and efficiency challenges. In particular our research will focus on optimizing both accuracy of the solution and run-ning time, as we want to run the obstacle detection and avoidance algorithm in real time.

We will also discuss the single camera approach, giving a small overview on the current state of the art.

It is worth to point out that the algorithms for the stereo setting partially applies also to the RGB-D camera approaches, with some notable differences. For example, we can skip the depth measurement computation, as the hard-ware already provides one.

This solution will be of course cheaper to implement and faster, but the other side of the coin is that we lose the general approach, as the RGB-D cameras works well only for close objects, and we cannot apply those techniques outdoor.

The core argument of this thesis will then be the algorithms for obstacle detection and avoidance with a stereo camera setting, with a particular focus on efficient and real time solutions. After a small introduction on the state of the art, we will present a solution we developed, whose parallel implemen-tation on a Many-core architecture runs in real time under some constraints.

(8)

We do not pretend our solution to be very accurate, as our major effort was on efficiency and parallelism. Anyway our approach works reasonably well for average and big obstacles, and can easily be improved without vio-lating the time requirements.

The second chapter will present some useful background and notation we use in the thesis. Specifically we focus at first on parallel frameworks for efficient development of applications, then we move to the algorithmic description of some general computer vision problems.

The third chapter will present a small survey on the state-of-the-art al-gorithms both for single and stereo camera settings.

In the fourth and in the fifth chapter we will describe our solution to the obstacle avoidance problem, which is well suited for small hardware and a stereo setting. While the fourth chapter provides an overview of the algorith-mic design of our application, in the fifth chapter we will go in depth with the implementation details.

In the sixth chapter we will present some results on the time efficiency of the algorithms.

At first we measure the time spent for two of the general algorithms we present in the background chapter. We exploit the OpenCV [10] implemen-tation, looking for the fastest possible.

The OpenCV library is in fact the standard image processing library which includes computer vision and machine learning algorithms. It will be very useful to our implementation, as we will use his functionalities to efficiently read and process images without memory leaks.

The second part is dedicated to our implementation, with some consid-eration on latency, service time and real time constraints.

Finally in the seventh chapter we draw our conclusion, and we present some ideas for further development.

(9)

Chapter 2 Background

In this chapter we will describe the algorithms used in the papers and in our solution, along with the tools used for the implementation.

As far as the tools are concerned, we focus firstly on the programming abstraction we used, the algorithmic skeletons. Then we describe briefly two frameworks we use for implementation, FastFlow [30] and OpenMP [26].

In the second section we introduce various image processing algorithms, specifically those used to solve the Correlation Problem for depth reconstruc-tion and/or apparent two dimensional moreconstruc-tion.

The implementation of those algorithms is provided by the OpenCV li-brary [10].

2.1 Parallelism

There are many ways to design a parallel application. Generally speaking, the core problem to solve when designing a new parallel application is the code organization. We want to separate in a clear way the functional code, that is the algorithm we want to compute, and the non fuctional code, which is needed for orchestrating parallel execution of functional code [9].

Many abstractions and frameworks were designed to solve this kind of problem. For high level parallel programming we introduce the concept of algorithmic skeleton [14], that is the abstraction we use as a guideline for the basic structure of our parallel application.

Definition 1. A skeleton is a programming abstraction implementing a com-mon and efficent parallel programming pattern. Such abstraction is para-metric, reusable between different applications and portable between different architectures.

(10)

Skeletons were defined in the first place by Murray Cole in 1988 [13] as higher order functions for parallel programming, then the definition changed and evolved, becoming more and more abstract. In his webpage Cole de-scribes skeletons simply as abstract patterns for parallel programming, that lack details.

Most commonly used algorithmic skeletons include stream and data par-allel ones. We describe them following the definitions given in [2].

Stream Parallel Skeletons

Stream Parallel skeletons refer to parallel computations related to indepen-dent data. If we have a long stream of data, then we can start computing in parallel an input before the previous one is completed. To be efficient, the stream length must be much bigger than the parallelism degree of the skeleton. This approach is able to reduce the time spent for computing a group of inputs, but cannot speed up the computation of a single element.

We report two examples: Suppose we have a stream i1...in, and we need

to compute a function f on every element. We have an architecture with a parallel degree of m (which means m logical processors, c1 .. cm).

For every example we also describe report the latency and the service time of that skeleton, given the completition time spent to compute the function: Tc(f ).

• Pipeline Skeleton : If we are able to write our function f as a composi-tion of m funccomposi-tions f = f1◦ f2 ... ◦fm, then we can execute every part

of the function on a different logical processor: f1 on c1, f2 on c2 and

so on.

If n m, then after a short setup time every decomposed function is working on a different data. Specifically, after the completition of f1(ik), we can execute in parallel f2(ik) and f1(ik+1), then f3(ik),

f2(ik+1) and f1(ik+2) and so on.

The service time for each item is

T spipe = max (Tc(f1) , ..., Tc(fm))

The latency is

(11)

• Farm Skeleton : Sometimes it is not possible, or it is not convenient, to divide the original function f . In those cases, we can simply instan-tiate a set of workers, each computing the same function f , and then distribute the items among the workers.

A possible implementation of this skeleton will be composed by an emitter E, which takes an item and sends it to an empty worker, then the workers, which execute f on a logical processor and a collector C, which gathers the results.

Other possibilities are to consider the emitter as a worker itself or to use more than one emitter, if it becames a bottleneck. Moreover there are some implementations which does not use a collector, but they send the results directly to the next stage [27].

The service time for every item is

T sf arm == max Tc(E), Tc(f ) nw , Tc(C)

Usually the most computationally expansive term is Tc(f )

nw , but if we

have enough physical cores, we can arbitrarily increase the number of workers, and we can reduce the service time to the execution time of the emitter or the collector.

The latency is computed as

lf arm = Tc(E) + Tc(f ) + Tc(C)

Data Parallel Skeleton

Data Parallel Skeletons are used to parallelize computations of single items. Given a function f and an input ik, we split the input into q smaller pieces ik1,

ik2, ... ikq; Then we compute the solution out of f (ik1),..,f (ikq) (computed

in parallel), which are supposedly less computationally expansive. It is worth to apply those patterns if the computation for smaller input data is faster, and if they do not create too many dependencies between the sub-problems.

We shortly present two data parallel skeletons:

• Map Skeleton: The map skeleton parallelizes the execution of our func-tion on an array x. Proven that there are no data dependencies between the elements of the array, we can execute for each index i the function f (xi) on a different logical processor.

(12)

The completion time of such a computation is given by the execution of the function on a single item Tc(f (xi)).

• reduce: This skeleton is applied to computations which combines the elements of an array x of length q with a commutative and associative operator . Instead of computing iteratively the solution R as R = R xi, we can organize the computations as a binary tree starting from

the leaves: at the first level we compute in parallel

l1,1 = x0 x1, l1,2 = x2 x3, ... l1,q₂ = xq−2 xq−1

Then at the second level we take the results l1,1, l1,2, ...l1,q₂ and apply

the same computation structure, obtaining

l2,1 = l1,1 l1,2, l2,2 = l1,3 l1,4, ... l2,q₄ = l1,q₂−1 l1,q₂

.

We continue until we arrive to the root of the tree. For example, if the array has four elements, at first we compute l11= x0x1and l12= x2x3

in parallel, then at the second level we compute l21= l11 l12 which is

the root of the tree.

The time spent for the computation of the tree is proportional to his height, specifically it is Tc() ∗ log2q

It is worth to point out that Stream and Data parallel skeletons can be used together: they are independent each other, and one kind of parallelism does not exclude the other.

In the following subsections we will describe two different frameworks for parallel implementation, FastFlow [30] and OpenMP [26]. The second one was not developed in the Skeleton community, but it provides anyway constructs for skeletal programming.

We describe only a small subset of the frameworks, specifically the con-structs that we will use in the code. For a deeper comprehension of those tools in their full potential, we refer the reader to the corresponding websites and tutorials online.

2.1.1 FastFlow

FastFlow is a template-based framework written in C++. It provides build-ing blocks for stream and data parallel skeletons, with the implementation

(13)

on top of low level parallel code, such as c++ threads, communication and synchronization mechanisms [3].

In this context skeletons are provided as parametric abstract classes. To instantiate those classes we need at first to define the parametric types, and then implement all the abstract functions of the class.

If we want to build a stream parallel skeleton in FastFlow, the basic block is the abstract class ff node. It is a container for sequential code, and it can run in parallel with other ff node. The input and the output data are defined trough templates.

To instantiate the class we only need to implement the svc method, the function that the node will compute. For example, this code instantiates a node:

typedef task ...

struct node: ff_node<task> { task* svc(task* t) {

//do something with t

return t; }

}

We then define how those nodes run in parallel,by instantiating the actual stream parallel skeleton.

Among the many stream parallel skeletons implemented in FastFlow, we describe the pipeline skeleton and the farm skeleton:

• ff Pipe: It is composed by many stages, each of them is a ff node. The output of a stage is the input of the next stage, until the end of the pipeline. The stages runs in parallel: once the first stage of the input data ik is concluded, ik goes to the next stage, and is computed in

parallel with the first stage for the input data ik+1.

If we already defined two ff node structures stage_1 and stage_2, then we instantiate and run the pipeline as follows:

ff_pipe<> pipe(make_unique<stage_1>(), make_unique<stage_2>() );

if (pipe.run_and_wait_end()<0) error("running pipe"); • ff Farm: Has an emitter, a collector and many workers. The workers

(14)

same function on more than one input data. If we have n workers, we can compute in parallel the same function with n inputs.

Every farm has an emitter and a collector. The programmer can option-ally implement those nodes. If no emitter and collector are specified, then FastFlow adds default ones. It is possible to disable the collector, but not the emitter.

This simple example shows how to instantiate and run a farm with n workers computing stage_1:

std::vector<unique_ptr<ff_node_t>> nodes;

for(int i = 0, i < n; i++) {

nodes.push_back(make_unique<stage_1>()); }

ff_Farm farm(std::move(nodes));

if (farm .run_and_wait_end()<0) error("running farm"); The pipe and the farm skeletons can be nested, creating for example a pipe inside the farm or a farm inside a pipeline.

As far as the data parallel skeletons are concerned, they are usually im-plemented inside a single node, to speed up a sequential computation.

They usually operate on a part of the input data. Two of the skeletons that we choose to analyse are the ParallelFor and the ParallelForReduce:

• ParallelFor : This skeleton is instantiated with a function, an iteration index and some shared variables. The function is executed in parallel for all the indexes. A common skeleton usage is to choose a pointer to the input data as shared variable, and then make the function operate on a part of the data. It is suitable for computations where the input is an array, and the function is applied independently to all the elements. For example, if we want to compute a functionint compute(task* arg) on the array task input[N], storing the result in the arrayint output[N], then we instantiate the skeleton as follows:

ParallelFor pf; pf.parallel_for(0,N,

[&input,&output] (const long i) {

output[i] = compute(input[i]); }

(15)

• ParallelForReduce: Similar to the skeleton previously described, it adds another function to group all the partial results. The reduce function is very useful if we previously splitted the input data and we want to recostruct the solution from the sub-problems.

If after the functionint compute(task* arg) we want to compute the sum of all the results, then we can use this second skeleton instance:

ParallelForReduce pfr; int sum;

pfr.parallel_for(0,N,

[&input,&output] (const long i) {

output[i] = compute(input[i]); }

);

pfr.parallel_reduce(sum, 0, N,

[&output](const long i, int& sum) {

sum += output[i]; },

[](int& sum, const int elem) {

sum += elem; }

);

It is worth noticing that FastFlow by default overrides the scheduling pol-icy of the operating system, by assigning each core to a node. This behaviour can be modified by compiling the executable with the -DNO DEFAULT MAPPING flag.

2.1.2 OpenMP

OpenMP is probabily the most common shared memory parallel program-ming framework [26]. It has a very large success because it is very simple to use as it provides only annotations for the compiler,thus requiring only minimal code modifications by the programmer. Like FastFlow, it manages all the low level parallelism in an efficient way, if properly tuned.

We present a sample annotation:

#pragma omp ...

(16)

// block affected by the pragma

}

Every annotation begins with pragma omp. After every annotation there is a block of code surrounded by brackets. The compiler modifies the code inside the brackets accordingly. We present few of the possible annotations

• parallel [num threads] : This annotation spans some threads using the code contained in the following block. The number of threads can be specified, but this is not compulsory. If it is not specified, the compiler spans as many threads as the cores of the machine.

For example, if we want to span n threads, we write the following annotation:

#omp parallel num_threads(n)

{

// thread code

}

• sections and section: This construct contains a set of structured blocks, each proceeded by a section pragma. Every block is executed on a different thread.

In this example we create and execute two threads, each running dif-ferent code:

#pragma omp sections

{

#pragma omp section

{

// code for thread 1

}

#pragma omp section

{

// code for thread 2

} }

• parallel for [schedule] : Provides a very easy way to implement a data parallel approach: spans a thread for each iteration of the following for command. Potentially we can execute a loop with the same time spent for a single iteration.

(17)

Moreover we can define the scheduling policy of the parallel for: by default the scheduling is static, which means that at the start of the loop we assign to each thread a set of iterations.

If instead we set a dynamic scheduling policy, OpenMP assigns only one iteration to each thread. When a thread finishes, it will be assigned the next iteration not yet executed.

If the loop iterations have all the same execution time, then we apply the static scheduling policy. If instead we know that the execution time differs between iterations, than a dynamic scheduling policy distributes better the work among the threads.

A parallel loop in OpenMP with dynamic scheduling policy is imple-mented as follows:

#pragma omp parallel for schedule(dynamic)

for(int i = 0; i < N; i++) {

... }

If instead we want the scheduling to be static, we just remove the option schedule(dynamic), as the default behaviour is static scheduling.

2.2 Image Processing algorithms

The algorithms in this section are a very important background to understand all the approaches for obstacle detection. We describe Stereo Matching, Op-tical Flow and Stereo Flow. The following formal definitions are a necessary step for making further explanations easier.

Definition 2. An Image is a function I : R2

→ Rn

where the output is the intensity value in the colour space Rn, and the input is a vector (x, y) ∈ R2 _{which represents the position of the requested value in}

the image. The image has as boundaries a width w and a height h, beyond which the output value is undefined.

For simplicity we consider only black and white images, meaning that the colour space is always n = 1. This approach seamlessly weaves to every algorithm, as for each colour space there exists a conversion function C :

(18)

Figure 2.1: A stereo image, not rectified and rectified. In red are underlined some scan lines

Rn → R that we can implicitly apply when we process a new image. The function is obviously not invertible, but this limitation will not create any kind of problems.

Definition 3. A pixel p is a point in R2 _{representing his position inside}

an image I. It is important for a pixel to be inside the image boundaries. Formally, given the image width wI and height hI

p = (x, y) ∈ R2 s.t 0 ≤ x ≤ wI, 0 ≤ y ≤ hI

Definition 4. The function N (p, r) represents a neighbourhood of width and height r of a pixel p = (p1, p2). It returns as result the following set of pixels

(x, y) ∈ R2 s.t |x − p1| ≤ r, |y − p2| ≤ r

Definition 5. A video is an ordered sequence I1..In of images, taken with a

time distance t the one with the other

Definition 6. A Stereo Pair is a pair of images Il_{, I}r_{, taken from two}

cam-eras at the same time, from different positions. We denote R, T the Rotation and the Translation from the centre of the left image to the centre of the right image.

Given the stereo pair images Il and Ir, an important operation is the so called Image rectification: This process applies a shearing to both image

(19)

planes, in order to put the correspondence points in the same horizontal lines. As we can clearly notice from the figure 2.1 the scan lines are aligned after the image transformation.

Definition 7. A Stereo Video is a sequence of Stereo Pairs, taken with a time distance t the one with the other. A Rectified Stereo Video is a Stereo Video where the Image pairs are already rectified.

2.2.1 Stereo Correspondence

We refer to Stereo Correspondence [29] as the problem of finding the corre-spondence points in a image stereo pair. The algorithms to solve this problem provides as solution a mapping, called disparity map, between the pixels in the left image Il and the pixels in the right image Ir. In particular, the

so-lution is the function D where, if the pixel at position pl= (xl, yl) in the left

image Il is matched to the pixel at position pr = (xr, yr) in the right image

Ir, then

D (x) = |pl− pr|

If the stereo images are rectified, then the correspondences are always on the same line: yl = yr and D (x) = |xl− xr|.

There are two main approaches to the solution of the stereo matching problem: we refer to them as Local and Global approaches [29]. In both techniques the algorithm defines a matching cost function M(p, d), with p and d ∈ R2_{, that computes the error between two blocks of the image.}

Specifically it matches the patch around the pixel at position p in the left image with the patch around the pixel in position p + d in the right image.

The above computation depends on the neighbourhoods of the pixels p and p + d. A common and efficient choice of the matching cost function is given by the sum of absolute differences, which is given by

M(p, d) = X

q∈N (p,r)

|Il(q) − Ir(q + d)|

If the images are rectified beforehand, than we have to search for the correspondences only on a horizontal line. With this property we can redefine N (p, r) as the set of points between (px− rx, py) and (px+ rx, py), really

reducing the search range and the matching cost function complexity. Global Methods

The global methods aims at finding the disparity map D as the minimum of the function

(20)

min D X p∈Il M(p, D(p)) + λX p∈Il ρ(D(p)) !

where the first term is called the data term, while the second is the smoothness term. The first term measures the similarity of all the matched patch, and the smoothness term penalizes solutions with a high variation be-tween neibouring disparities. A good choice for the ρ function is the gradient function, that measures exactly how fast the function changes locally. Also λ is an important parameter, as it measures how much the smoothness term penalizes the matching cost function.

Local Methods

Local methods have another approach in finding the best disparity. If we want to find the disparity value D(x) = d, then we take a patch around Il(x), and we search for the best correspondence in Ir. As the methods name

suggest, the range of the search is local, and the best correspondence in the search range is supposed to be also the best correspondence possible in the whole image. The value of the disparity map for a given pixel point x can then be computed as min d   X p∈N (x,r) M(p, d)  

Both the block size and the maximum search range are parameters of the algorithms. Those parameters are very important, as they affects heavily the quality of the results and the time spent for an iteration.

Although global methods are often more accurate then local methods, the local approaches are preferred when applied to real time constrained applications: Those methods have lower complexity,and furthermore they have no data dependencies in the computation of disparity values at different pixels, thus they can be easily parallelized [29].

2.2.2 Optical Flow

The Optical Flow algorithms aims at finding the apparent motion, in two consecutive frames of a video, between the objects in the image and the observer. They solve the so called Image Alignment problem, which consists of moving, and possibly deforming, a portion of a frame to match with the other [4].

(21)

Specifically, in his simplest formulation, it consists in finding a displace-ment function F (x) that, for each pixel x minimizes some measure of the difference between I1(x) and I2(x + F (x)) [23].

The formal definition, which may seem very similar to the one given for the Stereo Correspondence problem, presents two important differences. The first one is about the displacement function F (x); While the disparity function D(x) measures a scalar distance, the displacement function F (x) is a vector in R2, measuring a shift direction.

Moreover the correspondence problem benefits from the techniques of Image Rectification, reducing the range search to an horizontal line. In the Optical flow problem, instead, the search range for the vector h is always two dimensional.

In fact the Image Alignment problem is a generalization of both Stereo Matching and Optical Flow, and with some changes an Optical Flow algo-rithm can also solve the stereo correspondence problem [23] [22].

The problem can be solved differently from the block matching approach exploiting an important property: Since we assume that the images I1 and

I2 are on a video stream, thus they are taken with a small time delay, the

displacement of the vector h is always small

Following this intuition we can approximate the position of the pixel in the new image with the first order Taylor approximation: given the pixel i in the image I1, we can predict the pixel position in the next image as

I2(x + h) ≈ I2(x) + h

∂I2

∂x

The methods exploiting this property are known in literature as differ-ential methods. The error we make by choosing h as displacement vector is

E(p, h) = X

x∈N (p,r)

(I1(x) − I2(x + h))2

and using the Taylor approximation

E(p, h) = X x∈N (p,r) I1(x) − I2(x) + h ∂I2 ∂x 2

There are various techniques to obtain the displacement function F . As before, we can separate the global methods and the local methods. The first ones were introduced by Horn And Schunck in [20], while the second ones were introduced by Lucas and Kanade in [23].

(22)

In addition to those, we present another local method which is called Dense Inverse Search [22]. It is a very fast method at the current state of the art, and it is suitable for real time applications.

Global methods

The global techniques tries to find the whole function F in a single optimiza-tion problem. We only state the problem, and we don’t describe the iterative solution.

The function to be minimized includes a data term, which is exactly E(p, h) approximated with Taylor, and a smoothness term, which penalizes abrupt changes of F using the derivative. The function to be minimized in the Horn and Shuck framework is

min F X p∈Il E (p, F (p))2+ λ2 2 X i=1 ||∇F (p)i|| 2 ! Local methods

If we want to solve the optimization problem locally, we have to find the best displacement vector F (p) such that

F (p) = arg min h   X x∈N (p,r) E(p, h)  

since h ∈ R2 _{the optimization problem is two dimensional and to solve it}

we have to apply iterative techniques.

The minimization problem is solved by Lucas and Kanade as a classical Gauss-Newton gradient descend algorithm [4]. The minimization function is derived from the error function with the Taylor expansion.

The algorithm in [22], called Dense Inverse Search, is a modification of the original Lucas Kanade algorithm. In particular, it modifies the Gauss-Newton minimization function in order to use a constant approximation of the Hessian matrix. This improvement makes single iterations much faster, and the algorithm reaches real time performances.

2.2.3 Scene Flow

The last problem formulation we tackle is a combination of Optical flow and Stereo matching. It finds the three dimensional displacement vector between an object and the observer. Given two stereo rectified images from a video

(23)

Il 1 x Ir 1 x x + d1 I₂l x x + q I₂r x x + q + d2

Figure 2.2: Motion and disparity constraints for a single pixel x

I₁l, I₁r, I₂l and I₂r, Scene Flow algorithms returns as result a three dimensional field in R3_.

The term Scene Flow was firstly introduced in [31] as the three dimen-sional motion field, whose two dimendimen-sional projection is the Optical flow.

We focus, as in the previous sections, on the block matching approaches, which proved to be the fastest methods.

Differently from the Image registration problem, we have four images, and a pixel can be matched in all the images. In figure 2.2 we show how the pixel position changes in the images.

We assume Il and Ir to be rectified. If di = (di1, 0) with i = 1, 2 are the

stereo correspondences between Il

i and Iir, and q = (q1, q2) is the Optical

Flow solution, then we have the following relations for the pixel x: • stereo correspondences for i = 1, 2:

I_il(x) = I_ir(x + di)

• optical flow correspondences:

I₁l(x) = I₂l(x + q) I₁r(x + d1) = I2r(x + d1+ q)

• cross terms:

I₁l(x) = I₂r(x + d2+ q)

(24)

I₁r(x + d1) = I2r(x + d2+ q)

I₁l(x) = I₂l(x + d1+ q − d2)

With a metric T (sum of absolute differences, sum of squared differences or others) and one of the relations we depicted above, we can exploit those relations in a block matching fashion. For example, to find the solution of the first cross term relation on a pixel p with the unknown u = d2 + q =

(d21+ q1, q2), we solve arg min u   X x∈N (p,r) T (x, x + u)  

Beyond the classical Global-Local distinction, which holds also for scene flow algorithms, it is important to point out that in this case the problem is over-constrained, as there are many different combinations of relations we can exploit.

Another distinction to highlight is that with this block matching approach the variables to be optimized are four: the two disparity values d11 and

d21 and the optical flow q. We can either optimize the variables together,

choosing a coupled approach, or we can decouple the problem in two Stereo Correlations problem and an Image Registration problem [34].

The coupled approach is more robust and mostly it provides consistent results. Viceversa, the decoupled approach can lead to bad and inconsistent results, but it is computationally much faster.

Due to the huge optimization space, no coupled approach reaches real time performance at the current state of the art.

Another interesting approach, which leads to real time performances on small resolution images, as explained in [34] and [28], is to use a Stereo Matching algorithm to solve part of the problem, and then integrate the depth values to resolve the Scene Flow in a robust way for the Optic flow unknowns.

In this way we reduce the unknown space and we greatly improve the running time of the method, while preserving accuracy.

(25)

Chapter 3 Resolution Methods

In this chapter we present some specific solutions to the obstacle avoidance problem. In our small survey we considered algorithms applied to:

• small Micro Aerial vehicles (MAV) • autonomous cars

• ETA (Electronic Travel Aids) for visually impaired people

We group all the approaches considering the hardware used by the solu-tions. If we have a single camera, we can approximately detect motion and objects coming toward us, but we do not have a direct measure of the depth of the obstacles.

On the other hand Stereo Cameras have more informations available, and potentially the approaches with this hardware setting can be much more accurate than single camera ones.

The drawback is obviously related to efficiency: If we have a very limited computational power, for example in the pocket drone we described in the introduction, we cannot exploit all the information we receive, thus we are forced to drop a lot of data without having the possibility to read it.

We do not present all the possible methods with single and multiple cameras, but we try to give some example of how we can exploit at best the data we receive.

Not every approach is suitable for all the applications we have considered above. Some problems require a better quality of the result, or simply the algorithms are too computationally expansive to be solved in a reasonable time with the available hardware.

In the table 3.1 we indicate which methods are optimal to solve each of the problems we stated above.

(26)

MAV Autonomous driving ETA

Motion Parallax yes [36] no no

Relative Size Cues yes [24] only robots [12] no

Elevation Map no yes [25] yes [16]

Scene Flow Segmentation no yes [34] yes [1]

Stereo Matching yes [33] no no

Table 3.1: We classify current state-of-the-art solutions for obstacle detection and avoidance on Micro Aerial Devices, on Autonomous driving robots and cars, and on Electronic Travel Aids for visually impaired peaople. Some of those methods are developed using a single camera, while others need a stereo camera setting.

The following sections are dedicated to explaining the methods in the table, divided by single and multiple cameras.

In particular for the single camera setting we will address a class of algo-rithms which exploits the Optical flow and another class of algoalgo-rithms which uses features matching.

For the stereo camera setting we present a solution based on a height map of the obstacles; another one which segments the scene flow result to detect moving obstacles, and the last one which exploits only the stereo matching algorithm.

3.1 Single camera approach

In a single camera setting we have limited informations, but they are sufficient to determine the relative frontal direction and velocity between the objects and the camera.

We will discuss of two possibilities to implement collision avoidance on a monocular system without making further assumption on the environment: Using optical flow and motion parallax, or using the relative size cues of matched features [24].

3.1.1 Motion Parallax

Motion parallax is a type of depth perception cue in which closer objects are apparently faster then further objects.

The approach we study exploits optical flow. Due to the motion parallax property, the optical flow magnitude is directly proportional to the distance from the camera.

(27)

We can determine if an obstacle is approaching if the motion pattern diverges from the centre. Specifically the optical flow is bigger from the direction from which the obstacle is coming.

A good measure of this specific pattern, as suggested in [36] is given by the subtraction of optical flow values on the left and on the right. Specifically, given the width w and the height h of the images, we compute at first the optical flow F = (Fx, Fy), then we divide the field into two sets:

R = (x, y) ∈ R2 s.t w 2 ≥ x ≥ w, 0 ≥ y ≥ h L = (x, y) ∈ R2 s.t 0 ≥ x ≥ w 2, 0 ≥ y ≥ h The value X x∈R Fx(x) − X x∈L Fx(x)

measure how big is the optical flow divergence from the centre. If it is bigger then a predefined threshold τ , then the algorithm triggers a signal.

When we detect a similar behaviour, it means that the obstacle is already very close to the camera.

This technique is good for micro aerial vehicle that can quickly change his direction, or robots with low speed, but it is not feasible for autonomous driving, since in this case there is not enough time to break after the obstacle detection.

Anyway, the obstacle detection range varies depending on the threshold τ . In [36] the authors claims to detect walls at a distance of two meters, and corners at a distance of four meters.

Another drawback of this method concern frontal objects, with small or average size, that does not move away from the centre of the image. They do not have the pattern we described, and are not detected by the algorithm.

3.1.2 Relative Size Cues

The relative size methods relies on features matching and measures the rate of expansion of those features.

If we can detect and match at least one feature of an object in two consec-utive frames, then we can measure the size difference between those features. As the object is closer to the camera, the size of the matched features of the object becomes bigger.

Moreover the rate of expansion of the features gives us informations about the relative velocity between the object and the camera.

(28)

We follow the approach described in [24]: summarizing the algorithm, it firs finds some suitable features to track between different images, then it measures the expansion rate of the matched features. If in the image there is a group of close features with the same expansion rate, then there is an obstacle approaching.

We now describe in depth the two steps of this obstacle avoidance algo-rithm.

Features Matching and Tracking

Generally speaking a feature matching algorithm tries at first to build a descriptor from some selected pixels, then it defines a matching function to search that descriptor in the second image.

The pixels chosen for matching usually have a high variation in at least two directions with respect to the neighbour pixels.

Given a descriptor D(p1) of an area around a pixel p1 in the image I1,

the matching function is defined as min

p2

(M (D(p1), D(p2)))

where M computes a matching error between the patches around the pixels p1 and p2 of respectively I1 and I2.

The patches D(p) usually have an orientation, which corresponds to the biggest intensity variation direction, a size and a scale value.

In [6] is described a very fast approach to detect features to track and build descriptors from the image. The feature detector is known as SURF, which stands for Speeded-Up Robust Features.

It uses the Hessian matrix (second order derivative of the image at a pixel point x) to find those points which have an high intensity variation in at least two directions.

SURF descriptors have the important property to be invariant to rotation, translations and scale change, and they provide a fast and robust solution for the matching task.

The standard descriptors in SURF are built around a 9 × 9 patch of the image around a pixel p with a scaling of 1.2.

In the matching process the scale factor can be modified, by taking bigger patches and then sub sampling the resulting image in a classical image pyra-mid fashion. The descriptor is then computed from the sub sampled image, which has the same size of the original 9 × 9 patch.

Our obstacle avoidance algorithm uses as first step the SURF descriptors, to match corresponding pixels pi and pj respectively on the images I1 and

(29)

I2.

The scale factor is important in the matching process, as if it is bigger in the second frame it means that the object is coming closer to the camera. Expansion Rate Computation

The computed scale from the original SURF matching are not accurate enough to compute the right velocity of the object and moreover it can lead to false positive.

To avoid this we take all the SURF matched keypoints with a bigger scale in the second image, and we perform an additional scale computation using a bigger patch of 20 × 20 pixels. We call this new templates D20(p).

To this end we redefine the template size of the patch around the pixels p1 and p2 in the unknown scale s as

D20(p1).size = D(p1).size ∗ 1.2 9 ∗ 20 D20(p2).size = D(p1).size ∗ 1.2 9 ∗ 20 ∗ s

and then the resulting scale can be found solving the following problem

min

s

M (D20(p1), D20(p2))

s2

If the best scale s is big enough, then we can conclude that the point feature is expanding, and the object is coming towards the camera.

The algorithm described works well for images with many textures, where we can find many features to track, but it may not work if at the contrary the objects has no sufficient textures, which can not be detected by the SURF algorithm.

Moreover it suffers form the same problem described in the previous sub-section: the detection range is small, thus requiring a high reactivity of the robot (or the car or the MAV) to effectively avoid the obstacle.

3.2 Stereo camera approaches

We now move to the stereo camera approach. The monocular approach was suitable only to applications with limited memory and few processing units, such as small MAVs. It does not provide sufficiently robust and complete informations for the other applications we consider.

(30)

The stereo camera is instead a feasible approach for all the three applica-tions, as it can provide very accurate results if the application needs it, but there are also algorithms designed to run with limited computational power. In [7] the authors provides a survey on the current stereo methods for obstace detection in autonomous driving.

We present two of the approaches identified by the article. Those methods are fairly general, and can be used also for ETA, with some minor modifica-tions.

In the last step we describe a very quick stereo correlation algorithm, called LongSeq, enabling a pocket drone with a stereo camera and a small processing unit can perform obstacle avoidance on-board in real time.

3.2.1 Elevation Map

Elevation maps are built from the result of a stereo matching algorithm, preferably on a cloud point fashion. We have a set of points (X, Y, Z), with the origin on the centre of the camera.

Basically the procedure separates the points belonging to the paviment plane from the ones belonging to obstacles. It is called Elevation Map because it marks the points wich are higher than the pavement plane, identifying them as obstacles.

We follow the approach described in [25] and describe the main steps. First we need to estimate a planar surface from the point cloud. We assume that the points belonging to the road are much more frequent than the points belonging to the obstacles.

After that we cluster the points that does not fit in the planar assumption. Those points are marked as obstacles.

Quadratic Plane Model

For the pavement plane estimation we employ a quadratic model, as we want to include in our model also the longitudinal and lateral curvatures. The surface model has five unknowns: a, a0, b, b0 and c, and it is defined as

MY(X, Z) = a0X2+ aX + b0Z2 + bZ + c

After we found the planar equation, the points not satisfying this relation are identified as possible obstacle points.

To fit the surface we use some points and then we run RANSAC [17] to find the plane equation. The error function for each point is defined as

(31)

The initial set of points is defined as a cubic region of point clouds in front of the camera, with small Z and small Y.

Those initial points are very important for the whole plane fitting, and they can be additionally filtered out :

• if we have an estimation of the plane M0

Y from the previous frame, then

we reject all the points (X, Y, Z) for which the error E(X, Y, Z, M_Y0 ) is too high

• We perform edge detection on the rectangle of the image corresponding to the point cloud region. If we can fit a line, then all the points which are on the opposite side of that line, with respect to the camera, are rejected.

For obstacle detection we fuse the result of two classifiers based on differ-ent principles: the first one computes the density of the cloud points, while the second one computes the height of the points with respect to the plane model.

We consider a square region of size L × L in the XZ plane. The projection of the region in the left image has a trapezoidal shape with average width Cw and height Ch.

The projection of a plane to the image is a classical geometric problem: if we have the equation of the plane as stated above, the projection matrix Π and the point P1 = (P1X, P1Y, P1Z) then the points in the image are defined

as follow

x1 = Π(P1)

x2 = Π(MY(P1X + L, P1Z))

y1 = Π(MY(P1X+ L, P1Z + L))

y1 = Π(MY(P1X, P1Z + L))

We can compute the average width and the height as

Cw =

(|x2x− x1x| + |y2x− y1x|)

2 Ch = |y1y− x1y|

(32)

Point Density Classifier

As far as the point cloud density principle is concerned, the core idea to detect obstacles is to consider the expected density of road points on a region with respect to the actual number of points on that region.

A region with an obstacle will not have a flat surface, thus its point cloud density will be higher than the density of a plane.

The expected road density of a region is estimated as the area of the trapezoid projection

Erd= Ch∗ Cw

The size of the squares defines the grain of the final elevation map: big regions may lead to an inaccurate and too coarse estimation of the obstacles, but they also improve the speed of computation.

The road actual density is instead computed by averaging the counters of close regions. The averaging range of the regions is bigger as we consider those further from the observer, since the low number of points of far regions may lead to inaccurate results.

If Cnt(x, y) measures the number of points of the cell (x, y), then the measured density md(x, y) can be measured as

md(x, y) = ₁ 1 Ch + 1 1 4Ch X k=− 1 4Ch Cnt(x + k, y)

The measured density is compared to an expected density Th, which

tol-erates a minimal slope of the plane. Th is computed as

Th =

Erd(slope = 0%)

Erd(slope = 40%)

The projection of a region with a slope is solved as above modifying the plane equation My accordingly.

To ensure consistency, we derive two thresholds from the expected density Th. A region is identified as an obstacle if one of the two following conditions

holds true:

• md(x, y) ≥ Th∗ Erd

• md(x, y) ≥ Th

(33)

Height Classifier

The height classifier is simpler than the density based one, and it compares the average height of the cell with the expected height of the planar surface. In particular for each cell we find the average height havg(x, y), a

maxi-mum variation threshold ∆h. Given the height of the plane hplane we classify

the cells as follows:

• If havg(x, y) ≤ hplane, then the region is classified as road

• If hplane ≤ havg(x, y) ≤ hplane+ ∆h and the cell was not classified as

obstacle by the density based classifier, then it is considered as small obstacle (Traffic Isle)

• if havg(x, y) > hplane+ ∆h, then the cell is classified as obstacle.

As far as the execution time is concerned, the paper presents a parallel implementation which operates on 498 × 468 images, and it reaches with his fastest configuration an average processing time for frame of 10 ms on a Pentium dual core.

It is also worth to point out that Elevation Maps applications, even if the primary applied on autonomous driving, can be employed also in Electronic Travel Aids systems.

Those algorithms can successfully recognize free space and it can guide a visually impaired user trough a crowded environment. An example of implementation can be found in [16].

It is instead not possible to apply this solution for Micro Aerial Vehicles, as for flying objects there is no pavement plane to be extracted, and we can not discriminate obstacles using their height.

3.2.2 Scene Flow Segmentation

We already presented the Scene flow problem in the previous chapter. Those methods provide a motion flow of the objects in three dimensional space. With this information we can easily detect an incoming collision and the velocity of the object.

In this subsection we suppose we already have the scene flow field. We report a method for scene flow segmentation as proposed in [34], together with a measure of the likelihood that a pixel is moving.

After the segmentation is computed, we can estimate the direction of objects moving in the scene, and we can determine if any of those objects will hit the camera.

(34)

We recall that the Scene Flow algorithm minimizes the error function for the unknowns d1, d2 for the stereo matching, and u and v for the optical

flow.

In [34] the authors substitutes d2 with the disparity change unknown

p = d2− d1. We also adopt this notation.

Suppose we have two image pairs, for which we have already computed the flow field, and we have transformed all the points from the image coordinates to the world coordinate, inverting the projection

  x y d  = 1 Z   Xfx+ x0 Y fy+ y0 bfx  

where (x0, y0) are the coordinates of the centre of the image, (fx, fy) is the

focal length and b is the baseline distance between the two cameras. We define the transformed flow field with the notation

M (x, y) = Ftransf(x, y)

Omitting, when it is clear form the context, the arguments of the function. Motion Likehood

We additional assume that for every unknown variable d,u,v and p in the Scene Flow algorithm, we have somehow computed the uncertainty measures ud, uu, uv and up, which are the probability of those unknowns to have the

correct value.

Then the covariance matrix estimates the joint uncertainty of all the variables: given a flow field point M = Ftransf(x, y), then the matrix is given

by ΣM = J diag(u2d, u 2 u, u 2 v, u 2 p)J T

where J is the Jacobian matrix with the partial derivatives    ∂Mx ∂d ∂Mx ∂u ∂Mx ∂v ∂Mx ∂p ∂My ∂d ∂My ∂u ∂My ∂v ∂My ∂p ∂Mz ∂d ∂Mz ∂u ∂Mz ∂v ∂Mz ∂p   

If we assume the world to be stationary, then we expect a standard normal distribution N (0, ΣM). We can test this stationary assumption by computing

the deviations of the vector M from this random variable. The variable ξM, defined as the Mahalanobis distance

(35)

ξM(p) =

q

MT_Σ−1 SFM

is χ2 distributed and for each pixel p we can find the outliers (that means pixels belonging to moving objects) with a threshold on the result.

In particular we compare the result of the function ξM(p) with the

quan-tiles of a χ2 distribution.

For example, if ξM(p) > 7.81, which is the 95% quantile of a distribution

with three degrees of freedom, it means that the pixel p is moving with probability bigger than 95%.

Segmentation Algorithm

We apply segmentation to find a group of pixels which is supposed to be moving with high probability.

For each pixel p we need his likelihood to be moving, measured by ξM(p),

and a fixed value for the likelihood of a pixel to be static, measured by ξstatic(p).

The algorithm is better explained from the same authors in [35], and the goal is to find a binary labelling function

L(p) = 1 if the pixel p is part of a moving object 0 otherwise

The labelling function is found by minimizing another function that pe-nalizes the labelling of a pixels as moving if ξM(p) < ξstatic(p)

The function to be minimized is

X p∈I  −L(p)ξM(p) − (1 − L(p)) ξstatic+ λ X p0_{∈N (p,1)} |L(p0_{) − L(p}0_)| |I(p) − I(p0_{)| + a}  

where the last term is the regularization parameter which ensures consis-tency between near pixels. The weight parameter λ measures how influential is the consistency parameter in the formula.

The problem is equivalent to a graph cut problem, were we have two nodes, corresponding to the motion node m and to the static node s, that are connected with all the pixels.

Moreover every pixel is connected to the four neighbors pixels. The weights of the nodes for a pixel p are defined as follows

(36)

W (p, s) = −ξstatic(p)

W (p, p0) = λ

|I(p) − I(p0_{)| + a} f or p

0 _{∈ N (p, 1)}

At the end of the graph cut algorithm every node is connected either to the motion node or to the static node. The minimum cut which yields to this separation defines also the labelling that minimizes the formula above.

We conclude with some considerations of the algorithm.

It is worth to notice that the algorithm is computationally quite expan-sive, but it can run in quasi-real time for small resolution images.

The paper claims 20 Hz on the GPU and 5 Hz on the CPU, with small QVGA images.

This method was tested for applications on driving environment, but it can easily be applied to the Electronic Travel Aid Devices.

It is surely not feasible for on-board processing on small MAVs, due to his high computational power needs.

3.2.3 Stereo Matching Segmentation

A simple approach for obstacle detection and avoidance is to take the output of a stereo vision algorithm and then threshold each depth value to see if there is an obstacle too close to the camera.

The algorithm we present is called LongSeq and it is developed in [33] to run on-board on a Pocket Drone. It runs on a processor with very limited memory and speed.

It consists simply on a modified stereo correlation approach which com-putes a depth map, and then if the value is too high, it detects a collision and tries to avoid the obstacle.

Due to the limited computation power, the matching is performed only on one line. The first step is to find the minimal matching cost

min

d C(x, d) = mind

Il(x) − Ir(x + d)

with d on a pre defined range from dmin to dmax. Then a binary array is

computed for every pixel as

B(x, d) = 1 if C(x, d) > τcost and mindC(x, d) < τmin 0 otherwise

where τcost and τmin are two pre defined thresholds. The ones represents the

(37)

The last step is about finding the longest sequence of zeros in the array, which is supposed to be the best possible match. To this end we firstly replace each zero value with the length of the sequence before another bad match is found, then we find the match as the maximum value of the sequence

max

d B(x, d)

For example, if we have the binary array [110001001] then we firstly replace all zeros with the sequence length before the next one, obtaining [113331221], then we find the best disparity index, which is maxdB(x, d) = 3.

As an optimization step, we can repeat the same operation by matching the pixels in Ir _{with the ones in I}l_{, and then taking the minimum of the two}

indexes as the best result.

If the disparity is bigger than a certain threshold at some pixels, then we fire an obstacle detection signal, which will be then processed and will activate the avoidance manouvers.

The algorithm runs on a very small processing unit, with 168 MHz of clock speed and 192 Kb of available memory. It reaches a runtime of 90 ms (11 Hz) with images of resolution 128 × 92.

(38)

(39)

Chapter 4 Our Approach

This chapter presents our solution to the obstacle avoidance problem. Our algorithms are strictly constrained by the real time computational require-ments. This reflects both on the algorithm complexity and on the quality of the results.

We developed two core algorithms, one based stereo matching, the other on scene flow.

Then we designed the parallel application using algorithmic skeletons. Our design efforts were focused on providing a limited service time to ensure real time, while keeping the latency as low as possible.

In the following sections we will describe the overall architecture of our solution: In the first section we describe the parallelism we exploited, while in the second one we describe the algorithms to label the pixels as dangerous.

We use the same notation of the previous chapters: Il

i and Iir are

respec-tively the left and the right image at time i.

4.1 Description

Given a stereo video, the purpose of our application is to analyse two con-secutive frames, and find the motion of the various object in the scene.

We exploit Stereo Correlation to find the depth of the objects in two consecutive frames. This information alone is already capable of identifying which parts of the image contains dangerous pixels that are coming closer to the camera

Optionally, we can integrate the information of Optical Flow, filtering out objects that are not converging to the centre of the image.

The last step processes stereo and, if available, optical flow informations, and labels the dangerous pixels accordingly. If there are enough dangerous

(40)

Image Conversion Il 1 I₁r Il 2 Ir 2 I₃l I₃r ... Stereo Matching Optical Flow Stereo Matching Optical Flow Stereo Matching Optical Flow Output Output ...

Figure 4.1: The overall execution flow of the application with three inputs, re-spectively with the colors red, green and blue

pixels, a signal is fired.

Our algorithm runs in two different ways:

• Stereo : It employs only the stereo matching algorithm. It is faster and less computationally demanding, but it flags some pixels as false positives.

• Stereo Flow: It uses both stereo matching and optical flow, and it reconstructs a three dimensional motion field. With those informations we can label as safe all the pixels whose vectors are not directed to the centre of the image.

The application structure is described in the figure 4.1. As stated above, the Optical flow stage may be optionally computed.

We can identify four steps, one of which only runs in the Stereo Flow setting:

• Image Conversion: this step reads a Stereo Image Pair Il

i and Iirand

(41)

• Stereo Matching computation: Given the images Il

i and Iir, we

compute the depth map Mi(p) for every pixel p

• Optical Flow (optional): Given the image Il

i ant the image at previous

step Il

i−1, we compute the Optical Flow Fi(p) for every pixel p.

• Pixel Labelling: With the depth map at the current step Mi(p), the

one at previous step Mi−1(p) and (optionally) the current optical flow

Fi(p), this step labels every pixel p as dangerous or safe.

In the last step the depth map Mi−1 comes for free, as we already

com-puted it in the previous step.

After the first stereo image pair, in the other iterations we can compute the result at the cost of a single Stereo Matching and optionally of an Optical Flow computation.

To get an acceptable latency we had to discard many optimization steps, which could have been done to get a better quality of results. For Stereo Matching and Optical Flow we use one of the fastest methods at the current state of the art, and we consider the accuracy of the results as a minor issue. In particular for Stereo Matching we employ the method described in [21], which is a classical stereo matching algorithm with parametric search range. For Optical flow we instead use the algorithm in [22], which is one of the fastest methods at the current state of the art, without any variational optimization.

The intuition behind those choices is the following: To detect an incoming obstacle we do not need an accurate reconstruction of three dimensional motion. We need only few vectors correctly labelled as dangerous, which can be computed also with a noisy scene flow.

In fact our Stereo Flow approach, which combines Stereo Maps and Opti-cal Flow without any refinement, computes very noisy and incomplete Scene Flow results, as pointed out and demonstrated in [28].

Concerning the last step, the pixel labelling function is very simple, and his runtime is negligible with respect to the previous steps. We will describe those algorithms in the last section of this chapter.

4.2 Parallel Architecture

The parallel architecture follows a schema which is both stream and data parallel. Those schemes are independent each other, but they compete for the same resources.

(42)

Moreover the OpenCV implementation itself exploits some form of data parallelism, further enhancing the parallelism degree.

Each parallelism degree is independent one with respect to the other, and in addition the OpenCV parallelism is invisible to FastFlow and viceversa.

Our purpose is to process the entire video stream, with a fixed delay d between images. (framerate f = 1_d frames per second)

We need the service time of all the steps to be at least as low as d, in order not to create bottlenecks. To this end, it is very important that the resources are assigned correctly. In our application we let the system scheduler assign the cores, but we reserve to modify the default behaviour for some time critical processes.

Referring to the figure 4.1 from the previous section, we can identify the dependencies and consequently which parts of the application can run in parallel.

The dependencies are given in the picture by the arrows:

• Depth Map: to compute the depth map Mi of the i-th frame, we

need the images Il

i and Iir.

• Optical Flow: to compute the optical flow Fi of the i-th frame, we

need the images I_i−1l and I_il.

• Pixel Labelling: to compute the labels of the pixels in the image Il i,

we need (if we consider the Stereo Flow setting) the optical flow Fi and

the depth maps Mi−1, Mi.

The fastest parallel configuration depends on the hardware we have. With enough cores, we exploit both stream and data parallelism, reaching real time performance with acceptable latency.

We try to keep it under 100 ms, but the lower the latency, the better the final result, proved that we can still have the service time needed for real time processing.

We employ a classical pipeline skeleton, in which the second stage can be a farm. The mapping between the pipeline stages and the application steps is the following:

• The Emitter maps the Image Conversion step, and it is in charge of reading the images and converting them into grayscale.

Then it creates the tasks for the workers: in the Stereo mode, we require only the computation of Stereo Matching, while in the stereo flow mode we require the computation of Stereo Matching and Optical flow, by sending two different tasks.

(43)

It is very important for our tests to keep a fixed framerate, and to be neither too slow nor too fast.

• The Worker receives tasks from the emitter. It maps either the Stereo Matching or the Optical Flow task, based on the emitter’s request. In the first application setting, which is Stereo, we only need to com-pute Stereo Matching, which has a latency lSM. In the Stereo Flow

application setting, instead, we need to compute both Stereo Match-ing, with latency lSM, and Optical Flow, with latency lOF.

This step is the most problematic one, since the completion time with-out any parallel implementation is always bigger than the frame emis-sion delay d. To lower the latencies we suggest three approaches, in which the third is a combination of the first two:

– Task replication: we model the second stage as a farm, and we add as many workers as needed to get the desired service time – Data Split: Every input is divided in many tiles. Each tile is then

computed independently and in parallel with the others. The total latency then is reduced to the computation time of a single tile. – Combined approach: We employ a farm of workers, each of which

splits the input image into many tiles. This solution makes use both of task parallelism to reduce service time, and data paral-lelism to reduce latency.

• Finally the Collector maps the pixel labelling step. Since in our ap-plication the execution time of this step is negligible with respect both to the minimum framerate and the previous step, we do not need to apply further parallelization techniques.

The general parallel structure is depicted in the figure 4.2. We now de-scribe the parallelism schemes following a top-down approach: we start from task parallelism, which is the first approach suggested above and aims at lowering the service time of the second parallel stage; then we analyse the data parallel approach, which instead tries to reduce the latency of single tasks.

4.2.1 Task parallelism

Following the task parallel approach, we instantiate a set of workers which all compute the same function, according to a farm skeleton pattern. The

(44)

Worker: Stereo Matching or Optical Flow Worker: Stereo Matching or Optical Flow ... Worker: Stereo Matching or Optical Flow Emitter: Image Conversion Collector: Pixel Labelling

Figure 4.2: Farm skeleton used by our application

purpose of this approach is to lower the service time Ts until we reach the

value d.

The workers are all the same: they specialize each time they receive a new task, computing either Stereo Matching or Optical Flow, based on the request.

The structure of the farm is the same in both the Stereo and Stereo Flow setting, but the minimum number of workers to get the right Ts is different,

since in the second setting we have to compute also Optical Flow, while achieving the same service time.

Given the Stereo Matching latency lSM, the optical flow latency, lOF, and

the target service time d, the optimal number of workers can be computed in the following way:

• For the Stereo setting, we care only about lSM: the optimal number of

workers is

w = lSM d

• For the Stereo Flow setting, we receive two tasks, Stereo Matching and Optical Flow, to be computed at each frame. With two free workers at

Parallel Obstacle Avoidance with FastFlow

Department of Computer Science

Master Degree in Computer Science

Parallel Obstacle Avoidance

with FastFlow

Candidate:

Nicola De Socio

Supervisor:

Prof. Marco Danlutto

Refree:

Prof. Giuseppe Prencipe

Contents

Chapter 1

Introduction

Chapter 2

Background

2.1

Parallelism

2.1.1

FastFlow

2.1.2

OpenMP

2.2

Image Processing algorithms

2.2.1

Stereo Correspondence

2.2.2

Optical Flow

2.2.3

Scene Flow

Chapter 3

Resolution Methods

3.1

Single camera approach

3.1.1

Motion Parallax

3.1.2

Relative Size Cues

3.2

Stereo camera approaches

3.2.1

Elevation Map

3.2.2

Scene Flow Segmentation

3.2.3

Stereo Matching Segmentation

Chapter 4

Our Approach

4.1

Description

4.2

Parallel Architecture

4.2.1

Task parallelism