Implementation of real-time multi resolution dense stereo vision for augmented reality in ARM GPUs

(1)

C

ORSO DI

L

AUREA

M

AGISTRALE IN

I

NGEGNERIA

E

LETTRONICA

Anno Accademico 2013/2014

Implementation of Real-time Dense Stereo

Vision for Augmented Reality in ARM

®

GPUs

Gian Marco Iodice

23th July 2014

Academic supervisor: Roberto Saletti, Full Professor,

Department of Information Engineering, University of Pisa (Italy) Industrial supervisor: Anthony Barbier, GPU Compute Technical Lead,

(2)

This thesis is sponsored by ARM® _{- Media Processing Group, Cambridge - UK}

Supervisors:

Roberto Saletti - Academic supervisor: ... Anthony Barbier - Industrial supervisor: ...

(3)

Nomenclature

SAD: Sum of Absolute Difference SSD: Sum of Squared Difference CT: Census transform

MCT: Modified census transform MR: Multi-resolution

CTF: Coarse-to-fine MF: Median filter

LRC: Left-Right consistency check

(4)

Terrazza Mascagni - Livorno, Italy

(5)

Acknowledgment

First of all, I would like to express my deep gratitude to the Department of Information Engi-neering of University of Pisa and ARM Holding for having provided and offered the chance to conduct my MSc thesis research project in the field of stereo vision.

Special thanks go to my academic supervisor Roberto Saletti for fruitful discussions and encour-agements and for his guidance and support throughout this study.

I would like to express my thankfulness to Anthony Barbier, my industrial supervisor, for his useful feedbacks and suggestions on low-level OpenCL™_{optimizations and improvements.}

I would like to thank Peter Horsman for having supported this thesis work.

I would like to express my very sincere gratitude to Anthony Barbier, Roberto Mijat and Gian-carlo Todone for their important review of this thesis work.

Last but not the least, once again, I would like to thank all in ARM Media Processing Group (Cambridge - UK) for having provided me a such friendly and comfortable working environ-ment during the last summer internship.

(6)

(7)

Abstract

Stereo vision is a powerful visual sensing technique aimed at inferring depth from two cameras giving us the possibility to measure distances between the cameras and the objects in the scene. Nowadays it is not only an attractive subject of study and research in machine vision, robotics and computer vision but it also finds application in many real-world cases, such as natural user interfaces, industrial automation, autonomous vehicles, and many more. However the stereo vi-sion algorithms are extremely computationally expensive and they result in very high CPU load, particularly if accuracy is needed by the application.

In order to get fast reactions from a system, it is really important to ensure data is delivering at a high frame rate. Moreover the reliability and robustness of 3D data are important aspects as well to consider especially in a wide variety of scenes and illumination conditions.

Many algorithms are proposed every year to improve accuracy/reliability or computation speed: due the high computational complexity, it is still a huge challenge to achieve high accuracy/reliability at a high frame rate.

A possible solution to overcome this, it is to implement a stereo vision algorithm that exploits the highly-parallel execution feature of a Graphic Processing Units (GPU).

The aim of this thesis work is to present a real-time dense stereo vision algorithm based on a multi resolution strategy and the modified census transform that is tailored to a mobile graphics processor such as the ARM Mali™-T600 Series GPU.

The thesis describes the adapted algorithm and the design decisions taken in order to implement an optimally OpenCL code carried out on ARM Mali-T604 GPU. The algorithm robustness with respect to changing illumination conditions is also shown. Computing stereo vision on the GPU realizes a significant speedup and precision improvement, as proved by the performance analysis (in terms of computation speed and accuracy) on widely accepted reference scenes.

From experimental results, in fact, it was demonstrated that the implementation is robust un-der non-ideal illumination and reaches high speed of computation with various image resolutions with accuracy comparable to other state-of-art related real-time stereo vision algorithms. In par-ticular, regarding the speed of computation, we reached approximately 120 fps on 320x240 image resolution with 60 disparity levels and 40 fps on 640x480 with the same disparity range on above mentioned ARM Mali-T604 GPU.

In the end, at the current state, almost all real-time stereo vision implementations are developed

(8)

on desktop or laptop GPUs: with this thesis work it will be demonstrated that this kind of compu-tation expensive algorithms are suitable as well for mobile GPU, reaching significant performance benefits and competitive results using ARM Mali-T600 Series GPUs.

This project has been developed with a joint cooperation between the Dept. of Information Engineering of the University of Pisa (Italy) and ARM Holding (Media Processing Group, Cambridge - UK).

(9)

Key words: Stereo vision, Image processing, computer vision, census transform, modified census transform, distances, ARM

Mali-T600 Series GPU, GPU compute, OpenCL, 3D reconstruction, depth estimation, augmented reality, disparity map, hamming distance, dense disparity map, template matching

(10)

Chapter 1 Introduction

“It is not knowledge, but the act of learning, not possession but the act of getting there, which grants the greatest enjoyment”1

Karl Friedrich Gauss (1777-1855), Letter to Farkas Wolfgang Bolyai (2 Sep 1808).

1.1 Introduction

S

tereo vision is a powerful visual sensing technique aimed at inferring depth from two cam-eras giving us the possibility to measure distances between the camcam-eras and the objects in the scene.

Nowadays it is not only an attractive subject of study and research in machine vision, robotics and computer vision but it also finds application in many real-world cases, such as natural user interfaces, industrial automation, autonomous vehicles, and many more. However the stereo vision algorithms are extremely computationally expensive and they result in very high CPU load, particularly if accuracy is needed by the application.

In order to get fast reactions of a system, it is really important to ensure data is delivering at a high frame rate. A minimum of at least 10 fps should be achieved in any case as suggested in[34] for robotic applications. Moreover the reliability and robustness of 3D data are important aspects as well to consider in a wide variety of scenes and illumination conditions.

Many algorithms are proposed every year to improve accuracy/reliability or computation speed: due the high computational complexity, it is still a huge challenge to achieve high accu-racy/reliability at a high frame rate.

1

“Non è la conoscenza ma l’atto di imparare; non il possesso ma l’atto di arrivarci che dà la gioia maggiore”

(14)

Since a large number of stereo vision algorithms have been proposed, it can be inferred that it does not exist a “best” overall solution, because of trade offs in speed, flexibility and accuracy[16].

The inherent complexity of this kind of algorithms led us to the decision of moving off the CPU to make use an additional processing unit such as the GPU. Today GPUs are a very attractive development platforms at far less cost because, exploiting their powerful parallel processing capabilities, is possible to achieve very high performance in a short time.

However GPU compute requires a whole new mindset in algorithms design: as we’ll see in the following chapters, programming strategies that result in a fast CPU implementation do not necessarily produce fast GPU code[52].

The aim of this thesis work is to present a real-time dense two-frame stereo vision algo-rithm based on a multi-resolution strategy and the modified census transform that is tailored to a graphics processor such as theARM Mali-T600 Series GPU.

The thesis describes the adapted algorithm and the design decisions taken in order to imple-ment an optimally OpenCL code carried out onARM Mali-T604 GPU. The algorithm robustness

with respect to changing illumination conditions is also shown.

Computing stereo vision on the mobile GPU realizes a significant speedup and precision im-provement, as proved by the performance analysis (in terms of computation speed and accuracy) on widely accepted reference scenes.

From experimental results, in fact, it was demonstrated that the implementation is robust under non-ideal illumination and reaches high speed of computation with various image

resolutions withaccuracy comparable to other state-of-art related real-time stereo vision algorithms. In particular, regarding the speed of computation, we reached approximately 120 fps on 320x240 image resolution with 60 disparity levels and 40 fps on 640x480 with the

same disparity range on above mentionedARM Mali-T604 GPU.

In the end, at the current state, almost all real-time stereo vision implementations are de-veloped on desktop or laptop GPUs: with this thesis work it will be demonstrated that this kind of computation expensive algorithms are suitable as well for mobile GPU, reaching significant performance benefits and competitive results using ARM Mali-T600 Series GPUs.

1.2 Objectives and workbench

This project has been developed with a joint cooperation between the Dept. of Information En-gineering of theUniversity of Pisa (Italy) and ARM Holding (Media Processing Group,

Cam-bridge - UK).

Goal of this thesis work was to develop areal-time dense stereo vision algorithm designed

to run onARM Mali-T600 Series GPUs.

The algorithm was thought for embedded real-time systems in the field of augmented reality and robotic applications such as obstacle detection and robot/autonomous navigation.

(15)

For this kind of applications, not only we have to ensure data is delivering at a high frame rate (a minimum of at least 10 fps should be achieved in any case as suggested in[34]) but we also have to ensure good accuracy, reliability and robustness in a wide variety of scenes and illumination conditions.

The development platform is theSamsung™ _{ARM Chromebook}™_{XE303C12 (figure 1.2.1)}

with theSoC Exynos 5 dual powered by ARM Cortex™ A15 1.7/2 GHz Dual core (CPU) and ARM Mali-T604 (GPU) while the operating system used is Linux.

Figure 1.2.1: Samsung ARM Chromebook XE303C12

1.3 Thesis structure

In this thesis work, Chapter 2 gives a brief overview about GPU compute. Chapter 3

intro-duces the non-parametric local image transform. This kind of image transform will be neces-sary to ensure algorithm robustness against changes in illumination. Chapter 4 presents the

main aspects of a stereo vision system. Chapter 5 shows the proposed algorithm based on a

multi-resolution strategy and the modified census transform.Chapter 6 evaluates the proposed

algorithm in terms of accuracy and speed of computation and compares the results with other state-of-the-art related real-time implementations. Finally, theChapter 7 presents the summary

and conclusions of our thesis work providing relative suggestions and developments for a future implementation.

(16)

Chapter 2 General Purpose computing on

Graphics Processing Units

T

he last decade of the 20th century and the first decade of the 21st century can be called the “Era of Parallel Computing” [8]. During this time, not only were designed and built the fastest parallel supercomputers but were also developed main now-standard parallel program-ming systems.

Notable standards for parallel programming areMPI (Message Passing Interface) and OpenMP

(single memory multiple processors) which are the predecessors of the newest one and portable

OpenCL.

Generally speaking the main different between the CPU and GPU programming is that the former is optimized for general purpose computation while the latter is designed to perform a set of simple operations on a huge amount of data achieving speed through massive parallelism.

2.1 OpenCL

OpenCL (Open Computing Language) was started by Apple, Inc. and is now an open standard for applications development, enabling the use of heterogeneous multi-processor systems. Cur-rently, it is managed by the Khronos Group, which defines the OpenCL API specifications with the help of representatives from several major computer vendors1_.

Programming with OpenCL there is no requirement to know the GPU instruction set. 1_{For the list of members and promoters the reader is referred to http://www.khronos.org/members/promoters}

(17)

2.2 Heterogeneous system

The parallel computation is based on anheterogeneous system which has two or more types

of processor units: the classic CPU, called the host, and the highly parallel processors called

thedevices. A typical example of an heterogeneous system consists of a GPU (i.e. ARM

Mali-T604 GPU) controlled by a CPU (i.e. ARM Cortex A15 1.7/2 GHz Dual core) such as the SoC Exynos 5 dual. Only the device is responsible for highly parallel processing that accelerates

computation.

2.3 OpenCL components

An OpenCL program is composed by two main components: ahost program and at least one kernel (OpenCL program).

The former is programmed using C/C++ and OpenCL Runtime APIs and executed on host side while the kernel is written in OpenCL C and executed on OpenCL device2.3.2.

Figure 2.3.1: OpenCL application

Figure 2.3.2: Host and device programming languages. Image from [8]

Basically the host program provides the following functionalities [7]:

1. Calls the OpenCL APIs.

2. Compiles the CL kernels.

(18)

4. Sets up command queues.

5. Sets up dependencies between the tasks.

6. Sets up the NDRanges that the kernels execute over.

Regarding the OpenCL C, this language is a subset of ISO C99 but with some exclusions. These exclusions concern standard C99 headers, function pointers, recursion, variable length arrays and bit fields [8] even if there are some additions such as the vector types, atomic functions and the built-in OpenCL functions.

Figure 2.3.3: This block diagram summarizes the components of OpenCL and the actions that occur on the host during an OpenCL application execution[9].

2.3.1 Kernel and threads

GPU compute with OpenCL is based the concept of “kernel”. As already seen, the kernel is programmed in OpenCL C and parallely executed on a CL device.

The host program submits the kernel for the execution on a CL device and defines the number of instances (called threads or work-items) to execute. Once the kernel is submitted, the

OpenCL engine creates an integer index MxNxK space called NDrange with dimension equal to the number of work-items defined. Each thread has a different index in the NDrange called global ID. When the parallel execution starts, each work-item executes the same sequence of instructions defined by a single kernel. While the sequence of instructions is the same, the behavior of each work-item can vary because of branch statements within the code or data selected through the global ID [9]. For a better understanding of this aspect let’s consider the examples 2.1 and 2.2 taken from [8]. As it can seen, the concept of the C-loop in 2.1 has been replaced by the index space id in 2.2.

(19)

Algorithm 2.1 Convetional C-code: add two float vectors

void Add(const int n, const float *a, const float *b, float *c) {

int i;

for(i=0; i<n; i++) c[i] = a[i] + b[i]; }

Algorithm 2.2 OpenCL kernel program: add two float vector

kernel void Add( global const float* a, global const float* b, global float *c ) {

int id = get_global_id(0); c[id] = a[id] + b[id]; }

/* execute over n work items. */

2.4 GPU compute on ARM Mali-T600 Series GPUs

In this section we give an overview on ARM Mali-T600 Series GPUs which support OpenCL 1.1, Full Profile. For more details the reader is referred to [7].

ARM Mali-T600 Series GPU contains one to eight identical shader cores. Each shader core supports up to 256 concurrently executing threads [7].

Figure 2.4.1: ARM Mali-T604 GPU with 4 shader cores Each shader core is based on:

1. Two or four arithmetic pipelines.

2. One load-store pipeline.

3. One texture pipeline.

OpenCL typically uses only the Arithmetic or Load-Store execution pipelines. The texture pipeline is only used for reading image data types.

Each thread uses only one of the Arithmetic or Load-Store execution pipes at any point in time. Two instructions from the same thread execute in sequence. The next instruction from the program executes after the completion of the previous instruction.

(20)

ARM Mali-T600 Series GPU also usesSIMD (Single Instruction Multiple Data), so that most

arithmetic instructions operate on multiple data elements simultaneously [7].

In some respects,programming an ARM Mali-T600 Series GPU is easier than program-ming a desktop GPU [7] because the former has a different memory model to desktop workstations. In traditional desktop workstations, in fact, the global, local and private

memo-ries are physically separated. Typically a graphics-card has its own local memory. Data must be copied to the local memory and back again.

On ARM Mali GPUs the memory system is unified and then local and private memory are physically global memory and then the traditional copying of data is not required. This aspect is particularly important because this means that CL memory buffers are allocated in global memory that is physically accessible by both CPU and GPU cores. However, only memory that is allocated byclCreateBuffer is mapped into both the CPU and GPU virtual memory spaces.

Memory allocated usingmalloc(), etc, is only mapped onto the CPU [7].

2.5 Optimizing the OpenCL code on ARM Mali-T600 Series

GPUs

Probably a single section of a chapter is not enough to show all tricks and strategies needed in order to optimize OpenCL code. However, generally, to achieve very good performance by means of GPU computing, a set of general rules needs to be followed such as [7]:

1. Compute more pixel per kernel

2. Find ways to parallelize sequential code 3. Reduce the number of load/store operations

4. Avoid conditional branches and loop (i.e. using the global ID of the work-item instead of the loop)

5. Use built-in OpenCL functions (BIFLs): Where possible use the built-in functions as the com-monly occurring ones compile to fast hardware instructions using “half” or “native” versions of built-in functions

6. Use vector operations

As we’ll see, sometimes a new way of thinking about algorithm design is required in order to follow this set of rules and have good speed-ups.

(21)

Chapter 3 Non-parametric local transform

3.1 Introduction

I

n chapter 1 we talked about the reliability and robustness of stereo vision system under non ideal illumination conditions. For many application fields reliability of result is one of the most important requisites, since the system should also be able to operate in highly noisy environments, e.g. under adverse weather conditions such as snow, rain, fog, or any combination thereof [22].

This need brought to the study and research of a particular class of image transforms called “non-parametric local transform”.

Let us start introducing the definition for local image transform. A local transform is an image transform where the value of each new pixel in destination image depends on a set of pixel values from source image.

The source image can be both a color image and gray scale image.

Usually this operation is done using a sliding square window called "kernel" (with mathe-matical meaning, not to be confused with OpenCL kernels), "convolution matrix", "window" or "mask". Each new pixel in destination image Idst(x0, y0)is computed as a function of pixels in a

corresponding region of source image [10]. In this image transform class we have, for instance, the filters such as Sobel, Gaussian, Laplacian and so on. The equation 3.1.1 below represents the discrete 2D convolution for such filters.

Idst(x, y) = N X m=−N N X n=−N Isrc(x + m, y + n) · K(m, n) (3.1.1) 21

(22)

K =   4 0 0 0 0 0 0 0 −4  

Figure 3.1.1: Example of parametric kernel and the convolution operation

A non-parametric local transform is a sub-class of local transform that does not use any parametric kernel or threshold.

In similar manner to local transform, it can compute its operations both on color image and gray scale image.

Generally these kinds of image transform have the important property that their result is insensitive to the camera gain and illumination changes.

However we have to consider important drawbacks such as: 1. It is not immediate to do an inverse transform

2. The allocated memory and computation time, generally, increase with the window size In literature has been proposed several non-parametric local transform such as the rank trans-form and census transtrans-form by Ramin Zabih and John Woodfill [13] and the differential transtrans-form [19] by Masoud Samadi, Mohd Fauzi Othman and Shamsudin H. M. Amin.

For this project a variant of census transform has been used and the input image always it will be assumed as gray scale. This variant of census transform is called “modified census transform” and results better regarding robustness with respect to illumination changes and image noise.

3.2 The census transform

Census transform is a non-parametric local image transform introduced by Zabih and Woodfill [13] in 1994. Since the result of this transformation does not depend on camera gain and light condition, it is used in stereo vision implementations to improve robustness and reliability under non ideal illumination conditions.

Even if the census transform has this important property, important drawbacks are necessary to consider such as:

(23)

1. long computation time 2. memory footprint

Both drawbacks depend by the window size, in particular the allocated memory increases as ∼ d2_{being d the window size.}

For instance, a 3×3 window will generate 8-bit string for pixel, a 5×5 window will generate 24-bit string per pixel and so on. The increase of string length is exponential with respect to the increase of window size.

In this thesis work, we have limited the window size to 5x5 and 9x9.

3.2.1 Overview of census transform

This sub-section provides a short overview of the census transform (CT). For more details, the reader is referred to [13].

Figure 3.2.1: a) Sample image - Cones, b) Census transform 3x3

The fundamental idea behind the census transform is given in equations 3.2.1 and 3.2.2. Basically this image transformencodes the local spatial structure in a N-bit string.

Let P = (x, y) a pixel coordinate and I(P ) its intensity (i.e. 8 bit integer). Let us define N (P )the set of pixels in a square neighborhood of diameter d surrounding P .

(24)

Figure 3.2.2: Census window

The census transform uses intensity relation between the center pixel and neighboring pixels in the defined window.

ξ(I(P ), I(P0)) =    1 I(P0) < I(P ) 0 I(P0) > I(P ) (3.2.1) ICT(x, y) = O i=n O j=m

ξ(I(x, y), I(x + i, y + j)) (3.2.2)

Nbit= (d2− 1) (3.2.3)

where the operatorN denotes a bit-wise concatenation, x, y are the pixel coordinates and n · mis the census windows size (usually n = m = d).

Each pixel of the image is replaced by the N-bit-string ICT(x, y)with size (in unit of

bit) equal to 3.2.3. The bits of ICT(x, y) are obtained applying the equation 3.2.1 and then

comparing the intensity of all pixels inside the set N (P ) (except for the center pixel) with the intensity of the center pixel.

(25)

Figure 3.2.3: Example with d=3

To understand in detail how the ICT(x, y)string is computed, let us consider the examples

in figures 3.2.4 and 3.2.5   67 21 137 167 (58) 255 176 233 20  =⇒   0 1 0 0 X 0 0 0 1  =⇒ 01000001

Figure 3.2.4: Census transform 3x3 example

      41 154 115 211 27 203 67 21 137 246 79 167 (58) 255 1 135 176 233 20 198 42 191 39 113 209       =⇒       1 0 0 0 1 0 0 1 0 0 0 0 X 0 1 0 0 0 1 0 1 0 1 0 0       =⇒ 100010010000010001010100

Figure 3.2.5: Census transform 5x5 example

3.2.2 The modified census transform (MCT)

Modified census transform (MCT), introduced for the first time by Bernhard Froba and Andreas Ernst [15] ten years later (2004) for the automatic face and gesture recognition, is a non-parametric local transform that changes the census transform developed by Zabih and Woodfill. Census transform, even if is robust against variations of brightness, shows a sensitivity to high frequency noise [14]. In order to overcome this aspect, it was introduced the modified

(26)

Once again this new transform is based on a set of comparisons of pixels intensities in a local neighborhood; however instead comparing the neighboring pixels with the center pixel,it compares their values with the mean intensity of the local window.

With this we can now reformulate equations 3.2.1 and 3.2.2 with 3.2.4 and 3.2.5:

ξ( ¯I(P ), I(P0)) =    1 I(P0) < ¯I(P ) 0 I(P0) > ¯I(P ) (3.2.4) IM CT(x, y) = O i=n O j=m

ξ( ¯I(x, y), I(x + i, y + j)) (3.2.5)

where ¯I(P )is themean pixel intensity of the local window.

In the work of Bernhard Froba and Andreas Ernst [15] it is always assumed a 3x3 census window and the mean pixel intensity is computed inside that window.

The mean pixel intensity is always computed in a 3x3 window centered in P indepen-dently from the census window.

We followed this strategy for computing the mean pixel intensity for two reasons: 1. Computation time

2. Increasing the window size for computing the mean pixel intensity, we have lower details. This aspect is too important in stereo vision system when we want a good accuracy for 3D scene reconstruction.

Once we have calculated the mean pixel intensity, all pixel in the neighborhood square window are compared with this value.

For example, let us consider the following census window 5x5:         41 154 115 211 27 203 67 21 137 246 79 167 58 255 1 135 176 233 20 198 42 191 39 113 209        

First, mean intensity is computed in a 3x3 window centered on58.

The computation of average intensity can be done with different low pass kernel (i.e. Mean filter, Gaussian filter...).

In our work we have used both mean filter and Gaussian filter and we have not found significant improvements using any of the two kernels in particular.

(27)

K(m, n) =1₉·   1 1 1 1 1 1 1 1 1  =   1 9 1 9 1 9 1 9 1 9 1 9 1 9 1 9 1 9   Idst(x, y) = 1₉· 1 P m=−1 1 P n=−1 Isrc(x + m, y + n)

Figure 3.2.6: Mean kernel

Regarding our example, we have:         41 154 115 211 27 203 67 21 137 246 79 167 58 255 1 135 176 233 20 198 42 191 39 113 209         =⇒ I = 126

After the calculation of the mean pixel intensity, we are ready to compute the 24 bit string:       41 154 115 211 27 203 67 21 137 246 79 167 58 255 1 135 176 233 20 198 42 191 39 113 209       I = 126 =⇒       1 0 1 0 1 0 1 1 0 0 1 0 X 0 1 0 0 0 1 0 1 0 1 1 0       I = 126 =⇒ 101010110010010001010110

Figure 3.2.7: Modified census transform 5x5 example

The modified census transform allow us to berobust against the noise, particularly salt and pepper noise and, at the same time, improves accuracy in details as well in the stereo

vision algorithm.

3.2.2.1 Mean filter3x3 and Gaussian filter3x3: Performance analysis

kernels/filter.cl

Kernel name: MeanFilter3x3_16px or GaussianFilter3x3

The mean and Gaussian filters are two classic examples of parametric local transforms because they use a parametric kernel (figure 3.2.8). In this section we give the performance analysis for the MeanFilter3x3 and GaussianFilter3x3 kernels.

(28)

Kmean(m, n) = 19·   1 1 1 1 1 1 1 1 1  =   1 9 1 9 1 9 1 9 1 9 1 9 1 9 1 9 1 9   Kgaussian(m, n) = ₁₆1 ·   1 2 1 2 4 2 1 2 1  

Figure 3.2.8: Parametric kernel for the“Mean filter” and “Gaussian filter”

Resolution GPU exec. [ms] 320x240 0.046

640x480 0.163

1024x768 0.415

2048x1536 1.568

ARM Mali-T604

Table 3.1:Mean filter 3x3: Performance analysis

640x480 0.183

1024x768 0.452

2048x1536 1.765

ARM Mali-T604

Table 3.2:Gaussian filter3x3: Performance analysis

3.3 OpenCL census transform case study

The computation of the census transform is inherently a data parallel task because each pixel depends only on the intensity values of a finite number of neighboring pixels [24].

In this section it will be shown the main design decisions taken in order to implement an optimally fast OpenCL code.

Particularly, with the census transform 5x5 and 9x9 it will be demonstrated that the byte-arrangement as well should not be underestimated for reaching a good speed-up by GPU com-pute. As it will be explained later, a new byte-arrangement will not only reduce the number of store operations in kernel, improving its performance, but will also allow vectorization of implementations for the following stages in the stereo vision pipeline.

(29)

3.3.1 Census transform 3x3

kernels/census_transform3x3.cl

Kernel name: CensusTransform3x3_16px

Figure 3.3.1: Census transform 3x3 OpenCL block

The implementation is based on: 1. single kernel

2. each thread computes 16 pixels at once.

3. 8 uchar vload16 for loading the neighborhood pixel in each window 4. 1 uchar vload16 for loading the center pixel of each window 5. 1 uchar vstore16 for storing the results of the 16 pixels

Each vload16 can loads 16 pixels at a time for an 8 bit gray scale image.

Figure 3.3.2: Census transform 3x3 window - single window

To compute the census 3x3 string we have to compare px0, px1, ...., px7with c0(figure 3.3.2)

and apply the equation 3.2.1.

In order to compute 16 pixels at once for thread, we have to take into account the figure 3.3.3 where we can see 16 overlapped windows.

(30)

Figure 3.3.3: Census transform 3x3 window - single pixel

Let us suppose to compute thebit7 of the census string (figure 3.3.4).

Figure 3.3.4: bit7 in the census transform 3x3 string

As we can see from the CL pseudo-code 3.1, first we load the16 center pixels and px0 for

all the 16 overlapped windows in two uchar16 vectors:

1. uchar16 center: contains the center value

2. uchar16 row: contains the px0 for all 16 windows

After that we do“string = select((uchar16)0x00, (uchar16)0x80, row < center);” to

com-pute the bit7 for all the 16 census windows.

(31)

Figure 3.3.5: Computing bit7 of the 8-bit string

Algorithm 3.1 CL pseudo-code for computing the bit 7 of the census 3x3 string

// Load the center pixel for each window

const uchar16 center = vload16(src_image, x + y*width); uchar16 string;

...

// Load px0 for each window

uchar16 row = vload16(src_image, x-1 + y*width);

...

// Compute bit7 for each window

string =select((uchar16)0x00, (uchar16)0x80, row < center); ...

3.3.1.1 Census transform 3x3: Performance analysis

640x480 0.191

1024x768 0.478

2048x1536 1.798

ARM Mali-T604

(32)

3.3.2 Modified census transform 3x3

Kernel name: ModifiedCensusTransform3x3_16px

Figure 3.3.6: Modified census transform - Block scheme

The modified census transform 3x3 implementations is equivalent to the census transform 3x3. The only different thing is the additional input parameter “mean image” that stores the mean value for each pixel.

Before running the ModifiedCensusTransform3x3 kernel, it is necessary to compute the mean filter 3x3 on the source image with the kernelMeanFilter3x3 or with the kernel GaussianFil-ter3x3.

3.3.3 Census transform 5x5

Kernel name: CensusTransform5x5_16px_0, CensusTransform5x5_16px_12,

Kernel name: CensusTransform5x5_Opt_16px_0, CensusTransform5x5_Opt_16px_12

Census transform with a large window requires many comparison operations between pixel values [3]. This sub-section will describe a possible way to cope with this problem.

First of all it is necessary to point out that we implemented two versions of census transform 5x5.

(33)

Both produce a 24-bit string for each pixel usingtwo kernels (figure 3.3.8) but are different

nevertheless in terms of byte arrangement.

The first version, called“standard version”, is composed by the following kernels:

1. CensusTransform5x5_16px_0 - Computes 1 byte of the full string and each thread

per-forms 16 pixels

2. CensusTransform5x5_16px_12 - Computes 2 bytes of the full string and each thread

performs 16 pixels

while the second one, called“optimized version”, by:

1. CensusTransform5x5_Opt_16px_0 - Computes 1 byte of the full string and each thread

performs 16 pixels - Byte re-arrangement

2. CensusTransform5x5_Opt_16px_12 - Computes 2 bytes of the full string and each thread

performs 16 pixels - Byte re-arrangement

These kernels were implemented in similar manner to the kernelCensusTransform3x3.

Figure 3.3.8: Kernels for the census transform 5x5

Let us start to introduce the common thing between the two implementations.

In order to make the implementation as much as possible GPU friendly, we have definedtwo sub-masks in the census window where each kernel computes one of these (figure 3.3.10).

The kernel with suffix _16px_0 computes the byte 0 and the kernel with suffix _16px_12

computes thebyte 1 and 2 of the census string.

Using this strategy, the bit ordering is different than the standard census transform 5x5 (figure 3.3.9) but this does not change the final result.

(34)

Figure 3.3.9: Census transform 5x5, 24-bit string

Figure 3.3.10: Sub-masks for the census transform 5x5

The different thing between the“standard” and the “optimized” version concerns the byte

arrangement.

Basically the byte arrangement for the census transform 5x5 described by Zabih and Woodfill is shown in figure 3.3.11 where each pixel is defined by 3 bytes arranged in consecutive manner such as the image format RGB888.

Each byte of the pixel P = (x0, y0)is addressable with the relation 3.3.1:

addrbyte−K(x0, y0) = x0+ y0· width + K (3.3.1)

(35)

Figure 3.3.11: Byte order for the standard census transform 5x5

This scheme is followed by thestandard version and has two main drawbacks:

1. Many store operations for the census transform 5x5 kernels

2. If the bytes are arranged in this consecutive manner is really difficult to vectorize the subse-quent stages of the stereo vision pipeline.

For this reason we have thought a new byte-arrangement as described in figure 3.3.12.

Once again each pixel is defined by 3 bytes but each byte of the pixel P = (x0, y0)is

address-able with the relation 3.3.2:

addrbyte−K(x0, y0) = x0+ y0· width + K · width · height (3.3.2)

where K is the byte index.

(36)

Figure 3.3.12: Census transform 5x5 - Byte re-arrangement

As we can observe, in this case we had only re-arranged the bytes on the image transformed; however this strategy not only allows us to reduce significantly the number of memory access for thread improving the kernel performance but allows as well to make a vector implementation for the following stages of pipeline as we’ll see in chapter 5.

(37)

3.3.3.1 Performance analysis and comparison with CPU implementation

Following are the performance analysis for both“standard” and “optimized” census transform

5x5 implementations.

Resolution GPU exec. [ms] Speed-up “standard” “optimized” 320x240 0.240 0.168 1.4x 640x480 0.926 0.654 1.4x 1024x768 2.358 1.265 1.9x 2048x1536 9.335 6.268 1.5x ARM Mali-T604

Table 3.4: Census transform 5x5: “standard” and “optimized” comparison

3.3.4 Modified Census Transform 5x5

Kernel name: ModifiedCensusTransform5x5_16px_0, ModifiedCensusTransform5x5_16px_12,

Kernel name: ModifiedCensusTransform5x5_Opt_16px_0, ModifiedCensusTransform5x5_Opt_16px_12

Basically the modified census transform 5x5 implementations is equivalent to census trans-form 5x5. The only different thing is the additional input parameter “mean image” that stores the mean value for each pixel.

Before running the ModifiedCensusTransform5x5 kernels, it is necessary to compute the mean filter 3x3 on the source image with the kernelMeanFilter3x3 or with the kernel Gaus-sianFilter3x3.

(38)

3.3.5 Census transform 9x9

Kernel name: CensusTransform9x9_16px_01, CensusTransform9x9_16px_23,

CensusTrans-form9x9_16px_45, CensusTransform9x9_16px_67, CensusTransform9x9_16px_89

Kernel name: CensusTransform9x9_Opt_16px_01, CensusTransform9x9_Opt_16px_23,

Cen-susTransform9x9_Opt_16px_45, CensusTransform9x9_Opt_16px_67, CensusTransform9x9_Opt_16px_89

The design decisions followed to implement the census transform 9x9 are similar to those followed for the census transform 5x5.

Once again we implemented two versions as for the census transform 5x5, which produces a 80-bit string for each pixel usingfive kernels (figure 3.3.8). The two versions are different in

terms of byte arrangement.

The first version, called“standard version”, is composed by the following kernels:

1. CensusTransform9x9_16px_01 - Computes bytes 0 and 1 of the full string and each

thread performs 16 pixels

while the second one, called“optimized version”, by:

1. CensusTransform9x9_Opt_16px_01 - Computes bytes 0 and 1 of the full string and each

thread performs 16 pixels - Byte re-arrangement

(39)

In order to make the implementation as much as possible GPU friendly, we have definedfive sub-masks in the census window where each kernel computes one of these (figure 3.3.10).

(40)

(41)

3.3.5.1 Performance analysis and comparison with CPU implementation

Resolution GPU exec. [ms] Speed-up “standard” “optimized” 320x240 0.874 0.633 1.4x 640x480 3.263 2.446 1.3x 1024x768 8.331 6.095 1.4x 2048x1536 33.54 24.03 1.4x ARM Mali-T604

Table 3.5: Census transform 9x9: “standard” and “optimized” comparison

3.3.6 Modified census transform 9x9

Kernel name: ModifiedCensusTransform9x9_16px_01, ModifiedCensusTransform9x9_16px_23,

ModifiedCensusTransform9x9_16px_45, ModifiedCensusTransform9x9_16px_67, ModifiedCen-susTransform9x9_16px_89

Kernel name: ModifiedCensusTransform9x9_Opt_16px_01, ModifiedCensusTransform9x9_Opt_16px_23,

ModifiedCensusTransform9x9_Opt_16px_45, ModifiedCensusTransform9x9_Opt_16px_67, Mod-ifiedCensusTransform9x9_Opt_16px_89

The modified census transform 9x9 implementations is equivalent to census transform 9x9. The only different thing is the additional input parameter “mean image” that stores the mean value for each pixel.

Before running the ModifiedCensusTransform9x9 kernels, is necessary to compute the mean filter 3x3 on the source image with the kernelMeanFilter3x3 or with the kernel GaussianFil-ter3x3.

(42)

Chapter 4 Computer stereo vision

4.1 Introduction

“Be glad you’re not a cyclops! Robots that explore other planets must be able to see where they are going. Just like people, robots make good use of two eyes” - spaceplace.nasa.gov

L

et us start this chapter quoting the “urbie cartoon” from spaceplace.nasa.gov website to show that computer stereo vision, or simply stereo vision, is not only an attractive subject of study and research in machine vision, robotics and image analysis but finds application in many real world applications such a missions space, gesture-based user interfaces, 3D scene reconstruction, smart cameras, augmented reality, industrial automation, autonomous vehicles and many many more.

In automotive field, for example, the stereo camera enables a multitude of functions that makes driving safer and more comfortable, such as automatic emergency braking, traffic jam as-sist, construction zone asas-sist, adaptive cruise control (ACC), intelligent headlight control, lane departure warning and lane keeping support as well as road sign recognition. The complete three-dimensional recording of the vehicle’s surroundings also provides the basis for the auto-mated driving functions of the future1_.

1

http://www.bosch-automotivetechnology.com/en/de/component/CO_CV_DA_Adaptive-Cruise-Control_CO_CV_Driver-Assistance_2196.html

(43)

Figure 4.1.1: Subaru Eyesight Stereo camera

Figure 4.1.2: Mercedes safety system with Stereo camera on board

Microsoft introduced the stereo camera Kinect in 2009 for own game console in order to allow the gamers to interact and play using natural gestures.

(44)

But stereo vision also is landed on smartphones. For example it is possible to find a stereo vision system in LG P925 to capture stereoscopic photos e videos (figures 4.1.4 and 4.1.6).

An interesting project of augmented reality that involves a stereo vision system is the

“Stereo-scopic Vision for the Blind” (figure 4.1.5). Binocular vision makes this possible, and will thus

further enhance the applicability and versatility of The vOICe as an ETA (Electronic Travel Aid) for the blind in addition to its general vision substitution and synthetic vision features that realize augmented reality for the blind2_.

Figure 4.1.4: LG P925

Figure 4.1.5: Stereoscopic Vision for blind

Figure 4.1.6: Anaglyph stereoscopy - Terrazza Mascagni (Livorno, Italy) 2_{http://www.seeingwithsound.com/binocular.htm}

(45)

4.2 Fundamental of stereo vision

In traditional stereo vision system, two cameras allow us to obtain a depth information compar-ing the two different views of scene in a similar manner to human binocular vision.

Figure 4.2.1: Stereo camera and human binocular vision

When we talk about “depth data” we mean the possibility to know real-world distances (in meters) between the cameras and the objects in the scene.

In order to measure real-world distances, the values of sensor pixel size, focus length and the distance between the two cameras (baseline) have to be known.

(46)

Figure 4.2.2: Sensor image from http://www.dxomark.com/Reviews/More-pixels-offset-noise/Modeling-small-pixels, focus length and baseline.

If these informations were not available, we could only infer an information about “relative distance”. In other words, without these geometry camera informations, we can not measure real world distances but we can only estimate which objects are close or far the cameras. This last aspect is really similar to the human vision (figure 4.2.1): human vision, in fact, can not “measure distance” in the true sense of the word but can only estimate “how far away an object is”.

4.2.1 Stereo vision depth mapping

Stereo vision depth mapping is a method to recover spatial depth information using two

im-ages captured from a stereoscopic camera (figure 4.2.3).

The result of this method is a depth map (figure 4.2.3) which is an image that contains

information related to the distances from the objects in the scene. The concept is completely analogous to depth buffer and Z-buffer.

(47)

depth level. Lower depth values result in lighter pixels (figure 4.2.3).

a) b) c)

Figure 4.2.3: a) and b) Interior of the Roman Colosseum - Italy. Images taken by http://urixblog.com/. c) is the dense depth map computed by our proposed stereo vision algo-rithm

There are two different kind of depth map:

1. Sparse depth map: it contains information related to the distance of the surfaces of

scene objects only for a reduced amount of points (figure 4.2.4). This depth map can be potentially generated by fast stereo vision algorithm due to the reduced amount of analyzed points.

2. Dense depth map: it contains information related to the distance of the surfaces of scene

objects for all points (figure 4.2.4).

Figure 4.2.4: a) Sparse depth map and b) dense depth map

4.2.2 Horizontal epipolar line constraint

To implement a stereo vision system are required two cameras. For simplicity, we assume that:

1. Cameras optically identical: same image sensor and same focal length for both cameras

2. Cameras horizontally aligned: the displacement between the two cameras is b and called

baseline (figure 4.2.7)

3. Images rectified: the rectifing procedures are necessary to reduce the lens distortion,

(48)

4. Images taken at the same instant: this requirement is application dependent. Is not too

important for static scene. A typical example of stereo images taken at different instants is given by the aerial stereo images such as in figure 4.2.6.

Figure 4.2.5: Unrectified and rectified images

Figure 4.2.6: Aerial stereoscopic images. It is used only one camera but at different time instants

With the above simplifications, we have the left and right image planes that lie in a common plane such as in figure 4.2.7.

This means that a generic point at coordinate (x0, y0) in one image will be located in the

other one at the same scan-line y = y0but different x-coordinate (figure 4.2.7).

The horizontal scanline y = y0 is called horizontal epipolar line and the difference in

x-coordinates (d = xl− xr) of the corresponding point in the left and right image is called

disparity.

This geometry stereo vision restriction is known ashorizontal epipolar line constraint.

(49)

Figure 4.2.7: Image left and right in a common plane. The difference in x-coordinates (d = xl− xr) of corresponding point P in the left and right image is calleddisparity.

(50)

4.3 Depth from disparity via triangulation

Figure 4.3.1: Stereo vision system: horizontal epipolar constraint

The technique to extract depth information given two offset images is called“triangulation”.

Let’s define Z as the point-line distance between already defined baseline and a generic point P (figure 4.3.2).

Figure 4.3.2: Distance Z

Since the two cameras are separated by distance b, both view the same real-world point P in a different location. In particular, the point P will lie on the same scanline in both images if

(51)

it is satisfied the horizontal epipolar line constraint.

In the above figure 4.3.2, xland xrare the x-coordinates (in pixels) of point P respectively

in the left and right images. The distance Z (in meters) is what we are looking for.

NOTE: From the above geometric representation 4.3.2 we can see an interesting thing: if

the x-coordinate of point P in the left image is xl, this point in right image will be located at

x-coordinate xr≤ xr. In contrast, if the x-coordinate of point P in right image is xr, this point

in the left image will be located at x-coordinate xl≥ xl.

The main relation to compute the distance Z is given by the formula 4.3.1 that derives from the geometric relations between the similar triangles in figure 4.3.3.

Figure 4.3.3: T1 and T2 similar triangles; T3 and T4 similar triangles

Z = b · f d · pxsize

(4.3.1) where:

1. Z = distance between baseline and a generic point P [in meters] 2. b = baseline [in meters]

3. d = xr− xldisparity value [in pixels]

4. f = focal length [in meters] 5. pxsize=pixel size [in meters]

(52)

since

b · f pxsize

= constant = K (4.3.2)

we can write the equation 4.3.1 as:

Z = K

d (4.3.3)

This formula computes the real world distance (in meters) of a point P if it is known its disparity value. If we were interested only in its relative distance (in pixels) rather than exact distance, we could apply the formula 4.3.4.

Zr=

1

d (4.3.4)

where:

1. Zr= relative distance

2. d = xr− xldisparity value [in pixels]

4.3.1 The disparity map

Previously we talked about depth map. However in a stereo vision implementation the concept

ofdisparity map is prominent. As a matter of fact, the two concepts are not too different, as

depth and disparity are related by 4.3.1. The disparity map is visualized as gray scale image where each pixel’s brightness corresponds to a disparity level. High disparity values result in lighter pixels (figure 4.3.4).

Once we have the disparity map, we can back-project the pixels in a 3D space (figure 4.3.5), using the relation 4.3.1, making a3D point cloud such as in figure 4.3.4. Usually the point

cloud is saved in a file, i.e. thewavefront .obj, where there is the list of vertices that represent

(53)

Left image Right image Disparity map

“Teddy” 3D point cloud

(54)

Figure 4.3.5: 3D back-projection (point cloud)

4.3.2 Considerations about the depth resolution

Let us suppose to have an image sensor like: 1. pxsize= 17µm

2. f = 18.9mm 3. b = 75mm

4. disparity range [0-40]. so k = 83.4m

(55)

Figure 4.3.6: Depth vs Disparity

As it can be seen from the above stereo vision characteristic, the maximum range for the object detection it depends on the disparity range used. For instance, if we use a disparity range (0-40) we would have an object detection range about 2-83 m.

However it is necessary to consider an important aspect: for lower disparities, for instance in an interval of [1-3], small steps of disparity mean big variations of distance (i.e. d = 1 ⇒ Z = 83m, d = 2 ⇒ Z = 41m). In contrast, for higher disparities, for instance in an interval of [20-40], small steps of disparity mean small variations of distance (i.e. d = 30 ⇒ Z = 2.71m, d = 31 ⇒ Z = 2.68m).

This aspect can be explained better evaluating the distance relative error. Let us consider thedistance relative error:

εr= ∆Z Zt = Z − Zt Zt (4.3.5) where Z and Ztare respectively theestimated distance and the actual distance in meters.

εr= Z − Zt Zt = K d − K dt K dt (4.3.6) where d and dtare respectively theestimated disparity and the actual disparity in pixels.

εr= 1 d− 1 dt 1 dt =dt− d d (4.3.7) εr= ∆Z Zt =∆d d (4.3.8)

(56)

From the above relation we can observe that, operating at fixed ∆d, the relative error is greater for lower disparities.

To improve the depth accuracy at long distance, we could increase the constant K and than change f, b and pxsize; since f and pxsizeare fixed by the image sensor, we have only one degree

of free (DOF) which is the increasing of camera separation b [11].

However, the separation can not be increased without introducing a further problem: as the separation is increased, fewer points will be common to both images (figure 4.3.7).

Figure 4.3.7: Increasing the baseline, fewer pints will be common to both images

Finally, the formula 4.3.9 gives us the smallest achievable depth range resolution ∆Z for each distance Z.

It is useful to keep in mind this formula for knowing what kind of depth resolution we expect from our stereo vision system [17].

∆Z = Z

2_{· px} size

f · b · ∆d (4.3.9)

4.4 The correspondence problem

To pursue the task of depth mapping we have to know for each pixel the disparity value: in order to know these values, it is necessary to deal withthe correspondence problem.

Given two or more images of the same 3D scene, taken from different points of view, the correspondence problem refers to the technique of finding a set of points in one image which can be identified as the same points in the other one3_{(figure 4.4.1).}

(57)

Figure 4.4.1: The correspondence problem on “teddy” image

The possible strategies to pursue this task are two:

1. Feature-based method

2. Correlation-based method

Feature-based stereo matching method works by extracting strong image features (such as

edge, corners, lines...) in both images and looking for matching features amongst the two. Typically with this approach it is possible to compute onlysparse depth maps even if with

few erroneous values.

In contrast, correlation-based method is based on the technique of “template matching”

and leads todense depth maps but including, generally, more erroneous values. [21].

(58)

4.4.1 Template matching for the correlation based method

Template matching is the process of comparing and evaluating regions of an image with another one called“template” [11]. In order to evaluate how a region is similar to a template, it is

computed a measure of similarity.

In other words,given a pattern (template) in the reference image, the template match-ing finds the most similar area in the target image usmatch-ing a proper measure of similarity between the template and the region.

Template matching is used in correlation based methods to find corresponding pixels in the left and right images. Under the horizontal epipolar line constraint, the corresponding pixels lie along the same horizontal scanline in both images (figure 4.4.3 and 4.4.4). The horizontal epipolar line is also known assearch space.

Figure 4.4.3: Template matching with left reference image and right target image (LR). The horizontal epipolar line is the search space.

(59)

Figure 4.4.4: Template matching with right reference image and left target image (RL). The horizontal epipolar line is the search space.

For instance, let’s consider the left image as reference and the right image as target: taking a block of NxN pixels (template) in reference image centered in (x0, y0), template matching

computes the measure of similarity between the template and all NxN blocks in the search space of target image.

Figure 4.4.5: Reference and target image

In our example the search space is along the horizontal epipolar line and due by the interval (x0− dmax, x0)being x0the x-coordinate of template in reference image.

NOTE: If we had taken the right image as reference and the left image as target, the search

space would have been (x0, x0+ dmax).

In table 4.1 we give a brief overview of the most common measure of similarity used for the template matching purpose.

(60)

Measure of similarity Formula

SAD (Sum of Absolute Difference) P

(i,j)∈W

|I1(i, j) − I2(x + i, y + j)|

SSD (Sum of Squared Difference) P

(i,j)∈W

(I1(i, j) − I2(x + i, y + j))2

ZSSD (Zero Sum of Squared Difference) P

(i,j)∈W

(I1(i, j) − I1(i, j) − I2(x + i, y + j) + I2(x + i, y + j))2

NCC (Normalized Cross Correlation)

P (i,j)∈W I1(i,j)·I2(x+i,y+j) r P (i,j)∈W I2 1(i,j)· P (i,j)∈W I2 2(x+i,y+j)

SHD (Sum of Hamming Distance) P

(i,j)∈W

(I1(i, j) bitwiseXOR I2(x + i, y + j))

Table 4.1: Measure of similarity

Once that all NxN blocks in the search space are compared with the template, the most similar block will be which has the minimum measure of similarity value (figure 4.4.6). The measure of similarity value is also known as“cost”.

At this point the disparity value can be estimated as the difference in x-coordinate of the template and the matched block in target image.

(61)

Figure 4.4.6: Function matching cost

In a formal way, the disparity value dpof a pixel p in the left image is computed as [27]:

dp= argmin 0≤d≤dmax X q∈Wp c(p, q − d) (4.4.1) where:

1. argmin returns the value at which the function takes a minimum 2. dmaxis a parameter defining the maximum disparity (search range)

3. Wpis the set of all pixels inside the window centered on p

4. c(p, q − d) is the aggregated matching scores based on the measure of similarity used.

4.4.2 The Sum of Hamming Distance (SHD) as measure of similarity

(62)

The Sum of Hamming Distances is normally employed to match census-transformed images as suggested by Ramin Zabih and John Woodfill in their work [13] but it could be used on images that have not been census transformed as well.

In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different4_.

In another way,it measures the number of bits that differ in two bit strings (figure 4.4.7)

Figure 4.4.7: Sum of Hamming Distance (SHD)

Given two N-bit strings, the pipeline for SHD is based on two stages (figure 4.4.8):

1. XOR stage: computes the xor operation between two strings

2. popcount: counts the number of ’1 bits’

Figure 4.4.8: Pipeline for the Sum of Hamming Distance 4_{http://en.wikipedia.org/wiki/Hamming_distance}

(63)

4.4.2.1 OpenCL implementation

Once again, in order to improve the performance of our algorithm on GPU, we have to vectorize as much as possible these stages. There is no particular problem regarding the first stage while it is probably necessary to focus on some implementation aspects in popcount stage.

First of all we’d like to stress that programming strategies that result in a fast CPU imple-mentation do not necessarily produce fast GPU code, and this is a typical example.

The bit-counting operations, usually calledpopcount, often is implemented as single

ma-chine instruction on CPU. For instance, the ARM architecture introduced theVCNT instruction

as part of the Advanced SIMD (NEON) extensions. Regarding OpenCL, the popcount built-in function is supported only by OpenCL 1.2 and hence is not available with OpenCL 1.1. This forces us to find an efficient way to implement it.

A first naive implementation could make use of a look-up table which stores the number of ’1’ bits for each number in the interval of 0-255 (figure 4.4.9).

Figure 4.4.9: Look-up table for popcount

However with GPU compute if it were used the look-up table approach, we would have two main problems:

1. it would be increased the number of load operations. The load operations are not

“free” while most of arithmetic operations are.

2. the memory access to the look-up table would be sparse increasing the number of

Load-Store operations

In order to overcome these aspects, it is advisable to doless load/store operations and com-pute even complex expressions as most of them are “free”.

A possible general purpose algorithm is given by the“variable-precision SWAR algorithm”

(also called “parallel algorithm”).

(64)

The algorithm is described in a CL-like language in 4.1,

Algorithm 4.1 Scalar and vector version

// Scalar version

uchar Popcount(uchar val) {

val = (val & 0x55) + (val >> 1 & 0x55); val = (val & 0x33) + (val >> 2 & 0x33); val = (val & 0x0f) + (val >> 4 & 0x0f); return val;

}

// Vector version

uchar16 Popcount16(uchar16 val) {

val = (val & (uchar16)0x55) + (val >> 1 & (uchar16)0x55); val = (val & (uchar16)0x33) + (val >> 2 & (uchar16)0x33); val = (val & (uchar16)0x0f) + (val >> 4 & (uchar16)0x0f); return val;

}

In table 4.2 are given the execution times achieved by OpenCL popcount kernel using a

look-up table and the ’parallel’ approaches. As it can be seen, the speed-up achieved by the parallel algorithm is over 2x.

String length [unit of byte] GPU Exec time [ms] Speed-up Parallel algorithm Look-up table

76800 0.195 0.539 2.8x 307200 0.785 2.154 2.7x 786432 2.059 5.640 2.7x 3145728 8.124 22.26 2.7x

Table 4.2: popcount: look-up table vs parallel algorithm performance analysis

4.5 Challenges in stereo correspondence problem

4.5.1 Color inconsistencies

When we deal with the stereo matching problem, typically we assume that corresponding pixels have the same intensity/color [27].

That does not need to be true due to:

1. Image noise

2. Different illumination conditions in the left and right images

Implementation of real-time multi resolution dense stereo vision for augmented reality in ARM GPUs

C

ORSO DI

L

AUREA

M

AGISTRALE IN

I

NGEGNERIA

E

LETTRONICA

Anno Accademico 2013/2014

Implementation of Real-time Dense Stereo

Vision for Augmented Reality in ARM

®

GPUs

Gian Marco Iodice

Nomenclature

Acknowledgment

Abstract

Contents

Chapter 1

Introduction

1.1

Introduction

S

1.2

Objectives and workbench

1.3

Thesis structure

Chapter 2

General Purpose computing on

Graphics Processing Units

T

2.1

OpenCL

2.2

Heterogeneous system

2.3

OpenCL components

2.3.1

Kernel and threads

2.4

GPU compute on ARM Mali-T600 Series GPUs

2.5

Optimizing the OpenCL code on ARM Mali-T600 Series

GPUs

Chapter 3

Non-parametric local transform

3.1

Introduction

I

3.2

The census transform

3.2.1

Overview of census transform

3.2.2

The modified census transform (MCT)

3.3

OpenCL census transform case study

3.3.1

Census transform 3x3

3.3.2

Modified census transform 3x3

3.3.3

Census transform 5x5

3.3.4

Modified Census Transform 5x5

3.3.5

Census transform 9x9

3.3.6

Modified census transform 9x9

Chapter 4

Computer stereo vision

4.1

Introduction

L

4.2

Fundamental of stereo vision

4.2.1