Percezione del moto in profondità “neuromorfa” applicata all’interazione con oggetti dinamici del robot iCub in tempo reale.

(1)

Control of the Humanoid Robot iCub

Luna Gava

DIBRIS - Department of Computer Science, Bioengineering,

Robotics and System Engineering

University of Genoa

Supervisors and Co-supervisors:

Prof. Fabio Solari

University of Genoa

Prof. Manuela Chessa

University of Genoa

Dr. Chiara Bartolozzi

Italian Institute of Technology

In partial fulfilment of the requirements for the degree of

Master of Science in Robotics Engineering

(2)

(3)

(4)

I would like to express my sincere gratitude to all the people who have supported me, and without whom this Thesis would not have been possible. Firstly, I would like to thank my supervisors Prof. Fabio Solari, Prof.ssa Manuela Chessa from the University of Genoa and Dott.ssa Chiara Bartolozzi from Italian Institute of Technology (IIT), who provided me the opportunity to join the Event-Driven Perception for Robotics (EDPR) team as intern and who gave me access to the laboratory and research facilities.

Besides my supervisors, I wish to express my deepest gratitude to all the people from EDPR group, especially Giulia D’Angelo for her avail-ability, support and motivation from the beginning of COVID-19 lock-down through online meetings to the end of the thesis writing. I am particularly grateful for the inputs, suggestions and technical assistance given by Arren Glover, Marco Monforte, Suman Ghosh and Massimil-iano Iacono.

I would also like to thank my friends of the Robotics Engineering course for making these two stressful years of study more pleasant. Partic-ularly, I would like to thank Marta Lagomarsino and Lucrezia Grassi for being there during all the University exams, projects and study sessions. A very special thank goes to my bestie Fabio D’Amico for believing in me and for making me laugh and feel good in these years. Last but not least, I wish to acknowledge the support and great love of my family, especially of my parents, who kept me going by being close to me in the most difficult moments.

(5)

(6)

Motion-in-depth, the movement along the third dimension, has impor-tant implications in a robotic scenario where robots navigate in dy-namic environments avoiding approaching objects or interacting with moving entities in the scene. Visual motion perception is challenging for standard frame-based cameras due to the redundant information and then the large amounts of data to be processed. In contrast to traditional cameras, bio-inspired event-based cameras respond only to contrast changes in the scene, reducing dramatically the amount of information to be processed. Moreover, their properties such as low latency and high speed can be exploited to develop a fast bio-inspired pipeline for visual servoing.

This Thesis proposes an online motion-in-depth estimation pipeline on iCub, the humanoid robot, using event-driven cameras. The system allows us to track two objects in three-dimensional space computing their velocity along the depth axis. The pipeline integrates a centre of mass tracker and a disparity extractor, which solving the correspon-dence problem due to the presence of more than one object in the scene, well suites a multi-object scenario. Two different pipelines have been developed, which differ in the order of integration of the modules, the tracker, and the disparity extractor. Although the pipeline where the disparity is computed before the tracking is more computational demanding (processing time in the order of ms), it has been chosen to be run online on the robot due to its precision in tracking the centre of mass of the object. Experiments involving a person moving hands in front of the robot evaluated the performance of the implemented pipeline. Moreover, a first approach to the development of the air hockey demonstrator is presented, including preliminary tests. In con-clusion, the evaluation and the demonstrator of the robotic application showed good performances of the pipeline, in tracking and computing the motion in depth of the object in front of the robot.

Keywords: : Event-based cameras, motion-in-depth, changing dis-parity over time, interocular velocity difference, tracking, binocular

(7)

1 Introduction 1

2 Background 3

2.1 Human Motion-in-Depth Perception . . . 3

2.2 Bio-inspired Motion-in-Depth Systems . . . 7

2.2.1 Motion-In-Depth Algorithms . . . 7

2.2.2 Cooperative Networks . . . 7

2.2.3 Neuromorphic Event-Driven Sensors . . . 9

2.3 Computational Stereo Motion-in-Depth . . . 15

2.3.1 Binocular Motion-In-Depth Cues . . . 15

2.4 Computational Stereo Vision . . . 16

3 Motion-in-depth Robot Application 22 3.1 The iCub . . . 23

3.1.1 Hardware Specifications . . . 23

3.1.2 Software Architecture: YARP . . . 24

3.1.3 The iCub Simulator . . . 25

3.2 Air Hockey Robotic Task . . . 26

3.3 Tracking in Motion-In-Depth . . . 27

3.3.1 Asynchronous Cooperative Stereo Matching Network . . . . 32

3.3.2 Object Tracking . . . 33

3.3.2.1 Multiple Objects Tracking . . . 35

3.4 Implementation of the Computational Pipeline . . . 37

3.4.1 System Overview . . . 37

3.4.2 Architecture Validation . . . 44

3.4.2.1 Experiments and results . . . 44

3.4.3 Robot Control . . . 51

3.4.3.1 Experiments and results . . . 51

3.4.4 Towards the Air Hockey Demonstrator . . . 57

3.4.4.1 Experiments and Results . . . 57

4 Conclusions 67 4.1 Limitations and Future Work . . . 69

(8)

List of Figures

2.1 Everyday activities requiring MID perception. . . 4

2.2 Various motion trajectories and corresponding retinal motions in each eye. Adapted from Czuba et al. (2014). . . 4

2.3 Cortical areas involved in visual dorsal motion processing. . . 5

2.4 Three potential visual cues for MID of a moving object: changing disparity (CD), inter-ocular velocity difference (IOVD) and chang-ing size (CS). Adapted from Wu et al. (2020). . . 6

2.5 Visualization of the output from a neuromorphic vision sensor and a standard frame-based camera when facing a rotating disk with a black dot. Comparing to a conventional frame-based camera which transmitted complete images at fixed latency, the neuromorphic vi-sion sensor Lichtsteiner et al. (2008) emitted events individually and asynchronously at the time they occur. The figure is adapted from Chen et al. (2018). . . 9

2.6 Visualization of the generation of events in space (pixel address x,y) and in time (timestamp t) acquired from the left camera for a person moving and shaking hands. The frontal x-y plane depicts frames of accumulated events over an interval of 100ms. Three consecutive frames (top to bottom) sampled at 20Hz are shown. Green: pixel with only negative polarity events; Purple: pixel with only positive polarity events; Yellow: pixel with events of both polarities. . . 11

2.7 Abstracted scheme of the pixel circuits. It is composed of a fast log-arithmic photoreceptor circuit, a differencing circuit that amplifies changes with high precision, and cheap two-transistor comparators. Adapted from Lichtsteiner et al.(2008). . . 13

2.8 Principle of operation of the Dynamic Vision Sensor. Top: Input voltage from the photoreceptor; Bottom: Output voltage from the event triggering circuit. Adapted from Lichtsteiner et al. (2008). . 14

(9)

2.9 Principle of operation of the Asynchronous Time-based Image Sen-sor. Left: a significant relative change of the luminance (pro-grammable threshold n) generates an ON (respectively OFF) event when there is an increase (resp. decrease) in luminosity. Right: a snapshot of events over 100 ms for a car passing in front of the camera. White dots represent ON events, black ones OFF events.

Adapted from Haessig et al. (2018). . . 14

2.10 Schematic illustration of motion-in-depth (right) and the associated real movement (left). . . 15

2.11 Computational schemes for the CD (top) and the IOVD (bottom) cues. ’—’ indicates differencing and ’d/dt’ differentiation. Adapted from Giesel et al. (2018). . . 16

2.12 Stereo vision pipeline. . . 17

2.13 Top: a typical stereo vision system. Bottom: a rectified stereo vision system obtained by virtually rotating the two stereo cameras. 18 2.14 Example of stereo image pairs (left: left camera image, middle: right camera image) and the corresponding disparity map (bright areas are closer). . . 19

2.15 The stereo correspondence problem. The scene comprises four iden-tical objects, recorded from two viewpoints. R is the right and L the left camera. There are 16 possible matches, but only four of them are correct matches, visualized by red dots. The rectangles L1–L4 represent the imaging of L and R1–R4 of R. Adapted from Steffen et al. (2019). . . 20

3.1 The humanoid robot iCub. . . 24

3.2 Image acquisition system. . . 24

3.3 Events types defined in the event-driven library. . . 25

3.4 The iCub simulator interface in Ubuntu. . . 26

3.5 Left: professional air hockey table (Air Hockey 7 Ft Deluxe - Uragano), whose playfield measures 179x87 cm and goal is about 27 cm; Right: iCub’s initial playing posture with torso tilted and rotated of 30 de-grees and head tilted down by 7 dede-grees. The left hand is grasping the air hockey pusher extended with a cardboard cylinder tube. . . 27

3.6 First pipeline with disparity extractor first and object tracker second. 28 3.7 Second pipeline with object tracker first and disparity extractor second. . . 28

3.8 Comparison of processing times, sampled at 1 Hz, for computing motion-in-depth as a function of the number of incoming events in the last module of both the pipelines for the two paddles moving forwards and backwards. . . 29

(10)

3.9 Oscillations of CoM inside the object’s ROI over stereo frames cap-tured in different instants of time tracking CoM (in orange) of two paddles moving forwards and backwards using the second compu-tational pipeline. . . 31

3.10 CoM abscissa trend computed using two different pipelines. . . 31

3.11 CoM abscissa trend for the left and right hands computed using the second pipeline. . . 32

3.12 Visualization of representative frames of accumulated events col-lected over 33 ms, showing center of mass (red) and ellipsis (blue), which circumscribes a person that was walking forwards and back-wards along the depth axis of the iCub robot. . . 35

3.13 Draw of geometry vectors necessary to determine to which side of the image plane an event (x,y) belongs . . . 36

3.14 Visualization of representative frames of accumulated events, each collected over 33 ms, showing the center of mass (red), the global ellipsis (blue), ellipses (magenta) which circumscribe hands moved forwards and backwards along the depth axis of the iCub robot. . . 37

3.15 Number of events with a specific depth value inside the Region of Interest (ROI). . . 40

3.16 Depth variation over time of mean and mode estimated depths within the ROI for the left hand moving backwards and forwards. . 41

3.17 Variation of depth over time before filtering and after applying a Kalman filter for the left hand moving back and forth. . . 43

3.18 First experiment execution: a person is sitting in front of the hu-manoid robot iCub and is moving the left hand repeatedly back and forth. . . 45

3.19 Color-coded motion maps of the left hand moving forwards and backwards along the depth axis of the robot. Six representative frames, three for each direction (towards or away), collected over 100 ms are shown. Only pixels corresponding to events acquired during the 100ms frames are color-coded. The hand’s rectangular ROI is depicted in green. The center of mass is visualized as a dot of size proportional to the depth of the hand from the robot. The dot is red when the hand is approaching and green when the hand is receding. . . 46

(11)

3.20 Top: Second experiment execution: a person sitting in front of the robot is moving both the hands in opposite directions back and forth. Bottom: two representative frames of color-coded motion maps, collected over 100 ms are shown. Only pixels corresponding to events acquired during the 100ms frames are color-coded. The green ROI tracks the right hand, while the blue ROI tracks the left hand. The centers of mass are visualized as dots of size proportional to the depth of the hands from the robot. The dot is red when the hand is approaching and green when the hand is receding. . . 48

3.21 Variation of Kalman filtered depths over time for the left and right hands during the execution of the second experiment. . . 49

3.22 Color-coded motion maps of the two paddles moving forwards and backwards along the depth axis of the robot. Six representative frames, collected over 100 ms are shown. Only pixels corresponding to events acquired during the 100ms frames are color-coded. The green ROI tracks the paddle in the right hand, while the blue ROI tracks the paddle in the left hand. The center of mass is visualized as a dot of size proportional to the depth of the paddles from the robot. The dot is red when the paddle is approaching and green when the paddle is receding. . . 50

3.23 Robot reference frames according to the color convention RGB-xyz. 52

3.24 Visualization of events acquired from the right camera during iCub’s arm movement in three different situations: the head is indirectly moving with the torso (top-left), the gaze is fixing a specific point in 3D space (top-right), the head is moving in opposite verse with respect to the torso (bottom). . . 54

3.25 Experiment with a person moving the left hand along y-direction of robot’s frame (see Figure 3.23c). . . 55

3.26 Experiment with a person shaking the hands in front of he robot at various depths. . . 56

3.27 The neuromorphic iCub assumes the playing posture and it is look-ing at the air hockey table, where a puck is pushed by a human player. . . 57

3.28 The visual scene observed through the event-based cameras embed-ded on the neuromorphic iCub. Left: rectified events acquired from the stereo event-based cameras. Right: motion map tracking the hand of the robot. . . 57

(12)

frames are color-coded. The green ROI tracks the puck. The center of mass is visualized as a dot of size proportional to the depth of the puck from the robot. The dot is red when the puck is moving towards and green when the puck is moving away from the robot. . 59

3.30 Comparison of computed and Kalman filtered depth (Q= 0.001, R=1) with the ground truth for the air hockey puck moving in the air hockey table. The ground truth is determined using the YARP Gaze Interface method get3DPointOnPlane(). . . 60

3.31 Comparison of computed and Kalman filtered depth (Q= 0.001, R=1) with the ground truth for the air hockey puck pushed 2m far away from the robot reference frame. The ground truth is deter-mined using the YARP Gaze Interface method get3DPointOnPlane(). 61

3.32 Comparison between desired y position received by the MID pipeline and the actual y position of the end effector of the right arm in Cartesian space. Top: torso yaw joint disabled. Bottom: torso yaw joint enabled. . . 62

3.33 Comparison between desired z position and the actual z position of the end effector of the right arm in Cartesian space. Top: torso yaw joint disabled. Bottom: torso yaw joint enabled. . . 63

3.34 Comparison between desired orientation (green) and actual ori-entation (red) of the end effector of the right arm in Cartesian space with torso yaw joint disabled, using angle-axis representation xa, ya, za, theta[m],[rad]. . . 65

3.35 Comparison between desired orientation (green) and actual ori-entation (red) of the end effector of the right arm in Cartesian space with torso yaw joint enabled, using angle-axis representation xa, ya, za, theta. [m],[rad] . . . 66

(13)

List of Tables

3.1 Comparison of mean processing times for computing MID from the two pipelines. Four different data sets are considered: two paddles, two hands, one hand moving, air hockey. . . 30

(14)

AE Adress Event.

AER Address Event Representation. ATIS Asynchronous Time Image Sensor. CD Changing Disparity over time. CoM Center of Mass.

CS Change size of retinal images.

DAVIS Dynamic and Active-pixel Vision Sensor. DoF Degrees of Freedom.

DVS Dynamic Vision Sensor. ED Event Driven.

fMRI functional Magnetic Resonance neuroImaging. FPGA Field Programmable Gate Array.

HRI Human Robot Interaction.

IOVD InterOcular Velocity Differences. KF Kalman Filter.

LAE Labelled Address Event. LGN Lateral Geniculate Nucleus.

(15)

MID Motion-In-Depth. MT Middle temporal Area. ODE Open Dynamics Engine. ROI Region Of Interest. SNN Spiking Neural Network. V1 Primary Visual Cortex. V2 Secondary Visual Cortex. V3 Third Visual Cortex.

VLSI Very large scale integration. YARP Yet Another Robot Platform. ZCB ZynQ Carrier Board.

(16)

Introduction

Among all the five senses, the human visual system occupies one of the largest areas in the cortex, and it processes around 70% of the total amount of information from the surrounding environment Mandal (2003). The perception of the surrounding combined with the proprioception allows human beings to move, reach targets or avoid obstacles. Robots can similarly take advantage of the visual system interact-ing in three-dimensional space. Robotic vision, the ability for robots to perceive the surrounding, plays a key role in executing tasks as navigation Urmson et al. (2003), or obstacle avoidance Borenstein & Koren(1989), Milde et al.(2017). In this century, digital cameras have been an excellent instrument to record robotic visual input Horn et al. (1986). Computer vision is limited by the physical ca-pabilities of cameras such as small field of view, high power consumption, motion blur, over and underexposure frame rate. Classic frame-based cameras periodi-cally capture still frames (or static snapshots of the scene). Since the camera is blind between each frame, information about the motion of entities in the scene is lost. Moreover, within each image, the same relevant background static scene is repeatedly recorded, producing excessive redundant data. Real-time processing of large amounts of data, including redundant information is not easy and it is still an open challenge in roboticsDecotignie(2005). In the past decades, progress in computer vision have been achieved to overcome some of these limitations, de-veloping biologically-inspired devices such as event-driven cameras Lichtsteiner

et al. (2008). Event based visual sensors mimic the human retina producing an event only where and when the contrast changes in the visual field. These cameras allowed to drastically reduce the amount of outgoing data, providing a novel en-coding of the visual signal exploiting micro-second precision. Removing redundant streams of data implies less computational processing and high speed due to the low latency given by the bio-inspired cameras. This is especially suited for the robotics application domain, where real-time performance, and low latency are crucial. Bio-inspired vision sensors offer considerable advantages over standard

(17)

is a stream of asynchronous events, rather than intensity images, traditional vision algorithms already designed for frame-based cameras are not directly applicable. Novel algorithms are required to be implemented in order to process the event camera output and unlock their potential.

Event cameras provide a complete knowledge of motion since they capture movement as a continuous stream of information and do not suffer from informa-tion loss between frames, which can occur when frame-based cameras are used to track fast-moving targets. Visual motion estimation is a natural application for event cameras, as moving entities are naturally segmented by the vision sensors. However, in literature, timely estimation of motion is usually derived from two-dimensional (2D) optical flow Benosman et al.(2013), even if in real-world objects trajectories mostly occur in three-dimensions (3D). In particular, the perception of Motion-In-Depth (MID) , defined as the capability of discriminating between forwards and backwards movements, has an important impact in navigation in dynamic environments, not only for humans but also for robots Wixted & Ser-ences (2018). The detection of MID could improve the behaviour of robots to avoid approaching objects, but also interacting with the surroundings such as dur-ing manipulation and graspdur-ing tasks, as well as durdur-ing Human-Robot Interaction (HRI) in collaborative tasks.

This Thesis will be focused on the estimation of motion-in-depth of multi-ple objects approaching a robot. Visual estimation of objects’ motion through the three-dimensional environment is a fundamental requirement for humanoids robots like iCub, which has sensory capabilities designed to interact with a dy-namic environment Metta et al. (2006). In particular, the neuromorphic iCub comes equipped with a stereo pair of Asynchronous Time Image Sensor (ATIS) event cameras Bartolozzi et al. (2011),Posch et al.(2010). To test the MID esti-mation, this Thesis proposes an air hockey demonstrator to show iCub performing reaching tasks based on the visual perception.

(18)

Background

The idea behind this thesis is to move robots towards biologically-inspired hard-ware and softhard-ware. Since event-based cameras already emulate the human retina, bio-inspired algorithms have to mimic the human visual system to take advantage of the outstanding properties of the neuromorphic hardware.

There are two approaches to work with event-driven cameras: the first tries to adapt traditional computer vision algorithms to event-driven data Vasco et al. (2016), Piatkowska et al. (2017); the second one uses bio-inspired models that mimic the working principles of the human visual cortex based on a population of spiking neurons Osswald et al.(2017), Dikov et al. (2017). For this reason, first, it is important to study how the human brain encodes depth and motion-in-depth information. Then, it makes sense to analyze how computer vision addresses com-putationally the problem of MID estimation to adapt it to event-based cameras.

2.1 Human Motion-in-Depth Perception

Motion-In-Depth (MID), as the name suggests, refers to a movement through three-dimensional space. In literature, motion-in-depth is defined as the capabil-ity of detecting approaching or receding objects, while discriminating either their direction (i.e. towards or away) or their speed Sabatini et al.(2002). Mammalians have the natural ability to detect motion-in-depth, which allows them to recognise moving entities, and react accordingly. Moreover, for some animal species, precise motion estimation is crucial to successfully capture prey or evading a predator’s attacks. The perception of MID is a fundamental requirement for human survival. Human beings rely on MID detection in many daily tasks: walking among a crowd of people, driving a car, or practicing sports that require catching or intercepting a ball (e.g., football, basketball, tennis, hockey).

(19)

Figure 2.1: Everyday activities requiring MID perception.

Even though MID is a very common form of motion in everyday life, it is still unclear how and where motion-in-depth is processed in the human brain.

The neural processing of frontoparallel motion (i.e., 2D motion along a steady depth plane, see Figure2.2) and its corresponding pathways have been extensively investigated Maunsell & Newsome (1987), Born & Bradley (2005). The motion processing occurs largely along the dorsal stream in the visual cortex, known also as the ”where pathway”. The dorsal motion pathway begins in the Primary Visual Cortex (V1) , where many direction-selective neurons proceed towards the Middle Temporal area(MT or V5) Movshon & Newsome(1996) . The middle temporal area also receives inputs from other cortical areas, such as the Secondary Visual Cortex (V2) , the Third Visual Cortex (V3) , and also from subcortical regions (see Figure 2.3).

Figure 2.2: Various motion trajectories and corresponding retinal motions in each eye. Adapted from Czuba et al.(2014).

(20)

Figure 2.3: Cortical areas involved in visual dorsal motion processing. Area MT plays a key role in motion processing: more than 90 % of MT neu-rons are direction-selective, i.e. responding selectively to motion in one direction Zeki (1974), Maunsell & Van Essen (1983a), Albright et al. (1984), Born & Bradley (2005). Some early studies Zeki(1974), Albright et al. (1984) collected evidence that prove 3D motion encoding from MT neurons. However, Maunsell et al. found that MT neurons do not directly encode motion-in-depth, rather some of its features, such as frontoparallel motion and binocular disparity, used to build a representation of MID extracted by higher-level brain areas Maun-sell & Van Essen (1983b). Contrary to the evidence in Maunsell & Van Essen (1983b), recent human functional Magnetic Resonance neuroImaging (fMRI) and physiological studies found areas in MT and nearby regions activated by MID cues Likova & Tyler (2007), Rokers et al.(2009). In particular studies by Likova et al. found that an area anterior to MT selectively responds to MID cues, suggesting the involvement of multiple brain areas Likova & Tyler (2007). Two more recent complementary papers Czuba et al.(2014), Sanada & DeAngelis(2014) clarified the existence of 3D motion information in the responses of MT neurons of mon-keys, based on binocular cues.

For computing the velocity along the depth axis, the visual system can rely on three possible visual cues Wu et al. (2020), Beverley & Regan (1983), Regan (1993), Cumming & Parker (1994), Harris et al. (2008), as shown in Figure 2.4:

• Monocular cues:

– Change Size of retinal images (CS). • Binocular cues:

– Change in retinal Disparity over time (CD). – InterOcular Velocity Differences (IOVD).

(21)

It was demonstrated that each of the above-mentioned visual cues can achieve MID detection alone, but their singular contributions to the perception of MID is still under discussion Brenner et al.(1996) Regan & Beverley(1979). As an example, Sanada & DeAngelis(2014) found that in macaque MT, about 58 % of the recorded neural population is selective for 3D motion. Specifically, 56 % responded to IOVD-based stimulus and 10 % responded to CD-based stimulus. However, it is still an open question of how MT encodes 3D motion. Researchers managed to experimentally isolate each cue information, showing to human observers random-dot stereograms stimuli Giesel et al. (2018). In this way, they found that the two binocular cues have different sensitivity profiles that depend on speed and eccentricity Czuba et al. (2010). For instance, sensitivity to the CD cue is best near the foveal region and for slower speeds, while the IOVD cue becomes more important at high speeds and in the peripheral visual field. Furthermore, it is still unknown whether CD and IOVD circuits remain independent within MT. Surely, motion processing starts with separate computations from the two retinas (left/right), but at some point, a binocular integration is required. Firstly, this happens at a smaller scale in the Lateral Geniculate Nucleus (LGN) and then later in the visual cortex Joo et al. (2016). In the future, more advanced tools including stronger magnets for use in imaging in combination with new machine learning approaches will help neuroscientists to understand not only wherein the brain 3D vision occurs, but also how the brain handles motion computationally.

Figure 2.4: Three potential visual cues for MID of a moving object: changing disparity (CD), inter-ocular velocity difference (IOVD) and changing size (CS). Adapted from Wu et al. (2020).

(22)

2.2 Bio-inspired Motion-in-Depth Systems

This section discusses bio-inspired models that try to mimic through cortical ar-chitectures the working principles of the human visual system for encoding MID information. These algorithms can get the most out of neuromorphic vision sensors towards a whole bio-inspired system both in hardware and software.

2.2.1 Motion-In-Depth Algorithms

Despite the importance of motion-in-depth information for navigating in dynamic environments, there are fewer studies on motion-in-depth than on static depth perception. Over the years, several models proposed cortical architectures to pre-dict sensitivity to MID Sabatini & Solari (2004) Shioiri et al. (2009), Peng & Shi (2014) Baker & Bair (2016); however, few of these have been applied to the robotics domain.

Sabatini & Solari (2004) simulated MID sensitivity of MT neurons by combining responses from complex disparity energy units. They proposed to compute MID simply from the convolutions of the two stereo images with quadrature pairs of Gabor kernels (i.e., complex spatio-temporal bandpass filters), by exploiting the chain rule in the evaluation of the temporal derivative of phases. Image analysis with Gabor filter thought to be similar to extraction of frequency content in spec-ified directions carried out by the simple cells in the visual cortex of mammalian brains Marˆcelja (1980) Daugman (1985). Moreover, Sabatini et al. analytically proved that information hold in the interocular velocity difference is the same as that derived by the evaluation of the total derivative of the binocular disparity if a phase-based disparity encoding scheme is assumed.

They suggested implementing their algorithm directly in Very Large Scale Inte-gration (VLSI) , as proved by their prototype Raffo et al. (1998).

Shioiri et al. (2009) and Peng & Shi (2014) developed models based on CD and IOVD cues to estimate direction and speed of MID. Baker & Bair(2016) devised a model based on IOVD to simulate the activities of MT neurons, responding to motion also along the depth axis. Since all these models attempted to repro-duce neural mechanisms for encoding MID, they naturally fit biologically-inspired hardware, like the event-based cameras.

2.2.2 Cooperative Networks

At present, there are no references in literature for bio-inspired motion-in-depth algorithms using event-driven cameras as input. However, cooperative networks are an example of bio-inspired computing, that has been applied to event-driven cameras. Cooperative computing refers to a class of algorithms that deals with

(23)

locally distributed computations, which work together to provide a global solu-tion. Cooperative networks were born to meet the need to find a solution to the stereo-matching problem in a bio-inspired way, given that depth information ex-traction in the brain is considered a cooperative process Julesz (1971). Marr & Poggio (1976) proposed the first cooperative network, that could perform feature matching between two images with the intent of modelling the disparity detection mechanisms of the brain. This network can be considered a three-dimensional array, each of whose cells (neurons) encodes a belief in matching a corresponding pair of pixels from the left and the right images. The authors presented two basic constraints to limit the number of false matches (i.e., formed between features that are not generated by the same target in space) between left and right camera: • Cross-disparity uniqueness: each pixel can have at most one disparity value. This assumption is derived from the fact that a point on a surface has only one spatial location at a given time.

• Within-disparity continuity: disparity of neighboring regions should be the same. This constraint relies on the observation that cohesive objects should occupy a localized region of depth.

Over the years, extensions of Marr and Poggio’s cooperative stereo algorithm have been proposed Firouzi & Conradt (2016), Osswald et al. (2017).

Firouzi et al. proposed a dynamic cooperative and competitive neural network which implements a stereo-matching algorithm designed to be used along with bio-inspired event-based vision sensors Firouzi & Conradt (2016). It can asyn-chronously extract disparity for each new event in input via the Winner-Take-All mechanism. Their model simplifies the operation of Spiking Neural Networks (SNN) by representing neurons as elements of a 3D array and keeping track of their activation times. It has the advantage of not requiring expensive dedicated neuromorphic hardware like SpiNNaker for implementation Furber et al. (2014). The network tries to reproduce the competitive process utilized by disparity sen-sitive neurons present in the mammalian brain to encode the horizontal disparity Zhu & Qian (1996). To respectively comply with the uniqueness and continuity assumptions, the cells which lie along the same row should inhibit each other and the neighbouring cells lying on a common disparity map should potentiate each other creating a local pattern of excitation. Osswald et al., instead, offered a bio-inspired model that solves the stereo-correspondence problem with a Spiking Neural Network that can be directly implemented using neuromorphic engineering devices Osswald et al. (2017). However, in this thesis, we preferred Firouzi et al. model because it enables us to build an application that can run on a regular PC and easily interface with the robot for dynamic disparity estimation.

(24)

2.2.3 Neuromorphic Event-Driven Sensors

In recent years, a newly emerging bio-inspired technology known as event-driven (ED) cameras, has attracted a lot of attention from both the academia and the industry. Their increasing popularity is attributed to the outstanding properties that these devices offer compared to standard frame-based image sensors.

ED cameras respond to brightness changes in the scene asynchronously for every pixel. On the contrary, frame-based cameras acquire sequences of static images at fixed temporal intervals.

Thus, instead of acquiring redundant static intensity data at a constant frame rate, event cameras acquire temporal contrast data only from relevant ”moving” parts of the scene. Figure2.5shows the difference between the outputs of a frame-based and an event-based camera observing a black circle printed in a rotating grey desk. The continuous stream of events generated by the moving point overcomes the typical loss of information between frames.

Figure 2.5: Visualization of the output from a neuromorphic vision sensor and a standard frame-based camera when facing a rotating disk with a black dot. Com-paring to a conventional frame-based camera which transmitted complete images at fixed latency, the neuromorphic vision sensor Lichtsteiner et al. (2008) emit-ted events individually and asynchronously at the time they occur. The figure is adapted from Chen et al.(2018).

An event camera approximately mimics the biological retina, which has been observed to respond stronger to changes in illumination than to steady illumina-tion Lankheet et al. (1991). The first bio-inspired vision sensor was developed by Mead & Mahowald (1988) during the period 1986-1992. Instead, the first event camera became commercially available in 2008 Lichtsteiner et al. (2008). This neuromorphic device outputs spike events using the Address Event Representation

(25)

(AER) protocol, a communication protocol designed for event-based communica-tion among neuromorphic chips. It is based on biological neural processing, where information transfer is modeled as asynchronous timed spikes. The event camera output corresponds to a stream of an asynchronous sequence of events (or spikes) in space and time. Whenever a single-pixel detects a change of intensity, it trig-gers an event, which contains the address (i.e., coordinates x and y of the pixel triggering the event), the timestamp as well as the sign of the change (positive or negative).

The stream of events from the silicon retina can be mathematically defined as follow:

e:= (p, t, δ) (2.1)

where:

• p = (x,y) is the position of the event pixel. • t is the timestamp at which the event occurred.

• δ ∈ [−1, +1] is the polarity of the event indicating an increase (+1) or de-crease (-1) of the intensity of incident light at its corresponding pixel location. Figure2.6 visualizes the stream of events in the 3D space spanned by x, y, and time generated for a person moving in front of the event cameras.

(26)

Figure 2.6: Visualization of the generation of events in space (pixel address x,y) and in time (timestamp t) acquired from the left camera for a person moving and shaking hands. The frontal x-y plane depicts frames of accumulated events over an interval of 100ms. Three consecutive frames (top to bottom) sampled at 20Hz are shown. Green: pixel with only negative polarity events; Purple: pixel with only positive polarity events; Yellow: pixel with events of both polarities.

(27)

Event-cameras present numerous advantages (see Gallego et al. (2019) for a complete review) over standard cameras, that can be summarized by the following list:

• High dynamic range (about 120 dB), which can handle extreme light condi-tions.

• High temporal resolution (about 1 µs), which helps with applications in-volving strict temporal measurement requirements and high-speed moving stimuli.

• Low latency (about 1 µs), which means that as soon as the change of intensity is detected, it is transmitted.

• Low power consumption (about 10 mW).

• No redundant information from static parts of the scene. In other words, it means that, if nothing moves in front of the camera, no events are generated. • Low bandwidth requirements, since only intensity changes are transmitted. The drastic reduction in bandwidth solves the problem of processing images in real-time, which so far it was considered an open challenge for conventional cameras, due to the huge amount of images to process.

• Real-time operation with asynchronous data.

• No suffer from artefacts such as motion blur and over-saturation. Motion blur is caused by fast motion (e.g., a falling object) during the exposure time of the image, while over-saturation occurs when too much light hits the camera sensor, which makes it difficult to quantify the light information accurately.

On the other side, event-camera have some disadvantages with respect to conventional cameras:

• Demand novel complicated vision algorithms. Traditional computer vision algorithms cannot be applied since the output of event cameras (measuring per-pixel brightness changes asynchronously) is fundamentally different from that of traditional ones (measuring “absolute” brightness at a constant rate) • Lower spatial resolution.

• Not useful for static scenarios.

(28)

Figure 2.7: Abstracted scheme of the pixel circuits. It is composed of a fast logarithmic photoreceptor circuit, a differencing circuit that amplifies changes with high precision, and cheap two-transistor comparators. Adapted from Lichtsteiner

et al. (2008).

• Dynamic Vision Sensor (DVS) Lichtsteiner et al. (2008): is a silicon retina with 128x128 pixels, each pixel operates parallelly and independently, consisting of a logarithmic photoreceptor, a differentiator circuit, and com-parators. Figure2.7 shows how these three components are connected. It transmits output signals (in the form of digital events) only when there is a sufficiently big change in intensity of incident light on the photoreceptor at that pixel location. When the change in the intensity reaches the thresh-old, an ON (increase in intensity) or OFF (decrease in intensity) event is generated by the pixel. Figure 2.8 shows the working principle of the DVS. • Asynchronous Time-based Image Sensor (ATIS) Posch et al.(2010):

it contains an array of fully autonomous pixels that combine event-based in-tensity changes with the absolute inin-tensity measurement. It has a resolution of 304x240 pixels. Each pixel starts intensity measurement after detecting any local intensity change. The grayscale value is then asynchronously mea-sured. Figure 2.9 shows the working principle of ATIS.

• Dynamic and Active-pixel Vision Sensor (DAVIS) Berner et al. (2013): it combines both an asynchronous event-based sensor and a stan-dard frame-based camera in the same pixel array. It also encodes the visual information between two successive frames by an asynchronous sequence of events that indicates the pixel level intensity changes.

(29)

Figure 2.8: Principle of operation of the Dynamic Vision Sensor. Top: Input voltage from the photoreceptor; Bottom: Output voltage from the event triggering circuit. Adapted from Lichtsteiner et al. (2008).

Figure 2.9: Principle of operation of the Asynchronous Time-based Image Sensor. Left: a significant relative change of the luminance (programmable threshold n) generates an ON (respectively OFF) event when there is an increase (resp. de-crease) in luminosity. Right: a snapshot of events over 100 ms for a car passing in front of the camera. White dots represent ON events, black ones OFF events. Adapted from Haessig et al. (2018).

(30)

2.3 Computational Stereo Motion-in-Depth

Once known how the brain encodes motion-in-depth, it is important to reproduce it computationally to allow robots to gather information about motion in 3D. A moving object in the 3D space projects opposite directions velocities (vL, vR)

into the left and right retinas, as it is depicted in Figure 2.10. In particular, the motion of an object directly toward (Vz >0) the cyclopean eye produces internal

opposite directions of retinal motion in the two eyes. On the contrary, receding movement (Vz < 0) of an object reverses the direction of image motion in each

eye.

Figure 2.10: Schematic illustration of motion-in-depth (right) and the associated real movement (left).

2.3.1 Binocular Motion-In-Depth Cues

Since that the scene is viewed by the iCub robot with both eyes, this thesis focuses only on binocular cues.

As already explained, the third component of the object’s motion (i.e., its motion-in-depth) can be approximated in two ways, as it is evidenced in Figure2.11:

1. by the rate of change of disparity (CD) Vdisp =

d(xL− xR)

dt =

d(disp)

dt (2.2)

2. by the difference between retinal velocities (IOVD) Vdisp=

d(xL)

dt −

d(xR)

(31)

The two cues are equivalent from a mathematical point of view. The only difference is in the order of computation of the quantities involved. The change in disparity over time can be computed from the temporal derivative of interocular difference in retinal position, namely the horizontal disparity. In this case, the disparity is computed first, and the temporal derivative is computed second. As regards the interocular velocity difference, the retinal velocities are computed first, and then the difference between these velocities.

Figure 2.11: Computational schemes for the CD (top) and the IOVD (bottom) cues. ’—’ indicates differencing and ’d/dt’ differentiation. Adapted from Giesel

et al. (2018).

Both these binocular visual cues involve the computation of disparity, a fun-damental concept in computational stereo vision.

2.4 Computational Stereo Vision

Computational stereo vision is a key topic in computer vision and image analysis Lazaros et al. (2008). Stereo-vision is defined as the problem of inferring three-dimensional information from a pair of images displaced horizontally with different viewpoints. The need for two viewpoints does not necessarily involve the physical presence of two cameras, but also a monocular system that moves around and gathers multiple shots of a common scene. Human beings have two frontal-parallel eyes that see the world from two slightly different positions. Each eye is only able to capture a two-dimensional image, that will be impressed in the retinal surface. The brain then subsequently processes the pair of images solving the ill-posed

(32)

problem from a 2D input to a 3D representation of the scene. Another favoured approach is however to consider an acquisition system with two cameras shifted horizontally from one another of a quantity known as baseline.

Figure 2.12: Stereo vision pipeline.

The three-dimensional position of any point in the image can be obtained passing through different steps, as illustrated in Figure 2.12:

• Calibration is a procedure with the purpose of finding intrinsic and extrinsic parameters of the two cameras. The intrinsic parameters (e.g. focal length, optical center, lens distortion, etc.) characterize the mapping between cam-era coordinates and pixel coordinates in the image plane through the intrinsic camera matrix, K (see Eq.2.5). The extrinsic parameters describe the posi-tion and orientaposi-tion of the two cameras with respect to the world coordinates system through rotation and translation matrices R, t (see Eq.2.5). Intrin-sic and extrinIntrin-sic parameters together form the camera matrix called P (see Eq.2.4), that allows mapping world coordinates and image points expresses in homogeneous coordinates:

x y 1 = P X Y Z 1 (2.4)

where the camera matrix P is computed as follows:

P = KR|t (2.5)

• Rectification is the process of applying a pair of 2D projective transforms, or homographies, to a pair of images whose epipolar geometry is known. Then, the image planes become co-planar and conjugate epipolar lines be-come collinear Loop & Zhang (1999), simply by virtually rotating the two cameras, like it is shown in Figure 2.13. Thanks to the rectification con-straint, the search of corresponding points is simplified by limiting the search space to one dimension, i.e., along the same row in the left and right images.

(33)

Figure 2.13: Top: a typical stereo vision system. Bottom: a rectified stereo vision system obtained by virtually rotating the two stereo cameras.

• Stereo Correspondence aims at finding correspondences between image pairs to compute the relative displacement of matching left and right image points, that, in computer vision, is called disparity. Stereo correspondence can be done sparsely and densely according to the number of pixels in the image of which one wants to find the correspondence and consequently of which one wants to compute the disparity value. In this thesis, it was con-sidered the sparse approach to obtain a disparity map (see Figure2.14) with disparity computed only for the incoming events.

(34)

Figure 2.14: Example of stereo image pairs (left: left camera image, middle: right camera image) and the corresponding disparity map (bright areas are closer).

Generally, when cameras are directed towards a finite 3D point in space, it is possible to have horizontal as well as vertical disparities. However, as already mentioned, the rectification process simplifies the computation by forcing the vertical disparity to tend to zero. From now on, the disparity will refer to the difference in x coordinates (ul− ur) of corresponding points

in the left and right images.

Since a single pixel of one image could match several pixels in the corre-sponding image, it is really important to identify the correct match Hartley & Zisserman(2003). Stereo vision research studies took into account various approaches to solve the challenging ”correspondence problem” (see Fig-ure2.15) based on feature descriptor and similarity measure adopted Szeliski (2010). Most classical stereo-matching algorithms use feature-based or area-based matching for computing dense stereo correspondences Scharstein & Szeliski (2002). Feature-based matching algorithms focus on correlation amongst features rather than the intensity of pixels. Area-based match-ing compares the intensity of a window around a smatch-ingle pixel with an area around potential corresponding pixels in the other image.

Classical stereo-matching algorithms are computationally demanding and unsuitable for real-time robotics applications in which the processing time is crucial Dom´ınguez-Morales et al.(2012). New emerging biologically-inspired vision systems try to fulfill the request of asynchronous and fast processing Steffen et al. (2019). However, in the event-based framework, finding stereo matches is more complicated because individual pixels often do not possess enough information for accurate matching. Nevertheless, in the event-based case, one can exploit the fact that corresponding points in the left and right images are probably generated close in time due to the high temporal res-olution that these novel sensors offer. It is not possible to rely solely on time-based coincidence in the case of noise, intrinsic sensor variations,

(35)

spa-tial aliasing, or extended objects. However, in case of a robust feature-based tracker, noise is removed and we can attain a more robust event-based stereo correspondence.

Figure 2.15: The stereo correspondence problem. The scene comprises four iden-tical objects, recorded from two viewpoints. R is the right and L the left camera. There are 16 possible matches, but only four of them are correct matches, visual-ized by red dots. The rectangles L1–L4 represent the imaging of L and R1–R4 of R. Adapted from Steffen et al.(2019).

• Triangulation is a process that tries to reconstructs the original 3D position of the correspondence given the disparity map, the baseline, and the focal length Furukawa & Ponce(2009). It consists of extending the line connecting each corresponding point and the camera center, as shown in Figure 2.13. The intersection point of these projected lines should provide the location of the real point in 3D space. Nevertheless, since that correspondences often are not precise, the projected lines may not intersect in reality. Therefore, the problem turns into finding a 3D point that optimally fits the measured image points. In literature, there are works that suggest how to define an optimal 3D point, when for example noise is involved Hartley & Zisserman (2003).

The reprojection of the image into 3D space is given by the following relation in the homogeneous coordinates involving the disparity-to-depth mapping

(36)

matrix Q (see Eq.2.7):

X Y Z W T = Qu v disparity(u, v) 1T (2.6) where Q is a 4x4 perspective transformation matrix defined as:

Q=     1 0 0 −cx 0 1 0 −cy 0 0 0 f 0 0 − 1 Tx cx−c′x Tx     (2.7)

with left principal coordinates (cx, cy), right principal abscissa point c′x,

base-line Txand single focal length f . As everything is in homogeneous coordinate,

the final 3D point is retrieved using the following relation: X′ _Y′ _Z′ ₁T =_X W Y W Z W 1 T (2.8) Depth turns out to be inversely proportional to disparity: objects with a high disparity and low depth are close to the camera. The final output of the stereo vision system should be the 3D point cloud of the scene.

(37)

Motion-in-depth Robot

Application

Perceiving motion-in-depth is important, not only for humans but also for hu-manoid robots, especially to properly explore and react to real-world dynamic environments, e.g. with a defensive response to a sudden movement. Although many event-based vision algorithms have been evaluated from a robotic perspec-tive, fewer have been used to control a robotic platform online. For instance, a DVS has been used to solve the classic inverted pendulum problem by providing fast visual feedback for controlling an actuated table to balance an ordinary pencil Conradt et al. (2009). A simple tracking has been used to control a one-degree of freedom robotic ”goalie”, that was able to block a rolling ball despite a low con-trast object Delbruck & Lang(2013). The capability of catching the ball strongly depends on the latency in the perception loop, where the robot has to detect, track and predict the trajectory of the ball, to plan where and when it should be caught. Further, Glover et al. proposed a method to detect and track a circle with gaze fixation in presence of event clutter caused by the motion of the head Glover & Bartolozzi(2016). All these robotic applications have highlighted the new level of performance enabled by the event cameras: low latency and computational effi-ciency.

This chapter discusses the steps required to perform neuromorphic visual ser-voing using the humanoid robot, iCubBartolozzi et al.(2011). As a demonstration of the devised event-based pipeline, this Thesis proposes to make iCub play air hockey.

(38)

3.1 The iCub

In this Thesis, the research platform used for developing and testing the air hockey application is: iCub Bartolozzi et al. (2011), as shown in Figure 3.1, equipped with two event-based cameras (ATIS) embedded on its eyes. The iCub Sandini

et al. (2007) (see also http://www.icub.org/) is an open-source humanoid robot developed by the Italian Institute of Technology (IIT) for research purposes in the field of human cognition and artificial intelligence. Measuring 94cm, the iCub robot is approximately the same size as a three years old child. It is equipped of 53 actuated degrees of freedom, organized in the following manner:

• 7 for each arm: 3 in the shoulder, 1 in the elbow and 3 in the wrist.

• 9 for each hand: 3 in the thumb, 2 in the index, 2 in the middle finger, 1 in the coupled ring and little finger, and 1 for adduction/abduction.

• 6 for the head: 3 in the neck, 3 in the eyes. • 3 for the torso.

• 6 for each leg: 3 in the hip, 1 in the knee, 2 in the ankle.

This Thesis only exploited the upper-body part of the iCub, i.e. hand, arm, torso, and head, that are essential for the reaching task. The lower-body part is placed on a mobile platform that enables us to transport the robot on flat terrain and allows us to bring it near the air hockey table.

3.1.1 Hardware Specifications

To localise objects in the environment, iCub has to rely solely, similarly to human perception, on a visual system based on stereo cameras. Within each eye is an Asynchronous Time-based Image Sensor (ATIS) event-driven camera Posch et al. (2010), which acquires event-driven visual data from the surroundings.The latter is designed to communicate asynchronous data with a spike-based “address-event representation” (AER) communication protocol.

Figure 3.2 shows the image acquisition system composed of vision sensors and embedded hardware used for the processing of asynchronous data. The two ATIS event-based cameras are connected to the specialised iCub-zynq processor (ZynQ-7030 board) through a Field Programmable Gate Array (FPGA) . The iCub-zynq chip hosted in the ZynQ Carrier Board (ZCB) contains both a FPGA and an ARM CPU that are interfaced with. The FPGA on the board timestamps the events, and loads them into the memory of the ARM core (running Linux), which is connected to the local TPC/IP network. The YARP module zynqGrabber

(39)

Figure 3.1: The humanoid robot iCub.

running on the iCub-zynq reads the Address Events (AE) from the Zynq board and exposes the data on a YARP port for further processing in a separate PC with 16 GB memory and an Intel i7 8 core processor running at 2.5GHz.

Figure 3.2: Image acquisition system.

3.1.2 Software Architecture: YARP

The robot iCub uses Yet Another Robot Platform (YARP) as middleware Metta

et al. (2006) to provide modular processing and connectivity to sensors and actu-ators. YARP is an open-source middleware platform, written in C++ for building a robot control system as a collection of programs that communicate with each other via different protocols like TCP/IP, UDP, etc. The air hockey application is

(40)

developed using the event-driven framework Glover et al. (2018), which provides basic functionality for handling events in a YARP environment. Figure 3.3 shows the event types developed within the event-driven framework. For instance, the events produced by the event cameras are called AddressEvent (AE) and en-code information about pixel address (x,y), pixel polarity (δ: darker/lighter), and the camera channel (c: left/right) (see Equation 2.1 for a review of the variables here mentioned). Labelled AE (LAE) add a label ID over the fields inherited by address events.

Figure 3.3: Events types defined in the event-driven library.

3.1.3 The iCub Simulator

The control-based software developed in this thesis was preliminarily tested using the iCub simulator, and then the real iCub. The iCub simulator (Figure 3.4) was an attempt to reproduce the physics and dynamics of the robot and its environ-ment using only open-source libraries, such as the Open Dynamics Engine (ODE) and OpenGL Tikhanoff et al. (2008). Since this simulator has been implemented collecting data directly from the robot, the results obtained can be trusted.

(41)

Figure 3.4: The iCub simulator interface in Ubuntu.

3.2 Air Hockey Robotic Task

The air hockey task appeared to be a good demonstrator to test the event-based MID pipeline developed for this thesis, because the hockey puck can reach high speeds and can approach the robot in both directions (i.e., towards or away). Moreover, it represents a good opportunity for applying event-based visual servoing to a humanoid robot. The idea behind the air hockey task is simple: iCub has to move one of its arms to intercept with the pusher the air hockey puck moved by a human opponent. This thesis represents the first approach to the task, so in the following sections, I will show the preliminary experiments towards iCub’s defensive movements during the air hockey play. Figure 3.5 shows the air hockey table that we have at our disposal in the laboratory. It is a professional air hockey table (Air Hockey 7 Ft Deluxe - Uragano) with a playfield measuring 179 x 87 cm. It is 80 cm tall and it weighs 96 kg. The air hockey goal length is about 27 cm. It comes equipped with 4 pushers and 4 pucks, which give the possibility of having multiple targets moving within the field of view of the iCub robot.

(42)

Figure 3.5: Left: professional air hockey table (Air Hockey 7 Ft Deluxe - Uragano), whose playfield measures 179x87 cm and goal is about 27 cm; Right: iCub’s initial playing posture with torso tilted and rotated of 30 degrees and head tilted down by 7 degrees. The left hand is grasping the air hockey pusher extended with a cardboard cylinder tube.

3.3 Tracking in Motion-In-Depth

To compute motion-in-depth, it is necessary to know other information about the moving object, that are:

• The position of the object in the 2D space (x,y). • The depth of the object with respect to iCub (z).

The 2D position of the object can be obtained using a tracker, while the depth of the object can be determined via binocular disparity. Thus, the system of the air hockey application had to integrate two already existent event-based modules: • Bio-inspired disparity extractor implemented by Suman Ghosh for his

Master ThesisGhosh(2019). His module took inspiration from the workFirouzi & Conradt (2016) described in Chapter 2. In contrast with other stereo-matching algorithms, it does not require dedicated neuromorphic hardware,

(43)

because it simplifies the cooperative Spiking Neural Network (SNN) by rep-resenting it as a 3D matrix. Moreover, it solves the stereo correspondence problem, exploiting the stereo-pair temporal coherence.

• Event-based Center of Mass (CoM) tracker developed by Marco Monforte for his recent work Monforte et al. (2020). The CoM is simply computed as the mean of pixel addresses where events have occurred in the image plane, considering a single moving target. The tracker accumulates events within a Region Of Interest (ROI) , and then updates the position of the ROI based on the average position of events within it. As the events move along the trajectory, so it does their mean position and the centre of the ROI.

One important aspect of the Thesis, was to understand how to integrate the two modules and how to make them communicate to asynchronously extract motion-in-depth and derive a dynamic MID map for objects moving in the scene in real-time. There are two possible ways of integrating the two event-based modules, which differ in the respective position of the two (see Fig. 3.6 and Fig. 3.7).

Figure 3.6: First pipeline with disparity extractor first and object tracker second.

Figure 3.7: Second pipeline with object tracker first and disparity extractor second. Both types have pros and cons. In the first system (Figure 3.6), the dispar-ity extractor computeDisp acquires the raw events from the stereo event-based

(44)

cameras and asynchronously extracts the information of the disparity. Therefore, the disparity information is fed as input into the CoMtracker, which computes the Center of Mass (CoM) of the disparity events. The second system (Figure

3.7) first computes the center of mass and then the disparity of only a few events corresponding to the centers of mass of the objects present in the scene. The cen-ter of mass is calculated for both the cameras considering separately the events acquired from the left and the right sensor. The first system has resulted in more computational demanding than the second because it computes the disparity for all the events generated by the cameras, even if we are interested only on MID of one particular point representing the object, namely, the center of mass. The second system, instead, computes the stereo correspondence and disparity on the center of mass of tracked objects, processing for fewer events.

0 2 4 6 8 10 12 14 16 18 Number of events 104 0 0.5 1 1.5 2 Processing time [s] 10-3 Pipeline 1 Pipeline 2

Figure 3.8: Comparison of processing times, sampled at 1 Hz, for computing motion-in-depth as a function of the number of incoming events in the last module of both the pipelines for the two paddles moving forwards and backwards.

Figure3.8shows the computational time for both implementations, confirming that the second pipeline is faster than the first. The graph shows the processing time for both the pipelines as a function of events acquired by the last module for computing MID for the two paddles moving forwards and backwards. For reasons of visualization, the processing time is sampled at 1 Hz, namely it is shown the value of the computed processing time every 1s. The peaks of processing times at regular intervals correspond to instances when the paddle came very close to the camera generating a large density of events.

(45)

is calculated on the processing times acquired every 100 ms. The first system takes longer (at least ten times more) for computing MID than the second pipeline. The first system processes more events (i.e. the disparity events over the whole image) than the second system, which gets only the centers of mass of the two objects. The times are in the order of ms.

Pipeline Two paddles Two hands One hand Air hockey

1 4.0 ms 3.1 ms 11.7 ms 8.8 ms

2 0.14 ms 0.12 ms 0.47 ms 0.5 ms

Table 3.1: Comparison of mean processing times for computing MID from the two pipelines. Four different data sets are considered: two paddles, two hands, one hand moving, air hockey.

Despite the computational efficiency, the second system relies on the detection of the center of mass of the object over time in the left and right images. The CoM of the object is obtained by calculating the center of mass of events produced by the edges of the objects. These depend on the motion contrast, illumination, reflection, noise, and mismatch effects that often result in unstable detection of the CoM location in different points inside the object region. It is not easily controlled, so the centre can shift inside the object. A feature-based detector or a more so-phisticated tracker, instead of a center of mass tracker, could help the consistency in determining the exact object location, going towards the implementation of a feature and time- based computation of disparity.

Figure 3.9 shows two pairs of stereo frames where the center of mass is not asso-ciated with the same point of a paddle stimulus over time.

Figure 3.10 shows the computed x coordinate of the CoM over time for the two pipelines considering the same number of input events. The CoM computed using the second pipeline presents more oscillations than the first pipeline. The centers of mass oscillations could be interpreted by the disparity extractor network as a real change in depth, even if the paddle hasn’t moved significantly. This results in a noisy disparity output with a lot of oscillation, although the object is moving only forwards.

(46)

(a) left (b) right

(c) left (d) right

Figure 3.9: Oscillations of CoM inside the object’s ROI over stereo frames captured in different instants of time tracking CoM (in orange) of two paddles moving forwards and backwards using the second computational pipeline.

0.5 1 1.5 2 2.5 3 3.5 4 Number of events 104 180 200 220 240 xCoM [pixels] Pipeline 1 Pipeline 2

(47)

Moreover, Figure 3.11 shows that the abscissa oscillations cause crossing be-tween x coordinates of CoM computed separately in the left and right camera, particularly evident in the graph for a time between 0.5s and 1s. This results in a negative disparity if the disparity is computed as the difference between x coor-dinates in the left and right camera (xCoMl− xCoMr). The disparity extractor

computeDisp() could not work properly if the input events are not coming from the same object point due to oscillations. Since the object is constantly approach-ing the observer, the curve associated with the left CoM has to be always above the curve associated with the right CoM. This finding led to the choice of the first system pipeline to be run online on the robot and to provide information regarding what’s happening in the visual field. However, as suggested above, a more stable and reliable tracker could improve the second faster pipeline.

0.5 1 1.5 2 Time [s] 180 200 220 240 xCoM [pixels] left right

Figure 3.11: CoM abscissa trend for the left and right hands computed using the second pipeline.

3.3.1 Asynchronous Cooperative Stereo Matching Network

Suman Ghosh, during his Master Thesis project, developed a robust C ++ appli-cation for dynamic disparity estimation using the humanoid robot iCub equipped with event-based vision sensors Ghosh (2019). Ghosh’s work was inspired by the event-based biologically-inspired model, already explained in Section2.2.2 Firouzi & Conradt (2016). It deals with a matrix-based cooperative and competitive stereo-matching network that can be implemented using a Spiking Neural Net-work (SNN) as Osswald et al. did Osswald et al. (2017). For each rectified event in input, the network gets updated a 3D matrix with the first two dimensions equal to sensor resolution and the third one to the maximum detectable disparity value. Then, it computes the best disparity value via a Winner-Takes-All mechanism,

(48)

taking into account the cross-disparity uniqueness and within disparity continuity constraints first introduced by Marr and Poggio in 1976 Marr & Poggio (1976). The output of the network coincides with a dynamic disparity map containing all the input rectified events, each with the additional information of disparity differently color-coded. The disparity information is useful to estimate the depth of each event. The disparity could be computed as a simple difference between the x coordinates of corresponding points, but it is not always easy to understand the correct matches between points in the left and right image. Ghosh’s appli-cation fits perfectly in the Thesis context because it solves the ”correspondence problem” that arises especially when we have a scenario with multiple objects. It is also biologically-inspired, since the model tries to reproduce the cooperative and competitive process exerted by cortical neurons to extract the depth informa-tion Julesz (1971). Coupled with neuromorphic vision sensors, the computeDisp module can asynchronously process events as and when they appear, compute the disparity and make it immediately available to be run on the robot in real-time without requiring neuromorphic hardware for implementing the SNN in contrast to the work of Osswald et al. Osswald et al. (2017).

3.3.2 Object Tracking

In classic Computer Vision, tracking of moving objects in the scene can be consid-ered equivalent to find the trajectory of a selected feature. As a first approach, it was considered the simplest low-level feature, namely the centroid location of the tracked objects. In the case of fixed stereo event cameras, events are only caused by moving objects in the scene. Therefore, the area where the motion is happening can be detected using objects region descriptors like statistical moments that are usually used in traditional computer vision for image analysis Rocha et al. (2002). These descriptors allows us to find interesting object properties such as the center of mass, the orientation, or the shape of the moving area like the best fitting ellipsis. In classical computer vision, image moments are mathematically defined as weighted averages of the image pixels’ intensities. In a grey-scale image, with pixel intensities, I(x, y), the raw (p, q)-moment Mp,q, is given by:

Mp,q = W X x H X y xpyqI(x, y) (3.1)

The equation 3.1 can be rearranged to fit the event-based framework: Mp,q=

X

x,y

(49)

As already said in Section 2.2.3, events do not have an intensity value in con-trast to frame-based pixels. Events simply spike when the intensity changes relative to their pixel exceeding a certain threshold. According to the possible combina-tions of the indices (p, q) (see Eq. 3.2), it can be defined as image moments of various orders describing respectively different object properties. For example, the zero-order moment M0,0 gives the total number of events in the image, i.e. the

object’s area. The first-order moments M1,0 and M0,1 when normalized by M0,0

give the coordinates of the center of mass in the horizontal and vertical directions respectively computed as follow:

xCoM =

M₁_,0

M_0,0, yCoM = M₀_,1

M_0,0 (3.3)

Extracting the parameters of the equivalent ellipse circumscribing the object starts to be a little complicated because they are not directly obtainable from the second-order raw moments M2,0, M1,1, M0,2 (applying Eq. 3.2). It is necessary to

compute also the second order central moments, defined as follow: µ′2,0 = M_2,0 M_0,0 − x 2 CoM µ′ 1,1 = M_1,1 M_0,0 − xCoM yCoM µ′0,2 = M_0,2 M_0,0 − y 2 CoM (3.4)

Here are the final formulas giving the major axis orientation θ and the major and minor axis lengths l and w, respectively computed as follow:

θ= tan−1 µ′ 1,1 µ′ 2,0−µ′0,2 2 l= r 6(µ′ 2,0+ µ′0,2) + q µ′2 1,1+ (µ′2,0− µ′0,2)2 w= r 6(µ′ 2,0+ µ′0,2) − q µ′2 1,1+ (µ′2,0− µ′0,2)2 (3.5)