Detection and pose estimation of texture-less objects in a multi-camera system. Application to visual servoing and manipulation with a ROS-guided semi-humanoid robot in an industrial context.

(1)

Universit´

a di Pisa

Dipartimento di Ingegneria dell’Informazione

Tesi di Laurea Magistrale in

Ingegneria Robotica e dell’Automazione

Detection and Pose Estimation of

Texture-less Objects in a Multi-Camera

System.

Application to Visual Servoing and Manipulation of a

Semi-Humanoid, ROS-guided Robot in an Industrial Context.

Candidato

Erika Di Stefano

Relatori:

Prof. Carlo Alberto Avizzano

Prof. Emanuele Ruffaldi

Prof. Lorenzo Pollini

(2)

Abstract

In this thesis we investigate and develop an algorithm for the detection and pose estimation of texture-less objects in a multi-camera system, using RGB images and range sensors.

We propose a method for fast training and detection of untextured objects and focus our attention on sensor-fusion, combining the multiple sensors’ data in a recursive filtering process that aims to obtain an estimate of the object’s 3D pose.

The system is applied to the visual-servoing of a semi-humanoid robot with the purpose of manipulating objects made of uniform colour and material and which are typically employed in industrial environments.

The software is developed on ROS operating system, opportunely interfaced with Matlab and Simulink and the algorithm was validated first on synthetic data and then on the real system.

(3)

Chapter 1 Introduction

Human-robot cooperation is one of the most challenging fields to which robotic research is nowadays paying much attention. Both home and industrial environments are being more and more populated by robots that assist humans in executing typical skills, such as object manipulation, assembling and precision operations, and replace them in case of inability to perform

an action. The most recent results of these studies can be synthesized

in the production of collaborative humanoid robots such as Baxter from Rethink Robotics, Roberta from Gomtec, Speedy-10 from Mabi, Nextage from Kawada Industries and many others.

In particular, when the task of the robot is to be aware of the surrounding world in order to be able of interacting, a necessary point is to reproduce the human senses that provide perceptual capabilities. The most functional among them that finds robotical applications is certainly vision.

In fact in the last few years we have seen a great development of RGB and range RGBD cameras, whose fields of employment are variegated. In the specific case of robotics and automation, the above mentioned vision systems find wide usage in the matter of object detection and visual servoing and much research is dedicated to these topics, with the aim of creating systems interacting with as much autonomy as possible.

(6)

1.1 Problem Statement

The problem addressed in this thesis work is the recognition and 3D pose estimation of texture-less objects, in order to fulfil a visual servoing task with the aim of performing a grasping operation. For the purpose, we rely on the presence of a Baxter research robot from Rethink Robotics and a multiple camera system.

Baxter is a semi-humanoid robot equipped with two 7-DOF arms, three VGA cameras, one mounted on the head and two mounted on each end-effector, and two range IR sensors, similarly mounted on the end-effectors. Moreover an external RGBD Asus XTION camera is available.

The first issue concerns the type of objects to consider. Indeed we have to handle objects devoid of texture and of reduced size, employed typically in industrial environments and for pick and place and assembling operations. This characteristic prevents us from using classical object detection algorithms that rely on the presence of interesting points on the object’s surface, deriving from colour information.

Another issue is how to extract the 3D pose of the object of interest from 2D datas provided by the sensors.

Furthermore, we have to combine all the measures provided by the multi-camera system (2D and 3D datas) in order to obtain an optimal overall information. At the same time, the robot has to fulfil the visual-servoing task, once the 3D pose of the object is recovered.

The topics that we face in this thesis work are thus the followings : • Detection of texture-less objects from multiple cameras. • Sensor fusion for 3D pose estimation.

(7)

1.2 Related Work

In this section we will see a few existent approaches present in the literature and from which this work was inspired.

A branch of the existing algorithms aims to detect unseen objects belonging to certain object classes, learned through computationally expensive statistical-learning techniques, such as random forests [Breiman, 2001] or HOG features [Dalal and Triggs, 2005]. While they perform well for category generalization, they are usually much slower during learning and runtime which makes them unsuitable for online applications.

Other methods rely on the knowledge of the object in terms of keypoint-features, points or regions of interest that can be univocally detected in each new image, deriving from color and shape information. This branch has significantly progressed thanks to the researches that have produced very efficient results, like SIFT [Lowe, 1999] and SURF [Bay et al., 2008]. When applying these two descriptors, objects are learned during a training phase, where interesting points of a set of reference images of the object are extracted and stored in a database. The object is later on recognized in an image by individually comparing each feature from the image to this database. Keypoint-based approaches are well suited for textured objects, but may not be effective for texture-less objects, made of uniform colour and material. In this case edge detectors can be employed [Canny, 1986], which allow to discriminate discontinuities in depth and in surface orientation, changes in material properties and in colour by employing gradient magnitudes and surface normals. On the other hand, geometrical figure detectors [Ballard, 1981] aid to find particular shapes in the image.

For the issue of texture-less object detection, template-matching algorithms

seem to be the most suitable. An early approach to template matching

[Olson and Huttenlocher, 1997] and its extension include the use of the Chamfer distance between the template and the input image contours as a dissimilarity measure. For instance, in [Gavrila and Philomin, 1999] a coarse-to-fine approach in shape and parameter space is introduced, using Chamfer Matching [Borgefors, 1988] on the Distance Transform of a binary

(8)

edge image. The Chamfer Matching minimizes a generalized distance between two sets of edge points. Although being fast when using the Distance Transform (DT), the disadvantage of the Chamfer Transform is its sensitivity to outliers which often result from occlusions.

Another common measure on binary edge images is the Hausdorff distance [Rucklidge, 1997]. It measures the maximum of all distances from each edge point in the image to its nearest neighbour in the template. However, it is sensitive to occlusions and clutter. In [Olson and Huttenlocher, 1997] the authors try to avoid that shortcoming by introducing a generalized Hausdorff distance which only computes the maximum of the k−th largest distances between the image and the model edges and the l-th largest distances between the model and the image edges. This makes the method robust against a certain percentage of occlusions and clutter. Unfortunately, a prior estimate of the background clutter in the image is required but not always available. Additionally, computing the Hausdorff distance is computationally expensive and prevents its real-time application when many templates are used. Both Chamfer Matching and the Hausdorff distance can easily be modified to take the orientation of edge points into account. This drastically reduces the number of false positives, but unfortunately also increases the computational load. [Holzer et al., 2009] is also based on the Distance Transform, however, it is invariant to scale changes and robust enough against planar perspective distortions to do real-time matching.

Nevertheless, it is restricted to objects with closed contours, which are not always available. All these methods use binary edge images obtained with a contour extraction algorithm, using the Canny detector for example, and they are very sensitive to illumination changes, noise and blur. For instance, if the image contrast is lowered, the number of extracted edge pixels progressively decreases which has the same effect as increasing the amount of occlusion. The method proposed in [Steger, 2002] tries to overcome these limitations by considering the image gradients in contrast to the image contours. It relies on the dot product as a similarity measure between the template gradients and those in the image. Unfortunately, this measure rapidly declines with the distance to the object location, or when the object appearance is even

(9)

slightly distorted. As a result, the similarity measure must be evaluated densely and with many templates to handle appearance variations, making the method computationally costly. Using image pyramids provides some speed improvements, however, fine but important structures tend to be lost if one does not carefully sample the scale space.

There are also approaches that are specifically trained on different viewpoints. For example in [Ozuysal et al., 2009], one or several classifiers are trained to detect faces or cars under various views.

More recent approaches for 3D object detection are related to object class recognition. In [Stark et al., 2010] the authors rely on 3D CAD models and generate a training set by rendering them from different viewpoints. In [Liebelt and Schmid, 2010] geometric shape and pose prior with natural images are combined.

Another recent solution to the handle of texture-less objects is the BOLD [Tombari et al., 2013] feature descriptor. This descriptor exploits groups of neighbouring line segments to realize an object’s parts representation. In particular, the descriptor considers relative orientations and displacements between pairs of segments and the object is detected by comparing these primitives with the stored ones. It is invariant to rotation and translation, but is not very robust to outliers.

From the related works which also take into account depth data there are mainly approaches related to pedestrian detection. They use three kinds of cues: image intensity, depth and motion (optical flow). The most recent approach of [Enzweiler et al., 2010] builds part-based models of pedestrians in order to handle occlusions caused by other objects and not only self occlusions modelled in other approaches [Wojek et al., 2009]. Besides pedestrian detection, there has been an approach to object classification, pose estimation and

reconstruction introduced in [Sun et al., 2010]. The training data set is

composed of depth and image intensities while the object classes are detected using the modified Hough transform.

With regards to the pose-estimation problem, specifically applied to visual servoing and manipulation, there is a branch that approaches the problem by combination of image data provided by multiple cameras in order to fulfil an

(10)

image-based visual servoing task [Kermorgant and Chaumette, 2011]. Other approaches recur to Kalman filtering for a target pose estimation, computed from visual data [Lippiello et al., 2003].

1.3 Thesis Objectives

We developed an algorithm for the detection and pose estimation of texture-less objects in a multi-camera system, applied to the visual servoing of the Baxter robot and the execution of a grasping operation.

The detection system employs the state-of-the art LINEMOD method [Hinterstoisser et al., 2012], an efficient online template-matching algorithm, which allows a

real-time detection of the object and, at the same time, has a low computational load, thus out-performing many other existing texture-less object detection algorithms.

Since this algorithm allows to simply detect the object on an image, a point of this work is how to recover the 3D pose from the visual information. For this aim we propose a method that allows to recognize the object on the image and at the same time returns a coarse value of the object’s pose with respect to the camera. This task is fulfilled by capturing several images of the object in different poses with respect to the camera and recording templates that contain the image features and the pose. Doing this operation manually would require too much time and highly accurate pose recordings of the object in front of the camera. For this reason we implement a training system that relies on the a priori knowledge of the object in terms of its CAD model and on the generation of synthetic images rendered in different poses with respect to a virtual camera, having the same intrinsic parameters as the real one. In this way we generate a wide set of colour and depth images of the object in different poses with respect to the camera and employ them in a training stage where we record templates and associated pose in a very fast and efficient way.

All the information deriving from the various cameras is integrated in a sensor-fusion framework developed within a Kalman filtering process that provides a pose estimate of the object in the 3D space, thus computing the

(11)

reference end-effector pose to perform a grasping action.

For the latter purpose, a position-based visual servoing of the robot’s arm is executed by dynamically planning the end-effector trajectory that gives the desired end-effector pose at each step to finally approach the object of interest and, contemporary, ensures that the object remains in the moving hand-mounted camera field of view. The reason for the choice of the position-based approach ensues from the possibility to specify directly the desired end-effector position and, on the other hand, from the lack of the specific image features required by the alternative image-based approach.

The chosen approach gives much more security of motion to the robot’s arm and allows a much more precise positioning of the end-effector, especially when objects of significantly reduced size have to be manipulated. In this way good performances can be reached, despite the robot’s technical limitation represented by the arm’s elastic backlash following from the elastic joint actuation.

The software is developed most in C++ language on the ROS operating system running on Linux Ubuntu, giving an interface between the robot and the user, and on MATLAB, opportunely communicating with the ROS platform.

1.4 Thesis Structure

In the following the content of the thesis chapters will be briefly summarized: • Chapter 1 : Analysis of the state of the art and illustration of the

thesis goals.

• Chapter 2 : Explanation of the employed theoretical tools. • Chapter 3 : Description of the Hardware.

• Chapter 4 : Description of the Software.

(12)

• Chapter 6 : Illustration of the validation results. • Chapter 7 : Conclusion discussion.

(13)

Chapter 2 Theoretical Prerequisites

In this chapter we will present a few basic theoretical elements of computer vision, visual servoing, rigid transformations and state estimation.

2.1 Camera Geometry Model

An image represents a projection of 3D points onto an image plane. There exist different projection models that describe the relationship between a point in the 3D space and a 2D point on the image plane. The most simple between them, that doesn’t take into account any image distortions, is called pinhole camera model and explains the process of perspective projection. According to this model, the optical projection center is a point through which all the rays of light pass (see fig.2.2)

Consider a reference system {C} = {XC, YC, ZC} with its origin lying on

the optical projection center and with the ZC optical axis pointing to the

view direction of the camera. The image plane is the plane orthogonal to the optical axis and with distance f , called focal length, from the camera

reference system origin {OC}.

Consider {I} = {x, y} the bidimensional reference system associated to the image plane and with its origin coinciding with the intersection point between the camera optical axis and the image plane, called principal point.

(14)

point p = (x, y) on the image plane as follows : " x y # = f Z " X Y # (2.1)

Fig. 2.1: Pinhole camera model.

Fig. 2.2: Perspective Projection.

If, in general, the 3D point P is expressed in a world-coordinate system {W } different from the camera-frame {C}, an intermediate transformation

C_T

W, which considers the relative pose between {C} and {W }, is required.

This transformation matrix is the matrix of the extrinsic parameters, computed during a calibration stage and made of a rotational and a translational

(15)

component. The new coordinates are in homogeneous form:       C_X C_Y C_Z 1       = f W_Z( C_T W)       W_X W_Y W_Z 1       (2.2) and therefore    x y 1   = f C_Z    C_X C_Y C_Z    (2.3)

Next, the 2D point image plane coordinates (x, y) are transformed into pixel

coordinates (xp, yp) through the camera intrinsic parameters:

   xp yp 1   =    1 sx 0 cx 0 _s1 y cy 0 0 1       x y 1    (2.4)

where sx and sy are the pixel sizes, as pixels might not be square, and cx

and cy the principal point coordinates.

At this point we can define the equation that fully describes the mapping from a 3D real world point, transformed to the camera frame, to a 2D pixel point, assuming a pinhole camera model :

   xp yp 1   =    1 sx 0 cx 0 _s1 y cy 0 0 1       f 0 0 0 f 0 0 0 1       1 W_Z 0 0 0 W1_Z 0 0 0 W1_Z       C_X C_Y C_Z    (2.5)

However, in a real camera lenses are used and, depending on the design and quality of the lenses, the resulting image will be more or less distorted. It is, therefore, necessary to compensate for those distortions by applying a distortion model. This is done by solving non-linear equations, which take into account radial and tangential distortions of the image and are not so

(16)

trivial and fast to solve.

2.2 Image Features

Image features are primitives in the image plane that can be unambiguously recovered for different points of view of the camera.

If we consider a generic object, features can be points, lines, borders, edges, corners, etc. In particular, such an element can be classified as feature if it has the following characteristics :

• Distinctiveness : it is unique and distinguishable from other features. • Repeatable : it is always detectable.

• Invariant to geometric transformations : it is detectable regardless of a translation, a rotation or a scale transformation.

Figure 2.3 shows an example of corner and blob detection :

Fig. 2.3: Corner and Blob detectors.

Each feature is described by a feature descriptor , i.e a representation that can be of higher or lower dimensionality than the same feature.

A simple representation can be, for example, the 2D coordinates for a point feature, whereas a complex one can be a histogram of gradients for a border. If the same feature is present in two different images, a feature-matching is performed through a correspondence between the descriptors. In figure 2.4 we can see an example of edge detection through the image gradients and

(17)

their local descriptor and an illustration of a keypoint matching between two images by means of the descriptors.

Fig. 2.4: Feature descriptor - Feature matching. Feature descriptors can be distinguished into two classes :

• global descriptor : it furnishes a global description of the entire image.

• local descriptor : it describes a patch of certain dimensions within the image.

2.3 Visual Servoing

Visual servoing or visual servo control refers to the use of computer vision data to control the motion of a robot. Typical applications are tracking, positioning with respect to an object and grasping.

Generally, images can be acquired from one or more cameras that can be placed in two different configurations :

• Eye-to-hand configuration, where the cameras are fixed in the workspace. • Eye-in-Hand configuration, where the cameras are attached to the

robot’s end-effector.

Two main branches of visual servoing have been developed : • Image-Based visual servoing

(18)

Fig. 2.5: Camera configurations.

2.3.1 Image Based Visual Servoing

In image based visual servoing the robot’s motion is determined by minimizing the error between a set of desired and previously learned image feature parameters of the object and the ones available in the current image data. Once defined an appropriate distance measure d(∗, ∗) between the features in exam, the error vector can be defined as :

e(t) = d (s(I(t), pi, pe), s∗) (2.6)

where s and s∗ are the current and desired feature parameters, I(t) is the

image and pi and pe are respectively the set of camera intrinsic and extrinsic

parameters.

In image-based visual-servoing point features are generally employed and the aim is to align the reference points with the ones that the camera is capturing.

Assuming that a motion of the camera causes the measured features to change, the relationship between the rate of change of the image features,

which we call ˙s, and the velocity screw of the camera that sees the object Vc

is :

˙

(19)

Fig. 2.6: a) Desired camera pose b) Initial camera pose c) Desired target image (red) and initial one (blue).

where Ψ is the so-called Interaction Matrix, whose mathematical form depends on the choice of the feature parameters, one can compute the control law that produces a velocity screw as input to the camera and leads to an exponential decrease of the image space error e(t) :

Vc= −λΨ+e , Ψ ∈ <k×6 , λ ∈ < (2.8)

Both in eye-in-hand and eye-to-hand configurations, the computed velocity screw has to be mapped from the camera reference system to the end-effector one, considering, in case, the kinematic chain between them. If the camera is attached on the end effector of a robotic manipulator, as in figure 2.7, the feature space control loop has to be put beside a joint space control loop :

Fig. 2.7: Image-based visual servoing control loop.

(20)

and suffers, therefore, from various drawbacks.

First of all, the computed camera screw doesn’t consider any eventual physical limitations of the robot/manipulator on which the camera is mounted, thus often producing sub-optimal motion in cartesian space, due to the coupling between translational and rotational motion.

In the computation of Ψ+ _{the extrinsic and intrinsic parameters are involved}

and their knowledge is inevitably affected by uncertainty. For this reason, only an approximation of the interaction matrix can be employed and a coarse estimation of it can lead to instability problems.

Task singularity can occur, caused by a wrong choice of the features, that

reduces rank( ˆΨ) and leads to unsatisfactory motions of the camera.

Another issue is the presence of local minima that give rise to unfeasible image motions.

2.3.2 Position Based Visual Servoing

Differently from the previously described approach, in position-based visual servoing the image features are exploited together with the knowledge of the 3D parameters of the target object, to build an estimate of the object’s pose with respect to the camera and define a position control of the robot/manipulator to execute tracking and/or positioning tasks. In this case only a desired target position is specified and the control law and positioning of the robot can be executed according to its specific inverse kinematics and desired control tuning parameters.

2.4 Rigid Transformations

Relative poses between objects can be represented by relative transformations between reference frames attached to the objects.

Consider the relative pose between a frame {S0} and a frame {S1}. First,

let’s assume that no translations between the frames occur, that is the frames’ origins coincide.

(21)

Fig. 2.8: Position-based visual servoing control loop.

through the rotation matrix 0R1 that takes a vector expressed in {S1} and

transforms it to a vector expressed in {S0}.

1_{p = [}_¯ 1_x,1_y,1_z]T _{is the vector made of the 3D coordinates of a point with}

respect to {S1}, then the same vector expressed in {S0} is defined as :

0_{p =}_¯ 0_R

11p¯ (2.9)

Fig. 2.9: Rotation between two frames.

The rotation matrix 0_R

1 is a 3 × 3 matrix whose columns r1, r2, r3 are

mutually orthonormal vectors, i.e

r_iTrj =    0 if i 6= j 1 if i = j (2.10)

(22)

Consequently, given a general rotation matrix R, the following property holds :

RTR = RRT = I3×3 (2.11)

The set of rotation matrices {Ri} belongs to the algebraic group SO(3),

having the following formal definition :

SO(3) = {R ∈ <3×3: RTR = I , det(R) = 1} (2.12)

Multiple rotations can be represented by the in cascade multiplication of the relative rotation matrices. If the successive rotations are assumed to occur along the initial fixed axes X0, Y0, Z0, the overall rotation is obtained by pre-multiplying the relative rotation matrices :

0_R

3 =

←−−−−−−−

2_R

31R20R1 (2.13)

whereas, if the sequence of rotations occurs around the mobile axes of the intermediate configurations between the initial and the final one, the overall rotation is obtained by post-multiplying the relative rotation matrices :

0_R

1 =

−−−−−−−→

0_R

11R22R3 (2.14)

Even if R is made of 9 elements, thanks to the six constraint equations deriving from the group’s properties (2.12), only three parameters forming a minimal representation are required to represent the rotation.

One of the various existing minimal representations describes the rotation from one frame to the other as the sequence of three successive rotations around the mobile axes X, Y, Z.

The possible sequences are XY Z, Y ZX, ZXY , XZY , ZY X, Y XZ whose respective angles of rotations are called Cardan angles.

Among the previous ones, the XY Z or Roll, Pitch, Yaw angles, as well as the ZY Z or Euler angles sequences are widely employed.

Another widespread minimal representation makes the use of the axis-angle notation, where the transformation is imagined as a rotation of an angle θ

(23)

Fig. 2.10: Example of successive rotations.

around a unique axis ˆu

Fig. 2.11: Axis-angle representation.

Given the 3 × 3 rotation matrix between two reference frames, it is therefore possible to compute the three chosen minimal parameters.

Nevertheless, there exist some conditions under which the representation degenerates. In that case a singularity occurs and the solution for the three parameters is not unique.

One way to bypass this issue is to employ unit quaternions.

If the relative transformation between two frames also envolves a translation between the respective origins, then expression (2.9) changes into :

0_{p =}_¯ 0_R

11p +¯ 0¯t1 (2.15)

where 0_R

1 represents the rotation and 0¯t1 the translation that allow

to change configuration from {S0} to {S1} or, analogously, to perform a

(24)

The same expression can be written in homogeneous form as : " 0_p_¯ 1 # = " 0_R 1 0¯t1 01×3 1 # " 1_p_¯ 1 # =0M1 " 1_p_¯ 1 # (2.16) and successive transformations can be represented by the in cascade

multiplication of the relative homogeneous matrices i−1Mi analogously to

the pure rotational case.

Fig. 2.12: Homogeneous transformation between two frames.

2.4.1 Quaternions

A quaternion can be defined as the generalization of a complex number or,

likewise, an extension of complex numbers to <4_{, defined as follows :}

q = q0+ q1ˆi + q2ˆj + q3ˆk = [q0, ¯q] (2.17)

where q0 is the scalar part and ¯q the vectorial one. Each qi of the

quaternion is a real number, whereas ˆi, ˆj, ˆk are orthogonal imaginary unit

vectors such that :

îî = ˆjˆj = ˆkˆk = îˆjˆk = −1 îˆj = −ˆjî = ˆk ˆ jˆk = −ˆkˆj = î ˆ kî = −îˆk = ˆj (2.18)

(25)

q∗ = [q0, −¯q] and the square of its magnitude is

||q||2 = qq∗ = q02+ q12+ q22+ q32

It is straightforward to show that the inverse of a quaternion can be expressed as

q−1 = q

∗ ||q||2

The sum of two quaternions q and p is also a quaternion and is obtained by summing the respective elements, while the product is defined as :

q ⊗ p =       p0 −p1î −p2ˆj −p3kˆ p1î p0 p3kˆ −p2ˆj p2ˆj −p3ˆk p0 p1î p3kˆ p2ˆj −p1ˆj p0             q0 q1î q2ˆj q3ˆk       = r = r0+ r1î + r2ˆj + r3ˆk

and [1, ¯0] results to be the neutral element for the product.

A unit quaternion is a quaternion with unitary magnitude, i.e

||q||2 = 1 (2.19)

The importance of unit quaternions in rotation representations derives from the existence of a relationship between them and the rotation matrices of SO(3) with no singularities, but at the price of employing four parameters instead of the three of the above mentioned minimal representations.

The unit quaternion can be placed in the form

q =hcos(θ₂) u sin(ˆ θ₂) i

(2.20) where cos(θ₂) is the scalar part and ˆu sin(θ₂) the vectorial one. θ is the angle

(26)

and ˆu the axis of rotation of the corresponding axis-angle representation. So, given the axis-angle vector

θˆu = ¯v = v0ˆi + v1ˆj + v2kˆ

θ and ˆu can be computed as follows :

   θ = ||¯v|| ˆ u = ¯v_θ (2.21)

and the quaternion can be built according to (2.20).

On the contrary, if we want to get the axis-angle representation from the

quaternion q = [q0, ¯q], we have to compute θ and ˆu as

θ = 2atan2(||¯q||, q0) ˆ u =    ¯ q sin(θ₂) , θ 6= 0 ¯ 0 , θ = 0 (2.22)

2.5 State Estimation

When the robot needs to be aware of the surrounding environment, it has to collect and combine information to determine the state that characterizes the environment. If the dynamic model of the state is known, it could be retrieved from the integration of the dynamic equations. In the same way, information about the state could be obtained from measurements furnished by available sensors. Nevertheless, neither the system’s dynamics nor the measurements can exactly determine the state concerned. The reason for this is, firstly, that the dynamics of a system is never known exactly. Indeed, there are always unmodeled phenomena regarding the state dynamics, that are translated into system uncertainty. Furthermore, the integration of the system’s dynamic equations requires the knowledge of the initial state and, therefore, a little error in the initial state can become very relevant during the successive integrations. On the other hand, measurements are very noisy

(27)

and, moreover, they do not always provide direct information about the state, but are indirectly connected to it.

Both sources of information can be optimally combined to obtain an estimate of the state through a recursive filtering process.

Let’s call xk the state vector at time step k, uk the input vector to the

system, y_k the output vector of the system with associated measurements

zk. Based on the assumption that state and measurement variables are

generally determined stochastically, which means that they are governed by probabilistic laws, recursive filters aim to compute the state in a recursive way, given a dynamic model, a series of measurements and inputs to the system. Formally, the aim is to find :

p(xk|x0:k−1, u1:k, z1:k−1) (2.23)

which, if the state is a Markov process, i.e it is independent from its past

history and, thus, xk−1 and uk are a sufficient statistic description at time

k, can be reformulated as :

p(xk|xk−1, uk) (2.24)

called state transition probability.

2.5.1 Kalman Filter

Consider a general linear and discrete-time dynamic model of our system,

where the state transition probability p(xk|xk−1, uk) and the measurement

probability p(yk|xk) can be described by the following equations :

   xk= Akxk−1+ Bkuk+ µk y_k = Hkxk+ νk (2.25)

where Ak, Bk, Hk are the system dynamic matrices and µk and νk are

(28)

representing respectively the process uncertainty due to unmodelled and/or uncertain dynamics and the measurement noise due to imprecise sensors. The Kalman filter reveals to be the optimal state estimator when µ and ν meet the following conditions :

• Zero-mean white Gaussian distribution : E[µ] = 0

E[ν] = 0

E[µiµjT] = Cov(µi, µj) = Qδi,j E[νiνjT] = Cov(νi, νj) = Rδi,j

(2.26)

• Uncorrelation with each other and with the state : E[µiνjT] = E[νiµjT] = Cov(µi, νj) = 0 E[νixjT] = E[µixjT] = Cov(νi, xj) = 0

(2.27)

Under these conditions, if we indicate with ˆx the state estimate and with

e = x − ˆx the estimation error, the Kalman filter is the Minimum Squared

Error estimator that computes

argmin ˆ x

E ||x − ˆx||2 (2.28)

and thus minimizes the estimate error covariance

argmin ˆ x E (x − ˆx)(x − ˆx)T = argmin ˆ x {P } (2.29)

When the variable of interest has a Gaussian distribution, the estimate is also optimal in the sense of the Maximum Likelihood, i.e

argmax ˆ x

{p(x|y)} (2.30)

For simplicity, we indicate with x, y, z, µ, ν, A, B, C, Q, R, P the same variables previously indicated in bold font. At each iteration step the filtering process consists of two successive operations :

(29)

(1) State and Covariance Prediction :

The state is predicted according to the dynamic model and the estimation error covariance matrix prediction follows from the state prediction expression

ˆ x−_k = Akxˆk−1+ Bkuk ˆ P_k−= E[(xk− ˆxk)(xk− ˆxk)T] = AkE[(xk− ˆxk)(xk− ˆxk)T]AkT + E[µkµkT] = AkPˆkAkT + Qk (2.31) (2) Measurement Correction :

The state prediction is corrected with the available measurements, by optimally weighting the respective contributions :

ˆ

xk= Gkxˆ−k + Kkzk (2.32)

where Gk and Kk are the specific weights. The relationship between these

weights can be recovered by imposing that both ˆx−_k and ˆxk are an unbiased

estimate of the state and exploiting the hypothesis about the noise : E[ˆxk− xk] = E[Gkxˆ−k + Kkzk− xk]

= E[Gkxˆ−_k + Kk(Hkxk+ νk) − xk]

= E[Gkxˆ−k + Kk(Hkxk+ νk) − xk+ Gkxk− Gkxk] = E[Gk(ˆx−k − xk) + (KkHk+ Gk− I)xk+ Kkνk] = GkE[(ˆx−k − xk)] + (KkHk+ Gk− I)E[xk] + KkE[νk] = (KkHk+ Gk− I)E[xk] = 0

=⇒ Gk= I − KkHk

(30)

The expression for the prediction correction becomes : ˆ xk= (I − KkHk)ˆx−k + Kkzk = ˆx−_k + Kk(zk− Hkxˆ−k) = ˆx−_k + Kkrk (2.34)

where rk is the residual or innovation term.

After the state correction, also the covariance matrix ˆP_k− can be updated :

ˆ Pk = E[ekeTk] = E h (xk− ˆxk)(xk− ˆxk)T i = Eh(e−_k − Kkrk)(e−k − Kkrk) Ti = Ee− k(e − k) T_{+ E K} krk(Kkrk)T − E e−_k(Kkrk)T − E (Kkrk(e−_k)T = ˆP_k−+ KkE(Hke−k + νk)(Hke−k + νk)TKkT − E e−k(Hke−k + νk)T KkT+ − KkE(Hke−k + νk(e−k) T (2.35) It is demonstrated that the previous expression is minimized by the

following optimal value for the Kalman gain Kk :

Kk = ˆPk−HkT(Rk+ HkPˆk−HkT) −1

(2.36) and the covariance update becomes :

ˆ

Pk = (I − KkHkHk) ˆPk− (2.37)

We can observe that the minimum squared error estimate provided by the Kalman filter also minimizes each scalar function of ˆP , such as trace( ˆP ).

Being trace( ˆP ) the sum of the matrix eigenvalues, this means that the

Kalman filter minimizes the individual variances associated to the state components.

(31)

Fig. 2.13: Kalman Filter Iterations.

2.5.2 Unscented Kalman Filter

The Kalman equations illustrated before are valid when the system in exam is described by linear equations. But when the system dynamics is not linear the previous described procedure doesn’t work anymore.

Consider a system where the state transition and observation models are described by the following non-linear equations :

   xk = f (xk−1, uk) + µk y_k= h(xk) + νk (2.38)

There exist several variants of the Kalman filter that approach the problem of state estimation with non-linear dynamics. The most spread one is the Extended Kalman Filter which, briefly, is based on the first order Taylor expansion approximation of the dynamic system equations over the current state and covariance estimate. In particular, the state is propagated in the prediction step through the non-linear equations and then, in the update step, a linearization is carried out in order to obtain the state and observation

(32)

by deriving the non-linear functions Jacobians : Ak= ∂f ∂x x=ˆx−,uk Ck= ∂h ∂x x=ˆx− (2.39)

However, first order approximation can introduce large errors in the true posterior mean and covariance of the transformed variable, which may lead to sub-optimal performance and sometimes divergence of the filter, especially when strong non-linearities characterize the system and first order accuracy is insufficient. For this reason, another variant was studied, which addresses this problem by using a deterministic sampling approach and is known as the Unscented Kalman Filter.

In the UKF the state distribution is represented by a minimal set of carefully extracted sample points. These sample or sigma points are chosen in a way that they completely capture the true mean and covariance up to the third order Taylor expansion for any non-linearity.

This time we can consider a generic non-linear system without assuming additivity of the noise sources :

   xk = f (xk−1, uk, µk) y_k= h(xk, νk) (2.40)

As the same name says, this type of filter is based on the unscented transformation, a method for calculating the statistics of a general random

variable (thus not necessarily gaussian) which undergoes a non-linear transformation. For this purpose, three vectors χ of 2N +1 sigma points are built, one for each

of the variables x, µ and ν, being N the state x dimension, by opportunely sampling the distribution around the mean value. Given

E[x] = ˆx E[µ] = 0 E[ν] = 0

(33)

and the associated covariance matrices P , Q and R, the sigma-vectors are built in the following way :

χx_k = ˆ xk xˆk± q (N + λ) ˆPk i , i = 1 . . . N χµ_k = h 0 ±p(N + λ)Q_k i i , i = 1 . . . N χν_k = h 0 ±p(N + λ)Rk i i , i = 1 . . . N where S =p(N + λ)M i

is the i−th row of the matrix Cholesky Factorization,

such that STS = M and λ = α2(N + γ) − N is a scaling parameter. α

determines the spread of the sigma points around the mean value and γ is a secondary scaling parameter usually set to a small value.

Next, the prediction step is carried out as follows : χx−_k = F χx_k−1, χµ_k ˆ x−_k = 2N X i=0 W_imχx−_i,k ˆ P_k− = 2N X i=0 W_icχx− i,k − ˆx − k χ x− i,k − ˆx − k T Y−_k = Hχx−_k , χν_k ˆ y_k− = 2N X i=0 W_imY_i,k− (2.41)

The previous equations show how the state and measurement predictions originate from the non-linear propagation functions applied to the entire sigma-vectors, including the noise contribution, and then taking the weighted mean of the resulting transformed vector components. The associated weights

(34)

are accurately chosen as illustrated below :

W₀(m) = λ/(N + λ)

W₀(c) = λ/(N + λ) + (1 − α2+ β)

W_i(m) = W_i(c) = 1/[2(N + λ)], i = 1 . . . 2N

where β is optimally chosen according the distribution of x and is set to β = 2 for gaussian distributions.

In the update step, the predicted state and covariance are computed as :

Pyk,yk = 2N X i=0 W_icY_i,kx−− ˆy_k− Y_i,kx−− ˆy−_kT Pxk,yk = 2N X i=0 W_icχx−_i,k − ˆx_k− χx−_i,k − ˆx−_kT Kk= Pxk,ykP −1 yk,yk ˆ xk= ˆx−k + Kk(yk− ˆy − k) ˆ Pk= ˆP − k − KkPyk,ykK T k

(35)

Chapter 3 Hardware

The system is implemented on the Baxter reasearch robot from Rethink Robotics, equipped with two 7-DOF arms, three VGA cameras and two range IR sensors. Furthermore also an external RGBD Asus XTION camera is used in order to provide depth-information.

3.1 Baxter Robot

The Baxter research robot from Rethink Robotics is a semi-humanoid robot conceived for human-robot interaction and collaboration in industrial and manifacturing scenarios. However, thanks to the possibility of direct programming access to the system via a ROS interface, the robot can execute a lot of tasks.

Fig. 3.1: Baxter Robot.

(36)

at each joint, incorporating full position and force sensing. As previously mentioned, the robot is equipped with three integrated cameras, one mounted on the head and two on each end-effector, two range-finding sensors, also mounted on each end-effector, along with a head sonar and accelerometers. On each end-effector a gripper is mounted, which can reach different levels of opening and closing. The gripper can, thus, perform mainly picking, placing and assembling operations.

While the elastic actuation makes the system more safe, given that the robot is conceived to coexist with humans, on the other hand tracking and positioning tasks may result imprecise because of the elastic backlash.

Fig. 3.2: Baxter Robot in manifacturing operations.

3.1.1 Arms

Baxter’s arms are made by joints with elastic actuators.

From the base, along the arm to the end-effector Baxter joints are : • SO First shoulder joint : roll movement.

• S1 Second shoulder joint : pitch movement. • EO First elbow joint : roll movement. • E1 Second elbow joint : pitch movement.

(37)

• WO First wrist joint : roll movement. • W1 Second wrist joint : pitch movement.

• W3 Third wrist joint (Gripper base) : roll movement.

Fig. 3.3: Baxter Arm and Joints.

The joints can be commanded through position, velocity and torque control, by specifying only the Cartesian endpoint desired position.

In the following figures and tables we can see some technical specifications of the robot :

(38)

Joint Range in Degrees S0 +51,-141:192 S1 +60,-123:183 E0 +173.5,-173.5:347 E1 +150,-3:153 W0 +175.25,-175.25:350.5 W1 +120,-90:210 W2 +175.25,-175.25:350.5

Fig. 3.4: Baxter Joint Ranges.

3.1.2 RGB Cameras

Baxter has three RGB (Red-Green-Blue) cameras integrated, one mounted on the head and two on the end-effectors.

Fig. 3.5: Baxter Head and Hand Camera.

Each camera offers a maximum resolution of 1280 x 800 and a minimum resolution of 320 x 200 pixels, with intermediate ones, and a frame-rate of 30 frames per second. Due to the bandwidth limitations, only two cameras at a time can be used, therefore the two hand-mounted cameras were chosen for the work.

3.1.3 IR sensors

Baxter’s hand-mounted IR range-sensors are able to detect the presence of nearby objects without any physical contact. The sensor is made of a light

(39)

emitting diode that emits a beam of infra-red rays and a light sensor, sensible to the changes of the returning light that has hitten the object. Their nominal range goes from 0.004 m to 0.4 m.

Fig. 3.6: Baxter IR Range Sensor.

Baxter’s IR-sensors are used together with the other visual systems as a measure of the distance of the object from the cameras to use for the pose estimation.

3.2 RGBD Camera

The Baxter robot comes only with RGB cameras, therefore a choice had to be taken on installing an RGB-D camera. For this aim an Asus XTION PRO was employed. A tempting solution was to place the sensor on the head, but the head is too far away from the table where the objects have to be picked and resolution worsens with distance, therefore we tried to place it on the end-effector. Nevertheless the placement on the end-effector turned to be awkward because of the pendant wire and unstable because of the arm’s motion and, as an extrinsic calibration of the camera is required in order to insert it in the kinematics chain made by the robot’s arms and RGB camera frames, even small shifts of the sensor force to repeat the calibration procedure. The camera was finally placed on the robot’s torso, a more stable position that, furthermore, provides a different field of view of the scene.

An RGBD camera is an attractive alternative to expensive laser scanners. It provides simultaneously two 2D images, a traditional RGB colour image and a depth-image that encodes the distance of each point of the RGB image from the camera. The camera is made of a light emitting LED and an IR camera that receives the returning signal.

(40)

Fig. 3.7: Asus XTION PRO range camera.

Fig. 3.8: Colour and Associated Depth Image.

Today’s range cameras are based on two different operation technologies: • Structured Light : The light source emits a particular light pattern, which undergoes a deformation when hitting the surrounding obstacles. Based on how this pattern is deformed, a depth is assigned to each pixel of the image.

Range cameras based on structured light are very sensitive to the lighting condition and do not work in outdoor environments, due to the presence of infra-red components in the sunlight. Moreover, depth-maps often present black holes in the image, meaning that at that point the depth has not been reconstructed.

• Time of flight : The light source emits a ray of light that is reflected by the surrounding objects and returns back to the camera. The depth of each pixel is, in this case, assigned according to the time the ray requires for coming back to the source. These kinds of cameras are more accurate than the previous ones and depth-maps are more uniform.

(41)

Fig. 3.9: Microsoft Kinect1 based on Structured Light and Kinect2 based on Time of Flight.

If, furthermore, the depth-image is combined with the RGB one, a collection of 3D coloured points can be obtained, which is known as point cloud.

Fig. 3.10: Point Cloud.

The Asus XTION PRO camera functioning is based on the first technology. In particular, the light speckle captured by the IR camera is compared with a set of previously captured images at various depths and stored in the sensors memory. When a speckle hits an object with distance to the sensor that differs from that of the reference plane, the position of the speckle in the IR-camera image will be shifted in the direction of the baseline between the emitter and the perspective center of the IR camera. All these shifts give rise to a disparity image.

To each pixel a distance from the sensor can then be associated, according to the disparity map.

(42)

Fig. 3.11: Infra-red image of the pattern of speckles projected on a sample scene and the resulting depth image.

Mathematical Model

The figure below illustrates the relationship between the distance of an object point k to the sensor relative to a reference plane and the measured disparity d on the image plane :

Fig. 3.12: Relationship between relative depth and measured disparity. Consider a reference system attached to the perspective center of the IR camera, with the X-axis along the baseline b between the IR camera center and the light emitter and the Z-axis orthogonal to the image plane, pointing to the object’s direction.

With respect to an object lying on the reference plane at a distance Zo to the

(43)

causes the speckle to be displaced on the image plane in the X direction. This is defined as disparity d.

If we resort to the similarity of triangles, we can state :    D b = Zo−Zk Zo d f = D Zk (3.1)

where Zk denotes the distance of the point k from the camera, b the base

length, f the focal length of the IR camera, D the displacement of k along X and d the observed disparity in image space. Substituting D from the second equation into the first one, we obtain the mathematical model for the

derivation of depth Zk from the observed disparity :

Zk =

Zo 1 + Zo

f bd

(3.2)

The Xk and Yk object coordinates of each point can then be computed in

accordance with the perspective projection model :    Xk = −Z_fk(xk− cx+ δx) Yk= −Y_fk(yk− cy + δy) (3.3)

where xk and yk are the image coordinates of the point, cx and cy are the

coordinates of the optical center and δx and δy compensate for lens distortion

according to a particular distortion model. Depth Resolution

The depth resolution refers to the minimum depth difference that can be measured and is determined by the number of bits per pixel used to measure the disparity.

The resolution of the IR camera, or more precisely the pixel size of the disparity image, determines the point spacing of the depth data on the

image plane. Since each depth image is made by a constant number of

pixels, the point density will decrease with increasing distance of the object from the camera. While the number of points remains constant, the area is

(44)

proportional to the square distance from the sensor.

From the previous equations we can see that depth is inversely proportional to disparity, so the resolution of depth is also inversely related to the levels of disparity. Definitely, the following relationship holds :

∆Z =

m f b

Z2 (3.4)

where m is a normalization parameter.

Figure (3.13) shows how the resolution changes with the distance from the sensor.

(45)

Chapter 4 Software

The Baxter robot offers an interface with the user via the ROS robot operating system, running on linux Ubuntu 14.04. For this reason, most of the software was written in C++ on this platform and, partly, also in MATLAB.

4.1 ROS - Robot Operating System

ROS is an open-source meta-operating system that allows to interface devices, guaranteeing hardware abstraction, low-level control and offers many modules that provide various useful tools for robotics. Network applications can be implemented, where each thread can communicate with the other through specific communication channels.

Below the fundamental elements that explain how a ROS network works are illustrated :

• Node :

Given a network of threads, a node is the single executable. Each node executes a specific task and, in case, communicates a message.

• Master :

(46)

all. This node manages the exchange of messages across the network and stores all the common parameters shared between the nodes. • Message :

A ROS message is a packet of communication data sent over the network. There exist several predefined messages available to use, but there is also the possibility to define a personal one as a data structure. Messages can cross the network through asynchronous and synchronous communication channels.

• Topic :

A topic is the asynchronous communication channel. Each node can send a message on a topic, namely publish on that topic and/or receive a message from a topic, namely subscribe to it. The process of publishing and subscribing is completely asynchronous, which means that each node can continuously publish and/or receive a message. In this latter case, when a message on a topic is received, a callback is run, inside which some operations can be executed, otherwise the program can continue executing or perform a blocking waiting until a message is received.

• Service :

A service is the synchronous communication modality in ROS.

As the name says, through this form of communication a node provides a service as a response to another node that sends e special request in a precise moment. The service is, therefore, called only when required. In our setup the roscore runs inside the Baxter, from which messages are received, containing data from the cameras and the relative transformations between the various reference systems attached to the robot’s arms.

After processing these data and executing the object detection and pose estimation, the joint position commands, following the inverse kinematics resolution, are published on a topic, to which the robot subscribes to move the arm.

(47)

Fig. 4.1: Software organization

4.2 OpenGL and Synthetic Rendering

The object detection and pose estimation algorithm implemented in this thesis requires a preliminary training stage where the object of interest is captured by the camera from more points of view as possible, in order to register as many features as possible of it and the associated pose with respect to the camera. These are later on stored in a database to appeal during detection.

To avoid physically turning the camera around each object or vice versa, risking to not capture all points of view and, furthermore, record an inaccurate associated pose, the training stage was fulfilled virtually, by employing the graphics tools present in the open-source OpenGL library.

After synthetically rendering the object’s CAD model and turning the virtual object around a virtual camera, multi-view colour and depth images of the object were realized and afterwards employed in the training stage.

This procedure has the advantage to be significantly faster and more efficient than doing the same thing manually and allows to record the precise pose of the object with respect to the camera.

Given the specific intrinsic, i.e focal length, resolution, pixel size and optical center, the procedure was applied for each camera and each object of interest

(48)

and the rendering process was carried out by covering a wide, but discrete, number of possible poses in the space.

Let’s now describe the virtual environment of OpenGL.

In the rendering process, a virtual camera, with a reference system attached to it, is defined. The reference system has, by convention, the x-axis pointing to the right, the y-axis pointing up and the z-axis to the opposite side of the object. Besides, a view frustum with the form of a truncated pyramid, that defines the viewing volume, is determined.

Fig. 4.2: OpenGL Virtual Camera and View Frustum

As we can see in the figure, the view frustum is characterized by a near and far clipping plane, respectively the planes of minimum and maximum distance to the camera-frame origin at which objects are seen and projected onto the viewing plane, and by a vertical and horizontal field of view angle. Objects that lie outside this volume cannot be seen by the camera and, thus projected onto the view-plane.

Given the CAD model to be rendered, a pipeline is executed, where the object is first transformed from object coordinates to window coordinates and then rasterized in the 2D space.

Object coordinates are the local coordinates of the objects, namely its initial position and orientation in object-space before any transform is applied.

(49)

Fig. 4.3: OpenGL vertex transformation process.

Eye coordinates are, instead, the object coordinates with respect to eye-space. First of all the vertices of the object’s mesh are transformed from local to world coordinates through the Model matrix. Next, the object is converted from world-space to eye-space through the View matrix. In OpenGL there is no separation between these two matrices, but they are combined together in a unique one :       xe ye ze we       = MM V       xo yo zo wo       (4.1)

After that, the eye coordinates are transformed into the clip coordinates, thus projecting the vertex data onto the screen, through the Projection matrix :       xc yc zc wc       = MP       xe ye ze we       (4.2)

The next step consist of a perspective division that gives the normalized

device coordinates, by dividing each coordinate by wc :

   xndc yndc zndc   = 1 wc    xc yc zc    (4.3)

At this point the normalized device coordinates have to be translated and scaled in order to fit into the rendering screen. This is yielded by applying

(50)

the viewport transformation :    xw yw zw   = MV    xndc yndc zndc    (4.4)

The window coordinates finally are passed to the rasterization process of OpenGL pipeline to become screen pixels.

The rasterization process consists of two parts. The first is to determine which squares of an integer grid in window coordinates are occupied by the vertex primitive. The second is assigning a color, a depth value and a texture to each such square. A grid square along with its parameters of assigned color, depth, and texture coordinates is called a fragment. The fragment is eventually further processed in order to produce the synthetic pixels of the rendered image.

In a similar way also depth images of the objects of interest are rendered and stored for the training stage.

4.3 OpenCV and PCL

Two further open-source libraries were employed for the processing of the colour images and the processing of the point clouds, i.e OpenCV and OpenGL. OpenCV provides many tools and functions for image acquisition and processing, whereas PCL allows to handle point clouds acquired from range cameras.

(51)

Chapter 5 Proposed Approach

5.1 Introduction

We faced the problem stated in 1.1 by subdividing it into three successive phases.

First, a learning of the object of interest was required in terms of a synthetic 3D rendering of its CAD model and a training where the various feature templates and associated poses were stored in a database.

After this preliminary step, we addressed the problem of the object’s detection in the images provided by the different cameras and the corresponding pose matching.

Then the pose matchings were employed as measures in a filtering process where the object’s pose was more finely computed.

At last we performed a trajectory planning for the robot’s end-effector to approach the object and, at the same time, ensure the object to remain in the moving camera field of view.

(52)

Fig. 5.1: Overall Framework.

5.2 Reference Frames

Before explaining the implemented algorithms to solve the problem in question, let’s define the work-space contest and the reference systems that have to be taken into account.

The presence of joint encoders that record the joint angles of the robot’s arms allows to reconstruct at each time-step the forward kinematics of the arms, thus providing the pose of each end-effector frame, i.e the frame attached to the gripper base, with respect to the torso/base frame.

We also know the pose of each hand-mounted camera and IR-sensor with respect to the gripper for construction and the preliminary extrinsic calibration provides the pose of the RGB-D camera frame within the chain. The object to detect is positioned on the table in front of the robot.

The reference frames are therefore :

• {S0} : Base reference frame

• {S1} : Frame attached to the left-hand moving camera.

• {S2} : Frame attached to the right hand fixed camera.

• {S3} : Frame attached to the external RGB-D camera.

• {S4} : Frame attached to the left-hand IR sensor.

• {Sobj} Frame attached to the object.

We know have a complete definition of the kinematics chain that comes into play, as shown in figure 5.2.

(53)

Fig. 5.2: Reference frames.

5.3 Object Detection Algorithm

For the purpose of detecting the object of interest, the LINEMOD method [Hinterstoisser et al., 2012] was employed.

This method relies on a similarity measure between an a priori known template of the object and the input image where the object has to be detected. In the following we explain the principle of operation of the algorithm. Colour Gradients of the RGB Image

Given an RGB reference image of the object of interest, the learning process consist in the computation of the image gradients. To have an algorithm that is robust to illumination and contrast changes, the direction of the gradients are ignored and only their orientation is considered.

In particular, the gradient orientations are computed separately on each colour channel {R, G, B} and then only the one with largest magnitude is taken into account.

(54)

we can define : IG(x) = ori( ˆC(x)) (5.1) where ˆ C(x) = arg max_C∈{R,G,B} ∂C ∂x (5.2)

The orientation space is divided into a discrete number n0 of equally

spaced orientations and therefore each computed orientation is quantized into one of the n0.

In the following figure illustrates how orientations are quantized. Moreover we can see an RGB image of a toy tuck and the difference between the contour representation on a gray value image and gradients computed with the LINEMOD method :

Fig. 5.3: Gradient orientation quantizing - Contour representation with classical method (lower left) and with LINEMOD method (lower right).

Keeping only the gradients whose norm exceeds a certain threshold, each location x of the image is assigned with the gradient whose quantized orientation is present most often in a 3 × 3 neighbourhood.

A template T is, thus, a pair

(55)

where O is the reference image and P is the list of locations xi at which the gradient orientations are computed.

Surface Normals of the Depth Image

If also a depth image of the object is available, a further feature can be computed and added to the template representation, namely the surface normals.

Fig. 5.4: Colour gradients and surface normals representation. These are estimated as follows :

Given the range image of the object, the first order Taylor expansion of the depth function D(x) around each pixel location x is considered :

D(x + dx) − D(x) = dxT∇D + h.o.t (5.4)

If we define a region around x, each pixel displacement dx gives rise to an equation that constrains the value of the gradient ∇D and allows to estimate

an optimal ˆ∇D, corresponding to a plane that passes through three points

X1, X2 and X3 :          X1 = ~v(x)D(x) X2 = ~v(x + [1, 0]T)(D(x) + [1, 0]T∇D)ˆ X3 = ~v(x + [0, 1]T)(D(x) + [0, 1]T∇D)ˆ (5.5)

(56)

with ~v(x) belonging to the line of sight going through pixel x. The surface normal at the 3D point projected onto x can be estimated as the normalized

cross-product of (X2− X1) and (X3− X1).

Analogously to the colour gradient orientations, the normal space is divided

into a discrete number m0 of surface normals.

The m0 quantized surface normals belong to the circular cone with the peak

pointing to the camera and each computed surface normal is quantized into

one of the m0 according to the angular distance.

Fig. 5.5: Surface normals quantizing - Depth image (lower left) and surface normals representation with LINEMOD.

Each pixel location x is eventually assigned with the one that appears most often in a 3 × 3 neighbourhood.

Binarized Image

The next step is the generation of a binary representation I of the image. As previously explained, gradient orientations are extracted and quantized

into the discrete number n0. Afterwards the gradient orientations are ”spread”

around their location, i.e each location r is marked with the orientations present in a neighbourhood of size T and centered in r :

(57)

R(r) = r − T 2, r + T 2

This makes the algorithm robust to small translations and deformations.

The spread orientations are then represented by a binary string made of n0

bits encoding the presence or absence of the n0 discrete orientations. If the

bit is set to 1 the corresponding orientation occurs in the neighbourhood of the considered location, otherwise the orientation is not present.

Fig. 5.6: Left: Gradient orientations and their binary code a) Quantized gradient orientations in the input image b) Gradient orientation spreading c) Resulting binary image.

In practice the ”spreading” process is carried out by computing a map for each quantized orientation, assigning 1 to the location where the specific orientation is present or 0 otherwise. Then these maps are shifted over the

range of −T 2, + T 2 × − T 2, + T

2 and merged with an OR operation.

In this way a binary representation I of the input image is obtained and the binary strings identifying each image location are used to access the precomputed response maps of the similarity measure.

Similarity Measure

For the purpose of finding a matching template in the input image, a similarity measure E (I, T , c) is applied. After processing the input image as explained in the previous sections, for each gradient orientation in the reference image,

(58)

the similarity measure searches in a neighbourhood of the corresponding location in the input image the most similar orientation :

E(I, T , c) = X r∈P max t∈R(c+r)|cos(ori(O, r) − ori(I, t))| (5.6) where the various terms are :

• I the input image, • T the template,

• O the reference image,

• P the set of gradient locations r in the reference image,

• R(c + r) the neighbourhood of size T centered in t = c + r in the input image, i.e R(c + r) = c + r −T 2, c + r + T 2 × c + r −T 2, c + r + T 2

• ori(O, r) the gradient orientation at location r in the reference image, • ori(I, t) the gradient orientation at location r shifted by c in the input

image.

The cosine is taken with its absolute value so that a dark or a bright background have the same effect.

The similarity measure for the surface-normals is, instead, defined by the dot product of the normalized surface normals :

EN(I, T , c) = X r∈P max t∈R(c+r) h ~_{N (O, r) · ~}_{N (I, t)}i (5.7) and if both primitives are considered, the resulting similarity measure is the sum of (5.6) and (5.7).

(59)

In practice, the similarity measure is not computed online during the detection phase, as it would require much time, but, given the binary representation of the input image, the pre-stored response-maps are queried.

Precomputation of Response Maps

For each gradient orientation in the reference image and its corresponding location the value of the similarity measure (5.6) is precomputed off-line. A

lookup-table τi is built for each of the n0 quantized orientations as follows :

τi[L] = max

l∈L |cos(i − l)| (5.8)

where i is one of the n0 quantized gradient orientations and L is the set

of orientations l present in the local neighbourhood of i.

If we consider the reference image with its spread gradient orientations and indicate with o1, o2, o3 ... ,o8 the set of quantized orientations, let’s assume that we want to pre-compute, for example, the response map for orientation

o5. For each location of the reference template image, the corresponding

orientation and those present in its local neighbourhood are used in operation (5.8) and the maximum value associated to the binary string that identifies the specific location is computed. For example, at location 10010100 this value will be max{|cos(o5− o1), cos(o5− o4), cos(o5− o6)|}.

The lookup table of each of the n0 quantized orientations will then have a

list of maximum values for the similarity measure associated to each binary string, encoding the specific gradient location on the reference image.

Given the input image I, the response map at each location can be evaluated, thus, as :

Detection and pose estimation of texture-less objects in a multi-camera system. Application to visual servoing and manipulation with a ROS-guided semi-humanoid robot in an industrial context.

Universit´

a di Pisa

Dipartimento di Ingegneria dell’Informazione

Ingegneria Robotica e dell’Automazione

Detection and Pose Estimation of

Texture-less Objects in a Multi-Camera

System.

Application to Visual Servoing and Manipulation of a

Semi-Humanoid, ROS-guided Robot in an Industrial Context.

Erika Di Stefano

Prof. Carlo Alberto Avizzano

Prof. Emanuele Ruffaldi

Prof. Lorenzo Pollini

Abstract

Contents

Chapter 1

Introduction

1.1

Problem Statement

1.2

Related Work

1.3

Thesis Objectives

1.4

Thesis Structure

Chapter 2

Theoretical Prerequisites

2.1

Camera Geometry Model

2.2

Image Features

2.3

Visual Servoing

2.3.1

Image Based Visual Servoing

2.3.2

Position Based Visual Servoing

2.4

Rigid Transformations

2.4.1

Quaternions

2.5

State Estimation

2.5.1

Kalman Filter

2.5.2

Unscented Kalman Filter

Chapter 3

Hardware

3.1

Baxter Robot

3.1.1

Arms

3.1.2

RGB Cameras

3.1.3

IR sensors

3.2

RGBD Camera

Chapter 4

Software

4.1

ROS - Robot Operating System

4.2

OpenGL and Synthetic Rendering

4.3

OpenCV and PCL

Chapter 5

Proposed Approach

5.1

Introduction

5.2

Reference Frames

5.3

Object Detection Algorithm