Visual-Based System for Object Recognition and Model Calibration to Perform Autonomous Grasping

(1)

Visual-Based System for Object Recognition and

Model Calibration to Perform Autonomous

Grasping

Supervisor:

Prof.ssa Lucia Pallottino Co-Supervisors:

Ing. Danilo Caporale Ing. Alessandro Settimi Ing. Gianluca Lentini

Candidate: Alessandro Palleschi

(2)

The design of systems that, using vision, allow autonomous robotic grasping and object ma-nipulation is one of the major topics in Robotics and Automation. In presence of unstructured environments and without any prior informations about the external environments and about the robot model, the question is how visual sensors can be used to perform autonomous estim-ation of the necessary informestim-ation about the environment and the object to manipulate, and how an autonomous calibration of the robot model can be performed. The goal of this thesis is to develop a visual-based system, implemented to perform autonomous object recognition and grasping of known objects. The proposed approach resolves the problem implementing two different pipelines on top of a Pose-Based Visual Servoing framework. At first, a Per-ception Pipeline is implemented, to perform object recognition and segmentation exploiting state-of-the-art machine learning algorithm. The desired end effector pose to actually grasp the object is supposed to be known in advance, based on the knowledge of the target ob-ject characteristics. Then, to drive the hand to the desired pose, a Calibration Pipeline has been designed. The proposed approach performs autonomous Camera-Base transformation estimation and kinematic chain calibration solving a non linear optimization problem. This procedure exploits visual estimation of the end-effector pose to set up the optimization prob-lem. The proposed pipeline is supposed to run before the actual grasping routine. To account for run-time changes of the previously estimated model, an on-line Internal model calibration procedure has been designed. In the proposed approach, an Extended Kalman Filter is used to calibrate the estimated model, at the beginning of the grasping routine. The framework is then implemented and tested on a reproduction of the AlterEgo humanoid robot arm, using the ROS middle-ware.

(3)

1.1 Left: A regular 3-layer Neural Network. Right: A ConvNet arranges its neur-ons in three dimensineur-ons (width, height, depth), as visualized in one of the lay-ers. Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels). [1] . . . 11 1.2 A simple ConvNet [2] . . . 12 1.3 The filter slides over the input and transfers its output to the new layer. [1] . . 13 1.4 The ReLU operation. . . 13 1.5 Max Pooling. [1] . . . 14 1.6 R-CNN applied to an image [3]. . . 16 1.7 After creating a set of region proposals, R-CNN passes the image through a

modified version of AlexNet to determine whether or not it is a valid region. [3] 16 1.8 Right: Classical R-CNN. Left: In Fast R-CNN a full forward pass of the image

is created and the conv features for each region of interest are extracted from the resulting forward pass . . . 17 1.9 Fast R-CNN combined the CNN, classifier, and bounding box regressor into

one, single network [4]. . . 18 1.10 Detection using Faster R-CNN [5] . . . 19 1.11 In Faster R-CNN, a single CNN is used for region proposals, and

classifica-tions. [5] . . . 19

(4)

coordinates where k is the number of anchors) [5]. . . 20

1.13 YOLO models detection as a regression problem [6]. . . 21

1.14 YOLO Network Architecture [6]. . . 22

1.15 Detection Using YOLO. . . 23

1.16 Mask R-CNN results on the COCO test set [7]. . . 23

1.17 A Fully Convolutional Network (FCN) is added on top of the CNN features of Faster R-CNN to generate a mask (segmentation output) [7]. . . 24

1.18 The Mask R-CNN framework for instance segmentation. [7] . . . 25

2.1 Pinhole Camera Model . . . 27

2.2 Example of 2D-3D pose estimation problem [8]. . . 29

4.1 Left: Base Learning Problem for calibrated kinematics; Right: Camera-Base Learning Problem in presence of calibration errors . . . 39

4.2 Model Learning in presence of constant calibration errors . . . 41

5.1 System Architecture - Two different pipelines are implemented to perform autonomous grasping . . . 49

6.1 Perception Pipeline Architecture . . . 50

6.2 Comparison between Detectron (left) and YOLOv3 (right) . . . 52

6.3 Detectron - ROS Integration . . . 52

6.4 Detectron - Application Results . . . 53

6.5 ORK - Model Capture Procedure . . . 54

6.6 ORK - Test on a Robotic Platform . . . 56

7.1 EGO - A Humanoid Robotic Platform . . . 58

7.2 ArUco Marker on the Hand Palm . . . 59

7.3 Examples of ArUco Marker . . . 59

7.4 Control Action . . . 61

(5)

8.3 Test Phases: (1) The Robot use the EKF to recalibrate the model; (2) A Marker is used to compute the desired pose; (3) The hand is controlled to the desired pose; (4) The hand grasps the object . . . 64 8.4 The Robot is commanded to 5 different poses . . . 65 8.5 Position and Orientation Error Norm With and Without Filter - Comparison . 65 8.6 Filter Action to Correct the model . . . 66 8.7 The Computed Goal Pose (left) and the final Hand Pose (right) . . . 66

(6)

I

State of the Art

9

1 Object Recognition 10

1.1 Feature-Based Classification . . . 10

1.2 Convolutional Neural Networks . . . 11

1.2.1 Regional CNN . . . 16

1.2.2 Fast R-CNN . . . 17

1.2.3 Faster R-CNN . . . 19

1.2.4 YOLO . . . 20

1.2.5 Mask R-CNN . . . 23

2 Object Pose Estimation 26 2.1 2D-3D Feature-Based Pose Matching . . . 27

2.1.1 Analytical Solutions . . . 29

2.1.2 Algorithmic Solution . . . 31

2.2 RGB-D Pose Estimation . . . 32

2.2.1 LINE-MOD . . . 33

II

Visual Control

34

3 Problem Formulation and Related Works 35 4 Model Learning 38 4.1 Camera-Base Transformation Model . . . 39

(7)

4.2 Kinematic Calibration Errors . . . 40

4.3 Parameter Estimation . . . 40

5 Online Model Adaptation 43 5.0.1 Extended Kalman Filter . . . 45

5.0.2 State Dynamic Model . . . 46

5.0.3 Measurement Model . . . 47

III

Implementation and Testing

48

6 Perception Pipeline 50 6.1 Object Detection . . . 51

6.1.1 Detectron - ROS Integration . . . 52

6.2 Pose Estimation . . . 54

7 Calibration Pipeline 57 7.1 Testing Platform . . . 57

7.2 Hand Pose Estimation . . . 59

7.2.1 Arm Control Strategy . . . 60

8 Autonomous Grasping Test 62 8.1 Results . . . 64

8.1.1 Calibration Procedure . . . 64

8.1.2 Non Linear Filter Performances . . . 65

8.1.3 Grasping Results . . . 66

(8)

Humans principally use sight, alongside the other senses, to perceive the surrounding envir-onment.

Using our eyes and our experience we are able to learn how the external world is made, tell the differences between different objects and manipulate them.

Everyday we use vision to manipulate different objects and adapt our actions to changes of the world. This activity, which is simple for us and it is performed without considerable effort, is challenging and difficult to achieve for a robot. If the robot works on heavily structured environment and is programmed with all the necessary information about the object position in the world, it could not be necessary to provide it the tools to perceive the external environ-ment. However, in normal operations the environment can change or could not be known at all, and prior informations about the objects might not be available. If humans show capabil-ities to adapt to new scenarios, the same is not straightforward for a robot. The challenge is to use visual feedbacks to make the robot understand the environment, program it to adapt to its changes and correctly manipulate different objects in various conditions. This is a critical topic for robotics, because it is the step needed to make the robot useful and effective in real situations, and once resolved would make possible to use the robot as operator in hostile en-vironment in substitution of humans.

A popular approach, especially in humanoid robotics, is to bypass the problem, exploiting the vision system to remotely control the robot. The robot becomes an ”avatar” of a human oper-ator, that remotely controls it and uses the images from the robot eyes to understand the best way to perform the task. In this scenario, the robot is ”blind”, and the interpretation of the camera images are delegated to the external operator, that uses its expertise to drive the robot. The mind that processes the information is the human and the robot is just a tool to execute

(9)

the mission. This approach does not fully exploits the capability of the robot as autonomous system, since the decision and the image processing phase is delegated to the human operator. Thus, performances are strictly related to the abilities of the human operator and have to deal with problems regarding the direct communication with the robot.

Making the robot capable to autonomously operate in heavily unstructured and sometimes hostile environments means to shift the images processing phase from the human mind to the robot itself, making it able to ”perceive”.

Focusing on robotic manipulation, the perception system processes the images from the cam-era to run object recognition and pose estimation algorithms.

This is nowadays a standard problem for robotics and many approaches are presented in the literature. Part I of this work is dedicated to present these approaches.

In Chapter 1 will be presented how the object recognition is nowadays performed, discussing state-of-the-art algorithms that use Deep-Learning to detect objects in the images streamed by the camera. Chapter 2 is dedicated to the problem of estimating the pose of a rigid object from the camera feed and how is treated in the literature. Procedures that use depth sensors are also presented.

The proper perception system has then been designed, combining and adapting existing solu-tions for the robot to give autonomous answers to two fundamental quessolu-tions:

• Is the object I want to interact with in the current scene? • Where is located the object with respect to myself?

Once the perception system has been designed and implemented the next problem becomes: • Given the object pose, how to find the best way to manipulate it.

• Defined the grasp pose, how to drive the robot in the desired pose using vision.

The former is a complex problem, and the design of a grasp planner is out of the scope of this thesis. It will be assumed during the rest of the work that, if we are working with known objects, the desired grasp pose has been computed off-line and the robot is programmed with a look-up table where, given the object type, it can extract the suitable pose.

(10)

vision to drive the robot to the correct pose. Even if visual servoing techniques have been studied for the last 30 years [9] and a number of advanced solutions have been proposed dur-ing this period [10], visual feedback of the hand is rarely used to perform real-time reachdur-ing and grasping tasks [11].

In fact, visual estimation of the end-effector pose is difficult to achieve and computationally expensive; therefore, it is not suitable for real-time application, where information at high frame-rates are required. However, purely open-loop reaching and grasping can hardly be successful, because the robot models are typically not accurate enough.

The problem of visual-based estimation and/or adaptation of the robot model is complex and not easy to resolve. Inspired by the method presented in [12], the proposed approach tends to imitate the way humans use vision to correct movements.

Vision is initially used to measure the hand pose and to learn a camera-hand kinematic model, later to be used to perform reaching.

Similarly to the approach proposed in [13], a non linear optimization problem is set up us-ing a visual measurement of the end-effector pose and the pose computed usus-ing the froward kinematics. Solving this problem, a camera-hand model is learned and, at the same time, cal-ibration errors on the arm kinematic chain are estimated.

Alongside this procedure, which will be investigated and explained in section 4, an Extended Kalman Filter has been implemented to perform an online model adaptation at the beginning of the grasping routine.

The last part of this work is dedicated to describe how the complete system has been imple-mented and tested on a real robot, with the realization of autonomous grasping of a testing object.

(11)

State of the Art

(12)

Object Recognition

Object detection is a basic skill required for a robot to perform tasks in human environments. Unlike image classification, detection requires localizing objects within an image.

This task is still a challenge for computer vision systems. Many approaches to the problem have been implemented over multiple decades.

In this chapter feature-based methods will be discussed, with a focus on state-of-the-art al-gorithms that exploits machine learning techniques.

1.1 Feature-Based Classification

Feature-Based algorithms recognise objects based on specific features. Features are supposed to be characteristic of each object, and one object is often described by multiple features. Pos-sible choices for features are colours, contour lines, geometric forms or edges. Feature-based object recognition strategies follow these steps:

• Every input image is searched for a specific type of feature

• This feature is then compared to a database containing models of the objects.

Features and their descriptors can be either found considering the whole image (global fea-ture) or after observing just small parts of it (local feafea-ture).

(13)

Detection pipelines generally start by extracting a set of robust features from input images (SIFT [14], HOG [15]). Then, classifiers or localizers are used to identify objects in the feature space. These classifiers or localizers are run either in a sliding window fashion over the whole image or on some subset of regions in the image.

Current approaches to object recognition make essential use of machine learning methods that are able to outperform classical computer vision algorithms. Ever since Alex Krizhevsky, Geoff Hinton, and Ilya Sutskever won ImageNet in 2012, Convolutional Neural Networks (CNNs) have become the gold standard for image classification [16].

1.2 Convolutional Neural Networks

Convolutional Networks [17], also known as Convolutional Neural Networks or CNNs, are a specialized kind of neural network for processing data that have a known, grid-like topology.

Figure 1.1: Left: A regular 3-layer Neural Network. Right: A ConvNet arranges its neurons in three dimensions (width, height, depth), as visualized in one of the layers. Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels). [1]

CNNs have a different architecture with respect to regular Neural Networks, as shown in image 1.1. Regular Neural Networks transform an input by putting it through a series of hidden layers. Every layer is made up of a set of neurons, where each layer is fully connected to all neurons in the previous layer. Finally, there is a last fully-connected layer, the output layer, that represents the predictions. Convolutional Neural Networks have a different structure:

(14)

• The neurons in one layer do not connect to all the neurons in the next layer but only to a small region of it.

• The final output is reduced to a single vector of probability scores, organized along the depth dimension.

A simple ConvNet, as illustrated in figure 1.2, is a sequence of layers, and every layer of a ConvNet transforms one volume of activations to another through a differentiable function. Three main types of layers are used to build ConvNet architectures:

• Convolutional Layer + ReLU • Pooling Layer

• Fully-Connected Layer

Figure 1.2: A simple ConvNet [2]

A CNN model can be thought as a combination of two components: feature extraction part and the classification part. The convolution and pooling layers perform feature extraction.

Convolutional Layer + ReLU:

CNNs derive their name from the “convolution” operator. The primary purpose of convolution is to extract features from the input image; this task is performed on the input data with the use of a filter or kernel to produce a feature map.

A convolution is performed by sliding the filter over the input. As the filter slides over the width and height of the input volume it produces a 2-dimensional activation map, the feature map, that gives the responses of that filter at every spatial position. Three hyperparameters control the size of the output volume:

(15)

Figure 1.3: The filter slides over the input and transfers its output to the new layer. [1]

• Kernel Size

• Stride: the size of the step the convolution filter moves each time. The stride size is usually 1, meaning the filter slides pixel by pixel.

• Depth of the output: corresponds to the number of filters used, each of them looking for different features in the input.

• Zero-Padding: because the size of the feature map is always smaller than the input, a layer of zero-value pixels is added around the output, so that the feature map will not shrink.

An additional operation called ReLU is performed after each convolution layer, to introduce non-linearity in the ConvNet.

Figure 1.4: The ReLU operation.

(16)

replace all negative pixel values in the feature map by zero, obtaining a rectified feature map.

Pooling Layer:

After a convolution layer, it is common to add a pooling layer in between. Pooling (also called subsampling or downsampling) reduces the dimensionality of each feature map but retains the most important information. Spatial Pooling can be of different types: Max, Average, Sum etc.

Figure 1.5: Max Pooling. [1]

In case of Max Pooling, a spatial neighbourhood, i.e. a n × n window, is defined and the output is the largest element from the rectified feature map within that window. Instead of taking the largest element, other options are to take the average (Average Pooling) or summing all the elements in that window. Figure 1.5 shows an example of Max Pooling operation on a Rectified Feature map (obtained after convolution + ReLU operation) by using a 2×2 window. The stride is one of the parameters that is used to characterize the layer, together with the window size and the type of pooling.

Convolution and pooling layers are the basic building blocks of any CNN. Together these layers extract the useful features from the images, introduce non-linearity in the network and reduce feature dimension, while aiming to make the features somewhat equivariant to

(17)

scale and translation. At the end, the output of the last Pooling Layer acts as an input to the Fully Connected Layer. Since the output from the convolutional and pooling layers represents high-level features of the input image, the purpose of the Fully Connected layer is to use these features for classifying the input image into various classes, based on the training dataset. As most of the object recognition algorithms, CNNs need a training to adapt the weights of the neurons.

The overall training process of the Convolution Network may be summarized as below: Step 1: All filters and parameters / weights are initialized with random values

Step 2: The network takes a training image as input, goes through the forward propagation step (convolution, ReLU and pooling operations along with forward propagation in the Fully Connected layer) and finds the output probabilities for each class.

Step 3: The total error at the output layer is computed: T otal Error =X1

2(target probability – output probability)

2

Step 4: Backpropagation is used to calculate the gradients of the error with respect to all weights in the network and use gradient descent to update all filter values / weights and para-meter values to minimize the output error.

Step 5: Steps 2-4 are repeated for each image of the training datasets.

The discussed architecture is based on LeNet, one of the very first Convolutional Neural Net-work.

Some other influential architectures are AlexNet [16], GoogLeNet [18] and ResNets [19]. The original application of CNNs to the object detection problem was the Regional CNNs [3] alongside its improvements, Fast R-CNN [4] and Faster CNNs [5]. These architectures were able to produce as output bounding boxes and labels for each recognized object.

Other popular architectures are YOLO [6] and SSD [20] that resulted to be fast and look prom-ising for real-time applications.

An upgrade of Faster CNNs was developed, using CNNs to perform pixel-wise segmentation of the detected objects [7]. An outline of the working principles for these architectures will be presented in the next sections.

(18)

1.2.1 Regional CNN

Figure 1.6: R-CNN applied to an image [3].

The goal of R-CNN is to take an image as input and producing as output bounding boxes and labels for each recognized object. R-CNN propose a bunch of boxes in the image and see if any of them actually corresponds to an object. These bounding boxes, or region proposals, are created using a process called Selective Search. At a high level, Selective Search looks at the image through windows of different sizes, and for each size tries to group together adjacent pixels by texture, color, or intensity to identify objects.

Figure 1.7: After creating a set of region proposals, R-CNN passes the image through a modified version of AlexNet to determine whether or not it is a valid region. [3]

(19)

passes it through to a modified version of AlexNet (1.7).

AlexNet has an architecture very similar to LeNet, but is deeper, bigger, and has featured Con-volutional Layers stacked on top of each other. On the final layer of the CNN, R-CNN adds a Support Vector Machine (SVM) that simply classifies whether this is an object, and, if so, to which class it belongs.

Then, R-CNN runs a simple linear regression on the region proposal to generate tighter bound-ing box coordinates.

1.2.2 Fast R-CNN

R-CNN has some drawbacks:

• Forward pass of the CNN (AlexNet) is needed for every single region proposal for every single image.

• It has to train separately the CNN to generate image features, the classifier that predicts the class, and the final regression model.

In 2015, Ross Girshick in [4] presented a new method to overcome these issues, Fast R-CNN. Fast R-CNN resolves the forward pass of the CNN using a technique known as RoIPool (Region of Interest Pooling). At its core, RoIPool shares the forward pass of a CNN for an image across its subregions (1.8).

Figure 1.8: Right: Classical R-CNN. Left: In Fast R-CNN a full forward pass of the image is created and the conv features for each region of interest are extracted from the resulting forward pass

(20)

The CNN features for each region are obtained by selecting a corresponding region from the CNN’s feature map. Then, the features in each region are pooled (usually using max pool-ing).

The second insight of Fast R-CNN is to jointly train the CNN, classifier, and bounding box regressor in a single model. Fast R-CNN replaces the SVM classifier with a softmax layer on top of the CNN to output a classification and also adds a linear regression layer parallel to the softmax layer to output bounding box coordinates.

Figure 1.9: Fast R-CNN combined the CNN, classifier, and bounding box regressor into one, single network [4].

(21)

1.2.3 Faster R-CNN

Figure 1.10: Detection using Faster R-CNN [5]

In the middle of 2015, a team at Microsoft Research composed of Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, developed an architecture capable to speed-up the region proposal step, named Faster R-CNN [5]. A single CNN is used to both carry out region proposals and classification.

Figure 1.11: In Faster R-CNN, a single CNN is used for region proposals, and classifications. [5]

Faster R-CNN adds a Fully Convolutional Network on top of the features of the CNN, creating a Region Proposal Network (1.11).

(22)

and outputting k potential bounding boxes and scores at each window, indicating how good each of those boxes is expected to be.

Figure 1.12: The Region Proposal Network slides a window over the features of the CNN. The network outputs a score and a bounding box per anchor (hence 4k box coordinates where k is the number of anchors) [5].

Since objects in an image should fit certain common aspect ratios and sizes, the k proposals are parametrized with respect to k reference boxes, called anchor boxes. For each of such anchor box, the network outputs one bounding box and score per position in the image. Each such bounding box are then passed to the Fast R-CNN.

1.2.4 YOLO

You only look once (YOLO) is one of the most effective state-of-the-art, real-time object detec-tion system [6].

Contrarily to other architectures, it reframes object detection as a single regression problem, going directly from image pixels to bounding box coordinates and class probabilities. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilit-ies for those boxes. YOLO trains on full images and directly optimizes detection performance. This method allows YOLO to be faster than other solutions, allowing to use it in applications where real-time performances are critical.

(23)

YOLO to see the entire image during training and test time, so it implicitly encodes contextual information about classes as well as their appearance.

Figure 1.13: YOLO models detection as a regression problem [6].

In fact, the network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. The detection system scheme of functioning is shown in figure 1.13. The input image is divided into an S ×S grid. Each grid cell predicts B bounding boxes and confidence scores for those boxes. Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell.

Each grid cell also predicts C conditional class probabilities. These probabilities are condi-tioned on the grid cell containing an object. The class confidence score for each prediction box is computed as:

class conf idence score = box conf idence score × conditional class probability These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.

As said, the model is implemented as a convolutional neural network, where the initial convo-lutional layers of the network extract features from the image while the fully connected layers

(24)

predict the output probabilities and coordinates.

Inspired by the GoogLeNet model [18], it has 24 convolutional layers followed by 2 fully con-nected layers.

Figure 1.14: YOLO Network Architecture [6].

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that the model can predict.

YOLO is just the first version of this algorithm, improvements have been obtained using newer versions like YOLOv2 [21] and YOLOv3 [22], that use a slightly different network architecture and training strategy.

(25)

Figure 1.15: Detection Using YOLO.

An example of the results obtained using YOLO is shown in figure 1.15.

1.2.5 Mask R-CNN

Figure 1.16: Mask R-CNN results on the COCO test set [7].

The output of R-CNN, Fast-CNN, Faster CNN and YOLO, are only the class of the detected object and a bounding box around it.

(26)

For some applications, localizing an object with a simple bounding box is not enough. For in-stance, you might want to segment an object region once it is detected. This class of problems is called instance segmentation.

Kaiming He and a team of researchers at Facebook AI explored the possibility of solving this problem using CNNs implementing an architecture, known as Mask R-CNN [7]. This archi-tecture extends Faster R-CNN to also carry out pixel level segmentation. This is done adding a Fully Convolutional Network on top of a CNN based feature map (1.17).

Figure 1.17: A Fully Convolutional Network (FCN) is added on top of the CNN features of Faster R-CNN to generate a mask (segmentation output) [7].

In this case, the network takes as input the feature map obtained using the CNN and gives as output a binary mask, a matrix with 1 on all locations where the pixel belongs to the object and 0 elsewhere.

Furthermore, a correction on the RoIPool architecture is made, to obtain accurate segmenta-tion, using a method called RoIAlign.

(27)

Figure 1.18: The Mask R-CNN framework for instance segmentation. [7]

This techniques is used to accurately map a region of interest from the original image onto the feature map, using bilinear interpolation.

(28)

Object Pose Estimation

The results of the object detection step is the localization of the target on the image plane. The next step for perform grasping operations is the estimation of its pose in the camera reference frame.

The estimation problem can be stated as: Pose Estimation Problem:

Given:

• a camera and a reference frame Oc− xc, yc, zcattached to it.

• a reference frame Oo− xo, yo, zo attached to the detected object, supposed rigid.

Estimating the pose of the object relative to the camera means estimating the element of TC B,

the homogeneous transformation matrix corresponding to the relative pose of the object with re-spect to the camera.

Although methods exist that try to estimate the pose of unknown objects, reconstructing a superquadric representing the object by using a 3D partial point cloud acquired from stereo vision [23] or decomposing the object into Minimal Volume Bounding Boxes [24], many pose estimation algorithms rely on a prior knowledge of the object to manipulate.

For this thesis the problem of the pose estimation of objects whose CAD model is available will be considered.

Traditionally, the problem of 6D object pose estimation is tackled by matching feature points 26

(29)

between 3D models and images. These methods, that will be briefly treated in the following, work using 2D-3D pose matching and are based on models that relate the 2D points on the image plane to 3D world coordinates.

Other solutions try to use region-based 2D segmentation and 2D to 3D pose tracking, us-ing a known 3D model of the object [25], or usus-ing Temporally Consistent Local Color Histo-grams [26].

The problem of feature-based methods is the handling of textureless object, where is not easy to extract robust and meaningful feature points, while segmentation-based methods critical is-sue is the capability to achieve good background segmentation, especially on cluttered scenes. With the emergence of depth cameras, several methods have been proposed for recognizing textureless objects using RGB-D data, with recent attempts to use CNNs even for the pose estimation [27].

In this work the focus will be on template matching methods, and a some of them will be briefly presented in section 2.2.

2.1 2D-3D Feature-Based Pose Matching

Figure 2.1: Pinhole Camera Model

The pinhole camera model describes the mathematical relationship between the coordinates of a point in three-dimensional space and its projection onto the image plane. That relation-ship can be expressed via a so-called calibration matrix which depends on up to five intrinsic

(30)

parameters [28]: K =       f αx 0 x0 0 f αy y0 0 0 1       (2.1)

where f is the focal length of the camera, (αx, αy)are two scale factors that relate pixel units

to metric units, and (x0, y0)are the coordinates in pixels of the principal point.

Using this model, a given 3D point ~pc_{= (p}c

x, pcy, pcz)T expressed in the camera reference frame

can be related to the (XI, YI)pixels coordinates using:

XI = αxf pcx pc z + xo XI = αyf pcy pc z + yo (2.2)

Equation (2.2) can be expressed in linear form using the homogeneous representation of a point (xI, yI, λ): XI = xI λ YI = yI λ (2.3)

Thus, it can be written as:

      xI yI λ       = λ       XI YI 1       = KΠ           pc x pc y pc z 1           (2.4)

where K is the matrix (2.1) and Π is the matrix:

Π =       1 0 0 0 0 1 0 0 0 0 1 0      

(31)

If the intrinsic parameters are known, it could be useful making reference to the normalized coordinates (X, Y ), defined by the normalized perspective transformation:

λ       X Y 1       = Π           pc_x pc_y pc_z 1           (2.5)

Using (2.2) and (2.5), the following transformation can be obtained:

λ       XI YI 1       = K       X Y 1       (2.6)

Equation (2.6) is widely used when the problem of estimating the matrix TC

B is treated in an

analytic way, and the elements of the matrix are obtained solving a system of equations.

2.1.1 Analytical Solutions

Figure 2.2: Example of 2D-3D pose estimation problem [8].

The element of the matrix TC

B are calculated using measurements of characteristics parameters

(32)

In fact, considering n points of the object and assuming to know the position vectors of these points expressed in the object frame, the system of equations to solve is built using the pro-jections of these points on the image plane.

For each point, the coordinates on the image plane are:

si =    Xi Yi   

and the vector of the characteristic parameters is defined as:

s =       s1 ... sn      

The homogeneous coordinates of the object point can be expressed in the camera reference frame as: ˜ rc_o,i= T_BCr˜_o,io Using (2.6): λis˜i = ΠTBCr˜ o o,i (2.7)

With n correspondences it is possible to built a system of n equations in the form of (2.7). This problem is known as P nP (Perspective-n-Point) problem [28].

For each solution to PnP, the chosen point correspondences cannot be coplanar. In addi-tion, PnP can have multiple solutions, and choosing a particular solution would require post-processing of the solution set. Random sample consensus (RANSAC) is also commonly used with a PnP method to make the solution robust to outliers in the set of point correspondences. RANSAC [29] is an iterative method to estimate parameters of a mathematical model from a set of observed data that contains outliers, when outliers are to be accorded no influence on the values of the estimates.

Another famous approach to analytically solve the problem is the Direct Linear Transforma-tion (DLT) method [28].

This method allows to compute the elements of TC

B solving a system of linear equations

(33)

The equation is obtained from the equality:

S(˜si)[RC_B oc_c,o]˜ro_o,i = 0, where S(˜si₎is an antisymmetric matrix.

In theory, using six correspondences a linear system with 12 equations and 12 variables with rank 11 is obtained. In reality, due to the presence of noise, the system would have rank 12 and then a number of n > 6 points would be needed. In this case the solutions should be computed using a least square technique.

The former methods make use of a system of equations to solve for the elements of the matrix TC

B. Another solution is to formulate the pose estimation problem in a way that resembles the

algorithm used to compute the inverse kinematics of a manipulator [28].

The basic principles behind this approach will be briefly treated in the next section.

2.1.2 Algorithmic Solution

An algorithmic solution for the pose estimation problem exploits the relationship that holds between the velocity vector in the image plane (˙s) and the relative velocity between the camera and the object:

˙s = Js(s, TCB)vcc,o (2.8)

where the matrix Jsis called image Jacobian.

Equation (2.8) can be rearranged by specifying the relative velocity in terms of the absolute velocities of both the object ad camera:

˙s = Js(s, TCB)(voc+ Γ(occ,o)vcc) = Js(s, TCB)vco+ Lsvcc (2.9)

where the Γ(·) is the operator:

Γ(·) =    −I S(·) 0 −I   

and the matrix Ls= Js(s, TCB)Γ(occ,o)is called Interaction Matrix.

Usually, Ls ∈ Rk×m with m = 6 in case of unconstrained motion. If k > m, equation (2.8)

can be solved for vc

c,ousing:

(34)

The pose estimation problem is set up in a way formally similar to the algorithm used to compute the inverse kinematics of a manipulator. In fact, defining a parametrization of the object pose using the vector:

xc,o =    oc c,o Φc,o    (2.11)

equation (2.8) can be written as:

˙s = LsΓ(−occ,o)TA(Φc,o) ˙xc,o = JAs(s, xc,o) ˙xc,o (2.12)

where TA(Φc,o) is the matrix used to link the velocity vector to the time derivative of the

chosen parametrization:

vc_c,o = TA(Φc,o) ˙xc,o

Equation (2.12) is used to build an algorithm that estimates the pose ˆxc,ominimizing the error:

es= s − ˆs = s − s(ˆxc,o)

These Jacobian-based methods are efficient if the initial estimate is accurate but are not asymp-totically stable, then the system might converge to a solution where the error is limited but not zero.

2.2 RGB-D Pose Estimation

The recently spread of powerful and cheap Depth Sensors, like Microsoft Kinect or the Asus Xtion, have led to the development of methods that try to use depth informations to solve the pose estimation problem. In general, these stereo cameras work in a way that resembles the human binocular sight:

• Two separated sensors capture the scene .

• Depth and motion are estimated by comparing the displacement of pixels between the left and right images.

Many approaches have been tried for real-time object detection using RGB-D data.

(35)

point cloud is fitted to a registered point cloud.

A popular library to manipulate point clouds, PCL [30], offers implementations of some of these algorithms, like object recognition using correspondence grouping.

A class of approaches widely used in recent times are template matching-based approaches [31].

2.2.1 LINE-MOD

A state-of-the-art template matching algorithm, LINE-MOD, has proven to be an efficient method that exploits both depth and color images to capture the appearance and 3D shape of the object in a set of templates [31].

The LINE-MOD templates sample the possible appearances of the objects to detect gradients and depth map normals. Then, when a test image is encountered, the algorithm searches through the test image with a sliding window. This window contains a matching object if the similarity score between the gradient feature matrix of this window and that of a template is above a certain threshold. LINE-MOD proposes a similarity measure that, for each gradient orientation on the object, searches in a neighbourhood of the associated gradient location for the most similar orientation in the test image.

The pose estimated by LINE-MOD is only approximately correct, since a template covers a range of views around its viewpoint. Thus, a pose refinement process has to be applied [31]. This pose refinement is obtained applying Iterative Closest Point (ICP) algorithms.

Originally introduced in [32], the ICP algorithm aims to find the transformation between a point cloud and some reference surface (or another point cloud), by minimizing the square errors between the corresponding entities.

As any gradient descent method, the ICP is applicable when a relatively good starting point is available, hence it is not suitable to be used stand-alone for pose estimation but it works well in case of refinements of initial pose estimate given by LINE-MOD.

In this work, LINE-MOD has been used to build the implemented object pose estimation pipeline.

(36)

Visual Control

(37)

Problem Formulation and Related

Works

Visual servoing can be defined as a feedback closed-loop control strategy based on vision [33]. In humanoid robots, the cameras are placed in the head and visual servoing can be done with an eye-to-hand approach [34].

Visual Servoing control techniques are broadly classified into the two types [34]: • Image-Based Visual Servoing

• Pose-Based Visual Servoing

In the first case (IBVS), the control law is based on the error between current and desired fea-tures on the image plane, and does not involve any estimate of the pose of the target. For the latter (PBVS), an estimate of the object pose is performed instead, and the control schemes are based on the minimization of an error defined on the operative space. In this work, a Pose-Based approach has been used. In fact, this approach allows to use classical controllers to drive the robot to a desired pose and does not involve the use of complex control laws to deal with errors defined in the image plane, as it would be in a Image-Based approach. On the other hand, an Image-Based approach is more robust with respect to camera calibration errors. The eye-hand calibrations, i.e. the estimation of the end-effector pose with respect to the head, is critical when a Pose-Based approach is implemented.

(38)

This issue could be overcome, providing a real-time visual feedback of the end-effector to con-trol the arm movements, but this approach, used for example in [35], could be computationally expensive, since the end-effector pose should be provided with suitable rate, and it is not ro-bust with respect to visual tracking failure. Furthermore, the end-effector could not be always on the camera field of view and such approaches should implement controllers to keep the hand always visible. The approach proposed in this thesis aims to reproduce the human beha-viour, where vision is not used directly during the reaching phase; therefore different solutions have been investigated to perform autonomous hand-eye calibration using vision. In the lit-erature, many solutions are proposed. Learning-based approach to hand-eye coordination for robotic grasping using monocular images, where a convolutional neural network is trained to predict the probability that task-space motion of the end-effector will result in successful grasps, has been proposed in [36]. This approach, which is independent of camera calibra-tion or the current robot pose, needs a long training phase for the network (two months in the cited case). Hence, this method has been discarded in favour of simpler and faster solutions. In [13] eye-hand calibration is realized by performing several predefined arm movements with a predefined hand posture without informations about the pose. Optimization techniques are employed to learn the transformation used to calibrate the kinematics. A slightly different method has been proposed in [37]. A robust pose estimation of the robot hand is achieved via a 3D model-based stereo-vision algorithm, using an edge-based distance transform metric and synthetically generated images of a robot arm-hand internal computer graphics model (kin-ematics and appearance). Then, a particle-based optimisation method is used to adapt on-line the robot internal model to match the real and the synthetically generated images, compens-ating for the calibration errors. In this case, a model of the camera-hand transformation is supposed to be known, and the visual system is used to adapt this model to calibration errors. The approach proposed in this thesis is an hybrid learning-adapting strategy that tries to unify in a single framework the ideas proposed in [37] and [13].

In this approach, the vision system is used to learn from scratch the hand-camera transforma-tion and then a non linear filter is implemented to adapt the estimated model and compensate the calibration errors. This hybrid strategy, called Learn&Adapt, is implemented using two procedures to:

(39)

• Learn the robot model

• Adapt online the learned model

(40)

Model Learning

As expressed in [13], the general formulation for the eye-hand calibration starts from the equation:

AX = XB, (4.1)

where A ∈ R4×4 is the homogeneous matrix that represents the camera motion, B ∈ R4×4

represents the end-effector motion and X ∈ R4×4_{is the unknown transformation that relates}

the two reference systems.

Equation 4.1 can be also written using:

AX = ZB, (4.2)

where the end effector and the camera are related by two different matrices.

The solution proposed in [13] exploits the robot forward kinematics and a visual estimation of the hand pose using the camera depth map to set up an optimization problem.

In fact, if a set of positions (XC

H, XHB)of the end effector is available, it is possible to estimate

the homogeneous matrix that maps the two frames solving a constrained minimization prob-lem [13]. The approach here proposed is similar, since it uses a set of measured end-effector poses in the camera frame and the robot forward kinematics to create the cost functional to be minimized, but it is enhanced with measurements of the end-effector orientation, and taking into account calibration errors on the robot kinematic chain. The structure of the problem is shown in figure 4.1, where the figure at the left shows the case of perfect calibrated kinematic model. In this case, the problem is to estimate the elements of the matrix TC

B, having collected

(41)

a set of N poses of the end-effector in the camera frame (TC

H) and in the robot base frame using

the forward kinematics Λ(q).

Figure 4.1: Left: Camera-Base Learning Problem for calibrated kinematics; Right: Camera-Base Learning Problem in presence of calibration errors

The problem is slightly different if, as in the case in exam, calibration errors are present. In this case (right image in 4.1) the variables to be estimated include the vector calibration errors β that determines the real base-hand transformation. The way used to model the transformation matrix TC

B and the calibration errors β will be described in detail in the following sections, and

will be discussed how it is possible to use optimization techniques to estimate them.

4.1 Camera-Base Transformation Model

The camera-base transformation is modelled using the homogeneous transformation matrix T_BC. This matrix, supposed to be constant, can be decomposed in a rotational part RC_B ∈ R3×3

and a translation vector tC B ∈ R3: T_BC =    R_BC tC_B 01×3 1    Knowing RC

B and tCB the transformation is completely defined. Since it is not convenient

es-timating directly the elements of the rotation matrix, a different parametrization has been used. In this approach the rotation matrix has been parametrized using the unit quaternion QC = {ηC, C}.

(42)

Unit quaternions provide a convenient mathematical notation to represent orientations, are simpler to compose than Euler angles and avoid the problem of gimbal lock. Compared to rotation matrices they are more compact, more numerically stable, and more efficient.

Thus the transformation it is completely specified by the vector of 7 parameters:

xC =    QC tC_B    with the constraint:

kQCk = 1

4.2 Kinematic Calibration Errors

Using the robot kinematics function from the base frame to the end effector (Λ(·)) and the vector of joint encoder readings q, an estimate of the pose can be obtained by:

ˆ

T_HB = Λ(q)

However, calibration errors might affect this estimate. The source of errors can be encoded in many ways, but here it has been decided to follow the approach used in [37], where the errors are encoded in the joint space. Thus, given the vector of measured joint angles, the real angles can be computed as:

qr = q + β (4.3)

where β are the constant joint offsets representing the calibration errors. If an estimate of these offsets is available, a better estimate of the hand pose can be computed as:

ˆ

T_HB = Λ(q + ˆβ)

4.3 Parameter Estimation

The proposed approach simultaneously estimates:

(43)

• The vector of joint offsets β

This is obtained using optimization techniques, solving a constrained minimization problem. Referring to figure 4.2, the cost function to be minimized is based on visual measurements of the end effector pose and the parametric model of this pose.

Figure 4.2: Model Learning in presence of constant calibration errors

In fact, given a measure of the joint angles, the pose of the end-effector can be expressed as: TC_HQC, tCB, β, q = T_BCQC, tCB Λ (q + β) (4.4)

The computed transformation matrix can be parametrized using the unit quaternion Q QC, tCB, β, q

and the translational vector pC

H QC, tCB, β, q

. Then, measuring for a given joint angles con-figuration the corresponding pose in the camera frame, it is possible to build an error vector:

ep¯M, ¯QM, QC, tCB, β, q =    eP eO   =    ¯ pM − pCH QC, tCB, β, q ∆ ¯QM, QC, tCB, β, q    (4.5)

(44)

where the orientation error has been defined using unit quaternions [28]. ∆ ¯QM, QC, tCB, β, q

is the vector part of the unit quaternion:

∆Q ¯QM, QC, tCB, β, q = ¯QM ∗ QQC, tCB, β, q −1

Assuming to collect a set of N visual measurements of the hand pose and the corresponding joint angles configuration qiand defining the total parameters vector as:

x =       QC tC B β      

the cost function used in the optimization problem can be defined as:

J(x) = 1 N N X i=1 ke ¯pMi, ¯QMi, x, qik 2 _(4.6)

The desired parameter are the solution of the constrained minimization problem: ˆ x = arg min x J(x) s.t. kQCk2 = 1 (4.7)

(45)

Online Model Adaptation

The previously proposed procedure can be used to learn the correct camera-hand model. This procedure is fully autonomous and is supposed to be executed at a certain time before the ro-bot has to actually perform the grasping. Ideally, the roro-bot has to learn its model just once, and then should use it to reach the target pose computed by the grasp planner. If some changes occur, such open-loop strategy would definitely fail. To make the system more robust, the robot has to close the loop with respect to the visual feedbacks, using vision to adapt its es-timated model. The methods proposed in [38] and [37] use markerless pose estimation and tracking of a robot manipulator. In [38] the concept of virtual visual servoing is exploited to estimate the correct pose, where a rendered model of the hand is aligned with the current im-age of the hand and edges are extracted as feature. Hence, a real-time tracking of the robot is possible. On the other hand, the approach here proposed is based on the implementation of a visual-based filter that estimates errors on the learned model and uses this estimate to correct it, instead of performing a visual tracking of the hand.

The concept is similar to the methods proposed in [37], where a particle filter is used to estim-ate the current hand pose and to calibrestim-ate the model. The internal model is used to generestim-ate hypotheses about the hand appearance in the eye cameras, that are compared to the actual hand appearance in the images coming from the cameras. In this formulation, the only errors are on the joint angles measurements. Since the calibration procedure proposed in this thesis is used to learn the matrix ˆTC

B and the calibration errors ˆβ, it has been decided to model the

presence of errors on the estimated model as errors on the estimated camera-base transform-43

(46)

ation and on the estimated joint offsets: ˜

TC_H = TCTˆ

C

BΛ(q + β + β) (5.1)

where the matrix TC is used to model the presence of small errors on the camera-base

trans-formation: TC =    I3− (δϑc×) δtc 01×3 1    (5.2)

Then, the position and orientation of the end-effector are function of the parametrized errors:

˜

pC_H(q) = gP(q, C, β)

˜

QC(q) = gO(q, C, β)

(5.3)

The proposed solution uses a set of visual-measured poses of the end-effector to estimate the previously modelled errors.

The estimation of a set of variables, from measurements related to them according to a certain relationship, is a classic problem in automation and robotics.

The state-of-the-art estimation procedures are based on the Kalman Filter framework, widely used for stochastic estimation from noisy sensor measurements. In its original formulation [39], the Kalman filter addresses the general problem of trying to estimate the state x ∈ Rnof

a discrete-time controlled process that is governed by the linear stochastic difference equation:

xk = Axk−1+ Buk+ wk−1 (5.4)

with a measurement vector z ∈ Rm:

zk = Hxk+ νk (5.5)

where wk, νk, process and measurement noise (respectively), are random variable, assumed to

be independent, white and with normal distribution:

wk ∼ N (0, Qk) (5.6)

(47)

with Qk, Rk process and measurement covariance matrices.

The Kalman filter estimates a process by using a form of feedback control: the filter estimates the process state at some time and then obtains feedback in the form of (noisy) measurements. The procedure can be extended to the case where the process to be estimated and/or the meas-urement relationship to the process are non-linear.

This procedure is called Extended Kalman Filter and its working principles are presented in the next section.

5.0.1 Extended Kalman Filter

Let assume that the process is governed by the non-linear stochastic difference equation:

xk = f (xk−1, uk) + wk (5.8)

with measurement:

zk= h(xk) + vk (5.9)

For the sake of simplicity, let also assume that the process and measurement noises are additive, although general approaches that do not use this assumption can be developed.

Again, the procedure can be seen as divided in two stages: • Prediction

• Correction

In the EKF framework the non linear function f is used to compute the predicted state from the previous estimate and the function h can be similarly used to compute the predicted meas-urement from the predicted state [40].

The propagation and correction of the covariance can be done using the non linear functions, but at each step a Jacobian matrix has to be computed and evaluated at the current predicted state. The equation used to implement the EKF are the following:

• Predict:

ˆ

xk|k−1 = f (ˆxk−1|k−1, uk) (5.10)

(48)

• Update

ˆ

xk|k = ˆxk|k−1+ Kk(zk− h(ˆxk|k−1)) (5.12)

Pk|k = (I − KkHk)Pk|k−1 (5.13)

The Kalman gain is computed as:

Kk = Pk|k−1HkT(HkPk|k−1HkT + Rk)−1 (5.14)

In the propagation of the covariance matrix and in the computation of the Kalman gain, the state transition and observation matrices are defined using the following Jacobians:

Fk = ∂f ∂x ˆ xk−1|k−1,uk Hk = ∂h ∂x ˆ xk−1|k−1 (5.15)

Unlike its linear counterpart, the extended Kalman filter in general is not an optimal estimator and, if the initial estimate of the state is wrong, or if the process is modelled incorrectly, the filter may even diverge, because of its linearisation.

Nonetheless, the EKF has proven to be quite effectively and is widely used in robotic and automation estimation problems, becoming the standard, for instance, in navigation for GPS-IMU integration. This, summed to the its ease of implementation and low computational load, makes the Extended Kalman Filter a promising way to estimate the model errors.

In the following sections will be explained how the estimation problem can be written to follow the EKF framework.

5.0.2 State Dynamic Model

The model errors are the variable to be estimated:

=       δtc δϑc β      

These errors are supposed to be constant over time, therefore, according to the EKF framework, the process is simply governed by the difference equation:

(49)

5.0.3 Measurement Model

To build a suitable measurement vector, it has been assumed that a visual estimation of the hand pose is available.

This pose can be assumed to be an accurate estimate of the real pose, and, since a measure-ment of the joint angles can be obtained from the encoders, it can be combined with the pose calculated using the previously estimated model to build an error vector.

Position Error The position error is simply obtained subtracting the estimated position from the measured one.

Using the formalism introduced in (5.3), this error can be written as:

eP = Pmeas− ¯Pest(q) = gP(q, c, β) − gP(q, 0, 0) + νk= hp(q, ) + ν (5.17)

where ν is used to model the visual measurement noise.

Orientation Error The orientation error is built using the angle-axis parametrization. In fact, denoting the measured rotational matrix as Rvis and the modelled rotational matrix as

¯

Rest, the orientation error between the two frames can be computed as:

eO(q) =

( ˜RS− ˜RTS) ∨

2 (5.18)

where the matrix ˜RS = Rvis· ¯RTest(q)has been defined.

The orientation error can the be written as: ˜

RS(q, ) = Rvis(q, ) · ¯RTest(q, 0)

eO(q, ) = hO(q, ) + ω

(5.19)

where ω is used to model the measurement noise on the rotation matrix. Thus, the measurement vector is built using these two errors:

zk =    hp() hO()   +    νk ωk   = h() + ηk (5.20)

(50)

Implementation and Testing

(51)

For this thesis, a visual-based system to perform autonomous object recognition and robot model calibration has been implemented and tested.

Figure 5.1: System Architecture - Two different pipelines are implemented to perform autonomous grasping

The realized framework implements two systems: • A Perception Pipeline

• A Calibration Pipeline

A simple diagram illustrating how these pipelines might be integrated in a robotic platform is shown in figure 5.1. In the following, a more detailed description of these systems will be presented, together with a characterization of the robotic platform used to test the calibration system.

(52)

Perception Pipeline

Figure 6.1: Perception Pipeline Architecture

The perception pipeline, as shown in figure 6.1, should be designed to perform: • Objects Recognition

• Pose Estimation of the graspable objects

These two objectives are reached using two different and independent systems.

(53)

6.1 Object Detection

Object recognition is nowadays performed using Convolutional Neural Networks.

These algorithms are able to recognize objects on an image, and have to be adapted to run on the video feed from the vision system. This can be obtained exploiting ROS [41], hence programming a node to subscribe to the topic where the camera images are published. These images are passed to a callback function, where the appropriate algorithm is used to perform object recognition. Two different publishers are defined, one is used to send the informations about the recognized objects (classes, confidences, pixel coordinates of the bounding boxes) while the other publishes the camera image with the addition of the algorithm results. The choice of the algorithm used has been made evaluating different parameters:

• Computational Effort • Speed

• Capability to segment the recognized object

• Possibility to easily train the network with custom datasets • ROS Integration

In terms of real-time performances, computational effort and possibility to integrate the sys-tem with ROS, YOLO seemed to be the ideal solutions. In fact a ROS package to use YOLO on GPU and CPU was developed by the ETH Zurich Robotic System Lab. The pre-trained model of the convolutional neural network is able to detect pre-trained classes including the data set from VOC and COCO. Using the provided ROS package, it is possible to obtain almost real-times performances on a platform equipped with an Intel CoreT M _{i7-7700HQ processor and}

an NVIDIA GeForce GTX 1060 Max-Q GPU.

However, since real-time performances are not critical and knowing the issues related to the use of YOLO algorithm in cluttered scenes, it has been decided to use a different algorithm. In fact, the possibility to perform object segmentation and a better handling of cluttered scenes has driven the choice towards a different solution, although at the expense of the processing speed and an increase of the computational load.

(54)

Figure 6.2: Comparison between Detectron (left) and YOLOv3 (right)

Since the objective was to obtain object segmentation, an interesting possibility was Detec-tron [42], a Facebook AI Research’s software system that implements state-of-the-art object detection algorithms, including Mask R-CNN. Detectron is written in Python and powered by the Caffe2 deep learning framework. A comparison between the end-to-end trained Mask R-CNN model with a ResNet-101-FPN backbone from the model zoo used in Detectron and YOLOv3 is shown in figure 6.2, where it can be clearly seen how Detectron is capable to detect objects even in case of cluttered scenes.

6.1.1 Detectron - ROS Integration

Facebook AI Research (FAIR) open sourced Detectron in January, 2018. The provided scripts are capable to run object recognition algorithms over a set of images and save the segmented images as file. To use it on a robotic platform it has to be adapted to run in real-time over the camera feed.

Figure 6.3: Detectron - ROS Integration

(55)

A ROS node, programmed in Python, subscribes to the camera image topic and inference is performed on the received image. On a platform equipped with an Intel CoreT M _i7-7700HQ

processor and an NVIDIA GeForce GTX 1060 Max-Q GPU, around 0.3 seconds are needed to process the image. Since in rospy the subscriber’s callback are executed in separate threads, a resource locking mechanism is used to increase the real-time performances of the system. When the callback is executed the program checks if the resource is already used, i.e if a pre-vious frame is currently used to perform the recognition algorithm. If so, the received image is discarded, otherwise, if the resource is free, the callback locks the resource and the received image is processed.

When an image is processed, one publisher is used to send on a dedicated topic the inform-ations about the recognized objects (classes, confidences, pixel coordinates of the bounding boxes) while a second one publishes the camera image with the segmented objects.

Figure 6.4: Detectron - Application Results

At the moment, the system has been implemented using the trained network provided by Detectron, but it is possible to train the network with custom datasets.

(56)

6.2 Pose Estimation

The second sub-system that forms the perception pipeline is the Pose Estimation system. Ro-bust pose estimation of the object to manipulate is critical for a successful grasp operation. There exist many ways to implement the pose estimation pipeline, in this work it has been decided to exploit the algorithms provided by the Object Recognition Kitchen. Object Recog-nition Kitchen (ORK) is a project started at Willow Garage for object recogRecog-nition [43]. It was designed to easily develop and run simultaneously several object recognition techniques and provides an integration with the ROS ecosystem.

This software recognizes and estimates the pose of objects whose model is known. The model can be acquired using a dedicated procedure, where the object is placed in the center of a fiducial board; then, by slowly turning the board, different views are captured. In alternat-ives, a CAD model of the object can be used. The model is then uploaded to the CouchDB database [44].

Figure 6.5: ORK - Model Capture Procedure

Several object recognition pipelines have been implemented for the ORK framework, but in this work it has been decided to focus on two of them:

• LINE-MOD • Tabletop

(57)

Pipeline Type of Objects Limitations

LINE-MOD Rigid Does not work with partial

occlu-sions Tabletop Rotationally symmetric object on a

planar surface

The object is assumed to be on a table with no 3D rotation.

Table 6.1: ORK Pipelines Characteristics

LINE-MOD is the ideal solutions for generic rigid object recognition and it proceeds using very fast template matching. The version used in the package is the same as the original paper but it uses an automatic view generator to create all the templates, hence thousands of images with depth+mask are generated and fed to the OpenCV trainer. Then, a post-processing step to refine the pose using ICP is implemented.

Tabletop is a good solution in presence of symmetric objects laying on a table. This object detection method has two parts: a table finder and an object recognizer. It performs:

• Segmentation: the table is detected by finding the dominant plane in the depth map based on analysis of 3D normal vectors; points above the table are considered to belong to graspable objects.

• Recognition : for each cluster, a simple iterative fitting technique is used. If a good fit is found, the database id of the model is returned along with the cluster.

Object recognition can only handle translation along the X and Y axes, while the Z axis is assumed to be pointing up relative to the table.

(58)

(59)

Calibration Pipeline

The calibration pipeline is dedicated to the estimation of the camera-base transformation, the joint angles biases and to the online adaptation of the estimated model during the grasping operation. The methods used to learn and adapt the model are the same exposed previously, and are all based on a measure of the end-effector pose in the camera reference frame. The next section will be dedicated to the examination of the robotic platform used to test the calibration procedure and how the measure of the hand pose using vision has been obtained in this work.

7.1 Testing Platform

To test the proposed pipeline, it has been decided to use a reproduction of the right arm of the humanoid robot EGO.

EGO is a soft-robotics platform for physical human-robot interaction, a dual-arm humanoid robot provided with two anthropomorphic hands, soft actuation, a vision systems and a tele-operation interface.

(60)

Figure 7.1: EGO - A Humanoid Robotic Platform

Currently, EGO is tele-operated, hence a human operator is needed to perform manipu-lation tasks. The 5-DOF arm is composed by 5 variable stiffness actuators, provided by qbro-botics. The end-effector is the Pisa/IIT SoftHand, a simple, robust and effective hand designed for grasping and soft manipulation. The arm can be modelled as a serial chain composed of 5 revolute joints. Link αi ai di θi 1 0 π/2 0 θ1 2 0 −π/2 0 θ2+ π/2 3 0 −π/2 0.168 m θ3− π/2 4 0 π/2 0 θ4 5 0 −π/2 0.185 m θ5

Table 7.1: Denavit-Hartenberg Table

The forward kinematics map can be obtained using the Denavit-Hartenberg table showed in 7.1.

(61)

7.2 Hand Pose Estimation

The estimation of the end-effector pose using vision is not trivial. Many solutions are available with differences in terms of complexity, computational load and accuracy. In this work, it has been decided to use a marker-based approach, since it is simple to implement and does not require too many computational resources. An ArUco marker, applied on the palm of the robot hand, has been used for this purpose.

Figure 7.2: ArUco Marker on the Hand Palm

ArUco markers [45] [46] are synthetic square markers composed by a wide black border and a inner binary matrix which determines their identifier (id). ROS packages are available that, given the marker ID and dimension, return the marker pose in the camera reference frame.

Figure 7.3: Examples of ArUco Marker

The choice of a marker-based approach requires a slight modification of the Calibration procedures presented in this work. In fact, what it is actually measured is the pose of the marker, not the hand pose. The hand pose must be obtained from the knowledge of the trans-formation between the marker frame and the end-effector frame. This transtrans-formation, sup-posed to be constant, can be expressed using the homogeneous transformation matrix TM

H .

A good knowledge of this transformation is critical to correctly compute the hand pose and to perform a precise calibration. In this case, since the marker is put on the hand palm, the