University of Pisa
Master Degree in Robotics and
Automation Engineering
Machine learning for grasp
synthesis
Author: Daniela Resasco Student Id: 511293 Supervisor: Professor Antonio Bicchi External Supervisor: Professor Kris Hauser Cosupervisor: Professor Lucia PallottinoAl mio nipotino Andrea. Che i tuoi sogni ti portino a vedere orizzonti lontani, dove gli altri disegnano confini.
Abstract
The goal of this thesis is have a robot that learn how to grasp an unknown object with an underactuated end-effector. We chose to implement re-gression via deep convolutional neural network in a supervised learning scenario. We present 2D and 3D deep convolutional neural networks, which take images and voxels from various points of view as input, re-spectively.
We create a database that match the image with one feasible pose. The pose is make using Minimum Volume Bounding Box, minimizing the volume of the boxes which fit partial point clouds. In these way we can focus on outermost boxes and we can choose a desired pose that grasping the object in a successful manner.
The simulation is made using Klamp’t simulator, and the neural net-work is implement using Theano library.
We compare the results of our neural network trying different loss function and two structure of network: one with regularization and the other without it.
Contents
1 Introduction 1
1.1 Overview . . . 1
1.2 Underactuated hand. Robot vs Human . . . 2
1.3 Machine Learning . . . 5
1.3.1 Regression . . . 7
1.3.2 Neural Networks . . . 11
1.3.3 Convolutional Neural Networks . . . 13
1.3.4 Learning rates . . . 17
1.3.5 State of the art . . . 19
1.4 Organization . . . 20
2 Grasp planning with soft hands 21 3 Teaching robots to grasp with underactuated hands 28 3.1 Simulation and dataset creation . . . 31
3.1.1 Grasp pose and dataset generation . . . 31
3.1.2 Network architecture . . . 38
3.2 Results . . . 41
3.2.1 Bounding box method . . . 41
3.2.2 Learning method . . . 44
4 Conclusions and future work 57
List of Figures
1.1 Humans hand. . . 3
1.2 Pisa/IIT Soft-Hand. . . 3
1.3 Reflex . . . 4
1.4 Human hand vs Soft-Hand. . . 4
1.5 Two different method to control a robotic hand. . . 5
1.6 Loss function . . . 8
1.7 Explanation of gradient descent. . . 10
1.8 Biological neural network. . . 14
1.9 Neural network. . . 14
1.10 Rectifier logical units activation function. . . 14
1.11 Sigmoidal activation function. . . 15
1.12 Explanation of deep convolutional neural network . . . . 17
2.1 Graphical explanation of the projection procedure. . . . 23
2.2 Example of first split . . . 23
2.3 Volume variyng for the kettle. . . 24
2.4 Area variyng for the kettle. . . 24
2.5 Graphical explanation of the procedure performed to align the hand with each bounding box. . . 26
2.6 In order to generate more poses to grasp each box, the hand is rotated and translated along x axis of the box. . 27
3.1 Reflex . . . 28
3.2 Proximal finger. . . 29
3.3 Distal finger. . . 30
3.4 Preshape finger. . . 30
3.5 Model of object before and after the point downsamples. 32 3.6 X-wing representation . . . 33
3.7 Image taken from various points of view . . . 34
3.8 Difference of frame between RightHand and Soft-Hand. . 35
3.9 Graphical explanation of the procedure performed to align the hand with each bounding box. . . 35
3.10 Input-output mapping representation . . . 36
3.12 Structure of our algorithm . . . 37
3.13 Difference between fully connected network and network with regularization . . . 40
3.14 2D convolutional neural network . . . 40
3.15 Training the 2D convolutional neural network using dropout 40 3.16 3D convolutional neural network . . . 41
3.17 Kindness trend during the unsuccessful grasp . . . 43
3.18 Kindness trend during the successful grasp . . . 44
3.19 Behavior of error using quaternion eq 3.5 . . . 47
3.20 Behavior of error using quaternion eq 3.4 . . . 48
3.21 Behavior of error using quaternion eq 3.3 . . . 49
3.22 Behavior of error using Euler angle . . . 50
3.23 Behavior of error using quaternion eq 3.5 and dropout reg-ularization . . . 51
3.24 Behavior of error using quaternion eq 3.4 and dropout reg-ularization . . . 52
3.25 Behavior of error using quaternion eq 3.3 and dropout reg-ularization . . . 53
3.26 Behavior of error using Euler angle and dropout regular-ization . . . 54
List of Tables
3.1 Partial results of Bounding Box algorithm . . . 42 3.2 Table of metric results for 2DCNN . . . 45 3.3 Table of metric results for 2DCNN with dropout . . . 45
Chapter 1
Introduction
1.1
Overview
The goal of this thesis is to have a robot that learns how to grasp an unknown object with an underactuated end-effector. We chose to imple-ment regression via deep convolutional neural network in a supervised learning scenario. This work is a collaboration between the University of Pisa and Duke University in Durham North Carolina. Our research involves the use of ReFlex grippers, developed at Righthand Robotics company, in combination with simulator software Kris’ Locomotion and Manipulation Planning Toolbox, Klamp’t, developed at Duke University. We present 2D and 3D deep convolutional neural networks, which take images and voxels from various points of view as input, respectively.
Let’s define the problem sets:
T ask = grasp an unknown object. P erf ormance measure = percent a successf ul grasp
T raining experience = grasp examples
We can split the objective in three parts
• Define a method to grasp a unknown object and evaluate the num-ber of successful grasp
• Define the net for the machine learning. • Validate and test the network .
For the first part the idea is to decompose the object in Minimum Vol-ume Bounding Box minimizing the volVol-ume of the boxes which fit partial point clouds [1]. In these way we can focus on outermost boxes and we
can choose an optimal pose that to grasp the object in a successful man-ner. We use this method since it is able to generate a good amount of successful grasps.
For the neural network we utilize the supervised learning method. These method needed, during the training phase, to a pair of input output examples. The output will be called label. Our goal is to take an object, so the input is an object representation (i.e., image, occupancy grid, voxel), and the output a minimal representation of the desired pose in Euler angle or quaternion.
1.2
Underactuated hand.
Robot vs
Hu-man
In recent years, there has been a great deal of research aiming to establish a seamless connection between human and robotic hand movement. The former is composed of bones, muscles, innervation, veins, tendons and 25 degrees of freedom, (DOF1), that allow it to move. Figure 1.1 shows
an overview of human hand. The latter is composed of links, joints and motors. The human muscles have the ability to contract react to com-mands coming from the brain and spinal cord know as Central Nervous System, CNS. When the neuron is enabled, the command is transferred through tendons, to the articulation and the muscle fiber is contracted.
1Degrees of freedom (DOF) is a term used to describe a freedom of motion in
three dimensional space in a mechanical system, in particulars it refers to the ability to move forward and backward, up and down, and to the left and to the right. Each degree of freedom corresponds to a joint.
(a) Human hand (b) DOF
Figure 1.1: Humans hand.
One of the objectives of robot hand research is to reproduce this behavior. Research in the field of neuroscience has demostrated the exis-tence of motor synergies; i.e., muscle groups, joints, tendons that move together as a single element in response to the nervous impulse. To re-produce the functionality we do not need to reproducing the individual components of the hand, but the movements as whole. Various robotics laboratories work to develop a robotic hand that following this idea e.g., the University of Pisa in collaboration with the Italian Institute of Tech-nology in Genoa developed Pisa/IIT Soft-Hand. Is an underactuated hand with 18 anthropomorphic joints and one synergy [2] [3]. This means that requires only one motor, as illustrate in figure 1.2.
Figure 1.2: Pisa/IIT Soft-Hand.
The Righthand Robotics company, developed the ReFlex grippers. It consists of three modular fingers and a central chassis that groups to-gether all the actuators, the preshape transmission, the palm electronics, and the interface electronics (see figure 1.3)
Figure 1.3: Reflex
Each finger is controlled by a single actuator that drives a tendon spanning both the proximal and distal joint.
Both examples use a tendon to move the fingers. This tendon acts as the muscle while the motor can be understood as the neuron that sends the command to the articulation. We can summarize that the robot hand has link as bones, joint as articulation and tendon as muscle (see figure 1.4).
(a) Humand hand (b) Soft-hand
Figure 1.4: Human hand vs Soft-Hand.
Although this scheme can simulate the hand behavior, a control for movements is needed. This control can be different, depending on the specific task. If the hand is used in the biomedical field, this control can be achieved using brain, mind-controlled prosthetic, [4] [5] [6]. If the task is manipulation we can use a grasp planning control [7] [8] [1] [9] or learning methods [10] [11].
(a) Johns Hopkins research (b) Sot-hand with Kuka arm
Figure 1.5: Two different method to control a robotic hand.
1.3
Machine Learning
An agent is learning if it improves its performance on future tasks after making observations about the world. Why would we want an agent to learn? If the design of an agent can be improved, why would not the designers just program that improvement to begin with?. There are three main reasons:
• The designers cannot anticipate all possible situations that the agent might find
• The designers cannot anticipate all possible changes over time • Sometimes designers are clueless about how to implement an
opti-mal solution
Face recognition, for example, is a very difficult task to perform without the aid of a machine learning algorithm. The same happens if we want to design a robot that plays checkers or speaks with the people. It’s dif-ficult, or impossible, to predict the future. There are three main types of learning. In unsupervised learning the agent learns patterns in the in-put even though no explicit feedback is supplied. Unsupervised learning, allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don’t nec-essarily know the effect of the variables. We can derive this structure by clustering the data based on relationships among the variables. With un-supervised learning there is no feedback based on the prediction results, i.e., there is no teacher to correct you. In reinforcement learning the agent learns from a series of reinforcements or punishments. Reinforce-ment learning allows the machine or software agent to learn its behavior based on feedback from the environment. This behavior can be learn
once and for all, or keep on adapting as time goes by. If the problem is modeled with care, some Reinforcement Learning algorithms can con-verge to the global optimum; this is the ideal behavior that maximizes the reward. In supervised learning the agent observes some example input-output pairs and learns a function that maps from input to out-put. In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output. Supervised learning problems are categorized into regression and classification problems. In a regression problem, we are trying to predict results within a contin-uous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories. We can say that: if the output is a number (i.e temperature, price of a house), is a regression problem, otherwise if the output is one of a finite set of value (i.e day for week, weather forecast) is a classification problem. The classification can also be called Boolean if the output is binary. When we have a regression problem we are looking for a conditional expectation or average value of the output, because the probability that we have found exactly the right real-valued number for the output is zero.
The task of supervised learning is this:
Given a training set of N example input-output pairs (x1, y1), (x2, y2), (xj, yj), ...., (xN, yN)
where yj was generated by an unknown function yj = f (xj),
discover a function h that approximates the true function f
Here x and y can be any value; they need not be numbers. The function h is a hypothesis. Learning is a search through the space of possible hypotheses, called hypothesis space H for one that will perform well, even on new examples beyond the training set. To measure the accuracy of a hypothesis we give it a test set distinct from the training set. We say that h generalizes well if it correctly predicts the value of y for novel example. If f is stochastic, we have to learn a conditional probability distribution P(Y |x). When we find more than one consistent hypothesis how do we choose from among multiple consistent hypotheses?. One answer is to prefer the simplest hypothesis consistent with the data. This principle is called Ockham’s razor. If the same data can be fitted adequately by both a first and sixth degree polynomials, the former is chosen according to the principle. There is a trade off between complex
hypotheses that fit the training data well and simple hypotheses that generalize better.
1.3.1
Regression
A linear function with input x and output y has the form y = w1x + w0,
where w0 and w1 are real-valued coefficients to be learned. This values
are called weights. We have to figure out how close is the truth, y, to the prediction f(x). We can try with: y − f (x) or f (x) − y, but that is not a great idea. Because we are looking at errors in only one direction and we want to penalize errors in every direction. One solution is to use this penalty squared error summed over all the training examples PN
j=1(yj − f (xj))
2 2. We want the sum of squares error to be small, so
we choose w0 and w1 to minimize the total error on the training set, and
that is the procedure of a simple least squared regression. We don’t need to find w1 and w0 because the machine learning algorithm will do it for
us. The least squared regression is minimized when its partial derivatives with respect to w0 and w1 are zero:
∂ ∂w0 N X j=1 (yj− (w1xj+ w0))2 = N 2(w1 X xj − X yj + w0) (1.1) ∂ ∂w1 N X j=1 (yj − (w1xj + w0))2 = N 2xj(w1 X xj+ w0 X yj) (1.2)
These equation have a unique solution: w1 = N (P xjyj) − (P xj)(P yj) N (P x2 j) − (P xj)2 (1.3) w0 = (P yj− w1(P xj)) N (1.4) 2L
2is appropriate when there is normally-distributed noise that is independent of
Figure 1.6: Loss function
As we show at figure 1.6 the loss function is convex. This is true for every linear regression problem with L2 loss function, and implies that
there are no local minima. What happens when the loss function have no closed-form solution?. In this case we use gradient descent. We choose the weight update to comprise a small step in the direction of the negative gradient. After each update, the gradient is re-evaluated for the new weight vector and the process repeated until we converge on the minimum possible loss; see algorithm 1 and figure 1.7 3.
Algorithm 1 Gradient descent
1: procedure Gradient descent(point in weight space) 2: w ← any points in the parameter space
3: loop until convergence do
4: for each wi in w do do
5: wi ← wi− α∂w∂
iLoss(w)
6: end for
7: end procedure
The parameter α is called step size or learning rate, usually it can be fixed constant α < 0 or a small value that decays over time as the learning proceeds. The parameter w is a vector [w0, w1]. We can find
the partial derivative remembering that: Loss(hw) = N X j=1 (yj− hw(xj))2 = N X j=1 (yj− (w1xj+ w0))2 (1.5)
So ∂ ∂wi Loss(w) = ∂ ∂wi (y − hw(x))2 (1.6) =2(y − hw) × ∂ ∂wi (y − hw(x)) (1.7) =2(y − hw) × ∂ ∂wi (y − (w1x + w0)) (1.8)
Applying this to both w0 and w1 we get:
∂ ∂w0 Loss(w) = − 2(y − hw(x)) (1.9) ∂ ∂w1 Loss(w) = − 2(y − hw(x)) × x (1.10)
Then, we get the following learning rule for the weights:
w0 ← w0+ α(y − hw(xj)) (1.11)
w1 ← w1+ α
X
j
(yj − hw(xj)) × x (1.12)
Convergence to the unique global minimum is guaranteed, as long we pick α small enough, but may be very slow. Techniques that use the whole data set at once are called batch methods. At each step the weight vector is moved in the direction of the greatest rate of decrease of the error function, and so this approach is known as gradient descent or steepest descent. Although such an approach might intuitively seem reasonable, in fact it turns out to be a poor algorithm, for reasons discussed in [13]. For batch optimization, there are more efficient methods, such as conjugate gradients and quasi-Newton methods, which are much more robust and much faster than simple gradient descent [14] [15] [16]. Unlike gradient descent, these algorithms have the property that the error function always decreases at each iteration unless the weight vector has arrived at a local or global minimum. In order to find a sufficiently good minimum, it may be necessary to run a gradient-based algorithm multiple times, each time using a different randomly chosen starting point, and comparing the resulting performance on an independent validation set.
(a)
(b)
Figure 1.7: Explanation of gradient descent.
Now we must find an efficient technique for evaluating the gradient of an error function E(W ) for a fee-forward neural network. We will use backpropagation. In order to clarify the terminology, it is useful to consider the nature of the training process more carefully. Most training algorithms involve an iterative procedure for minimization of an error function, with adjustments to the weights being made in a sequence of steps. At each such step, we can distinguish between two stages. In the first stage, the derivatives of the error function with respect to the weights are evaluated. As we will see, the advantage of the backpropa-gation technique is that provides a computationally efficient method for evaluating such derivatives. Because it is at this stage that errors are propagated backwards through the network, we use the term
backpropa-gation specifically to describe the evaluation of derivatives. In the second stage, the derivatives are then used to compute the adjustments to be made to the weights. The simplest such technique, and the one originally considered by [17], involves gradient descent. The first stage, namely the propagation of errors backwards through the network in order to evaluate derivatives, can be applied to many other kinds of network and not just the multilayer perceptron 4. See algorithm 2.
The regression and classification can be calculates using different types of machine learning methods. The most famous are Decision Trees, Artificial Neural Network and Convolutional Neural Network CNN. In these project we use the last one.
1.3.2
Neural Networks
Neural networks are composed of a great number of nodes or units con-nected by directed links. A link from unit i to j serves to propagate the activation ai from i to j. Each link also has a numeric weight
wij associated with it, which determines the strength and sign of the
connection. The weights are initialized with random numbesr but with defined bound, which depend on the specific activation function. From article [18] we know that:
Wbound =[− r 6 fin+ fo ; r 6 fin+ fo ] For tanh (1.13) Wbound =[−4 r 6 fin+ fo ; 4 r 6 fin+ fo ] For sigmoid (1.14)
Where finis the dimensionality of inputs, and fo is the number of hidden
units. Each unit j first computes a weighted sum of its inputs: inj =
n
X
i=0
wijai (1.15)
Then it applies an activation function g to this sum to derive the output: aj = g(inj) = g( n X i=0 wijai) (1.16)
Let us consider a few facts from neurobiology. The human brain can be described as a biological neural network with an interconnected web of neurons transmitting elaborate patterns of electrical signals. Dendrites
4A perceptron network is a network with all the inputs connected directly to the
Algorithm 2 Backpropagation
1: procedure Backpropagation(pairs examples (x,y), network )
re-peat
2: for each wij in network do do
3: wij ← a small random number
4: end for
5: for each example (x,y) in examples do
6: */ Propagate the inputs forward to compute the outputs */
7: for each node i in input layer do
8: a ← xi
9: end for
10: for l = 2 to L do
11: for each node j in layer l do
12: inj ← P iwijai 13: aj ← g(inj) 14: 15: end for 16: end for
17: */ Propagate deltas backward from output layer to input
layer */
18: for each node j in output layer do
19: ∆[j] ← g0(inj) × (yj− aj)
20: end for
21: for l = L − 1to1do do
22: for each node i in layer l do
23: ∆[i] ← g0(ini)
P
jwij∆[j]
24: end for
25: end for
26: */ Update every weight in network using deltas */
27: for each weight wij in networks do
28: wij ← wij + α × ai× ∆[j]
29: end for
30: end foruntil some stopping criterion is satisfied
31: return network
receive input signals and, based on those inputs, fire an output signal via an axon as shown at figure 1.8. The human brain, for example, is esti-mated to contain a densely interconnected network of approximately 1011
neurons, each connected, on average, to 104 others. Neuron activity is typically excited or inhibited through connections to other neurons. Re-garding the way in which neurons interact with each other, the switching speed of an individual neuron and the amount of time invested in a task (e.g., recognizing a known face takes around 0.1 s) places a limit to the number of neuronal firings that can take place in sequence. This observa-tion has led many to speculate that the informaobserva-tion-processing abilities of biological neural systems must follow from highly parallel processes op-erating on representations that are distributed over many neurons. One motivation for neural network systems is to capture this kind of highly parallel computation based on distributed representations. A neural net-work objective is to model the behavior of the biological system shown in figure 1.8 with the scheme presented in figure 1.9. The activation func-tion g is typically either a hard threshold as shown in 1.10, or a logistic function shown at 1.11. A feed-forward network has connections only in one direction, and it forms a directed acyclic graph. Every node re-ceives input from “upstream”nodes and delivers output to “downstream ”nodes; there are no loops. A feed-forward network represents a function of its current input; thus, it has no internal state other than the weights themselves. A recurrent network, on the other hand, feeds its outputs back into its own inputs. Moreover, the response of the network to a given input depends on its initial state, which may depend on previous inputs. Hence, recurrent networks can support short-term memory. This makes them more interesting as models of the brain, but also more dif-ficult to understand. Feed-forward are usually arranged in layers, such that each unit receives input only from units in the immediately preced-ing layer. A network with all the inputs connected directly to the output is called single-layer neural network, or a perceptron network.
1.3.3
Convolutional Neural Networks
The name convolutional neural network indicates that the network em-ploys a mathematical operation called convolution. Convolution is a specialized kind of linear operation. Convolutional networks are simply neural network that use convolution in place of general matrix multipli-cations in at least one of their layers.
Figure 1.8: Biological neural network.
Figure 1.9: Neural network.
Figure 1.11: Sigmoidal activation function. of a real-valued argument s(t) = (x ∗ ω)t = Z x(a)ω(t − a)dt = ∞ X a=−∞ x(a)ω(t − a)dt x, ω ∈ < (1.17) Where: • s(t) is a feature map • x is the input
• ω is a weighting function. Needs to be a valid probability density function, or the output is not a weighted average. We will refer to these value as kernel.
• a is the age of a measurement.
In the machine learning applications, the input is usually a multidimen-sional array of data and the kernel is usually a multidimenmultidimen-sional array of parameters that are adapted by the learning algorithm. We will refer to these multidimensional arrays as tensors. Because each element of the input and kernel must be explicitly stored separately, we usually assume that these functions are zero everywhere but the finite set of points for which we store the values. This means that in practice we can imple-ment the infinite summation as a summation over a finite number of array elements. If there are two dimensional space the 1.3.3 will be:
S(i, j) = (I ∗ K)(i, j) =X
m
X
n
I(i − m, j − n)K(m, n) (1.18) Where I is our two-dimensional image input, and K a two-dimensional kernel. Convolution is commutative, meaning we can equivalently write:
S(i, j) = (I ∗ K)(i, j) =X
m
X
n
Instead, many neural network libraries implement a related function called the cross-correlation, which is the same as convolution but without flipping the kernel:
S(i, j) = (I ∗ K)(i, j) =X
m
X
n
I(i + m, j + n)K(m, n) (1.20)
In many applications of pattern recognition, it is known that pre-dictions should be unchanged, or invariant, under one or more transfor-mations of the input variables. If sufficiently large numbers of training patterns are available, then an adaptive model such as a neural network can learn the invariance, at least approximately. This involves includ-ing within the traininclud-ing set a sufficiently large number of examples of the effects of the various transformations. Thus, for translation invari-ance in an image, the training set should include examples of objects at many different positions. This approach may be impractical, however, if the number of training examples is limited, or if there are several in-variants (because the number of combinations of transformations grows exponentially with the number of such transformations). We therefore seek alternative approaches for encouraging an adaptive model to exhibit the required invariances, such as:
• Augmented training set used replicas of the training patterns, trans-formed according to the desired invariances.
• A regularization term is added to the error function that penalizes changes in the model output when the input is transformed. • Invariance is built into the pre-processing by extracting features
that are invariant under the required transformations.
• Build the invariance properties into the structure of a neural net-work (shared weights, receptive fields).
All of this methods ignore a key property of images, which is that nearby pixels are more strongly correlated than more distant pixels. Many of the modern approaches to computer vision exploit this property by extracting local features that depend only on small subregions of the image. These notions are incorporated into convolutional neural networks [19], [20] through three mechanisms:
• Local receptive fields • Weight sharing • Subsampling
In the convolutional layer the units are organized into planes, each of which is called a feature map. Units in a feature map each take inputs only from a small subregion of the image, and all of the units in a feature map are constrained to share the same weight values. It is common to periodically insert a Pooling layer in-between successive convolutional layers. Its function is to progressively reduce the spatial size of the rep-resentation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. Formally, after obtain-ing our convolved features as described earlier, we decide the size of the region, and then we pool our convolved features over. Then, we divide our convolved features into disjoint regions, and take the mean (or maxi-mum) feature activation over these regions to obtain the pooled convolved features.
The whole structure of a CNN can be depict as in figure 1.12.
1.3.4
Learning rates
There are several methods to set the learning rate such as: delta-bar-delta based on the idea that if the partial derivative of the loss with respect to a given model parameter, remains the same sign, then the learning rate should increase and otherwise decrease. AdaGrad adapts the learning rates of all model parameters by scaling them inversely pro-portional to the square root of the sum of all the historical squared val-ues of the gradient, Adaptive Moment Estimation ADAM method [21]. Adam, is a stochastic gradient descent algorithm based on estimation of 1st and 2nd-order moments. The algorithm estimates 1st-order mo-ment (the gradient mean) and 2nd-order momo-ment (elemo-ment-wise squared gradient) of the gradient using exponential moving average, and corrects its bias. The final weight update is proportional to learning rate times 1st-order moment divided by the square root of 2nd-order moment. The main difference between Adam and other Momentum learning updates is that Adam has a bias correction step. From ADAM “Let f (θ) be a
noisy objective function: a stochastic scalar function that is differentiable w.r.t. parameters θ. We are interested in minimizing the expected value of this function, E[f (θ)] w.r.t. its parameters θ. With f1(θ), ..., , fT(θ)
we denote the realizations of the stochastic function at subsequent time steps 1, ..., T. The stochasticity might come from the evaluation at ran-dom subsamples (minibatches) of datapoints, or arise from inherent noise function. With gt = ∆θft(θ) we denote the gradient, i.e., the vector of
partial derivatives of ft, w.r.t θ evaluated at time step t. The
algo-rithm updates exponential moving averages of the gradient (mt) and the squared gradient (vt) where the hyper-parameters β1, β2 ∈ [0, 1) control
the exponential decay rates of these moving averages. The moving aver-ages themselves are estimates of the 1st moment (the mean) and the 2 nd raw moment (the uncentered variance) of the gradient. However, these moving averages are initialized as (vectors of) 0’s, leading to moment estimates that are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. the βs are close to 1). The good news is that this initialization bias can be easily coun-teracted, resulting in bias-corrected estimates ˜mtand ˜vt”. See algorithm
3.
Algorithm 3 Adam optimizer
1: procedure Adam updates(β1, β2, f (β), α)
2: while θt not converge do do
3: Compute gradient gt← ∆θft(θt−1)
4: t ← t + 1
5: Update first momentum mt← β1mt−1+ (1 − β1)gt
6: Update second momentum vt ← β2vt−1+ (1 − β2)gt2
7: Correct the bias in first momentum ˜mt← 1−βmtt 1
8: Correct the bias in second momentum ˜vt ← 1−βvtt 2
9: Compute update θt← θt−1− α√mv˜˜tt+
10: end while
11: return θt
12: end procedure
Another algorithms to set the learning rate are Momentum or the variation of the momentum Nesterov Momentum [22] [23]. The method of momentum [24] which we refer to as classical momentum (CM), is a technique for accelerating gradient descent that accumulates a velocity vector in directions of persistent reduction in the objective across it-erations. Given an objective function f (θ) to be minimized, classical momentum is given by:
vt+1 = µv − ∆θf (θ) (1.21)
θt+1 = θt+ vt+1 (1.22)
Where > 0 is the learning rate, µ ∈ [0, 1] is the momentum coefficient, and ∆f (θt) is the gradient at θt. The variable v plays the role of
ve-locity, it is direction and speed at which the parameters move through parameter space. The velocity is set to the exponentially decaying av-erage of the negative gradient. The name momentum derives from a physical analogy, in which the negative gradient is the force that moves a particle through the parameter space, according to Newton’s laws of motion. A hyperparameter µ determines how quickly the contribution of previous gradients exponentially decay. The difference between Nevs-terov and standard momentum is where the gradient is evaluated. With Nesterov momentum, the gradient is evaluated after the current velocity is applied.
1.3.5
State of the art
Nowadays neural networks are used in various fields of science. They can be used for object recognition [25] [26] [27], camera or object re-localization [28] [29] [30], human pose estimation [31] [32], grasp pose estimation [33] and detecting grasp pose [11] [34] [10].
For the sake of simplicity, in this discussion and in the rest of the text we will refer to convolutional layers with the notation C(f,d,s), where f is the feature map, d is the filter spatial dimension and s is the spatial stride. Additionally, we will label pooling as P(m), with m downsam-ple factor, and a fully connected layer as FC(n), being n the number of output neurons. In the field of object recognition [25] presents an architecture that learns to extract features and classify objects from the raw volumetric data, voxel. The voxels are extracts from differ-ent sources of 3D data such as Lidar (to have 631 urban objects in 26 categories), RGBD, or CAD models (to have 151 and 128 , 3D model classified into 40 object categories). The composition of their model is C(32, 5, 2) − C(32, 3, 1) − P (2) − F C(128) − F C(k) where k is number of classes. For each source they make a different voxel size to be sure that image is not overfitted or downfitted, so they use 30 × 30 × 30 and 24×24×24 size grids for CAD and RGBD, respectively. With this model they obtain an accuracy around 80%. In the field of camera relocaliza-tion, [28] present an algorithm consist of a convolutional neural network trained end-to-end to regress the camera’s 6 DOF pose relative to a scene. The input of the convolutional neural networks is the image 2D, and the
output is a vector, given by 3D camera position and which orientation is represented by quaternion. The network is compose by C(23)−F C(2048) with accuracy above the 95% . In the area of human pose estimation, the work in [31] propose a method that solves regression problem towards body joints. They used a cascade and a deep neural network, where the first one is used to localize the joint and the second to predict the dis-placement of the joint location. The network is compose by 7-layer convo-lutional network C(96, 11, 1)+LRN −P (2)+C(256, 5, 1)+LRN −P (2)+ C(384, 3, 1) + C(284, 3, 1) + C(256, 3, 1) + P (2) + F C(4096) + F C(4096), where LRN is the normalization layer. They obtain a precision around 61%. In the area of grasp pose detection, and in similar way to [34], the authors in [31] use a two-step cascaded system with two deep networks, where the top decisions from the first are re-evaluated by the second using a single RGBD view. They use deep learning to learn a set of RGBD features from each candidate grasp, which are used afterwards to assign a score to that particular grasp option. The approach includes a structured multimodal regularization method that improves the quality of the features learned from RGBD data. A small deep network is used to score potential grasps in the image, and a small candidate set of the top-ranked grasp is provided to a larger deep network, which yields a single best-ranked grasp. They obtain 84% of successful grasps for 30 objects out of 100 trials. In contrast with this method, [34] that predict a single pose grasp from several possible grasp poses, [11] predicts a score for every grasp pose called grasp function.
1.4
Organization
This thesis is organized as follow: In 2 we present a short explanation of the Bounding Box algorithm using soft-hand. In 3 we explain the modification introduces to the bounding box code to enable it’s to use the ReFlex, we give an overview of the neural network and present the results obtained. Concluding remarks and future lines of research can be found in 4.
Chapter 2
Grasp planning with soft
hands
To generate the candidate pose we use the bounding box decomposition algorithm [1]. This method is based on the Minimum Volume Bounding Box (MVBB ) algorithm for object decomposition originally proposed in [8]. The method consists in the iterative building of MVBB of points resulting from splitting the point cloud of the object. The split procedure is performed in such a way that the sum of the two areas resulting from the convex region of each set of points is minimized. An algorithm for grasp planning for fully actuated hands using MVBB decomposition of an object was proposed in [9]. The idea is to decompose the object in MVBBs by minimizing the volume of the boxes fitting partial point clouds. The algorithm takes a point cloud of an object (points3D) and approximates it with MVBBs. This is performed by first projecting a point onto three planes which are the non-opposite faces of the box. Then the points are split in two sets p1 and p2 using Algorithm 5. The split is performed for each projected point (f aces) and for each one of the two projection axes. After that, the points are approximated with a box and their areas a1 and a2 are computed. At the end, the algorithm returns the point and the split direction minimizing the mentioned area. The split is then performed for whole set of all 3D points and it results in two boxes p and q with a set of 3D points. The volume reduction rate of the two new boxes compared with the original is then compared with a user-given parameter t to assess whether the split was useful or not. In case it is found useful, the split is performed and each of the boxes are consider as new point clouds to repeat the procedure. The algorithm is stopped otherwise, see algorithm 4.
At figures 2.1 and 2.2 we show an example of the first step of this technique.
Algorithm 4 Approximate the object in MVBB
1: procedure BoxApproximate(faces, points3D)
2: box ← F indBoundingBox(points3D)
3: f aces ← nonOppositeF aces(box)
4: (p, q) ← split(F indBestSplit(f aces, points3D))
5: if (percentualV olume(p + q, box) < t then . t is a stop criteria
6: BoxApproximate(p)
7: BoxApproximate(q)
8: end if
9: end procedure
Algorithm 5 Split the point cloud
1: procedure FindBestSplit(faces, points3D)
2: for i ← 1 to 3 do
3: p2D ← projects(points3D, f aces[i])
4: for x ← 1 to width(f aces[i]) do
5: (p1, p2) ← verticalSplit(p2D, x)
6: a1 ← boundArea2D(p1)
7: a2 ← boundArea2D(p2)
8: if (a1 + a2 < minArea) then
9: minArea ← (a1 + a2)
10: bestSplit ← (i, x)
11: end if
12: end for
13: for x ← 1 to height(f aces[i]) do
14: (p1, p2) ← verticalSplit(p2D, y)
15: a1 ← boundArea2D(p1)
16: a2 ← boundArea2D(p2)
17: if (a1 + a2 < minArea) then
18: minArea ← (a1 + a2)
19: bestSplit ← (i, y)
20: end if
21: end for
22: end for
Figure 2.1: Graphical explanation of the projection procedure.
Figure 2.2: Example of first split
The t parameter can vary depending on which density of box do you want. More box means that the object could be recognized with high probability. Below at Fig 2.3 and Fig 2.4 we shown what changes in the box decomposition by varying the parameters of area and volume.
(a) volume 0.125 (b) volume 0.0125
(c) volume 0.00125 (d) volume 0.000125
(e) volume 0.0000125 (f) volume 0.00000125
Figure 2.3: Volume variyng for the kettle.
(a) area 0.8 (b) area 0.9
(c) area 0.95 (d) area 0.98
Let’s now consider how the pose is chosen.
We part from the orientation of each of the MVBBs and the orien-tation of the object itself. The orienorien-tation of the boxes come from the principal axis defined by the Principal Component Analysis (PCA) per-formed withing the F indBoundingBox function. The inclusion of the PCA is one of the improvements with respect to the original algorithm in [9], and makes the algorithm invariant to the reference frame of the point cloud. The object’s reference frame is defined with the z axis in the normal direction to its base. The x axis is oriented to some features of interest (a handle, for example) and parallel to the plane of the object’s base. This helps figuring out the orientation of the object during exper-iments. The origin is placed at the intersection of the middle axis of the object with the base plane. Once the object is decomposed into MVBBs, the next step is to select a box to grasp. There are many criteria to do this, the most adequate choice depending on the task that the robot has to perform once the object is grasped. In our method, we start by generating hand poses from the outermost box. This choice is driven by our first priority of just grasping the object in a successful manner — most probably in a power grasp configuration, as the hand is just closed to a certain extend — e.g., cleaning a table. Once a MVBB is selected, the procedure followed to find the transformation TH
O describing the pose
of the hand with respect to the object is the following:
1. Align the x axis of the hand parallel the longest side of the MVBB. 2. Align the z axis of the hand with the axis of the box which has the
smallest angle with respect to the z axis of the hand.
3. Compute the orientation of the y axis to form a right-handed frame. From this procedure, we can generate the rotation matrix RHO defining the orientation of the hand frame H with respect to the object frame O. The frame H is placed 5 mm out of the MVBB, in the negative direction of the z axis defined previously. The procedure is explained graphically in figure 2.5.
Figure 2.5: Graphical explanation of the procedure performed to align the hand with each bounding box.
The previous procedure generates just a single hand configuration in one side of a box. However, once a MVBB is generated, there is a large number of possibilities to grasp it. In this work we increase the number of hand configurations, looking each side of the box. In this way we considere the grasp on each part of the object and not just one of them.
Figure 2.6: In order to generate more poses to grasp each box, the hand is rotated and translated along x axis of the box.
In order to generate more variations for a box, we first set the range of motion in which we can move the hand, translating a distance xt and
rotating by an angle αt, both along the longest axis of the box, while still
not colliding with the object and table. The collision with the table is added in this work. Figure ?? shows the random variations created for the cup. Variables xt and αt generate a 2D space, with high probability
of being collision free, from where we pick a random point, with uniform distribution, and the check for collision. If this configuration is collision free, then it is a candidate pose to grasp the object.
This part, in combination with the code developed for the grasp sim-ulation, was used for the Iros Grasp and Manipulation Competition in Korea on October 2016.
Chapter 3
Teaching robots to grasp with
underactuated hands
The goal of this thesis is to have a robot that learns how to grasp an unknown object with an underactuated end-effector. We choose to im-plement regression via deep convolutional neural network in a supervised learning scenario. In this chapter we explain how we developed the 2D and 3D deep convolutional neural networks, and the database. As previ-ously said to produce the pose we use Minimum Volume Bounding Box. In chapter 2 we present this method using Soft-Hand as end-effector, but in this work we use a Righthand Robotics ReFlex grippers. Therefore, we adapt the MVBB algorithm so it can used with our gripper. The ReFlex hand is an underactuated hand with tactile sensors and a joint feedback. It consists of three modular fingers and a central chassis that groups together all the actuators, the preshape transmission, the palm electronics, and the interface electronics (see figure 1.3).
(a) fully open (b) fully close
Figure 3.2: Proximal finger.
Each finger is controlled by a single actuator that drives a tendon spanning both the proximal and distal joint. This allows the fingers to passively shape themselves and adapt to the object. A fourth actuator controls a coupled preshape degree of freedom, for a total of seven joints including the fingers. The proximal revolute joint connects the proximal link to the knuckles, and rotates around the knuckle axle. The angle range goes from 0 radians (finger flat and fully open, as shown on left at figure 3.2(a) to nearly π radians when fully closed and resting against the palm (as illustrated at figure 3.2(b)). The distal flexure joint connects the distal link to the proximal link, and flexes around the cast urethane joint. The range goes from 0 radians (finger flat and fully open as shown at figure 3.3(a)) to nearly 7π8 radians when fully closed and resting against the proximal link (figure 3.3(b)). The distal joint cannot be commanded directly because it is coupled to the proximal joint. The angle of the distal joint is calculated from the difference between the tendon spool encoder and the proximal joint encoder, and as result it is less accurate than the proximal joint measurement. The preshape joint changes mono-lithically the angle of finger 1 and 2 with respect to the palm. These two fingers can be closed ( 0 radians, aligned with finger 3, in a power grasp orientation) or opened (π2 radians, perpendicular to finger 3, in a pinch grasp orientation) according to the type of object to be gripped, as show at figure 3.4. All these information and figures are taken from [35].
To simulate the grasp we use Klamp’t simulator. Klamp’t is an open-source, cross-platform software package for modeling, simulation, plan-ning, and optimization for complex robots, particularly for manipulation and locomotion tasks.
In comparison with Klamp’t, other competitor software do not show the same level of flexibility, support for legged robot and code portability.
(a) fully open (b) fully close
Figure 3.3: Distal finger.
3.1
Simulation and dataset creation
3.1.1
Grasp pose and dataset generation
To train the neural network we need a high volume of data. In [33] they used 800.000 data over the course of two months, using between 6 and 14 robot manipulations at any given time. In [36] used 1.2 million high-resolution images, with 650.000 neurons and 5 convolutional layers. In [37] they were able to collect over 700 hours of real-world grasps using a Baxter robot. A similar initiative in [33] explored data collection through robotic collaboration — collecting shared grasping experience across a number of real-world robots, over a period of two months. Alternative environments for large-scale data collection also exist. Simulators alle-viate a significant amount of real-world issues, and are invaluable tools that have been accelerating research in the machine learning community. Recent works leveraging simulated data collection for robotic grasping include [10], which collected over 300.000 grasps across 700 meshed ob-ject models, and [38], which collected a dataset with over 2.5 million parallel-plate grasps across 10,000 unique 3D object models.
To generate our database we use different object datasets called ycb, apc2015 and Princeton. The first two datasets were used in the IROS 2016 Grasp and Manipulation Challenge. They include object meshes in different formats (i.e., .ply, .stl). We consider a mesh to be not good when it is dirty, or very noisy. To make an accurate end-effector’s poses we need to clean this mesh with a point downsamples method. As we can see at figure 3.5(a), the whole object is not defined correctly, since there are many points that imply a considerable time of bounding box calculations. To clean the mesh, and decrease the number of points, we use the Quadric Edge collapse based simplification [39] method. We refer to this procedure as Mesh simplification or Mesh decimation. This is a class of algorithms that transform a given polygonal mesh into another with fewer faces, edges, and vertices. The simplification process is usually controlled by a set of user defined quality criteria aimed to preserve specific properties of the original mesh as much as possible (i.e., distance, visual appearance, geometric). Figure 3.5(b) shows the effect of this simplification procedure on the initially complex mesh. This method decreases the computational time of the bounding box decomposition.
(a) good mesh (b) dirty mesh
Figure 3.5: Model of object before and after the point downsamples. The output of our network is the pose of the end-effector respect the object frame, we should now choose which is the best input of the net. To image matching we can use either intensity-based methods or feature-based methods. These classes include all multi-view stereo tech-niques that compute correspondences across images and then recover 3D structure by triangulation and surface fitting. An alternative approach to scene reconstruction is based on computations in three-dimensional scene space in order to construct the volumes or surfaces in the world that are consistent with the input images. We call this approach volu-metric scene modeling. Voluvolu-metric scene modeling avoids the disadvan-tages on intensity and feature-based methods. These disadvandisadvan-tages are: i) views must often be close together (i.e., small baseline) so that corre-spondence techniques are effective; ii) correspondences must be main-tained over many views spanning large changes in viewpoint; iii) many partial models must often be computed with respect to a set of base viewpoints, and these surface patches must then be fused into a single, consistent model; iv) if sparse features are used, a parameterized sur-face model must be fit to the 3D points to obtain the final dense sursur-face reconstruction; and v) there is no explicit handling of occlusion differ-ences between views. The bounded volumetric representation is where the objects of interest lie. This volume is frequently assumed to be a cube surrounding the scene. The most common approach to represent-ing this volume is a regular tessellation of cubes, called voxels. The scene reconstruction using a voxel-based representation is defined by an occupancy classification of each volume element into one of a discrete set of labels. This is usually a binary decision (transparent or opaque) or a ternary decision (transparent, opaque or unseen). For a 3D neural network we use a voxel representation given by binvox library [40]. At figure 3.6 we can see an example of this procedure; from a object
repre-sentation 3.6(a) we can obtain a voxel reprerepre-sentation 3.6(b) calling the function ./binvox, ” − down”, ” − ri”, ” − d”, ”128”, ” − e”, xwing.ply. Where i) -d: specifies voxel grid size (default 256, max 1024), we choose 128; ii) -down: downsample voxels by a factor of 2 in each dimension (can be used multiple times); iii) -ri: remove internal voxels iv) -e: exact carving (gives best results);
(a) mesh
(b) voxel
Figure 3.6: X-wing representation
To feed the 2D CNN we use the images obtained with a virtual camera sensor inside the simulation software. To increase the size of dataset, we take the image from various points of view as depicted in figure 3.7, following the algorithm presented in 6 .
Algorithm 6 Image representation
1: procedure Image representation(Object)
2: world ← sensor camera
3: while Some criteria stop do
4: Simulate
5: sensor camera.move()
6: data ← sensor camera.getM easurements()
7: rgb ← F romDataT oRgb(data)
8: image ← rgb
9: end while
10: end procedure
(a) view 1 (b) view 2 (c) view 3
Figure 3.7: Image taken from various points of view
With this method we increase the number of input from 1.230, images takes from one point of view, to 49.880.
In chapter 2 we introduced the Bounding Box technique that we use to generate the candidate pose for the robotic hand. Since that algorithm assumes a Soft-Hand as end-effector and here we use the Reflex grippers, and the reference frames are not equal, we have to perform a reorientation of our coordinate axes. We illustrate this concept in figure 3.8.
(a) Reflex frame (b) Soft-Hand frame
Figure 3.8: Difference of frame between RightHand and Soft-Hand. The new orientation frame can be obtained following the procedure below (see figure 3.9):
1. Align the y axis of the hand parallel the longest side of the MVBB. 2. Align the z axis of the hand with the axis of the box which has the
smallest angle with respect to the z axis of the hand.
3. Compute the orientation of the x axis to form a right-handed frame.
Figure 3.9: Graphical explanation of the procedure performed to align the hand with each bounding box.
In this project we also added a new method to increase the number of poses by checking each box’s face.
We can use three kinds of mappings input-output, depicted in figure 3.10.
(a) one-to-many (b) one-to-one
(c) many-to-one
Figure 3.10: Input-output mapping representation
One-to-many mapping introduces ambiguity into the grasp space, by evoking a one-to-many mapping between images and grasps. In this case, the gripper orientation is not directly linked to the camera orientation, which means that a single image may correspond to possibly many dif-ferent grasps. One-to-one introduces a more direct relationship between images and grasps; similar orientations of the object captured in the im-age reflect similar orientations within the grasp. We choose many-to-one mapping to increase the size of dataset. The bounding box algorithm was developed in C++, while the simulator and the neural network were coded in python. To avoid rewriting all bounding box code, we utilize the Simplified Wrapper and Interface Generator SWIG [41]. SWIG is an interface that allows the communication between the two languages, automatically producing a working Python extension module (see figure 3.11).
Figure 3.11: Graphical explanation of the swig method.
Once we have the interface we can create the code to simulate the grasp. We developed a code that automatically choses an object from the dataset, calls the bounding box method that computes the poses and verifies the collision free condition between end-effector - object and between end-effector - table. If the pose is collision free, then the pose is feasible and will be attempted. At figure 3.12 we show a graph of our algorithm.
3.1.2
Network architecture
A linear function with input x and output y has the form y = w1x + w0,
where w0 and w1 are real-valued coefficients to be learned. These values
are called weights. To evaluate how close is our prediction to the truth, we can choose a penalty function as the square error summed over all the training examples PN
j=1(yj− f (xj))
2 1. We want the sum of square
errors to be small. We choose w0 and w1 to minimize the total error
on the training set, a procedure that is called simple least squared re-gression. We do not need to find w1, w0 because the machine learning
algorithm will do it for us during the training phase using algorithms such as: i) delta-bar-delta based on the idea that if the partial deriva-tive of the loss, with respect to a given model parameter, remains the same sign, then the learning rate should increase otherwise decrease; ii) AdaGrad that adapts the learning rates of all model parameters by scaling them inversely proportional to the square root of the sum of all the historical squared values of the gradient; iii) Adaptive Moment Estima-tion ADAM method based on the estimaEstima-tion of 1st and 2nd-order mo-ments. The algorithm estimates 1st-order moment (the gradient mean) and 2nd-order moment (element-wise squared gradient) of the gradient using exponential moving average [21]; iv) Momentum or the varia-tion of the momentum Nesterov Momentum [22] [23] [24], technique for accelerating gradient descent that accumulates a velocity vector in directions of persistent reduction in the objective across iterations..
The least square regression is minimized when its partial derivatives with respect to w0 and w1 are zero. When the loss function has no
closed-form solution we use gradient descent. We choose the weight update to comprise a small step in the direction of the negative gradient. After each update, the gradient is re-evaluated for the new weight vector and the process repeated until we converge on the minimum possible loss.
The loss function depends on a specific task. The authors in [28] try to regress the 6-DOF camera pose from a single RGB image in an end-to-end manner. To regress pose, they train the network on Euclidean loss using stochastic gradient descent with the following loss function:
Loss = k¯x − xk2+ β ¯ q − q kqk 2 (3.1) Where x is 3D camera position and orientation represented by quaternion q. The issue with using unit quaternions is that q and -q denote the same rotation, and it is needed to take into account the ambiguity in quaternion representation as pointed out in [42]:
1L
2is appropriate when there is normally-distributed noise that is independent of
Ψ1 = min{kq1− q2k2, kq1+ q2k2} (3.2)
A similar function that involves unit quaternions is given by:
Ψ2 = min{arccos(q1· q2), π − arccos(q1· q2)} (3.3)
Where · denotes the inner (or dot) product of vectors. As in 3.3, the ambiguity in sign of unit quaternions must be taken into consideration. So, Ψ2 can be replaced by the following computationally more efficient
function:
Ψ3 = arccos(|q1· q2|) (3.4)
Since it is necessary that Ψ3 is a non-negative function, we restrict the
angles returned by arccos to be in the first quadrant, (i.e., the range of values mapped by Ψ3 is [0,π2] rad). Alternatively, the inverse cosine
function above can be eliminated by defining:
Ψ4 = 1 − (|q1 · q2|) (3.5)
This function it used for the distance measure between two Euclidean transformations. Function Ψ4 gives values in the range [0, 1].
To avoid consistency constrain (i.e., quaternion with norm 1, rotation matrix with determinant 1), is possible to use a Euler angle representa-tion [42]. Let (α1, β1, γ1) and (α2, β2, γ2) be two sets of Euler angles.
Then : Ψ((α1, β1, γ1), (α2, β2, γ2)) = p d(α1, α2)2+ d(β1, β2)2+ d(γ1, γ2)2 where d(a, b) = min{|a − b|, 2π − |a − b|} Is a distance function on SO(3) if α, γ ∈ [−π, pi); β ∈ [−π2,π2).
Our 2D Network is composed of six convolutional layers and three fully connected layers that map the convolutional output to our desired output (see figure 3.14). To find the loss function that best fits our task, we evaluate the result for each distance metric written above. We also modify the network using a dropout method. This method is a regular-ization technique used to prevent overfitting during the training phase figure 3.15. For each iteration we randomly omitted some neurons along with their connections, so for every learning step, the neural net will have a different representation, as shown in figure 3.13. Consequently, only the connected neurons and their weights will be learned in a particular learning step.
(a) Fully connected network (b) Network with dropout
Figure 3.13: Difference between fully connected network and network with regularization
The 3D deep convolutional neural network structure is shown in figure 3.16
Figure 3.14: 2D convolutional neural network
Figure 3.16: 3D convolutional neural network
We use 1.230 inputs (one point of view for each image), of which the 85% are use to train network, 10% to validate it and 5% to test its performance. The respective data are chosen randomly from a vec-tor of successful grasp. Notice that for all method with quaternion the predicted pose is normalized using ∆ = [1e−12, 0, 0, 0] 2 see algorithm 7. Algorithm 7 Quaternion Normalization
1: procedure Quaternion Normalization(qpredicted)
2: qn←
qpredicted+delta
kqpredicted+deltak2
3: end procedure
3.2
Results
3.2.1
Bounding box method
As mentioned above we used the Minimum Volume Bounding Box to decompose the object and looking the candidates poses. For each object we decompose it in several boxes and evaluate the pose along each side of outermost boxes. During the simulation we can say that an object is succesfully grasped or not looking two parameters: the difference along the z axis, and the sum of the derivative of the vector distance between the end-effector frame and the object frame. The second parameter is a measure of the kindness of the grasp. As we show at figure 3.17 the kindness varies during the simulation. We have four spikes that means that the object is slipping. Instead at figure 3.18 the kindness is close to zero; that means that we have a successful grasp. At table 3.1 we show partial results of our pose generator code.
2The non zero term is added to a scalar part to avoid that the norm is equal to
Object name Number of total poses Number of grasp successful percentage soft scrub 2lb 4oz 80 20 25 yellow plastic chain 33 16 48.48 starkist chunk light tuna 11 5 45.45
plastic wine cup 0 0 0
play go rainbow stakin cups 9 red 11 10 90.90 black and decker lithium drill driver unboxed 68 45 66.17 wescott orange grey scissors 89 39 43.82 purple wood block 1inx1in 4 3 75.0 stainless steel fork red handle 84 32 38.09 melissa doug farm fresh fruit apple 11 10 90.90 play go rainbow stakin cups 7 yellow 11 5 45.45
plastic nut grey 0 0 0
(a) play go rainbow stakin cups 6 purple
(b) kindness
(a) play go rainbow stakin cups 6 purple
(b) kindness
Figure 3.18: Kindness trend during the successful grasp
3.2.2
Learning method
To code the neural network we use Theano [43]. Theano is the library in Python to implement neural networks. However, Theano is not strictly a neural network library, but rather a Python package that makes it possible to implement a wide variety of mathematical abstractions. In this section we explain the results obtained for our network.
Results of 2DCNN
For simplicity we refer to convolutional layer as C(f,d,s) where f is a num-ber of feature, d is the filter spatial dimension and s the spatial stride; pooling as P(m) with m downsample factor; fully connected layer as FC(n) with n the number of output neurons and D(rates) as dropout with rates being an array to denote the proportion of neurons to drop. Our 2D network is composed by C(32, 5x5, 1) + C(32, 5x5, 1) + C(32, 5x5, 1) + P (2x2)+C(32, 3x3, 1)+P (2x2)+C(32, 3x3, 1)+P (2x2)+C(32, 4x4, 1)+ P (2x2), we call this part 2DCN N . Using this notation, the neural net-work with dropout can be expressed as 2DCN N + D(0.2) + D(0.5) + D(0.5) + F C(1500), while the one without is defined by 2DCN N + F C(5500) + F C(2500) + F C(1500). The parameters are:
Input = vector of image output = end − ef f ector pose batch size = 10 n kernel = 32 delta = [1e−12, 0, 0, 0] learning rate = 1e−9 momentum = 0.5 dataset =1230 Where:
1. batch size is the number of object that we considered at each steps. 2. n kernel is the number of filter at each layer.
We summarize the results in tables 3.2 and 3.3.
Metric Number of grasp successful Training error Test error Validate error
quaternion eq 3.5 6 0.12 0.11 0.16
quaternion eq 3.4 5 0.25 0.32 0.25
quaternion eq 3.3 4 0.25 0.25 0.25
euler 1 0.5 0.67 0.49
Table 3.2: Table of metric results for 2DCNN
Metric Number of grasp successful Training error Test error Validate error
quaternion eq 3.5 0 0.2 0.19 0.18
quaternion eq 3.4 0 0.3 0.37 0.30
quaternion eq 3.3 0 0.3 0.35 0.30
euler 0 0.5 0.67 0.52
In the figure 3.19 3.20 3.213.22 are depicted the behavior of the errors in the 2d network without dropout, meanwhile in 3.23 3.24 3.253.26 are show the 2D network with dropout. To compare the orientation error on the different method we need to normalize the loss function respect it is maximum error, see figure 3.27. Notice that the equation 3.5 is not normalize because the range is in [0, 1]. In this figure can we see that our network does not converge; this is because we do not have enough data to training the network. The size of database is one of the most recurrent problem in machine learning because, as in our case, few data do not allow the network to converge, and using big data can lead to overfitting.
(a) Training error
(b) Translation error
(c) Orientation error
(a) Training error
(b) Translation error
(c) Orientation error
(a) Training error
(b) Translation error
(c) Orientation error
(a) Training error
(b) Translation error
(c) Orientation error
(a) Training error
(b) Translation error
(c) Orientation error
Figure 3.23: Behavior of error using quaternion eq 3.5 and dropout reg-ularization
(a) Training error
(b) Translation error
(c) Orientation error
Figure 3.24: Behavior of error using quaternion eq 3.4 and dropout reg-ularization
(a) Training error
(b) Translation error
(c) Orientation error
Figure 3.25: Behavior of error using quaternion eq 3.3 and dropout reg-ularization
(a) Training error
(b) Translation error
(c) Orientation error
Figure 3.26: Behavior of error using Euler angle and dropout regulariza-tion
(a) orientation error for 2DCNN
(b) orientation error for 2DCNN with dropout
Figure 3.27: Comparation of orientation error
Results of 3DCNN
Our 3D network is compose by C(5, 3x3x3, 1)+C(5, 3x3x3, 1)+P (2x2x2)+ C(5, 3x3x3, 1)+C(5, 3x3x3, 1)+P (2x2x2)+C(5, 2x2x2, 1)+C(5, 2x2x2, 1)+ C(5, 2x2x2, 1) + P (2x2)
The parameters using in 3D network are: Input = vector of voxel output = end − ef f ectorpose batch size = 6 n kernel = 5 delta = [1e−12, 0, 0, 0] learning rate = 1e−9 momentum = 0.5 dataset =189
Chapter 4
Conclusions and future work
The most important challenge in machine learning is to generate a database big enough to allow the network to learn, but keeping the number of in-puts moderate to prevent overfitting. In this work we are not able to generate more than 10% of good results. These results are extremely dependent from the number of data and the map between input-output. The main problem that we found to generate more successful results is the lack of computational resources. In the future, access to larger num-ber of graphical process units (GPU) should increase the performance of the methods we present in this paper.Our prospective is to continue this work, trying to improve the database and the map input-output.
1. Change the grasp metric to be independent from simulation time. 2. Add a collision check inside the grasp generator algorithm.
Appendix A
Theano Tutorial
To make a neural network we use Theano [43]. Theano is the library in Python to implement neural networks. However, Theano is not strictly a neural network library, but rather a Python library that makes it possible to implement a wide variety of mathematical abstractions. In this section we explain the code developed for our network.
To install Theano follow the tutorial at... and try to run the simple tutorial at ... We recommend to use a pc with GPU, so the code can run very quickly.
For the first think we define the input and the output, in our case the input is T.tensor3 1, where 3 denote that we have one point x, y and
some parameter that define the channel of our input, value 1 if is grey scale or 3 if rgb. The output is T.fmatrix. After that we might define the number of filter for each layer and the number of batch size. The batch size is the number of object that we considered at each steps. In our case we chose batch size = 10 and n kernel = 32 for each layer. At this point we can make the convolutional layer, to simplicity we refer to convolu-tional layer as C(f,d,s) where f is feature maps, d are the filter spatial dimension and s spatial stride; pooling as P(m) with m downsample fac-tor; fully connected layer as FC(n) with n output neurons and D(rates) as dropout with rates as array to denote the proportion of neurons to drop. Our 2D network is compose by C(32, 5x5, 1) + C(32, 5x5, 1) + C(32, 5x5, 1)+P (2x2)+C(32, 3x3, 1)+P (2x2)+C(32, 3x3, 1)+P (2x2)+ C(32, 4x4, 1) + P (2x2) , we call this part 2DCN N . We now distinguish the network with dropout and the other one; the first one is compose by 2DCN N + D(0.2) + D(0.5) + D(0.5) + F C(1500), the other one is compose by 2DCN N +F C(5500)+F C(2500)+F C(1500). The training,