Vision-based Deep Learning Model for Guiding Multi-fingered Robotic Grasping

(1)

Master’s degree in Computer Science

Vision-based Deep Learning

model for Guiding

Multi-fingered Robotic

Grasping.

Author:

Lapo Frati

Supervisors:

Dott. Davide Bacciu

Dott. Matteo Bianchi

Prof. Antonio Bicchi

(2)

cent advances in deep learning we propose a vision-based model to generate human-inspired sequences of grasping primitives suitable for transfer to multi-fingered robotic hands. The proposed model, inspired by Neural Image Captioning, consists of a convolutional and recur-rent part. The convolutional part employs a pre-trained model from ILSVRC-2014 adapted to combine features from multiple points of view of a single object by using a view pooling layer. The extracted features are then used to seed Long Short Term Memory recurrent units and generate sequences of primitives that can be used to guide a sophisticated multi-fingered robotic hand during the approach leading to a grasp.

(3)

1 Introduction 2

2 Background 4

2.1 A brief history of grasping . . . 4

2.2 Grasping Related Work . . . 5

3 Deep Learning 10 3.1 From learning to machine learning to deep learning . . . 10

3.2 How deep is deep learning? . . . 11

3.3 Flavours of Deep learning . . . 12

3.3.1 Convolutional Neural Networks . . . 13

3.3.2 Recurrent Neural Networks . . . 18

3.3.3 Embeddings . . . 20

3.3.4 Dangers of depth . . . 21

3.3.5 Long Short Term Memory . . . 23

3.4 Applying Deep learning . . . 24

3.4.1 Transfer Learning . . . 25

3.4.2 Ensembling . . . 26

3.4.3 Dropout . . . 28

3.4.4 Knowledge Distillation . . . 30

3.5 Deep learning Frameworks . . . 31

4 Methodology 35 4.1 Dataset . . . 35

4.1.1 Berlino . . . 35

4.1.2 ImageNet . . . 37

4.2 PyTorch, an informed choice . . . 37

4.2.1 No Tape based Autograd . . . 38

4.2.2 Dataloaders . . . 40

4.3 Models . . . 41

4.3.1 Balancing tradeoffs: VGG . . . 42

4.3.2 MultiView Feature Extractor . . . 43

4.3.3 Grasping Sequence Generation . . . 45

(4)

5 Experimental analysis 48

5.1 Challenges . . . 48

5.1.1 Grayscale inputs in an RGB era . . . 48

5.1.2 Batch Normalization . . . 49

5.1.3 Data Augmentation . . . 51

5.1.4 Synthetic Labels . . . 51

5.2 Results . . . 55

5.2.1 Pseudo-coloring . . . 55

5.2.2 Multi View Convolutions . . . 56

5.2.3 Tags Generation . . . 58

6 Conclusion 64

Bibliography 65

Appendices 73

(5)

Introduction

Seventy years ago the problem of grasping, an activity so natural for humans, was considered so complex that it was not even possible to describe the components of a grasp. Around 1960 the first attempts to create a formal description of the basic types of static grasps were made. Since then the interest in the area has been increasing steadily; researchers and physiologists have been exploring the problem of identifying the essential components of grasping. Over the past three decades research in the field of grasping has made enormous progress. Thanks to advances in medicine, neurology, machine learning and robotics our un-derstanding of the mechanisms involved has improved dramatically. We have never had such a good understanding of the anatomy of the limbs used for grasping, sophisticated robotic arms are able to mimic them to an amazing degree, machine learning models controlling them are able to solve more and more complex tasks. Yet humans still vastly outperform robotic arms when it comes to recognize, approach and grasp a novel object. Humans are able to look at an object never seen before and immediately plan a course of action, an approach that would bring our hand at a location suitable for a successful grasp and a hand position that firmly secures the object. Being able to replicate such feats would open up endless opportunities for robots to assist humans by delegating the grasping of arbitrary objects to a robotic arm. Going from a resting position to a completed grasp can be divided into two parts: the “approach” phase during which the robotic arm gets close to the target object and the “grip” phase during which contact with the object is secured.

We have focused our efforts on the “approach” phase, relying on a flexible, multi-fingered robotic hand capable of performing human-like movements to complete the grasp. As such we have tackled the task of learning how to best approach objects from videos of human subjects performing grasps. Recent advances in the field of deep learning have led to the creation of architectures able to process images and generate sequences with a performance approaching, and sometimes surpassing, human level. In this work we propose a deep learning model that uses Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). CNNs are used to extract relevant features that will be later fed to the recurrent part of the model and exploit the presence of multiple points of view to improve the quality of the features extracted through the use of a view pooling layer that has shown interesting performance when dealing with 3D models. Exploiting transfer learning a pre-trained CNN is used as features extractor to avoid the very long and difficult training of deep convolutional networks. To cope with the grayscale nature of the dataset used we experiment with a pseudo-coloring layer inspired by learned colorspace transformations. To

(6)

generate the grasping sequence we employ a Long Short Term Memory, a building unit for RNNs that has achieved outstanding results when used for automated image captioning, the process of generating a short natural language description of an input image. We aim at transfering the same capability to the generation of grasping sequences from the images of target objects. However, in the course of the experimental assessment, the annotations made available for the dataset used to validate the model has proved to be insufficient for a complete assessment of the performance of our proposed model. Therefore, we employ a synthetic labelling inspired by the original dataset and compare our model with a baseline Markov model in order to determine if the research question is worth pursuing further once a suitable dataset will be collected.

Our work is structured as follows.

• Chapter 2 provides an introduction to the theory of grasping and then present the relevant work that has been done in the area of machine learning applied to grasping. We are going to highlight the evolution of approaches from simple models using sensors data to more advanced models based on deep learning.

• Chapter 3 gives an overview of learning, from machine learning to deep learning. First, the strengths and weaknesses of deep learning are going to be presented along with two of its most common flavours: Convolutional Neural Networks and Recurrent Neural Networks. Then, common techniques used to get the most out of training and the most popular frameworks for deep learning that have been developed over the years are going to be covered.

• Chapter 4 presents the datasets, motivates the choice of the programming framework and discusses the deep learning models used throughout our work.

• Chapter 5 discusses the challenges faced while developing our models and the solutions proposed, such as pseudo-coloring and synthetic labelling. We are going to also discuss techniques to deal with datasets in general, such as Batch Normalization and Data Augmentation. Finally, the results of our experiments are presented.

• Chapters 6 discusses the results of our experiments and presents possible directions for further research.

(7)

Background

2.1 A brief history of grasping

Early days. The study of the hand has always fascinated researchers. With its dual function as sensory and motor organ it has always been the gateway to our interaction with the world around us. Since evolution to bipedal creatures has freed our hands our brains and nervous systems have been constantly adapted to allow fine control of our hands. For a long time studies on the inner workings of grasping have been focused exclusively only on the neurophysiological aspect of it, and only recently the attention has shifted to the movement themselves. With its 25 degrees of freedom the possible number of movements had been considered too vast to collect them into a functional and comprehensive framework. It was only with the work of Napier [1956] that a first attempt at describing and classifying the most relevant hand movements had been made. His focus had been exclusively on static hand position and the principal types of grasping (e.g. power grip vs precision grip). However it later became general consensus that the action of reaching is essential to understand grasping as Nowak and Hermsd¨orfer [2009] summarize: “one cannot grasp an object without first reaching for it”. Other time the findings of numerous researches have been collected in taxonomies such as the one introduced by Cutkosky [1989] or the “GRASP Taxonomy” from Feix et al. [2016].

Components of grasping. The work of Iberall and Fagg [1996] introduces the term “opposition” to describe basic forms of force application patterns required for grasping. In particular it identifies:

• Pad opposition which occurs when an object is held between a set of fingers and the thumb.

• Palm opposition which occurs when an object is held with fingers opposing the palm. • Side opposition when the thumb’s volar surface opposes radial sides of the fingers. The choice of fingers to use depends on many factors including object properties, the manip-ulation required after the grip, environmental constraints and the anatomy of the forearm. Hoff and Arbib [1993] have postulated the existence of a higher-level schema that coordi-nates the controllers (“schemas”) for reach and grasp. The overall control of the movement is achieved by a modular decomposition of transport, preshape and enclose controllers.

(8)

Figure 2.1: Grasping taxonomy (from Heinemann et al. [2015])

Another aspect of human grasping that had been often neglected was the exploitation of environmental constraints. Despite the fact that humans vastly outperform robots when it comes to grasping their movements are often far from “ideal”. Through contact with the environment humans can guide their movement to achieve robust grasping even under substantial uncertainty as in the case of impaired vision, while their performance degrades if forced to abstain from using environmental aids Puhlmann et al. [2016].

Taxonomies. Having identified the key components of human grasping, the next step was to incorporate those findings into accurate taxonomies of grasping primitives. This happened in a gradual process, over time many such taxonomies have been created and merged to form more detailed ones such as Feix et al. [2009], which combines over 30 of them. However such taxonomies still did not fully capture the pre-grasp interaction that precede the final static hand posture nor the exploitation of environmental constraints. One taxonomy that focuses on those two aspects has been presented in Heinemann et al. [2015] and contains 6 fundamental grasping primitives: reach, close, slide, edge-grasp, flip and fail. These grasping primitives are combined with modifiers to form a grasp strategy (see Figure 2.1). The taxonomy that we will use in our work is an extended version of the aforementioned one.

2.2 Grasping Related Work

Seminal works from the 90s. We are now going to present some results in the field of grasping to show how it has evolved over time with a focus on neural networks based approaches like ours, we refer the reader to Shimoga [1996], Bicchi and Kumar [2000] and Siciliano and Khatib [2008] for a more general survey of past work in robotic manipulation. We are going to start with three seminal works from the ’90s and then progress to more recent results, while pointing out the main obstacles and research directions taken. One of the earliest results in this area, which has applied machine learning in the form of neural networks to the study of grasping has been the previously mentioned work of Iberall and Fagg [1996]. Their approach consisted in the usage of a feedforward neural network to determine which opposition and which fingers to use for a given set of task requirements (see Figure 2.2a). Their work required the explicit extraction of object properties and force and precision requirements to use respectively as inputs and targets of their net. Despite its limitation it represents one of the first attempts at using machine learning concepts in a

(9)

field previously focused on a medical/descriptive point of view.

(a) Shallow NeuralNetworks for grasping from Iberall and Fagg [1996]

(b) Autoencoder structure from Uno et al. [1995]

Figure 2.2: Seminal applications of machine learning models to grasping

A more advanced attempt at using neural network was made by Uno et al. [1995]. In their work they implemented a generative model using what we would now call an autoencoder (see Figure 2.2b). Compared to the work of Iberall and Fagg [1996] they used as inputs a heavily downscaled image of the object to grasp (i.e. 8x8 version of the original image) and the data from 12 sensors on a DataGlove input device worn by the test subject. They then trained their network using a combination of backpropagation and constrained optimization, and studied the internal representation to see if it had learned interesting features about the task. The work represented an interesting step in the direction of a more scalable approach since it utilized directly the image of the object but due to limited computational resources and more recent advances in computer vision they were forced to use downscaling and exploit symmetries in the input images to limit the number of pixels. The third approach we are going to present in this short review of seminal works is the one from Kamon et al. [1996]. Similarly to Uno et al. [1995] they used images to create a visually guided grasping system. They created a dedicated vision subsystem in charge of extracting features such as center of mass and main axis and segmentate input images. The extracted features where then fed to the learning subsystem which had to chose grasp points on the object’s boundary.

Steps forward form early 2000 Moving closer to the present we can see how ad-vances in computer vision, machine learning and technology have progressed since those first attempts. We can classify modern works in roughly two categories: those that rely on spatial information in the form of 3D models or RGB-D (2.5D) data and those that use only visual information in the form 2D images. On one hand the advantage of using 2.5/3D data is rooted in the geometric nature of planning robotic grasping but has the downside of requiring specialized depth-enabled cameras or pre-existing 3D models. On the other hand the 2D approach can leverage a vast wealth of existing images, which is an especially desirable property when training data-hungry deep models. Among those that went for a 3D model based approach notable examples are the following:

• Miller et al. [2003] used heuristic rules to generate and evaluate grasps for three-fingered hands by assuming that the objects are made of basic shapes such as spheres, boxes, cones and cylinders each with pre-computed grasp primitives.

(10)

• Pelossof et al. [2004] used Support Vector Machines (SVM) to estimate the quality of a grasp given a number of features describing the grasp and the object.

• Hsiao et al. [2007] and Hsiao and Lozano-Perez [2006] used partially observable Markov decision processes (POMDP) to choose optimal control policies for two-fingered hands. However these methods were not tested through real-world experiments, but were instead modelled and evaluated in a simulator. Those attempts that instead followed a vision based approach with real world experiments were mostly focused on 2D planar objects. Among such works we highlight:

• Piater [2002], and Coelho et al. [2001] estimated 2D hand orientation using K-means clustering for simple objects (specifically, square, triangle and round “blocks”). • Morales et al. [2002a], and Morales et al. [2002b] calculated 2D positions of

three-fingered grasps from 2D object contours based on feasibility and force closure criteria. • Bowers and Lumia [2003] considered the grasping of planar objects and chose the location of the three fingers of a hand by first classifying the object as circle, triangle, square or rectangle from a few visual features, and then using pre-scripted rules based on fuzzy logic.

• For grasping known objects Hueser et al. [2006] used use Learning-by-Demonstration, in which a human operator demonstrates how to grasp an object, and the robot learns to grasp that object by observing the human hand through vision.

Recent results: towards deep learning. We are going to start with a work attempting to use simple 2D images.In practical scenarios it is often very difficult to obtain a full and accurate 3D reconstruction of an object seen for the first time through vision. Even if more specialized sensors such as laser scanners (or active stereo) are used to estimate the object’s shape, we would still only have a 3D reconstruction of the front face of the object.

Saxena et al. [2008] developed a probabilistic model whose parameters where learned through maximum likelihood for logistic regression that takes two or more pictures of the object, and then tries to identify a point within each 2D image that corresponds to a good point at which to grasp the object. Their work takes inspiration from Castiello [2005], which showed that cognitive cues and previously learned knowledge both play major roles in visually guided grasping in humans and in monkeys. This indicates that learning from previous knowledge is an important component of grasping novel objects. Given these 2D points in each image, they use triangulation to obtain a 3D position at which to actually attempt the grasp. Then, given two (or more) images of an object taken from different camera positions, they will predict the 3D position of a grasping point. To do that they generate synthetic images along with correct grasps using a computer graphics ray tracer. The choice of using a ray tracer (instead of a faster but cruder OpenGL style graphics) was motivated by the relation between better the quality of the synthetically generated images and graphical realism, and better accuracy of results. This observation hints at the presence of subtle visual cues useful for grasping beyond just the volumetric information. It’s also interesting to note that the accuracy in predicting 3D grasping points was higher than the one achieved when classifying 2D regions as grasping points because the probabilistic model for inferring a 3D grasping point automatically aggregates data from multiple images, and therefore “fixes” some of the errors from individual classifiers. A later work by Lenz et al. [2015] instead attempted to use deep learning with depth as an additional feature instead

(11)

Figure 2.3: From left to right: RGB-D image is acquired, and searched over a large space of possible grasps. For each of these, a set of raw features corresponding to the color and depth images and surface normals is extracted, then used as inputs to a deep network which scores each rectangle. (from Lenz et al. [2015])

Figure 2.4: Grasping Quality CNN (from Mahler et al. [2017])

of using just images as raw input. Their approach consisted in using a small deep network to extract features from numerous possible grasping positions. The best candidates were then fed to a bigger deep network tasked with selecting the best candidate (see Figure 2.3). They also analyzed how to combine multimodal data coming from different sources and implemented regularization techniques to further improve their results.Another interesting approach is the one used by Mahler et al. [2017] which builds upon the idea from Saxena et al. [2008] of using synthetic data by building a vast dataset of 3D models. Since training on real images may require significant data collection time, an alternative approach is to learn on simulated images and to adapt the representation to real data. Their goal was to learn a robustness function of a grasp, given an observation as the probability of success under uncertainty in sensing and control, using what they call a “Grasp Quality Convolutional Network” (GQ-CNN see Figure 2.4). The estimated robustness function can be used to select the best grasping policy among a set of candidates subject to constraints such as collisions or kinematic feasibility. Solving for the grasp robustness function in objective is challenging and requires a lot of data. To address this they generated Dex-Net 2.0, a training dataset of 6.7 million synthetic point clouds, parallel-jaw grasps, and robust analytic grasp metrics across 1,500 3D models. So far we have seen some key examples of works exploring the capabilities of probabilistic models on 2D images (Saxena et al. [2008]), deep models trained on 2.5D images and synthetic data (Lenz et al. [2015]), convolutional networks

(12)

pretrained on synthetic 3D models(Mahler et al. [2017]), but all these works employed simple 2-points grippers. For such a simple gripper little work is required in adapting the orientation to the actual grasp, which is not the case when using more complex multifinger grippers. As analyzed in Chen et al. [2015] multi-fingered dexterous robotic hands differ from two-jaw grippers and underactuated hands in the variety of grasp types it can achieve, and the unique capability of in-hand manipulation. Due to the multi-finger contact with the object, the object in-hand pose can also be more effectively estimated to help facilitate improved execution of the follow-up task such as object manipulation. However, the grasp quality and manipulation performance of multi-fingered robotic hands rely on grasp planning algorithms based on known object model or information. In scenarios where objects are not in the expected location for the robot, or the end-effector of the robot is not in the expected configuration as the robot is commanded, unexpected contacts or collisions caused by the uncertainties during the grasping task execution can result in grasp failure or poor grasp quality. This in turn hinders the performance of executing object manipulation tasks further down the task chain. The work from Yu et al. [2013], despite still being focused on 2-points grippers, attempts to estimate the object pose in order to adapt the gripper orientation during the grasping, but is still a long way from the sort of mid-flight orientation adaptation seen in human grasping.

Summarizing. We have seen how the first computational attempts at grasping used rudimentary shallow neural nets, trained on either hand-crafted features, sensor inputs or downscaled images. Over time multiple different kind of approaches were tested, from SVMs to POMDPs to clustering, with a clear shift from hand-crafted features toward models able to learn features from 2/2.5D images or 3D models, with deeper and convolution-based architectures. We have seen how the focus has been mostly on 2-points grippers and the challenges associated with using multi-fingered robotic grippers. Our work builds on these foundations by employing deep convolutional models trained on simple 2D images as feature extractors for recurrent models aimed at guiding the grasping of a multifingered robotic hand towards a human like grasping.

(13)

Deep Learning

3.1 From learning to machine learning to deep learning

Over the course of history humans have always dreamed to be able to imbue machines with intelligence. After having spent millennia refining the art of reasoning, soon after the creation of the first computers humans have tried to transfer that intelligence into them. Two elements are essential in order to achieve that goal:

• a way of “inserting” knowledge into machines in a format they can work with. • a way of “manipulating” knowledge in order to refine it and extract new information

from it that can be used to solve a target problem.

A classic example would be the game of chess. In this scenario the programmer would have inserted knowledge in the machine in the form of a numerical representation of board posi-tions and then enabled manipulation of such information through the use of ad-hoc heuristics (e.g. strategic placements of pieces, difference in pieces captured, etc) that leveraged the knowledge slowly accumulated by chess experts over the years, and brute force simulation of multiple games at the same time. This proved to be a successful approach that allowed machines to defeat human chess grandmasters. However this was possible because, despite the apparent intractable vastness of the state space of chess games, the rules of chess can be formalized easily, the state of the board is fully known, the effect of moves are deter-ministic. When researchers tried to tackle more ambiguous problems or problems for which even defining heuristics was extremely difficult the limits of this approach became apparent. What if the outcome of an action is not deterministic? What are the rules to distinguish an animal from an object? Hard-coded heuristics proved to be very time consuming to come up with, if possible at all, and required the knowledge of experts. The difficulties faced by these systems suggested that AI systems needed the ability to acquire their own knowledge, by extracting patterns from raw data. This capability is known as machine learning. Leveraging this new perspective simple methods such as logistic regression or naive Bayes gained more and more popularity. But soon another obstacle appeared: the performance of these simple machine learning methods depended heavily on the quality of the repre-sentations they were given. Extracting relevant features from the raw data dramatically improved their performance. These features could be selected, for example, by removing unnecessary information (e.g. truncated PCA) or cleverly refining data in order to capture

(14)

abstract relationships that might have not been explicit in the original dataset. Even if the use of machine learning had relieved the programmers from the task of hard-coding explicit rules for the program to follow by switching to data driven models, the selection of an ap-propriate representation was still a difficult task. An obvious solution to this problem has been the introduction of representation learning, by which the machine learning model required not only to find a good mapping between inputs and outputs but also turn the raw inputs into a more suitable representation. The process of disentangling noisy dimension from the actual useful bits and refine them in order to capture high-level abstract features has proven to be as challenging as the learning process itself. Deep learning solves this central problem of representation learning by leveraging the hierarchical nature of high level features, introducing representations that are expressed in terms of other simpler represen-tations. Perfect examples of a deep learning model are multilayer perceptrons (MLP). A multilayer perceptron represents a mathematical function mapping some set of input values to output values. The model is structured as a sequence of layers, each layer taking as input the output of the previous one and applying some linear and non-linear transformations to it. This sequential application of functions maps the input data into increasingly abstract spaces that we can think of as higher level representations of the low level inputs. Other than providing a data driven way of discovering good representations the nature of deep learning allows for a multi-step computation. Each layer of the representation can be thought of as the state of the computer’s memory after executing another set of instructions in parallel. Networks with greater depth can execute more instructions in sequence (Goodfellow et al. [2016]). It is plausible to wonder if it is really depth the reason for better results when using deep models or since deep models tend to have more parameters the improvements are due to the bigger size of the model but experiments from Goodfellow et al. [2013] show that that is not the case. In their work they attempt to recognize arbitrary multi-digit numbers from Street View imagery through the use of a deep convolutional neural network capable of performing localization, segmentation, and recognition steps operating directly on the image pixels. Their results show that depth is necessary for good performance on this multistep task and that large but shallow models cannot achieve the same level of performance (see Figure 3.1).

3.2 How deep is deep learning?

Before going further with our presentation of deep learning we need to talk about neural networks’ very own sorites paradox: If adding one more layer does not make a net deep when does deep learning begin? First of all we need to define a way to measure the “depth of learning”. Learning or credit assignment is about finding weights that make the MLP exhibit desired behavior. Depending on the problem and how the neurons are connected, such behavior may require long causal chains of computational stages. We are going to use the notation introduced in Schmidhuber [2015] for Credit Assignment Depth (CAP) using event-oriented notation for activations spreading in MLPs. Let’s consider a single finite episode or epoch of information processing, without learning through weight changes. During an episode, there is a partially causal sequence xt(t = 1, ..., T ) of real values called events. Each xt is either an input set by the environment, or the activation of a unit that may directly depend on other xk(k < t) through a current NN topology-dependent set int of indices k representing incoming causal connections or links.

(15)

Figure 3.1: Comparison of accuracy and size of various models. Notice that even the smallest deep model performs better than biggest shallow one (from Goodfellow et al. [2013]).

Definition 3.2.1. Potential Direct Causal Connection (PDCC). Consider two events xp and xq (1 ≤ p < q ≤ T ), a PDCC is a Boolean predicate pdcc(p, q), which is true if and only if p ∈ inq. Then the 2-element list (p, q) is defined to be a CAP (a minimal one) from p to q.

Definition 3.2.2. Potential Causal Connection (PCC). PCC are expressed by the recursively defined Boolean predicate

pcc(p, q) ≡ pdcc(p, q) ∨ ∃k | pcc(p, k) ∧ pdcc(k, q) .

Suppose a CAP has the form (..., k, t, ..., q), where k and t (possibly t = q) are the first successive elements with modifiable weights. Then the length of the suffix list (t, ..., q) is called the CAP’s depth (which is 0 if there are no modifiable links at all). This depth limits how far backwards credit assignment can move down the causal chain to find a modifiable weight. Suppose an episode and its event sequence x1, ..., xT satisfy a computable criterion used to decide whether a given problem has been solved (e.g., total error E below some threshold). Then the depth of the deepest CAP within the sequence is called the solution depth. Given some fixed NN topology, the smallest depth of any solution is called the problem depth. Notice that in the case of feedforward neural networks the solution depth is limited by the model depth but in the case of recurrent neural networks is potentially unlimited. In this framework one might ask at what depth does a shallow network become deep, but unfortunately there is no clear consensus among the experts in the field. However, it is generally agreed that problems with depth greater than 10 require very deep learning.

3.3 Flavours of Deep learning

When training deep neural networks the two types of nets most frequently used are con-volutional and recurrent. Fully connected layers require a very high amount of weights

(16)

since every input is connected to every output of the layer. This means that, unless a drastic bottleneck is introduced in the architecture, the amount of parameters can easily reach hundreds of millions in the case of a deep network whose inputs are very large as it is the case when dealing with images. In those cases it is advantageous to use deep stacks of “convolutional” layers that exploit weight sharing to keep the amounts of weights man-ageable while providing other features especially suited to deal with images such a certain degree of translation invariance. When inputs to a net are instead arbitrary sequences it is appropriate to use recurrent networks. In this case the network is ”unrolled” along the input sequence, and the outputs at each step are fed back into the net in a process similar to recursion. As opposed to “vertical” depth of convolutional networks, recurrent networks display “horizontal” depth, being able to capture arbitrarily distant relationships between parts of the input sequence. We are now going to see in more detail both of these flavours of neural networks presenting their strengths, challenges and their evolution over time.

3.3.1 Convolutional Neural Networks

We are now going to focus our attention on convolutional neural networks (CNN). From a mathematical point of view a convolution is an integral that expresses the amount of overlap of one function g as it is shifted over another function f . It therefore ”blends” one function with another. The convolution of a function f with a function g, written f ∗ g is defined as:

(f ∗ g)(t) =

Z ∞

−∞

f (τ )g(t − τ )dτ.

In convolutional network terminology, the function f is often referred to as the input and the function g as the kernel. The result of this operation is sometimes referred to as the feature map. If g is a valid probability density function we can view the convolution as performing a weighted average. Since each point of a function is replaced with an average of neighbouring points the convolution operation is usually used to smooth a noisy function (see Figure 3.2).

Given the discrete nature of data in a computer the aforementioned formula of the convolution is usually replaced with a discretized version where the integral is replaced with a sum: (f ∗ g)(t) = ∞ X τ =−∞ f (τ )g(t − τ ).

We are also going to assume that the kernel function is zero everywhere except for the finite number of values that we specify. This allows us to use a compact representation of our kernel function as an array while still being able to implement the infinite summation. Since the convolution operation computes the amount of overlap between the input and the kernel we can use it to “search” for specific patterns in the input. Suppose for example that we are interested in finding peaks (i.e. points whose value is higher than their neighbours) in a simple 1D function. A suitable kernel for this operation would be g = [−1, 2, −1]. We can see the result of its convolution with a pyramidal structure in Figure 3.4. The same discrete convolution we have previously described can be extended to multiple dimensions simply by adding a separate summation for each axis:

(F ∗ G)(x, y) = ∞ X m=−∞ ∞ X n=−∞ F (m, n)G(x − m, y − n).

(17)

2

1

0

1

2

0

100

200

300

400

500

2

1

0

1

2 sine

noisy sine

noisy_sine * gauss_ker

Figure 3.2: Application of a Gaussian kernel to remove noise from a sine function. Top: the original sine function and the version with added noise. Bottom: the result of the convolution with the kernel

Let’s now consider a grayscale image of size width × height. We can think of it as a function that takes two coordinates and returns the grayscale value of the corresponding pixel (in the case of RGB images it would be a triplet of numbers instead). This means that we can apply convolutions on this “image” function after lifting them by setting the value of each point outside of its bounds to zero. Depending on the choice of the kernel different visual effects can be achieved such as blurring, sharpening, edge detection, etc. (see Figure 3.3). While these filters are pre-determined and chosen to achieve particular visual effects in most visual editing applications the approach of convolutional networks is instead to learn these kernels as part of the training process.

Figure 3.3: Edge sharpening kernel

Advantages of Convolution

The main advantages of using convolution are: sparse connectivity, weight sharing and equivariance.

Sparse connectivity. One of the major problems of fully connected (FC) layers that we have mentioned before is the need to have connections from each input neuron to every output. This means that a FC layer having i inputs and o outputs will require O(i × o)

(18)

0 50 100 150 200 250 300 20 0 20 40 60 80 100 0 2 0 2 f g f * g

Figure 3.4: Application of a simple kernel to detect edges. Small: the kernel g representing the pattern we are looking for. Big: the function f and the result of the convolution of g on it. Notice the positive spikes corresponding to the location in the function where the pattern (i.e. peak) is found and the negative one where the pattern is inverted (i.e. valley)

parameters. When using convolutional layers the number of multiplications to compute a single output is the same as the number of elements in the kernel which is usually much smaller than the size of the inputs. This small number of connections between each of the inputs and a specific output makes the computation much faster, requiring only O(k × o) operations for a kernel of size k.

Weight sharing. In a conventional FC layer each weight is “unique” in the sense that it is assigned to connect a unique pair of neurons. Instead, given the nature of the convolution operation, when computing the feature map each value of the kernel is applied at nearly every position of the input. This “replication” of the weights makes the storage requirements much smaller as a single kernel has to be stored to be able to compute the full output, and the size of kernel is extremely small. As we have seen a single kernel can be used to select specific low level features such as vertical or horizontal edges or changes in luminosity. In order to represent all the relevant feature of a given image multiple of these simple kernels are needed. Therefore the processing done at a specific convolutional layer involves the convolution of many different kernels. Let’s consider for example an input image of size h × w × c (in the case of an RGB image c would be 3). A convolutional layer will typically consist of n kernels of sizes {hi× wi× ci | i ∈ [1, ..., n]}. The output of the convolutional layer will have size h × w × c × n (the number n is usually called the number of channels by overloading the term used to describe the RGB channels of input images) which, assuming n << h × w × c, is much smaller than the size of a FC layer which could require a number of weights of the order of (h × w × c)2_.

Equivariance. The representation obtained through convolutions is equivariant with respect to translation. Formally a function f (x) is equivariant to a function g if f (g(x)) = g(f (x)). If for example we have a transformation that shifts the input image by one pixel, applying a convolution to the shifted input is the same as shifting the output of convolution performed on the original image. This property is useful to deal with low level features

(19)

such as the detection of an edge, since the same edge may appear multiple times at different location and would therefore appear multiple times in the feature map. Unfortunately the convolution operator is not equivariant with respect to operations such as rotations or scaling. If for example a kernel is detecting vertical edges, a 90◦rotation would turn vertical edges into horizontal edges and so the feature map would change radically rather than be just a rotation of the previous one.

When applying deep CNN for classification of images the number of output classes is far smaller than the input size. The convolution operation we have so far described is not able to change the size of output, only the number of channels. To make the ends meet and manage the width of a CNN three approaches are used: pooling, padding and strides.

Pooling. Usually a convolutional layer is composed of: convolution, activation func-tion and pooling. When using pooling the input is divided into pools and from each pool a summary statistic is extracted. For example max pooling (Zhou and Chellappa [1988]) operation reports the maximum output within a rectangular neighborhood. Applying this pooling operation with a pool size of p effectively reduces the size by a factor of p. Further-more the representation becomes invariant to small translations because any translation smaller than the pool size has no effect on the output.

Padding. When we have presented the convolution operator we have implicitly applied some padding. If that was not the case, when a kernel is convolved over the input the size would invariably decrease at least by one. Consider for example the simplest case of a 2 × 2 input image and a kernel size of 2 × 2. The result of this convolution would be a single 1 × 1 value. To prevent this shrinkage is possible to add a zero-padding around the image. Adding one layer of zero-padding to our previous example would turn the input image into a 3 × 3 image and the output of the convolution would be a 2 × 2 feature map of the same size of the unpadded input. When padding is added such that input and output sizes match the padding is called “same”, when instead no padding is added is called “valid”.

Strides. One last mechanism to control the output size is to employ strides greater than one. A stride is the amount of displacement of a kernel during each step of the convolution. In the formula we have presented the stride size was implicitly 1. This means that at each iteration the kernel was moved by just one location along the input axes. We can also think of a stride of size s as a downsampling the output of the full convolution by keeping only the results every s positions. For more information about convolution arithmetic of the various combinations of padding and striding see Dumoulin and Visin [2016].

We can now define a formula to compute the output volume as a function of the input volume size (W ), the kernel size of the convolution (F ), the stride with which they are applied (S), and the amount of zero padding used (P ) on the border as (W −F +2×P )_S+1 . Evolution of architectures

Having presented the elementary building blocks of convolutional neural networks, we are now going to present a brief overview of the main architectures that have been assembled using those building blocks and we will see the clear trend towards increasingly deeper architectures.

AlexNet. While one of the first successful applications of convolutional neural networks has been the architecture knows as ”LeNet” (LeCun et al. [1998a]), it was not until 2012 with the creation of ”AlexNet” (Krizhevsky et al. [2012]) that deep nets started blooming in the computer vision field. It competed and won in the 2012 ILSVRC (ImageNet Large-Scale Visual Recognition Challenge1_{), achieving and error rate 10% lower than the runner}

(20)

up, which was an outstanding result in a competition where a few percent improvement was regarded as a crowning achievement. The network was made up of 5 convolutional layers, max-pooling layers, dropout layers, and 3 fully connected layers designed for classification with 1000 possible categories. Its training required two GTX 580 GPUs for five to six days. Despite the amazing result of AlexNet the inner workings of CNN were still not well understood and researchers often resorted more to trial and error than first principles. An improvement over AlexNet was achieved the next year by ZFNet, the winner of ILSVRC 2013. The team of ZFNet (Zeiler and Fergus [2014]) also provided a careful analysis of their design choices and a useful tool called Deconvolutional Network (deconvnet) which shed some light on the kind of learning going on in the kernel of the net by propagating back to the image from the activations of a single kernel in order to determine which images activated it the most.

VGG. In 2014 a team from the Visual Geometry Group of Oxford took another sig-nificant step in the direction of deep networks. They proposed an architecture known as VGGNet (Simonyan and Zisserman [2014]) and combined smaller kernels with more depth. Despite the small 3 × 3 kernels, by stacking 3 convolutional layer back to back the effec-tive recepeffec-tive field increased achieving the same result as a 7 × 7 kernel but with more non-linearities and fewer parameters (3 × (32_C2_{) vs 7}2_C2_{) for C channels per layer). It} contained almost 20 layers with the number of filters doubling after each pooling operation, following the idea of shrinking spatial dimensions, but growing depth. Despite achieving only second place at ILSVRC 2014 competition the architecture was very successful as it showcased the power of combining simplicity and depth. Training required two to three weeks on 4 Nvidia Titan Black GPUs.

Figure 3.5: Inception module (from Szegedy et al. [2014])

GoogLeNet. From the same year as VGG, GoogLeNet (Szegedy et al. [2014]) was the winner of ILSVRC 2014 with a top 5 error rate of 6.7%. As opposed to the sim-ple stacking of convolutional layers used by VGG, GoogleNet stacked what the called “inception module” (see Figure 3.5). The key concepts implemented in this module are: channel depth manipulation and ker-nel flexibility. Chanker-nel depth manipulation is used to improve computational efficiency by utilizing 1 × 1 kernels (introduced by Lin et al. [2013]). Since these modules feed out-puts into each other repeatedly applying a

1 × 1 × C0kernel to an input of size w × h × C we get an output volume w × h × C0that allows us to prevent an unlimited growth of channel depth while preserving the input sizes. Fur-thermore, by reducing the channel depth the subsequent convolutions using bigger kernels are much cheaper computationally, allowing the module to explore an increased number of kernel sizes. Instead of restricting the kernel size to only one value, kernels of size 1,3 and 5 are used in parallel. Their outputs are combined together by filter concatenation. One last detail worth mentioning is the injection of additional gradients at intermediate layers to help train lower layers and act as a regularizer. By stacking 9 inception modules it reached 22 layers depth while still having 12x fewer parameters than AlexNet. It was trained for

(21)

approximately a week on a few high end GPUs.

Figure 3.6: Residual Block (from He et al. [2016])

ResNet. This general trend toward increasingly deep architectures really took off in 2015 when a team from Microsoft won the ILSVRC using 152 layers (He et al. [2016]). The team observed that after a certain depth the performance degrades significantly both in testing and training. Their hypothesis is that with so many layers the model becomes extremely difficult to optimize. They have therefore devised a model composed of a very deep stack of what they called ”Residual blocks” (see Figure 3.6) which supposedly help coping with this complexity. In a standard con-volution layer, given an input x the layer would learn an underlying mapping F (x). In a residual block the input is added to the output to form the residual H(x) = F (x) + x. After observing the performance degradation introduced by increasing depth the authors believed that it would have been easier to optimize the residual mapping than to optimize the original unreferenced mapping because if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart. Their experiments show that the learned resid-ual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning. ResNet required two to three weeks of training on an 8 GPU machine.

(a) Single-crop top-1 validation accuracy for top scor-ing sscor-ingle-model architectures

(b) Top-1 one-crop accuracy versus amount of oper-ations required for a single forward pass

Figure 3.7: Comparison of CNN (from [Canziani et al., 2016])

3.3.2 Recurrent Neural Networks

We have seen that CNNs are the perfect fit when dealing with images but despite some successful applications to sequences (van den Oord et al. [2016]) Recurrent Neural Networks (RNN) are a better suited for the task. RNNs are able to deal with sequences of arbitrary length x(1)_{, ..., x}(n)_{and can show incredible depth when considering the length of horizontal} paths information can travel along. The key to going from feedforward networks to

(22)

recur-rent ones is “parameter sharing”. Having one parameter per input would require to know in advance the maximum length of an input sequence, would be wasteful if lengths vary significantly and would not be able to “reuse” knowledge since different arrangements of the same input values would be treated as completly different. One can think of applying 1D convolutions to allow parameter sharing over time but even when using dilated convolutions (i.e. stacked convolution with stride > 1) the output would be function of only a fixed num-bers of neighbouring points. RNNs instead achieve parameter sharing in a very different way, each output is a function of the previous one, obtained applying the same update rule in a recursive fashion. Their recurrent nature allows RNNs to have a dynamical memory of the input history. Since outputs at one step are fed back into the next, this allows infor-mation to persist and be carried along the sequence if needed. This memory allows RNNs to essentially act as running a program with some inputs and internal variables as RNNs have been shown to be Turing complete (Siegelmann [1995]). In their most basic form we can think of recurrent nets as dynamical systems driven by a signal x(t), described by the formula

h(t)= f (h(t−1), xt; θ)

where h(t) is the hidden state of the net at time t (used to compute the output), f is an activation function and θ the parameters of the net. If we unfold this recurrence over time we can see how it forms long chains of internal states (see Figure 3.8).

(a) Circuit diagram (b) Unfolded computational graph

Figure 3.8: Recurrent Neural Networks

In our example the recurrent net produces one output for every input but that is not necessarily the case, possible alternatives are:

• sequence input and single output, as in the case of sentiment analysis where a sentence is classified as expressing positive or negative sentiments.

• sequence input and, only after, sequence output, as in the case of machine translation where one sentence in one language is processed and then one sentence in another language is produced.

• single input and sequence output, as in the case of image captioning where features from one image are used as seed to generate a description of the image contents. Once the computational graph has been unrolled our RNN can be trained by gradient descent applying Back Propagation Through Time (BPTT Werbos [1990]), a slightly

(23)

modified version of standard backpropagation that takes into account the replicated nature of network (i.e. the fact that each apparently distinct node in the unrolled graph is just an instance of the same node at different time steps). Since the information has to propagate forward through time and then be propagated backward along paths of arbitrary length the execution of BPTT can be quite expensive.

3.3.3 Embeddings

When we introduced CNNs to deal with images we have been lucky since the input was already in a format ready to be ingested by convolutional neural networks. Since the rep-resentation of an image in computer memory is a multidimensional collection of numbers ([0, 255] or [0, 1] depending on the specific format) their connection with functions and convolutions was straightforward. For tasks such as object recognition all the information needed to successfully perform a task is encoded in the data, but this might not be the case when performing Natural Language Processing (NLP). In NLP words are treated as discrete atomic symbols and their actual representation in computer memory is completely arbitrary. Consider for example a picture of a cat and the word “cat”. A model that has learned to recognize cats with a good generalization would be able to recognize a new picture of a cat never seen before. Consider instead a model that has learned to recognize the word “cat” when processing text, such a model would not be able to recognize similar words such as “kitty” or “feline”. When presented with a word of a similar animal it would not be able to reuse any of learned knowledge, while a CNN would be able to exploit learned features to recognize similar categories.

Figure 3.9: Male to female vector transformation

To avoid these problems Vector Space Models (VSMs) are used which embed words in a contin-uous vector space where semantically similar words are mapped to nearby points. This allows geomet-ric manipulations of these vectors to reflect seman-tic changes, as in the classic example: king - man + woman = queen (see Figure 3.9). All methods of this type rely on the Distributional Hypothesis, which states that words that appear in the same context have the same meaning. The main approaches that rely on this principle can be subdivided into: count based, such as Latent Semantic Analysis and predic-tive methods, such as neural probabilistic models. As discussed in Baroni et al. [2014]: Count-based meth-ods compute the statistics of how often some word

co-occurs with its neighbor words in a large text corpus, and then map these count-statistics down to a small, dense vector for each word. Predictive models directly try to predict a word from its neighbors in terms of learned small, dense embedding vectors (considered pa-rameters of the model). The creation of these embedding may require extremely long times when the corpora used are extremely large but luckily many of these embedding are made publicly available. Famous examples of the two approaches used for NLP are “Word2vec” (Mikolov et al. [2013]) for the predictive approach and “GloVe” (Pennington et al. [2014]) for the count based one. In our case we have had to train our own embeddings but since the number of possible “words” (i.e. distinct labels) in our dataset is rather limited the training was carried out from scratch as part of the model. This functionality is offered with most

(24)

neural networks frameworks and take care both of learning the embedding and the necessary lookups (from number to one-hot-encoding to corresponding embedding vector) so that its usage is completely transparent.

3.3.4 Dangers of depth

Having previously introduced a rule of thumb to estimate if a model can be regarded as shallow or deep a natural question to ask is how deep should one model be. Unfortunately to this day, despite the efforts to find a strong theory of deep learning, this choice is still often made using a simple trial and error approach. Given that deeper models are able to learn more sophisticated representations one might naively wonder why not just keep adding layers over and over. As the depth of a neural network increases new challenges appear, most notably the problem of vanishing or exploding gradients. To see how these problems arise let us consider the case of a simple recurrent linear model (similar consideration will apply in the case of more complex scenarios) as discussed in Pascanu et al. [2013]. We are going to specify the generic recurrent neural network formula which we have previously presented, with input x(t)_{and state h}(t) _{for time step t, as:}

h(t)= f (h(t−1), x(t); θ) (3.1)

= σ(Wrech(t−1)+ Winx(t)+ b) (3.2)

= σ(Wrech(t−1)) + Winx(t)+ b (3.3)

where the parameters θ of eq. 3.1 of the model are given by the recurrent weight matrix Wrec, the input weight matrix Win and the bias b. Eq. 3.2 and eq. 3.3 are equivalent because is always possible to turn one into the other by a re-parametrization of the model. Let us now introduce a cost E =P

tEtand Et= L(x(t)) so that we can analyze the gradient: ∂E ∂θ = T X t=1 ∂Et ∂θ (3.4) ∂Et ∂θ = t X k=1 (∂Et ∂xt ∂xt ∂xk ∂xk ∂θ ) (3.5) ∂xt ∂xk = k Y i=t ∂xi ∂xi−1 = k Y i=t

W_recT diag(σ0(xi−1)). (3.6)

In eq. 3.5 each term of the summation is a temporal contribution of how θ at time k affects the cost at time t > k. In eq. 3.6 we can see how the error is transported in time by ∂xt

∂xk from step t back to step k. We can now distinguish between long term and

short term contributions depending on the value of k, long term being the case when k t and short term the case when k ≈ t. We can now define the exploding gradient problem (Bengio et al. [1994]) as the situation where the long term components increases exponentially blowing up the gradient. The vanishing gradient problem instead refers to the situation where the long term components quickly become zero, making the model unable to learn anything but short term relationships. This happens because of the product of t − k Jacobian matrices inside eq. 3.6. Using the power iteration method it can be shown that if the spectral radius ρ of the recurrent weights matrix Wrec is < 1 the long term components will vanish in the long run and will explode if ρ > 1 (see Pascanu et al. [2013]

(25)

b1 b2 b3 b4 C

w1 w2 w3 w4

Figure 3.10: Linear neural network using sigmoids

for more details). This problem is not exclusive to RNNs. More generally, it turns out that the gradient in deep neural networks is unstable, tending to either explode or vanish in earlier layers. This instability is a fundamental problem for gradient-based learning in deep neural networks. To see how vanishing gradient might happen in a non-recurrent neural network let us now consider the simplest deep topology possible, a long chain of single neurons like the one shown Figure 3.10 where wn are weights, bn are biases and C is come cost function.

Assuming we are using sigmoids as activation functions we have the usual formula for activations of layer j (a single neuron in this case) aj = σ(zj) where zj = wjaj−1+ bj. If we consider the gradient of C with respect to b1 we get:

∂C ∂b1 = σ0(z1) × w2× σ0(z2) × w3× σ0(z3) × w4× σ0(z4) × ∂C ∂a4 . (3.7)

Assuming the weights have been initialized using a Gaussian distribution with mean 0 and standard deviation 1 they will satisfy |wj|< 1. Since σ0(x) ≤ 1₄ then we will have |wzσ0(zj)| < 1/4 and when we multiply many of those terms together, as in the case of eq. 3.7, then the values will quickly shrink to zero and we will have a vanishing gradient problem. It is possible that during training the weights grow so that their norm will be greater than one but, while that would solve the vanishing gradient problem, we would end up having an exploding gradient problem instead. Problems such as vanishing/exploding gradients have two sides, one is the need to control the flow of gradients, another is to make sure enough useful “information” reaches all the layers so that the network can be trained effectively. For a long time it was general belief that incredibly deep networks just could not be trained because lower layer would never get any useful gradient and a lot of effort was put into coming up with a solution to this important problem. One of the major breakthroughs was the introduction of pre-training. Hinton and Salakhutdinov [2006] introduced the idea of using first a greedy layer-wise pre-training of the layers of a deep net using Restricted Boltzmann Machines (RBM) and then use standard supervised gradient descent as fine-tuning. Despite the effectiveness of this approach little was known about the underlying reasons for such improvements, for an analysis of the advantages of unsupervised pre-training see Erhan et al. [2010]. Other solutions to the vanishing gradient problem that have proved to be successful in various applications are:

• use of Rectified Linear Units (ReLU) as activation functions. ReLUs are defined as f (x) = max{0, x}. Despite the hard non-linearity and non-differentiability at zero ReLU outperforms sigmoids. Not only their unitary derivative prevents the saturation problem we have seen with sigmoids but are also extremely computational efficient and they lend themselves to the formation of very useful sparse representations (Glorot et al. [2011]). Unfortunately ReLU have downsides too such as the known problem of “dead ReLU” which happens when the neurons get initialized poorly or big jumps during update push them in the negative region where their gradient is zero. In such circumstances the model can end up in a situation where most of neurons never activate (therefore they are called dead). To counteract this, improved versions of

(26)

ReLUs have been developed such as “Leaky ReLU” which add a small gradient when their counterpart would be 0.

• use of Residual networks. One of the most recent ways to solve the vanishing gradient problem is to use residual neural networks. As mentioned in the section about ResNet, it was noted that extremely deep networks tended to have worse performance than shallower ones. It was hypothesized that this was due to the signal getting lost after traversing so many layers. Microsoft research therefore introduced the idea of residual block, a network unit contaning extra connections that allow the input to “skip” the block and flow to deeper parts of the network unobstructed. This approach does not require extra parameters or changes to the learning procedure but rather treats the net as an ensemble of smaller shallow networks (Veit et al. [2016]).

• use of Long Short Term Memory (LSTM) instead of vanilla RNNs. LSTMs use a special architecture to explicitly regulate the flow of information along time thus avoiding exploding/vanishing gradients and being able to focus better on useful rela-tionships.

3.3.5 Long Short Term Memory

As we have seen the RNNs have the potential to learn long distance (i.e. distant time steps) relationships between their inputs. However that promise is obstructed by the difficulty of making information travel over long distances without the model gradient converging to zero, exploding to infinity or the signal being swamped by noise. Long Short Term Memory (LSTM) are gated RNNs based on the idea of creating paths through time that have derivatives that neither vanish nor explode using connection weights that may change at each timestep. By doing so the net is able to both remember information to be carried over and forget that information when it is not useful anymore. Some architectures achieved that by manually resetting the signals flowing across time. LSTMs instead put a neural network in charge of deciding what information to keep and what to discard based on the context. LSTMs where first introduced by Hochreiter and Schmidhuber [1997] and like the basic unit of RNNs shown in Figure 3.8 rely on a core component called the Memory Cell.

(27)

The memory cell (Figure 3.11) clearly shows two many axes across which information flows, one from Ct−1to Ct(cell state) and one from ht−1to ht(output). The cell state and outputs are regulated using pointwise sums and multiplication. Pointwise multiplications are used to scale down components that need to be forgotten, while pointwise sums add new information to remember to the cell state. The values used in these operations are selected using 3 “gates” implemented as neural networks with sigmoid activations: a forget gate ft, an input gate itand an output gate ot.

First the forget gate (eq. 3.8), as the name implies, is used to select what information to remove from the cell state. Since the sigmoidal activations range from zero to one, when those activations are pointwise multiplied by the previous cell state a value of zero means “complete removal” while a value of one means “complete recall”.

Then the input gate (eq. 3.9) instead selects, using the current input, what information to add to the cell state. To do so it works in a manner similar to the forget gate but this time the sigmoidal activations are used to select the bits of information that have passed through another subnetwork with tanh activations. The information that has been selected to be preserved is then added to the cell state using a pointwise sum. Together with the result of the forget gate it’s value is used to update the current state as shown in eq. 3.11. Last the output gate (eq. 3.10) is used to select the output at the current timestep. Its sigmoidal activations are combined with the information updated by the other gates to generate the final output (eq. 3.12). The general working of the memory cell can be described with the following equations:

ft= σ(Wf· [ht−1, xt] + bf) (3.8)

it= σ(Wi· [ht−1, xt] + bi) (3.9)

ot= σ(Wo· [ht−1, xt] + bo) (3.10)

Ct= ft∗ Ct−1+ it∗ tanh(WC· [ht−1, xt] + bC) (3.11)

ht= ot∗ tanh(Ct). (3.12)

What we have presented is just a general version of the LSTM structure. Many alternative architectures exist which leverage the same underlying ideas such as LSTM with “peephole connections” that let the gates look into the cell state (Gers and Schmidhuber [2000]) or the Gated Recurrent Unit (Cho et al. [2014]) that combines the forget and input gate into a single update gate. Karpathy et al. [2015] have done a remarkable work in trying to understand and visualize what and how the cells inside an LSTMs learn. They have trained LSTMs on big texts such as War and Peace or the Linux Kernel and have inspected the activation of internal cells in search of recognizable patterns (see Figure 3.12). They also compared their results with those obtained using finite horizons n-grams and confirmed the power of LSTM at learning long-range interactions that enable them to correctly deal with difficult inputs that are troublesome for n-grams such as matching distant brackets or closing quotes.

3.4 Applying Deep learning

After having introduced the ideas behind deep learning, along with some of its strengths and weaknesses we now shift our attention to some results that allow us to effectively apply

(28)

(a) Cell sensitive to end of line

(b) Cell that activates inside if statements

(c) Cell that activates inside quotes

Figure 3.12: Interpretable cells from LSTMs trained on War and Peace and Linux Kernel (from Karpathy et al. [2015]). Colors ranging from red to blue represent activations between −1 and +1.

deep learning to many different tasks. We are going to see how to reuse, manipulate and improve existing models in order to make the best of the computational effort invested in creating them.

3.4.1 Transfer Learning

In Section 3.3.1 we have presented a brief overview of the evolution of most popular convolu-tional architectures. Among other information we have mentioned the training times of each of those architectures. As we can see even those architectures that strived for a streamlined and computational efficient architecture require weeks of training time. We also have to consider that those results were achieved by teams from top universities of large companies, which provided them access to multiple high-end graphics processing units (GPUs) which are able to dramatically speed up the matrix multiplications required to compute the nu-merous layers of deep networks. Despite the technological improvements that make faster and cheaper machines available to general public it would be very costly to retrain such architectures from scratch every time they need to be applied to solving a new task. Fur-thermore the acquisition of a suitable dataset can be an obstacle to train a very deep model. Even if acquiring some of the more popular ones is usually free, download and processing big amounts of data can be difficult. Since humans are able to transfer skills they have learned in one domain to a new but similar one, we would like to be able to transfer the knowledge of one trained model to a new task with just minor adjustments. This is what is known

(29)

as “Transfer Learning”(TL). To better describe the different kinds of transfer learning that have been researched we are going to first present the the notation used in [Pan and Yang, 2010]. Let us define a domain D = {X , P (X)} where X represents a feature space (e.g. the space of all possible images), X = {x1, ..., xn} ∈ X is a set of learning samples (e.g. a specific image dataset) and P (X) a marginal probability distribution. Let us also define a task T = {Y, f (·)} where Y is a label space (e.g. True or False for a binary classification problem) and f (·) a predictive function not observed directly but learned from the training data {(xi, yi) | xi ∈ X, yi ∈ Y}. Given two pairs of domains and tasks (DT, TT), (DS, TS) as the source and target of the transfer we can now define the possible types of transfer learning as show in in Table 3.1.

Table 3.1: Different types of Transfer Learning

Learning Setting Source & Target Domains Source & Target Tasks

Traditional ML same same

Inductive TL same related

Transductive TL related same

Unsupervised TL related related

Let us now consider a common scenario when trying to transfer knowledge from a pre-trained deep convolutional neural network to a new task. It is usually the case that the source domain and target domain are the same (e.g. images of objects or animals) while the source and target tasks are different but related (e.g. targets have a different number of classes). This situation is called inductive transfer learning and the specific application is the transfer of features knowledge. The basic idea is to learn a low-dimensional representation of the inputs that is applicable to different related tasks. This is similar to the case of common feature learning of multitask learning with the difference that features are transferred from a source to a target task rather than learned simultaneously. This transfer is possible by exploiting the hierarchical nature of the representations learned by deep neural networks. If we examine the kind of features learned at different stages in a deep neural network trained on natural images we are going to see that lower layers (i.e. closer to the input) tend to always learn kernels similar to Gabor filters or color blobs, regardless of the specific task. This behaviour is so common that failing to do so is an indication of the possible presence of some kind of problem in the model. Higher layers (i.e. closer to the output) will instead learn features that are more task specific. This means that lower layers of a deep net can be easily transferred among similar tasks but finding the right “depth” is a delicate procedure that can lead to “negative transfer”, the situation where the transfer actually decreases the performance. Extensive tests performed in Yosinski et al. [2014] show that this can be due to multiple factors like: excessive specificity of the features or fragile co-adaptation between layers. In a situation where different layers have ended up co-adapting to one another in a complex way if only some of them are transferred it is possible that the co-adaptation will not be recovered even after additional training. An interesting result they present is that in certain cases transferring features and then fine-tuning them results in networks that generalize better than those trained directly on the target dataset.

3.4.2 Ensembling

We are now going to present two topics that are needed for introducing the next topic. When talking about transfer learning we have discussed how to transfer the knowledge of