Augmenting Recurrent Neural Networks Resiliency by Dropout

(1)

Augmenting Recurrent Neural

Networks Resiliency by

Dropout

Author: Francesco Crecchi

Supervisor: Dr. Davide Bacciu

University of Pisa

Department of Computer Science

A thesis submitted in fulfillment of the requirements for the

degree of Master of Science (Computer Science).

(2)

This thesis presents a novel, principled approach to training recurrent neural networks that are robust to missing part of the input features at prediction time. By building on the ensembling properties of Dropout regularization, we propose a methodology, named DropIn, which efficiently trains a neural network model as a committee machine of subnetworks, each capable of pre-dicting with a subset of the original input features.

We discuss the application of the DropIn methodology to the most represen-tatives recurrent neural models, ranging from simplest recurrent networks to Reservoir Computing models and targeting applications characterized by input sources that might be unreliable or prone to collect discontinued mea-surements, leading to missingness in input data (e.g., as in pervasive wireless sensor networks and IoT contexts). We provide experimental assessment us-ing real-world data from ambient assisted livus-ing and healthcare application domains, showing how the DropIn methodology allows maintaining predic-tive performances comparable to those of a model without missing features, even when 20%-50% of the inputs are not available.

(3)

I would first like to thank my thesis advisor Dr. Davide Bacciu of the Com-puter Science Department at the University of Pisa for being the first who believed in me and my still unexpressed potential. He has been so persuasive, that I started believing in myself too and I will thank him forever for this.

I would like to thank also all the professors and colleagues I met during this Master Degree because in a big or little part, directly or indirectly, they brought me to this moment.

(4)

In questo momento cos`ı importante vorrei ringraziare la mia famiglia, per i sacrifici fatti e per avermi sempre fatto sentire il loro sostegno. La consapev-olezza di avere persone che credono in me `e stata la spinta che mi ha fatto andare avanti nei momenti di difficolt`a e la speranza di poter condividere questo momento con tutti voi il miglior incentivo che potessi avere.

In particolare vorrei ringraziare Erika, membro speciale della famiglia e compagna di vita da dodici bellissimi anni, per essere stata sempre tutto quello di cui avessi bisogno: fidanzata, amante, amica, collega, ricercatrice, segretaria. Senza di te la mia vita sarebbe infinitamente pi`u triste e io non sarei di certo la persona che sono adesso.

Inoltre vorrei estendere i ringraziamenti per la famiglia a tutte le famiglie presenti e agli amici, vorrei che sapeste che vi considero parte di una cosa sola e avervi presenti in questo momento cos`ı speciale `e il miglior regalo potessi ricevere.

Infine vorrei ringraziare quel preciso momento in cui ho deciso di smettere di mettermi i bastoni tra le ruote ed ho cominciato a credere in me, da quel momento in poi `e stato tutto molto pi`u semplice.

(5)

Abstract i

Acknowledgments iii

Contents iv

List of figures vi

List of tables viii

1 Introduction 1

2 Background 4

2.1 Notation . . . 4

2.1.1 Non-Temporal vs Temporal Tasks . . . 5

2.2 Recurrent Neural Networks . . . 8

2.2.1 Fundamental Recurrent Neural Networks . . . 8

2.2.2 Training Recurrent Neural Networks . . . 10

2.2.2.1 Vanishing of the gradient . . . 11

2.2.3 Long Short-Term Memory Networks . . . 12

2.2.4 Gated Recurrent Unit . . . 16

2.2.5 Reservoir Computing . . . 17

2.2.5.1 Echo State Networks . . . 18

2.2.5.2 Leaky Integrator ESN . . . 21

2.2.5.3 ESN Reservoir construction . . . 21

2.2.5.4 ESN Supervised Training . . . 22

2.3 Dealing with Missing Data . . . 25

3 Approach 29 3.1 Motivations and Goal . . . 29

(6)

3.2 The DropIn Approach . . . 31

3.2.1 Dropout regularization . . . 31

3.2.2 DropConnect . . . 32

3.2.3 The rationale behind DropIn approach . . . 34

3.2.4 Dropout for missing data . . . 35

3.2.5 DropIn Recurrent Neural Networks . . . 37

3.2.5.1 DropIn-ESN . . . 40

4 Implementation 44 5 Experiments 53 5.1 Experimental Setup . . . 53

5.2 Movement Dataset . . . 56

5.3 The Kitchen Cleaning Task . . . 62

5.4 Predicting In-Hospital Mortality of ICU Patients . . . 69

6 Conclusions 77

(7)

2.1 Sequence-to-sequence vs sequence-to-element transductions. . 6

2.2 A recurrent neural network proposed by Jordan (1986). . . 9

2.3 Simple Recurrent Neural Network, Elman (1986). . . 10

2.4 A simple recurrent net with one input unit, one output unit, and one recurrent hidden unit. . . 11

2.5 Vanish of the gradient. . . 12

2.6 LSTM memory cell with forget gate. . . 13

2.7 Gated Recurrent Unit (GRU). . . 16

2.8 Echo State Network (ESN). . . 19

3.1 Dropout regularization technique. . . 31

3.2 Dropout regularization technique: training and test phases. . . 32

3.3 DropIn approach applied on a generic Recurrent Neural Network. 37 3.4 DropIn ESN. . . 41

4.1 TensorFlow computation graph example. . . 47

4.2 Tensorboard, visualization tool. . . 48

4.3 Mackey-Glass time series prediction. . . 50

4.4 Fixed-NARMA of order 10 time series prediction. . . 51

5.1 Movement dataset experimental setup. . . 57

5.2 Test accuracy of SRN vs DropInSRN as a function of the num-ber of missing inputs for the Movement prediction task. . . 61

5.3 Test accuracy of LSTM vs DropInLSTM as a function of the number of missing inputs for the Movement prediction task. . 62

5.4 Test accuracy of LI-ESN vs DropInESN as a function of the number of missing inputs for the Movement prediction task. . 63

(8)

5.6 Test MAE of SRN vs DropInSRN as a function of the num-berof missing inputs for the Kitchen Cleaning task. . . 67 5.7 Test MAE of LSTM vs DropInLSTM as a function of the

number of missing inputs for the Kitchen Cleaning task. . . . 67 5.8 Test MAE of LI-ESN vs DropInESN as a function of the

num-ber of missing inputs for the Kitchen Cleaning task. . . 68 5.9 Test AUC of SRN vs DropInSRN as a function of the number

of missing inputs for the In-Hospital Mortality Challenge. . . . 75 5.10 Test AUC of GRU vs DropInGRU as a function of the number

of missing inputs for the In-Hospital Mortality Challenge. . . . 75 5.11 Test AUC of SRN vs DropInESN as a function of the number

(9)

5.1 RNN model parameters for model selection procedures. . . 54 5.2 Baseline results of model-selection and testing for the

Move-ment prediction task. . . 59 5.3 Simple Recurrent Network model-selection and testing for the

Movement prediction task. . . 59 5.4 Long Short Term Memory Network model-selection and

test-ing for the Movement prediction task. . . 60 5.5 Echo State Network model-selection and testing for the

Move-ment prediction task. . . 60 5.6 Baseline results of model-selection and testing for the Kitchen

Cleaning task. . . 65 5.7 Simple Recurrent Network model-selection and testing for the

Kitchen Cleaning task. . . 65 5.8 Long Short-Term Memory Network model-selection and

test-ing for the Kitchen Cleantest-ing task. . . 66 5.9 Echo State Network model-selection and testing for the Kitchen

(10)

5.10 Time-series variables for the PhysioNet/CinC Challenge 2012 and percentage of patients for whom at least one measurement was available during the first 48 ICU hours (total of 12,000 ICU stays). . . 71 5.11 Baseline results of model-selection and testing for the In-Hospital

Mortality Challenge. . . 73 5.12 Simple Recurrent Network model-selection and testing for the

In-Hospital Mortality Challenge. . . 73 5.13 Gated Recurrent Unit model-selection and testing for the

In-Hospital Mortality Challenge. . . 73 5.14 Echo State Network model-selection and testing for the

(11)

(12)

1

Introduction

The increasing diffusion of networks of pervasively distributed environmental and personal sensor devices requires computational models capable of deal-ing with continuous streams of sensor data under the form of time series of measurements. Machine learning models, in this context, serve to make sense of such multivariate sequences of heterogeneous sensor information by pro-viding predictions supporting, for example, context-awareness and ambient intelligence functions. Numerous applications have been developed by mod-eling them as regression or classification tasks on sensor streams, including event recognition, fault and anomaly detection, human activity recognition and, in general, supporting robotic and Internet-of-Things (IoT) applications by providing adaptivity and context awareness mechanisms.

Recurrent Neural Networks (RNN) are a popular and effective means to deal with sequential information thanks to their ability in encoding the history of the past inputs within the network state. However, these networks, as well as their feed-forward counterpart, require a set of fixed input features to be available both at training as well at test time. Their predictive performance tends to abruptly decline when one or more of the input features on which it has been trained is missing when querying the model for predictions [28]. Many practical ubiquitous computing and IoT scenarios comprise informa-tion generated by networks of loosely connected devices, which are often battery-operated and communicating through wireless channels with little

(13)

quality of service guarantees. In such scenarios, it is not unlikely to have to deal with missing information, which might result in a RNN performing recognition or prediction task missing a part of its inputs.

The goal of this thesis is to propose and to experimentally assess, a prin-cipled approach to make neural networks robust to such missing inputs, at query time (i.e., at prediction), focusing in particular on RNN models. The large majority of the works in literature addresses the problem of missing inputs solely with respect to information that is missing at training time. In this context, dealing with missing information amounts to finding the best strategies to impute values of certain features which are missing in some of the training samples: [14] provides a recent survey on this problem. In this thesis, instead, we consider a scenario in which we train a RNN using com-plete data (i.e., with no missing inputs) but where part of the inputs may become unavailable, even for long time, during system operation (e.g. due to a sensor failure).

We propose a novel solution to the missing input problem that exploits Dropout [45], a regularization technique that has become widely popular in the context of deep learning in the latter years. Dropout prevents overfitting by randomly selecting, during training, a subset of neurons or synapses which are dropped (i.e., their activation is zeroed) for the current training sample. The key intuition guiding this work is that one of the effects of Dropout is to train the full neural model as if it were an ensemble of thinned models obtained by the random dropout disconnections. As such, we believe that it can be exploited to build model robust to missing inputs as an ensemble of thinned models obtained by applying Dropout to the input neurons and connections alone.

In Chapter 2, we introduce the reader to the use of recurrent neural models to deal with temporal tasks on the sequence structured domains; in

(14)

partic-ular, we introduce the RNN models that have been used in the experiments highlighting the differences among them and providing a consistent notation across the models. In the latter part of the chapter, the reader is introduced to the problem of dealing with missing data using such neural models and the countermeasures that have been taken in the literature in years.

In Chapter 3, we introduce our approach, named DropIn, aimed to making RNNs robust to missing inputs. First, we introduce the main component of DropIn: the Dropout regularization method. Then, we show how to use Dropout during RNN training in order to augment model’s resiliency against predictions with missing data. In the end of the chapter, we describe the application of DropIn approach to Reservoir Computing (RC) models: these partially trained RNNs represent the most extreme case for DropIn to work due to their untrained input and recurrent weights configuration.

In Chapter 4, we lead the reader through the projectual choices and the code which supports the experimental phase and, in particular, the goods and the bad of existing libraries taken into consideration.

Chapter 5 provides an experimental assessment of the proposed model on two real-world datasets from ambient intelligence and pervasive computing appli-cation plus a biomedical dataset for Intensive Care Units (ICU) in-hospital mortality prediction.

Finally, Chapter 6 concludes the document resuming the entire work done, pointin gout the goals reached and setting the ground for future develop-ments.

(15)

2

Background

In this Chapter, we aim to provide a readable, intuitive, and consistently notated review on recurrent neural networks for sequence learning. Let’s start by fixing the notation we will use.

2.1 Notation

We denote scalars using italic notation, vectors in lowercase (e.g., v = [v1, . . . ,

vN] for a length N vector) and matrices using uppercase (e.g., W) If v is a

vector, then vi represents its i-th entry; if X is a matrix, then Xi,j represents

the element correspondent to the i-th row and j-th column of the X matrix. Sequences of vectors are represented in lowercase bold and sets in upper-case bold. We represent a set of K time series as S = {s(1)_{, . . . , s}(K)_{} ,}

where the generic entry s(i) _{is a, possibly multivariate, time series of length}

T: s(i) _{= [u(1), . . . , u(T )], for i = 1, . . . , T . Finally (R}N_{)* and (R}N₎n

repre-sents an arbitrary length N -dimensional sequence domain and of length n, respectively.

(16)

2.1.1 Non-Temporal vs Temporal Tasks

Let Nu and Ny be the input and output dimensions respectively. Let D =

{(u, y) | u ∈ RNu_{, y ∈ R}Ny_{} be a dataset of T examples where u is the}

input vector and y is the desired output vector. Suppose D is sampled from an unknown function ftarget : RNu → RNy. The sets RNu and RNy

are unstructured domains. A non-temporal task consists in inferring the unknown function ftarget from D. Each example u is feed to the model which

produces ˆy as output. Let ˆY denote the set of produced outputs and Y be the set of desired outputs, the model is tweaked such that the error E(Y, ˆY) is minimized. This function approximation task is clearly insensible to the order of the inputs and thus it assumes they are independent.

Different is the case dealing with sequence structured domains. Let us denote with s ∈ (RNu_{)* and ˆ}_{y ∈ (R}Ny_{)* the sets of inputs and output}

sequences and y ∈ (RNy_{)* be the set of desired output sequences. In a}

temporal task, the unknown ftarget : (RNu)* → (RNy)* depends on input

order presentation, and the sequence vectors x and y are indexed by time steps t = 1, . . . , T . In a temporal task, the unknown function is named structural transduction on sequence domain.

There are two main kinds of sequence transductions, as shown in Fig. 2.1: • sequence-to-sequence: isomorphic mapping of an input sequence s ∈

(RNu)* to an output sequence ˆ_{y ∈ (R}Ny)*.

• sequence-to-element: non-isomorphic mapping of an input sequence s ∈ (RNu_{)* to an output vector ˆ}_{y ∈ R}Ny_.

(17)

Figure 2.1: Sequence-to-sequence vs sequence-to-element transductions.

ˆ

y(t) = fy(x(t)) = fy(fx(. . . , u(t − 1), u(t))) (2.1)

where fx is said transition function and it is applied to each time-step t

and fy is said output function. The input summarization function fx allows

to encode a sort of memory of the inputs seen at previous time-steps and can be, generally, represented in two ways depending on the time representation chosen:

• explicit time representation: At time-step t0_{, a time-frame F of an}

arbitrary but finite size is chosen having as input vectors u(t) with t = t0 − F + 1, . . . , t0_{. The vector x(t) is computed as follows: x(t) =}

fx(u(t − F + 1), . . . , u(t)).

• implicit time representation: At each time-step t, the input of the model is the input vector u(t) coupled with a state vector x(t) which represent all previously seen inputs u(t0) with t0 < t. The state x(t) is computed recursively by the so called state-space formulation:

x(t) =    fx(u(t), x(t − 1)) t > 0 0 t = 0 (2.2)

(18)

with respect to an implicit one. For example, a model trained using a finite-length context window of finite-length 5 could never be trained to answer the simple question: “what was the data point seen 6 time steps ago?”. For this reason the research community interest has focused on the implicit time representation models, whose transition functions is Equation 2.2.

We can now formulate our funknown, or τ as it is commonly referred to, as

the function composition τ = τenc◦ τout, where:

• τenc: (RNu)* → (RNx)* is a sequence-to-sequence transduction which

maps s in a sequence of states x; τenc is computed by applying the

function fx of Equation 2.2.

• τout transduction can be:

– sequence-to-sequence: so τout : (RNx)* → (RNy)* maps the

se-quence of states x into the vectors output sese-quence y; τout is

com-puted by applying fy of Equation 2.1 to each state of x;

– sequence-to-element : a state mapping function χ : (RNx_{)* → R}Nx

is applied, which maps the set of states x to a unique state denoted as χ(x); thus τout : RNx → RNy maps χ(x) to a single vector

applying the function fy of Equation 2.1 to the state χ(x).

Depending on the nature of the task at hand, the structure of the dataset can be:

• sequence-to-sequence: D = {(sk_{, y}k_{) | ∀k = 1, . . . , K.s}k _{∈ (R}Nu₎Tk_{, y}k_∈

(RNy₎Tk_};

• sequence-to-element: D = {(sk_{, y(k)) | ∀k = 1, . . . , K.s}k _{∈ (R}Nu₎Tk_{, y(k) ∈}

(19)

2.2 Recurrent Neural Networks

A widely used Neural Network which makes use of an implicit time represen-tation of the transition function fx is the recurrent neural network (RNN).

A recurrent neural network is a class of artificial neural networks where con-nections between units form a direct cycle. This feedback creates an internal state or memory which allows to exhibit dynamic temporal behaviour.

We start presenting the simplest RNN: the recurrent unit. Such model represents a state x(t) ∈ R whose activation is computed as stated in Equa-tion 2.2. Let us denote the current input vector with u(t) ∈ RNu _{and the}

previous state x(t − 1), then:

x(t) = fx(u(t), x(t − 1)) = f (winu(t) + wx(t − 1) + b) (2.3)

where win ∈ R1×Nu and w ∈ R are the parameters we need to learn. The

first is the one associated to the input to internal state connection, the second is the one associated to the internal state update, plus the bias.

2.2.1 Fundamental Recurrent Neural Networks

We start presenting two fundamental and historically relevant RNNs who differ mostly for the internal state representation.

Jordan networks (Fig. 2.2), introduced by Michael Jordan in 1986, present an early architecture for supervised learning on sequences [24]. A Jordan network resembles a feedforward network with a single hidden layer , but is extended with context units. Outputs are fed into the context units, which then feed in to the hidden nodes at the following step. Additionally,

(20)

con-Figure 2.2: A recurrent neural network proposed by Jordan (1986). text units have self-connected edges. Intuitively, these self-connected edges give Jordan network a means to send information across multiple time steps without perturbing the output at each intermediary time-step.

Elman networks, also named Simple Recurrent Neural Networks (SRN) (Fig. 2.3), introduced by J. L. Elman [13] simplify the structure in the Jor-dan network. Associated with each unit in the hidden layer is a single context unit. Each context unit j0takes as input the state of the corresponding hidden node j at the previous time-step, along with an edge of unit weight wj0_j = 1.

Effectively, this entire setup is equivalent to a simple RNN in which each hid-den node has a single self-connected recurrent edge. Some of Elman’s ideas, including fixed-weight edges and the importance of self-connected recurrent edges in hidden nodes survive in Hochreiter and Schmidhuber’s subsequent work on Long Short-Term Memory [16]. In the paper, Elman trains the net-work using backpropagation and demonstrates that the netnet-work can learn long-range time dependencies. The paper features two sets of experiments. The first extends the logic operation exclusive OR (XOR) to the time domain by concatenating sequences of three tokens. For each three-token segment, e.g. “001”, the first two tokens (“01”) are chosen randomly and the third (“1”) is chosen by performing XOR on them. A completely random guess should achieve an accuracy of 50%. A perfect system should perform the

(21)

Figure 2.3: Simple Recurrent Neural Network, Elman (1986).

same as random for the first two tokens, but guess the third token perfectly, achieving accuracy of 66.7. . . %. Elman’s simple network does get close to this maximum achievable score.

2.2.2 Training Recurrent Neural Networks

Learning with recurrent neural networks has long been considered to be dif-ficult. As with all neural networks, the optimization is NP-Hard. However, learning on recurrent networks can be especially hard due to the difficulty of learning long-range dependencies as described in [7]. Let us denote with W the RNN free parameters. The classical approach, based on gradient descent, consists in tuning the weights according to the gradients ∂E/∂W in order to minimize the error E. The standard way to achieve this is via backpropagation on the unfolded recurrent net which essentially becomes a feed-forward neu-ral network with a number of hidden layers that is consistent with the length of the sequence the RNN is trained on. All layers sharing the same weights. This algorithm is named Back-Propagation Through Time (BPTT)[55] and it is the standard way to train dynamical RNNs. A less used alternative due to its computational cost is the Real-Time Recurrent Learning (RTRL) [57],

(22)

Figure 2.4: A simple recurrent net with one input unit, one output unit, and one recurrent hidden unit.

in which the gradients computation is done in a forward-mode in time.

2.2.2.1 Vanishing of the gradient

Training a RNN using BPTT can be computationally expensive, especially for long sequences. This, coupled with a high number of weight updates, leads to difficulties when using large models. Worse than this, are the well-known problems of vanishing or exploding gradients [16] occurring when propagating errors across many time steps. As a trivial example, consider the network with a single input node, a single output node, and a single recurrent hidden node (Fig. 2.4). Now consider an input passed to the network at time τ and an error calculated at time t, assuming input of 0 in the intervening time steps. Owing to the weight tying across time steps (the recurrent edge at hidden node j always has the same weight), the impact of the input at time τ on the output at time t will either explode exponentially or rapidly approach zero as t−τ grows large, depending on whether the weight |wjj > 1|

or |wjj < 1| and also upon the activation function in the hidden node (Fig.

(23)

Figure 2.5: A visualization of the vanish of the gradient problem, using the architecture depicted in Fig. 2.4. If the weight along the red edge is less than one, the effect of the input at the first time step on the output at the final time step will rapidly diminish a function of the size of the interval in between.

One possible solution to this problem is to truncate the backpropagation after some maximum number of time steps. This algorithm, named Truncated Back-Propagation Through Time [57], while with a small cutoff can be used to alleviate the exploding gradient problem, requires that one sacrifice the ability to learn long-range dependencies. To be able to learn such long-term dependencies, ad-hoc models have been proposed.

2.2.3 Long Short-Term Memory Networks

In 1997, to overcome the problem of vanishing gradients, Hochreiter and Schmidhuber introduced the LSTM model [16]. This model resembles a standard neural network with a recurrent hidden layer, only each ordinary node in the hidden layer is replaced with a memory cell (Fig. 2.6).

(24)

Figure 2.6: LSTM memory cell with forget gate.

Simpler recurrent neural networks have long-term memory in the form of weights. The weights change very slow over time encoding general knowledge about the data. They also have short-term memory in the form of ephemeral activations, which pass from each node’s output to successive node. The LSTM model introduces an intermediary sort of memory via the memory cell. All elements of the LSTM cell are enumerated and described below.

• Input Node: This node, labeled gc, behaves as an ordinary neuron,

taking activation from the rest of the hidden layer at the previous time step as well as from the example u(t). Typically, the linear input is run through a tanh activation function.

• Input Gate: Multiplicative gates are distinctive features of the LSTM model. A gate is a sigmoidal unit that, like the input node, takes activation from the hidden layer at the previous time-step and also from the example u(t). The gates are notable because their outputs are multiplied by the output of another node. The input gate ic is so

(25)

gate in the sense that if it outputs 0, flow through the gate is cut off. If the gate outputs 1, all activation is passed through the state.

• Internal State: At the heart of each memory cell is a node s with linear activation, which is referred to in the original paper as the “internal state” of the cell. We index cells with c and thus the internal state of a cell c is sc. The internal state sc has a self-connected recurrent edge

with weight 1 called constant error carousel (CEC). This edge spans adjacent time steps with constant weight, assuring that error can flow across time-steps without vanishing. Thus s(t) = g(t) i(i) + s(t − 1). • Forget Gate: Forget gates fc, introduced in 2000 by Gers et al. [15],

provide a method by which the network can learn to flush the contents of the internal state. This is especially useful in continuously running networks. With forget gates, the equation to calculate the internal state on the forward pass is: s(t) = g(t) i(i) + s(t − 1) f (t). • Output gate: Finally, an output gate, which we denote oc is multiplied

by the value of the internal state sc to produce the value of vc output

by the memory cell. It is customary that the internal state be run through a tanh activation function prior to this multiplication, but we know no argument why this should be so.

A note on notation follows. We use i, f and o to refer to input, forget and output gates respectively. When we use set notation we are referring to the values of the nodes in an entire layer of cells. For example, g is a vector containing the value of g at each memory cell in a layer. When the subscript c is used, it is to refer to an individual memory cell.

Also in 2000, Gers and Schmidhuber proposed peephole connections [15], which passes from the internal state directly to the input and output gates of that the same node without first having to pass through the output gate.

(26)

They report that these connections improve performance on timing tasks where the network must learn to measure precise intervals between events. The intuition beyond peephole connections can by captured by an example. Consider a network which must learn to count objects and emit some desired output when n objects have been seen. The network might learn to let some fixed amount of activation into the internal state after each object is seen. This activation is trapped in the CEC, and is incremented iteratively each time another object is seen. When the n-th object is seen, the network needs to know to let out content from the internal state so that it can effect the output. To accomplish this, the output gate ocmust know the content of the

internal state sc.

Formally, at each time-step a modern LSTM with forget gates computes:

g(t) = φ(Wguu(t) + Wihh(t − 1) + bg) i(t) = σ(Wiuu(t) + Wihh(t − 1) + bi) f(t) = σ(Wf uu(t) + Wf hh(t − 1) + bf) o(t) = σ(Wouu(t) + Wohh(t − 1) + bo) s(t) = g(t) i(i) + s(t − 1) f(t) h(t) = s(t) o(t)

where stands for Hadamard product (element-wise multiplication). We use the tanh function φ for the input node g. Again, h(t − 1) is a vector containing the values vc output by each memory cell c in the hidden layer at

the previous time step.

Intuitively, in terms of the forward pass, the LSTM can learn when to let activation into the internal state. So long as the input gate takes value 0, no activation can get in. Similarly, the output gate learns when to let the value out. When both gates are closed, the activation is trapped in the LSTM,

(27)

Figure 2.7: Gated Recurrent Unit (GRU).

neither growing nor shrinking, nor affecting the output at the intermediary time steps. In terms of backward pass, the CEC enables the gradient to propagate back across many time steps, neither exploding nor vanishing.

2.2.4 Gated Recurrent Unit

Gated Recurrent Units (GRUs), introduced by Cho, et. al. (2014) and de-picted in Fig. 2.7, are a very popular variation on the LSTM model which have become widely popular because they simplify the model without signif-icant performance loss in many tasks [10]. It combines the forget and input gates into a single “update gate”. It also merges the cell state and hidden state, and make some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

Let us describe how the activation of the j-th hidden unit is computed. First, the reset gate rj is computed by

rj(t) = σ([Wru(t)]j+ [Urh(t − 1)]j)

where σ is the logistic sigmoid function, and [.]jis the j-th element of a vector.

(28)

Wr and Ur are weight matrices which are learned.

Similarly, the update gate zj is computed by

zj(t) = σ([Wzu(t)]j+ [Uzh(t − 1)]j)

The actual activation of the proposed unit hj is then computed by

hj(t) = zj(t)hj(t − 1) + (1 − zj(t)) ˜hj(t)

where

˜

hj(t) = φ([W u(t)]j+ [U (r(t) · h(t − 1)]j)

In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation.

On the other hand, the update gate controls how much information from the previous hidden state will carry over the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember long-term information.

2.2.5 Reservoir Computing

We present now the Reservoir Computing (RC), a completely different ap-proach of RNN modeling. The Reservoir Computing [35] is based on the complete separation of the state transition function fx and the output

func-tion fy in the form of two distinct levels:

• a dynamic reservoir - an RNN as a nonlinear temporal expansion func-tion and,

(29)

out-put from the expansion.

using the notation of Equation 2.1, the reservoir implements the state transition function fx while the readout implements the output function fy

applied to the non-linear expansion of the inputs.

The main characteristics of RC approach is the observation that if a ran-dom RNN possesses certain algebraic properties, training only a linear read-out from it is often sufficient to achieve excellent performance in practical applications. Thus once the reservoir is correctly initialized there is no need to train that part of the RNN, just the readout parameters. Intuitively, since these two levels have different purposes, it is plausible to treat them differently.

2.2.5.1 Echo State Networks

The most important RC model is certainly the Echo State Network (ESN) [19]. An ESN comprises an input layer with NU units, a reservoir layer with

NR units and a readout with NY units. The reservoir is a large,

sparsely-connected layer of recurrent non-linear units (typically tanh) which is used to perform a contractive encoding [19] of the history of driving input signals into a state space. The readout comprises feed-forward linear neurons computing ESN predictions as a weighted linear combination of the reservoir activations. In terms of state transition function and output function, the reservoir implements the nonlinear encoding of the input sequence fx : RNU× RNR →

(30)

Figure 2.8: Echo State Network (ESN). x(t) =    f (Winu(t) + ˆW x(t − 1)) t > 0 0 t = 0 (2.4)

with Win ∈ RNR×NU as input matrix, ˆW ∈ RNR×NR as reservoir matrix

and f is the component-wise reservoir activation function.

Whereas the readout, trained linear layer, implements the mapping fy :

RNR → RNY defined as:

y(t) = Woutx(t) (2.5)

with Wout ∈ RNY×NR as output matrix (readout matrix). Training this

kind of networks is extremely efficient since just the linear readout matrix Wout has to be learned, the rest remains untrained.

To ensure a “rich” set of dynamics, the reservoir has to be produced as big, sparse and randomly connected. Moreover the, so-called, Echo State Property (ESP) [19] has to hold. This condition essentially states that the effect of

(31)

a previous state x(n) and a previous input u(n) on a future state x(n + k) should vanish gradually as time passes (i.e. k → ∞) and not persist or even get amplified. In [19], it is provided a necessary and a sufficient condition for the ESP. The sufficient condition states that the largest singular value of the reservoir weight matrix must be less than 1:

σ( ˆW ) < 1 (2.6) where σ( ˆW ) is the largest singular value of ˆW . The necessary condition [19], on the other hand, says that if the spectral radius ρ( ˆW ), i.e. the largest absolute eigenvalue of the matrix ˆW , is larger than 1 and the network admits a null sequence as input, the reservoir is locally asymptotically unstable near the null state 0 ∈ RNR_{. The sufficient condition 2.6 is considered to be}

too restrictive [19] for practical purposes. Instead, the ˆW matrix is often initialized to satisfy the necessary condition, i.e.

ρ( ˆW ) < 1 (2.7) with values of the spectral radius that are, typically, close to the stability threshold 1. It is worth to notice that the ESP regards the untrained part of the model, thus it has to be ensured before training.

Input weights (Win) are randomly chosen from a uniform distribution over

[−sin, sin] (where sinis an input scaling parameter), while ˆW is typically from

(32)

2.2.5.2 Leaky Integrator ESN

One of the cons of using standard sigmoidal ESNs is that cannot be “slowed”, for example to generate slowly-changing sinusoidal waves. For this reason, Leaky Integrator Echo State Network (LI-ESN) models have been proposed ([19], [21], [23]). A LI-ESN is a variant of the standard ESN model which applies exponential moving average to the reservoir state space values. This allows a better handling of input sequences that change slowly with respect to sampling frequency [35]. At each time-step t, the activation of state transition function of the LI-ESN is computed by

x(t) = (1 − a)x(t − 1) + af (Winu(t) + ˆW x(t − 1))

The term a ∈ [0, 1] is a leakying rate which controls the speed of LI-ESN state dynamics, with larger values denoting faster dynamics. As a special case, a ESN can be obtained from a LI-ESN setting a = 1.

2.2.5.3 ESN Reservoir construction

The following procedure is widely used to construct the ESN reservoir. Algorithm 1: ESN reservoir construction.

1 Random initialization of a sparse matrix W₀ ∈ RNR×NR_. 2 Compute ρ(W₀).

3 Build the matrix W₁ = _ρ(W0)1 W₀ (as a result ρ(W₁) = 1)

4 Scale the resulting matrix by ρd∈ (0, 1] obtaining ˆW = ρdW1 (as a

result ρ( ˆW ) = ρd).

(33)

for ensure ESP and for ESN training.

2.2.5.4 ESN Supervised Training

As a general guide, the ESN training can be summarized as Algorithm 2: ESN training guidelines.

Data: Input series S = [u(1), . . . , u(T )] Target series Y = [y(1), . . . , y(T )]

1 Discard an initial transient (washout)

2 Collect the reservoir states and target values for each time step t

X = [x(1) . . . x(N )]

3 Train the linear readout

min||WoutX − Y ||22

Depending on whether your task is sequence or sequence-to-element there are different ways to compute the reservoir activation state matrix X. The fist is obtained by considering all the states x(t) generated after an initial transient, while the latter will consider only the last state produced x(T ). However, this is actually the untrained part of the ESN. The actual training of the model is in the readout layer and amounts to learning the values of the Wout matrix which implies the solution of a linear

least squares minimization problem.

There are two main kinds of learning algorithm: offline and online learn-ing. The offline learning consists in finding the solution of the least squares minimization problem using efficient batch linear methods such as

• Moore-Penrose Pseudo-inversion: Wout = Y X+.

(34)

Computing Wout via Moore-Penrose pseudo-inverse is always possible and

it is useful if it is needed high precision ESN training but, for this reason, it is prone to overfitting. Whereas, Ridge Regression (or Tikhonov regulariza-tion) leads to better generalization capabilities. This is an L2 regularization method, which implies the exact solution of the slightly modified least squares problem

||Y − WoutX||22+ λ||Wout||22

such λ value is said Tikhonov regularization coefficient. The solution is given by

Wout = Y XT(XXT + λI)−1

with I identity matrix which has the order of the number of rows of X. Notice that Moore-Penrose method can be obtained from Ridge Regression method simply setting λ = 0.

Some applications as in [22] require an online adapting of the Wout weights,

meaning that Woutis updated computing the error one example at time. The

simplest approach would tackle this problem as a gradient descent in the direction of minimizing the instantaneous squared error

e(t) = ||y(t) − ˆy(t)||2₂

with y(t) be the desired prediction at time t. Alternatively, the Recursive Least Squares (RLS) algorithm has been proposed by Jaeger in [22] as a fast online learning approach. At each time-step, RLS minimizes the discounted error E(y, ˆy) = 1 NY T X t=1 λT −t||y(t) − ˆy(t)||2₂ where 0 ≤ λ ≤ 1 is an error forgetting factor.

(35)

term δ and the forgetting factor λ is the following: Algorithm 3: RLS algorithm.

Data: Input series S = [u(1), . . . , u(T )] Target series Y = [y(1), . . . , y(T )]

1 Initialize P−1(0) = δ−1I, with I identity matrix with order equal to the

number of X rows

2 Initialize W_out(−1) as the zero matrix 3 for t = 0 to T-1 do

4 Given the input u(t) compute the state x(t) via Eq. 2.4 5 Compute ˆy(t) using Eq. 2.5

6 e(t) = y(t) − ˆy(t) 7 Φ(t) = P−1(t − 1) · x(t)

8 k(t) = Φ(t) · (λ + Φ(t) · x(t))−1

9 P−1(t) = λ−1(P−1(t − 1) − (k(t) · Φ(t))) 10 Wout(t) = Wout(t − 1) + (k(t) · e(t))T 11 end

12 return W_out

Let us denote with Φ(t) =Pt

i=0λ

t−i_x(i)x(i)T _{the squared correlation}

ma-trix of order NR of the reservoir state x; it gets recursively updated with

Φ(t) = λΦ(t − 1) + x(t)x(t)T_{. The matrix P is obtained as P = Φ}−1_{(t) and}

gets initialized as stated in line 1 of Algorithm 3, with a really big scalar δ, such that the invertibility of Φ(0) is guaranteed. The RLS algorithm, at each time-step,updates Wout with a rule involving the error residual e(t) and

a vector k(t), named gain, which is defined in terms of the correlation matrix and of the activation state x(t).

(36)

2.3 Dealing with Missing Data

Multivariate time series data are ubiquitous in many practical applications ranging from health care, geoscience, astronomy to biology and others. They often inevitably carry missing observations due to various reasons, such as medical events, saving costs, anomalies, inconveniences and so on. In par-ticular, many practical ubiquitous computing and Internet-of-Things (IoT) scenarios comprise information generated by networks of loosely connected devices, which are often battery-operated and communicating through wire-less channels with little quality-of-service guarantees. In such scenarios, it is not unlikely to have to deal with missing information.

In the past decades, various approaches have been developed to address missing values in time series [43]. The large majority of the works in literature addresses the problem of missing inputs solely with respect to information that is missing at training time. In this context, dealing with missing in-formation amounts to finding the best strategies to impute values of certain features which are missing in some of the training samples: [14] provides a recent survey on this problem. A variety of methods have been developed to fill in the missing values, such as smoothing or interpolation [29], spec-tral analysis [37], kernel methods [40], multiple imputation [56], and EM algorithm [14]. In [43] and references are provided excellent reviews on re-lated solutions. In particular, dealing with sensor data [33] uses a k-Nearest Neighbor (kNN) to replace missing sensor values with spatially and tempo-rally correlated sensor values. This approach has a major drawback in the fact that it assumes that sensor devices are of homogeneous type and that it is possible to find a spatially/temporally related sensor, of the same type as the failing one, to replace its measurements. An alternative approach to the problem is based on fitting a distribution of the missing features given the observable inputs, e.g. by using a Parzen window estimator [49], and

(37)

to use it to sample replacement measurements for the missing information [47], [48]. In [8], it is proposed a discriminative solution that uses a RNN to model static (i.e. non-sequential) problems while recurrent layers are used to estimate the values of missing features. A rather different approach to infor-mation imputation uses committee machines or model ensembles [28]: these models accommodate data with missing features by training an ensemble of predictors with random subsets of the total number of available features. However, these solutions often result in a two-step process where imputation are disparate from prediction models and missing patterns are not effectively explored, thus leading to sub-optimal analyses and predictions [54].

The main problem in filling the gaps is that there is not just a single category of missing data (as explained in the details in [43]). For any data set, one can define indicator variables R that identify what is known and what is missing. We refer to R as the missingness. In modern missing-data procedures missingness is regarded as a probabilistic phenomenon [41]. We treat R as a set of random variables having a joint probability distribution, so called the distribution of missingness. Because missingness may be related to the data, we classify distributions for R according to the nature of the relationship following the Rubin’s topology (1976) [41].

Adopting a generic notation, let us denote the complete data as Ycom and

partition it as Ycom = (Yobs, Ymis), where Yobs and Ymis are the observed

and missing parts, respectively. Rubin defined missing data to be missing at random (MAR) if the distribution of missingness does not depend on Ymis,

P (R|Ycom) = P (R|Yobs). (2.8)

In other words, MAR allows the probabilities of missingness to depend on the observed data but not on missing data. An important case of MAR,

(38)

called missing completely at random (MCAR), occurs when the distribution does not depend on Yobs either,

P (R|Ycom) = P (R).

When Equation 2.8 is violated and the distribution depends on Ymis, the

missing data are said to be missing not at random (MNAR).

With just this overview of missingness it is evident that the problem has changed from trying to fill the gaps to trying to understand the latent process that generates the data in terms of statistical relationship between samples of the dataset. Thus is clear now that filling the gaps is a failing strategy: to deal with missing data we need a shift of paradigm.

As it happened with parallel programming with the rise of multicore pro-cessors and distributed systems: one can not simply write sequential code and hope it to be able scale well with some hand-tweaking. Parallel code has to be parallel by design. Same is happening here, in order to be resilient to missing data you this have to be considered during model design.

Recent missing-data strategy exploit this idea: [34] and [11] tried to han-dle missingness in RNNs by concatenating missing entries or timestamps with the input or performing simple imputations. Exploiting the power of RNNs along with the informativeness of missing patterns is a new promising venue to efficiently model multivariate time-series. Che et. others (2016) [9] proposed a novel deep learning model based on GRU, named GRU-D, that effectively exploit two representations of informative missingness pat-terns, i.e. masking and time interval. Masking informs the model which inputs are observed (or missing), while time interval encapsulates the input observation patterns. The model captures the observations and their depen-dencies by applying masking and time interval (using a decay term) to the

(39)

inputs and network states of GRU, and jointly train all model components using back-propagation. Thus, the model not only captures the long-term temporal dependencies of time series observations but also utilize the miss-ing patterns to improve the prediction results. Che et others, demonstrated in [9] that GRU-D is able to outperform imputation based recurrent neural network models in dealing with missing data.

(40)

3

Approach

3.1 Motivations and Goal

As we stated in the previous chapter, RNNs are a great tool to deal with se-quential information. However, these networks, as well as their feed-forward counterpart, require a set of fixed input features to be available both at training as well as at a test time. Model’s predictive performance tends to abruptly decline when one or more of the input features on which it has been trained is missing when querying the model for predictions [28]. Many practi-cal ubiquitous computing and IoT scenarios comprise information generated by networks of loosely connected devices, which are often battery-operated and communicating through wireless channels with little quality of service guarantees. In such scenarios, it is not unlikely to have to deal with missing information, which might result in a RNN performing recognition or predic-tion task missing a part of its inputs.

The goal of this thesis is to propose, and to experimentally assess, a prin-cipled approach to making neural networks robust to such missing inputs, at query time (i.e. at prediction), focusing in particular on representatives RNN models which exploit very different characteristics. Showing that the proposed approach is source of benefits in terms of model resiliency.

(41)

The large majority of the works in literature (see Section 2.3) addresses the problem of missing inputs solely with respect to information that is missing at training time. In this context, dealing with missing information amounts to finding the best strategies to impute values of certain features which are missing in some of the training examples: [14] provides a recent survey on this problem. In this work, instead, we consider a scenario in which we are able to train a RNN using complete data (i.e. with no missing inputs) but where part of the inputs may become unavailable, even for a long time, during system operation (e.g. due to a sensor failure). The typical approaches, introduced in Section 2.3, to deal with this problem exploit imputation, i.e. substituting the missing feature by inferring it from other information. However, none of the approaches proposed, for example in [32] [33] [49], attempt to make the original model robust to missing information.

We propose a novel solution to the missing input problem that exploits Dropout [45], a regularization technique that has become widely popular in the context of deep leaning in the latter years. Dropout prevents overfitting by randomly selecting, during training, a subset of neurons or synapses which are dropped (i.e. their activation is zeroed) for the current training sample. The key intuition guiding this work is that one of the effects of Dropout is to train the full neural model as if it were an ensemble of thinned models obtained by the random dropout disconnections. As such, we believe that it can be exploited to build a model robust to missing inputs as an ensemble of thinned models obtained by applying dropout to the input neurons and connections alone.

In the following, we discuss the application of such a technique, DropIn from here onwards, to the most representative RNN models.

(42)

Figure 3.1: Dropout Neural Network Model. Left: A standard neural net-work with 2 hidden layers. Right: An example of a thinned net produced by applying dropout to the network on the left. Opaque units have been dropped.

3.2 The DropIn Approach

In this section, we provide a brief overview of the Dropout regularization technique before actually introduce the proposed approach.

3.2.1 Dropout regularization

Dropout is a regularization technique for fully connected neural network lay-ers with adaptive synaptic weights [45]. The key intuition of Dropout is to prevent units from co-adapting by randomly disconnecting a subset of them from the network during training. Dropout is also though to act as a model ensembling technique [45] such that it can efficiently combine an exponen-tial number of neural network architectures corresponding to the thinned networks obtained by randomly dropping the neurons subsets.

(43)

Figure 3.2: Left: A unit at training time that is present with probability p and is connected to units in the next layer with weights w. Right: At test time, the unit is always present and the weights are multiplied by p.

Dropout training works by randomly determining which units are kept at a given training instant; each unit is retained with a fixed probability p, independently from other units, or is dropped with probability 1 − p. The retention probability p is typically determined through model selection pro-cedure. A dropped out unit is temporarily removed from the network, along with all its incoming and outgoing connections: in practice, this amounts to zeroing the dropped unit output (Fig. 3.1).

In this respect, a neural network with N units is an ensemble of 2N _thinned

networks whose total number of parameters is still bound by O(N2_{) since}

they all share the weights. At test time, it is not feasible to reproduce the network thinning process to compute the ensemble prediction from all the 2N _{models. However, [}₄₅_{] shows how this can be efficiently approximated by}

using the original unthinned network with weights scaled by the retention probability p (Fig. 3.2).

3.2.2 DropConnect

The Dropout approach has been later generalized by the DropConnect model [53]. Instead of dropping out single units, DropConnect sets random subsets

(44)

of the network weight to zero, such that each unit receives input from a portion of neurons in the previous layer. To do so, it exploits, again, a retention probability p which is independently applied to the single elements of the neural network connectivity matrix. Like Dropout, the technique is suitable for fully-connected layers only.

Consider a fully-connected layer of a neural network with input u = [u1, u2, . . . , un]T and a weight parameter W (of size d × n). The output

of this layer y = [y1, y2, . . . , yd]T is computed as

y = φ(W u) (3.1)

When Dropout is applied to the outputs of a fully connected layer, we can write Eq. 3.1 as:

y = m φ(W u) (3.2)

where denotes the element-wise product and m is a binary mask vector of size d with each element drawn independently from a Bernoulli distribution of probability p. Since many commonly used activation functions such as tanh, sigmoid and relu have the property that φ(0) = 0, then Eq. 3.2 can be rewritten as:

y = φ(m W u) (3.3)

DropConnect is similar to Dropout as it introduces dynamic sparsity within the model, but differs in that the sparsity is on the weights W , rather than the output vectors of a layer. In other words, the fully-connected layer with DropConnect becomes a sparsely-connected layer in which the connections are chosen at random during the training stage. Note that this is not equiv-alent to setting W to be a fixed sparse matrix during training.

For a DropConnect layer, the output is given as:

(45)

where M is a binary matrix encoding the connection information and Mij ∼ Bernoulli(p). From Eq. 3.3 and Eq. 3.4, it is evident that

Drop-Connect is the generalization of Dropout to the full connection structure of a layer. This method is shown to enhance the ensembling effect with respect to Dropout, as the number of thinned network in DropConnect is O(2|M |), where M is the matrix of randomly selected zero and ones used for weight masking [53].

3.2.3 The rationale behind DropIn approach

The inspiration for our approach comes from [28]: in the paper, the authors introduce a learning algorithm, named LEARN++_{MF, that can accommodate}

data with missing features. The algorithm uses an ensemble of classifiers ap-proach. The classifiers in the ensemble are trained with random subsets of the total number of available features. When an instance with missing fea-ture(s) needs to be classified, those classifiers trained with only those features that are presently available in the given instance are used to determine the correct classification. These classifiers are referred as usable classifiers.

In the paper, the authors show empirically that if U classifiers yield a certain classification performance on data with no missing features, then a similar performance can be achieved simply by generating additional weak classifiers on data that have missing features. Weak classifiers are classifiers trained on subsets of the input features. They are typically used in ensem-ble. Using an ensemble of weak classifiers has benefits with respect to one using the complete set of features. First, the training time is often less for generating multiple weak classifiers compared to training one strong classi-fier. This is because, strong classifiers spend a majority of their training time in fine-tuning the desired decision boundary, whereas weak classifiers com-pletely skip the fine-tuning stage as they only generate rough approximation

(46)

of the decision boundary. Intimately related to fast training, weak classifiers are also less prone to overfitting problems, since they avoid learning outliers, or quite possibly a noisy decision boundary.

Since ensemble of (weak) classifiers is combined through a majority vot-ing process, each classifier can be trained usvot-ing a different set of features. By creating an ensemble with various subsets of the features used to train each classifier, a subset of the classifiers can still be used when a particular instance is missing certain features. This is because an instance missing cer-tain features can be classified by those classifier that did not use the missing features for training.

This approach makes two basic assumptions: First, the feature set is re-dundant, and hence includes an unknown number of features that are actually not required, or even possibly irrelevant. Second, the features are assumed to be unrelated and/or independent (the value of any feature is independent of all others). While these assumptions are clearly not satisfied in all classi-fication problems, they actually hold true in many practical applications.

Despite the undoubted innovative approach, we believed that we could do better. In particular, referring to the LEARN++MF algorithm [28], consider-ing all possible input missconsider-ing feature an exponential number of weak models have to be generated, trained, and their results collected for original classi-fier prediction. This is just computationally unfeasible. We need a way that accomplishes the same result without the need to generate such exponential number of classifiers.

3.2.4 Dropout for missing data

We describe a novel use of the Dropout technique as a principled approach to make neural networks robust to missing inputs at prediction time. We name

(47)

this approach DropIn as it is based on the use of unit-wise Dropout [45] at the level of the input neurons of the network. The rationale inspiring our work is the observation that the application of Dropout to NU input units

is essentially training a committee machine of 2NU thinned networks with shared weights. In this sense, our approach recalls the committee machine approach by [28]. Differently from this work, our method does not require to define a specific committee machine architecture nor it includes any increase in the parametrization of the original model. The choice of using unit-wise Dropout [45] in place of the connection-wise DropConnect [53] is motivated by the fact that the latter approach would not make the neural robust to los-ing one or more of its inputs; rather, it would make the network somewhat less focused on the exploration of a specific input which, in principle, can be very discriminant for the task.

Experiments performed with DropConnect approach (not reported in this document) have confirmed our intuition that the model performs significantly worse than Dropout on our tasks and, however, it leads to significantly dif-ferent model than the missing input one which is the focus of our work.

In this thesis, we show that DropIn approach is general. Proving that applying Dropout on the input layer has beneficial effects on all kind of RNNs considered, ranging from SRNs to ESNs, we de-facto prove that Dropout can be extended to RC. This represents a novelty in ML literature, leading-us to present the conference paper [6] at International Joint Conference on Neural Networks (IJCNN 2017).

In this document, in particular, we focus on recurrent neural networks with the goal to prove that we can use the proposed approach to make them more robust in adverse predictions situations coping with missing data.

(48)

Figure 3.3: DropIn approach applied on a generic Recurrent Neural Network.

3.2.5 DropIn Recurrent Neural Networks

We start introducing the DropIn approach: we claim that adding a Dropout input layer on top of a RNN model leads to less generalization error in pre-dictions with missing data (Fig. 3.3). The rationale behind this idea is that if we simulate the missingness of inputs during training we will likely have a model which is able to cope with missing data at prediction-phase.

To reproduce the missing-data situation we used a Dropout layer at the input level of the model which, as described in Section 3.2.1, drops units from training with probability p. In order of DropIn to work, we need an online training algorithm (e.g., Stochastic Gradient Descent ) since every sequence u in the training data D = {(u, y) | u input sequence, y target sequence} has to be seen by the model multiple times, each time with a fixed probability 1−p of independently missing the single inputs, where p is a retention probability as in Dropout. A typical learning algorithm which exploits DropIn is listed in Algorithm 4.

Note that the masking vector m is computed before passing the input sequence u(i), with i = 1 . . . L where L is the dataset size, so that same

(49)

Algorithm 4: DropIn approach for RNN training.

Require: A dataset of L pairs of input-output sequences

(u(1)_{, y}(1)_{), . . . , (u}(L)_{, y}(L)_{); a properly initialized RNN model}

and an input unit retention probability p.

1 while learning not converged do 2 Shuffle sequence order

3 for i = 1 to L do

4 Compute masking vector m with retention probability p (e.g.,

m ∼ Bernoulli(p))

5 for t = 0 to T-1 do

6 Mask the current input u(t) using m obtaining m(t) as

m(t) = m u(t)

7 Given the masked input m(t) compute the RNN state x(t) 8 Compute predicted output ˆy(t) from the state x(t)

9 Compute instantaneous error rate:

e(t) = ||y(t) − ˆy(t)||2₂

10 Update model parameters so that e(t) is minimized

11 end

12 end 13 end

(50)

masking is applied to each time-step of the input sequence congruently. This is the way we simulate the missing data due e.g., by a sensor failure. It is interesting to notice further that if we push down line 4 in Algorithm 4 we can emulate a missing at random (MAR) situation.

We assess the performance of DropIn approach using the most representa-tive RNN models to demonstrate its generality. A generic RNN module can be part of one of these two main sets:

• Fully-Trained Recurrent Neural Networks (FTRNNs): using notation of Section 2.1.1, there is no complete separation between the input encoding function τenc and the output function τout. All layers in the

model are trained in order to minimize the generalization error. • Partially-Trained Recurrent Neural Networks (PTRNNs): only a small

part of the model is trained, typically the one which implements the output transduction function τout. The rest remains untrained once

correctly initialized.

As representatives of the FTRNNs we selected a very simple recurrent model (an Elman Network (SRN)) and two complex modern RNNs: an LSTM Network and a GRU model. While, for the PTRNN set we decided to investigate the effects of DropIn in the context of RC, and in particular on Echo State Networks. Implementation details of the chosen models can be found in Section 2.2 of the Thesis.

The choice of such fully-trained models is to assess the DropIn effects on models of increasing complexity, ranging from simpler to more complex one which are able to achieve state-of-the-art results in many tasks (e.g., im-age captioning [25] and machine-translation [46]). Whereas using DropIn on fully-trained models is more straightforward, partially-trained models (and in particular ESNs) represent the most challenging conditions to apply the

(51)

approach and it requires some non-trivial settings that we describe here after.

3.2.5.1 DropIn-ESN

Here, we focus on the use of the DropIn in the context of RC and, in par-ticular, with the ESN model. The motivation for such a choice is twofold. First, RC models are a popular approach in ambient intelligence, pervasive computing and IoT applications [5], [12], [38], that are areas where one has to deal with input information collected by low-fidelity devices (e.g. battery operated) and transmitted over loose communication channels, thus involv-ing an high likelihood of faults and missinvolv-ing data. The second motivation is more related to the exploration of the effect of DropIn in a learning paradigm which is, perhaps, the least straightforward in terms of Dropout application. The characteristics of RC, in fact, make it benefit poorly from Dropout in terms of regularization, as only a minor part of the network weights are trained. In fact, to the extent of our knowledge, this is the first work in which Dropout techniques are being used in RC. Here, we seek to determine if the application of Dropout to inputs with untrained weights has an effect propagating throughout the untrained reservoir all the way to the readout, which is the only trained part of the network and hence the only part where the effect of the DropIn can be recorded.

The application of DropIn to an ESN (Fig. 3.4) requires it to be trained through an iterative learning algorithm. In fact, in order to make the net-work robust to missing inputs, we need the ESN to process the single training sequence multiple times, each time with a fixed probability 1 − p of indepen-dently missing the single inputs, where p is a retention probability as in Dropout. As discussed in Section 2.2.5.4, the ESN online training entails the Recursive Least Squares (RLS) algorithm.

(52)

Figure 3.4: DropIn ESN.

Our DropIn-ESN model embeds input units Dropout within the steps of incremental RLS learning. A schematic view of the procedure is provided by the pseudo-code in Algorithm 5.

Assuming we are provided with a properly initialized ESN (see Section 2.2.5.1), the procedure runs for several epochs through the dataset (each time reshuf-fling the order of the sequences) until an error stability criterion is met (e.g. early stopping on validation error) or the maximum number of epochs is reached. The DropIn mask m in Fig. 3.4 is computed before processing a whole sequence, hence the mask is the same for all the elements of the se-quence (see the first operation in the outer for loop in Algorithm 5). This ensures that the readout weights are modified to take into account reservoir activations corresponding to certain inputs being missing. Clearly, the same training sequence can be expected to be processed again in the next epochs, but this time with different input features missing. As in standard Dropout, the decision to drop an input unit is taken independently for each input fea-ture with probability (1 − p). Input masking can then be easily obtained by zeroing the elements of current input u(t) using the masking vector M . The equations in the innermost loop of Algorithm 5 implement the standard RLS

(53)

Algorithm 5: DropIn Training of an ESN by RLS.

Require: A dataset of L pairs of input-output sequences

(u(1)_{, y}(1)_{), . . . , (u}(L)_{, y}(L)_{); a properly initialized ESN model;}

a forgetting factor λ; a regularization term δ and an input unit retention probability p.

1 Initialize P−1(0) = δ−1I, with I identity matrix with order equal to the

number of X rows

2 Initialize W_out(−1) as the zero matrix 3 n = 0

4 while learning not converged do 5 Shuffle sequence order

6 for i = 1 to L do

7 Compute masking vector m with retention probability p (e.g.,

m ∼ Bernoulli(p))

8 for t = 0 to T-1 do

9 Mask the current input u(t) using m obtaining m(t)

10 Given the masked input m(t) compute the state x(t) via Eq.

2.4

11 Compute ˆy(t) using Eq. 2.5 12 e(t) = y(t) − ˆy(t)

13 Φ(t) = P−1(t − 1) · x(t)

14 k(t) = Φ(t) · (λ + Φ(t) · x(t))−1

15 P−1(t) = λ−1(P−1(t − 1) − (k(t) · Φ(t))) 16 Wout(t) = Wout(t − 1) + (k(t) · e(t))T

17 end

18 n = n + 1 19 end

20 end

(54)

algorithm for ESN, whose details can be found in [22].

Once an ESN model has been trained using DropIn, it can be used for prediction without any weight re-scaling. Note that this is different from how standard Dropout works. In a DropIn-ESN, in fact, masking of the input units does not bear any effect on the input-to-reservoir weights, whose values are frozen to their random initialization. Readout weights are the only ones that are affected by DropIn but the effect on them is indirect and mediated by the reservoir activations. Preliminary experiments on the effect of weight re-scaling (not reported in this document) show that it negatively affects predictive performance. The reasons underlying such effects deserve further studies: possibly, the fact that the DropIn networks do not require re-scaling might be associated with the fact that at test time we are simulating missing inputs, hence allowing the network to work under conditions similar to those of the Dropout training phase.

(55)

4

Implementation

As estimators of reproducible research we wanted the code supporting our experiments to be cross-platform and open-source so, among other program-ming languages, we opted for Python. Python is a natural choice working in ML: its popularity for data science is largely due to the strength of its core libraries (NumPy1_{, SciPy}2_{, Matplotlib}3_{, IPython}4_{), high productivity}

for prototyping and building small and reusable systems, and its strength as a general purpose programming language. Moreover, Python also excels as a glue language for applications written in C, C++, and Fortran, especially using the excellent Cython project 5_.

For the data manipulation part we used the Pandas6 _library[₃₆_{]; among}

the other benefits, this library allows

• A fast and efficient DataFrame object for data manipulation with in-tegrated indexing;

• Tools for reading and writing data between in-memory data

struc-1_{http://www.numpy.org/} 2_{https://www.scipy.org/} 3_{http://matplotlib.org/} 4_{http://ipython.org/} 5_{http://cython.org/} 6_{http://pandas.pydata.org/}