Gated Reservoir Networks for Sequences and Trees

(1)

Gated Reservoir Networks

for Sequences and Trees

Author

Daniele Di Sarli

Supervisors

Dr. Claudio Gallicchio

Prof. Alessio Micheli

(2)

Recurrent Neural Networks are an important tool in the field of Machine Learning, since they represent a learning model that allows a neural network to learn inferences over temporal or sequential data. Such neural networks, in their naïve implementation, are notoriously costly and difficult to train. For this reason, in literature different approaches emerged to exploit Recurrent Neural Networks in practical settings. The first approach that we consider, Echo State Networks, harnesses the intrinsic characteristics of Recurrent Neural Networks in order to make the costly training of the recurrent part of the network unnecessary. The second approach that we consider, Gated Recurrent Units, exploits architectural structures called “gates” in order to make training easier. We propose a new efficient neural network model (GResNet) that extends Echo State Networks by introducing gates,

and for Gated Recurrent Units we explore hybrid training approaches in which only the gates get trained. We then extend our proposed models to deal with tree-structured data. With our experiments, we validated our neural network models over two Natural Language Processing tasks, both for trees and sequences. The results show an increase in predictive performance for GResNet with respect to Echo State Networks, and highlight how a full training of a Gated Recurrent Unit can even be unnecessary.

(3)

1 Introduction 1

1.1 Thesis outline . . . 5

2 Background 6 2.1 Notation . . . 7

2.2 Word Embeddings . . . 11

2.3 Recurrent Neural Networks . . . 12

2.3.1 Echo State Networks . . . 17

2.3.2 Gated Recurrent Units . . . 21

2.4 Recursive Neural Networks . . . 25

2.4.1 Tree Echo State Networks . . . 28

2.4.2 Tree Gated Recurrent Units . . . 32

3 Proposed Models for Sequences 35 3.1 GResNet: Gated Reservoir Network . . . 35

3.1.1 Initialization and training . . . 37

3.1.2 Contractivity . . . 38

3.1.3 Computational complexity . . . 41

3.2 Hybrid GResNets . . . 42

3.2.1 Gradient computation . . . 44

4 Proposed Models for Trees 49 4.1 Tree-Structured Gated Reservoir Network . . . 49

4.1.1 State mapping functions . . . 50

4.1.2 Initialization and training . . . 51

4.1.3 Contractivity . . . 51

(4)

4.2 Hybrid TreeGResNet . . . 56

4.2.1 Gradient computation . . . 57

5 Experiments and Results 63 5.1 Datasets . . . 64

5.1.1 Question Classification . . . 64

5.1.2 ITAmoji . . . 65

5.2 Models for sequences . . . 67

5.2.1 Question Classification Task . . . 67

5.2.2 ITAmoji Task . . . 69

5.3 ITAmoji competition . . . 72

5.3.1 Preprocessing . . . 73

5.3.2 Description of the system . . . 73

5.3.3 Training . . . 76

5.3.4 Results . . . 77

5.4 Models for trees . . . 83

5.4.1 Question Classification Task . . . 83

5.4.2 ITAmoji Task . . . 86

5.5 Discussion . . . 89

6 Conclusion 96

A Hyperparameters Search Space 101

B Additional Results 107

List of Figures 109

List of Tables 112

(5)

Introduction

One of the biggest challenges in developing software systems which are as powerful as the brain is surely that of computational efficiency. The human brain is estimated to contain about 85 billion neurons [60], each of which is able to exhibit complex behaviors and interactions with each other, in an apparently chaotic mixture of electrochemical signals which however, at large scales, give rise to the exceptionally sophisticated characteristics of many living organisms. Biologically realistic simulations of even small populations of neurons have, until recently, been computationally unfeasible. Luckily, the field of Machine Learning has shown how it is sufficient to use extremely simple mathematical models (even if far from the biological plausibility) to represent simplified artifi-cial neural networks that are still able to learn and adapt, thus trading biological plausibility for efficiency. As of today, complex applications like Machine Trans-lation [11, 55], Image Captioning [58] and Speech Recognition [23], just to name a few, can only be realized by resorting to artificial neural networks models since classical approaches (when they exist) have proved to be insufficient.

Despite the intrinsic efficiency of these artificial models when compared to biologically plausible simulations, the progress towards larger and more complex neural networks has led to the need to use disproportionate computational resources to make it possible to successfully train these models, requiring to fine-tune one by one the strengths of the millions of connections between the neurons, trying to reach an optimal configuration.

Is such costly fine-tuning really necessary? The Izhikevich model [31] is a mathematical framework that can accurately simulate the dynamics shown by biological cortical neurons by means of a few differential equations. Amazingly, just randomly connecting these simulated neurons leads to a population that

(6)

exhibits the same collective rhythmic patterns observed in the mammalian cortex during wake or sleep states. How can such an order emerge from the random, non-plastic synapses?

Being able to extract information from the behavior of a randomly connected artificial neural network would make it possible to get a very efficient model that mostly does not require training. This is exactly what the field of Reservoir Computing aims to study. We can model the complex interactions between an input signal and a medium (reservoir ) that can be composed by biologically plausible neuron models, simplified artificial neuron models, or even of a physical liquid material and then we can learn to make predictions just by observing the state of such reservoir at any given time. A particular instance of this class of models in the context of Machine Learning is represented by Echo State Networks.

Echo State Networks, introduced by Herbert Jaeger [32], implement the idea of Reservoir Computing with a model that is able to make predictions from sequential input signals by exploiting the ability of a large untrained reservoir to discriminate between different input histories. While Echo State Networks have biologically plausible properties as a whole [43], the reservoir is implemented as an artificial neural network composed by simplified mathematical abstractions that only emulate mean firing rates of biological neurons. The strengths of the connections between the neurons in the reservoir are fixed and initialized at random, as training only happens within a readout layer that observes the state of the reservoir and learns to emit the correct result for the task at hand. The fact that random connections are present in the network implies that the model could easily be sub-optimal, even if in many cases the achieved predictive performance is satisfying and comparable to that of a fully trained network. Many techniques can be used to adapt the random reservoir in an unsupervised way with the aim of further improving the predictive performance at the expense of computational cost, one notable example being Intrinsic Plasticity [49], a rule inspired by biological neurons that acts on the intrinsic parameters of each unit.

In this work, we will follow a different approach for the optimization of the Echo State Network reservoir. Instead of acting on the synapses or on the individual neurons, we will explore some architectural modifications to the reservoir that will allow the evolution of the state to follow different, more interesting dynamics. This is motivated by the fact that one of the main issues

(7)

with Echo State Networks is the difficulty that they experience when trying to model long term dependencies within a long input sequence. If we were able to carry forward in time these dependencies instead of letting them gradually vanish, then we could obtain richer reservoir states from which we could extract more information.

The issue of long term dependencies has a great impact in the field of Machine Learning, since many tasks can be modeled as sequences of varying length. Aside from the obvious case of temporal data, images can be modeled as a sequence of pixels, and even text expressed in natural language can be seen as a sequence of words or characters. As a matter of fact, to validate our proposed models we will deal with Natural Language Processing tasks, where long term dependencies are a fundamental characteristic and their impact cannot be ignored.

Echo State Networks are not the only type of networks to suffer from tasks presenting long term dependencies in the data. In fact, Echo State Networks are just a particular instance of the more general framework of Recurrent Neural Networks, which are models characterized by a state which allows the network to remember information between different time steps. In their most common form, unlike Echo State Networks, an iterative algorithm is used to train all parameters of Recurrent Neural Networks, usually with high training costs.

The literature already provides models that are capable of alleviating the issues with long term dependencies in fully trained Recurrent Neural Networks. For example, Gated Recurrent Units employ gates, special architectural structures in the recurrent part of the model that allow the network to selectively remember and forget information across time steps. Gated Recurrent Units are trained by means of the same techniques used for Recurrent Neural Networks, and for this reason training a network with a large number of neurons can still be computationally expensive. Still, their capacity to learn to choose what to store in their limited short-term memory has gained them high popularity and state-of-the-art results.

In fact, it is by using gates that we take on the challenge of improving the predictive capabilities of Echo State Networks, while however taking care to main-tain intact the main strength of Reservoir Computing: computational efficiency. Nonetheless, we will also investigate ways of trading predictive performance for computational efficiency in Gated Recurrent Units, by using hybrid training patterns in which part of the parameters (e.g. those related to the gates) are

(8)

trained by an iterative algorithm and the remaining part is fixed and random as in the Reservoir Computing paradigm.

We will delve into more details about Echo State Networks and Gated Recurrent Units in Chapter 2. For now, however, it is interesting to notice that sequences are not the only interesting type of data that we can encounter. Natural Language Processing, Chemistry, Social Networking, are only some of the fields that are routinely working with information that can be naturally expressed in structured form like graphs or trees. It is only natural, then, to ask if we can generalize our algorithms to work with tree-structured data in a way that preserves the relationship information between the nodes. From the literature, Recursive Neural Networks [54] describe a generalization of Recurrent Neural Networks for structured data. We will then follow this path to extend all our models to their tree-structured variants, and experiments to assess their performance will be discussed just like for the sequential case.

In particular, our experiments will validate our models over two Natural Language Processing tasks. The choice of this particular field gives us the opportunity to test our networks over both sequential and tree-structured data, since natural language sentences can have meaningful representations both as a sequence of words and as a grammatical parse tree. The two tasks that we will deal with are Question Classification [42], which requires to identify the semantic scope of a question, and ITAmoji [48], which asks to predict the emoji associated to a Twitter message.

As a matter of fact, the ITAmoji dataset derives from the competition with the same name, organized in 2018 by EVALITA, a periodic campaign of Natural Language Processing and speech tools for the Italian language. We took the opportunity to participate to ITAmoji in order to assess the performance of the already existing Reservoir Computing models in a competitive setting.

To wrap up, with this thesis we have set ourselves the goal of developing new Reservoir Computing models that are able to bring their predictive performance one step closer to the current state of the art, while still adopting extremely efficient training techniques. Particular focus has been put on structured mod-els, since they provide a generalization of Recurrent Neural Networks that is fundamental for working in many real-world application domains.

(9)

1.1 Thesis outline

We have given a short introduction about the problems that we are going to tackle, the motivations and the approaches that we will follow.

In Chapter 2, we will review the fundamental models over which our contribu-tion lies. We will provide a brief overview of Recurrent Neural Networks and then we will present the relevant models from the literature (Echo State Networks and Gated Recurrent Units). For working with tree-structured data we will then illustrate Recursive Neural Networks and the relative generalization of the previously introduced models (Tree Echo State Networks and Tree-Structured Gated Recurrent Units). We will also take the opportunity to fix some of the notation which is used throughout the rest of the thesis.

In Chapter 3, we will discuss our proposal for new models which extend the Echo State Network framework with (a) architectural changes and (b) different, hybrid training methodologies.

In Chapter 4, we will extend the models proposed in Chapter 3 by offering a generalization that is designed to deal with tree-structured input data.

We will then present and discuss, in Chapter 5, the experiments comparing our new models to those already in the literature. We will also include the results from our participation in the ITAmoji competition, a Natural Language Processing task for emoji prediction.

In Chapter 6, we will finally draw the conclusions and talk about possible future related work.

(10)

Background

In this chapter we will review the existing literature about the neural network models that we will use as the foundations for our work. In the first part of the chapter we deal with recurrent networks [39], a neural network model for sequential or temporal data domains, while in the second part we introduce recursive networks [54], which generalize the idea of recurrent networks for structured data domains.

First, in Section 2.1 we introduce some of the basic notation which will be adopted throughout the rest of the text. We took particular care in homogenizing the notation among the whole literature so that, as well as making the text more readable, differences and similarities between the different equations that we will introduce will be self-evident.

In Section 2.2 we will briefly explain the notion of “word embeddings”, an important tool for an effective processing of natural language sentences with neural networks.

In Section 2.3 we will deal with Recurrent Neural Networks, a particular class of Neural Networks designed to work on sequential or temporal inputs, whose main characteristic is the ability to retain memory of past computations. We will see the many possible applications of Recurrent Neural Networks but also their problems, which put a practical limitation on the theoretical capabilities of these models. We will then introduce two different paradigms for approaching the above mentioned problems: Reservoir Computing [43, 57] and gated Recurrent Neural Network models [11, 28]. About the first one, in Section 2.3.1 we show the Echo State Network model, one of the most important representatives of the Reservoir Computing paradigm. In Section 2.3.2, instead, we will see how the gates in the Gated Recurrent Unit (one of the most used models in many state

(11)

of the art results) can offer an alternative solution to the problems of recurrent networks.

In Section 2.4 we will then deal with Recursive Neural Networks, a general-ization of recurrent networks for dealing with structured data like trees. This is motivated by the fact that many problems can be more naturally expressed by representing the information involved in the computation in the form of more complex data structures, like trees. Since Recursive Neural Networks suffer from the same issues affecting recurrent networks, we will extend the two approaches introduced before (Echo State Networks and Gated Recurrent Units) for dealing with tree-structured data: in Section 2.4.1 we will introduce Tree Echo State Networks, while in Section 2.4.2 we present the Tree Gated Recurrent Unit.

2.1 Notation

Input, state and output

The kind of neural network models that we will discuss have three fundamental components: the input, the state and the output, all of which can be in the form of either sequences or trees, depending on the model. In any case, individual elements of the input, of the state and of the output are represented as real valued vectors of size NU, NR and NY, respectively. We will always assume to

deal with column vectors unless stated otherwise.

In general, we will refer to the n-th element of the input (which could be the

n-th element of the sequence, or the n-th node of a tree according to some order)

with u(n) ∈ RNU_{. Similarly, we use x(n) ∈ R}NR _{for the state and y(n) ∈ R}NY

for the output.

Sequences

When dealing with sequences, it is often useful to refer to a sequence as a whole, instead of to its individual elements. We will then denote a sequence of length n with sn(·). For example, in our case:

sn(u) = [u(1), . . . , u(n)]

sn(x) = [x(1), . . . , x(n)]

sn(y) = [y(1), . . . , y(n)].

(12)

Tree structures

Since we will often deal with models working with tree-structured data, we will now fix some useful notation about trees.

A tree t is a recursive data structure whose set of nodes is denoted by N (t), and whose root is denoted by root(t). We will denote the space of all the trees with labels in the domain L by L#, while the space of all the trees with labels in the domain L where each node has at most k children will be denoted by L#k_{. The k children of a node n will be denoted by ch}

1(n), . . . , chk(n), and the

number of children of node n by d(n).

Activation functions

Activation functions are particular nonlinear functions which are used to introduce nonlinearities in the computations of neural networks in order to increase their expressive power.

We will often make use of the standard logistic activation function, sometimes also referred to as “sigmoid” (Figure 2.1a). It is defined, for a generic real input

x, as:

σ : R → R σ(x) = 1

1 + e−x.

(2.2)

There are many other functions that can serve this purpose besides the sigmoid. In fact, another commonly used activation function is the standard hyperbolic tangent (tanh, Figure 2.1b):

tanh : R → R tanh(x) = e

2x_{− 1}

e2x_{+ 1},

(2.3)

or again, an activation function especially popular in Deep Learning is the rectifier (Figure 2.1c), often called rectified linear unit (ReLU):

ReLU : R → R

ReLU(x) = max(0, x). (2.4)

When applied to vectors, the activation functions are intended as applied to each of the entries individually. Let v ∈ Rn _{be a generic vector. Then, if f is}

(13)

−4 −2 0 2 4 −1 −0.5 0 0.5 1 (a) sigmoid −4 −2 0 2 4 −1 −0.5 0 0.5 1 (b) tanh −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 (c) ReLU Figure 2.1: Some of the commonly used activation functions.

one of the activation functions defined above, we generalize it as follows:

f (v) = f         v1 .. . vn         =     f (v1) .. . f (vn)     (2.5)

The only exception to this is the softmax activation function, which is directly defined over vectors since each entry in the output depends on all the input vector: softmax : Rn→ Rn softmax(v)i = evi Pn j=1evj for i = 1, . . . , n, (2.6)

where softmax(v)i is the i-th entry of the n-dimensional output of the

soft-max. The main property of the softmax is that it transforms each entry into a probability, i.e. n X i=1 softmax(v)i = 1, softmax(v)i > 0. (2.7) Hadamard product

The circled dot binary operator () is used to denote the Hadamard product between two matrices or, for our purposes, vectors. It is defined as the entrywise product between the operands, or more formally:

: Rm×n

× Rm×n

→ Rm×n

(A B)i,j = Ai,jBi,j,

(2.8)

(14)

Matrix calculus

In some of the coming chapters we will deal with derivatives of vector functions with respect to vectors. The result of such derivations is a Jacobian matrix, for which we will use the numerator-layout notation. For example, the derivative of a generic vector function y with respect to a generic vector x is defined as:

∂y ∂x =        ∂y1 ∂x1 ∂y1 ∂x2 · · · ∂y1 ∂xn ∂y2 ∂x1 ∂y2 ∂x2 · · · ∂y2 ∂xn .. . ... . .. ... ∂ym ∂x1 ∂ym ∂x2 · · · ∂ym ∂xn        . (2.9)

Sometimes we will need to compute partial derivatives of a function while taking some subfunction as a constant. For example, assume to have a generic function y(x) that depends on another generic function f (x). Then with

∂y(x) ∂x

!

f (x)

(2.10)

we denote the partial derivative of y with respect to x while considering f (x) as a constant.

Input bias

The output of the most simple type of neuron consists of a nonlinearity (the activation function) applied to a linear combination of the entries within its input vector, plus a constant term (“bias”). If we denote with ˜w ∈ Rn _{the vector}

containing the coefficients of the linear combination (“weights”), with ˜_{u ∈ R}n

the vector of the input, and with b ∈ R the constant term, then the equation describing the output y ∈ R of the neuron is:

y = f ( ˜wTu + b).˜ (2.11)

For simplicity of notation, we will implicitly make the assumption that the bias is always included in the weight vector. So instead of writing the equations as in (2.11), we will consider the extended vectors w and u defined as

w = " ˜ w b # u = " ˜ u 1 # (2.12)

and just write

(15)

In the literature is common practice to refer to the output of a whole layer of neurons with a single equation instead of describing them individually. For a layer of m neurons, since they all have the same input u, this is done by extending ˜w and b in Equation 2.11 to the matrix ˜_{W ∈ R}m×n _{and the vector}

b ∈ Rm_{, respectively, where each row refers to a different neuron:}

y = f ( ˜W˜u + b). (2.14)

Just like in Equation 2.12, we will implicitly consider the extended matrix

W defined as

W =hW b˜ i (2.15)

and just write

y = f (Wu). (2.16)

2.2 Word Embeddings

Artificial Neural Networks are designed to work with input tokens that are expressed as real-valued vectors. When dealing with sentences expressed in natural language it is thus necessary to create a mapping from a sequence of characters to a sequence of numerical vectors.

The simplest approach is to take the input alphabet A and define a “one-hot” encoding to vectors in R|A|. For example, for A = {a, . . . , z}:

a: h1 0 . . . 0iT b: h0 1 . . . 0iT .. . z: h0 0 . . . 1iT . (2.17)

This approach works, but puts on the neural network the burden of learning the most basic structure of the sentences and the words from scratch, which can require a lot of training effort and large amounts of data.

A more common technique is to use word embeddings [44, 47, 8], which are real-valued vectors associated to whole words instead of to the characters. An input sentence can be tokenized by splitting it at the occurrence of spaces or punctuation, and each word is then associated to its corresponding word embedding.

(16)

Since the alphabet of all possible words is too large to make the use of one-hot vectors feasible, word embeddings are generated themselves by learning algorithms and have a more compact representation (typical word embeddings can have 300 dimensions). Moreover, word embeddings have an additional, fundamental advantage: the semantic of the words is encoded in their representation. In fact, the distance between two different embedding vectors which are associated to words with similar semantics will be smaller than the distance between two unrelated embeddings. For example, let e1, e2, e3 ∈ RNE be three word

embeddings, with NE being the number of dimensions of the embedding space.

Then, if e1 is more semantically related to e2 than it is to e3, we will have that

ke1− e2k < ke1− e3k . (2.18)

The use of word embeddings allows the network to focus on the task at hand, instead of requiring it to learn the basic sentence structures from scratch.

Moreover, pretrained word embeddings generated from large text corpora are often available for download, making their use straightforward. In this thesis, we will make use of pretrained FastText word embeddings1_.

2.3 Recurrent Neural Networks

Recurrent Neural Networks (RNNs) [39] are a class of neural network models with an internal state, or memory, which allows them to effectively process sequential or temporal inputs without a predetermined maximum length.

As opposed to fully connected feed-forward networks, one of the biggest strengths of RNNs is the fact that they are able to retain memory of past compu-tations, and are thus naturally able to successfully model complex tasks such as, among many, image captioning [58], image generation [25], question answering [59], machine translation [11, 55], sentence parsing [52] and text classification [41]. In fact, it has been proven that RNNs are universal approximators [51], i.e. they can in principle simulate all Turing machines, and thus solve any computable task.2

1_{https://fasttext.cc}

2_{This also gives rise to interesting implications, such as the fact that it is impossible to} determine for a generic RNN whether a particular unit will eventually assume a particular value, since this would allow us to solve the halting problem.

(17)

x(n) y(n) g u(n) τ x(n−1) x(n) x(n+1) y(n − 1) g u(n − 1) y(n) g u(n) y(n + 1) g u(n + 1) τ τ τ τ

Figure 2.2: A Recurrent Neural Network. On the left, we can see how the state is updated for a single time step. On the right, the same RNN is unfolded over three time steps.

RNNs are characterized at an abstract level by two functions: the state

transition function τ and the output function g: τ : RNU _{× R}NR _{→ R}NR

g : RNR _{→ R}NY_.

(2.19)

The state transition function takes the state at the previous time step and the input at the current time step to produce a new internal state for the network. This new state is then used as input for the output function g3_{. Mathematically,}

the behavior of a typical RNN at a generic time step n can be described by means of the following pair of equations:

x(n) = τ (u(n), x(n − 1))

y(n) = g(x(n)), (2.20)

In other words, the information flows from the input to the recurrent part of the model where it is combined with the state at the previous time step, and then exits the network through the function g (Figure 2.2, left). When the RNN is applied to the whole input sequence, it is convenient to think about the evolution of the model by unfolding it over time (as in Figure 2.2, right). This allows to clearly see the flow of information between the time steps.

From Equations 2.20 and Figure 2.2 we can notice how each time step n corresponds to both an input u(n) and an output y(n). While this is always true in general, in practice we may decide to ignore some of the outputs, or to stop feeding some of the inputs by providing a zero vector instead. This allows

3_{We could also make the output function depend on the current input in addition to the} state.

(18)

x(1) x(2) x(3)

(a) element to sequence

x(1) x(2) x(3) (b) sequence to element x(1) x(2) x(3) x(4) x(5) (c) sequence to sequence x(1) x(2) x(3) (d) sequence to sequence

Figure 2.3: The different ways in which a Recurrent Neural Network can be used (unrolled over time). Incoming arrows represent the inputs, while outgoing arrows represent the outputs. Missing input arrows can be considered in practice as feeding zero vectors, and missing output arrow are just those outputs that we decide to ignore.

the network to be used in diverse configurations, each of which is useful for representing different kind of problems. In Figure 2.3 we illustrate four different modes of application, namely “element to sequence”, “sequence to element” and two variants of “sequence to sequence”, one asynchronous and the other synchronous.

To give some examples of practical applications of the different settings illustrated in Figure 2.3, we can start by considering the element-to-sequence case: this is the ideal structure for tasks like image captioning, where a fixed size image vector is fed into the network, and the resulting output is a sequence of words or characters. An example of an intrinsically sequence-to-element task is instead sentiment analysis: an input sentence is fed into the network, and the output is a single value that represents a score for the degree of positive or negative emotions in the whole sentence. A typical case for an asynchronous sequence-to-sequence task is then machine translation, where we translate sentences from one language to another and thus the length of the input and output sequences can be different, and finally, for a synchronous sequence-to-sequence task, we can think of applying a sentiment score to each of the words in an input sentence.

(19)

u(n) x(n) y(n) Win W Wout

Figure 2.4: A simple RNN with two input units, three recurrent units and two output units, i.e. NU = 2, NR = 3 and NY = 2. Notice how each recurrent unit

is interconnected with the others. In this way, at time step n + 1 each recurrent unit has in input all values from the preceding state x(n).

In our work, we will focus on the sequence-to-element type of tasks: we will feed to the network input sequences of varying length to produce a final, fixed size output vector of probabilities.

Implementation of simple RNNs

In practice, functions τ and g from Equations 2.19 are implemented as neural network layers. As such, they typically include a nonlinear activation function and a linear combination of the inputs (as mentioned in Section 2.1). In the simplest type of RNN, we implement Equations 2.20 as:

x(n) = f1(Winu(n) + Wx(n − 1))

y(n) = f2(Woutx(n)).

(2.21)

Here, f1 : RNR → RNR and f2 : RNU → RNU are activation functions, while

Win ∈ RNR×NU, W ∈ RNR×NR and Wout ∈ RNY×NR are the matrices containing

the parameters of the neural network. In particular, Win contains the parameters

for the transformation from the input layer to the recurrent layer, W is the matrix for the recurrent weights, and Wout contains the parameters for the

transformation from the recurrent layer to the output layer.

We can represent the neural network as layers of individual units like in Figure 2.4, from which the role of Win, W and Wout becomes evident. Since for

(20)

hand, we resort to algorithms and techniques to find approximations of these values in an automated way.

Gradient descent is the most common approach to find good values for the parameters given a training set of examples in the form of pairs of inputs and expected outputs, to which we will refer as samples. It is a first-order iterative optimization algorithm that is used to find the minimum of a differentiable function, in our case an error function, or loss, that describes how far the actual output of the network is to the expected output. There are many possible error functions, but one of the simplest ones is the Mean Squared Error (MSE):

MSE = 1 Ntrain Ntrain X i=1 (oi− ¯yi)2, (2.22) where Ntrain is the number of samples that we use for training, oi is the output

of the network for sample i and ¯yi _{is the expected output (the true label) for}

sample i.

To find the minimum of the error function, each parameter w is altered proportionally to the partial derivative of the error E with respect to that parameter, and the process is repeated until the error is sufficiently low:

w = w − α∂E

∂w. (2.23)

In order to effectively compute the gradient of the error function, algorithms such as Backpropagation Through Time (BPTT) [45] or Real-Time Recurrent Learning (RTRL) [61] are used. In practice, however, long term dependencies within the input sequence are notoriously difficult to learn because, for the kind of algorithms that rely on the computation of the complete gradient, this may either “blow up” or “vanish” towards zero [27, 6]. In the first case, the result is that the value of the parameters will be unstable. In the second case, the parameters will practically get stuck since the variations in their value will be extremely small.

In the following sections, we will describe two approaches that get around this problem: Echo State Networks and Gated Recurrent Units. The first approach exploits the intrinsic architectural bias of RNNs, which can produce an ordered state space even without training the recurrent neurons [26]. The second approach uses gates, a general mechanism that allows the model to selectively remember and forget information, providing a direct solution to the issue of long term dependencies.

(21)

2.3.1 Echo State Networks

Echo State Networks (ESNs) [32, 33], part of the Reservoir Computing (RC) paradigm [43, 57] and introduced in 2001 by Herbert Jaeger, are a subclass of RNNs in which the recurrent part of the model (the reservoir ) is carefully initialized and then left untrained. The input sequence flowing within the reservoir causes nonlinear activations of the recurrent units, and the global activation state of the reservoir at a given time step is then used as input for the output layer (the readout, in RC terminology), which can then be trained to produce the desired output sequence. The readout (function g) is typically implemented as a linear layer and trained with efficient direct methods like Moore-Penrose pseudoinversion or Ridge Regression [29], while the reservoir state transition function τ , as we have hinted, is never trained.

In its simplest form, an ESN is described by the following equations:

x(n) = f (Winu(n) + Wx(n − 1))

y(n) = Woutx(n),

(2.24)

where the function f is a nonlinear activation function, Win ∈ RNR×NU is

the randomly initialized input-to-reservoir matrix, W ∈ RNR×NR _{is the randomly}

initialized reservoir-to-reservoir matrix, and Wout ∈ RNY×NR is the readout

matrix. We can see that the state at time step n only depends on the input and on the state at the previous time step, for which a zero vector is used when

n = 0.

Intuitively, when the state transition function exhibits particular Markovian properties (that we will introduce in the coming paragraphs), the input signal going through the random reservoir gets scrambled in such a way that similar signals end up close together in the final state space, while different signals end up far apart from each other as shown in Figure 2.5 [26]. Given the usually high dimensionality of the reservoir, it is likely that the state representation of the signals will in the end become linearly approximable. Here, similarity has to be intended in a Markovian sense: more weight is given to the most recent part of the signal, while the past progressively loses influence [17].

To guarantee the conditions that allow the state dynamics to behave as we have just described, we require that, for long input sequences, the state in which the network ends up should only depend on the input itself. In other words, we do not want the initial conditions to influence the long-term dynamics of the

(22)

State Space

8,9, 5, 3, 5, 4, 8, 3, 1, 9, 5

7,7,3,4,4,3,1,2, 1, 9, 5

4,8, 1, 5, 6, 0, 7, 8, 2, 2, 3

2,4,3,2,0,2,0,4, 2, 2, 3

Figure 2.5: Markovian organization of the reservoir state space in an ESN: input sequences with similar suffixes drive the state towards points which are close together in the NR-dimensional space. Note how the influence on the current state

by the inputs at previous time steps gradually vanishes with time (represented here by fading out the oldest elements in the sequences).

network. A reservoir exhibiting this particular behaviour is said to satisfy the Echo State Property.

Echo State Property

The Echo State Property (ESP) is a characteristic of the state transition function that guarantees stability of the network dynamics. Formally, the ESP can be expressed as:

∀sN(u) = [u(1), . . . , u(N )] ∈ (RNU)N,

∀x, x0 _{∈ R}NR _:

kˆτ (sN(u), x) − ˆτ (sN(u), x0)k → 0 as N → ∞.

(2.25)

Here, ˆτ is the function representing the repeated application of the state

transition function to the whole input sequence:

ˆ τ (s(u), x) =    x if s(u) = [ ]

τ (u(n), ˆτ ([u(1), . . . , u(n−1)], x)) if s(u) = [u(1), . . . , u(n)]. (2.26) The ESP is important because, if satisfied, it guarantees that the initial conditions progressively lose their influence when the network is driven by a sufficiently long input sequence.

(23)

In practice, we have both a sufficient and a necessary condition for the ESP, which have been proven in the context of an ESN with tanh as the activation function [32].

The sufficient condition states that if the largest singular value of the reservoir matrix W is less than one then the network satisfies the ESP:

σ(W) < 1. (2.27)

Equivalently, since the largest singular value of W is equal to its Euclidean norm, the sufficient condition can also be expressed as:

kWk₂ < 1. (2.28)

The other condition states that it is necessary for the spectral radius of W (i.e. its largest eigenvalue in absolute value) to be less than one in order to satisfy the ESP (under the assumption of only admitting the null sequence as input for the network):

ρ(W) < 1. (2.29)

If this condition is not met, the dynamical system can be unstable.

For clarity, we summarize here in brief the necessary and sufficient conditions for the ESP:

kWk₂ < 1 =⇒ ESP =⇒ ρ(W) < 1 (2.30)

Initialization

The input-to-reservoir matrix Win can be initialized by generating a random

matrix from a uniform distribution between −scalein and scalein, where scalein ∈

R+ is a hyperparameter.

In order to obtain a reservoir matrix W such that the ESP is satisfied, it is sufficient to randomly generate a matrix and then rescale its norm such that the sufficient condition (Equation 2.28) is satisfied. Since this condition is rather strong, an alternative commonly used in RC practice is to rescale the matrix such that only the necessary condition (Equation 2.29) is met. This does not ensure a stable dynamic, however it is empirically observed that a rescaling based on the spectral radius is enough to provide good reservoirs.

Moreover, the reservoir matrix W is typically sparse: it is not uncommon to have less than 20% of connectivity.

(24)

Training

The network can be trained in an offline fashion by first feeding the input sequence to the reservoir and collecting the resulting states at each time step. An initial number of states is usually ignored in the so-called washout phase, since they will heavily depend on the initial conditions. The number of states to discard can vary, and will generally depend on the specific task or on the scarcity of training data. We will denote the number of discarded states with Ntransient.

Let Ntrain be the number of examples in the training set, let X be the matrix

whose columns are the reservoir states x(Ntransient+ 1), . . . , x(Ntrain), and let ¯Y

be the matrix whose columns are the target vectors ¯y(Ntransient+1), . . . , ¯y(Ntrain):

X =hx(Ntransient+ 1) . . . x(Ntrain)

i

∈ RNR×(Ntrain−Ntransient)

¯

Y =hy(N¯ transient+ 1) . . . ¯y(Ntrain)

i

∈ RNY×(Ntrain−Ntransient)_,

(2.31)

we can train the readout by solving the following minimization problem:

min WoutX − ¯Y 2 2, (2.32)

meaning that we want to find values of Wout that minimize the mean squared

error between the actual output of the network and the expected output ¯Y.

We can find an optimal value of Wout by closed form solutions such as

Moore-Penrose pseudo-inversion:

Wout = ¯YX+. (2.33)

Here, X+

∈ R(Ntrain−Ntransient)×NR _{denotes the pseudoinverse of matrix X,}

which is a generalization of the matrix inverse.

In general, the solution to the linear system in Equation 2.32 is not unique, since we are trying to learn a continuous function from a finite number of samples. In order to avoid overfitting, we want to give preference to the simplest solutions, which are associated to small norms of the weight matrix. For this reason, we may want to solve the following minimization problem instead (Ridge Regression, also known as Tikhonov regularization):

min WoutX − ¯Y 2 2+ λ kWoutk 2 2, (2.34)

where λ ∈ R+ _{is the regularization parameter. A higher λ corresponds to a more}

(25)

The regularized solution can then be found by a direct method like in the previous case:

Wout = ¯YXT(XXT + λI)−1, (2.35)

where I ∈ {0, 1}NR×NR _{is the identity matrix.}

Even if the typical practice is to use these efficient closed form methods, nothing prevents the use of more complex readouts and iterative algorithms for the solution, such as a multilayer perceptron trained with backpropagation.

We can see how ESNs represent an efficient alternative to classical recurrent networks, both in relation to the computational cost for training the recurrent part of the model (the reservoir of an ESN is not trained), and as regards the issues related to the vanishing or exploding gradient.

Computational complexity

Let R be the maximum number of non-zero entries in the rows of the recurrent matrix W (this represents the maximum number of connections for a reservoir unit, which is smaller for sparser matrices). A value x(n) for a single state transition can then be computed in O(RNR) time. For a sequence of length N ,

we get a total cost of

O(N RNR). (2.36)

The readout can be trained by many different techniques, from direct methods to iterative optimization algorithms. Typically, simple linear methods as those illustrated in the previous section are sufficient thanks to the high dimensionality of the reservoir, and this is an important advantage since these kind of meth-ods can have very efficient implementations when compared to the first-order iterative algorithms that are used with typical RNNs. In fact, RNNs trained by backpropagation need to add to the cost of the state computation the cost of computing the partial derivatives for all the parameters with respect to an error measure, and then iterate the whole procedure for a number of epochs.

2.3.2 Gated Recurrent Units

Gated Recurrent Units (GRUs) have been introduced in 2014 by Cho et al. as a model for Statistical Machine Translation, where they were used as part of an encoder-decoder architecture [11], and they currently represent the state of the art for Natural Language Processing tasks.

(26)

r(n) h(n) u(n) z(n) x(n − 1) x(n)

Figure 2.6: Electrical representation of the behaviour of the reset and update gates in a Gated Recurrent Unit. With r(n) ∈ RNR _{and z ∈ R}NR _{we denote}

the values of the reset gate and update gate at step n. With h(n) ∈ RNR _we

denote the intermediate result of the computation of the new state at step n, which merges information from the input u(n) with a partial representation of the previous state x(n − 1).

The GRU is an extension of a RNN that inherits the same output function g, which is implemented as a neural network layer. In addition, the GRU includes gating mechanisms within the state transition function τ , which are responsible for regulating the flow of information from the previous time step to the current one, and from the current time step to the next. This makes it easier for the network to deal with long term dependencies, since relevant information from a time step can be stored indefinitely within the internal state of the GRU until it is no longer necessary and can be discarded.

In particular, the architecture of a GRU adopts two gates, called the reset gate and the update gate. The behaviour of the two gates is illustrated in Figure 2.6 by means of an equivalent electrical circuit. The reset gate controls how much of the previous state information is dropped. The update gate, instead, controls how much of the previous state information is included in the current state (basically implementing an adaptive leaky-integration unit [5]). Both gates working together allow the network to dynamically choose which information to remember and which to forget, thus improving the performance in the presence of long-term dependencies.

GRUs alleviate the problem of the vanishing of the gradient, since the gates effectively introduce “shortcuts” for traveling long chains of partial derivatives

(27)

during gradient computation. GRUs are however still affected by the problem of the explosion of the gradient, even if workarounds have been studied [35].

With a radical change of notation with respect to the original literature, motivated by uniformity with the other models presented in this thesis, in the following we describe the equations for state computation at a generic time step

n: r(n) = σ(Wr_inu(n) + Wrx(n − 1)) z(n) = σ(Wz_inu(n) + Wzx(n − 1)) h(n) = tanh(Winu(n) + W(r(n) x(n − 1))) x(n) = z(n) x(n − 1) + (1 − z(n)) h(n) (2.37)

Here, r(n) ∈ RNR _{and z(n) ∈ R}NR _{respectively represent the values of the}

reset gate and update gate at step n. Each individual entry in both gates can

open and close by assuming continuous values between 1 and 0. The vector

h(n) ∈ RNR _{is sometimes referred to as memory, and its entries vary between}

−1 and 1.

The number of units of a GRU is reflected in the shape of its matrices. In fact, we have Wr

in, Wzin, Win∈ RNR×NU and Wr, Wz, W ∈ RNR×NR.

GRUs are a flexible model that can be used in many different ways, just as with plain recurrent networks. The state computed at each time step is typically fed into subsequent arbitrary layers of varying complexity, which can even be GRUs themselves.

Initialization

The weight matrices are initialized from a random distribution, such as uniform between −1 and 1 or Gaussian with mean 0 and unit variance. A common approach that works well in practice is to use the following uniform distribution, which is scaled relative to the number of units:

U √−1 NR ,√1 NR ! . (2.38) Training

Training of the weights of a GRU can be performed by means of the same optimization techniques used for generic recurrent networks as already seen in Section 2.3, for example Stochastic Gradient Descent with Backpropagation Through Time.

(28)

In Equations 2.37 we described the evolution of the GRU state by means of four separate computations. In order to find the cost of the full state computation, we proceed by first computing the costs of the individual components.

Let Rh be the maximum number of non-zero entries in the rows of the recurrent matrix W (this represents the maximum number of connections for a recurrent unit, which is smaller for sparser matrices). Likewise, with Rr _{and R}z

we denote the same concept for, respectively, Wr _{and W}z_{. We can then trivially}

see how we can compute a value for the reset and update gates in, respectively,

O(Rr_N

R) and O(RzNR) time.

Similarly, a value for h(n) can be computed in O(RNR) time (the cost of

the Hadamard multiplication is dominated by the other terms) if we already have a value for r(n). Finally, given z(n) we can compute x(n) with just O(NR)

operations.

Putting these costs together, we get

O(RrNR) + O(RzNR) + O(RhNR) + O(NR)

= O(max(Rr, Rz, Rh)NR)

= O(RNR)

(2.39)

for a single state transition, where R = max(Rr_{, R}z_{, R}h_{). For a sequence of}

length N , we then get a total cost of

O(N RNR). (2.40)

If we compare this asymptotic cost with the corresponding one for an ESN, we find that they are equivalent (the GRU is slower by a constant factor of about 3).

When training a GRU, however, we have to compute the gradients for all six matrices and update their values, for an additional cost that depends on the training algorithm. Here, we will consider the case of the BPTT algorithm. For

Niters iterations, a dataset of Ntrain examples and sequences of length N we have

a total training time complexity (including all state computations) of:

O(NitersNtrainN NR2). (2.41)

In practice, this is usually much more costly than the direct methods that can be used for the RC models.

(29)

a b d e c f h g i

Figure 2.7: An example of a generic tree.

2.4 Recursive Neural Networks

RNNs, as we have described so far, are designed to work on sequential information, like temporal signals or a stream of words. This is quite limiting when dealing with structured or hierarchical information, which is something common in the field of Computer Science but also, among many, Biology, Chemistry, and Linguistics.

To better illustrate the problem, let us assume we want to build a system to do some kind of inference on a tree, which is a form of hierarchical data structure. We can, without loss of information, always represent a tree in the form of a sequence. In our case, the tree in Figure 2.7 could be represented as:

a(b(d,e),c(f(h),g(i))).

However, note how the information about the dependencies is now rather unintuitive:

• the path “a–c–g–i” is now scattered across the whole sequence instead of having its nodes close together;

• the children of node “a” (nodes “b” and “c”) are now far apart instead of being next to each other;

• distant relatives such as nodes “c” and “e” are now close together in the sequence.

Even if all information is preserved this is clearly sub-optimal, and a model which is able to deal with structured information would not need to lose these properties of locality.

(30)

Luckily, we can derive an elegant generalization of RNNs for structured data (trees and graphs), as first shown by Sperduti and Starita in 1997 [54, 18, 19]. We call this superclass of models Recursive Neural Networks (RecNNs).

At an abstract level, RecNNs are characterized by two functions: the state

transition function and the output function. We overload the notation that we

have used for RNNs in Section 2.3, and we take τ as the state transition function and g as the output function (the correct definition will be clear from context). In the case of a RecNN for trees, if k is the maximum degree of the trees under consideration, we have: τ : RNU _{× R}NR _{× · · · × R}NR | {z } k → RNR g : RNR → RNY_. (2.42)

To compute the state for node n, the state transition function takes the input of the current node and the states associated to the children of n to produce a new internal state for the network. This process produces a number of different states (one for each input node) that can either be processed independently by function g to produce a tree structured output with the same shape as the input (tree-to-tree transduction), or alternatively function g can be applied just once to a NR-dimensional combination of all the states to produce a single output vector4

(tree-to-tree transduction), similarly to what happened with sequence-to-sequence and sequence-to-element settings in RNNs. Both types of transductions are illustrated in Figure 2.8.

Mathematically, let

χ : (RNR₎# _{→ R}NR _(2.43)

be a state mapping function that takes as input the states associated to all the nodes of a tree and outputs a combination of such states, and let y(t) ∈ RNY _be

the vector representing the output for a tree t in a tree-to-element transduction. Then the behavior of a tree-to-element RecNN at a generic node n can be described by means of the following pair of equations:

x(n) = τ (u(n), x(ch1(n)), . . . , x(chk(n)))

y(t) = g(χ(x(t))), (2.44)

4_{Since we will deal with classification tasks, in our work we will only focus on networks} emitting a single output vector.

(31)

u(5) u(4) u(1) u(2) u(3) x(5) x(4) x(1) x(2) x(3) y(5) y(4) y(1) y(2) y(3)

(a) tree to tree

u(5) u(4) u(1) u(2) u(3) x(5) x(4) x(1) x(2) x(3) y(1) (b) tree to element

Figure 2.8: Possible input-output shapes for the recursive models under consid-eration. In case (a), the output tree is isomorphic to the input. In case (b), the information in the input tree is squashed into a fixed size output vector.

where zero vectors can be used for the appropriate child states in case the number of children of node n is less than k.

In practice, functions τ and g from Equations 2.42 are implemented as neural network layers. As such, they typically include a nonlinear activation function and a linear combination of the inputs (as mentioned in Section 2.1). In the simplest type of RecNN, we implement Equations 2.44 as:

x(n) = f1  Winu(n) + d(n) X i=1 Wix(chi(n))   y(t) = f2 Woutχ(x(t)) . (2.45)

Here, f1 : RNR → RNR and f2 : RNU → RNU are activation functions, while

Win ∈ RNR×NU, Wi ∈ RNR×NR and Wout ∈ RNY×NR are the matrices containing

the parameters of the neural network. In particular, Win contains the parameters

for the transformation from the input layer to the recursive layer, Wi is the

matrix for the recursive weights (there is one matrix for each different value of

i), and Wout contains the parameters for the transformation from the recursive

(32)

a b c

c

b

a

Figure 2.9: On the left, a sequence (which is processed left-to-right by a RNN). On the right, the equivalent tree (which is processed bottom-to-top by a RecNN).

Equation 2.45 determines a bottom-up recursive visit of the input tree: in order to compute the state for node n, we must first compute the states for the children of n, which are then combined into a fixed size, NR-dimensional vector.

We can easily see that RNNs for sequences are a special case of RecNNs by showing that for each sequence “a- · · · -b” we can build a tree whose nodes only have either zero or one child, the root is “b” and the only leaf is “a”. An example is shown in Figure 2.9 (the case of a sequence with a single time step is trivial). For such special trees, Equation 2.45 reduces to the one of RNNs.

RecNNs can in general be trained with techniques similar to those already seen for RNNs in Section 2.3 (we can use a simple extension of Backpropagation Through Time for structures), and can thus suffer from the same issues related to the vanishing or explosion of the gradient. Instead of dealing with a vanilla recursive network, we will approach the task by either following RC methods or using a gated model. For these reasons, in the following sections we show the generalization of ESNs (Section 2.3.1) and GRUs (Section 2.3.2) to their corresponding recursive models.

2.4.1 Tree Echo State Networks

Tree Echo State Networks (TreeESNs) [20] are a recursive generalization of ESNs for tree data structures.

Model description

In ESNs, the typical input is represented by a single sequence and both the reservoir state and the output are computed, in order, for each time step of the sequence. In RecNNs in general and TreeESNs in particular, instead, the input dataset is composed by a set of trees. Each of the trees is processed independently from the others, but within a tree the nodes take a role which is parallel to the one taken by the time steps in the sequential case.

(33)

a a c d a d c a a c d a a c b a d a b c b a d a b c a c State Space common

sufﬁx commonsufﬁx

Figure 2.10: Markovian organization of the state space in Tree Echo State Networks. The roots of input trees with similar suffixes are encoded in close regions of the reservoir state space.

State dynamics

The reservoir is driven by the labels u(n) associated to the nodes of the input tree t ∈ (RNU₎#_{, starting from the leaves and climbing towards the root.}

The state dynamics is described by the following equation:

x(n) = f Winu(n) + k X i=1 Wx(chi(n)) ! . (2.46)

Here, similarly to the ESN case, Win ∈ RNR×NU is the input-to-reservoir

matrix and W ∈ RNR×NR _{is the reservoir-to-reservoir matrix (both matrices are}

initialized and then left untrained). We can see that the state of a particular node n only depends on the input and on the states associated to its children, for which a zero vector is used in correspondence of any missing child. We can also notice that the states themselves form a tree structure isomorphic to the input t, which we will denote with x(t) ∈ (RNR₎#_{. Interestingly, for k = 1 we}

get the special case of a sequence, for which the model behaves exactly like an Echo State Network with respect to the computation of the states.

When the state dynamics is stable, we have that the roots of similar trees will get encoded close together in the state space (see Figure 2.10), similarly to what happens with sequences in ESNs.

(34)

State mapping functions

When dealing with tree-to-element tasks, since we are considering input trees with variable topologies and different number of nodes, after having computed the state x(t) it is necessary to build a fixed size representation of it so that it can be used to train the readout (which has a fixed number of parameters). For this purpose, a state mapping function χ : (RNR₎#_{→ R}NR _{is used.}

Different types of state mapping functions can be used. In this work, we will consider the root state mapping and the mean state mapping.

The root state mapping is defined as:

χ(x(t)) = x(root(t)). (2.47)

When the root state mapping is used, it is assumed that the root state contains all the information of interest, i.e. we introduce the bias that the root node is representative for the whole tree. Note that with this state mapping, if

k = 1 (i.e. the input comes in the form of sequences) the model behaves exactly

like an ESN.

The mean state mapping, instead, takes into consideration all the states of the tree by computing their mean:

χ(x(t)) = 1

|N (t)|

X

n∈N (t)

x(n). (2.48)

In the case of a mean state mapping, we introduce the bias that assumes that all nodes have the same influence on the final representation of the data.

Initialization

Initialization of a TreeESN follows for the most part the same practices used for standard ESNs. The conditions for the ESP are however slightly different, because in this case we have to consider the maximum degree k of the input trees. This yields the following sufficient condition for the ESP:

k kWk₂ < 1. (2.49) When initializing W, it is thus sufficient to generate a random Wrand and

then take

W = ςWrand

k kWrandk₂

(35)

The hyperparameter ς = k kWk < 1 is the contraction coefficient of the model.

A different rescaling strategy can alternatively be used. We can rescale the recurrent matrix W by its spectral radius, so that

ρ(W) ≤ 1. (2.51)

In this case, the state transition function must be updated to keep into consideration the degree of node n, denoted by d(n):

x(n) = f Winu(n) + 1 d(n) k X i=1 Wx(chi(n)) ! . (2.52) Training

Since applying the state mapping function to the tree-structured state yields a fixed-size vector χ(x(t)) ∈ RNR_{, we can proceed to train the readout for}

tree-to-element tasks in exactly the same way we do in ESNs (e.g. Moore-Penrose pseudoinversion or Ridge Regression). The only difference is that unlike in ESNs, there is no need to discard an initial transient, since each tree in the dataset is processed independently from the others.

Let R be the maximum number of non-zero entries in the rows of the recurrent matrix W (this represents the maximum number of connections for a reservoir unit, which is smaller for sparser matrices). A value x(n) for a single state transition can then be computed in O(kRNR) time. For a tree t, we get a total

cost of

O(|N (t)|kRNR). (2.53)

The readout can be trained with the same techniques used for ESNs (see Section 2.3.1). This allows TreeESNs to inherit the characteristic efficiency of ESNs when compared to the costly iterative methods that are commonly used for training RecNNs.

(36)

2.4.2 Tree Gated Recurrent Units

Tree Gated Recurrent Units (TreeGRUs), a generalization of GRUs for tree-structured data, have been proposed by Kokkinos and Potamianos in 2017 for the task of sentiment analysis [38]. TreeGRUs were introduced along with two additional characteristics that can be independently combined, i.e. a bidirectional approach [50] for scanning the tree in both directions simultaneously and an attention mechanism [1] used to implement an adaptive state mapping.

Even if both mechanisms were shown to lead to an increase of performance on the Stanford Sentiment Treebank dataset [53], in this work we will only consider a simpler variation of a TreeGRU in a single direction without attention so that we can more easily evaluate the intrinsic differences between the models under consideration. Furthermore, all the models that we will introduce can in principle be easily adapted to include these additional mechanisms.

For these reasons, we now proceed to introduce only the simplest form of TreeGRUs.

Model description

TreeGRU is an extension of a RecNN that inherits the same output function g, which is implemented as a neural network layer. In addition, TreeGRU includes gating mechanisms within the state transition function τ , which are responsible for regulating the flow of information from the child nodes to the current one, and from the current node to its parent. This makes it easier for the network to deal with long term dependencies, since during the bottom-up recursive visit of an input tree any relevant information (e.g. from the leaves) can be stored indefinitely within the internal state of the TreeGRU until it is no longer necessary and can be discarded.

In particular, the architecture of a TreeGRU adopts two gates, called the

reset gate and the update gate, just like the GRU that we have introduced in

Section 2.3.2. The reset gate controls how much of the previous state information is dropped. The update gate, instead, controls how much of the previous state information is included in the current state. Both gates working together allow the network to dynamically choose which information to remember and which to forget, thus improving the performance in the presence of long-term dependencies.

For TreeGRU, the input of the model is a tree t ∈ (RNU₎#_{. The state}

(37)

recursive visit of the tree, so that a state is computed for each of the nodes: r(n) = σ Wr_inu(n) + k X i=1 W_irx(chi(n)) ! z(n) = σ Wz_inu(n) + k X i=1 W_izx(chi(n)) ! h(n) = tanh Winu(n) + k X i=1 Wi(r(n) x(chi(n))) ! x(n) = z(n) k X i=1 x(chi(n)) + (1 − z(n)) h(n). (2.54)

As it was the case with the sequential GRU described in Section 2.3.2,

r(n) ∈ RNR _{and z(n) ∈ R}NR _{respectively represent the values of the reset gate}

and update gate at step n. Each individual entry in both gates can open and close by assuming continuous values between 1 and 0. The vector h(n) ∈ RNR _is

sometimes referred to as memory, and its entries vary between −1 and 1. The number of units of a TreeGRU is reflected in the shape of its matrices. In fact, we have Wr

in, Wzin, Win ∈ RNR×NU and Wri, Wzi, Wi ∈ RNR×NR for 1 ≤

i ≤ k, with k being the maximum degree of the input trees under consideration. Matrices Wr_in, Wz_in and Win are used to take the input vector and represent

it into a NR-dimensional space. Similarly, recurrent matrices Wri, Wzi and Wi

are used to take the state associate to the i-th child of the current node and apply a transformation in the same NR-dimensional space. As we can see from

Equation 2.54 a different recurrent matrix for each child position is used, allowing the model to differentiate between trees with differing orders for the children of any node.

After a state has been computed for all the nodes, we get a global tree-structured state x(t) ∈ (RNR₎# _{for the whole input tree t. In [38], the authors}

use an adaptive mechanism (inspired by the “attention” in [1]) for extracting a fixed size vector from this representation. Nonetheless, a simple alternative remains to just use the root state x(root(t)), in a way similar to what happens with the GRU over sequences, where the final representation is the one associated to the last time step.

In the literature, one variant of TreeGRUs is represented by the model proposed by Kuta, Morawiec, and Kitowski [40], which however we do not consider since it is designed for binary trees and it assumes that the input is only