Learning Tree Transductions by Deep Neural Networks

(1)

DEPARTMENT OF COMPUTER SCIENCE

MASTER DEGREE IN COMPUTER SCIENCE

Learning Tree Transductions

by Deep Neural Networks

Supervisor:

Dott. Davide Bacciu

Author:

Antonio Bruno

Accademic Year

2016-2017

(2)

3.3 Tree2Tree in detail . . . 34 3.3.1 Implementation details . . . 37 3.3.1.1 Torch framework . . . 37 3.3.1.2 Tree . . . 37 3.3.1.3 TreeLSTM . . . 38 3.3.1.4 Encoders . . . 38 3.3.1.5 Tree2TreeModel . . . 38 3.3.1.6 Output layers . . . 39 3.3.1.7 Validation . . . 39 3.3.1.8 Label representation . . . 40 3.3.1.9 Other aspects . . . 40

4 Tests and results 41 4.1 INEX Challenge . . . 41

4.1.1 Task description . . . 41

4.1.2 Datasets description . . . 42

4.1.3 Validation and final training . . . 42

4.1.4 Test results . . . 43

4.2 Dataset Elimination . . . 45

4.2.1 Task description . . . 45

4.2.2 Dataset description . . . 45

4.2.3 Validation, training and test . . . 46

5 Conclusions and future work 47

Appendix 49

(4)

Table of acronyms

ML Machine Learning

AI Artificial Intelligence

PCA Principal Component Analysis

RNN Recurrent Neural Network

RecNN Recursive Neural Network

ESN Echo State Network

HMM Hidden Markov Model

HRM Hidden Recursive Model

TreeHMM Tree Hidden Markov Model

DRNN Doubly Recurrent Neural Network

BPTT Back-Propagation Through Time

BPTS Back-Propagation Through Structure

RTRL Real Time Recurrent Learning

CNNs Convolutional Neural Networks

CAE Convolutional Autoencoder

VAE Variational Autoencoder

DBN Deep Belief Network

DBM Deep Boltzmann Machine

(5)

DAGs Directed Acyclic Graphs

DOAGs Directed Ordered Acyclic Graphs

(6)

Chapter 1 Introduction

Machine Learning (ML) is a field of Artificial Intelligence (AI) whose aim is to de-velop models and algorithms that can learn to perform computational tasks from data, without (or with limited) prior knowledge. The idea of a learning machine was first formulated in 1950 by A.M. Turing [1] and in the following decades several other mod-els haven been presented. However, ML became popular only in the 1990s essentially for two reasons: the growing number of data available and the increasing computing power.

Nowadays, ML has become one of the most popular, attractive and growing research area counting a lot of different approaches based on strong theoretical background (e.g. mathematical optimization, statistics).

In the most simplistic way, ML approaches are divided in [2]:

supervised learning: a set of input-output data is provided and the learning algo-rithm seeks a function from inputs to the respective targets. If the target space is discrete, it is called classification problem. Alternatively, if the target space is continuous, it is called regression problem;

unsupervised learning: only the set of inputs is available and the learning model is used to identify similarities or unknown relationships between the samples (e.g. to perform data aggregation and partitioning);

reinforcement learning: in this case the learning system (i.e. agent) is immersed in an environment and there is a teacher that evaluates its behavior. The teacher can reward or punish the agent depending on its actions (reinforcement), so that the agent can perform the correct actions in future (like pet training).

Nowadays, supervised machine learning is by far the more common use case because the output of your algorithm is already known (just like when a student is learning

(7)

from an instructor). Unsupervised machine learning is a more complex process which has been put to use in a far smaller number of applications so far, but this is where a lot of the excitement over the future of AI stems from when people talk about comput-ers learning to “teach themselves” (revolutionizing the principles of machine learning saying we have to teach them).

A challenging domain for supervised learning is that of structured learning in which input and samples are structured data (e.g. sequence, tree, graph). Structured data are used to represent information about entities and relation among them. Entities can be represented as nodes to which numerical (of any type, e.g. continuous, multidimen-sional) labels are associated that express some kind of information about the entity. Edges, directed or not, between nodes express some kind of relation between labels, or a specific part of them (e.g. in sequences relations express a total order, in trees relations express a hierarchy). Edges can be label to better specifies some aspects of the relation (it’s easy to deduce that every type of structured data is suitable to a graph labeled both in nodes and edges). For what said above structured data are much more expres-sive than classical flat domain (i.e. vectors, matrices), doing a trivial example: vectors and matrices have fixed size whilst structured data can have different size and shape, we could use vectors (or matrices) to represent structured data but, for example, if the vector is too small not all information about the data can be express, on the other hand if the vector is too big a lot of empty information are used (more detailed discussion in Section 2.1). The introduction of structured data makes tasks more complex since the learning model must be able to extrapolate useful knowledge from both the numerical as well as structural information. Moreover, elements in structured domain haven’t a fixed size or structure and models must be able to adapt to these structural variations (e.g. models that process only binary trees are much simpler than models that process generic trees; dealing with acyclic graphs is much simpler than manage graphs with cycles).

Structural learning is still an open problem which most of ML researches are focused on because it can solve a lot of real world task but on the other hand only for particu-lar task there are consistently reliable data. Indeed for sequential structural learning, either because is the most simple structural domain and there are a lot of available data (e.g. sentences, audio/video signals, DNA) very good result have been achieved (e.g. [3][4][5]). To do a few examples: in the case of machine translation, there is a mapping between two sequences of words (or on a lower level sequences of characters) belonging to two different language vocabulary; in the case of speech recognition there is a mapping between a sequence of sampled sound and a sequence of words (i.e. the

(8)

written representation of the speech).

Tree structural domain can be suitable for different real world information too (e.g documents, images, sentences) but there are less works respect to sequential domain because it’s very hard, especially for the computational cost. Main tasks addressed with trees concern natural language processing (e.g. [6][7]). As an example there is the text simplification: given a (parse tree of a) sentence the model returns the (parse tree of the) simplified version of the input, i.e. sentence with the same meaning without irrelevant information.

In the end, for the graph domain there are less relevant results (e.g. [8][9][10]) since there are a lot of more aspect to take into account (e.g. directed/undirected, cycles) for which stable models haven’t been found yet and too few data are available, however if a good modeling is found for graphs it can be automatically used for all kind of domains (since they are the most generic structures).

1.1 Motivations and goals

The thesis focuses on structural learning because in real life data are naturally presented in structures, some example are: natural language (sentences are sequence of words), documents (hierarchy of chapters, paragraphs, etc.) chemistry (chemical molecules are graphs of atoms). However same data can be expressed with different structures and these different representation can be more expressive than others, for example a sen-tence (which is a sequence) can be also represented as a syntactic/parse tree (Figure 1.1(a)) which is a richer representation because gives additional information that help to disambiguates the meaning of a word respect to the context in which it is. Another example of multiple representation of the same data is shown in Figure 1.1(b): a chem-ical molecule can have several structured representation such as sequential, directed and undirected graph.

Another motivation is that flat domain (i.e. vectors and matrices) is a well known do-main and a lot of problems have been solved with good results, while more complex structured learning is still an open problem. However, for the sequences (the most sim-ple structural domain) interesting results have been obtained [3][4][5].

The purpose of this thesis is to take a step ahead to the structural learning focusing on tree domain. There are some aspect about trees to consider: first, with respect to the sequences they can express one-to-many relation between entities. Second, sequences can vary only on two aspects that are length and type of data, instead trees may also have different relational information (i.e. top-down vs bottom-up), every node can have

(9)

an arbitrary number of node with which being in relation (i.e. outdegree) and moreover different order can be defined over the relations (i.e. ordered siblings).

If some of these aspects are known a priori then some approaches for sequences may be adopted for trees too (e.g. for a full binary tree its sequential representation given by a visit can be used, because it is unique) or more generically a combination of all paths in the tree can be used as sequential representation but redundant information will be used and this approach may leads to computational intractability for big trees (e.g. high outdegree, high depth, density).

Motivated by what said above, this thesis presents a deep neural network approach for learning of generic tree transductions (before this, good results have been obtained only for NLP, moreover they are rule-based or grammar-based [6][7], the proposed model has no priori knowledge on task and data). The core of the model is the LSTM unit (Section 2.2.4) that represents the state-of-the-art in major tasks dealing with se-quences, so the main idea is to generalize it to process trees (Section 3.1). In particular LSTM are used to encode the entire tree into a fixed size vector and, since it’s a form of automatic feature extraction (Section 2.1) this aspect is conditioning for the entire learning process, two encoding variation that are used for the proposed approach are presented.

(a) Parse tree of a sentence. (b) Different representation of a molecule.

Figure 1.1: Example of structured representations: (a) parse tree of a sentence, (b) sequential and graph representations of a molecule [11].

(10)

1.2 Thesis outline

The structure of this thesis is outlined in the following.

Chapter 2 describes relevant aspects of structured domain, including different ap-proaches, focusing attention on tree structure. The chapter also provides a brief ex-planation of Deep Learning, the approach of the proposed model, presenting some canonical models. In the end there is the state-of-the-art on tree transductions. In Chapter 3 the proposed model is described showing details about encoders, imple-mentation and how it deals with different type of tasks, input and transductions. In Chapter 4, experimental results are reported. The results are about tasks on two real world datasets and one on synthetic data (this one to test a particular type of transduction).

(11)

Chapter 2 Background

2.1 Machine Learning for structured domain

Summarizing what was said in Chapter 1, the term structured data is used to define information describing entities and relations among them.

Classic ML methods process data in vector form (flat representation) in which every element in the vector represents a significative feature of the data. However, in real world problems data are often structured but if a domain expert of the problem is available (very rarely) it can be used to produce flat hand-designed structural features. Consulting a domain expert is an expensive phase and the following issues, in addi-tion to those already menaddi-tioned in the introducaddi-tion, have to be taken into account to produce a good flat representation:

dataset dependence: if the dataset is modified, it may be that the hand-designed features aren’t the best anymore (assuming they were) because different informa-tion are introduced, the domain expert is needed again;

error impact: flat domain techniques are highly dependent on vectorial representa-tion, even a small representation error can result in strong limitations from the point of view of the performance of the learning model;

dimensionality: catching an higher number of relevant aspects usually means higher quality flat representations. Having more features entails vectors of larger size. If the number of data observation is the same (i.e. dataset size) bigger flat represen-tations means that is more likely to have sparse data and this can be enough to not get a good model. This phenomenon is known as Curse of Dimensionality [12]. Di-mensionality reduction can be performed by Principal Component Analysis (PCA) but only linearly dependent features can be removed in this way.

(12)

preprocessing: data often need further elaborations to be properly processed (e.g. normalization).

Hovewer, even a very good flat representation can lead to bad performance due to the fact that relational (structural) information cannot be caught.

2.2 Machine Learning evolution: Deep Learning

Since its foundation ML aimed to produce systems that learn mapping functions be-tween input and output data, especially for supervised learning.

The next generation of ML models focused on Representational Learning [13], which entails ML models that are capable to perform feature extraction by itself, reducing to minimum human intervention.

The most recent approach to Representational Learning is Deep Learning.

Deep Learning [14] is a class of ML algorithms that using a stack of non-linear layers provides an unsupervised or semi-supervised feature extraction and transformation. Every layer in the stack get the output of the previous layer as input and deeper layers correspond to higher abstraction levels. Moreover, redundant information are discarded and, more importantly unlabeled and sparse data can be exploited. More complex func-tions can be caught simply adding a few number of layer or unit. Typical deep network training consists of the following steps:

1. (unsupervised) pretraining of the encoding deep module;

2. (supervised) training of the readout layer;

3. fine-tuning of the entire deep model.

Deep Learning achieves best results when data shows some kind of regularity (e.g. spatial, temporal) and this is the main reason why dominates the state-of-art in image, signal and NLP domains. The evolution of ML described above can be summarized with the Figure 2.1.

It follows a brief taxonomy of deep learning models.

2.2.1 Convolutional Neural Network

Convolutional Neural Networks (CNNs) are feedforward networks dominating image processing. They are a MLP variant having units designed to require minimum pre-processing [15]. Main job of this unit is invariant feature extraction [16]. CNNs are

(13)

Rule-based systems Classic Machine Learning Classic Rep-resentation Learning Deep Learning Input Input Input Input Ad-hoc program Handmade features Feature semplici Features Simple features Mapping Mapping Multiple layers of complex features Mapping Output Output Output Output

Figure 2.1: Machine Learning evolution (top to bottom), filled rectangles denote adap-tive components (learning by data).

biologically inspired by visual cortex in which every neuron is responsible to process information of its own visual portion (receptive field) and portions are overlapped to process the whole visive field [17]. Typically, CNNs are made (as shown with the ex-ample in Figure 2.2) by a stack of:

convolutional layers: convolution emulates cortical neurons behaviour. Mathemati-cally convolution is defined as:

(f ∗ g)(t) = Z ∞ −∞ f (τ )g(t − τ )dτ = Z ∞ −∞ f (t − τ )g(τ )dτ (2.1)

where f is the function and g is called filter.

Every unit processes a piece of the input data, the part belonging to its receptive field (i.e. the integration bounds), and receptive fields (smaller and darker squares in Figure 2.2) of the units are overlapped in a way that perturbations (e.g. translation, rotation, distorption) doesn’t affect the output quality. Usually convolution layers implements different filters or kernels (e.g. Gaussian) and units belonging to the same convolutional layer share weights, reducing the number of the parameters and making training faster;

pooling layer: the outputs of the convolutional layer are often fed to an aggregation function with the twofold objective of achieving dimensionality reduction as well as invariance with respect to domain specific transformations (e.g. photometric vari-ations in images, translvari-ations, rotvari-ations). Usually max pooling is used but several pooling functions have been proposed (e.g. average pooling, norm pooling).

(14)

alter-nation of convolutional and pooling layer there are fully connected layers, like an MLP, which merge information extracted by precedent layers.

Convolutional Layer 1

Input Convolutional

Layer 2 Pooling

Layer 1 Layer 2Pooling ConnectedFully Figure 2.2: An example of a generic CNN with two convolutional+pooling layers and two fully connected layers.

Particular mentions are to LeNet [18] for being the first CNN, to GoogleNet [19] for changing the classic CNN approach (i.e. inception module) and to Microsoft ResNet [20] for getting the record of 3.6% di error rate in ILSVRC 2015 (human error rate is 5%-10%).

Despite CNN were invented for image processing, they get good performance on se-quential data [15][21][22].

2.2.2 Autoencoders

Autoencoders are networks that are trained to reproduce as output a reconstruction r of the vector x received in input. This could sound useless because it seams that autoencoders implements identity function but their utility will be shown soon. The simplest autoencoder is composed as the architecture of a MLP, comprising an input layer, an hidden layer which stores the encoding h of the input and an output layer so they can be trained like every feedforward net. They can be seen composed of:

encoder: performs the encoding h = enc(x);

decoder: performs the reconstruction r = dec(h).

Real autoencoder utility is represented by undercomplete autoencoder (having hidden layer dimensionality lower than input one, as shown in Figure 2.3(a)) used to per-form dimensionality reduction, compression and feature selection. It’s estabilished that autoencoder having linear units with Mean Square Error loss performs input PCA, non-linear units allow to learn more general, non-linear compressed representations of the data. However, if the encoder and decoder comprise a large number of neurons and

(15)

are left unconstrained, then the result is learning a simple memory of the input without catching relevant information during the encoding phase. Autoencoder having hidden layer size equal o larger then input are useful only under sparsity constraints. Best results are achieved by deep autoencoder [23] in which encoder and decoder comprise a stack of hidden layers (Figure 2.3(b)).

Mixing autoencoder with generative approach we get Variational Autoencoder (VAE) [24]: the encoder maps input to a probability distribution used for the output recon-struction. In order to exploit usual learning algorithm, the encoding distribution is made differentiable representing it by its parameters (e.g. mean, variance).

Autoencoder can be combined to CNN leading to Convolutional Autoencoder (CAE) whose application can be: image reconstruction/restoring, denoising, coloring [25] and high-resolutioning [26]. x r h Encoder Decoder (a) Autoencoder x Encoder r h Decoder (b) Deep autoencoder

Figure 2.3: Examples of autoencoders.

2.2.3 Deep Belief Network e Deep Boltzmann Machine

Deep Belief Network (DBN) and Deep Boltzmann Machine (DBM) are the generative approach (to know what a generative approach is, see Section 2.3.2) in Deep Learning. The main structural aspect of both of them is they are made by stacking several layers of Restricted Boltzmann Machine (RBM).

RBM is a complete bipartite graph graphical (generative) model derived by Boltzmann Machine to make inference computationally tractable (in Figure 2.4(a) is shown an example of a Boltzmann Machine and the respective RBM, it’s easy to see how the latter is more simple in term of energy function E(x, h) to optimize). In the simplest case RBM treats binary values (there are generalization to deal with multinomial and real data too [27]) and it’s very simple and fast to train due to its structure.

(16)

The main difference between DBN and DBM is that the first has directed connection between layers (Figure 2.4(b)) making DBM more expressive and flexible [28]. As main phylosophy of Deep Learning, these deep networks are pretrained layer-wise and the best way to do it (both for quality and efficiency) is by Contrastive-Divergence [29][30].

h1 h2 h3 h4 v1 v2 v3 h1 h2 h3 h4 v1 v2 v3

(a) Botzmann Machine vs RBM

(b) DBN vs DBM

Figure 2.4: (a) Boltzmann Machine vs RBM, (b) DBN vs DBM.

2.2.4 Long Short-Term Memory (LSTM) unit

LSTM Network is a variation of RNN (Section 2.3.1) that solves the problem of Gradi-ent Vanishing/Explosion due to RNN fed with very long input [31]. LSTM units store contextual information in their memory cell catching long term dependencies and this is done adding three multiplicative units called gates to the traditional recurrent unit:

input gate: protects the memory cell by irrelevant/noisy input;

output gate: protects other units by irrelevant/noisy content of the considered unit;

forget gate: performs memory cell reset when needed.

This new unit is called Memory Unit and an intuitive explanation of gates influence is provided by Figure 2.5. Memory units can share the same gates to form a Memory Block.

(17)

Formally, given a generic memory unit, its behavior at timestep t is ruled by the following equations:

ut = tanh

W(u)xt+ U(u)ht−1+ b(u)

(classic reccurency)

it = σ

W(i)xt+ U(i)ht−1+ b(i)

(input gate)

ot = σ

W(o)xt+ U(o)ht−1+ b(o)

(output gate) ft = σ W(f )xt+ U(f )ht−1+ b(f ) (forget gate) ct = it ut+ ft ct−1 (cell state) ht = ot tanh(ct) (output)

with xtthe input at the current timestep, σ the sigmoid activation function and the

elementwise multiplication.

What described above is just a possible LSTM unit, because more generally for the input gate every non linear function can be used and the cell state ct can be used as

further input in the next timestep.

In [5] there are some architectural variants and applications of LSTM networks.

tanh σ σ σ tanh + × × × u_t i_t ot ct ft ht xt, ht−1 ct−1

Figure 2.5: LSTM unit graphical representation - If input gate value is 0 then the input doesn’t contribute to the cell state value, if forget state is 0 then the new cell state value will depend only by input and if output gate is 0 the output of the unit will be 0 too. Dashed lines show a delay of one timestep.

(18)

2.3 Learning on structured domain

This section show the main approaches for learning on structured domain showing relative advantages and drawbacks.

2.3.1 Neural network approach

According to the type of the structured domain we have:

Recurrent Neural Network (RNN): this type of network is used to process se-quential data. Like the flat counterpart MLP, RNN has been proven to be a uni-versal approximator [32]. On the other hand training is computationally very heavy due to the unfolding and with very long input they suffer of Gradient Vanishing/-Explosion.

The state transition of a recurrent unit at timestep t is the following:

ht = f W xt+ U ht−1+ b

(2.2)

where f is the activaction function (e.g. tanh, simgmoid), xtis the input of current

timestep, ht−1 is the state at the previous timestep, the other ones are learnable

parameters, in particular U represents the recurrent weights.

h1 ht-1 ht ht+1 h x y h-1 y1 yt-1 yt yt+1 x1 xt-1 xt xt+1 Unfold

Figure 2.6: A simple recurrent unit and its unfolding over a sequence. You can see that the the hidden state of a timestep is used as further input for the elaboration of the next timestep.

Usually, the training is performed unfolding the network over the sequence used as example, in particular the network structure is replicated over every sequence element as shown in Figure 2.6. In this way the unfolded network is a MLP and can be trained with classical gradient descend methods, this way of training is called Back-Propagation Through Time (BPTT). The major drawbacks about BPTT is that it can’t be used for online training and unlike BP on MLP it may converge to local minima.

(19)

Learning (RTRL) is used, but is very less efficient.

A very deep study covering the last 30 years about RNN is [33].

Recursive Neural Network (RecNN): this type of network is a generalization of RNN and is used to process trees. The state transition of a recursive unit with k children is: h(u) = f W xu+ k X i=1 U hchi(u)+ b (2.3)

where f is the activaction function (e.g. tanh, simgmoid), xu is the input of current

node, hchi(u) is the state of the i-th child of u, the other ones are learnable param-eters: in particular U represent the recursive weights. In the previous equation the non-positional case is presented, if children orders is relevant a weights matrix is used for each of them:

h(u) = f W xu+ k X i=1 U(i)hchi(u)+ b . (2.4)

The formulation of the equations shows that the computation is performed in a bottom-up way because the state of a node is computed using the states of its children. Like RNN training is performed by Back-Propagation Through Structure (BPTS) [34] unfolding the network over the tree structure (as shown in Figure 2.7, in which the hidden state of each node is propagated to its parent, obtaining a MLP). h x y Unfold hR hL 1 3 4 2 6 5 1 3 4 2 6 5 h x y h4 h5 h2 h3 h6 hh1 y1 x1 y2 x2 y3 x3 y6 x6 y5 x5 y4 x4 h2 h3 h4 h5 h6

Figure 2.7: Simple recursive unit for binary tree and its unfolding over an example tree.

Echo State Network (ESN): in the last years ESNs [35] have received particular attention. They are a class of models belonging to Reservoir Computing [36] which is an efficient parading for modeling RNNs that implements dynamic systems. As shown in Figure2.8, this class of models is caracterized by:

(20)

input layer: by which data are fed to the network;

reservoir: non linear hidden layer, with randomized and sparse connectivity and performs the input encoding. This layer is typically not subject to training; readout: output layer, usually is a simple linear model.

x

y

h

Input Layer Reservoir Readout

W

_in

W

^

W

_out

Figure 2.8: Example of a simple ESN. Input layer and readout are feedforward layers, while reservoir is recurrent, sparse and randomly connected.

The basic equation describing the encoding performed by an ESN are similar to the equation of encoding performed by RNN:

ht= f Winxt+ ˆW ht−1

(2.5)

where f is the activation function (usually tanh), Win is the input-to-reservoir

weight matrix (possibly including a bias term), ˆW is the recurrent reservoir weight matrix. The output is usually computed through the linear combination yt =

Woutht and usually Wout is the only parameter of the network to be trained (in

this way the same reservoir can be used for different tasks just training different readouts).

Not every choice of ˆW leads to a valid ESN. A valid ESN model must satisfy the Echo State Property that is the network, asymptotically, depends only by inputs (i.e. must not depend by its initial state). To obtain the ESP some condition have been provided, a sufficient condition usually used is that the largest singular value of ˆW is less than unity:

ρ ˆW < 1 (2.6)

since the Euclidean norm of ˆW is equal to its largest singular value, thus the suf-ficient condition can be restated as k ˆW k₂ < 1.

(21)

be solved. The most stable techniques are Ridge Regression [37] (offline training) and Recursive Least Square [38] (online training). Obviously the training an ESN is much faster than training a RNN. Hyperparameters on which validation is per-formed are substantially the ones belonging to the reservoir (e.g. reservoir size, density, spectral radius) and to the readout. One main drawback of ESN is they have only short term memory capacity and it can be estimated only with linear reservoir units.

ESN can be used to deal with trees TreeESN [39] and graphs GraphESN [8].

2.3.2 Generative approach

These models allow to learn a probability distribution defined over the input structural space. They are also referred to graphical models, because are usually represented as a graph in which nodes are the states (usually empty nodes) and the visible values (usu-ally filled nodes); edges between nodes specifies the conditional relation (i.e. the target node state/value is conditioned by the source node state). Since they deal with condi-tional probabilities, computation are made affordable exploiting the Bayes Theorem. A generative model make hypothesis over the input data in order to adapt parameters. Learning can be seen as the maximization of the likelihood between observed data and the model. Some reference models belonging to this category are:

Hidden Markov Model (HMM) [40]: this class of models is used for processing sequences and they are based on the assumption that data has a so-called Marko-vian nature: the value of every element in the sequence can be generated by con-sidering only the value of its l predecessors (or successors) in the sequence (Figure 2.9(a) shows a first-oder HMM, i.e. l=1). This property can be seen as a kind of short term only dependency;

Formally they are used to model dynamics through three distributions (for sim-plicity only the case of dependency on the immediate predecessor is shown): state transition: is the latent process (which gives the hidden in the name)

char-acterized by the hidden states variables Qt whose values often belongs to

discrete set H. The probability distribution is defined for each values of the hidden states (∀i, j ∈ H):

Aij = P (Qt= i|Qt−1 = j) (2.7)

and gives the probability that the latent variable assumes value i in the next timestep having value j in the current one;

(22)

prior: πi = P (Qt = i) is the probability to have the latent variable with value i

at the current timestep;

emission: the observable outputs value Y (i.e. emissions) are ruled by the distri-bution:

bi(yt) = P (Yt = yt|Qt= i) (2.8)

So the learnable parameters of an HMM are θ = (B, A, π), another important hy-perparameter to consider is the number of hidden state C = |H| (i.e. number of value assumed by Qt).

Learning in HMM is defined as follow: given a dataset of observed sequences D and the number of hidden state C find the parameters θ = (B, A, π) that maximize the probability of the model having generated the sequences in D (i.e. Complete Like-lihood ) and learning algorithm used is tipically Expectation-Maximization (EM) [41].

There are several variants of HMM such as IO-HMM [42] in which state transition is conditioned also on the inputs (i.e. input-driven, Figure 2.9(b)), Bidirectional HMM [43] in which dependency is defined in both directions and Coupled HMM [44] which models interactions between interlaced generative processes describing different dynamics (Figure 2.9(c)).

Q1 Qt-1 Qt Qt+1 Y1 Yt-1 Yt Yt+1 (a) HMM Q1 Qt-1 Qt Qt+1 Y1 Yt-1 Yt Yt+1 X1 Xt-1 Xt Xt+1 (b) IO-HMM (c) Coupled-HMM

Figure 2.9: Examples of first order (l = 1) HMM architectures. As said before visible values are represented the filled nodes. In (a) we can see that every state is conditioned only by the previous state; in (b) every state is also conditioned by the input; in (c) we can see that states of a dynamic is also conditioned by the previous state of the other dynamic.

Tree Hidden Markov Model (TreeHMM): generalization of HMM to deal with trees. The generalization is obtained modifying the state transition distribution. Depending of the type of Markovian assumption two main types of TreeHMM can be described:

(23)

its parent states [45]. Formally the state transition distribution is defined as:

P (Qch(u)|Qu) (2.9)

or equivalently:

P (Qu|Qp(u)) (2.10)

An example of binary Top-Down TreeHMM is in Figure 2.10(a), as indicated by the directed edges the state of a node is conditioned by the state of its parent.

Bottom-Up TreeHMM: is used to model tree (considering outdegree N ) in which the state of a node depends on the joint state of its children. In this case the state transition is more complex:

P (Qu|Qch1(u), Qch2(u), ..., QchN(u)) (2.11)

An example of binary Bottom-up TreeHMM is in Figure 2.10(b), as indicated by the directed edges, the state of a node is conditioned by the joint state of its children.

Both TreeHMM showed before are valid models but bottom-up performs better because children-to-parent transition allows to effectively model the co-occurence of substructures among the subtrees of internal nodes, they are computationally more complex but some assumptions and approximations can make computation easier [46]. Q1 Y1 Q2 Y2 Q4 Y4 Q5 Y5 Q6 Y6 Q3 Y1

(a) Top-Down TreeHMM

Q1 Y1 Q2 Y2 Q4 Y4 Q5 Y5 Q6 Y6 Q3 Y1 (b) Bottom-Up TreeHMM

(24)

2.3.3 Kernel methods

Kernels [47] allow to restate a non linear problem in the form of a linear one. The basic concept is to map data on bigger dimensional space because bigger is the space higher is the probability that data can be linearly separable. Pratically kernels are a similarity measure between pairs of object of the same type computed as inner product of data mappings and the advantage is that mapping function mustn’t be explicated (kernel trick ), except for structured data. Main advantages of kernels are compositionality, computation efficiency (in some cases) and they can cover all structured domains. Main kernel categories are:

convolutional kernels: decompose the object in smaller parts whose similarity can be valued, then all similarities are combined to get the value of the whole object;

syntactic kernels: count, in a weighted manner, the number of equal substructures that objects have;

adaptive kernels: learn from data the weights used to calculate similarity;

generative kernels: are obtained using generative models to define new measures or to generate the substructures used by convolutional/syntactic kernels.

In the Section 2.6.4 there are several examples of kernel used in the state-of-the-art.

2.4 Tree data structure

2.4.1 Definitions and properties

In this thesis the term “tree” is used to refer to the rooted tree concept in graph theory. Only some aspects, relevant to this thesis, are shown.

Def. (Tree): Pair T = (V, E ) with V node set and E ⊆ V × V edge set.

A node is represented by u and the ordered pair (u, v) means there is an edge between u and v and express the following relations:

“v is a child of u” “u is the parent of v”.

There are particulare nodes, for example: Def. (Root): node with no parent. Def. (Leaf ): node with no children.

(25)

Def. (Sibling nodes): nodes that have the same parent.

Trees can be of diferrent type, in order to their properties:

Def. (Node labeled tree): is a tern T = (V, E , X ), where X is the node label set. This thesis refers to labeled trees, an example of labeled tree showing root, leaves, outdegree and other aspects is in Figure 2.11.

Def. (N-ary tree): tree in which every node has at most n children. The maximum number of children is also said outdegree.

Def. (Ordered tree): tree in which an order is defined over the children. Def. (Tree size): Cardinality of set V.

Given a generic node u it can store several information:

label: node value, denotated with xu;

parent: if any, denotated with p(u);

children: is a collection denotated with ch(u) (the i-th child is denoted with chi(u))

and follows the defined order between children.

r 10 b 5 b 5 f 4 a 7 c 8 e 3 d 1 g 9 h 2 i 6

Figure 2.11: Labeled tree example. Bold data refer to nodes and italic ones refer to labels. The tree in figure is and n-ary tree with n = 4, size 10, rooted on r and having leaves b, d, e, f, g, h, i.

2.4.2 Tree data applications

Tree structured data are present in many real-word problems such as NLP with parse tree (an example is in Figure 2.12(a), in which words are the leaves and the internal nodes are the contextual information), semi-structured data as HTML and XML (an example is in Figure 2.12(b), on the left the XML document and on the right it’s tree representation in which leaves represent the values and internal node are the tags),

(26)

search space in reasoning algorithms. Other examples can be chemical molecules in which atoms are represented by node and bonds are the edge. Even images can be represented as tree (Figure 2.12(c), the entire image is the root and nodes in deeper levels represent smaller zones of the image till arrive to the leaves representing single objects of the image).

(a) Parse tree (b) XML Document

(c) Image

Figure 2.12: Examples of tree structured data.

2.4.3 Tree isomorphism

Def. (Tree isomorphism): Let T = (V, E , X ) and T0 = (V0, E0, X0), they are isomor-phic if exists a bijection f : V → V0 such that ∀(u, v) ∈ E ⇐⇒ (f (u), f (v)) ∈ E0.

An equivalent definition can be done using the concept of skeleton.

Def. (Skeleton tree): Let T = (V, E , X ), its skeleton is skel(T ) = (V, E ). In other words skeleton just focus on the structure.

In this way, two trees are isomorphic if they have the same skeleton.

The Figure 2.13 shows three isomorphic trees: as said in the definition of isomorphism labels aren’t considered. The example shows how reflection of the subtrees (i.e. inverting the sibling order of the node representing the root of the subtree) doesn’t affect the

(27)

skeleton because a bijection between nodes is still defined. In particular, pairs of node mapped by the bijection have the same color.

Figure 2.13: Example of three isomorphic trees, bijection between edges is shown with the same color.

2.4.4 Tree transductions

A transduction is simply a mapping from elements of an input domain into element of output domain. Attention is focused on structural transduction in which both domains are structured. Let U#, Y#respectively the input and output structural domains then a structural transduction is a function F : U#_{→ Y}#_.

There are two main types of tree, and more generally structural, transduction:

structure-to-structure: in literature also referred to input-output isomorphic trans-duction [48] is used to describe the case when input and output domains are the same (e.g. tree-to-tree, sequence-to-sequence) and moreover input and output are isomorphic (i.e. U# = Y# and skel(x) = skel(y) with x, y ∈ U#). In this case F can be decomposed as F = Fout◦ Fenc with:

Fenc : U# → X# Fout : X#→ Y# (2.12)

that are encoding and output transductions, respectively. Figure 2.14 shows a tree-to-tree transduction, we can see that both encoding and output transduction pro-duce a tree isomorphic to the input one.

structure-to-element: as referred to supersource transduction [48] and is the case in which input is mapped into flat (vectorial) domain, which is still a trivially structure domain. In order to realize this type of transduction a state mapping function which maps a structured state into a flat one:

(28)

in this case F = Fout◦ S ◦ Fenc.

Different state mapping function can be used, some examples are:

mean state mapping: used in the case of real valued encodings, the vector is obtained performing the element-wise mean of the states of every node of the encoding structure;

mode state mapping: used in the case of discrete valued encoding, the vector is obtained performing the element-wise mode of the states of every node of the encoded structure, also referred to voting;

root state mapping: used only in the case of tree structured encoding, take the state of the root.

Figure 2.15 shows a tree-to-element transduction, we can see that encoding pro-duces a tree isomorphic to the input one and the state mapping function propro-duces a single element.

x

enc(x)

y

Figure 2.14: A tree-to-tree transduction. Encoding and output are isomorphic to the input.

x

enc(x)

(

enc(x)

)

y

Figure 2.15: A tree-to-element transduction. Encoding is isomorphic to the input, while the output is a single element due to the application of mapping function.

(29)

2.5 Learning in tree structured domain

This section provides a brief summary of the approaches that can be used for learning on tree structured domain, already presented in Section 2.3). Table 2.1 shows why the core of the model proposed in this thesis is TreeLSTM: like RecNN is the most powerful class of model for tree domain, with no assumptions on data, plus the problem of gradient vanishing/explosion solved by LSTM unit, moreover, potentially can learn every kind of transduction (but non-isomorphic tree transduction is still an open problem). Moreover, as you can see in Section 2.6, model based on LSTM haven’t been proposed yet even if it obtained relevant results for sequence domain.

Name Type Transduction

Type Pros Cons

RecNN

Deep Neural Network

Any

Powerful expressive and generic model (universal approximator) Slow training Gradient Vanishing/Explosion TreeESN Neural Network Isomorphic 1 _{Efficient computation} Assumption on data Only short-term dependencies

RecNN approximation

TreeLSTM

Deep Neural Network

Any Gradient problems

solved Slow training

Kernel Kernel None2

Compositionality Allow to ”linearize”

the problem

Computational costs Similarity function must

be defined by hand

TreeHMM Generative Any Unsupervised learning allowed

Computational costs Markovian assumption

1_{i.e. structure-to-structure, structure-to-element}

2_{used with a model, e.g. SVM, can be used for structure-to-element transduction}

(30)

2.6 State-of-the-art on tree transduction

This section provides an overview of state-of-the-art technique in tree transductions. Moreover will be mentioned works that showed innovation on this specific research topic.

2.6.1 A generic framework for learning on structured domain

The first mention goes to the first generic framework for structured learning [49] used as a basis by recent works. First of all it can deal with Directed Ordered Acyclic Graphs (DOAGs) so it can process several structures too (i.e. flat, sequences, trees). Moreover it gives the possibility to choose the approach by which learning will work, in particular RecNN for neural networks and TreeHMM (more specifically Hidden Re-cursive Model (HRM)) for generative approach. The limitation is that it only allows to learn isomorphic transduction and is not competitive anymore because suffers of some problems: one of them is that it uses classical RecNN and they are known to suffer of gradient vanishing/explosion which is wiped out by LSTM. The main problem of the HRM is the high computational cost due to the bottom-up approach, allowing to deal only with tree having low outdegree (i.e. binary): however, [46] recently proposed an approach to reduce the computational complexity of such a bottom up approach.

2.6.2 IO-BHTMM

In [50] the first input-driven generative model for tree-structured data is proposed. More precisely, for any (x,y) pair of input and output trees then this kind of model is used to learn a conditional generative model P (y|x) for the structured output, given the input x. As in standard HMM the learning is achieved introducing for each node u an hidden state variable Qu that allows to simplify the conditional distribution P (y|x)

ruled by the independence assumption defined by the undelying hidden Markov model. In this conditional scenario, the state variable emits the output label yu given the input

xu resulti in the conditional emission P (yu|Qu = i, xu). For this model, the underlying

hidden Markov model is the Bottom-Up Hidden Tree Markov Model (Section 2.3.2) in which, as said before, the hidden state of a node Qu is determinate by a state transition

from the joint configuration of its children. Considering this conditional scenario, the input-conditional contribution of the l-th child to the state transition of the parent u is P (Qu = i|Qchl(u) = j, xu).

The exact form of the emission and transition distributions depends on the input type (for more details look the reference) and, as for any HMM, the computations of the

(31)

likelihood is allowed by factorization and marginalization, learning is performed by EM of the likelihood and prediction by Viterbi Algorithm.

2.6.3 TreeESN

In [39] the efficient approach of ESN is extended to the tree structured domain. In this case, the reservoir units perform a recursive state transition function, for example for a generic node u with input xu is:

h(u) = f Winx(u) + k X i=1 ˆ W h chi(u) (2.14)

you can see that is like the RecNN unit.

Given an input tree t the encoding phase is performed as follows: the generalized reservoir architecture is unfolded over t and the recursive state transition is performed on every node in a bottom-up way following the structure of t (like RNN). In this way the encoding of every tree is isomorphic to the input. The initialization of the reservoir follows the same rules adopted with the ESN for sequential data and then the reservoir is left untrained.

2.6.4 Kernel-based models

Kernel-based model are the most used approach for learning structure-to-element tree transduction. Several kernels have been designed for the purpose and some of them will be shown (with examples in Figure 2.16):

SubTree Kernel (ST) [51]: to compute similarity it considers only complete sub-trees (i.e. only descendant of the subtree root until leaves);

SubSet Tree Kernel (SST) [52]: to compute similarity it considers every type of subtrees (i.e. descendant of the subtree root, considering all of its children);

Partial Tree Kernel (PT) [52]: similar to SST but even part of the children can be considered;

PAK-ST, PAK-SST, PAK-PT [53]: positional version of the above kernels (i.e. different subtrees order leads to different results);

Subpath Kernel [54]: to compute similarity it considers every paths from the sub-tree root to every node (it’s a particular case of the PT Kernel considering only one child per time). Only this one considers single nodes as tree.

(32)

1 2 4 5 3 (a) Tree 1 2 4 5 3 4 5 (b) ST Kernel 1 2 3 4 4 5 1 2 4 5 3 (c) SST Kernel 1 2 3 4 4 5 1 2 4 5 3 1 2 3 1 2 4 1 4 3 1 2 1 3 1 4 4 5 1 1 2 4 5 1 4 5 3 (d) PT Kernel 1 4 5 1 2 1 3 1 4 4 5 1 2 3 4 5

(e) Subpath Kernel

Figure 2.16: Kernel comparisons - Tree to elaborate and its subtrees considered by the kernels listed before.

2.6.5 Doubly Recurrent Tree Decoder

While working on this thesis an interesting paper has been published [55]. This work presents a Doubly Recurrent Neural Network (DRNN) (graphical representation in Figure 2.17(a), detailed description follows) used to generate a tree starting from its encoding. It allows to model both vertical (parent-children) and horizontal (sibling-sibling) information flows, so it can determine: 1) if a node is a leaf, hence if he has to generate is first child; 2) if it’s the last sibling, hence if he has to generate it’s successor. This is done using two RNNs.

Formally, let T = (V, E , X ) a tree and i ∈ V a generic node having parent p(i) and previous sibling s(i), the ancestral and fraternal hidden states are updated via:

ha_i = ga(ha_p(i), xp(i)) h f i = g f (hf_s(i), xs(i)) (2.15) where ha p(i), xp(i), h f

s(i) e xs(i) are the vectors representing the previous sibling’s and

parent’s hidden stantes and values, respectively. The functions ga and gf apply one step of the two separate RNNs.

(33)

These new updated states are used to compute a predictive hidden state:

h(pred)_i = tanh(Ufhf_i + Uaha_i) (2.16)

used to obtain the depth and width growing probabilites and the node output:

pa_i = σ(ua· h(pred)_i ) pf_i = σ(uf · h(pred)_i ) (2.17)

oi = softmax(W h (pred)

i + αiva+ ϕivf) (2.18)

where pa_i, pf_i ∈ [0, 1] can be interpreted as the probabilty that the node has a child and a next sibling, respectively.

Values αi e ϕi are topological indicators (in particular during test phase are obtained

from pa i and p

f

i, during training are set gold value, training details are shown in 2.6.5.1).

In the end, the node output oi is used to generate the node label xi.

There are two main innovations that this model shows:

• topological expansion choices are made explicitly, using probabilities, instead of classical implicit method using specific tokens (padding by tokens is useful in se-quences but for tree is computally expensive, because every leaf must be padded);

• topological expansion choices are made before label generation (other proposed methods does it at the same time).

gf hf s xs ga ha p xp + softmax σ σ ha i hf_i pa i pf_i oi hpred_i

(a) Doubly Recurrent Unit.

1 2 5 6 3 4 7 8 9 Encoder hf₂ hf₃ hf₅ hf₇ hf₈ ha1 ha 1 ha 1 ha 2 ha2 ha4 ha 4 ha 4 ha0 / / / / / / / / / /

(b) Example of output sampling.

Figure 2.17: (a) DRNN unit. (b) Example of a sampled output tree starting from its encoding. Barred arrows indicate the ending of the tree growing (i.e. low topological probability).

(34)

2.6.5.1 DRNN training

Training is performed by BPTS [34] and is composed by the following steps:

forward: nodes output are performed in a top-down fashion on the unfolded version of the network, following the order defined by the tree (usually Breadth-first pre-order ). Next both types of losses are computed: cross-entropy loss between label xi and gold label; binary cross-entropy loss is computed between topological values

pa_i, pf_i and gold values αi e ϕi;

teacher forcing: after a node label and relative loss is computed, the node label is replaced with its gold value so children and next siblings of the node are fed with the correct value. Analogously for topological information;

backward: in a bottom-up way, every node receives gradients from children and next sibling, computes internal gradients respect to label and topological predictions and backpropagates them to previous sibling and parent.

The loss function to minimize is:

L(ˆx) =X

i∈V

Llabel_(x

i, ˆxi) + Ltopo(pi, ˆpi) (2.19)

The decoupled nature of this loss allows to weight the two objectives differently, to emphasize either topology or label prediction accuracy, but the authors of the work left this for future work.

Figure 2.17(b) shows an example output tree sampling starting from its encoding. We can see that straight arrows indicate the propagation of ancestral states (i.e. parent to child) and the dashed arrows indicates the propagation of fraternal states (i.e. node to next sibling). Barred arrow indicate the end of topological expansion (i.e. low topolog-ical probability).

(35)

Chapter 3 Deep Learning for tree

transduction learning

This chapter provides a detailed description of the proposed model. It is a deep neu-ral network, based on TreeLSTM, for learning tree transductions (i.e. structure-to-structure, structure-to-element, structure-to-substructure) and can process different type of trees due to its generality (e.g. can process trees of any kind of label, of any size, with any outdegree and ordered siblings). The elaboration can be divided in two main phases:

Bottom up encoding by TreeLSTM: input tree is encoded, by one of the two TreeLSTM possible, in a fixed size vector. This is the core of the model;

Output generation from encoding: usually is done by a classic ML model, as every deep model.

Another relevant feature of the model is that it can be trained end-to-end, i.e. for the training it just need the input tree and the associated target without any intermediate representation. Last, but not least, the proposed model doesn’t need any kind of priori knowledge about data.

The encoding process is described in Section 3.1 and in Section 3.2 there is the de-scription of possible layers for the output generation. In Section 3.3 there is a detailed description of how the different type of transductions are performed, this section also includes implementation details about tree representation, encoders, the whole model and the utility used to perform model selection.

(36)

3.1 Recursive neural networks for bottom up tree

encoding

We present two categories of TreeLSTM for encoding a tree into a fixed size vector [56] used by the proposed model. Their application depends on the structural infor-mation between data (e.g. outdegree is known, positionality relevance). The choice of a bottom-up approach is motivated by the fact that already with TreeHMM [46] and with sequences learning using LSTM [3] (i.e. reversing the sequence) better results are achieved than top-down (straightforward for sequences) approach. Another observation is that bottom-up approach on tree allow to perform encoding transduction and root state mapping-function at the same time.

3.1.1 Child-Sum TreeLSTM

This type of TreeLSTM is used to encode tree with no particular relation among siblings. Let M the memory size, I the input size and ch(j) the children collection (of size K) of the generic node j, its state transition equation follows:

˜ hj = X k∈ch(j) hk (3.1) uj = tanh

W(u)xj + U(u)˜hj + b(u)

(3.2)

ij = σ

W(i)xj+ U(i)˜hj+ b(i)

(3.3)

oj = σ

W(o)xj+ U(o)˜hj + b(o)

(3.4) fjk = σ W(f )xj+ U(f )hk+ b(f ) , ∀k ∈ ch(j) (3.5) cj = ij uj + X k∈ch(j) fjk ck (3.6) hj = oj tanh(cj) (3.7)

where the terms are explained Table 3.1, σ sigmoid function and the elmentwise product. As every LSTM it has the three gates, in particular there is a gate for every child but all of them share the same parameters. They are suitable to process unordered trees or trees with high unknown outdegree.

(37)

3.1.2 N-ary TreeLSTM

It allows to discriminate children by their position but needs a prior knowledge about the max outdegree N . Let M the memory size, I the input size and j the generic node, its behavior is ruled by the following equations:

uj = tanh W(u)xj + N X `=1 U_`(u)hj`+ b(u) (3.8) ij = σ W(i)xj+ N X `=1 U_`(i)hj`+ b(i) (3.9) oj = σ W(o)xj + N X `=1 U_`(o)hj`+ b(o) (3.10) fjk = σ W(f )xj+ N X `=1 U_k`(f )hj`+ b(f ) , ∀k = 1, 2, . . . , N (3.11) cj = ij uj + N X `=1 fj` cj` (3.12) hj = oj tanh(cj) (3.13)

where the terms are explaind in Table 3.2, σ sigmoid function and elmentwise prod-uct.

The introduction of separate parameter matrices for each child allows the model to learn more fine-grained conditioning on the states of a unit’s children. Equation (3.11) shows a parametrization of the k-th child’s forget gate fjk that contains off-diagonal

parameter matrices U_k`(f ), k 6= `. This parameterization allows more flexible control of information propagation from child to parent. Moreover it can be used to rule the in-fluence between siblings (in particular the hidden state of a node can affect the forget gate of its sibling).

It must be said that the authors tested performances of this encoder only to process Constituency Trees ( i.e. binary trees, so N = 2, receiving input only at leaves).

(38)

Term Description Type Dimensions Adaptive Parameter

xj Node’s input label Vector I No

˜

hj Node’s children hidden states sum Vector M No

uj Node’s update state Vector M No

ij Node’s input gate state Vector M No

oj Node’s output state Vector M No

fj Node’s forget state Matrix K × M No

cj Node’s memory cell state Vector M No

hj Node’s hidden state Vector M No

W(∗) Input-to-** weights Matrix M × I Yes

U(∗) Children-to-** weights Matrix M × M Yes

b(∗) _{** bias} _Vector _M _Yes

* can be u (update), i (input gate), o (output gate), f (forget gate)

** can be node, input gate, output gate, forget gate for *= u, i, o, f respectively

Table 3.1: Explanation of terms in Child-Sum TreeLSM equations.

Term Description Type Dimensions Adaptive

Parameter

xj Node’s input label Vector I No

hj` Node’s `-th child hidden state Vector M No

uj Node’s update state Vector M No

ij Node’s input gate state Vector M No

oj Node’s output state Vector M No

fj Node’s forget states Matrix N × M No

cj Node’s memory cell state Vector M No

cj` Node’s `-th child memory cell Vector M No

hj Node’s hidden state Vector M No

W(∗) _{Input-to-** weights} _Matrix _{M × I} _Yes

U(∗) _{`-th child-to-** weights} _Matrix _{M × M} _Yes

U_k`(f ) Weights of connections from the

`-th child to the k-th forget gate Matrix M × M Yes

b(∗) ** bias Vector M Yes

* can be u (update), i (input gate), o (output gate), f (forget gate)

** can be node, input gate, output gate, forget gate for *= u, i, o, f respectively

(39)

3.2 Models for output generation

For output generation any kind of model dealing with flat domain can be used. For the experiment two kind of simple output layer have been used.

3.2.1 Classification output layer

The output layer used for classification tasks is the following:

oi = LogSoftmax(Whh + Wcc + b) (3.14)

where O is the output size and M the memory size and the terms are explained in Table 3.3.

3.2.2 Regression output layer

The output layer used for regression tasks is the following:

oi = f (Whh + Wcc + b) (3.15)

where f is the activation function (e.g. tanh, sigmoid) and the terms are explained in Table 3.3.

Term Description Type Dimensions Adaptive

Parameter

oi Layer’s output Vector O No

h Encoder’s hidden state Vector M No

c Encoder’s memory cell Vector M No

Wh Hidden state-to-output weights Matrix O × M Yes

Wc Memory cell-to-output weights Matrix O × M Yes

b Bias Vector O Yes

Table 3.3: Explanation of terms in output layers equations.

3.3 Tree2Tree in detail

The proposed model is a kind of generalized model for tree transductions providing a variety of structural alternatives. First of all it can be used for both classification and regression tasks allowing to customize the output layer. The second feature is the possibility to choose the type of encoder (Child-Sum or N-ary). Another parametriza-tion is the use of the memory cell of LSTM unit in the output layer (usually only the

(40)

hidden state is used and effects of using also the memory cell have never been inves-tigated before in literature). In the end, it allows to learn both structure-to-structure and structure-to-element transduction. In a more detailed way:

structure-to-structure: as shown in Figure3.1, in this case the model structure is composed only by the TreeLSTM unit augmented with the output layer, so the output label can be generated for every node. So during processing (training or prediction) both TreeLSTM unit and output layer are unfolded over the tree;

structure-to-element: as discussed in Section 2.4.4 the classical pipeline is: encoding transduction → state mapping → output transduction. In addition to this, the model allows to perform the aggregation function (in this case we refer to labeling function) to the output tree in the following way: instead of performing state mapping after the decoding, a structure-to-structure transduction is performed (for the training a target tree is created from the target value, using the latter as gold label of every node in the target tree) then, the aggregation function is applied on the output produced at each node (Figure 3.2) obtaining a vector. In particular weighted mean is applied for regression and voting for classification (it’s a kind of Bagging [57] or Boosting [58] using output of different nodes instead of using output of different models). Figure 3.3 shows how the model behaves with a structure-to-element transduction using the root mapping function: in this case only the encoding module is unfolded over the structure, then the output layer is concatenated to the root node and the output is produced;

structure-to-substructure: this term is used to refer to the case in which the output tree O a subtree of input I (i.e. skel(O) ≤ skel(I)). Conceptually can be considered as a non-isomorphic transduction. The model can learn this type of transductions using a trick: consider a structure-to-structure transduction assigning a particular value for the label of nodes to be considered inexistent.

(41)

Processing 2 enc4 Input tree enc o layer Encoder x encch Structure-to-structure elaboration unit x4 x5 enc5 x6 enc6 enc2 x2 enc3 x3 x1 1 3 6 5 4 Output o4 o5 o2 o6 o3 o1 Output tree o1 o3 o6 o5 o4 o2

Figure 3.1: Structure-to-structure processing example. The elaboration unit to unfold over the tree (blue rectangle) is the encoder (light blue rectangle) augmented with the output layer (dark blue circle). During the processing, the elaboration unit is unfolded over the input tree producing the isomorphic output tree whose labels are the output of each node obtained during unfolding.

Processing 2 Input tree enc o layer Encoder x encch Structure-to-element with vote labeling

elaboration unit 1 3 6 5 4 Output Output element Structure-to-structure output tree o1 o3 o6 o5 o4 o2 Voting

Figure 3.2: Structure-to-element with vote labeling example. First a structure-to-structure transduction is performed, then the voting function is applied to get the output element. Processing 6 5 2 enc4 Input tree enc o Output layer Encoder x encch Structure-to-element

with root labeling elaboration unit Encoder x4 Encoder x5 enc5 Encoder x6 enc6 enc2 Encoder x2 enc3 Encoder x3 Encoder x1 enc1 enc o Output element 1 3 6 5 4

Figure 3.3: Structure-to-element with root labeling processing example. The elabora-tion unit is composed only by the encoder (light blue rectangle). During the processing the elaboration unit is unfolded over the input tree producing the encoding tree, then the output layer (dark blue circle) is concatenated to the root to obtain the output element.

(42)

3.3.1 Implementation details

In this subsection implementation details are presented. The entire work is enclosed in a Lua module (which is analogous to a Java package) called tree2tree, component of the module are described in the relative subsections, providing commented code snippets only for the relevant parts.

3.3.1.1 Torch framework

There are several ML framework suitable for Deep Learning (e.g. Theano [59], Ten-sorFlow [60], Caffe [61]) but the choice has fallen on Torch and the reasons will be explained soon.

Torch [62] is a scientific computational framework built using Lua that runs on Lu-aJIT compiler. It has strong CUDA and CPU (coded in C) backends and contains well-developed, mature machine learning and optimization packages. It uses a unified efficient data representation (i.e. Tensor ) and the neural networks libraries (i.e. nn, graphnn) can be used to build arbitrary acyclic computation graphs with automatic differentiation functionalities (i.e. it has a :forward() function that computes the output for a given input, flowing the input through the network and a :backward() function that will differentiate each parameter in the network w.r.t. the gradients). The latest version, Torch7, has easy to use multi-GPU support and parallelizing packages that make it very powerful for training deep architectures. Torch has a large community of developers and is being actively used within large organizations like Facebook, Google and Twitter making a lot of their code open source.

It’s easy to deduce that Torch has been chosen due to its efficiency ([63] shows how it is the most efficient deep ML framework for the most common deep architectures computations), ease of use and support. In addition of what said before, the implemen-tation of one encoder used as reference for the experimental assessment, i.e. Child-Sum TreeLSTM, was available in Torch.

3.3.1.2 Tree

The class Tree.lua implements the tree data structure used as input and output of the model processing. The implementation follows the recursive nature by which a tree is represented by its root and the set of the children. For each node in the tree the following information are taken into account:

id: unique identification code within the tree;

(43)

gold label: gold label of the node used as target value during the training;

output: output label of the node obtained during forward step of model computations;

children: list of the node’s children;

num children: size of the above list.

The class, moreover provides methods to add and remove nodes, get depth and size, visits and prints methods. The most used method is the one to instantiate a tree starting from it’s DFS string representation (an example is shown in Figure 3.4)

1

6

7 4 2

3 (1 (6 (7)(4)) (3 (2)))

Figure 3.4: Example of tree instantiated from its DFS string representation. Every open and closed bracket means going one level below and above, respectively.

3.3.1.3 TreeLSTM

TreeLSTM.lua is the abstract class, that extends nn.module of Torch, used to describe the common structure and behavior of the two encoders. It also manages the allocation and deallocation of modules necessary for the unfolding. The implementation of the class is shown in the code snippet 1 in the Appendix.

3.3.1.4 Encoders

Class implementing the encoders (i.e. Child-Sum TreeLSTM and N-ary TreeLSTM) are inherited classes of TreeLSTM.lua. In particular they overload computing flow (i.e. :forward(), :backward()). Since Child-Sum TreeLSTM was already available we implemented the N-ary one, and its snippets are 2, 3 and 4 in the Appendix.

3.3.1.5 Tree2TreeModel

The model is a class which is mainly composed by:

encoder: which is can be either an instance of N-ary TreeLSTM or Child-Sum TreeL-STM;

(44)

output layer: which can be one of the layers described in Section 3.2 or user defined.

In particular, depending on type of transduction to learn, the elaboration unit to unfold for the processing can change (as shown in Section 3.3). The code snippet referring to Tree2Tree model are 5, 6 and 7 in the Appendix.

3.3.1.6 Output layers

Output layers are contained in the Lua module factory.lua. As an example, only code snippet implementing the output layer used for classification is shown in the snippet 8 in the Appendix.

3.3.1.7 Validation

For the validation, a module called Utility validation.lua is created. The module en-closes the utility functions needed to perform the (parallelized) validation, such as:

validation thread: given the options of the validation - e.g. model to validate, hyper-parameters and values on which to perform model selection, type of validation (i.e. normal or k-fold) and relative parameters (e.g. number of folds, stopping criteria) - it makes all possible configurations on which perform model selection, it returns a report about the whole validation and the best model configuration;

parent validation: is the code of the master thread performing the validation, it instantiates the workers and performs the static scheduling (i.e. given all the tasks and the number of the workers, it distributes the tasks to the workers). Moreover it makes the splitting of the dataset used in the folds;

worker validation: it implements the behaviour of each worker. In particular each worker receives a list of tasks (i.e. a subset of parameters combinations) and per-forms validation for each of them. In the end it sends results to the parent;

k fold validation: it receives model, configuration and data splits. For each split performs a normal validation phase. It also collects information about folds (e.g. average and best validation errors);

validation: it receives model, training set, validation set and stopping criteria and performs a single step of validation i.e. while stopping criteria doesn’t occur it performs a single training step on training data and evaluates the error on the validation set, in a loop.

Learning Tree Transductions by Deep Neural Networks

DEPARTMENT OF COMPUTER SCIENCE

MASTER DEGREE IN COMPUTER SCIENCE