A generative approach for learning contexts in graphs

(1)

UNIVERSIT `

A DI PISA

Dipartimento di Informatica

Corso di Laurea Magistrale in Informatica

TESI DI LAUREA

A generative approach for learning contexts in

graphs.

Candidato:

Federico ERRICA

Relatori:

Dr. Davide BACCIU Prof. Alessio MICHELI

(2)

(3)

Abstract

The Structured Domain Learning research field deals with the development of new models that can cope with complex data such as sequences, trees or graphs. The fact that data has structural relationships, in contrast to flat representations that may lead to loss of information, is the key to gain expressiveness. Therefore we present a novel method for unsupervised learning with graphs of variable structure. It relies on a constructive approach to spread contextual information between vertexes, which works by stacking layers of probabilistic models. We exploit it in a supervised setting using a Support Vector Machine. After an experimental screening of our model’s representational capabilities, we empirically show that it performs well for tasks involving classification of trees and graphs.

(4)

(5)

Mathematical notation

Throughout the thesis we will make use of the following notation: Notation for Trees

Y is a tree dataset.

yn _{is the n-th observed tree belonging to Y. When clear from the context we}

will write y to ease the notation. Uyn is the size of a tree.

L is the maximum out-degree of a tree A node will be referred to as u or v.

yu stands for the emission label of a node u.

xu will be used to denote an input value, for non-homogeneous probabilistic

models.

C is the size of the hidden states’ alphabet.

Qrepresents a multinomial latent variable, with values ranging in {1, .., C}. The state of node u is denoted with qu.

Q = {Q1, ..., QUyn} is the set of all the latent variables associated to an observed

input, and q is a particular instantiation. Qu ∈ Q will be used to denote a

specific variable associated to node u. r is the root node of a tree.

(10)

pa(u) is the parent node of u.

ch(u) is the set of node u’s children. The j-th child is chj(u).

pos(u) is the position of a node among its siblings. leaves(u) denotes the set of leaves of node u.

y \ yr defines the set of a tree’s observed labels without its root value.

Notation for Graphs G is a graph dataset. g is a graph instance.

A vertex will be referred to as u or v.

V ert(g) is the set of vertexes belonging to g.

An edge au,v = (u, v) denotes an oriented connection between vertexes u and v.

Edg(g) is the set of edges of g, also called arcs.

lu denotes vertex u’s label (possibly multidimensional). The j-th component will

be denoted as lj u

Given a graph g, N (u) = {v|(v, u) ∈ Edg(g)} indicates the neighbourhood of a vertex u, which is the set of all vertexes with outgoing arcs towards u. We will consider an open neighbourhood i.e. one that does not include u itself, but it could also be closed in order to include it.

Notation for transductions

Tenc, or also Tenc, defines the encoding process.

Tout, or also Tout, defines the output process which follows Tenc.

τ is the state transition function, which produces a state value at time t. o will denote the output function.

(11)

11 NU _{and N}R _{will be, respectively, the dimension of the input and of the reservoir}

state, when dealing with Echo State Networks.

Win and ˆW, will be, respectively, the (both) untrained input weight matrix and

recursive reservoir weight matrix, when dealing with Echo State Networks. Input weights ¯w will be distinguished by reservoir weights ˆw.

(12)

(13)

Introduction

The goal of Artificial Intelligence (AI) has always been to study complex systems so as to reproduce their inner workings through mathematical and logical models. What we now call the classical AI comprises exact methods and heuristics to find local optima in a discrete/continuous search space, formal rules that describe how an agent should act in different situations, etc. However, the last three decades have seen a significant shift of interest towards the data-driven approach, giving rise to the field of Machine Learning (ML). We can distinguish three main branches of this discipline which differ by how data is exploited. Supervised learning deals with examples for which a known target value is known. On the other hand, Unupervised learning tries to capture underlying patterns in the data when supervision cannot be provided. Finally, Reinforcement learning uses a system of rewards and punishments in order to find a function that maps input data to suitable actions (or policy in the ML jargon).

In the early days of machine learning research models required data in flat, vec-torial form: this is still good enough for many applicative domains. A feedforward neural network (FNN) is the most striking example of a powerful approach capable of approximating any computable function (see the Universal Approximation The-orem [31]) when learning from vectorial data. Also, in the probabilistic framework the Naive Bayes Classifier gave surprising results for Natural Language Processing (NLP) tasks. These models, however, do not explicitly work with structured data; for instance, a sequence of vectors can be given as an input to a FNN only by concatenating its elements into a single longer vector. In other terms, these mod-els do not exploit the fact that data has relationships, which can provide helpful structural information.

Learning in structured domains (SDs, or SDL) aims at developing novel models that take into account the actual structure of the data. The goal is to learn a map,

(14)

called transduction T , between a structured information domain and a discrete or continuous space. An example of transduction can be to tell whether a sentence is written in correct English or not; another can be the translation to another language. A transduction often relies upon three properties:

• Causality • Stationarity • Adaptivity

A mapping is adaptive when it is learnt from observed data, and stationary if it is independent from the specific element of the structure considered. Instead, causality reflects the order that we give to structure i.e the transduction computed on an element v depends only on itself and its descendants. As we will see in detail, learning of a transduction T is often the result of other functions’ composition. Specifically, an input g is first transformed by an encoding function Tenc into a

“hidden” representation x(g). What happens next is guided by the task: if we are working on structure-to-structure isomorphic trasductions we need just an output function Tout which maps each element of x(g) into a real or discrete value. If we

are dealing with structure-to-element regression/classification an intermediate step is necessary, which is the application of a state mapping function X which takes an encoded representation x(g) and outputs a single (possibly multidimensional) value summarizing that encoding.

Another key aspect in SDL is the concept of contextual processing i.e. the response for each vertex is influenced by the whole information in input, according to its topology. This means that the above mentioned causality relation heavily influences context spreading by partially restricting the amount of information received by an element. It is worth stressing that an adaptive Tenc can be designed

to automatically determine how much information is needed to solve a specific task. This is particularly true in the cases of supervised learning.

In an unsupervised setting it is more difficult to tell whether we are capturing interesting relations in the data or not. This question is at the very basis of the Representation Learning research field [9], which studies how unsupervised models can be designed to extract robust features from the training examples. It is not uncommon to see machine learning architectures resulting from the combination

(15)

15 of unsupervised and supervised models. For instance, a possibility is to convert the dataset into encodings produced by an unsupervised model and then train a supervised one using them as a form of “augmented input”. This methodology usually goes under the name of Semi-supervised learning.

In order to provide some examples we briefly present ML models specifically designed to learn from structured data. Sequences are the simplest form of struc-ture, yet they can fit a large number of application domains ranging from NLP to temporal series. Indeed, a string can be thought as a sequence of characters, a sentence is made of words and measurements over time are naturally sequen-tial. On the other hand, choosing this kind of representation is entirely up to the user; for example, there could be more suitable forms in which we can represent a sentence, e.g. through its syntactic tree. There is no obvious better way, but some are certainly more informative than others. Models such Recurrent Neu-ral Networks (RNN) and Hidden Markov Models (HMM) have been developed to explicitly handle sequences. The former uses feedback connections that provide contextual information to subsequent elements of an input, whereas the latter ex-press uncertainty through a sequence of random variables. In addition, the Input Delay Neural Network (IDNN) exploits a window of fixed size to build ”shifted inputs” from a single vector, which may represent a sequence of measurements. In doing so, no feedback connections are required and training is achieved through simple backpropagation of a FNN.

When dealing with trees things get more complex, since we need to deal with a varying number of children. Moreover, the choice of the causal direction (bottom-up or top-down) can drastically impact performances. Many things can be de-composed into trees, such as a document or an image. Recursive Neural Networks (RecNNs) [44] adopt a recursive state transition function τ to address this prob-lem. These models are proven to be universal approximators, ensuring a principled approach to learning. On the probabilistic side, the HMM has been extended to work with trees: the resulting architecture is the Hidden Tree Markov Model [20], whose versions will be discussed in the following chapter.

Finally we introduce kernels. Kernels for trees are similarity functions which com-pare different fingerprints obtained through shortest paths, breadth first searches etc. These kernels will be adopted for graph processing, thus they are relevant to our work along with connectionist and probabilistic approaches.

(16)

The difficulty of learning graphs stems from the presence of cycles, the role of the causality assumption and the fact that graphs have different sizes and shapes. Many researchers have tried to deal with it by decomposing a graph in multisets of simpler substructures or by introducing a cyclic state definition along with conver-gence criteria. Others have looked at the immediate neighbourhood of each vertex without performing any kind of information diffusion mechanism. The most crit-ical applications involve social networks, cheminformatics, computer vision and network data in general. We can predict properties of a chemical compound given its structure, or identify characteristics of a person when its friends’ neighbour-hood is known. At the same time we can work seamlessly with sequences and/or trees, since they are restricted forms of graphs.

Because of the above considerations, the aim of this work is to design, develop and test a novel model for graphs which will extend unsupervised learning and perform contextual information sharing between vertexes. We will focus on adap-tivity and on the role of deep architectures. We are moved by the idea that un-supervised learning can capture significant patterns in the data regardless of the considered task. This can be a blessing and a curse, because a supervised method learns internal representations specific for a problem, which often results in better performances. However, the absence of labelled examples can be very limiting sometimes. Hence, an unsupervised method would exploit the huge amount of unlabelled data to construct a meaningful fingerprint of a sample, thus helping supervised methods to achieve a better performance. If supervised examples are scarce with respect to unlabelled ones, an unsupervised encoding can greatly help generalisation. Finally, the concept of depth as a mean to share context is second main idea. We will try to assess the effect of stacking layers, in order to evaluate if a non-recursive but deep approach can really bring advantages in graphs compared to other works.

This thesis is organised as follows: Chapter 1 introduces state-of-the-art mod-els that were our sources of inspiration, Chapter 2 formally introduces the new model, Chapter 3 details the results of our experiments and Chapter 4 present pros, cons and future works. Appendix A gives detailed information about formal derivations of training, inference and extensions to multidimensional vertex labels, whilst Appendix B discusses implementation details, time-space complexity and

(17)

17 Table 1: An overview of structured data learning.

Model → Data type

Symbolic Connectionist Probabilistic

STATIC flat data Rule induction, Decision trees NN, SVM Mixture models, Naive Bayes SEQUENTIAL serially ordered entities Learning finite state automata Recurrent NNs, Kernels for strings, Echo State Networks

Hidden Markov Models STRUCTURAL variables with relationships Inductive logic programming Recursive NNs, Kernels for SD, NN4G, GNN, GraphESN, TreeESN, PATCHY-SAN Recursive Markov Models, CGMM

presents the pseudo-code of the Expectation-Maximisation [33] algorithm.

To summarise this introduction we show an overview of SDL by comparing the data type with the kind of learning model.

(18)

(19)

Chapter 1 Background

This chapter presents a number of models for SD learning, and from those we have taken inspiration to develop our extension to graphs. We do not want this chapter to be a comprehensive literature review; instead, we will highlight the key characteristics from which we have borrowed ideas for our novel method.

1.1 Models for trees

When we deal with tree-structured data, two approaches are typically taken into consideration: to unfold the tree in a top-down fashion or in a bottom-up one. These differ by how a model can capture information in that it reverses the causal-ity direction. The No Free Lunch Theorem states that there exist tasks where one model is bound to outperform another, but it is clear that a bottom-up approach can make better use of hierarchical information; models like RecNNs of all sorts try to learn short and long-term dependencies present in the data, whereas in the probabilistic world frameworks such as the Hidden Recursive Model have been pro-posed, with practical applications limited to small trees with binary out-degree. Thus, the implicit goal of any model is to handle contextual information and exploit it to produce a meaningful internal representation of the input. When consider-ing a top-down model, a node accepts input from the father, which in turn holds a context made of its ancestors. On the other hand, in a bottom-up setting, the same node is influenced by its children and consequently by its children’s sub-trees: clearly this can potentially carry information about the whole tree to the root, but

(20)

not without drawbacks in terms of learning difficulties and computational burdens, as we will see in the following. Since we are interested in extending probabilistic models to graphs, we start with a recap on important generative solutions which tackle time and space infeasibility through clever approximations.

1.1.1 Hidden Tree Markov Model: the top-down approach

We start our digression with the Hidden Tree Markov Model (HTMM) [20], the first extension of a Hidden Markov Model (HMM) suitable for trees. Indeed, the use of recursive definitions allows to preserve relationships among tree constituents, in contrast to flat encodings. As we will see, it takes advantage of its Bayesian in-terpretation to apply a number of techniques that reduce mathematical difficulties as well as computational requirements. Hence, to predict P (yn_{), latent variables}

Q are introduced and each node yu becomes the child of an hidden node Qu in the

graphical model. The causality direction combined with the Markov assumption implies that a joint distribution of hidden states can be decomposed efficiently. The problem is related to the maximisation of this term

P(yn) =X

Q

P(yn,Q)

which is obtained by marginalisation. Moreover, it is required that each node be conditionally independent from any non-descendant given the parent, and that each observation be independent from the rest given its associated hidden node value, which we will call state. Hence we can further decompose the previous equation P(yn) = X Q P(yn,Q) = P (Qr)P (yr|Qr) Y yu∈yn\yr P(yu|Qu)P (Qu|pa(Qu)))

A graphical representation of the belief network is shown in Figure 1.1. The param-eters of this model are the prior P (Qr), the emission P (yu|Qu) and the transition

P(Qu|pa(Qu))). Stationarity assumptions, which define a form of weight

shar-ing, and learning will be investigated in the following sections (under a different flavour); however, we point out that full stationarity will amount to prior, emission and transition distributions which are independent from the specific node. Alter-natively, a positional stationarity would take into account the position of a child

(21)

1.1. MODELS FOR TREES 21

Figure 1.1: An HTMM’s graphical model instantiation. Blue variables are ob-served.

among its siblings, especially for the transition distribution.

For now, we point out that the inference procedure consists in predicting max

Q P(Q, y

n_). _(1.1)

and can be obtained by a variation of the Viterbi algorithm; what is more, it is still very efficient because for each hidden variable, whose state has to be com-puted, we only need to inspect the parent node (in the Bayesian sense of the term). Armed with such introductory knowledge, we now discuss the issues of dealing with reversed causal directions.

1.1.2 Bottom-Up Hidden Tree Markov Model: the

bottom-up approach

The Bottom-Up Hidden Tree Markov Model (BHTMM) [4] is one of most inspiring methodologies that we studied in order to build a model for graphs. This generative model is able to overcome computational problems, due to the joint probability of children states, by adopting the switching parent (SP) model of dynamic Bayesian networks [10]. BHTMM, being a probabilistic directed acyclic graphical model, exploits:

• The Markov property:

(22)

• The Bayesian network full joint distribution: P(Q1 = q1, ..., Qn= qn) = n Y i=1 P(Qi = qi|pa(Qi)) (1.3)

• The fact that, in a Bayesian network, each variable is conditionally dent from all its non-descendants, given the parents. Conditional indepen-dence of variables R and B, given a variable Y, is defined as

P(R ∩ B|Y ) = P (R|Y )P (B|Y ) ≡ P (R|B ∩ Y ) = P (R|Y ) (1.4) and the right equivalence of this formula is what has been used most in the derivation of the learning algorithm.

Figure 1.2: BHTMM’s recursive graphical model with a Switching Parent node Su. Blue variables are observed.

Model & Learning

The model can be summarized by Figure 1.2. The hidden variable Su is related

to the Switching Parent approximation (details in the following). Denoting our parameters with θ, we are interested in the joint distribution

P(yn_{, Q}

1, ..., QUyn)

and in the likelihood of the model P(Y|θ) = L(θ|Y) = |Y| Y n=1 X q1,,q_Uyn P(yn_{, Q} 1 = q1, ..., QUyn = qUyn)

(23)

1.1. MODELS FOR TREES 23 At this point, the Expectation Maximisation (EM) [33] learning algorithm comes into play. The basic idea of this iterative procedure is to pretend that we know all the parameters of the model for observed and unobserved variables (starting from random assignments), and use them to compute the update equations. Formally:

θ(i+1) = arg max

θ

X

q

P(Q = q|Y, θ(i)_)L

oglike(Q = q, Y|θ) (1.5)

where Loglikeis the expectation of the log likelihood of the “completed” data [Q, O].

Equation (1.5) contains the main steps of the EM algorithm i.e the E-step that computes the summation and the M-step that performs maximisation.

Another ingredient we need is the SP probability P (Su = l) to approximate the

joint state transition distribution with a weighted mixture of L (max-outdegree) simpler distributions P(Qu = qu|Qch1(u) = q1, ..., QchL(u) = qL) (1.6) = L X l=1 P(S = l)P (Qu = qu|Qchl(u) = ql) (1.7)

which helps to reduce the number of parameters needed from O(CL_{) to O(LC}2_+L)

and greatly enhances computational tractability. The exact form of this model, which we denote with BU , cannot be used in practice when L is bigger that 2, while SP -BU is capable of handling much larger arieties. The size of the model’s parameters also depends on the stationarity assumptions that are taken:

• Full stationarity: parameters are independent from a specific node u. This further reduces parameters’ number down to O(C2 _{+ L).}

• Positional stationarity: If we are interested in capturing the correlation be-tween a node and its l-th child subtree, we assume it. Then, equation 1.6 rewrites L X l=1 P(S = l)Pl_(Q u = qu|Qchl(u) = ql)

At this point it is useful to introduce indicator variables zn

ui that tell us which

state i ∈ [1, C] is responsible for the generation of node u, tn

(24)

value the SP variable gets (again, in the range [1, L]) and ¯zn

uijl which is 1 when the

l-th child of node u is in state j while u is in state i. By pretending to know the assignment of all latent variables, we can rewrite the complete log-likelihood log Lc

in this way (for more technical details, please refer to [2]) log Lc(θ|Y, Z) = N X n=1 C X i=1 X u∈leaves(¯yn₎ z_un(i)logPpos(u)(Qu = i) + N X n=1 C X i=1 Uyn¯ X u=1 z_un(i)logP (yu|Qu = i) + N X n=1 X u∈yn_\leaves(¯_yn₎ L X l=1 tn_ullogP(Su = l) + N X n=1 X u∈yn_\leaves(¯_yn₎ C X i=1 C X j=1 L X l=1 ¯

z_uijln logP(Qu = i|Qchl(u) = j)

(1.8) The E-step requires the computation of these 3 formulas

E[zuin|y n , θ(k)] = P (Qu = i|yn) E[tn ul|y n_{, θ}(k)_{] = P (S} u = l|yn)

E[¯zuijl|yn, θ(k)] = P (Qu = i, Qchl(u) = j, Su = l|y

n

)

Factorising those probabilities into a product of known parameters requires long derivations that give rise to the Reverse Upward-Downward algorithm [4]. For the sake of curiosity, we show an interesting formula that arises from marginalisation and our assumptions:

P(Qv = xj|yn) = P (Qchl(u) = j|y n_{) =} C X i=1 P(Qu = i, Qchl(u) = j, Su = l|y n₎

where we notice that the right hand-side of the equation does not require a marginalisation sum over Su. To conclude the iterative algorithm, the M-step

requires the update of the following parameters (assuming positional stationar-ity):

(25)

1.1. MODELS FOR TREES 25 • A positional prior probability matrix π such that

Pl_{(Q = i) ≡ π}l

i ≥ 0 and

PC

i=1πil= 1, for 1 ≤ l ≤ L

• A SP probability matrix ϕ such that

P(S = l) ≡ ϕl ≥ 0 and

PL

l=1ϕl= 1

• Positional state transition probability matrices A = {A1, ..., AL} such that

Pl(Q = i|Qchl = j) ≡ a l ij ≥ 0 and PC i=1a l ij = 1, for 1 ≤ l ≤ L

• Two kind of emission models:

- A normal emission P (y|Q = i) ≡ bi(y) = N (y; µi,Σi) for continuous labels

y ∈ Rd

- A multinomial emission over a discrete alphabet of M symbols, parametrised by a matrix B such that:

P(y|Q = i) ≡ bi(m) ≥ 0 and

PM

m=1bi(m) = 1

which satisfy the sum-to-1 requirements through the introduction of external pe-nalisation methods (Lagrange multipliers).

Inference

When dealing with Hidden Markov Models, the Viterbi algorithm is one of the most used to perform inference. In our setting, we seek the most likely hidden state assignments Q given an input tree yn_{, which can be written as}

max

q P(Q = q, y

n_). _(1.9)

Here a “reversed Viterbi algorithm” is devised from eq. (1.9), and requires an upward recursion starting from the leaves to determine the most likely hidden state of the root. Later a downward recursion computes the most likely hidden states of the remaining nodes. Please notice that this algorithm spreads contextual information in both directions, but not symmetrically from each node. Perhaps the most significant part in this clever derivation concerns the approximation of an exponentially expensive maximisation, i.e.

δu(i) = max q1,...,qL L X l=1 ϕlAli,ql L Y v=1 bqv(yv) Y v0_∈ch(u) δv0(q_v0) (1.10)

(26)

which can be approximated using q_l∗ = arg max

ql

ϕlAli,qlbql(ychl(u))δchl(u)(ql) (1.11)

which allows us to rewrite equation 1.10 as δu(i) = L X l=1 ϕlAli,q∗ l L Y v=1 bq∗ v(yv) Y v0_∈ch(u) δv0(q_v∗0). (1.12)

Without this workaround we would not be able to efficiently solve the inference problem.

Brief discussion on information sharing

Looking at BHTMM’s graphical representation, it can be noticed that when a node Quassignment is given, the random variables Qch1(u), ..., QchL(u)cannot be assumed

to be independent one from another by definition of a Bayesian network. In turn, this observation is the main reason why the learning process induces information sharing among siblings, although those are not directly connected between each other.

1.1.3 Input-Output BHTMM: an extension to input/output

transductions

BHTMM can be further extended by adding an input label to the model: this breaks the time-homogeneity property, which is the independence of the state transition and emission distributions from a specific time instant or position in a sequence. The resulting non-homogeneous model can be used to generate IO-isomorphic transductions over trees, starting from a given input. Thus, the IOB-HTMM1 _{is a practical generative model for these kind of tasks, since it manages to}

keep time and space requirements as low as possible without oversimplifying too much. Technical details can be found on [6], though in the end they are quite sim-ilar to the previous model. For this reason, we will present variations of formulas introduced in the previous section.

1_{Our official Python implementation can be found at http://pages.di.unipi.it/bacciu/} software/iobhtmm/

(27)

1.1. MODELS FOR TREES 27

Figure 1.3: IOBHTMM’s recursive graphical model. Model & Learning

The most important change can be appreciated by looking at the graphical model in Figure 1.3, where a new observable node xu has been introduced. The switching

parent node is obviously still present and its distribution weights the state transi-tion of node u according to the selected child. The assumptransi-tions do not vary with respect to BHTMM, and the complete log-likelihood gets rewritten as

log Lc(θ|Y, Z) = N X n=1 C X i=1 X u∈leaves(¯yn₎ z_un(i)logPpos(u)_(Q u = i|xu) + N X n=1 C X i=1 Uyn¯ X u=1 z_un(i)logP (yu|Qu = i, xu) + N X n=1 X u∈yn_\leaves(¯_yn₎ C X i,j=1 L X l=1 ¯ zn_uijl logP(Su = l)+

+ logP (Qu = i|Qchl(u) = j, xu)

.

(1.13)

The parameters of the model, which are also present in equation (1.13) inside the body of the logarithms, are:

• A positional prior probability matrix π, conditioned on the input, such that πl

i,x≥ 0 and

PC

(28)

• A SP probability matrix ϕl such that

ϕl≥ 0 and

PL

l=1ϕl = 1

• Positional state transition probability matrices A = {A1, ..., AL}, one for

each given input, such that al

ijx ≥ 0 and

PC

i=1alijx = 1, for 1 ≤ l ≤ L

• Two kind of emission models:

- A normal emission bi(y) = N (y; µi,Σi) for continuous labels y ∈ Rd for

each x belonging to the input alphabet.

- A conditioned multinomial emission over a discrete alphabet of M symbols, parametrised by a matrix B such that:

bi(m, x) ≥ 0 and

PM

m=1bi(m, x) = 1

and they must satisfy the same constraints of the BHTMM. The E-step and M-step formulas are not presented because quite similar to the ones of BHTMM.

Inference

This time, performing inference requires to solve y∗,q∗ = arg max

y,q P(y, Q = q|x). (1.14)

Complexity now becomes even much worse, since we need to take into account both the hidden states and the desired output. Again, we resort to an approximation which computes these values separately for each child (the term below is part of the δu(i) analogue to that of BHTMM , which we omit to ease the digression.)

q_l∗, y_ch∗

l(u)= arg max_q,y chl(u)

ϕlAli,ql,xubql(ychl(u), xchl(u))δchl(u)(ql). (1.15)

Classification tasks

Supervised classification tasks deal with an input tree and a single output value. We can fit such a sample to our model by simply assigning the same target value to all nodes of the unfolded tree. After training, the inference procedure is run and the resulting class can be produced in two ways: by returning the emission label of the root or selecting the most frequent one (voting). Supervised IO-isomorph

(29)

1.1. MODELS FOR TREES 29 transductions simply return the inferred label of each node. Also, both BHTMM and IOBHTMM model the absence of children by introducing null children as nodes which will always be in a special state ⊥. The only parameters that have to be modified accordingly are the state transition and prior probability distribu-tions.

The comparison between homogeneous and non-homogeneous modelling capabil-ities is still an open field of research, but this model is clearly more flexible and powerful of its base version described in section 1.1.2. However, it is known that increasing the complexity of a model requires an exponentially growing number of examples in order to maintain generalisation capabilities, a problem which is known in literature as the curse of dimensionality.

1.1.4 BDIO: Modelling bi-directional tree contexts

Up to now, we have described models that exploit uni-directional causality rela-tionships. There is no hope of capturing a more complete context without taking into consideration both directions. Efforts have been made in this sense by [7], where a top-down model (which we indicate with TD) and the IOBHTMM are combined so as to learn structured transductions. The resulting model is a com-bination of the two, called Bidirectional-IOBHTMM (BDIO in the following), and it is shown in Figure 1.4 where the model is trained to reproduce its input. The process is divided in two phases:

• Encoding phase: the input is fed into TD, which produces a tree of hidden states through the Viterbi algorithm. This encoding constitutes the trans-duction input for the IOBHTMM.

• Decoding phase: The IOBHTMM, given the input from the previous step, builds the transduction output in the way we have already seen previously. The hidden states produced at decoding time can be exploited by kernels [5] or other supervised methods.

The learning process consists in training the TD first and then using the produced latent state trees as input for the learning phase of IOBHTMM. Hence, the bottom-up model will receive a ”top-down” view of the tree. The goal is to thoroughly exploit all this knowledge to enhance modelling capabilities. Please notice that,

(30)

(31)

1.1. MODELS FOR TREES 31 differently from techniques that exploit kernels, here we learn directly both the encoding and the decoding steps. Nevertheless, separating the learning processes rises questions concerning how good it is to capture a top-down context before the bottom-up counterpart. Also, we may argue that, although BDIO shows state-of-the-art performances, capturing the context of another context could result in some kind of information loss. Hence, we would like to completely relax the causality assumption so as to model undirected trees with a single model.

1.1.5 Tree Echo State Network

An Echo State Network (ESN) is a model consisting of two main parts, an un-trained non-linear reservoir of hidden states and a un-trained linear readout. What characterises the model is the contractivity of the state transition function realised by the reservoir, which has to be initialised in such a way that the Echo State Prop-erty is guaranteed. Specifically, the Echo State PropProp-erty (ESP), here defined for sequences, is

∀s = [l1, ...,ln] ∈ (RNU)n input sequence of length n

∀i0, i00 ∈ R

NR_{initial states: ||τ (s, i}

0) − τ (s, i00)|| → 0 as n → ∞

This definition asserts that dependencies on initial conditions are progressively lost and the embeddings of different input sequences share a Markovian nature i.e. the longer the common suffix in the sequence the closer they are in the hidden state space.

The extension of an ESN to trees, which is called TreeESN [23] exploits a recursive state transition definition that is shared across all nodes of the tree. The unfolding is done in a bottom-up fashion, from the leaves to the root, so that the root will be able so see contextual information of the whole tree. The state transition function for a node u is defined as follows

τ(u) = fWinlu+ L X i=1 ˆ Wτ (chi(u))

where f is a non-linear activation function of each hidden state. Notice that this definition reduces to the one for sequences when each node has only one children. The encodings computed on the children are recursively combined to provide a

(32)

Figure 1.5: An Example of the encoding process computed by the reservoir of a TreeESN. Symbolic node labels are encoded using a 1-of-m binary representation.

bigger picture to their father. Notice also that the reservoir matrix is shared across all children thanks to the stationarity assumption; TreeESN can model positional trees just by using one reservoir matrix for each position, where the max ariety of a tree is known in advance and children have a well-defined positional relationship. Figure 1.5 shows an example of the unfolding of a tree so as to provide graphical explanation of the above concepts. The Echo State Property has been theoretically proven to hold also for this kind of transition function.

The biggest advantages of TreeESNs are to

• Perform well in tasks where the Markovianity assumption is valid

• Be very efficient both at training, thanks to the fact that only the linear readout is trained, and at test time

The state mapping function, which provides an input to the readout, can be re-alised in two ways. The former just takes the state computed at the root, while the latter averages of all the encodings generated for the tree. Finally, TreeESN has been applied to benchmark for trees such as INEX, which we will analyse in depth in Chapter 3.

(33)

1.2. MODELS FOR GRAPHS 33

1.2 Models for graphs

Moving from a directed acyclic graph to more general forms introduces severe limi-tations. First of all, undirected graphs can be modelled as if they had bidirectional dependencies between each pair of adjacent vertexes, which correspond to introduc-ing cycles in the causality “path”. The above mentioned models cannot deal with such structures, since recursion would generate an indefinite learning/inference process. It follows that a different methodology is needed, and we already have mentioned the relaxation of the causality assumption being a possible solution. In the following, we introduce neural-network based models for graphs, some of which have heavily influenced our new generative model for graphs. Then we extend our literature review with kernel and convolutional methods.

1.2.1 NN4G: The neural network for graphs

Neural Network for Graphs [32] (NN4G) constitutes the other milestone, along with BHTMM, on which our approach is based. This model performs classifica-tion and regression of graph structures which can be posiclassifica-tional, non-posiclassifica-tional, directed and undirected. Among the key aspects we find adaptivity, which is ap-plied even to the construction of the network itself. This, as we will see, is in contrast with kernel-based approaches that adopt a fixed similarity measure. To deal with undirected graphs, the causal dependencies which arise from the use of RNNs and RecNNs are relaxed. Note that when the causal relation is significant it still makes sense to use previous models as well as the NN4G version for directed graphs, but doing so inevitably restricts the classes of data that can be handled. In literature several stratagems have been tried to rewrite a graph into some sort of simpler structure; in a way, this is similar to what has been done in BDIO of section 1.1.4, but it does not solve the problem at its root and it is certain to lose knowledge intrinsic to the data in the process.

Surprisingly, NN4G solves all these limitations without introducing complex struc-tures, since it is based on a simple feedforward architecture with no recurrent connections, which implies the possibility of a fast training procedure. On the contrary, learning algorithms for sequences such as Backpropagation through time (BPTT) and Real-time Recurrent Learning (RTRL) have higher time and space requirements, with an ongoing area of research dedicated to find more efficient

(34)

solutions.

NN4G is based on a constructive approach, which builds the network according to a variation of the Cascade Correlation [21] methodology, as we will see in more detail. This allows to progressively learn increasingly larger contexts until the task at hand is solved. Please notice that the adoption of a Cascade Correlation pro-cedure is just one possible implementation of the method, and it is not necessary to the modelling of graphs. On the other hand, such technique is in line with the-oretical considerations about context spreading. In the end graph transductions, together with a specific architecture, are automatically learned, removing major architectural biases.

Model & Learning

Before we begin with the discussion on learning, we highlight that one can consider a closed form of N (v) which contains v itself, as well as an open one which does not. The degree(v) of a node is defined as |N (v)|.

The most general definition of a hidden unit τi(v) is the following:

τ1(v) = f _XL j=0 ¯ wijljv

when i=1, the state variable (or a hidden neuron) is the first to be added and has no previous layers, so it can use just the available input values to produce an output. When i > 1 τi(v) = f _XL j=0 ¯ wijlkv+ i−1 X k=0 X u∈N (v) ˆ w(u,v)_ik τk(u) (1.16)

meaning that a state variable can exploit both input and previous contextual in-formation to produce a state output. In addition, a state variable τi is shared

across each vertex of the input graph, and weights are adjusted to solve a specific optimisation problem using previous hidden units (τj for j < i) that have been

already trained and frozen. The f can be a linear or a non-linear activation func-tion.

(35)

1.2. MODELS FOR GRAPHS 35 shapes. This is addressed by the local unit configuration property i.e., the shape is composed according to local vertex connections, and by the stationarity as-sumption. As mentioned before, the abstract hidden unit system is applied to every vertex and it is independent from v thanks to an appropriate weight sharing technique among parameters, used for different vertices or within other subsets of graph components (edge connections in 1.16). The kind of stationarity, as in the BHTMM, controls which parameters are effectively shared and allow us to model different shapes in the same way.

To conclude this dissertation over NN4G’s hidden units, we point out several key properties of this model:

• The computation is not recursive, since each hidden unit only depends on other units in previous layers.

• There are no connections about unit replicas in the same layer.

• Each computation is local, thus is completely separated from any kind of topological order on the vertices.

The output layer emits a value for each vertex, which is useful for IO-isomorphic transduction, or a single output obtained by values aggregation such as an average operator Ti(g) = 1 |V ert(g)| X v∈V ert(g) τi(v) (1.17)

or similar. Hence, the output yo can be formulated as follows

o(p) =    fPN i=0woiTi(p) , if p ∈ G fPN i=0woiτi(p) , if p ∈ V ert(g), g ∈ G. (1.18) The entire transduction process from an input graph to an output can be then represented by this sequence of steps:

g → τ(g) → T (g) → o(g) (1.19)

and it can graphically appreciated in Figure 1.6 where it is clear that the higher is the layer the larger is the embedded context in a state variable. Compositionality is one, if not the most, important concept of NN4G: it applies either to contextual

(36)

Figure 1.6: NN4G encoding of an input graph for (T1(g), T2(g), T3(g)) and the

output mapping

information and to the learning process, where a new hidden unit exploits the solution to ”sub-problems” already provided by previously frozen units (details of the constructive methodology can be found in [21]).

The learning phase incorporates the automatic construction of the network sim-ilar to the standard Cascade Correlation technique: until a desired performance is not reached new hidden units are trained, one at a time, interleaving the mini-mization of the total error function at the output layer with the maximisation of the correlation between the new hidden unit i and the residual error of the output

(37)

1.2. MODELS FOR GRAPHS 37 unit Eo(g) Si = M X o=1 X g∈G (Eo(g) − ¯Eo)(Ti(g) − ¯Ti)

where M is the number of output units, ¯Eo is the mean error of output unit o and

¯

Ti is the mean value of Ti(g). This formula can be easily turned into one for single

vertices Si = M X o=1 X g∈G X v∈V ert(g) (Eo(v) − ¯Eo)(τi(v) − ¯τi)

The weight update formulas for the new hidden units is computed in this way ∆ ˆwij = η δSi δwˆij = η M X o=1 σo X g∈G (Eo(g) − ¯Eo) X v∈V ert(g) f0 k X t∈N (v) τj(t) !

where k = |V ert(g)| if Ti(g) is defined as (1.17), while the update of the output

units’ weights can be computed by means of a standard approach to backpropa-gation, taking into account the frozen units. We can say that each hidden unit trained in this way learn to solve a specific ”subtask”, whose solution is then made available for the new units. This allows the construction of the network to under-stand how much context is needed to solve the task at hand, with a significant improvement in training speed compared to RNN learning algorithms.

Theoretical properties

Intuition is reflected into theoretical properties whose proofs exploit induction and the concept of a graph’s diameter. First of all, we introduce the concept of context window.

Definition 1. (Context window): The contex window of a state variable xi(v),

denoted as C(τi(v)), is the set of all state variables that (directly or indirectly)

contribute to the determination of τi(v).

We trivially define N (v) also for a set of vertices V as N (V ) =S

v∈V N (v), and

(38)

such that

Ni(V ) = N (Ni−1(V )) ∪ Ni−1(V )

with N1_{(v) ≡ N (v) as base case. In other words, N}d_{(V ) is the extension of the}

neighbourhood of v up to distance d from v. After defining τj.V = {xj(v)|v ∈ V },

we are ready to discuss the first theorem.

Theorem 1. Given a graph g, ∀v ∈ V ert(g), and i ≥ 2, the NN4G context is given by C(τi(v)) = i−1 [ j=1 τj.Ni−j(v) = i−1 [ j=1 τi−j.Nj(v)

which simply states that the set of variables that contribute to the determina-tion of τi(v) are all the previous hidden variables, each one applied to a particular

extension of the neighbourhood of v; this perfectly describes the intuition given in Figure 1.6.

The last theorem proves that exists a finite number of hidden units that ensure the entire context can be captured.

Theorem 2. Given a finite size graph g, there exists a finite number h of state variables such that ∀v ∈ V ert(g), the context C(τh(v)) involves all vertices of g.

In particular, any index h > d satisfies the proposition if d is the greatest distance between any two vertices of g i.e, its diameter.

Summing up, NN4G represents a brilliant solution to efficiently capture the context of any kind of graph, which opens many possibilities and has been suc-cessfully applied to the chemistry world. We are particularly interested in this model because it completely relaxes causality retaining great efficiency.

1.2.2 GNN: The graph neural network model

The Graph Neural Network [41] is yet another model for supervised tasks that solves graph issues in an original way. In this model, the neural network has the same shape of the graph (thanks to the same weight sharing mechanisms) and at each time step information is exchanged with adjacent nodes through a diffusion

(39)

1.2. MODELS FOR GRAPHS 39 mechanism. In contrast to NN4G, the model has to reach a stable equilibrium because of recursive state dynamics; this is ensured by the use of functions with specific properties such as to be contraction maps, which however limits the range of values that can be assigned to the weights. Indeed, according to the Banach’s fixed point theorem, in order to ensure the uniqueness and the existence of a solution for the following system of equation

s =τ (s, a) o =o(s, lN)

(1.20)

where s, o, a, and lN are the vectors constructed by stacking all the states, all

the outputs, all the labels, and all the node labels respectively, and o is the global output function, τ (the global transition function) has to be a contractive map. In addition, an iterative scheme is provided to compute the state for each node n

sv(t + 1) =τ (lv,av,sN (v)(t), lN (v))

ov(t) =o(sv(t), lv), v ∈ N

The resulting architecture is a recurrent neural network, for which the tailored learning algorithm has to reach convergence at each epoch using the above scheme. In contrast, NN4G still enjoys the simplicity and the efficiency of a being a feedfor-ward neural network. A very technical presentation is made in [41]; nevertheless, to give the reader an idea of the unfolding process, Figure 1.7 is presented: each node/arc label of the graph is used in the encoding network, and depending on how many time steps are required to satisfy a stopping condition at each epoch, an unfolded network like the one depicted is generated. The main drawback of this model is the cost of the learning phase, which is much more expensive than the test one. This is due to equilibrium conditions that must be achieved and to the gradient computation. However, GNN has been successfully applied in a num-ber of object and text classification tasks. As NN4G, it can deal with any kind of graphs, hence it introduces another general framework with its own original design.

(40)

Figure 1.7: Graph (on the top), the corresponding encoding network (in the mid-dle), and the network obtained by unfolding the encoding network (at the bottom). The nodes (the circles) of the graph are replaced, in the encoding network, by units computing f and g (the squares). When τ and o are implemented by feedforward neural networks, the encoding network is a recurrent neural network. In the un-folding network, each layer corresponds to a time instant and contains a copy of all the units of the encoding network. Connections between layers depend on encoding network connectivity.

(41)

1.2. MODELS FOR GRAPHS 41

1.2.3 Graph Echo State Network

GraphESN [22] is the extension of an ESN to the graph domain. It still involves a fully stationary approach but adopts a cyclic vertex state definition that requires an iterative procedure to converge, as in GNN. Specifically, the local state transition function of the reservoir is computed at each time step t for all vertexes, according to the following equation:

τt(v) = τ (lv, τt−1(N (v))) =

f(Winlv+ ˆWτt−1(N (v))

GraphESN requires a maximum degree k of the neighbourhood in order to define ˆ

W as made of k submatrices. According to the topological properties of the graph these submatrices can be arranged differently to achieve appropriate modelling of the dataset. The entire architecture can be appreciated in Figure 1.8, where it is shown how each node exploits information coming from the previous iterative step in order to capture larger and larger contextual information.

Figure 1.8: An input graph, one pass of the encoding process computed by the reservoir in the state representation, the state mapping function and the output function for a GraphESN.

(42)

1.2.4 Kernel Methods

A great deal of effort has been also put on kernels. Kernel functions are symmetric and usually compute some kind of similarity value for each pair of input examples, let them be strings, trees or graphs. This is still a very active area of research. Kernel methods can be used to perform regression of classification, e.g. they can be embedded into a Support Vector Machine (SVM) [15]. It is not uncommon that these methods are applied or find application to bioinformatics and medicine. However, they have a number of limitations:

1. First of all, they are very often expensive to compute (hours or days of computation), so they generally do not scale well on large datasets. Indeed, they can be based on random walks, shortest paths etc. [42] on each vertex of the dataset.

2. The kernel matrix K, whose entries (i, j) contain k(gi,gj) must be positive

definite, a constraint that requires proof each time a new kernel is designed. 3. Many kernels are not adaptive i.e. they are not learned together with the entire architecture. In our opinion this is a big disadvantage in terms of flexibility, and requires an hand-crafted solution for each task (of course, such kernels may be reused for different tasks, but there are no guarantees that they will surely work well. The process of generating ad-hoc kernels can be time-consuming.

4. The most common approaches look at the number of matching substructures between two elements, which is quite reasonable. Nevertheless, in order to keep the computational burden relatively low it is required to limit the depth of each visit, and it may be the case that two slightly different structures are mapped to the same values. Thus, special workarounds need to be taken into account.

We now briefly present the idea behind kernel solutions found in literature that successfully handle graphs, discarding details about time and space complexity which are not particularly useful to this work.

(43)

1.2. MODELS FOR GRAPHS 43 Random-walk kernels

This family of kernels uses probability distributions of visible and hidden variables. In particular, in those cases where the posterior P (Q|y) is known and the likelihood P(y|Q) is not, it is possible to perform marginalisation:

K(g, g0_{) =}X Q P(g|Q)P (g0_{|Q)P (Q)} =X Q X Q0 Kz(z, z0)P (Q|g)P (Q0|g0)

where z = [g, Q] and Kz(z, z0) is the joint kernel. The posterior probabilities

as-sume that a hidden variable Q = (q1, ..., qt) is a sequence of numbers, ranging from

1 to the graph’s dimension, corresponding to a random walk. Thus we can decom-pose it using a prior probability to decide where to start, a transition probability to move from a vertex to another and a probability to stop the path. Initialisation can be done by assigning uniform distributions when no information is known. As for the joint kernel, in [27] it is further decomposed using simple kernels for pairs of vertices/edges. If we define a random walk h = vq1eq1q2vq2eq2q3vq3... with length

l then Kz(z, z0) = (0 if l 6= l0 K(vq1, v 0 q0₁) Ql

i=2K(eqi−1qi, e

0

q_i−10 q0_i)K(vqi, v

0

q0_i) otherwise

The crucial part now is the efficient computation of such a joint kernel when marginalising over possibly infinite random walks: it can be shown that this prob-lem can be formalised in terms of linear systems theory, by solving linear simul-taneous equations until convergence of the kernel is found for each pair of input graphs. A convergence condition is given, ensuring that the computation converges for any pair of starting states under special conditions. Experimental results show that 20-30 iterations were required to reach convergence for a couple of benchmark datasets.

A tree-based kernel

According to [16], common strategies to build kernels may consist in either re-stricting the input domain to a special subset of graphs or selecting a priori a set

(44)

of features based on substructures (e.g random walks). Both of these methodolo-gies have their limitations in terms of flexibility and kernel’s reuse, and discard some amount of information about the graph. The solution proposed here is to transform the graph into a multiset of DAGs. After that, an extension of largely available kernels for trees can be developed in order to be applied to DAGs. The resulting graph kernel will be the composition of the application of this new DAG kernel to pairs of DAGs multisets. The mathematical framework is peovided in the following, but first we define the “preprocessing” procedure that decomposes a graph into multisets:

1. For each vertex, perform a breadth-first visit

2. Assign an ordering to each edge encountered (this works for directed and undirected structures).

3. Edges connecting nodes visited at level l to nodes visited at level g < l are removed, in order not to introduce cycles.

Such a process must guarantee that isomorphic graphs are represented by exactly the same multiset of DAGs, and proof is given in [16]. Moreover, the output is a number of DAGs equal to the number of vertices present in the structure. Without going into too much detail, the resulting kernel for graphs is given by

KKDAG(g1, g2) =

X

D1∈M ultiset(g1)

D2∈M ultiset(g2)

KDAG(D1, D2)

Positive definitness of the above kernel is guaranteed. We now need to extend a tree kernel to DAGs; the noteworthy observation is that there are kernels for trees that can be applied to DAGs without no modification at all, provided that the children of all nodes are ordered. Thus, a strict partial order relation between DAG vertices is introduced and it is not presented here for the sake of brevity. We conclude by detailing the formula for a DAG kernel

KDAG(D1, D2) =

X

v1∈D1

v2∈D2

C(root(Visit(v1)), root(Visit(v2))),

where Visit(v) is a tree-visit starting from vertex v and C() can be, for example, a

(45)

1.2. MODELS FOR GRAPHS 45 A number of mathematical properties of the model are exploited to fasten the com-putation, but in practice it is also required to limit the depth of each visit to deal with especially large graphs. This approach has reached state-of-the-art results on a number of datasets involving binary classifications of chemical compounds. To deal with special cases that may cause transformations to be identical between two different inputs, the authors suggest to move this information away by adding an attribute somewhere else.

Shortest-path kernels

Another approach is taken by [11]. In order to overcome the usual computational burden, the authors have defined a kernel based on the shortest paths from each node. In theory, a kernel should be a good measure of the similarity between two graphs, and it is argued that this method retains expressiveness and efficiency in comparison to kernel based on walks that can become time prohibitive. Also, performing vertex’s label enrichment greatly helps to improve accuracy: indeed, substructures that used to share an high degree of similarity before can no longer been similar at all, allowing the subsequent classifier to discriminate better. We introduce here the notion of edge walk, which consists in a sequence of edges where adjacent edges have a vertex in common. We then need to perform the so called Floyd transformation, whose procedure is:

• Whenever two nodes vi and vj are connected by a path, add an edge between

them and label it with their shortest distance.

The output is a new labelled graph g = (V ert, Edg). At this point, we shall define the shortest-path kernel

kshortestpaths(g1, g2) = X e1∈Edg1 X e2∈Edg2 k_walk(1) (e1, e2)

where k(1)_walk(e1, e2) is a positive definite kernel on edge walks of length 1 (remember

that each pair of nodes connected by a path in the original graph is at distance 1 in g). The validity of this kernel is proved in [11]. Interestingly, both edges and labels of g can be enriched with additional attributes. Experiments have been done on protein datasets and the approach significantly outperforms random walks kernels.

(46)

Fast subtree kernels

Yet another kernel [42] attempts to exploit similarity in substructures, this time by means of the Weisfeiler-Lehman test for graphs isomorphism: this algorithm proceeds per iterations (i in the following), assigning at each round a multiset-label per node, performing some sorting of these multisets and finally compressing them into single labels. Finally it outputs a relabelled graph. The kernel is defined as follows:

k(h)_{W L}(g, g0) = |{si(v), si(v0))||f (si(v)) = f (si(v0)), i ∈ {1, ..., h}, v ∈ V ert, v0 ∈ V ert0}|

where si(v) is the string associated to the multiset-label of v, f is injective and the

sets {f(si(v))|v ∈ V ert ∪ V ert0} and {f (sj(v))|v ∈ V ert ∪ V ert0} are disjoint for

all i 6= j.

This kernel uses information coming from the previous iterations, which in turn are based on the neighbourhood of each node. The test terminates if sets of newly created labels are not identical in g and g0_{. So, the test accumulates information}

on each vertex in an incremental fashion, which makes it particularly interesting; this information is then used by the kernel to count common multiset strings in the two graphs.

Again, kernel-related proofs are given in [42]. Performances on experimental datasets are comparable to kernels based on random walks and shortest paths. Generative multiset kernels

The family of generative kernels compare two different examples on a likelihood basis i.e. how well they are generated by a probabilistic model. The feature space is usually obtained using the parameters of such a model: to give an example, in [5] it is briefly described the Fisher kernel, whose mapping from the input to the feature space is obtained using the Fisher score vector Ux = ∆θlog P (yn|θ), which

clearly requires to decompose the likelihood according to the various distributions defined for a specific model. Finally, the resulting vectors are compared according to the distance between each other. In [5] a new approach that tries to augment such kernel space with structural information is proposed; indeed, in the Fisher kernel there is no explicit syntactical information about common substructures, and one may argue that this is not enough when dealing with possibly complex

(47)

1.2. MODELS FOR GRAPHS 47 inputs. As a result, the General Jaccard Kernel for Hidden States Multisets [5] has two major advantages: (i) it is computationally cheaper than the Fisher kernel and (ii) it exploits adaptivity of the underlying generative model to define a metric that otherwise would be chosen by the human. In order to do so the IOBHTMM described in section 1.1.3 is adopted, because it learns to discriminate between examples belonging to different classes for the task at hand.

Once inference has been performed, the feature space can be defined by a unigram (the simplest form of multiset)

Φi(g) =

X

u∈g

δ(Qu, i), i ∈1, .., C

where C is the number of hidden states, δ(Qu, i), is the Kronecker function and Φi

returns a C-dimensional vector. Alternatively, we can use a bigram that introduces some form of syntactical knowledge

Φij(g) = X u∈g X v∈N (u) δ(Qu, i)δ(Qv, j) i ∈1, .., C, j ∈ 1, .., C

and returns a C2_{-dimensional vector. The Jaccard kernel for trees is then defined}

as k(g1, g2) = PD k=1min(Φk(g1), Φk(g2)) PD k=1max(Φk(g1), Φk(g2))

where D is C or C2_{, respectively, for the unigram or bigram feature vector.}

Experi-mental results on tree benchmark datasets (INEX2005 and INEX2006, see chapter 3) show that this technique competes with state-of-the-art results.

As we have seen, kernels rely on the entire graph to compute similarity measures that will help to learn to discriminate. We would like to investigate an algorithm that exploits locality and information sharing to obtain similar results. This would allow the computation to be totally data-parallel with the exception of a barrier when newly computed results have to be propagated along the network.

(48)

1.2.5 Univariate Collective Inference for node

classifica-tion

Another common scenario that occurs when dealing with graphs is the classification of vertexes. In the simplest case, we are presented with a graph g with known vertex labels V ertK _{and unknown ones V}U _{for which we have to do prediction.}

This is referred as the univariate case in [29]. Interestingly, the univariate case places target labels directly on each node, so the only information we rely upon is structural. Further, the following Markov assumption is made

P(yu|g) = P (yu|Nu)

where yu is the class value of node u. It follows from the above considerations that

a wise choice of the edge type is crucial to the success of the classification task. The contribution of the paper stems from thorough experiments on a number of datasets and the study of the impact of different architectures. Specifically, the so-called “node-centric network learning system” can be divided into three main parts , which are (i) a Non-relational model, which outputs a class label based on attributes present on a single node (it considers a node as a training example on its own), (ii) a Relational model, which looks also at neighbouring values as well as the kind of relation that binds an entity to another, and (iii) a Collective inference module that determines how unknown values are estimated together, possibly influencing each other, and how information is diffused. The definition of a relational model is often recursive, in the sense that the learnt P (yu = c|N (u))

is defined in terms of P (yv = c|N (v))∀v ∈ N (u). Nonetheless its value is often

approximated by the “current” class of node v during the iterative procedure of collective inference. Three approaches have been adopted in [29] to predict the unknown labels. We start with Gibbs sampling:

1. First of all, sample a class for each vertex v ∈ V ertU _{from its distribution,}

computed by means of the local classifier model M. Such distribution is to be considered as an approximation of the probability that the vertex takes a certain class given its neighbourhood.

(49)

1.2. MODELS FOR GRAPHS 49 3. Update one vertex at a time, sampling from the distribution obtained by the

application of the relational classifier.

4. Repeat prior step for a number of times without keeping statistics: this is known as the burnin period.

5. Repeat again for a fixed number of iterations, counting how many times each node gets assigned a particular class labels. A normalisation of this vector of counts returns the final class probability estimate.

Gibbs sampling has the advantage of assigning a class at each node since the very first iteration, so the relational classifier inference procedure is straightforward. The next procedure we consider is called Relaxation Labeling. Differently from Gibbs sampling the update is synchronous, meaning that every vertex in V ertU _is

modified at each iteration

1. Initialize the prior distribution for each vertex as done previously

2. For each element in V ertU _{apply the relational model, which outputs another}

probability distribution for that element.

3. Repeat for a fixed number of iterations, and output the final class probability distribution for each node.

This method allows the inference procedure to retain uncertainty between itera-tions, and has proven to be particularly effective when the percentage of known labels is small. Note that a probability distribution is outputted instead of a class label. Finally, we present the Iterative classification characteristics in light of what we have already seen. Indeed, just as Gibbs sampling does, IC assigns a vertex the most likely label according to the distribution of the relational model. Unlike Gibbs sampling, however, IC performs synchronous updates (in this sense, it resembles the Relaxation Labeling procedure).

Experimental analysis shows that Relaxation labeling outperforms the other two methods on different datasets. When coupled with simulated annealing (which decreases the influence of new estimates over time) we can ensure convergence. We have not said yet how to perform training of the local and relational models. In practice, the graph used for training is a subset of the original one. A local and

(50)

relational models are trained on it and then the original graph is used for testing, with the aim of predict the previously removed vertices’ classes.

We now turn our attention to the relational classifiers used in [29]. Indeed, a strong assumption is made, which is that a node class is more likely to be similar to the most frequent class among its neighbours. This is called the Ho-mophily assumption, and has worked well for the tasks the authors considered. For example, the Weighted-Vote Relational Neighbour Classifier (wwRN) estimates P(yu = c|N (u)) = _Z1

P

v∈N (u)wu,vP(yv = c|N (v)), where Z is a normalising term

and wu,v is the weight of the arc from u to v.

However, when a model is trained to capture more general structural differences, it will not necessarily learn to emit labels according to the homophily ”rule”. Thus it is less obvious how such a model could achieve the same performances. We will investigate this point later.

Kernel Neural Networks

Recently the work of Lei et al. [28] has proposed a mathematically formal and interesting connection between the computation of neural modules and kernels. The so-called Kernel Neural Network exploits a recurrent architecture which can be extended either in width or in depth. Extending the width can be seen as the equivalent of unrolling the network for a sequence of n elements, while increasing the depth may realise a different kind of kernel computation.

The main idea of this approach is that we can derive a single layer computational module which embeds graphs similarity. For example, if we are dealing with sen-tences, we can construct a deep and wide architecture such that the n-th layer performs the n-gram kernel computation i.e. counts how many n-grams are shared between two sequences of words. The width in this case represents the size of the sentence to be taken into consideration. Similarly, the authors show how a neural module can be derived from a Random Walk kernel for graphs: the computation of the internal state of a recurrent neuron corresponds to the kernel function applied to the entire graph and a particular substructure, whose identity depends on the specific neuron. There is one recurrent neuron per vertex in the graph, and the internal state transition involves the use of neighbouring states located at the same depth. This unifying view of kernels and recurrent neural networks may justify

A generative approach for learning contexts in graphs

UNIVERSIT `

A DI PISA

A generative approach for learning contexts in

graphs.

Abstract

Contents

Mathematical notation

Introduction

Chapter 1

Background

1.1

Models for trees

1.1.1

Hidden Tree Markov Model: the top-down approach

1.1.2

Bottom-Up Hidden Tree Markov Model: the

bottom-up approach

1.1.3

Input-Output BHTMM: an extension to input/output

transductions

1.1.4

BDIO: Modelling bi-directional tree contexts

1.1.5

Tree Echo State Network

1.2

Models for graphs

1.2.1

NN4G: The neural network for graphs

1.2.2

GNN: The graph neural network model

1.2.3

Graph Echo State Network

1.2.4

Kernel Methods

1.2.5

Univariate Collective Inference for node

classifica-tion