Deep Cascade Architectures of Neural Networks for Graphs

(1)

Dipartimento di Informatica

Corso di Laurea Magistrale in Informatica

(Curriculum “Artificial Intelligence”)

Tesi di Laurea Magistrale

Titolo

Deep Cascade Architectures of Neural Networks for Graphs

Candidato

Domenico Tortorella

Relatore

Alessio Micheli

(2)

(3)

Abstract

The application of the cascade correlation algorithm to automatically construct deep neural networks in order to implement classification or regression models for graph input data has already been proposed by A. Micheli with the "Neural Network for Graph" (NN4G) model. The aim of this thesis is to propose novel architectures as extensions of the NN4G model by evaluating new cascade expansion strategies, unit types, graph pooling methods, and a validation-based stop condition, and assessing their performance within a double cross-validation framework using nine popular graph classification tasks. Our models have achieved performances superior or comparable to state-of-the-art models that have already been evaluated on the same datasets with the same validation framework, are much more succinct (i.e. they require a considerably lower number of units, overall), and can be trained in the order of minutes. We have also discovered the crucial role played by layer width on the generalization ability of cascade models, thus allowing performances hitherto unattainable by the classic NN4G model.

(4)

(5)

Deep Cascade Architectures of

Neural Networks for Graphs

(6)

(7)

4.1. Cascade construction 25 4.2. Computation in layers 29 4.3. Graph pooling 31 4.4. Readout 34 4.5. Path-like models 34 Chapter 5. Experiments 37 5.1. Datasets 37 5.2. Experimental set-up 38 5.3. Comparing models 40 5.4. Contextual layers 42 5.5. Advanced layers 47 5.6. Spectral layers 47 5.7. Path-like models 47

5.8. Isomorphism and other data issues 48

Chapter 6. Analysis and observations 51

6.1. Features 51 6.2. Generalization ability 55 6.3. Wide layers 58 6.4. Observations 63 6.5. Final results 67 3

(8)

Chapter 7. Conclusions 73

Bibliography 77

(9)

CHAPTER 1

Introduction

Machine learning. Examples of machine learning technology can be found in many aspects of our modern society: from recommenda-tions on e-commerce websites or on content providers, to image classifi-cation, object identificlassifi-cation, automatic translation, speech recognition, and many more.

As a sub-filed of Artificial Intelligence, machine learning mostly aims to build statistical models (as opposed to symbolic, rule-based, or expert-designed models) from data and algorithms for their training in order to solve a specific task at hand, e.g. assigning a category to an image. Learning (as the process of inferring a model from data is called) can be either supervised, when the target labels or ground truth values of our samples are available; semi-supervised, when our data contains both labelled and unlabelled data; or unsupervised, when our task is to learn a structure or a representation for our data. In this thesis we concern with supervised learning, in particular with classification.

Conventional machine learning techniques were limited in their abil-ity to process directly raw data, requiring hand-engineering of vector features by experts (e.g. histograms of oriented gradients for object-recognition models on images).

Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the vector (or ma-trix) representations needed for detection or classification. Deep learn-ing models [Lecun et al., 2015] are representation-learnlearn-ing methods with multiple levels of representation, obtained by composing simple but non-linear modules (e.g. ‘neurons’ in neural networks) that each produce a transformation of the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level, producing a “hierarchy” of features without requiring human engineering.

With the composition of enough such transformations, very com-plex functions can be learned. In most common cases model parame-ters are learnt by optimizing a suitable loss function: cross-entropy for classification, reconstruction error for auto-encoders, and so forth.

This has allowed deep learning models to reach state-of-the-art per-formances on tasks across multiple domains, from computer vision to natural language, beating traditional AI approaches.

(10)

Still, “traditional” deep learning models can deal only with tensor-shaped data.

Graph data and models. Deep models have the ability to extract a hierarchy of representations from raw data, but not all data comes in forms that can be easily fit into a vector feature representation. For example, relationships between entities: chemical bonds between atoms in a molecule or between amminoacids in a protein, interactions between users of a social network, and so forth.

Two main approaches have been developed to treat such graphs directly: kernel models, and neural models.

Kernel-based approaches [David Haussler, 1999; Kondor and

Laf-ferty, 2002; G¨artner et al., 2003; Kashima et al., 2003; Ralaivola et al.,

2005; Shervashidze et al., 2011; Yanardag and Vishwanathan, 2015] implement similarity measures between graphs which satisfy certain properties, in order to be applied to classic kernel-based models such as support vector machines. Typical graph kernels are computationally expensive, and require expert knowledge to choose the right one for the task at hand.

On the other side, neural graph models extend the input space of neural networks to handle structured data. This approach was pi-oneered by Scarselli et al. [2009] and Micheli [2009], who extended previous neural models for trees and directed acyclic graph to general graphs. Nowadays this has resurfaced in graph convolutional networks (GCN) and other models [Gilmer et al., 2017; Battaglia et al., 2018].

Challenges. Training deep models in general, and graph models in particular, poses a range of different problems.

First of all, model topology (the number of layers, hidden features, type of connections, etc.) is part of the hyper-parameter space that need to be explored in a standard model selection procedure. As models grow bigger, the cost of a proper, fine-grained search gets worse and worse.

The other issue is inherent to the difficulty of training large models end-to-end. Gradient propagation in deep graph neural models have been found particularly difficult, and their ability to learn significant features in deeper layers has been questioned [Luzhnica et al., 2019; Li et al., 2018]. This is a critical problem for graph neural models, since the model depth determines how large is the context (i.e. the neighbour portion of the graph) taken into account to produce the feature representation for graph vertices [Micheli, 2009].

Objectives. Our purpose in this thesis is to continue and extend the research line initiated in Micheli [2009] with the Neural Network for Graph model (NN4G) aimed at addressing these challenges:

(11)

1. INTRODUCTION 7

(1) removing the burden of topology selection, by developing con-structive procedures that allow to select to the best extent possible the model dimension as a by-product of training; (2) overcoming the depth limitation of current end-to-end graph

neural models;

(3) possibly requiring lower training time with respect to end-to-end models.

This will be achieved by developing a constructive framework based on cascade correlation, which automatically builds the model structure during training. Constructive methods for unstructured, vector data are not new in themselves, and have been known in literature since the early 90s [Fahlman and Lebiere, 1990; Kwok and Yeung, 1997], but with rare exceptions have been neglected in the last decade [Marquez et al., 2018].

The original approach by Micheli [2009] was actually based on cas-cade correlation, with depth-only (i.e. one unit per layer) cascas-cade ex-pansion. Building upon this, we will explore a wide range of architec-tures, differing by layer type, graph pooling, and network expansion strategy, testing them on the most commonly used graph dataset in current literature.

We will answer several questions:

• What is the role played by layer width with respect to model accuracy, i.e. does also the model structure affect performance, besides depth and size?

• What is the role played by unit complexity, i.e. how much expressive cascade units should be?

• Are our cascades effectively able to produce non-trivial fea-tures in deeper layers, i.e. do they suffer of the same smoothing problem as seen in end-to-end models?

We will also follow the evaluation framework set by Errica et al. [2020] to assess the performances of our constructive models, and to compare them fairly with the main end-to-end graph neural models available in literature. We will thus follow best practices in Machine Learning research regarding model validation, by employing a double cross-validation scheme to estimate model accuracy. This will offer a standard, fair setting for model assessment, since current literature on graph models has been found frequently lacking of proper performance evaluation [Shchur et al., 2018].

Outline. We now present a summary of the next chapters:

• in Chapter 2 we review briefly the common approaches to building graph models, focusing in particular on neural graph models;

(12)

• Chapter 3 presents the main literature on the automatic con-struction of neural networks, focusing on the cascade correla-tion framework and its applicacorrela-tion to structured data;

• we propose our models in Chapter 4;

• Chapter 5 contains the bulk of our experiments, in which dif-ferent variants of our constructive models (by network expan-sion, layer type, etc.) are compared against each other with respect to achieved test accuracy and dimension of the con-structed model;

• in Chapter 6 we perform a detailed analysis of the features learnt by our models, and of their generalization capabilities; • finally, in Chapter 7 we draw conclusions on our work,

summa-rizing main accomplishments, and outlining further directions of future research.

(13)

CHAPTER 2

Graph Neural Networks

In this chapter we will presents a brief outline of the main neural1

(as opposed to Bayesian or generative) approaches to graph classifica-tion. Rather than being exhaustive and complete, the purpose of this introduction is to present the main routes taken by the research in this field related to the proposed models, as well as the main concepts that will be needed later.

2.1. Graphs

2.1.1. Basics. A graph G(V, E) consists of a set of vertices (or nodes) V and a set of edges E ⊆ V × V ; an edge (u, v) ∈ E is said to

be directed from u to v (that is, “ u → v ”). A graph is said to be

undirected if (u, v) ∈ E implies (v, u) ∈ E; otherwise, the graph is said to be directed.

The neighbourhood of a vertex v ∈ V is the set N (v) of all

ver-tices sharing an edge with v. For directed graphs, we can

distin-guish between incoming and outgoing edges, thus defining N+_{(v) =}

{u ∈ V : (u, v) ∈ E} and N−_{(v) = {u ∈ V : (v, u) ∈ E}, respectively;}

therefore, N (v) = N+_{(v) ∪ N}−_{(v). The degree of a vertex v is the}

cardinality of its neighbourhood, deg(v) = |N (v)|.

1_{We follow the subdivision of Bacciu et al. [2020], to which we also refer the}

reader interested in the other approaches.

Figure 2.1. Basic graph concepts.

(14)

0 2 4 6 8 10 -2 -1 0 1 2

(a) time signal

-0.4 -0.2 0 0.2 0.4 (b) graph signal Figure 2.2. Examples of time signals and graph sig-nals. The colour on the left figure denotes the signal value in each graph vertex.

Furthermore, to each vertex v can be associated a label l(v) ∈ RD_,

and in the same way edges (u, v) could have attributes a(u, v) ∈ RD0_.

In this case, the graph is said to be (vertex- or edge-) labelled.

In this work, we will focus on undirected, possibly vertex-labelled graphs, with the task of predicting the category y ∈ 1...C to whom each one of the graphs belongs.

2.1.2. Graph signals. Other than as combinatorial objects per se, graphs can be seen as the support for signals [Shuman et al., 2013; Ortega et al., 2018]. In the same way continuous- or discrete-time sig-nals are defined as functions x(t) : R → R or x(k) : Z → R [Oppenheim et al., 1996], a signal x(v) supported on the graph G(V, E) is defined as the function x : V → R. This allows us to gain the perspective (as well as the main tools) of the signal processing field, which will be later useful both for the definition of spectral models and their analysis.

The core concept it that of spectrum of a signal, which is obtained

from the Laplacian operator eigenspace: the Fourier version ˆx of a

signal x is in fact a change of coordinates into the Laplacian eigenbasis.

In the case of continuous-time signals, for example, we have _∂t∂22 as

Laplacian operator, and a continuous base of eigenfunctions given by e−2πif_{, with frequencies f ∈ R.}

Analogously, for a graph G(V, E) is defined its own Laplacian oper-ator L. Given an arbitrary ordering of the nodes V , we can represent a graph signal x(v) as the column vector x (we will refer to both with the same symbol x, as an abuse of notation). Thus the Laplacian

(15)

2.2. MODEL STRUCTURE 11

Figure 2.3. Some eigenfunctions of a graph spectrum. Bars of graph vertices denote signal magnitude (upward blue for positive values, downward black for negative).

and Aij = J(vi, vj) ∈ EK as the degree and adjacency matrices,

re-spectively2. Consequently, the spectrum is given by the eigenvectors

u0, ..., u|V |−1 of L, with the eigenvalues λ0 ≤ λ1 ≤ ... ≤ λ|V |−1 ∈ R+

playing the role of frequencies. In Figure 2.3 we can see that, simi-larly as to time signals, lower frequencies are associated to low-varying

eigenfunctions, with λ0 = 0 always corresponding to the constant

sig-nal.

In literature it is also defined the (symmetric) normalized Laplacian

Lnorm = D−1/2L D−1/2, whose eigenvalues have the nice property of

always being in the range [0, 2]. Sometimes also the spectrum of the adjacency matrix A can be used to analyse graph signals.

2.2. Model structure

The basic neural graph models can be partitioned in three compo-nents, as depicted in Figure 2.4:

(1) an isomorphic transduction, which takes a (labelled) graph G(V, E) as input, producing as output a graph with same

topology as G along with node features x(v) ∈ RN;

(2) a pooling layer, which transforms the multi-set of node features

{x(v) : v ∈ V } into a single vector representation X ∈ RN _for

the whole graph3_;

(3) a readout layer, which finally produces the target prediction

y ∈ RC _{from X.}

2

JP K denotes the Iverson bracket, which assumes value 1 when P is true and 0 otherwise [Knuth, 1992].

3_{Not necessarily true, e.g. in the case of multi-head attention pooling, but we}

(16)

Figure 2.4. An high-level structure of neural graph models. We can refer to the first two components as the “encoding” part of the model, since it encodes a graph G into its vector representation X. The encoder and the readout can either be trained separately, or in an end-to-end fashion.

The readout can either be as simple as a linear function X 7→ W X +b trained e.g. by pseudo-inverse, or a multi-layer neural network (see Chapter 3), or a support vector machine (SVM).

For a more complex framework which takes into account interac-tions between edge, node, and graph features, we refer to Battaglia et al. [2018].

2.3. Steady-state models

Recursive neural networks (RecNN) paved the way to the adap-tive processing of structured data [Sperduti and Starita, 1997; Fras-coni et al., 1998]. Two core ideas generalise the processing of time sequences by recurrent neural networks (RNN) to trees and directed acyclic graphs (DAGs):

(1) a partial ordering of the nodes present in the structures, given by the topological ordering inferred from directed edges; (2) the recursive decomposition into sub-structures (a parent node

and its children, or a vertex and its incoming edges), which in turn are processed recursively by the same encoding function. Models that will be described in the following are able to handle undirected graphs by ditching the total ordering assumption and by carrying on the local processing of vertex–neighbourhood sub-structures.

2.3.1. The GNN. The first approach attested in literature to handle graphs in such way has been proposed by Scarselli et al. [2009], extending the adaptive framework of recursive neural networks to pro-cess structures with cycles.

Differently from trees and DAGs, the states x(v), x(v0) of the two

nodes in an undirected edge (v, v0) are mutually dependent. The

solu-tion adopted consists of learning the node features as steady-states of a corresponding dynamical system.

(17)

2.4. SPECTRAL MODELS 13

A local, iterative map F (x, l) : RD _{× R}N _{→ R}N _{is defined as}

(1) xt+1(v) = tanh  Winl(v) + X v0_{∈N (v)} ˆ W xt(v0) + b  ,

where xt(v) denotes the state of vertex v at time-step t. Equation (1)

is applied until the convergence to a global steady-state is reached (e.g.

by fixing a threshold on ||xt+1− xt||). The existence of such state is

ensured when F (x, l) is contractive, i.e. it satisfies (2) ||F (x, l) − F (x0, l)|| ≤ µ||x − x0|| ∀x, x0 _{∈ R}N

, l ∈ RD

for 0 < µ < 1. A GNN model is trained end-to-end, with a corrective term added to the classification loss in order to ensure property (2) is satisfied.

Since such training is onerous due to the alternation between sta-bilization and recurrent back-propagation, an alternative has been de-veloped.

2.3.2. Graph ESN. Echo state networks have originally been de-signed to handle temporal sequences [Jaeger, 2001]. Instead of training the state transition function, echo state networks (ESN) rely on the

random initialization of the input and state transition matrices Win

and ˆW in equation (1). The latter matrix is then rescaled in order for

it to satisfy the echo state property (ESP): for every input l, (1) must

converge to the same steady-state independently of the initial state x0

at t = 0 (property (2) is a sufficient condition for the ESP) [Yildiz et al., 2012; Gallicchio and Micheli, 2017].

Gallicchio and Micheli [2010, 2020] applied the efficient reservoir approach of the ESNs to graphs (in the encoding part of the models, the readout usually trained by pseudo-inverse), with interesting results.

2.4. Spectral models

Owing to the effectiveness of convolutional neural network (CNN) on computer vision tasks, effort have been made to extend it on the non-regular grids of node graphs.

One way of doing so is to mimic the convolution operator in the spectral domain. Let us recall from signal processing [Oppenheim et al., 1996] that the convolution between a discrete-time input signal x(t) and a filter (also called a ‘kernel’) g(t) is defined as

(3) g(t) ∗ x(t) =X

τ ∈Z

g(τ ) x(t − τ ).

Convolution on images is simply the extension of (3) to the bi-dimensional,

limited-support case. Given ˜g(f ) = F (g(t)) and ˜x(f ) = F (x(t)) as the

respective signals expressed in the Fourier domain, we have

(18)

Figure 2.5. An example of a low-pass graph filter.

Spectral models thus exploit equation (4) to define convolution on graphs.

2.4.1. Graph filtering. As introduced earlier in this chapter, the spectrum of a graph signal x (in its vector representation) is defined in

terms of the graph Laplacian matrix eigendecomposition L = U ΛU>,

with columns uiof U representing eigenfunctions, and associated

eigen-vectors λi playing the role of frequencies.

The graph Fourier transform (GFT) is thus defined as ˜x = F (x) =

U>x, along with its inverse x = F−1(˜x) = U ˜x. Therefore, filtering can be easily done by an element-wise product in the spectral domain:

(5) g ∗ x = F−1(˜g ˜x) = U ((U>g) (U>x)).

In Figure 2.5 we can see an example of a low-pass filter.

2.4.2. Filtering models. A class of filters that can be easily learned and computed are the polynomial filters, which can be

ex-pressed as ˜g(λ) =P

kθkλ

k _{in the frequency domain, and then applied}

as (6) g ∗ x = U X k θkΛk ! U>x.

(19)

2.5. MESSAGE-PASSING MODELS 15

In particular, a polynomial filter can also be written as its expansion by Chebyshev polynomials, which are formally defined as

T0(ω) = 1,

T1(ω) = ω,

Tk(ω) = 2ω Tk−1(ω) − Tk−2(ω).

(7)

With some algebra, we can see that in this case a filter takes the form g ∗ x = U X k θkTk(Λ) ! U>x = X k θkTk(L) ! x (8)

ChebNet [Defferrard et al., 2016] employs the recurrent definition (7) for its layers: x` = 2L x`−1− x`−2.

2.5. Message-passing models

Another way to extend the CNN model for images to graph, is to directly mimic the convolution operation on vertices, without recurring to the spectral formulation. As CNNs do for each pixel in a regular grid, this class of models produce a feature locally for each vertex with its non-regular neighbourhood.

The first model to apply this approach was the “Neural Network for Graphs” (NN4G) model by Micheli [2009], albeit within the framework of constructive models for recurrent structures [Sperduti and Starita, 1997]. Differently from steady-state models, the encoding features are produced by layering of neighbourhood ‘convolutions’, instead of the fixed-point of a recurrent map. We will present the NN4G model in detail in chapter 3.

These models are also called ‘message-passing models’ [Gilmer et al., 2017], since each vertex can be seen as receiving a ‘message’ from each of its neighbours, which are then used to produce the output feature of the vertex. As in CNNs, multiple message-passing layers can be stacked (i.e. a layer receives as input the node features produced by the previous layer as output) to produce the final encoding.

The operation of a message-passing layer (producing new node

fea-tures x0(v) from x(v)) can be described by three functions:

(1) message(·), which computes the ‘messages’ hv0_→v from node

neighbours v0 ∈ N (v) to v, for example

hv0_→v = message(x(v), x(v0), a(v, v0)),

taking as input the features of the two nodes’ features and the features of the edge connecting them;

(20)

Figure 2.6. Message-passing layer.

(2) aggregate(·), which encodes the multi-set {hv0_→v : v0 ∈

N (v)} into h0_v in a permutation-invariant way;

(3) combine(x(v), h0v), which produces the new vertex features

x0(v) by combining the aggregated neighbourhood messages.

The three steps are depicted in Figure 2.6.

Message-passing models do not need to keep a so strict subdivision of the three functions. For example, the Graph Isomorphism Network (GIN) [Xu et al., 2019] implements all of them with a single multi-layer perceptron applied to the sum of neighbour node features,

x0_{(v) = mlp}   X v0_{∈N (v)∪{v}} x(v)  .

Finally, consider that spectral models defined as (8) can be imple-mented by an appropriate number of message-passing layers, since the

Chebyshev polynomials Tk(L) are defined recurrently, and the

Lapla-cian acts locally on a vertex and its neighbours. 2.6. Pooling

We will now focus on the pooling component of the graph neu-ral model. Its purpose is to transform node features {x(v) : v ∈ V }

into a single vector representation X ∈ RN _{for the whole graph, in a}

permutation-invariant way (thus the same concepts of this section can be applied to the aggregate(·) function for a node neighbourhood). 2.6.1. Simple. The simplest (and only) permutation-invariant lin-ear function is the sum [Maron et al., 2018]. Mean and maximum can also be used (and in fact were already applied by Micheli [2009]), but these have a reduced representative power: e.g. mean cannot distin-guish between set of features with identical distribution but different

(21)

2.7. KERNEL-BASED MODELS 17

Figure 2.7. Decreasing representativeness of some pooling methods (from Xu et al. [2019]).

cardinality [Xu et al., 2019].

sum-pool(x) = X v∈V x(v) (9) mean-pool(x) = 1 |V | X v∈V x(v) (10) max-pool(x) = max v∈V x(v) (11)

2.6.2. Attention. While the simplest forms of pooling do not re-quire trainable parameters, a more complex strategy such as attention pooling does. Originally employed for sequences in the natural lan-guage processing field, it allows to focus on the ‘most relevant’ nodes

thanks to a distribution αv, which can be computed for example as

αv = softmax(f (x(v))): (12) attention-pool(x) =X v∈V αvx(v), with X v∈V αv = 1.

The same mechanism can be adopted for neighbourhood aggrega-tion, for example using αv,v0 = softmax(f (x(v), x(v0), a(v, v0))).

2.6.3. Others. A completely different approach is that taken by DiffPool [Ying et al., 2018], which learns to group the nodes of a graph in clusters to produce an output graph with a new, reduced topology.

2.7. Kernel-based models

The classes of models presented so far in this chapter are not the only ones devised to handle graph structured data. Departing from the neural adaptive framework, kernel methods have held for many years state of the art performances on numerous tasks.

Graph kernels [Vishwanathan et al., 2010; Shervashidze et al., 2011; Yanardag and Vishwanathan, 2015] use classical graph algorithms to

produce a similarity score κ(G, G0) between two graphs G, G0.

When κ satisfies some properties [Shalev-Shwartz and Ben-David, 2014], it then induces a reproducing Hilbert space, thus allowing the

(22)

Figure 2.8. A graph kernel SVM. graph kernel complexity

random walk O(|V |6_{) (can be reduced to O(|V |}3₎₎

sub-tree O(|V |2₄d_{h) (d = node degree, h = sub-tree height)}

cyclic pattern NP-hard

shortest path O(|V |4₎

graphlet O(|V |k_{) (k = graphlet vertices, usually 3–5)}

Table 2.1. Computational complexity of some graph kernels.

use of a support vector machine (SVM) or other kernel-based models to carry on graph classification.

The most common of those are based on the R-convolution frame-work, where similarity between graphs is given by the count of common sub-structures (e.g. matching sub-graphs or sub-trees, random walks, etc.). There exist also kernels based on the Weisfeiler–Lehman test, and other paradigms [Kriege et al., 2020; Nikolentzos et al., 2019].

Albeit being able to achieve high accuracies, this approach has two severe issues: computing graph similarities is typically highly costly, and kernels need to be tailored for each task, requiring expert knowl-edge. For example, we can contrast the computational complexities reported in Table 2.1 for some common graph kernels with the typical complexity of message-passing models, which requires O(|E|) feature computations and O(|V |) feature aggregations of d terms for each layer,

(23)

CHAPTER 3

Automatic Construction of Neural Networks

In Chapter 2 we have introduced the basic concepts of neural models for graphs. We will now take a step back and focus how deep neural networks are constructed. In particular, we will see how the difficulty behind training deep models can be overcome with such approaches as cascade correlation, and how this has then been applied to neural models for structured data, and graphs in particular, by presenting the NN4G model.

3.1. Neural networks

In this section we will introduce the basic concepts about neural networks, the advantages of deep architectures and the challenges posed by their training.

3.1.1. Basic concepts. In the classic setting, a neural network

implements a function fW : RD → RC with a set of parameters W to

be trained.

Neurons. A “neuron” is the basic unit of neural networks. It is

defined by a simple affine transformation of its input x 7→ w>x + b

followed by a non-linear activation function φ(·): x`,i = φ(w`,i>x`−1+ b`,i),

where the subscripts `, i identify the unit’s position inside the network (i.e. ith unit of `th layer), while w`,i and b`,i are its trainable

parame-ters.

We can distinguish two classes of activation functions: saturating activations, which have a compact function range, and non-saturating activations.

The main ones are reported in Table 3.1, while we can see two examples in Figure 3.1.

Networks. In order to represent complex functions, neurons are arranged in networks. For example, units of multi-layer perceptrons (MLPs, Figure 3.2) are arranged into layers, with units i of layer ` all

receiving in input the output x`−1 of the previous layer, and producing

the scalar x`,i as output. Networks can also have different, less regular

connection structures, and differing number of units per layer.

(24)

name definition range tanh e_exx−e_+e−x−x [−1, 1] sigmoid _1+e1−x [0, 1] softmax Pexi jexj [0, 1] linear x [−∞, +∞] relu x_{Jx ≥ 0K} [0, +∞]

relu with negative slope x_{Jx ≥ 0K + 0.1x Jx < 0K} [−∞, +∞]

Table 3.1. Non-saturating and saturating activations.

Figure 3.1. Two examples of activation functions.

Figure 3.2. A neural network.

Training. The parameters W = {w`,i, b`,i : ∀`, i} of a neural

net-work fW are trained by minimizing a loss function L({(y(s), ˆy(s))}s∈D)

of targets y(s) _{∈ R}C _{and predictions ˆ}_y(s) _{= f}

W(x(s)) from a training

(25)

3.1. NEURAL NETWORKS 21

Figure 3.3. An example of deep learning model (LeNet).

This optimization problem is usually solved by methods such as stochastic gradient descent (SGD), with regularization terms added to the loss L, e.g. the 2-norm of the weights.

The most common loss functions are the mean squared-error (MSE) loss (used for both regression and classification), and the cross-entropy loss (used only for classification problems):

LMSE = 1 |D| X s∈D y(s)− ˆy(s) 2 , (13) Lcrossentropy = 1 |D| X s∈D X 1≤c≤C y_c(s)log(ˆy_c(s)). (14)

Representative power. The universal approximation theorem en-sures that any continuous function on a compact set can be approxi-mated arbitrarily close by a two-layer neural network with a suitable number of hidden units [Cybenko, 1989]. The core issue is: how many units should a neural network have?

3.1.2. Deep learning. While the universal approximation theo-rem requires only a single hidden layer, it has been proven that multi-layer models are much more concise, having total number of units that can be logarithmic with respect to corresponding shallow models [Tel-garsky, 2016; Eldan and Shamir, 2016; Cohen et al., 2016].

The idea of stacking a large number of layers (be it neural units, convolutional filters, or restricted Boltzmann machines) is at heart of deep learning. Deep models are able to achieve state-of-the-art perfor-mances in many fields (e.g. computer vision, machine translation, etc.) by building a hierarchy of features, starting from raw input up to more and more abstract representations [Lecun et al., 2015; Pouyanfar et al., 2018].

The ever-increasing number of layers and units of such models ex-poses two challenges:

(26)

Figure 3.4. The original cascade correlation architec-ture by Fahlman and Lebiere [1990].

(1) overcoming issues with gradient propagation and other train-ing difficulties, thanks to many techniques such as data aug-mentation, batch normalization, layer-wise pre-training, skip connections, and so on;

(2) selecting the appropriate model configuration, which is becom-ing more and more onerous as the size of models increase.

3.2. Cascade correlation

The cascade correlation algorithm for the automatic construction of neural networks for vector inputs was introduced by Fahlman and Lebiere [1990]. Its core idea is to iteratively refine a linear classifier by adding features correlated with the residual error (Figure 3.4).

Starting from a dataset {(l(s), y(s)_)}

s∈D, an initial linear classifier

ˆ

y = w>l + b is trained, for example by Moore–Penrose pseudo-inverse.

To correct the residual error y − ˆy, new hidden units

(15) xi = σ w>_i l x(i−1). + bi , x(i−1)_. = [x1 ... xi−1_]>_,

are trained to maximize their correlation (actually, their covariance) with the residual error

(16) S =X

k

|cov(xi, Ek)| , Ek = yk− ˆyk.

After a new unit has been trained, its weights are ‘frozen’, and a new linear classifier ˆy = w>h_x(i)l

.

i

+b is retrained. The model thus expands exclusively depth-wise.

Subsequent models include different criteria for expanding the net-work, or other type of training for units [Kwok and Yeung, 1997].

(27)

3.3. CASCADE CORRELATION FOR STRUCTURES AND GRAPHS 23

Figure 3.5. The ‘Neural Network for Graphs’ model by Micheli [2009].

3.3. Cascade correlation for structures and graphs The cascade correlation approach was immediately extended to re-current neural networks (RNN) to handle sequences [Fahlman, 1991], and subsequently to recursive neural networks (RecNN) to handle struc-tured data [Sperduti et al., 1996; Sperduti and Starita, 1997; Micheli et al., 2004].

As said in the previous chapter, the issue of mutually dependent node states of undirected edges have been addressed in two differ-ent ways: by Scarselli et al. [2009] with a steady-state model, and by Micheli [2009] with a layered model.

The ‘Neural Network for Graphs’ (NN4G) construction (Figure 3.5) follows the original one by Fahlman and Lebiere [1990] for vectors: the output of a linear readout is iteratively refined by new features correlated with the residual error, by new units trained to maximize the covariance S, added one at a time in a depth-only expansion.

In the unit’s place we have linear combination of the vertex labels and the previously computed neighbours’ features

(17) x`(v) = σ  w_label> l(v) + X v0_{∈N (v)} Wneighx.(v0) + b  ,

where weight Wneigh could also be dependent on vertices v, v0 instead

of stationary, followed by a graph pooling operation, such as mean or sum, to produce the new graph feature

(18) X` = 1 κ X v∈V x`(v), κ = 1 or |V |.

(28)

A new readout ˆy = w>X + b, with X = [X1, ..., XN]>, is trained after

the new unit is added (at the beginning is assumed ˆy = 0).

Most notably, the network expansion carried on by the NN4G model is exclusively in depth. The context, i.e. the set of vertices whose features contribute directly or indirectly to produce the output feature

x`(v) for vertex v at layer ` consists of all vertices within ` − 1 hops

from v (Figure 3.5); thus, a hierarchy of increasing larger context-based features are produced by the cascade network.

While the depth expansion allows the units to specialize on graph regions of different radius, the lack of any breadth (i.e. same layer) expansion can in principle imperil the exploitation of such context. Thus, other expansion strategies different from depth-only will be pro-posed and evaluated in the following chapters, along with diverse type of units, graph pooling methods, and a construction stopping criterion exploiting a validation set metric.

(29)

CHAPTER 4

Novel architectures

Having already presented the main characteristics of graph neural networks in Chapter 2 and of cascade correlation for the construction of neural networks in Chapter 3, we will now introduce our novel ar-chitectures.

Two of the main motivations for the variants of the original Neural Networks for Graph model we will propose are

(1) considering other cascade expansion strategies beside depth-only, in order to automatically assign the number of units required at each layer to properly exploit the current layer context (§4.1.3);

(2) new pooling methods and new unit types, which will allow both to learn more expressive features, and to produce differ-ent classes of models, including constructing complex graph fil-ters (§4.2.3), and labelled substructure counting models (§4.5).

4.1. Cascade construction

4.1.1. Outline. At the highest level, our graph models are com-posed of two parts:

(1) an encoder part, which takes as input a graph G(V, E) with

node labels l(v) ∈ RD _{for all nodes v ∈ V and produces a}

vector encoding X ∈ RN _{for the graph G;}

(2) a readout, which produces from X a prediction ˆ_{y ∈ R}C _{of the}

target y_G, be it a class one-hot encoding for a classification

task, or the output vector for a regression task.

4.1.1.1. Units. We can further subdivide the encoder into the com-position of two functions: an isomorphic transduction, which maps

node labels l(v) ∈ RD _{to node features x(v) ∈ R}N_{, preserving the}

graph structure; and a pooling function, which maps the set of node

features {x(v) : v ∈ V } to a single vector X ∈ RN _{for the whole input}

graph.

For constructive models, where the dimension N of the graph en-coding is expanded, it is useful to introduce the concept of unit.

A unit includes both the isomorphic transduction and the pooling function to produce a single scalar component of the node features and of the graph vector encoding, receiving as input the graph labels l(v)

(30)

and the node features x.(v) ∈ RN1+...+N`−1 computed so far by all the

previous layers 1, ..., (` − 1).

Units, as for classic neural networks, can be arranged in layers, with all units i belonging to the same layer ` receiving the same input fea-tures. We will identify the output node feature vector and the graph

encoding produced by a layer with x`(v) ∈ RN` and X`∈ RN`,

respec-tively; x`,i(v) ∈ R and X`,i ∈ R will refer to each output scalar of unit

i in layer `.

4.1.1.2. Construction. A cascade is therefore constructed by two steps, repeated until a termination criterion is met:

(1) a new unit is trained and inserted in the encoder, thus

expand-ing the graph vector encodexpand-ing X from RN _{to R}N +1_;

(2) the readout is re-trained, either from scratch or by expanding

the previous one, to minimize a loss functionP

s∈DL(y

(s)_{, ˆ}_y(s)

) for all samples s in the dataset D (in our case, it will be the mean squared-error loss).

This construction algorithm is common to all cascade correlation models, and to the Neural Networks for Graphs (NN4G) model in

par-ticular [Micheli, 2009]. Our main innovations in the application of

cascade correlation for graph networks can be find in the encoder ex-pansion strategies, the new types of graph layers and pooling functions on which the encoder is built, and on the criterion for selecting the cascade dimension.

4.1.2. Unit training. In order to describe how a new unit is trained, let us suppose to have already a N -unit cascade model,

pro-ducing target estimates ˆy(s) for each of the samples s in the dataset

D. In case we are training the first unit of the cascade, we will assume ˆ

y(s)= 0.

The idea behind cascade correlation is to train the new unit to produce a linear approximation of the current loss L(y(s)_{, ˆ}_y(s)_{), which}

represents the ‘error’ of our current prediction ˆy(s). A new readout will then use this new feature to ‘correct’ the target estimation.

Therefore, a new unit should be trained to maximize the correlation between its output X(s) and the error L(y(s), ˆy(s)) for each sample s,

(19) corr(X, L) = 1 |D| X s∈D (X(s)_{− hXi) (L(y}(s)_{, ˆ}_y(s) ) − hLi) pvar(X) var(L) where hXi = _|D|1 P s∈DX

(s) _{and var(X) = h(X − hXi)}2_{i denote the}

(31)

4.1. CASCADE CONSTRUCTION 27

Actually, what is maximized in the cascade correlation setting is the covariance1,

cov(X, L) = 1

|D| X

s∈D

(X(s)− hXi) (L(y(s)_{, ˆ}_y(s)_{) − hLi)}

= corr(X, L)pvar(X) var(L).

(20)

Since in our setting the loss will be the squared-error L(y(s), ˆy(s)) = (y(s)− ˆy(s))2,

we will directly correlate with the error E = ˆy(s)−y(s)_{, thus maximizing}

(21) S = X

1≤c≤C

|cov(X, Ec)| .

To (21) will be added a regularization term weighted by λunit ≥ 0 to

constrain the unit parameters’ norm.

4.1.3. Network expansion. Having already a network of N units structured in ` layers, we need a strategy to determine the next unit’s position.

To this objective, we will train a pool P`0 of k units for each

can-didate layer position `0. The best unit, i.e. the unit that has achieved

the highest covariance S(j) _{during training, from the pool with highest}

average covariance 1_kP

j∈P_`0S(j)will be added in layer `

0 _{in the cascade}

(see Algorithm 4.1). Training a pool of candidates instead of a single candidate unit is useful both for providing a selection less biased by a particular weight initialization, and to avoid inserting saturated units or units stuck into local maxima in the cascade.

Algorithm 4.1 Expand cascade with new unit Require: a set of candidate layers K

1: for all candidate layers `0 ∈ K do

2: train a pool P`0 of k units

3: evaluate pool covariance ¯S`0 = 1

k

P

j∈P`0S (j)

4: end for

5: select best expansion position `0_∗ = arg max_`0S¯_`0 (shallower on tie)

6: append unit with highest covariance in pool P`0

∗ at layer ` 0 ∗

We will consider three strategies for expanding a network of ` layers, as outlined in Figure 4.1.

• depth (K = {`+1}), where new candidates will be trained only on the next layer, thus generating models of one unit per layer (this is the only expansion strategy used by Micheli [2009]);

1_{Rather, its absolute value, since a negative correlation will be equally useful}

(32)

Figure 4.1. The three strategies for network expan-sion. Filled circles represent trained units, while dashed circles represent candidate positions.

• tail (K = {`, ` + 1}), where new pools of candidates will be trained both for current layer ` and next layer ` + 1, thus performing a greedy expansion both in depth and breadth; • global (K = {1, ..., `, ` + 1}), being the most expensive of the

three, where new pools of units will be trained for all layers 1...` and next layer ` + 1, in order to avoid the depthward bias of previous strategies.

The last two strategies (tail and global ) produce layers with variable number of units, with the aim to properly exploit the context provided in each layer, while depth is constrained to a single unit per layer. In addition, the thick expansion strategy, which produces fixed-width layers (leaving width as an hyper-parameter to select), will be proposed in Chapter 6, being motivated by experimental results from the three other strategies.

Considering other expansion directions can allow us to avoid un-necessarily deep models. In fact, a shallower but thicker model is most of the time less computationally expensive than a deep, thinner one, since the most expensive operation is the message passing, which needs to be done for each of the layers.

4.1.4. Termination. We now need a criterion for stopping the cascade expansion. Following best practices for model selection, we will use a validation set, extracted from the training set, on which to observe the error. We will select the cascade that minimizes the validation error; the bias of a particular training/validation split can be overcome by performing cross-validation.

Afterwards, the a new model can be retrained on the whole training set, using either the training error achieved on the selected model, or the number of units as stopping criterion.

(33)

4.2. COMPUTATION IN LAYERS 29

4.2. Computation in layers

We turn now to the different type of units that our cascade can have. In this section we will present the node features generation part, leaving graph pooling to the next section.

All layers share the common structure

(22) x`,i= φ(f (l, x.)),

where f (·) is a simple (mostly affine) function of the input node labels

l and of the input node features x., and φ(·) is an activation function.

The choice of the function φ(·) must be limited to saturating acti-vations. Since for any constant α ∈ R we have

(23) cov(αX, E) = α2cov(X, E),

the covariance which units are trained to maximize could be improved by simply rescaling the pooled feature X by a factor |α| > 1, without improving the correlation between X and E.

Weight regularization alone cannot contrast this effect. Thus, in order to avoid this situation, we must prevent a rescaling to be carried

out from the trainable weights of f (·) to the new encoding feature X`,i.

A saturating activation φ(·) serves precisely this purpose: in this case the rescaling would lead to the saturation of its output, which in turn would impact negatively on the covariance.

4.2.1. Contextual. The first type of layers that we consider are variations of the contextual approach taken by Micheli [2009]. These differ from the traditional GCN layers by incorporating the original vertex labels as input in all subsequent layers.

The simplest is (24) x`,i(v) = tanh  w>_labell(v) + X v0_{∈N (v)} Wneighx.(v0) + b  

where x. can either be the concatenation of all the previous layer’s

node features [x1... x`−1], which was the case for the original NN4G

model, or just the previous one’s x`−1. We will conduct experiments

in order to assess the role played by full connections, as pointed out by Luzhnica et al. [2019].

A first variant of this layer is obtained by adding an additional weight for the incoming features of node v itself, which was excluded from the neighbourhood aggregation in the classic nn4g,

(25) x`,i(v) = tanh  w_label> l(v) + w>_nodex.(v) + X v0_{∈N (v)} Wneighx.(v0) + b  

(34)

thus allowing it to implement e.g. a comparison between node v and its neighbourhood. Note also that this variant will still be able to distinguish whether a neighbourhood contains the node v itself due

to self-connections, since the two x.(v) will be weighted by different

parameters.

We will call nn4g and nn4g+ the layers defined by (24) and (25), respectively.

4.2.2. Advanced. Much more expressive layers can be readily built by enhancing nn4g. In fact, the possibility of specializing weights according to edges have already been proposed by Micheli [2009], al-though not experimentally investigated.

The first option is to specialize the weights for each of the

neigh-bourhood node label classes, with Wneigh∈ RD×|x.|acting as a bilinear

function, as done in the bi-neigh layer

(26) x`,i(v) = tanh  w>_labell(v) + X v0_{∈N (v)} l(v0)>Wneighx.(v0) + b  .

On the other hand, we can use l(v) instead of l(v0), thus in practice

making the unit weights specific for each of the node v categories

(27) x`,i(v) = tanh  w>_labell(v) + X v0_{∈N (v)} l(v)>Wneighx.(v0) + b  .

We will refer to this layer as bi-node.

By framing those as bilinear functions instead of vertex category-specific weights, these layers are able to handle also continuous vertex labels.

Using the same mechanism, even more expressive layers can be built, e.g. the tri-linear layer

(28) x`,i(v) = tanh  w_label> l(v) + X v0_{∈N (v)} Wαβγlα(v) lβ(v0) xγ.(v 0 ) + b  ,

where we employed the Einstein summation convention to express the

multilinear product of the order-3 tensor Wαβγ. Here the pair of vertex

labels lα(v), lβ(v0) is in fact acting as an edge label for (v, v0), which if present could have been directly employed in the bilinear layers.

4.2.3. Spectral. A layer can also directly implement a graph sig-nal filter, e.g. a saturated affine transformation of the Laplacian oper-ator such as

(35)

4.3. GRAPH POOLING 31

which can be implemented as the message passing layer

(30) x`,i(v) = tanh  w>  x.(v) degree(v) − X v0_{∈N (v)} x.(v0)  + b  .

Such kind of layer could be approximated by nn4g+ with

appropri-ate weights (e.g. with Wneigh= −w and wnode = −w_{|V |}1 P_v∈V degree(v)).

Since the graph datasets considered in our experiments do not have an input signal (either having categorical vertex labels, or completely lacking them), we will take a different approach. In fact, we will intro-duce a gradient-like layer grad, defined as

(31) x`,i(v) = tanh  w>  x.(v) − X v0_{∈N (v)} x.(v0)  + b  ,

which can be seen as a weight-constrained version of nn4g+ (Wneigh =

−wnode and wlabel = 0).

This layer is implementing a saturated affine transformation of the operator F = I − A,

(32) x`,i= tanh w>F (x.) + b ,

where I is the identity and A is the adjacency.

In Figure 4.2 we can see how its iterated applications F(`) act on

a constant graph signal. On the left, it is clearly evident that the constant signal is being shifted toward the high-frequency side of the

spectrum (λmax). On the right we observe the same effect in the

adja-cency spectrum, where signal is shifting from αmax to αmin (note that

F and A are simultaneously diagonalizable).

The combined effect of scaling and saturating activation function on the other hand act as a low-pass filter (Figure 4.3), with the bias term b providing an additional contribution to the constant component of the spectrum.

Overall, the actions of F and the saturated affine transformation contribute to realize a sequence of layers which produce spectrally rich features without the need of an input signal. These can then be fed to the linear readout to implement a more complex filter.

Therefore, we will evaluate whether our cascade construction is able to produce effective complex filters by employing grad units along with norm-based pooling (see subsequent section).

4.3. Graph pooling

We will apply different graph pooling methods among those intro-duced in Chapter 2.

(36)

Figure 4.2. Frequency shift produced by iterated ap-plications of grad on a constant signal (Laplacian and adjacency spectra).

Figure 4.3. Saturating activation functions such as tanh(·) act as low-pass filters, according to input scale.

(37)

4.3. GRAPH POOLING 33

4.3.1. Simple. The most simple ones are the three operators sum(·), mean(·), and max(·), whose properties have already been discussed.

In addition, we will introduce a class of norm and norm-like pooling functions to be used with spectral layers. They are reported in Table 4.1: ‘1-norm’ and ‘relu-sum’ are respectively the definition and approx-imation of || · ||1, while ‘2-norm’ implements || · ||22. Since a cascade of

spectral layers acts as a sequence of filters, the classifier will use the norm of the pass-band signal to make its prediction.

name pooling function

1-norm P v∈V |x`,i(v)| 2-norm P v∈V x`,i(v) 2 relu-sum P v∈V max{x`,i(v), 0}

Table 4.1. Norm pooling functions.

Note that all of these six do not require parameters to be trained. 4.3.2. Attention. A more powerful approach is the attention mech-anism, which allows a weighted mean pooling of the graph nodes ac-cording to equation

(33) X`,i=

X

v∈V

α`,i(v) x`,i(v).

The attention weights distribution α`,i is computed as

(34) α`,i(v) = softmax  w>   l(v) x.(v) x`,i(v)  + b  ,

that is by the softmax of a linear combination of each node label, the features computed in the previous layers, and the current unit output node feature. This allows to learn which nodes are more relevant to the final prediction.

We will also introduce an intermediate approach by rescaling the output features of the nn4g layer with weights computed from the node labels. Hence the focus layer is defined as follows

(35) x`,i(v) = xnn4g`,i (v) sigmoid w

>_{l(v) ,}

where xnn4g

`,i (v) is computed by equation (24), and sigmoid w

>_{l(v) is}

the rescaling factor. Note that according to equation (35), subsequent layers will receive the rescaled node features in input, thus allowing information on the ‘focus’ to be passed on; in our experiments we will use sum pooling in conjunction with this layer.

(38)

4.4. Readout

We will now focus on the readout part of our graph model, which is re-trained each time after a new unit is added in the cascade.

As for the original NN4G model [Micheli, 2009], it will consist of a simple affine function

(36) y = W X + b,ˆ

with W ∈ RN ×C and b ∈ RC as parameters to be trained.

In case the loss to be minimized is the (mean) squared-error (MSE)

lossP

s∈D(y

(s)_{− ˆ}_y(s)₎2_{, a direct solution can be obtained by the}

pseudo-inverse method.

Let the matrices Z ∈ R|D|×(N +1) and Y ∈ R|D|×C be

(37) Z =    X(1)> 1 .. . ... X(|D|)> 1   , Y =    y(1)> .. . y(|D|)>   .

The parameters in equation (36) can thus by obtained by

(38) W

>

b>

= Z+Y,

where Z+ denotes the Moore–Penrose pseudo-inverse of Z.

In case a Tikhonov regularization term is added to the loss, that is

we want to minimize P s∈D(y (s)_{− ˆ}_y(s)₎2_{+ λ} readout(||W||2+ ||b||2), the solution is given by (39) W > b> = (Z>Z + λ2_readoutI)−1Z>Y,

where I is the identity matrix. Actually, equation (39) is not used

directly: given the singular-value decomposition Z = U S V>,

(40) W

>

b>

= V (S2+ λ2_readoutI)−1S U>Y.

For classification tasks, nowadays is most common to train a clas-sifier via cross-entropy loss minimization. While this has shown some advantages in the end-to-end training of deep models, the lack of a closed-form solution leaves only gradient-descent methods to be ap-plied.

4.5. Path-like models

The constructive framework of graph neural models we have devel-oped so far can be readily used to implement more ‘esoteric’ models.

Since the ability to count labelled sub-structures in graphs has been shown to be essential in the chemical domain [Chen et al., 2020], we propose a model tailored for this. Our model consists in a cascade of single units specialized on one category of vertices or neighbour nodes.

(39)

4.5. PATH-LIKE MODELS 35

Figure 4.4. Construction of a cascade of label-specific units.

Figure 4.5. Substructures of the input graph repre-sented in subsequent layers by a path-node cascade.

The path-neigh unit restricts the vertex neighbourhood to a sub-set of nodes having the same label category ˆl`,

(41) x`(v) = tanh  w_label> l(v) + X v0_{∈N (v)} Wneighx.(v0) r l(v0) = ˆl` z + b  .

The path-node unit, on the other hand, computes features only for vertices with label category ˆl`, leaving others with null features:

(42) x`(v) = tanh  w_label> l(v) + X v0_{∈N (v)} Wneighx.(v0) + b   r l(v) = ˆl` z . At each layer ` of the cascade construction, a pool of units for each category is trained, with the best unit from the best performing

pool (having label ˆl`) being selected (Figure 4.4). Graph pooling is

performed with the sum operator. Thus, each layer counts a set of substructures of the input graph, produced according to the sequence of selected labels ˆl1, ..., ˆl` (Figure 4.5).

While the effectiveness of this class of models will be verified ex-perimentally, we highlight the flexibility of our constructive framework in allowing such departures from more classical graph convolutional models.

(40)

(41)

CHAPTER 5

Experiments

We will now turn to the bulk of our experiments: assessing the performances of our class of constructive models on the most common graph datasets, in order to compare them against the state of the art.

5.1. Datasets

The dataset of our experiments include all datasets employed in the

graph model comparison by Errica et al. [2020]1_.

All of them have a graph class as target, categorical node labels en-coded as one-hot representations (except ENZYMES, and social datasets, which are unlabelled), and no edge features. All datasets used in Errica et al. [2020] are balanced.

dataset graphs avg. nodes avg. edges node labels classes

DD 1178 284.32 715.66 89 2 ENZYMES 600 32.63 62.14 3 6 NCI1 4110 29.87 32.30 37 2 PROTEINS 1113 39.06 72.82 3 2 COLLAB 5000 74.49 2457.78 — 3 IMDB-BINARY 1000 19.77 96.53 — 2 IMDB-MULTI 1500 13.00 65.94 — 3 REDDIT-BINARY 2000 429.63 497.75 — 2 REDDIT-MULTI-5K 4999 508.52 594.87 — 5

Table 5.1. Graph datasets used in the experiments.

Chemical compounds. DD graphs [Shervashidze et al., 2011] represent protein structures: nodes correspond to amino acids, with

edges between them if they are within a distance of 6 ˚angstrom. The

task is to classify the proteins into enzymes and non-enzymes.

The PROTEINS dataset has been derived from the same data of DD [Borgwardt et al., 2005], with the same target classes. Node attributes identify helices, sheets, and turns in the protein structure, with edges connecting the 3 closest nodes.

1_{All datasets retrieved from www.graphlearning.io}

(42)

NC1 contains chemical compounds screened for activity against non-small cell lung cancer by the National Institute of Health [Sher-vashidze et al., 2011].

Finally, ENZYMES represents 100 proteins from six of the Enzyme Commission (EC) top-level classes [Borgwardt et al., 2005]. Node fea-tures are non-categorical.

Social graphs. The graphs of these datasets are unlabelled and represent interactions (edges) between people (nodes). The datasets have originally been provided by Yanardag and Vishwanathan [2015].

The COLLAB graphs represent collaboration between authors in three scientific fields: high-energy physics, condensed matter physics, and astrophysics. The task is to identify to which of the three an author belongs.

In the IMDB-BINARY and IMDB-MULTI graphs, nodes represent actors, with edges between them if they have played a part in the same film. Different graphs have been generated for different film genres, which are the target classes to predict (respectively action/romance and action/comedy/sci-fi).

REDDIT-BINARY and REDDIT-MULTI-5K have been generated from discussion threads on the reddit.com website, with one graph per thread, nodes representing users, and edges present if users responded to each other’s comments. The task for the two datasets it to predict whether a thread belongs to a Q&A/discussion community, or to five different sub-reddits, respectively.

Following Errica et al. [2020], we use the dataset in experiments both with a dummy constant label, and with node degrees as node attributes.

5.2. Experimental set-up

We follow the evaluation framework of Errica et al. [2020], thus allowing a direct comparison between their assessment of some most common graph models and ours.

Briefly, it consists of an outer k-fold training/test split to asses model performance, with an inner k-fold training/validation split to select the model parameters. This is called double cross-validation (de-picted in Figure 5.1).

5.2.1. Model selection. In order to choose a model’s parameters, one cannot simply rely on the training set performance, due to the risk of over-fitting, i.e. obtaining models with low training error, but poor generalization capability. Therefore, a subset of the data is set aside to validate the performances on unseen samples. The bias of a particular split can be avoided by having multiple training/validation splits and averaging out performances.

(43)

5.2. EXPERIMENTAL SET-UP 39

Figure 5.1. Model evaluation.

For our model selection we choose a 3-fold cross-validation split, where for each configuration of hyper-parameters the model is training on 2/3 of the data and evaluated on the remaining 1/3, thus using disjoint validation sets.

5.2.2. Model assessment. Reporting only validation performances would provide an over-estimate of model accuracy, since its parameters would have been selected to minimize error on the validation set. Nev-ertheless, as lamented by Errica et al. [2020], this malpractice has been found even in peer-reviewed literature on graph neural models, thus making the case for their work of re-assessment of state-of-the-art.

Following their approach, we make a 5-fold split in selection and test set. For each of the 5 splits, on the current selection set we select model parameters following the cross-validation scheme exposed in the previous section. Then, the model is re-trained on the whole selection set and evaluated on the test set; the accuracy averaged on the 5 test sets provides the final assessment of our model.

(44)

dataset learning rate λunit λreadout DD _                      10−3, 10−4 10−2, 10−3, 10−4 10−1 ENZYMES NCI1 PROTEINS IMDB-BINARY IMDB-MULTI REDDIT-BINARY REDDIT-MULTI-5K COLLAB 10−4, 10−5 10−2, 10−3, 10−4 10−1

Table 5.2. Hyper-parameter values to select for each dataset.

5.2.3. Hyper-parameters etc. We can subdivide our model pa-rameters in three categories:

(1) type of layer and pooling function, which characterize each model variant and are left out of selection in order to analyse their effect;

(2) regularization (λunit, λreadout) and learning rate for unit and

readout training, which are selected among the values listed in Table 5.2;

(3) stop condition, i.e. either training error or number of units to reach in order to halt cascade expansion: the average values which are reached during model selection are used as threshold when the model is retrained on the whole selection set. Therefore, our selection and Errica et al. [2020] one will have the following differences:

(a) pooling function and type of layer will be considered model variants to compare against each other, instead of parameters to select;

(b) given an expansion strategy, model topology will be automat-ically determined for our models, instead of being selected.

5.3. Comparing models

In order to compare more than two models against each other across different datasets, we will rely on the statistical test provided by Demˇsar [2006]; Garc´ıa and Herrera [2008]. The Nemenyi post-hoc test works directly with rankings of models’ performances, computing average ranks and a critical difference according to a significance level α (usually α = 0.05) which is used to arrange models in cliques of equal rank.

The models evaluated by Errica et al. [2020] are briefly recalled in Table 5.3. In Figure 5.2 is represented the critical difference diagram

(45)

5.3. COMPARING MODELS 41

model reference description

DGCNN Zhang et al. [2018] Graph convolution with node

sorting and alignment

DiffPool Ying et al. [2018] Trainable graph coarsening

ECC Simonovsky and Komodakis [2017] Neighbour aggregation weighted

according to specific edge param-eters

GIN Xu et al. [2019] Vertex features computed by a

multi-layer perceptron

GraphSAGE Hamilton et al. [2017] Graph convolution with

neigh-bourhood sampling Table 5.3. Distinctive characteristics of models

evalu-ated by Errica et al. [2020].

model

dataset MLP baseline DGCNN DiffPool GIN GraphSAGE

Chem. DD 78.4±4.5 76.6±4.3 75.0±3.5 75.3±2.9 72.9±2.0 NCI1 69.8±2.2 76.4±1.7 76.9±1.9 80.0±1.4 76.0±1.8 PROTEINS 75.8±3.7 72.9±3.5 73.7±3.5 73.3±4.0 73.0±4.5 ENZYMES 65.2±6.4 38.9±5.7 59.5±5.6 59.6±4.5 58.2±6.0 No Fea t. IMDB-BINARY_IMDB-MULTI 50.7_36.1±2.4 53.3±5.0 68.3±6.1 66.8±3.9 69.9±4.6 ±3.0 38.6±2.2 45.1±3.2 42.2±4.6 47.2±3.6 REDDIT-BINARY 72.1±7.8 77.1±2.9 76.6±2.4 87.0±4.4 86.1±2.0 REDDIT-MULTI-5K 35.1±1.4 35.7±1.8 34.6±2.0 53.8±5.9 49.9±1.7 COLLAB 55.0±1.9 57.4±1.9 67.7±1.9 75.9±1.9 71.6±1.5 With Deg. IMDB-BINARY 70.8±5.0 69.2±3.0 68.4±3.3 71.2±3.9 68.8±4.5 IMDB-MULTI 49.1±3.5 45.6±3.4 45.6±3.4 48.5±3.3 47.6±3.5 REDDIT-BINARY 82.2±3.0 87.8±2.5 89.1±1.6 89.9±1.9 84.3±1.9 REDDIT-MULTI-5K 52.2±1.5 49.2±1.2 53.8±1.4 56.1±1.7 50.0±1.3 COLLAB 70.2±1.5 71.2±1.9 68.9±2.0 75.6±2.3 73.9±1.7

Table 5.4. Accuracy evaluation of models taken from Errica et al. [2020] (best in bold).

for models in Table 5.4. Models are arranged according to average rank from first to last, while cliques within critical difference are joined by red lines. GIN comes out as the best model overall, while the multi-layer perceptron (MLP) baseline model performing better on chemical datasets, with the exception of NCI1.

(46)

Figure 5.2. Critical difference diagram (α = 0.05) ranking models from Errica et al. [2020].

In subsequent sections we will compare variants of our cascade models against each other, and we will defer comparison to end-to-end models to Chapter 6 where we will propose a ‘final’ model variant.

5.4. Contextual layers

We begin by evaluating how different expansion strategies, unit connectivity, and graph pooling affect accuracy in cascades with simple contextual unit types (nn4g, nn4g+). The complete results of the double cross-validation experiments are reported in Appendix A.

5.4.1. Role of connections. No statistically significant conclu-sion can be drawn on whether the presence of full connections provides more accurate models overall, for any of the three expansion strategies considered (Figure 5.3). More significant results will come in Chapter 6.

5.4.2. Pooling layers. Sum and attention pooling seems to per-form generally better, even though no pooling strategy is statistically significantly superior (Figure 5.4). On the chemical datasets, mean pooling is the performing significantly under the others (Table 5.5); this can be explained by taking into consideration that the MLP base-line of Errica et al. [2020], which takes an histogram of vertex labels as input, trumps the other models: node count is the key feature in this case.

5.4.3. Expansion strategies. Considering other expansion meth-ods than depth-only produces models of minor depth, but significantly low width (at most 2 units per layer, on average, see Figure 5.6). Still, no strategy outperforms others in a significant way (Figure 5.5, Table 5.6). It would be reasonable then to discard global expansion, since the

(47)

5.4. CONTEXTUAL LAYERS 43

(a) depth expansion

(b) tail expansion

(c) global expansion

Figure 5.3. Critical difference diagrams comparing the effect of connectivity in different expansion strategies (nn4g units, sum pooling).

model

depth tail

dataset sum mean max attention sum mean max attention

Chem. DD 75.4±1.3 63.3±3.4 71.9±2.6 72.4±5.5 74.6±2.0 65.2±2.1 70.2±3.2 73.9±4.0 NCI1 73.2±0.7 68.5±1.9 71.6±1.8 74.1±1.7 72.6±0.9 67.8±1.1 69.3±1.2 72.7±1.2 PROTEINS 73.3±4.5 68.4±2.7 75.4±2.7 74.0±1.9 74.0±3.9 68.9±4.1 74.2±3.5 75.0±3.2 ENZYMES 26.6±3.0 28.1±2.0 26.5±2.9 34.1±1.8 26.0±5.8 29.1±2.5 27.3±2.6 33.0±3.0 No Fea t. IMDB-BINARY_IMDB-MULTI _48.371.7±3.0 71.5±2.6 68.7±2.5 72.1±2.5 71.4±3.5 69.6±2.9 71.0±3.6 70.9±3.6 ±2.3 48.8±3.8 48.1±1.6 49.0±3.1 49.0±3.5 48.9±2.3 47.6±2.1 48.0±3.4 REDDIT-BINARY 83.2±1.7 81.0±1.6 85.7±2.0 82.6±1.8 82.5±0.7 80.7±1.5 82.9±2.3 83.0±1.9 REDDIT-MULTI-5K 52.5±1.1 50.0±1.4 51.3±1.2 51.7±0.9 52.8±0.9 50.2±1.1 50.2±1.0 51.8±1.0 COLLAB 70.2±4.0 73.2±0.7 73.6±1.3 72.5±1.4 73.0±0.6 71.5±0.8 71.5±1.3 71.9±1.3 With Deg. IMDB-BINARY 71.3±3.3 69.7±3.1 71.3±2.8 70.4±3.1 71.6±3.1 73.1±1.9 70.6±1.8 70.9±1.7 IMDB-MULTI 49.1±2.1 48.2±2.5 46.8±2.8 47.0±2.7 49.8±3.8 48.4±3.9 47.6±3.6 46.5±4.7 REDDIT-BINARY 84.0±0.8 82.2±2.4 84.2±1.0 85.8±2.1 82.4±1.0 81.3±2.3 82.4±2.5 85.6±0.3 REDDIT-MULTI-5K 52.8±1.3 50.4±0.5 50.6±2.3 52.2±1.6 51.6±0.6 50.0±0.6 50.0±1.5 52.9±1.9 COLLAB 71.9±0.6 72.0±0.6 72.6±0.2 71.8±0.7 69.4±3.5 71.8±1.7 71.1±0.8 69.4±1.0

Table 5.5. Accuracy evaluation of different graph pool-ing methods for two expansion strategies of cascades with nn4g units and full connections (best in bold).

(48)

(a) depth expansion only

(b) tail expansion only

(c) both expansion strategies

Figure 5.4. Critical difference diagrams ranking graph pooling methods for depth and tail expansion strategies.

significant cost of training candidate pools at each cascade position is not justified by improvements neither in accuracy, nor in model size.

The factors affecting this will be investigated in detail in the next chapter, where also the thick expansion strategy will be introduced.

Deep Cascade Architectures of Neural Networks for Graphs

Dipartimento di Informatica

Corso di Laurea Magistrale in Informatica

(Curriculum “Artificial Intelligence”)

Tesi di Laurea Magistrale

Titolo

Deep Cascade Architectures of Neural Networks for Graphs

Candidato

Domenico Tortorella

Relatore

Alessio Micheli

Abstract

Deep Cascade Architectures of

Neural Networks for Graphs

Contents

Introduction

Graph Neural Networks

Automatic Construction of Neural Networks

Novel architectures

Experiments