Corso di Laurea Magistrale in Informatica
Master Degree Thesis
A Non-Negative Factorization approach to
node pooling in Graph Convolutional Neural
Networks
Supervisor
Prof. Davide Bacciu
Reviewer
Prof. Luca Oneto
Candidate
Luigi Di Sotto
beyond recall, smiled at him sweetly. He smiled back with fondness, and without pain.
— Arthur C. Clarke, 2001: A Space Odissey, Chapter 45: Recapitulation
1 Introduction 1 2 Background 4 2.1 Introduction . . . 5 2.2 Basic notation . . . 6 2.3 Calculus on Manifold . . . 7 2.4 Calculus on Graphs . . . 9
2.5 Harmonical Analysis on Graphs . . . 10
2.6 Convolutional Neural Networks . . . 11
2.7 Graph Convolutional Neural Networks . . . 14
2.8 Expressive power of GCNs . . . 20
3 Graph Pooling by Non-Negative Matrix Factorization 23 3.1 Related work . . . 24
3.2 The problem . . . 24
3.3 Non-Negative Matrix Factorization . . . 26
3.4 Auto-Encoders. . . 27 3.5 Introduction to solutions . . . 33 3.6 AVBPool . . . 35 3.7 NMFPool . . . 39 4 Experiments 45 4.1 Datasets . . . 45 4.2 Experiments . . . 47 ii
5 Conclusions 52 5.1 Future work . . . 54
Bibliography 55
This section provides a concise reference describing notation used throughout this document.
Numbers and Arrays a A scalar (integer or real)
a A vector A A matrix A A tensor
In Identity matrix with n rows and n columns
I Identity matrix with dimensionality implied by context
e(i) Standard basis vector [0, . . . , 0, 1, 0, . . . , 0] with a 1 at position i
diag(a) A square, diagonal matrix with diagonal entries given by a
a A scalar random variable
a A vector-valued random variable A A matrix-valued random variable
Sets and Graphs
A A set
R The set of real numbers {0, 1} The set containing 0 and 1
{0, 1, . . . , n} The set of all integers between 0 and n [a, b] The real interval including a and b
(a, b] The real interval excluding a but including b A\B Set subtraction, i.e., the set containing the
elements of A that are not in B
G A graph
P aG(xi) The parents of xi in G
Indexing
ai Element i of vector a, with indexing starting
at 1
a−i All elements of vector a except for element i
Ai,j Element i, j of matrix A
Ai,: Row i of matrix A
A:,i Column i of matrix A
Ai,j,k Element (i, j, k) of a 3-D tensor A
A:,:,i 2-D slice of a 3-D tensor
Linear Algebra Operations A> Transpose of matrix A
A+ Moore-Penrose pseudoinverse of A
A B Element-wise (Hadamard) product of A and B
det(A) Determinant of A
Calculus dy
dx Derivative of y with respect to x ∂y
∂x Partial derivative of y with respect to x ∇xy Gradient of y with respect to x
∇Xy Matrix derivatives of y with respect to X
∇Xy Tensor containing derivatives of y with respect
to X ∂f
∂x Jacobian matrix J ∈ R
m×n of f : Rn → Rm
∇2xf (x) or H(f )(x) The Hessian matrix of f at input point x Z
f (x)dx Definite integral over the entire domain of x Z
S
f (x)dx Definite integral with respect to x over the set S
Probability and Information Theory
a⊥b The random variables a and b are independent a⊥b | c They are conditionally independent given c
P (a) A probability distribution over a discrete vari-able
p(a) A probability distribution over a continuous variable, or over a variable whose type has not been specified
a ∼ P Random variable a has distribution P Ex∼P[f (x)] or Ef (x) Expectation of f (x) with respect to P (x)
Var(f (x)) Variance of f (x) under P (x)
Cov(f (x), g(x)) Covariance of f (x) and g(x) under P (x) H(x) Shannon entropy of the random variable x DKL(P kQ) Kullback-Leibler divergence of P and Q
N (x; µ, Σ) Gaussian distribution over x with mean µ and covariance Σ
Functions
f : A → B The function f with domain A and range B f ◦ g Composition of the functions f and g
f (x; θ) A function of x parametrized by θ. (Sometimes we write f (x) and omit the argument θ to lighten notation)
log x Natural logarithm of x σ(x) Logistic sigmoid, 1
1 + exp(−x) ζ(x) Softplus, log(1 + exp(x)) ||x||p Lp norm of x
||x|| L2 norm of x
x+ Positive part of x, i.e., max(0, x)
1condition is 1 if the condition is true, 0 otherwise
Sometimes we use a function f whose argument is a scalar but apply it to a vector, matrix, or tensor: f (x), f (X), or f (X). This denotes the application of f to the array element-wise. For example, if C = σ(X), then Ci,j,k = σ(Xi,j,k) for all valid values of i, j and k.
Datasets and Distributions pdata The data generating distribution
ˆ
pdata The empirical distribution defined by the
train-ing set
X A set of training examples
x(i) The i-th example (input) from a dataset y(i) or y(i) The target associated with x(i) for supervised
learning
X The m × n matrix with input example x(i) in row Xi,:
Introduction
Grounded in theoretical and empirical foundations, Convolutional Neural Networks lead in the last decade to qualitative breakthroughs on a variety of problems defined over planar spaces. Unfortunately, the key ingredients making Euclidean CNNs particularly successful, find no applicability over relational data, since manifolds underlying graph data have properties of common system of coordinates, vector space structure and shift-invariance making sense only locally. We would talk of Euclidean neighbourhoods around data points instead. Thus underlying topology usually doesn’t have any regular support, for data points may relate in exponentially many ways. We will show proof of how absence of space compactness and of shift-invariance make unfeasible to apply Euclidean CNNs over such kind of domains. In the present Thesis we intend to start considering the current revival of Neural Networks over graphs, known as Graph Convolutional Neural Networks (GCNs), that try tackle limitations the Euclidean CNNs may face over
non-structured data. Before delving inside the machinery of GCN models is necessary to introduce a basic background on the elemental mathematical building blocks making up learning models over graphs. In Chapter 2 we will introduce Graph Spectral Theory to understand the problem of representation learning over graphs. We will see that core mathematical interpretation of learning over graphs has deep connection with that of explicitly learning graph spectrum through combinatorial graph Laplacian eigen-decomposition.
Graph Laplacian is a non-Toeplitz diffusion-like operator whose eigen-basis has similar significance of that of Fourier basis in general regular structures. Such an operator is central to learn to represent input graphs into lower dimensional embedding spaces, i.e. learn smooth-out its operands. In Chapter 2 we thus introduce GCNs, that build atop of Graph Laplacian as polynomial filters implicitly and efficiently learning graph spectrum. GCNs, we will see, may assume nuances of that of learning models acting as implicit power methods to learn graph embeddings that are also graph Laplacian eigenbasis. Actually, GCNs alone are not sufficient to fully learn properties underlying input graphs, especially in graph classification tasks. In Chapter 3 we formally introduce the problem of graph pooling, then we introduce a simple semantics to describe the functioning of an abstract learnable pooling layer adjointly working with GCNs to unfold compositional structure of input graphs to increase models predictive performance. Though common practice over graph structure data is to employ symmetric functions of sum, avg and max, to naively induce subsampling, similarly to what we do over general regular structures. In Chapter 2 we will argue how such functions simply squeeze graph features into a lossy signature vector. In Chapter 3 we will also introduce current state-of-the-art pooling mechanisms tackling such limitations. We will refer to one of such pooling methods, DiffPool, as the most representative amongst novel fully-differentiable approaches. In Chapter 3 we solve pooling problem by introducing two pooling mechanisms: AVBPool and NMFPool. Both are built on the Non-Negative Matrix Factorization (NMF) of a matrix representing node adjacency and node similarity as adaptively obtained through the vertices embedding learned by the model. Such mechanisms are applied to obtain an incrementally coarser graph where nodes are adaptively pooled into communities based on the outcomes of the non-negative factorization. NMF is a powerful yet easy to interpret method for decomposing positive matrices into positive product factors exposing communities underlying graph data. In Chapter 4, an empirical analysis on graph classification benchmarks shows how NMFPool operator yields significant improvements in the predictive performance of the model with respect to its non-pooled counterpart. In particular, we will show how such
simple and general purpose approach almost compares with that of DiffPool in terms of predictive performance, and how it has the potential to give even better results. At the end, in Chapter 5, we will show ideas to further enhance NMFPool model: symmetrical variants of NMF, and probabilistic generative approaches to learn underlying probability distributions of data with latent hierarchical structure, are the proposed future directions to extend our model.
Background
Nowadays the Deep Learning community is actively studying and evaluating novel tecniques aiming at generalizing Deep Convolutional Neural Networks (DCNN) architectures to non-structured domains [7]. In particular, under the term of Geometric Deep Learning we indicate a sub-field of Deep Learning aiming at projecting the successful Euclidean convolutional model from structured to non-structured data, namely high-dimensional manifolds with added structural information. Discrete examples of such manifolds are Graphs. We refer to agnostic neural models over such domains as Graph Convolutional Networks (GCN) [20]. As for flat-spaces in which features engineering has almost been substituted by the general empirical assumptions made by Euclidean Convolutional Neural Networks, we expect reaching same goals even on high-dimensional relational data as graphs are, but that may not be straightforward, since hypotheses made on Euclidean spaces hold no more due the nature of non-Euclidean spaces. Graphs offer the natural framework to generalize regular structures. In this chapter we will discuss constructions of Deep Graph Convolutional Neural Networks with two different approaches: one based on the Laplacian spectral decomposition. And the second, being the actual widely used approach relying on spatial representation of the graph, where graph embedding is learned by aggregating k-order node embeddings by means of fixed-point operators. To make parallels with such novel models, it is worth to mention works of Micheli [28] and Scarselli et al. [30] as the first
prominent attempts in extending neural networks to process data represented in graph domains. Their contribution has been essential to shed light on the inner workings of Graph Neural Networks (GNNs). Formal contextual approach of Micheli [28], and the proof, given by Scarselli et al. [30], of the learned graph filter being a contraction map, i.e. fix-point operator, provide strong theoretical foundations on which current GCNs models are built upon.
2.1
Introduction
Currently, many real-world phenomena are modeled as interacting objects living possibly into high-dimensional manifolds with added topological struc-ture. Examples can be found in genomics with protein-protein interaction networks, fake news discovery in social networks, functional networks in neuroscience, etc. Graphs are the natural mathematical model for such data with underlying non-Euclidean nature. Deluge of graph structured data is forcing Deep Learning community to push forward deep neural architectures to evolve into generalized agnostic models with no a priori knowledge of the input data for representation learning over non-Euclidean domains. Over such spaces, the statistical properties of stationarity, locality and composi-tionality - we take for granted on flat spaces - may not be easily satisfied. Aforementioned statistical properties are outstandingly mastered by Eu-clidean Convolutional Neural Networks, and trying to replicate their success over non-Euclidean domains imposes great efforts into re-designing the whole math for adaptively learn graph embeddings. Deep Learning models have been proven successul over signals having underlying Euclidean structure, on such domains they are able to extract meaningful informations by leveraging on simple statistical properties characterizing Euclidean spaces. For example, considering images as signals over a discrete sampling of the Euclidean plane (i.e. grid-like structure) we could easily apply convolutional operators based on the fixed local connectivity of the grid (i.e. each pixel has a fixed number of pixels over which a learnable spatial filter is applied), and by repeatedly applying symmetric functions for sub-sampling purposes (i.e. pooling) we could ensure the shift-invariance property, so that at inference time we do
not care where a peculiar feature is in space but if that feature is present; furthermore the compositionality nature of the Euclidean plane is also fully exploited giving us a hieararchically and abstract representation of the image under consideration. In addition to such priors over data, Deep Convolu-tional Neural Networks are particularly efficient in terms of space without sacrificing the retention capacity of the whole model by the employing of shared weights, so that the most interesting features are easily detected all across the considered plane.
Meanwhile non-Euclidean data is constantly arising in many fields today and at the present moment we still lack powerful methods for learning from relational data. Non-Euclidean spaces don’t have the afore-mentioned statistical properties characterizing Euclidean spaces, at least non-locally. As an example, consider the convolutional operator: spectrum of a graph lives into a (Fourier-) spanning-basis that totally depends on the topological structure of the non-Euclidean space considered. Consider also that relational data may exhibit very high variation in topology throughout the whole space, thus even small variations in the topology of a graph may result in a totally different spectrum. Then applying the basic convolutional operator for fixed topologies (e.g. grid-like structures) to non-fixed topologies is not that straightforward, and only weak-generalization capability is achieved due to the basis-dependence of the learned filters. It is also unconceivable to make assumptions on the size of local filters to be applied, for if we are intended to grasp the many local variations in a manifold we expect the number of filters to grow exponentially. That forces us to study the problem of learning features over graphs using a totally new mathematical perspective.
2.2
Basic notation
In the following section we give a brief introduction to the will be used nota-tion and basic concepts from manifold calculus deliberately using notanota-tion in [7].
Graph. A graph G is a tuple G = (V, E , A), where V is the set of vertices of the graph, E is the set of edges connecting vertices, i.e. E ⊆ V × V, and A ∈ Rn×n, with n = |V|, is the adjacency matrix such that
Ai,j =
ai,j > 0 if (i, j) ∈ E
0 otherwise
Note that here we are considering undirected graphs, i.e. such that (i, j) ∈ E iff (j, i) ∈ E, thus the associated A is symmetric, namely A = A>. Instead
it is not uncommon to encounter non-symmetric matrices, i.e. rectangular, modeling for example bi-partite graphs. In the present work, without loss of generality, we generalize to undirected graphs, thus we refer to positive symmetric matrices (though some problems may require negative elements, but that’s not our case). For labeled graphs we also assume X ∈ Rn×d to be
the matrix of the n signals xi ∈ Rd associated to each node.
Manifold. Informally, a manifold is a high-dimensional space that only locally resembles Euclidean. Think for example of a sphere-like geometry as the planet-earth. If we stand at a single point and look around, we could be easily fooled thinking to be into a planar space, but zooming it out we lose that perspective, namely no longer resembles a flat space but (globally-) spherical. Formally a (differentiable) n-dimensional manifold X
is a topological space with each point x being locally homeomorphic to a n-dimensional Euclidean space. Thes local neighbourhood of x is called tangent space and is indicated with TxX . Tangent spaces are equipped with
an inner-product h., .iTx : Tx× Tx → R, allowing us to measure distances,
angles and volumes.
2.3
Calculus on Manifold
We begin to define functions over the manifold as follows
hf, giL2(X ) =
Z
X f (x)g(x)dx, f, g : X → R
hF, GiL2(T X ) =
Z
X
hF (x)G(x)iTxXdx, F, G : X → T X (2.2)
where L2(X ) and L2(T X ) are Hilbert spaces. We also extend the concept of derivative to Manifolds, introducing the differential of a function
df : T X → R (2.3)
and in its linear form is as following
df (x) = h∇f (x), .iTxX. (2.4)
The operator ∇f : L2(X ) → L2(T X ) called intrinsic gradient has the usual significance of vector defining the direction of steepest change of the function at a point, but such a vector is now a tangent vector.
Then given a function f (x) at a given point x, and a function F (x) ∈ T X defining a displacement in a small neighbourhood of the given point on Manifold, we get the rate of change of f as follows
df (x)F (x) = h∇f (x), F (x)iTxX. (2.5)
We have generalized the notion of directional derivative. In the end, without entering too much into detail, we introduce the Laplace operator defined as a function of the form ∆ : L2(X ) → L2(X ), and defined as
∆f = −div(∇f ) (2.6)
The divergence operator acts upon tangent vectors producing a scalar field, i.e. gives a quantity associated to a vector field. If we think of a tangent vector field as a flow of material on a manifold, we could interpret ∆f "as the difference between the average of a function on an infinitesimal sphere around a point and the value of the function at the point itself" [7].
2.4
Calculus on Graphs
On discrete Manifolds we introduce functions f : V → R and F : E → R and Hilbert spaces L2(V) and L2(E ) of such functions by defining inner products given as follows hf, giL2(V)= X i∈V wifigi, (2.7) hF, GiL2(E)= X i∈V aijFijGij. (2.8)
The graph gradient is defined as
(∇f )ij = fi− fj, ∇ : L2(V) → L2(E ), (2.9)
satisfying (∇f )ij = −(∇f )ji. The graph divergence is instead an
operator doing the converse
(divF )i = 1 wi X j:(i,j)∈E wijFij, div : L2(E ) → L2(V) (2.10)
The graph Laplacian being ∆ = −div∇ is defined as follows
(∆f )i = 1 wi X (i,j)∈E aij(fi− fj), ∆ : L2(V) → L2(V) (2.11) In compact form ∆f = W−1(D − A)f (2.12)
where W is the diagonal matrix of weights associated to vertices and D is the degree matrix defined as Dii =
P
jaij. A different choice of W
may lead to different ways of expressing the Laplacian, e.g. W = I is the unnormalized Laplacian, or W = D is the random walk Laplacian.
2.5
Harmonical Analysis on Graphs
Now that we introduced basic concepts of calculus on Graphs, we are able to get into details of combinatorial Laplacian ∆. We will refer to theory on Harmonical Analysis on Graphs compactly summarized in [3], [8]. From 2.11, we compute smoothness of one-dimensional graph signals as following
f>∆f = 1 wi
X
(i,j)∈E
aij(fi− fj)2, ∆ : L2(V) → L2(V) (2.13)
with f being the n-dimensional vector of the points on the real line associated with each node in the graph. Equation 2.13can be generalized to d-dimensional signals as follows
T race(F>∆F ) = 1 wi X (i,j)∈E aij||fi− fj||22, ∆ : L 2(V) → L2(V) (2.14)
where F ∈ Rn×d. Thus2.14can be minimized to learn a graph embedding,
i.e. an ideally injective function mapping similar nodes to same embeddings. Graph Laplacian defined as ∆ = D − A, is known to be such that
∆−→1 =−→0 (2.15)
with −→1 and −→0 the vectors made of ones and zeros respectively. Thus, the smoothest signal is constant at each node of the graph. In the one-dimensional case to avoid collapse each node into real number 1, we further impose orthogonality constraints on the learned embedds. That could also be generalized over d-dimensional spaces, i.e. imposing F>F = I. Problem2.14 becomes the non-convex minimization problem known as the Rayleigh Quo-tient: positive semidefinite Laplacian matrix projects input graph data onto non-trivial sub-spaces of Rd with smoothness of F read off from ∆’s
eigen-values, in a way similar Fourier coefficients do over grid domains. Essentially, we learn to modulate smoothness of Laplacian operands. Unfortunately, graph structured data may not be that homogenous in space so that to apply
a template, as we do on regular spaces, and record the correlation of the template across the whole data. Constructing relational learning models using graph Laplacian may be limiting since most graphs have meaningful eigenvectors only for the very top of the spectrum, and the model may be too biased towards the learned spectrum.
2.6
Convolutional Neural Networks
Convolutional Neural Networks (CNNs) [22] have been extremely successful in problems where the coordinates of the input data have a grid structure, and the underlying manifold has equivariance/invariance with respect to this grid. Think for example of a pixel in an image being in a spatial dependence, i.e. joint probability, with another pixel. If at inference time such pixels happen to have moved away from each other, thanks to the invariance prior we have enough generalization strength to correctly predict the part of image with translated pixels, i.e. the joint-probability of the two pixels being in a long-range relationship. We also referred to the compositionality property in which the discrete flat space considered, as in images, could be broken down into lower resolution planes through the application of pooling and sub-sampling layers, realizing then a multi-view, abstract, representation of the samples. In the present section we give instead a more formal definition of the geometric priors mentioned in so far that made Convolutional Neural Network particularly successful nowadays.
2.6.1
A brief review
Let f ∈ L2(Ω) be a square integrable function defined on a compacat
Euclidean space Ω = [0, 1]d⊂ Rd. The generic supervised setting we consider could be formalized as the task of learning an unnkown function y : L2(Ω) → Y with observables given by the set {fi ∈ L2(Ω) , yi = y (fi)}i∈I, and Y being
the target space, it could be discrete, continous or a K-dimensional simplex (i.e. posterior proabilities) depending on the task. A Convolutional Neural
Network (CNN) can be formalized as the function composition of several differentiable layers. Let UΘ(f ) be a CNN parametrized by differentiable
parameters Θ, and defined as follows
UΘ(f ) = (Cθ(k)◦ · · · ◦ P ◦ · · · ◦ Cθ(1)) (f ) , (2.16)
where f (x) = (f1(x), · · · , fp(x)) is a p-dimensional vector of signals in
the d-dimensional space, where Cθ(j) is a convolutional operator applying a
set of learnable filters {θ}`,i as follows
g`(x) = ξ p X i=1 (fi? θ`,i)(x) ! , ` = 1, · · · , q, i = 1, · · · , p (2.17)
where ξ is a point-wise non-linearity, and ? indicates the standard con-volution. We thus produce a q−dimensional output feature map given by g(x) = (g1(x), · · · , gq(x)).
Example. Consider the case of f being a p-channel image with each channel fi having d1× d2 pixels, and suppose the image and filters are stored
into 2-dimensional arrays indexed following the usual notation for indexing 2D arrays, i.e. fi(m, n). Then, `-th convolution over i-th channel is computed
as follows (fi? θ`,i)(m, n) = X t1 X t2 fi(m − t1, n − t2)θ`,i(t1, t2). (2.18)
The above discrete convolution can also be seen as matrix-vector multi-plication, with the input image in vector form, and the filter expanded into a Toeplitz circulant matrix, i.e. the filter being on each row but shifted by one element. Note that in2.17 we applied the summation over the convolutions, but in practice we could have applied whatever function of interest, e.g. average, features concatenation. Without loss of generality we keep a simple symmetric function as the summation.
2.6.2
Pooling
In 2.16 P indicates the pooling layer used for down-sampling, namely to make the network capable of making a summary statistics in a localized fashion. We could define the operator P as follows
p`(x) = P ({f`(x0) : x0 ∈ N (x)}) , ` = 1, · · · , q (2.19)
where N (x) ⊂ Ω is a neighbourhood of point x and P is any permutation-invariant (symmetric) non-differentiable function, e.g. sum, max or average. As stated previously, the pooling function helps the model to ensure invariance with respect to translations, i.e. moving a feature by a small amount into space doesn’t affect model prediction capability. Sub-sampling could also be achieved using strided and dilated convolutions. For a more in depth discussion on convolution arithmetic see [13].
2.6.3
Geometric priors
Reasons behind success of CNNs are to be found in the empirical observed properties of Euclidean data: the stationarity and invariance with respect to deformations priors. The stationarity states that given a translation operator Tvf (x) = f (x − v), x, v ∈ Ω, we could ensure invariance to
translations, namely
y(Tvf ) = y(f ) (2.20)
and we could also ensure equivariance to translations, in essence
y(Tvf ) = Tvy(f ). (2.21)
The latter prior depends on the domain Y if it is a space where trans-lations are well defined, as in the cases of motion estimation or semantic segmentation.
Instead, the invariance with respect to deformations prior puts more strong hypothesis on what a CNN could be able to capture. We just restate invariance and equivariance priors redefining the translation operator as a deformation operator Lτ where τ : Ω → Ω, such that Lτf (x) = f (x − τ (x)).
Then priors 2.20 and 2.21 are now defined as follows
|y(Lτf ) − y(f )| ≈ ||∇τ || (2.22)
|y(Lτf ) − Lτy(f )| ≈ ||∇τ || (2.23)
An important consequence is that we can break-down long-range depen-dencies having as a result a multi-view abstract representation of the data under consideration.
Weight sharing. Another key feature that made CNNs particularly suc-cessful, is their ability to share weights across the d-dimensional plane, so each layer has a number of learnable parameters that doesn’t depend on the size of the input signal, making them particularly efficient in terms of space usage but without losing information retention capability.
2.7
Graph Convolutional Neural Networks
In the present work we are mainly interested in graphs as models for rep-resenting interactions, similarities or any other relation may be established between objects living in high-dimensional spaces and how to learn such in-teractions using agnostic end-to-end differentiable models over non-Euclidean spaces. Typical examples of applications include, for example, learning on networks of proteins in which bonding relations arising as a result of biochem-ical events triggered by proteins themselves (protein-protein interactions, PPI) realize a geometry no longer look like Euclidean. Other applications include to automatically detect fake news spread on social media [29], task of utmost importance in the present times. Representation learning over Graph
domains has a long history starting from classical Spectral Graph Theory [7], [8]. Vanilla GCN layers, as we will see, are built upon the combinatorial operator of Graph Laplacian, a linear operator for smoothing out graph signals and for learning trajectories of dynamical systems. We will also observe that applying graph Laplacian over a neighbourhood centered at a node and radius defined as the number of hops from that node may be computational infeasible. Furthermore it suffers from weak combinatorial generalization capability over unseen domains. State-of-the-art Graph Con-volutional Networks (GCNs) try to overcome mentioned difficulties with convolutions based on k-order Chebyshev polynomials, introducing the in-teresting duality of implicitly learning graph spectrum simply acting on the spatial representation. GCNs efficiently avoid the computational burden of spectral decomposing the whole graph (or up-to a subset of the frequencies) and of projection into the frequency domain. Learned filters are independent of the number of nodes in the graph. Note that the Graph Convolutional Network (GCN) model is not substantially different from the contextual approach to graph processing put forward by [28] a decade before GCN, and recently extended to a probabilistic formulation by [2]. Theory we present here mainly refers to Spectral Graph Theory as elegantly summarized in [7] and [8].
2.7.1
Spectral construction
A first Deep Learning approach on graphs is to explicitly learn graph spectrum. Consider a square integrable function on X , it can be decomposed into Fourier series
f (x) =X
i≥0
hf, φiiL2(X )φ(x)i (2.24)
where φi’s are the eigen-functions arising from the eigen-decomposition
of the Laplacian, such a basis is orthonormal, and Spectral Graph Theory suggests us that is a generalization of the Fourier basis in the context of Manifolds. The definition of eigen-function carries the same meaning of
that of eigen-vector (though an eigen-function has input over a finite/infinte space). The operation hf, φii can be described as the (forward-) Fourier
transform, and the futher multiplication by φi(x) can be referred to as the
(back-) Fourier transform projecting the Fourier coefficients back into the original domain.
Thus the classical Convolutional operator can be restated as follows
(f ? g)(x) =X
i≥0
hf, φiiL2(X )hg, φiiL2(X )φi(x) (2.25)
The continous inner-product can be substituited by the discrete counter-part in the case of graphs, with eigen-functions represented as column-vectors. Thus, applying a Convolution over a graph stands for applying a linear oper-ator C such that
Cf = ΦΛΦ>f (2.26)
where C ∈ Rn×n is the matrix representation of the convolution operator, along with the Laplacian eigen-decomposition given by Λ, the spectral representation of the filter (to be learned), a diagonal matrix, and Φ ∈ Rn×n,
the matrix of eigen-functions column-stacked. And f ∈ Rnis the input signal.
Thus global structure of the graph can be exploited with the spectrum of its Laplacian. As already mentioned, in the context of learning on graphs, we lack of shift-invariance since filter matrix depends on learned basis, and that translates furthermore in the absence of circulant structure in matrix C, i.e. is not a Toeplitz matrix. Besides the weakness of that model for the resons we already discussed, we are also incurring into a massive computation given by that of the eigen-decomposition and then by that of matrix-matrix multiplications at each filter application, though much of that computational complexity can be saved in part by using only the first k << n eigen-functions. Representative of that approach is the Spectral CNN model [8], where each layer k = 1, · · · , K, transforms an input vector fk of size n × pk−1, with n
number of nodes and pk number of learned filters at layer k-th, into an ouput
fk+1,j = ξ Φ pXk−1 i=1 Λk,i,jΦ>fk,i ! , j = 1, · · · , pk, (2.27)
where Λk,i,j is a diagonal matrix, and ξ is an element-wise non-linearity.
Consider again the possibility to cut-off frequencies by taking only the first d eigenvectors. Again, we face the limit of depending on the size of the input, in fact, pk−1· pk· d = O (n) parameters are required to train model 2.27. In
[8], Equation 2.27 is reframed considering also the application of a pooling implemented as an agglomerative clustering algorithm.
2.7.2
Spatial construction
A step further to ameliorate previous approach is to try find a way to implicitly represent convolutions over the spectral domain. Thus filter applications are of spatial-level leveraging on the topology structure but at the same time implicitly learning the spectral structure of the graph under consideration. That could be accomplished by giving a polynomial expansion to the filter, namely we have gα(Λ) = diag(gα(λ1), · · · , gα(λn)) (2.28) where gα(λ) = r−1 X j=0 αjλj (2.29)
Then we could express as many filters as we want, with each polynomial being implicitly a function of Laplacian’s eigen-values, the only task required is thus of learning the αj’s, i.e. polynomial coefficients. It’s easy to see
that eigenvalue powers control the range of application of the filter matrix, and being the polynomial a sum of eigenvalue powers is as we are applying convolution at most to r-hops.
Graph CNN (GCNN) a.k.a ChebNet [11], exploited the above observation by employing the Chebyshev polynomial recursively defined as follows
Tj(λ) = 2λTj−1(λ) − Tj−2(λ), T0(λ) = 1, T1(λ) = λ (2.30)
Thus if we consider the Laplacian decomposition as a polynomial in λ denoted with gα( ˆ∆) = r−1 X j=0 αjΦjTj( ˆΛ)Φ>j (2.31)
and conveniently rewriting it as
gα( ˆ∆) = r−1
X
j=0
αjTj( ˆ∆) (2.32)
where ˆ∆ = 2λ−1n ∆ − I and ˆΛ = 2λ−1n Λ − I are rescalings mapping eigen-values into [−1, 1]. Since Chebyshev polynomials form an orthonormal basis into that interval, we could perform convolutions without explicitly learning the orthonormal basis of the graph by applying the recursive definition2.30to 2.32for computing Tj applied to a signal f , namely fj = Tj( ˆ∆)f , ending-up
with
fj = 2 ˆ∆fj−1− fj−2 (2.33)
with f0 = f and f1 = ˆ∆f . Now complexity accounts only for
matrix-vector multiplications and no explicit eigen-decomposition of the convolu-tional operator.
A widely used convolutional layer over graphs built as a linear approxi-mation of the form 2.33 is the Graph Convolutional Network (GCN) model [20]. A GCN layer is a polynomial of degree r = 2, namely
gα(f ) = α I + D−1/2AD−1/2
with polynomial coefficients constrained to be α = α0 = −α1. Note that
in 2.34is employed a normalized random-walk Laplacian matrix with added self-loops. A non-linearity could be applied (e.g. ReLU) but recent work by [37] suggests that non-linearity does not improve too much the linear approximation of2.34. Taking into account the multi-dimensionality of data we could reframe 2.34 as following
gΘ(F ) = ξ I + D−1/2AD−1/2
F Θ (2.35)
with F being the n × d matrix of vector signals and Θ the d × q matrix of trainable weights, with q number of neurons per layer, and ξ element-wise non-linearity. Authors in [20] reframed layer 2.35 to avoid numerical instabilities as follows gΘ(f ) = ξ D1−1/2A1D −1/2 1 XΘ (2.36) and A1 = A + I adjacency matrix with added self-loops, and D1 diagonal
matrix of degrees of A1. A model could be built stacking up that layer
interleaved with pooling layers. Studying the pooling over graphs is the main reason behind the present work, and in the following sections we will expose the problem in more detail along with a proposed solution, taking the layer given in2.36 as a first building block in making-up a new pooling model on graphs.
2.8
Expressive power of GCNs
As seen in previous sections, most popular methods for deep learning on graphs are neuronal approaches where node features are propagated through-out the graph according to the underlying support. Different from fully agnostic differentiable graph models are the ones requiring to manually ex-tract graph features beforehand and then fed them into fully-connected layers (e.g. SVMs). If that approach may help models better understand underlying problem it introduces the drawback of imposing strong priors on the structure of the graph. Great efforts today are towards proving expressive power of GCNs, recent attempts are [38], [25] that analyze their quasi-equivalence to k-order Weisfeiler-Lehman (WL) graph isomorphism test and how dis-criminative power is limited over particular graph structures. Similar to GCNs, the WL test iteratively updates node features, i.e. embeddings, by aggregating feature vectors of its k-order neighbors. WL is enough powerful to learn injective functions mapping different neighboorhood structures into different embeddings: that is the main challenge in GCNs, namely how to effectively represent the multisets of neighboorhoods in order to map graphs into lower-dimensional embedding graphs preserving original distances, i.e. isometry
2.8.1
k-order GCNs are as powerful as k-WL
k-WL test. The WL test is an effective algorithm for distinguish a broad class of graphs, nemely it asks whether two graphs are isomorph. More formally let G = (V, E , d) be a colored graph namely such that d : V → Σ maps vertices to node colorings from an alphabet Σ. At each step of what is called color refinement, the color of each vertex is refined by a new color representing its current color and the multiset of its neighbors’ colors until there are no further changes in the graph color encoding. Thus, two graphs G and G0 are isomorph if there exists an edge and color preserving bijection φ : V → V0. That accounts for the 1-WL test. In general, k-order WL builds on d : Vk → Σ, namely it constructs colorings of k-tuples of vertices.
k-order GCNs. We start off first with defining permutation-invariant GCNs by concatenating maximally expressive linear equivariant layers [24]. That property is particular desirable to have, recall of geometric priors of classical CNNs. More formally k-order invariant graph networks are a composition of layers of the form 2.36 but with input defined on high-order tensors X ∈ Rnki×ai, with k
i ∈ [1, · · · , k], ai index color, and n number of
nodes in the graph. Let Li indicate such a layer prior to applying ξ. We say
Li satisfies the equivariance property, namely holds
Li(gX) = gLi(X) (2.37)
with respect to a permutation function g acting on high-order tensors. Li also satisfies invariance property with respect to g as follows
Li(gX) = Li(X) (2.38)
It’s easy to verify that a deep network with stacked layers satisfying properties 2.37 and 2.38 also satisfies invariance and equivariance with respect to a permutation function. In [24] it has been proved that such models are universal given k = n2, and k = 2 suffices to approximate any message-passing neural network.
Given the above constructions, in [25] is shown that, for every 2 ≤ k ≤ n, k-order graph networks are at least as powerful as the k-WL graph isomorphism test. For further details on the proof consider reading [25].
2.8.2
Structures confusing Pooling
Learned graph representations, i.e. node embeddings, are usually fed into fully-connected layers for node classification or graph classification tasks. For the latter it is important to have a compressed representation of the most salient features of a graph into a single signature vector. To that end, the n embedding vectors are summarized using graph-level aggregators, typical aggregator functions are the permutation-invariant and non-differentiable
sum, average and max. Unfortunately, that are non-injective functions. It is easy to show that in some cases they may fail project different graph structures into different embeddings. To illustrate this, here we provide an simple example, refer to [38] for many others. For simplicity, consider having one-dimensional node embeddings, thus let x(i) ∈ R be the feature of node i. Some structural properties, though trivial, can leave the model with no generalization power if we intend to use the aforementioned pooling functions. Take for example the case in which a node u has a neighborhood made by nodes with features x(a), x(b), and a node v with neighborhood x(c), x(d), x(e). If happens that features in both neighborhoods are of the same repeating value, say c, using the average or max functions the model confuses those two different topologies as the same, in fact 2c/2 = 3c/3 and max(c, c) = max(c, c, c). Contrary to the sum function that well discriminates between the two structures. There are cases in which average aggregator performs as well as sum aggregator. For example, in situations in which node degrees are used as features the average function well recovers the sum. Thus we could argue that mean and max operator may fail in cases with repeating patterns. In [38] are given further examples.
Graph Pooling by Non-Negative
Matrix Factorization
When considering graph classification tasks, we lack of a principled multi-resolution operator that yields a Deep GCN model more abstract representa-tion of input data as we go deeper in the network. Standard approaches to graph pooling are to employ non-differentiable symmetric functions as max, summation or average along features axes of the graph embeddings, though in [38] are showed not to have very discriminative power over particular graph structures. In the present section, we introduce two pooling mechanisms, AVBPool and NMFPool. Both share the same idea of being based on top of the Non-Negative Matrix Factorization (NMF) to leverage on the community structure underlying graph structured data to induce subsampling of the input graph, in order to capture long-range interactions as we go deeper in GCNs. That would be of practical interest especially in the context of graph classification tasks where the whole graph is fed into downstream learning systems as a single signature vector (see Figure 3.1). Such mechanisms, if applied to incrementally obtain coarser graphs, where nodes are pooled into communities based on the soft assignments output of the NMF of the graph adjacency matrix and Gram matrix of learned graph embeddings, we would help GCNs increase information propagation through the whole graph.
3.1
Related work
A first attempt to formalize graph pooling can be found in [8], in which a simple framework for multiresolution clustering of a graph is given based on a naive agglomerative method. There are some recent works proposing pooling mechanisms for graph coarsening in Deep GCNs, in [9] a subset of the nodes are dropped based on a learnable projection vector where at each layer only the top-k interesting nodes are retained. In [17], it is employed a rough node sampling and a differentiable approach through a LSTM model for learning aggregated node embeddings, though it may render difficult satisfying invariance with respect node ordering. Interestingly, in [4] it is applied a simple and well known method from Graph Theory for node decimation based on the largest eigenvector umax of graph Laplacian matrix:
at each step graph is halved into clusters V+and V− based on the sign of the
components of the largest eigenvector, if umax(i) > 0 then i-th node is put
into cluster V+ otherwise into cluster V−, and pooling of input graph signal is implemented by taking node embeddings in V+. They further employ a
more sophisticated procedure to reduce Laplacian matrix on the set V+ using
sparsified Kron reduction. Another relevant differentiable approach is that put forward by DiffPool [40], where the model learns soft assignments to pool similar activating patterns into the same cluster, though the idea of learning hiearchical soft-clustering of graphs via adjacency matrix decomposition using a symmetric variant of NMF can be dated back to [41]. In DiffPool, the learned soft assignment matrix is applied as a linear reduction operator on the adjacency matrix and input signal matrix, and the coarsened graph is thus further convolved with GCNs.
3.2
The problem
The general problem of graph pooling for inducing multi-resolution view of a graph can be stated formally as follows
Definition 3.2.1. (Graph pooling) Let G = (VG, EG, AG) be a graph with
associated node labels XG ∈ R|VG|×d0, and edge labels EG ∈ R|EG|×d1, with d0
and d1 number of features. Then, G is coarsened into a labeled graph H =
(VH, EH, AH), by means of a function C, if it exists, and with XH∈ R|VH|×d0
and EH∈ R|EH|×d1 such that |VH| < |VG| and |EH| < |EG|.
Thus, problem of pooling is related to that of embedding a graph G into another graph H only if nodes and arcs of H are representative of "things connecting the same" in G. To give clue of the underlying mechanism capable of doing that, we introduce the following definition
Definition 3.2.2. (Graph homomorphism) Let G = (VG, EG, AG) and H =
(VH, EH, AH) be two labeled graphs. We say G and H are homomorphic
[5] to each other, if there exists a homomorphism C : VG → VH such that
(u, v) ∈ EG if and only if (C(u), C(v)) ∈ EH, for all u, v ∈ VG. For labeled
graphs, such homomorphism, holds only if it maps vertices and edges with the same label.
Thus, we need to learn C along with its symmetric counterpart that back-projects H into G, i.e. learning reconstruction capability. As we will see, it would be desiderable to have function C associating to each node in the graph multiple overlapping communities, and that could be possible through learned soft-cluster assignment matrix. As will be introduced, soft-cluster assignments could be obtained decomposing adjacency matrix and Gram matrix of learned embeddings using Non-Negative Matrix Factorization (NMF). Further, we will see a fully-differentiable pooling operator, we refer to as AVBPool, based on NMF and how it could be seen as a learnable graph encoder function. That solution, we will see, is too limiting because it has strict dependence on the number of nodes in graph. An additional solution will be presented, and we refer to it as NMFPool, to overcome that constraint.
3.3
Non-Negative Matrix Factorization
In the following section we introduce the basics of the Non-Negative Matrix Factorization (NMF) as the building block for our proposed solutions. Idea to employ NMF stems from [41] in which latent community structure of graph data is made explicit via adjacency matrix decomposition using Symmetric Non-Negative Matrix Factorization (SNMF). Starting from that idea, we first introduce the general Non-Negative Matrix Factorization (NMF) in that it is a popular technique for extracting salient features in data by extracting a latent space representation of the original information. Throughout our work we refer to the original idea of NMF [23] though it has been extensively studied in numerical linear algebra in the last years by many authors and for a variety of applications.
Formally, the NMF problem can be stated as follows:
Definition 3.3.1. Given a non-negative matrix A ∈ Rn×m+ , find non-negative
matrix factors W ∈ Rn×k+ and H ∈ Rk×m+ , with k < min(m, n), such that
A ≈ W H (3.1)
If we see matrix A as having m multivariate objects column-stacked, the straightforward interpretation of (3.1) is as follows
A:,j ≈ W H:,j, (3.2)
with A:,j and H:,j corresponding to j-th columns of A and H. The
approximation (3.2) entails that each multi-variate object is a linear combina-tion of columns of W weighted by coefficients in H:,j. Thus W is referred to
as the basis matrix or equivalently the cluster centroids matrix if we intend to interpret NMF as a clustering method. Matrix H can be seen, instead, as a low-dimensional representation of the input data making thus NMF also useful for dimensionality reduction. Latent representation, in the clustering perspective, may indicate whether a sample object belongs to a cluster. For example, we could constrain each data-point to belong to a single cluster at
a time: namely, each data-point is assigned to the closest cluster A:,j ≈ W:,j.
We generally look for non-trivial encodings to explain community evolution in graphs. Thus, the problem could be relaxed into a soft-clustering prob-lem in that each data-point can belong to k > 0 eventually overlapping clusters [35]. Problem 3.1 can be showed proof of having similarity with the problem of learning an encoding function to project input data into lower-dimensional manifolds, H, and of learning a decoding function, U , if necessary, to backproject encoded data into its original space. In essence, we could solve problem of learning positive product factors for decomposing the symmetric positive adjacency matrix A, using auto-encoding framework with proper constraints.
3.4
Auto-Encoders
In what follows, we will show how Auto-Encoders represent a straightforward neural approach to learn NMF solutions, in fact, as we will see, their general linear formulation could be easily interpreted as that of factorizing neural networks. And simply adding proper constraints we could render them capable of learning positive solutions of NMF. Auto-Encoder neural networks may be composed of several layers each trained to learn hidden lower-dimensional representation of the original input. Typical structure of a vanilla auto-encoder is composed of a function f taking as input high-dimensional data and projecting it into a lower-dimensional manifold such that h = f (x). The encoding function f may be accompanied by another function g, the decoding function, able to back-project hidden representation h into its original input space, i.e. x = g(f (x)). Modern realizations [18] of auto-encoders go a step-further of deterministic functions by introducing stochastic encoder-decoder functions representative of posterior probabilities given the visible and hidden variables respectively. Idea of auto-encoding is translated to graphs, because differently from dataset without topology content, we also need take into account topology structure. For an introduction to latest tecniques on graph encoding consider reading [33]. In the following we will refer to general theory on auto-encoders as presented in [16], [35].
3.4.0.1 Complete Auto-Encoders
Suppose we have a set of data points {x1, x2, · · · , xn} living in d dimensions,
i.e. xi ∈ Rd, and such a data set is mean-centered, namely data distributions
are aligned. Suppose now we have a set of linearly independent vectors uj ∈ Rd stacked into columns of a matrix U ∈ Rd×d
U = | | · · · | u1 u2 ud | | · · · | (3.3)
called the matrix of basis vectors, and such that U>U = U U> = Id×d,
i.e. basis vectors are orthonormal, formally
u>i uj = 0, i 6= j (3.4)
kuik22 = 1 (3.5)
Linear independence and completeness of the basis U ensure us we’re able to span Rd entirely: given a data point x
j, and proper weight vector
hj, we can express it as a linear combination of the form
xj = U hj = d
X
i=1
uihij (3.6)
Given X and H as the matrices with data points and weight vectors column-stacked respectively X = | | · · · | x1 x2 xn | | · · · | H = | | · · · | h1 h2 hn | | · · · | (3.7)
X = U H (3.8) For example, in the trivial case of U being the the standard basis, i.e. uj = ej, we have
X = U X (3.9)
But we do not want trivial auto-encoders, in that we are looking for non-trivial basis vectors spanning Rd. That can be accomplished solving 3.8
as an unconstrained non-convex optimization problem as follows
J (U , H) = min
U ,HkU H − Xk 2
2 (3.10)
Optimization progresses alternating minimization of J w.r.t. U and H. Let’s look at the gradient of J w.r.t to parameters H when U is kept fixed
∂J ∂H = U
>
U H − U>X = 0 (3.11)
Then 3.11is a symmetric system of equations. Recalling that U>U = Id×d, solution to3.11 is given by
H∗ = U>X (3.12)
Substituiting3.12 into 3.8 we get
X = U U>X (3.13)
Thus3.10 can be reformulated as the following optimization problem
J (U ) = min U UU> X − X 2 2 (3.14)
Now we optimize only w.r.t. U , since latent representation H is contained implicitly, to get it we just apply equation3.12. It can be proved that global
minima of3.14are always orthonormal matrix, so adding constraint enforcing orthonormality of columns of U to3.14 is redundant. In 3.14we can thus identify an encoding function f : Rd→ Rd parametrized by weights U , and
such that h = f (x) = U>x; and a decoding function g : Rd → Rd being its symmetrical and such that x = g(h) = U h, and parametrized with the same weights of the encoder function. That simplistic model is referred to as the Linear Auto-Encoder with tied-weights, since it hasn’t non-linear activations and saves space with shared weights. Model 3.14can be capable, at its best, to learn copy input x to the ouput, namely x = g(f (x)) holds everywhere. The same would happen if model were given more capacity, namely able to project data into higher-dimensions. In practice, for the reasons we discussed in the introduction we aim instead at projecting input data into lower-dimensional spaces so that to compress input information into its more representitave features. Thus, we are only insterested in learning under-complete and possibly non-linear bases.
3.4.0.2 Undercomplete Auto-Encoders
In the previous section we used a complete basis for representing given data, i.e. we have d vectors in Rd space, but what if we employ only k of that
vectors? Chances are we are only able to span a k-dimensional sub-space, and as a consequence3.8 becomes an approximate problem, namely
X ≈ U H (3.15)
Approximation in 3.15 is called a projection, where basis matrix U is now of size d × k, and H is a k × n matrix with columns being the latent representation of samples in X, latent represention is obtained simply by projecting data-points into the lower-dimensional manifold spanned by the k vectors embedded into the original d-dimensional space. Again gradient sets to zero if
due to (left-)orthonormality of the spanning base, i.e. U>U = Ik×k.
Observe that exact (right-)orthogonality constraint U U> 6= Id×d can’t be
achieved because U is now a d × k matrix and d > k. Now we strive to find the closest approximation to the identity matrix Id×d. We refer to that
particular case as the under-complete Linear Auto-Encoder (AE). The Linear AE framework can be explained by means of a deterministic function
Eθ0 : R
d
→ Rk (3.17)
called the encoder parametized by a set of trainable parameters Θ0 ∈
Rk×d, and such that
H = Eθ0(X) = Θ0X (3.18)
AE is also comprised of another deterministic function called the decoder defined as
Dθ1 : R
k
→ Rd (3.19)
parametrized by parameters Θ1 ∈ Rd×k, and such that
X ≈ Dθ1 ◦ Eθ0(X) = Θ1Θ0X (3.20)
The set of trainable parameters of both functions are our orthonormal spanning basis, i.e. Θ0 = U> and Θ1 = U .
3.4.0.3 Graph Auto-Encoders
In light of autoencoding theory, we briefly introduce the graph auto-encoder model as presented in [21] as an extension of the auto-encoder model seen previously. In [21] they build graph autoencoders based on propagation rule 2.36of graph convolutional networks (GCNs), to learn the encoder function E . The decoder function, D, instead, is a simple inner-product, though it may be realized using a variaty of neuronal approaches. Inner-product decoder
function suffices for our purposes, but that would be more clear in the next sections. Most importantly, that model is able to incorporate topology and content informations, and that is what really matters over graph data. Thus, encoding function is defined as follows
Z = EΘ(A, X) = ξ D−1/2AD−1/2XΘ
(3.21) where A ∈ Rn×n is the adjacency matrix, with D diagonal matrix of node degrees. And with X ∈ Rn×d matrix of node embeddings, and Θ learnable parameters and ξ squashing non-linear function. Instead, decoding function is given as follows
D (EΘ(A, X)) = ZZ> (3.22)
Reconstruction capability is thus given as the task of learning latent representation Z such that
||A − ZZ>||F (3.23)
is minimized. KL-divergence could also be used. In [21] they further extend the model using variational inference approach making assumption of posterior being a Gaussian distribution, with mean and scale parametrized by further GCNs, though over graphs, making such an assumption on the underlying distribution of latent graph representation, may lead the model lose perspective of the hierarchical structure of the graph data. To extend that model, we refer to [27] for an adversarial approach to variational inference, to let the model learn more expressive posterior distributions. To that end, we briefly introduce here the main concepts of variational auto-encoders, to pave the way to next sections in which we will extend our model using adversarial variational bayes (AVB) autoencoders [27].
3.4.0.4 Variational Auto-Encoders
We briefly introduce the necessary background on Variational Bayes Autoen-coders (VBAs) [19]. VBAs are specified by a parametrized model pθ(x|z) of
the visible variables given the latent variables, a prior p(z) and an approxi-mate inference model qφ(z|x). It can be shown that
log pθ(x) ≥ −DKL(qφ(z|x)kp(z)) + Eqφ(z|x)log pθ(x|z) (3.24)
Maximum-Likelihood training accounts for computing marginal log-likelihood EpD(x)log pθ(x), where pD(x) is the data distribution, but actually
happens that marginalizing out with respect to latent variables makes the task intractable. Variational Bayes then translates the problem into an optimiziation one by maximizing with respect to φ and θ, namely the right-hand side of inequality 3.24, also known as the Evidence Lower Bound (ELBO). ELBO quality depends on the expressiveness of the inference model
qφ(z|x), usually taken to be gaussian.
3.5
Introduction to solutions
Before entering into details of the proposed pooling layers, we first give a necessary introductory discussion to show that both layers have the desider-able properties given in Definitions3.2.1 and 3.2.2. Namely, we are able to build pooling layers realizing C, that given as input a labeled graph it gives as output a coarsened version of it. We now introduce the functioning of C backing up both pooling mechanisms, then we show it to be feasible to solve pooling problem. As we have seen, the Non-Negative Matrix Factorization of the positive and symmetric adjacency matrix, A, unconvers communities entailed in a graph, that is, the product factor H, i.e. the encoding matrix, is the soft-assignment of nodes to communities. In light of that, we could think of such a matrix as a linear operator able to pool nodes into communities, realizing, then, a subsampling on the input graph. That would help build
coarser representations of the input graphs, so that at increasing scale k > 0 (namely the number of overlapping communities to pool nodes into) we are able to see how nodes, modeling concepts in a graph, eventually relate. The obtained abstract representation of the input graph, as a result, may help GCNs grasp even long-range relations. That is possible since coarser views of the input graph enable GCNs to faster propagate contextual informations, i.e. nodes involved in long-range relations are pulled together along with their learned contexts.
Therefore, we give abstract representation of a pooling operator based on NMF as a set of functions Cj defined as follows
Cj(i) =
(
j if Hji > 0
⊥ otherwise (3.25)
for each node i ∈ VG of input graph, and j ∈ VH of pooled graph, and
with ⊥ indicating that node i is not pooled into community j. And Hji∈ R
is an element of encoding matrix obtained solving NMF3.1 of the adjacency matrix or Gram matrix of learned embedds. In the following, we show C describing the semantics of a pool operator of Definition3.2.1, and satisfying property of Definition3.2.2. First, we can observe that
(u, v) ∈ EG =⇒ (Cc(u), Cc(v)) ∈ EHint, (3.26)
for some c ∈ VH in the coarsened graph, such that Cc(u) 6= ⊥ and
Cc(v) 6= ⊥, and EHint set of the intra-community edges. Namely, connected
nodes fall into same cluster, and that corresponds to the main goal of community detection, in which we look for communities such that nodes in the same community are strongly connected. In fact, if (u, v) ∈ EG means
that adjacency matrix of input graph has element auv > 0, and considering,
for simplicity, the Symmetric NMF of the adjacency matrix, it holds that
auv ≈ H:,u>H:,v > 0, ∀u, v ∈ VG. (3.27)
it is also possible that two disconnected nodes u and v can be pooled together, thus increasing likelihood of a new link in between u and v. Furthermore, the following also holds
(u, v) ∈ EG =⇒ (Cci(u), Ccj(v)) ∈ EH, (3.28)
for some ci, cj ∈ VH, ci 6= cj, and Cci(u) 6= ⊥ and Ccj(v) 6= ⊥. Again, by
3.25, Equation 3.28 holds, namely
ˆ
acicj = H
>
ci,:Hcj,: > 0, ∀ci, cj ∈ VH, (3.29)
where ˆacicj ∈ R is an element of the new adjacency matrix of the pooled
graph, obtained as a simplified version of the algebraic operation to get the adjacency matrix of the newly coarsened graph [41]. As a consequence, the embedding reduction operation [41], that given as input a node embedding zi of a node i ∈ VG, and breaking it up based on node i soft-assignments as
follows
ˆ
Z:,dj = HZ:,dj, H ∈ R
k×n
, Z ∈ Rn×d, (3.30) for each node feature dj ∈ R, makes even clearer that a pooling operator
as defined in 3.25, let us embedd similar patterns into same communities. Thus, building a pooling operator with semantics given by 3.25, and on top of Equations3.27,3.29 and3.30, we easily satisfy Definitions3.2.1 and 3.2.2.
3.6
AVBPool
In that section, we present an end-to-end learning pooling layer with semantics defined as that of 3.25. The aim is to make a pooling layer built on top of an auto-encoder solving NMF problem, in particular Symmetric NMF (SNMF), because of the symmetry of adjacency matrix and Gram matrix.
In the following we only present the case of adjacency matrix factorization, Gram matrix decomposition will be discussed in the next section. Thus, to
implement a learnable pooling function to decompose adjacency matrix, i.e. finding its latent representation being also a product factor in SNMF, we refer to encoder as defined in 3.21 minimizing reconstruction 3.23.
3.6.1
Layer overview
Because in the first part of our work we faced only with node classification tasks, with graphs having no varying topologies, i.e. fixed n, we relaxed definition of 3.21 to be such that
H = EΘ(A) = ξ D−1/2AD−1/2Θ
. (3.31)
with ξ being the ReLU activation function to ensure positivity of the solutions, and Θ learnable parameters. Note the fact we imposed X = I, namely we imposed to simply embedd adjacency matrix into lower dimen-sional spaces with content information being a one-hot encoding of the nodes. That is possible because n is fixed, and our intent was to fully employ train-able neurons to learn to encode the adjacency matrix, and Gram matrix eventually, in that in the original formulation of the NMF there is no explicit reference to node embeddings.
3.6.2
Training AVBPool
Formulation (3.1) requires to define a metric to measure the quality of the approximation, and Kullback-Leibler (KL-) divergence or Frobenius norm (F-norm) are common choices. Many techniques from numerical linear algebra can be used to minimize problem (3.1) whatever the cost function we use, although its inherently non-convex nature does not give any guarantee on global minimum [14]. In [23] were first proposed multiplicative and additive update rules that ensure monotone descrease under KL- or F-norm. In that context we would prefer use the Frobenius norm. Thus, we could solve Symmetric NMF by minimizing Frobenius divergence as following
L1 = ||A − HH>||F, A ∈ Rn×n, H ∈ Rn×k (3.32)
Observe how such model may be limited on graphs with varying topologies because of the dependence on n, and graph content has no role in helping extract hiearchical structure of the graphs. Note that 3.31 could also be given a linear formulation, i.e. without applying ReLU function, but to further impose positivity constraints, we referred to constrained positive auto-encoders of [1]. Following suggestion in [1], we add penalties to 3.32 to constrain the model to learn positive Θ and H. Thus we introduced the constraints J1 and J2 as follows
L2 = L1+ J1+ n X i k X j J2(Θij), (3.33)
In Equation 3.33, J1 forces H to be sparse as following
J1 = β n X i=1 DKL p k 1 d k X j=1 Hji ! . (3.34)
with β hyper-parameter, and p ∈ R a constant. Instead, J2, controls
positivity of parameters Θ as follows
J2(Θij) =
(
α1Γ(Θij, k) + α22||Θij||22 if Θij < 0
0 otherwise (3.35)
where ||.||2 is the L2-norm, and α1 and α2 are hyper-parameters. And
function Γ(Θij, k) is such that
Γ(Θij, k) =
(
||Θij|| if ||Θij|| > k
||Θij||2/2k + k/2 otherwise
(3.36)
3.6.3
Adversarial AVBPool
The simple deterministic auto-encoder we have seen, could be extended using the generative model of variational autoencoders, though in [21] they make assumption of the posterior function to be Gaussian, and with mean and variance being parametrized by further GCNs. We would push forward that stochastic model to avoid explicit assumptions on the nature of posterior function by translating the problem to that of the Adversiarial Variational Bayes (AVB) [18] extended to graphs. Also, we would include content information.
With Adversarial Variational Bayes (AVB) [27] autoencoders, instead, model qφ(z|x) is taken to be a black-box model and we make use of adversarial
training to get θ∗, φ∗and qφ∗(z|x) ≈ pθ∗(z|x). ELBO in3.24can be rewritten
as
max
θ,φ EpD(x)Eqφ(z|x)(log p(z) − log qφ(z|x) + log pθ(x|z)) (3.37)
with the negative of quantity
− T (x, z) = −(log p(z) − log qφ(z|x)) (3.38)
proved to be as the optimal value of a discriminative network trying to distinguish between true samples and generated samples with objective defined as
max
T EpD(x)Eqφ(z|x)log σ(T (x, z)) + EpD(x)Ep(z)log(1 − σ(T (x, z))) (3.39)
For further details on theoretical analysis of AVB refer to Mescheder et al. [27]. We are then able to extend vanilla NMF model as an adversarially regulated deep autoencoder. Optimization problem
with A ∈ Rn×n and H ∈ Rn×k, is now that of a variational
auto-encodeder with posterior distribution defined as qφ(H|A, X), and describing
latent representation of input samples given the observables. As seen before, we force qφ learn actual posterior distribution of encoding matrix by solving
optimization problems
L2 = max
θ,φ EpD(A,X)Eqφ(H|A,X)(−T ∗
(A, X, H) + log pθ(A, X|H)) (3.41)
L3 = max
T EpD(A,X)Eqφ(H|A,X)log σ(T (A, X, H))
+ EpD(A,X)Ep(H)log(1 − σ(T (A, X, H)))
(3.42)
where −T (A, X, H) = − (log p (H) − log qφ(H|A, X)), and p (H) could
be sampled over distributions generalizing to hyperbolic geometries [26]. That model has not been given further theoretical investigation to assess its mathematical soundness. Thus, in the present work we lack of experimen-tal investigation showing proof of its feasibility on real world applications, especially in graph classification tasks. In the next section we will show our latest solution, then we will show experimental results.
3.7
NMFPool
In the following section we introduce our latest model, NMFPool, a principled Pooling operator enabling deep graph CNNs develop multi-resolution repre-sentations of input graphs. NMFPool also leverages on community structure underlying graphs to pool similar nodes to progressively gain coarser views of a graph. Thus NMFPool is grounded on the basic semantics of function C as seen in the previous sections. Differently from previous model, we intend to employ a general non-symmetrical NMF of the adjacency matrix, using an external NMF procedure. To make it consistent with semantics of C, we will show how encoding matrix output of NMF is used as a linear reduction operator to subsample nodes in the graph. Further, we will also
show a variant of NMF in which, at each layer, we also decompose the Gram matrix of convolved embeddings. At the end, we will show how a deep GCN, interleaved with NMFPool layers, is trained along with regularization techniques to improve quality of learned solutions. NMFPool serves also as a preliminary study to first understand feasibility of NMF as a building block for pooling mechanisms.
3.7.1
Architecture overview
Consider ` NMFPool layers interleaved with at least ` + 1 stacked Graph Convolutions (GCs) as illustrated in Figure3.1, where the graph convolutions are computed according to (2.36). Then, let Z(i) ∈ Rni×d be the output of
i-th GC, namely the convolved node embeddings at layer i-th, defined as Z(i) = ReLU A(i)Z(i−1)Θ(i) (3.43) with adjacency matrix A(i) ∈ Rni×ni, with n
i number of nodes at previous
layer, and Θ(i) ∈ Rd×d matrix of weights. Note that Z(0) = X ∈ Rn×d,
and the initial adjacency matrix is set to A(0) = DL−1/2ALD −1/2
L , i.e. the
normalized adjacency matrix with AL= A + I, i.e. adjacency matrix with
added loops, and DL is a diagonal matrix of node degrees [20].
The i-th NMFPool layer solves the problem in (3.1), i.e. the decomposition of the symmetric and positive A(i), by minimizing the following loss
||A(i)− W(i)H(i)||F (3.44)
with W(i) ∈ Rni×ki
+ and H(i) ∈ R ki×ni
+ , and ki number of overlapping
com-munities to pool the ni nodes into, and k.kF the Frobenius norm. Observe
that loss3.44is not added to the loss of the whole model, since it is solved by an external NMF procedure. Also, observe that ki’s are hyper-parameters to
control graph coarsening scale. The algorithm to minimize (3.44) strictly de-pends on the chosen NMF implementation. Then NMFPool applies encoding H(i) to coarse topology and content associated to a graph as follows