CAGE: Constrained deep Attributed Graph Embedding Information Sciences

(1)

Contents lists available at ScienceDirect

Information

Sciences

journal homepage: www.elsevier.com/locate/ins

CAGE:

Constrained

deep

Attributed

Graph

Embedding

Debora Nozza

∗

, Elisabetta Fersini, Enza Messina

DISCo, University of Milano-Bicocca, Viale Sarca, 336 -20126 Milan, Italy

a

r

t

i

c

l

e

i

n

f

o

Article history: Received 3 June 2019 Revised 15 November 2019 Accepted 30 December 2019 Available online 31 December 2019 Keywords: Deep learning Representation learning Graph embedding Attributed graph

a

b

s

t

r

a

c

t

Inthispaperwedealwithcomplexattributedgraphswhichcanexhibitrichconnectivity patternsandwhosenodesareoftenassociatedwithattributes,suchastextorimages.In ordertoanalyzethesegraphs,theprimarychallengeistoﬁndaneffectivewayto repre-sentthembypreservingbothstructuralpropertiesandnodeattributeinformation.To cre-atelow-dimensionalandmeaningfulembeddedrepresentationsofthesecomplexgraphs, weproposeafullyunsupervisedmodelbasedonDeepLearningarchitectures,called

Con-strained Attributed Graph Embeddingmodel(CAGE). The maincontribution ofthe

pro-posed modelisthedefinition ofanoveltwo-phaseoptimizationproblemthatexplicitly modelsnodeattributestoobtainahigherrepresentationexpressivenesswhilepreserving thelocalandtheglobalstructuralpropertiesofthegraph.Wevalidatedourapproachon twodifferentbenchmarkdatasetsfornodeclassification.Experimentalresultsdemonstrate thatthisnovelrepresentationprovidessignificantimprovementscomparedtostateofthe art approaches,alsoshowinghigher robustnesswithrespect tothe sizeofthe training data.

1. Introduction

Real-world data are often characterized by an underlying relationalstructure, usually represented by graphs. Social and communication networks, citation networks, transport and utility networks are only some of the most common examples where we can observe complex relational interactions among a potentially large number of entities.

Eﬃcient and scalable approaches for handling these large, complex and sparse graphs regard the learning of graph representations, or Graph Embeddings[2,12], aimed at creating low-dimensional and meaningful representation of nodes by observing different graph properties. This permits to effectively apply off-the-shelf machine learning algorithms designed for handling vector representations on rich relational data for solving a wide variety of data analytics problems [6,38].

At the state of the art, most of the graph Representation Learning approaches derive graph embeddings by preserving the relational structure [26,31,33]. However, they disregard the fact that in real-world domains the nodes in a graph are often associated with a rich set of features or attributes (e.g. text, image, audio), and therefore they would be modeled by the so-called attributedgraphs[37].

Capturing also the attribute information could be of paramount importance, especially when nodes are not structurally related but they are similar looking at their attributes. Starting from the primary source of information given by the relational structure, the creation of graph embeddings can take advantage of attributes to enrich the knowledge about nodes and in particular when the graph is sparse and with noisy connections.

∗ _{Corresponding author.}

E-mail addresses: debora.nozza@unimib.it (D. Nozza), elisabetta.fersini@unimib.it (E. Fersini), enza.messina@unimib.it (E. Messina). https://doi.org/10.1016/j.ins.2019.12.082

(2)

In order to address this challenge, we propose a ConstrainedAttributed GraphEmbedding(CAGE) able to generate a graph representation that encodes both the node attributes and its relational structure.

In particular, the main contribution of the proposed model is the deﬁnition of a novel two-phase optimization problem that explicitly models node attributes to obtain a higher representation expressiveness while preserving the local and the global structural properties of the graph. Moreover, CAGE is fully unsupervised, implying that the obtained generalized latent representation can be used as input for several machine learning related tasks.

The rest of the paper is organized as follows. In Section2, the state of the art related to Graph Representation Learning is presented. In Section 3, the proposed Constrained Attributed Graph Embedding model is described and the associated optimization problem is derived. In Section 4, the experimental settings used for validating the proposed approach are detailed, while in Section5the experimental results are discussed by comparing CAGE with state of the art models. Finally, in Section6concluding remarks and future research directions are highlighted.

2. Relatedworks

Over the past decade, graph representation learning has attracted a surge of research attention, particularly focused on developing new embedding algorithms. While several works investigated the case of heterogeneous networks [6,9,23,28,34,36], in this paper we focus on studies using homogeneous network as input data. These research studies can be roughly distinguished in Factorization Methods and Deep Learning Approaches [12], aimed at preserving both the local and global graph structural properties in the embedded vector space. The local structures can be represented by the links existing between pair of nodes in a graph and captured by a ﬁrst-order proximity measure. The global structure is usually approximated by using a second-order proximity measure, that can be assumed to be denoted by the shared neighbors of the nodes.

FactorizationMethods are mainly aimed at ﬁnding a decomposition of an adjacency matrix denoting a graph for obtaining a dense and meaningful embedding representation. Although Laplacian Eigenmaps [2]and Locally Linear Embedding [27]are fundamental approaches in this area, they are not able to scale for large real-world graphs and for this reason, Ahmed et al. [1]proposed a distributed framework for matrix factorization. However, all these methods do not preserve the global graph structure. To overcome this problem, factorization methods have been extended to deal with both ﬁrst- and second-order proximities in [24,31].

More recently, DeepLearning approaches have been proposed in order to exploit random walk methods [7,9,13,14,26,30]. Among them, DeepWalk [26] and node2vec [14]aim to preserve the global graph structure by adapting recent advanced deep learning models originally deﬁned for language modeling on ﬁxed-length random walks. Recently, Chen et al. [7]proposed a general meta-strategy to improve these algorithms avoiding the issue of local minima that can occur in a non- convex optimization problem.

Beyond random walks methods, most of the investigations are focused on improving deep learning architectures [5,6,33]. One of the most effective approaches is Structural Deep Network Embedding (SDNE) [33], which exploits the ability of a deep neural network to generate embeddings that can capture non-linearity in graphs and simultaneously preserve the ﬁrst- and second-order graph proximities. The architecture is composed of two main components: one component consists of a deep auto-encoder aimed at deriving the embeddings of nodes by reconstructing its neighborhood, while the other one is based on Laplacian Eigenmaps [2]which applies a penalty when “similar” nodes are mapped far from each other in the embedding space.

Recently, attention has been directed to consider the set of attributes associated with each node in addition to the graph structure. However, these approaches do not explicitly consider the attribute contents but a similarity matrix lead- ing to a less informative form of representation [13,15,16,19,20,35]. Some other research approaches [20,35] investigate the incorporation of textual features of nodes in the graph embedding space under the matrix factorization framework. Hong et al. [15]adopted a model based on random walks on top of a constructed enhanced matrix representation of the attributed graph to capture the interaction between structural and attribute information from various degrees of proximity. In [13], the authors followed the intuition that edge attributes can provide additional information about the network, proposing a model that jointly optimizes higher order node neighborhood, social roles and edge attributes. A more general model was proposed by Huang et al. [16], where graph embeddings are computed based on the eigen-decomposition of attribute similarity matrix, minimizing the difference between representations connected nodes. In [19,22], graph embedding approaches based on eigen-decomposition are presented for dynamic attributed graphs.

Although remarkable results have been obtained by these approaches by considering node attribute similarity, the node attribute content is not explicitly taken into account, resulting in a lower expressiveness ability with respect to traditional content representation models.

(3)

In order to derive a richer and comprehensive graph representation, we proposed a ConstrainedAttributedGraph

Em-bedding (CAGE) model that is able to eﬃciently and effectively produce dense vectors encoding both graph structure and

node attribute contents. In particular, this model addresses the problem of learning representations from attributed graphs in unsupervised settings and is based on promising deep learning architectures enclosed in a constrained optimization problem.

3. ConstrainedAttributedGraphEmbedding

3.1. Basicdeﬁnitions

The main goal of the proposed model is to map nodes of a graph into a low-dimensional embedded space that preserves not only the local and global relational structure but also the attribute information. To this purpose, we first introduce the definition of attributedgraph to subsequently present first-order, second-order and attribute proximity. Table1defines terms and notations that will be used through the paper.

Deﬁnition 1. An attributed graph is deﬁned as G₌

(

V_,E_,

,

S

)

_, where V₌

{

v

1,...,

v

n

}

is the set of nodes, E=

{

(

v

i,

v

j

)

|

v

i,

v

j ∈ V

}

denotes the set of edges between the nodes,

=

{

π

v1,...,

π

vn

}

represents the node attributes related to each

node viand S=

{

si j

}

ni, j=1 is the matrix denoting the weights sijassociated to each edge eij.

Table 1

Meanings of the notations.

Symbol Deﬁnition

G graph

V set of nodes

E set of edges

node attributes matrix

S Adjacency Matrix

vi node i

eij edge between node i and node j

πvi attributes of node i

sij weight of the edge e ij

N i set of neighbours of node v i

CG graph embedding function

m dimension of graph embedding

n number of inputs

L number of encoder (and decoder) neural network layers

X set of input data

xi i th input

L AE auto-encoder reconstruction loss

ˆ

X reconstructed i th input data

ˆ

xi reconstructed i th input of auto-encoder architecture

W weight matrix of encoder neural network

b bias vector of encoder neural network

ˆ

W weight matrix of decoder neural network

ˆ

b bias vector of decoder neural network

θ= (W , b , ˆ W , ˆ b) auto-encoder neural network parameters h(l)

i l th hidden layer representations of node i

h(_i_{or h}L) i latent representation of node i

H = { h i} ni=1 latent representation xS

i or s i i th row of Adjacency Matrix

ˆ xS

i reconstructed i th input of auto-encoder architecture of structural optimization problem

ξi penalty constant

hS

i latent representation of node i in the structural optimization problem

xA

i orπvi i th row of node attributes matrix ˆ

xA

i reconstructed i th input of auto-encoder architecture of attribute optimization problem

hA

i latent representation of node i in the attribute optimization problem

L S _{loss function of structural optimization problem}

L A _{loss function of attribute optimization problem}

θS = ₍_WS , b S , ˆ _WS , ˆ _bS₎ _{auto-encoder neural network parameters of structural optimization problem}

θA = ₍_WA , b A , ˆ _WA , ˆ _bA₎ _{auto-encoder neural network parameters of attribute optimization problem}

α coeﬃcient that leverages the contribution of the structural component LS

x auto-encoder reconstruction loss of structural optimization problem

LA

x auto-encoder reconstruction loss of attribute optimization problem

LS

h loss involving the representation layers of structural optimization problem

LA

h loss involving the representation layers of attribute optimization problem

LS

reg regularization loss of structural optimization problem

LA

(4)

Fig. 1. Graphical example of an attributed graph. Similar attribute values are depicted by using similar colors; narrow edges denote “weak” relationships, wide edges represent “strong” relationships.

Real-world relational data usually contain rich attribute information, which can be heterogeneous and highly diverse. For example, publication networks are characterized by the text related to each publication where the connection between them corresponds to the presence of a citation, airline networks include numerical information regarding airports connected considering the ﬂight distance between them, social networks can comprise texts, photos and videos posted by users whose relation can be weighted with respect to the number of common friends. In order to provide a general approach for obtaining graph embeddings of attributed graphs, we represent all attributes as a feature vector, and in particular as a real-valued vector. This means that numerical information does not need any conversion, images can be expressed as an assembly of pixels, texts can be represented using the common bag-of-words or TF-IDF representations, and so on. A graphical representation of an attributed graph is depicted in Fig.1.

Now, the localstructure of an attributed graph can be deﬁned as the ﬁrst-order proximity introduced in [31].

Deﬁnition 2. The ﬁrst-order proximity in a graph is the local pairwise proximity between two nodes. For each pair of

nodes linked by an edge ( vi,vj), the weight sij indicates the ﬁrst-order proximity between viand vj. If no edge is observed

between viand vj, si j =0 .

Intuitively, preserving the ﬁrst-order proximity implies that the representation of two connected nodes should be affected by their proximity and therefore the greater is the value of their ﬁrst-order proximity, the more similar should be their representation. Considering this intuition, since nodes v5 and v6 in Fig.2(a) have a stronger relationship (depicted with a

wider line) they will be more similar in the embedded space than nodes v4and v5that have a weaker connection (depicted

with a narrower line).

The following deﬁnition, introduced in [31], presents the second-order proximity able to characterize the global graph structure.

Deﬁnition3. The second-orderproximity between a pair of nodes ( vi,vj) in a graph G represents the similarity between

their neighbourhood graph structures. Formally, given a node vi ∈V, let Ni =

{

v

j

|

∃

ei j =0

}

be the set of neighbours of node

vi. Then, the second-order proximity between vkand vl is determined by the similarity between Nkand Nl. If Nk∩ Nl= ∅ ,

the second-order proximity between vl and vkis 0.

The assumption behind the second-order proximity is that nodes representation should be similar if they have common neighbours. Fig. 2(b) depicts a typical case when the second-order proximity can help to handle edge sparsity. Although nodes v4 and v5are not connected, they have high second-order proximity since they share many common neighbours and

therefore their embedded representation should be similar.

Although the structural information of the graph is of great importance, also node attribute information should be taken into account to capture similarities between nodes. Hence, we introduce the concept of attribute proximity that can be deﬁned as follows:

Deﬁnition4. The attributeproximity between the nodes viand vjis the similarity between their attribute representation

π

vi and

π

vj.

(5)

Fig. 2. Graphical examples of ﬁrst-order, second-order and attribute proximity.

structural relationship is present. Intuitively, these two nodes should have a similar embedded representation because of their high attribute proximity.

3.2. Theproposedmodel

The aim of the proposed model is to learn, in unsupervised settings, the graph embeddings from attributed graphs, by leveraging both the structure of the graph and the attributes associated with each node. In particular, given a graph denoted as G =

(

V,E,

,

S

)

, the proposed model aims at learning a function CG: V → R m, where m | V| to preserve ﬁrst-order,

second-order and attribute proximity.

In order to achieve this goal, the proposed model ﬁrst obtains the structural node embeddings, i.e. the embedded representation of the node structure, and secondly it adds constraints to “force” the attribute node embeddings, i.e. the embedded representation of the node attributes, to be coherent also with the structural node embeddings. According to this constrained model, the similarity of the embedded representations of two nodes that are structurally different in terms of ﬁrst- and second-order proximity, but similar from the attribute point of view, will be fostered. Analogously, the similarity of the embedded representations of two nodes with similar graph structure but with distant attributes will be discouraged. This model will be accomplished by solving two different optimization problems during the training phase. A high-level overview of the proposed framework is reported in Fig.3.

In this paper, two different Deep Learning models have been exploited for modeling the architecture of the proposed model, i.e. SDNE [33]and KATE [8]. These two models have been preferred among the others because (i) they are the most promising models for inferring node structure and node attributes embeddings respectively and (ii) they both rely on the same Deep Learning architecture, that is an auto-encoder.

(6)

Fig. 3. Graphical representation of the Constrained Attributed Graph Embedding model.

Fig. 4. Auto-encoder architecture.

3.2.1. Auto-encoder

Auto-encoders have attracted a lot of attention in recent years as a building block of Deep Learning [18]. This unsupervised model is particularly able to eﬃciently extract latent factors of variations of the input data.

An auto-encoder initially consists of a feature-extracting function in a speciﬁc parameterized closed form, called encoder, denoted as f. For each instance xifrom a set of data X=

{

x1,...,xn

}

, the encoder function is deﬁned as:

hi=f

(

xi

)

(1)

where h_iis the hidden representation of the input x_i.

The other crucial component is the decoder function, which is a closed form parameterized function g that maps the feature space back into input space, producing the reconstruction Xˆ=

xˆ 1,..., ˆ xn

such that:

ˆ

xi=g

(

hi

)

=g

(

f

(

xi

))

(2)

The general structure of auto-encoders is illustrated in Fig. 4. The auto-encoder training process consists in ﬁnding the parameter set

θ

that minimizes the reconstruction loss:

LAE

(

θ

)

= 1 n n i=1

g

(

f

(

xi

))

− xi

22 = 1 n n i=1

xˆi− xi

22 (3)

(7)

The encoder function can be expressed as a multi-layer neural network:

h(_i1)=

σ

(

W(1)xi+b(1)

)

, h_i(l)=

σ

(

W(l)_h(l−1)

i +b(

l)

₎

_,_l₌₂_,_._._._,_L ₍₄₎

Consequently, it is possible to deﬁne the decoder function as a multi-layer neural network which takes as input the last layer of the encoder neural network:

h(_ik)=

σ

(

Wˆ(k)_h(k−1) i +bˆ(

k)

₎

_, _k₌_L₊₁_,_._._._,₂_L ₍₅₎

where xidenotes the ith input data, hi(l) is the lth hidden layer representations,

σ

(

x

)

=

1

1+exp(−x) is the sigmoid activation

function, L is the number of encoder layers and analogously of the decoder layers. The set

θ

=

(

W,b,_Wˆ _,_bˆ

₎

_{corresponds to} the auto-encoder neural network parameters, s.t. W is the encoder weight matrix, b is the encoder bias vector, and analogously Wˆ and bˆ are the decoder weight matrix and decoder bias vector. For the sake of simplicity, h(_iL)which corresponds to the latent representation of the input xihas been denoted as hi.

3.2.2. Structuralembeddingmodel

The core model for obtaining the structural node embeddings is the Structural Deep Network Embedding model (SDNE) proposed in [33]. This model is an auto-encoder architecture able to incorporate the ﬁrst- and second-order proximities into the node embedded representation.

Wang et al. [33] claim that the second-order proximity can be naturally incorporated into the reconstruction process of the auto-encoder. Given the set of weights S as the input to the auto-encoder, i.e. xS

i= si, the reconstruction process

will guarantee that nodes with similar neighbourhood structures will have similar latent representations. The loss function concerned with the second-order proximity is the following:

LS x= |V| i=1

(

xˆS i− xSi

)

ξ

i

22 (6) where xˆ S

i is the reconstructed input with respect to the global structure, denotes the Hadamard product, i.e. entrywise

multiplication, and

ξ

i =

{

ξ

i, j

}

|jV=1|. If si, j = 0 ,

ξ

i, j = 1 , else

ξ

i, j =

α

ξ>1 . Recalling the architecture of an auto-encoder, xS

i is the input to the encoder function and xˆ S

i is its reconstruction given

by the decoder function. The objective is to minimize the reconstruction error, i.e. the difference between xS i and ˆ xSi.

The ﬁrst-order proximity can be fulﬁlled by learning a model that makes the similarity between the latent representations of pairs of nodes representative of their connection strength. The higher is the weight sijbetween a pair of nodes

( v_i,v_j), the higher will be the similarity between their embedded representation hS i and h

S

j. The loss function related to the

ﬁrst-order proximity is then deﬁned as follows:

LS h= |V| i, j=1 si, j

(

hSi(L)− h S(L) j

)

2 2 = |V| i, j=1 s_{i, j}

(

hS i− hSj

)

22 (7) where hS_i(1)=

σ

(

WS(1)_xS(1) i +b S(1)

₎

hS_i(l)=

σ

(

WS(l)_hS(l−1) i +b S(l)

₎

_, _l₌₂_,_._._._,_L ₍₈₎

The objective function reported in Eq.(7)borrows the idea of Laplacian Eigenmaps [2], which introduces a penalization when similar nodes are mapped far away in the embedding space.

3.2.3. Attributeembeddingmodel

Given the attribute input representation xA

i =

π

vi, the objective of the model is the same of a general auto-encoder, that

is minimize the reconstruction error:

LA x = |V| i=1

xA i − ˆxAi

22 (9) where ˆ xA

i is the reconstructed input with respect to the attribute information.

(8)

3.2.4. Theoptimizationproblem

As previously outlined, CAGE ﬁrstly solves the structural node embedding optimization problem:

minLS

₍

_θ

S

₎

₌ LS x

|V| i=1

(

xS i− ˆxSi

)

ξ

i

22+ LS h

|V| i, j=1 s_{i, j}

hS i− hSj

22+LSreg (10)

where

θ

S₌

₍

_WS_,_b_,S_W_ˆS_,_bˆ S

₎

_{are the neural network parameters related to the network structure.}

Once the structural node embeddings have been computed, this representation is used to constraint the attribute embedding optimization problem as follows:

minLA

₍

θ

A

₎

₌ LA x

|V| i=1

xA i − ˆxAi

22+

α

LA h

|V| i, j=1

hA i − hAj

22−

hSi− hSj

22

+LA reg (11)

where

α

∈ [0, 1] is the coeﬃcient that leverages the contribution of the structural component and

θ

A₌

₍

_WA_,_bA_,_W_ˆA_,_bˆ A

₎

are the neural network parameters related to the attribute information.

For the sake of simplicity, the auto-encoder reconstruction losses have been named as LS

xand LAx, while the losses involv-

ing the representation layers have been deﬁned as LS hand L

A h.

The optimization process related to the proposed model can be performed by using stochastic gradientdescent, or other possible variations, relying on the computation of the partial derivative of the parameters.

The proposed model ﬁrstly determines the structural node embedding by optimizing _LS

₍

_θ

S

₎

_,_{then it}_{constraints the}

attribute embedding by ﬁnding the optimal solution related to LA

₍

_θ

A

₎

_{. In detail, the key steps concern the estimation of the}

partial derivatives of Eqs.(10)and (11), computed with respect to the parameter set

θ

S _and

_θ

A _{respectively. The two loss}

functions differ only for the term

hS i− h

S j

2

2, which is zero if derived with respect to

θ

A.

The ﬁrst required step is the computation of the partial derivative of LS

₍

_θ

S

₎

_{with respect to}_WˆS(l) _and_WS(l)_{, i.e. the}

decoder and encoder weight matrices respectively. More formally:

∂

LS

∂

WˆS(l)=

∂

LS x

∂

WˆS(l)+

∂

LS reg

∂

WˆS(l) (12)

∂

LS

∂

WS(l) =

∂

LS x

∂

WS(l)+

∂

LS h

∂

WS(l) +

∂

LS reg

∂

WS(l),l=1,...,L. (13)

In particular, Eq.(12)is composed of two addends that can be expanded as follows:

∂

LS x

∂

WˆS(l)=

∂

LS x

∂

XˆS·

∂

_XˆS

∂

WˆS(l) (14)

∂

LS reg

∂

_WˆS(l)=

∂

1 2 u l=1

(

WS(l)

2F+

WˆS(l)

2F

)

∂

_WˆS(l) =Wˆ S(l) ₍₁₅₎ In Eq.(14),

∂

LS

x/

∂

WˆS(l) is multiplied and divided by

∂

Xˆ S. This allows to compute the two terms because

∂

LSx/

∂

Xˆ S is equal

to 2

(

Xˆ S_{− X}S

₎

_ξ

iand the second term

∂

Xˆ S/

∂

WˆS(l)can be computed considering that Xˆ S=

σ

(

HS(2L−1)· ˆWS(2L)+ˆ bS(2L)

)

. The

derivative

∂

LS

x/

∂

WˆS(l)became then accessible. Based on back-propagation, it is possible to iteratively obtain

∂

LSx/

∂

WˆS(k),k=

L+ 1 ,...,2 L and

∂

LS

x/

∂

WS(l),l = 1 ,...,L.

Analogously, Eq.(13)can be studied by separately analysing the three terms which compose

∂

LS_/

_∂

_WS(l)_{. The computa-}

tion of the terms

∂

LS

x/

∂

WS(l)and

∂

LSreg/

∂

WS(l) can be carried out similarly as in Eqs.(14)and (15)respectively. The second

term

∂

LS

h/

∂

WS(l)can be rewritten as:

∂

LS h

∂

WS(l) =

∂

LS h

∂

HS ·

∂

HS

∂

WS(l) (16) where HS₌

_σ

₍

_HS(L−1)_WS(L)₊_bS(L)

₎

_,_s.t.

_∂

_HS_/

_∂

_WS(l)₌

_σ

₍

_HS(L−1)_WS(L)₊_bS(L)

₎

_·

₍

₁₋

_σ

₍

_HS(L−1)_WS(L)₊_bS(L)

₎₎

_{· H}S(L−1)_/

_∂

_WS(l)_.

Based on back-propagation, it is then possible to iteratively obtain

∂

HS(l)_/

_∂

_WS(l) _with_l₌₁_,_._._._,_L_{− 1}_.

Regarding the ﬁrst term

∂

LS

h/

∂

HS, for the sake of simplicity, LShcan be rewritten as:

(9)

where:

• si,jis equal to 1 or 0 for every i and j

• L=D− S, with D deﬁned as a n× n diagonal matrix with Di,i = j

si, j

Finally,

∂

LS

h/

∂

HS is equal to 2

(

L− L

)

· HS.

Once all the partial derivatives are computed, it is possible to apply stochastic gradient descent.

4. Experimentalsettings

This section details the studied dataset and the state of the art models considered for comparison. Finally, a brief re- minder of the well-known performance measure is given.

4.1. Dataset

For the experimental evaluation, two state of the art citation network datasets have been used. A citation network is a graph where the nodes are composed of a set of research articles, the node attribute is the title and the edges are the citation relationships between articles [10].

The DBLP dataset 1 _{is a collection of papers extracted from DBLP}_[32]_{. The dataset has been constructed using the same} procedure reported in [25], where a list of venues from 4 research areas has been selected: database (SIGMOD, ICDE, VLDB, EDBT, PODS, ICDT, DASFAA, SSDBM, CIKM), datamining (KDD, ICDM, SDM, PKDD, PAKDD), artiﬁcialintelligence (IJCAI, AAAI, NIPS, ICML, ECML, ACML, IJCNN, UAI, ECAI,COLT, ACL, KR), computer vision (CVPR, ICCV, ECCV, ACCV, MM, ICPR, ICIP, ICME).

The second dataset, called CiteSeer-M10 [25], is a subset of CiteSeer X _data2 _[21]_{. This citation}_{network comprises 10} distinct research areas: agriculture,archaeology, biology,computerscience,ﬁnancial economics,industrialengineering,material science, petroleumchemistry,physics, and socialscience. Since the resultant graphs are very sparse, with a huge number of isolated nodes or isolated connected couple of nodes, the ﬁrst connected component has been considered as evaluation data for each graph. In particular, the extracted DBLP graph consists of 16,191 nodes and 51,913 edges, while the CiteSeer-M10 graph comprises 2035 and 3356 edges.

4.2. Comparedmodels

In order to compare CAGE to state of the art approaches, the evaluation framework (mainly based on the GEM package [11]) considered several algorithms that leverage only the graph structure (

S

), only the node textual attributes (

T

) and both of them (

S+T

). The considered algorithms are reported in the following:

• LAP (

S

) [2]is a nonlinear dimensionality reduction algorithm that has locality preserving properties. • SDNE (

S

) [33]is a deep Auto-encoder that jointly preserves the ﬁrst- and second-order proximities.

• node2vec (

S

) [14]preserves a higher-order proximity between nodes by maximizing the probability of occurrence of

subsequent nodes in ﬁxed-length random walks.

• DeepWalk (

S

) [26]jointly adopts random walk and skip-gram model to obtain graph embeddings.

• KATE (

T

) [8]is based on an Auto-encoder architecture specialized on dealing with natural language text.

• Doc2Vec (

T

) [17] is the Paragraph Vectors algorithm which embeds any piece of text in a distributed vector using a

neural network language model.

• DW+D2V (

S+T

) [25] is a concatenation of the vectors learned by DeepWalk and Doc2Vec, for obtaining a combined

representation of node structure and textual attributes.

• TriDNR∗ (

S+T

) is the unsupervised variant of TriDNR [25], which is a supervised neural network based approach that

jointly learns graph structure, textual attributes and labels. For a fair comparison with CAGE, TriDNR has been evaluated without considering the label information.

• GAT2VEC (

S+T

) [30]is a framework that uses structural information to generate structural contexts, attributes to gener-

ate attribute contexts, and employs a shallow neural network model to learn a joint representation from them.

4.3. Evaluationframework

The experimental evaluation has been conducted on the task of nodeclassiﬁcation, aimed at associating with each research article (node) the correspondent research area (label). The textual attributes and the graph structure have been used to derive the feature representation with the proposed model. Once the embedding representation is obtained, a linear

(10)

Table 2

Comparison of CAGE vs S models using 50% of labeled data in terms of micro-averaged performance measures.

CAGE LAP SDNE node2vec DeepWalk

DBLP Accuracy 0.7479 ± 0.004 0.4792 ± 0.004 0.4684 ± 0.003 0.6933 ± 0.004 0.4952 ± 0.000 Precision micro 0.7399 ± 0.004 0.3445 ± 0.145 0.3642 ± 0.037 0.7160 ± 0.003 0.4549 ± 0.000 Recall micro 0.7479 ± 0.004 0.4792 ± 0.004 0.4684 ± 0.003 0.6933 ± 0.004 0.4952 ± 0.000 F-measure micro 0.7393 ± 0.004 0.3106 ± 0.005 0.3524 ± 0.005 0.6814 ± 0.004 0.4238 ± 0.000 M10 Accuracy 0.8487 ± 0.006 0.5175 ± 0.040 0.3548 ± 0.011 0.7949 ± 0.012 0.6375 ± 0.003 Precision micro 0.8496 ± 0.007 0.2089 ± 0.015 0.2530 ± 0.032 0.7886 ± 0.010 0.6251 ± 0.003 Recall micro 0.8487 ± 0.006 0.3576 ± 0.040 0.3548 ± 0.011 0.7949 ± 0.012 0.6375 ± 0.003 F-measure micro 0.8456 ± 0.006 0.2124 ± 0.039 0.2113 ± 0.012 0.7833 ± 0.012 0.6281 ± 0.003 Table 3

Comparison of CAGE vs T and S+T models using 50% of labeled data in terms of micro-averaged performance measures.

CAGE KATE Doc2Vec DW + D2V TriDNR ∗ _GAT2VEC

DBLP Accuracy 0.7479 ± 0.004 0.7450 ± 0.004 0.7095 ± 0.000 0.7106 ± 0.000 0.7019 ± 0.003 0.4833 ± 0.004 Precision micro 0.7399 ± 0.004 0.7377 ± 0.003 0.6924 ± 0.000 0.6980 ± 0.000 0.6886 ± 0.003 0.3618 ± 0.002 Recall micro 0.7479 ± 0.004 0.7450 ± 0.004 0.5001 ± 0.000 0.7106 ± 0.000 0.7019 ± 0.003 0.4833 ± 0.003 F-measure micro 0.7393 ± 0.004 0.7325 ± 0.004 0.4005 ± 0.000 0.6992 ± 0.000 0.6885 ± 0.003 0.3646 ± 0.002 M10 Accuracy 0.8487 ± 0.006 0.7482 ± 0.002 0.7750 ± 0.000 0.8092 ± 0.002 0.8128 ± 0.004 0.3340 ± 0.004 Precision micro 0.8496 ± 0.007 0.8382 ± 0.003 0.7785 ± 0.000 0.8019 ± 0.002 0.8041 ± 0.003 0.3010 ± 0.004 Recall micro 0.8487 ± 0.006 0.8360 ± 0.002 0.7750 ± 0.000 0.8092 ± 0.002 0.8128 ± 0.004 0.3340 ± 0.002 F-measure micro 0.8456 ± 0.006 0.8282 ± 0.010 0.7642 ± 0.000 0.8031 ± 0.002 0.8039 ± 0.004 0.3108 ± 0.003

Support Vector Machine is firstly induced using the training data and then used to predict the labels of the test data. Anal- ogously to what has been done in [25], a linear SVM 3_{has been chosen in order to reduce the impact of complex learning} approaches on the classification performance, like nonlinear models or sophisticated relational classifiers [29]. The datasets have been divided into training and testing sets by randomly selecting a percentage of labeled nodes. The parameters of the model have been estimated by using a sequential model-based optimization [3,4]. The adopted performance measures are Precision, Recall, F-Measure and Accuracy. Since these measures can be strongly influenced by the imbalanced distribu- tion of the data classes, the micro-averaged performance form has been considered. Each experiment has been performed 10 times and the corresponding results have been reported by showing the average results and the corresponding standard deviations. For the sake of completeness, macro-averaged measures have been reported in AppendixA.

5. Experimentalresults

The experimental results are presented by comparing CAGE against three different classes of Representation Learning models: those that exploit either the graph structure (

S

) or the textual node attributes (

T

) and those that combine both representations (

S+T

).

Table 2 shows the results obtained by comparing CAGE with models which exploit only the graph structure (

S

), by considering 50% of the training set. We can see that, on both datasets, CAGE achieves the best performance, conﬁrming the thesis that considering both the textual attributes and the interactions of the nodes provides signiﬁcant improvements compared to the results obtained by modeling only the relational information.

Similarly, the proposed model is able to achieve better results compared to models based on node attributes (

T

), i.e. KATE and Doc2Vec (see Table3). When compared to KATE, CAGE obtains satisfactory improvements, additionally demonstrating that the textual feature representation can be enhanced by considering the relational information. The results of Doc2Vec, DeepWalk and DW+D2V give further evidence of this statement, as the consideration of the concatenated representation space of DeepWalk and Doc2Vec is able to achieve signiﬁcant improvements with respect to the single methods: DeepWalk for the graph structure and Doc2Vec for the text. Among the models that jointly consider structural and attribute information (

S+T

), GAT2VEC obtains the lowest results. This is probably due to the limited number of nodes belonging to the two components of the bipartite graph exploited by the model. The unsupervised version of TriDNR ∗, which includes textual and relational information obtained by combining DeepWalk and Paragraph Vector, achieves good performance that is, however, overreached by CAGE (with respect to all the performance measures). The outperforming results of CAGE are conﬁrmed by the performance shown in Tables4and 5, where the F-measure _microobtained by considering 10%, 30%, 50% and 70% of the training set are reported. For the sake of completeness, results obtained by varying the percentage of the training set from

(11)

Table 4

Comparison of CAGE vs S models in terms of F-measure micro using different % of labeled data.

DBLP 10% 0.7118 ± 0.005 0.3111 ± 0.001 0.3697 ± 0.004 0.6641 ± 0.003 0.4373 ± 0.000 30% 0.7324 ± 0.001 0.3130 ± 0.002 0.3616 ± 0.005 0.6765 ± 0.003 0.4235 ± 0.000 50% 0.7402 ± 0.005 0.3106 ± 0.005 0.3524 ± 0.005 0.6814 ± 0.004 0.4238 ± 0.000 70% 0.7439 ± 0.005 0.3124 ± 0.007 0.3491 ± 0.007 0.6822 ± 0.004 0.4268 ± 0.000 M10 10% 0.7611 ± 0.012 0.1789 ± 0.062 0.2238 ± 0.051 0.6882 ± 0.013 0.5163 ± 0.002 30% 0.8152 ± 0.013 0.1885 ± 0.135 0.2369 ± 0.014 0.7529 ± 0.007 0.6032 ± 0.007 50% 0.8260 ± 0.008 0.2124 ± 0.039 0.2113 ± 0.012 0.7817 ± 0.009 0.6281 ± 0.003 70% 0.8282 ± 0.016 0.2058 ± 0.018 0.2146 ± 0.018 0.7878 ± 0.011 0.6470 ± 0.007 Table 5

Comparison of CAGE vs T and S+T models in terms of F-measure micro using different % of labeled data.

DBLP 10% 0.7118 ± 0.005 0.6927 ± 0.007 0.3668 ± 0.000 0.6541 ± 0.000 0.6620 ± 0.003 0.3756 ± 0.001 30% 0.7324 ± 0.001 0.7275 ± 0.001 0.3943 ± 0.000 0.6971 ± 0.000 0.6868 ± 0.003 0.3665 ± 0.002 50% 0.7402 ± 0.005 0.7325 ± 0.004 0.4005 ± 0.000 0.6992 ± 0.000 0.6885 ± 0.003 0.3646 ± 0.002 70% 0.7439 ± 0.005 0.7368 ± 0.003 0.4097 ± 0.000 0.7072 ± 0.000 0.6916 ± 0.004 0.3556 ± 0.001 M10 10% 0.7611 ± 0.012 0.7183 ± 0.009 0.6917 ± 0.000 0.7104 ± 0.003 0.7324 ± 0.003 0.2795 ± 0.001 30% 0.8152 ± 0.013 0.7667 ± 0.006 0.7369 ± 0.000 0.7799 ± 0.002 0.7784 ± 0.004 0.3072 ± 0.002 50% 0.8260 ± 0.008 0.7825 ± 0.013 0.7642 ± 0.000 0.8031 ± 0.002 0.8039 ± 0.004 0.3108 ± 0.002 70% 0.8282 ± 0.016 0.8007 ± 0.014 0.7661 ± 0.000 0.8187 ± 0.005 0.8114 ± 0.005 0.2742 ± 0.002 Table 6

Textual and label information of a selection of nodes of the network in Fig. 5 .

Node Textual Information Groundtruth Class Predicted Class

201 Toward Information Extraction: Identifying protein names from biological papers

Biology Biology

44 Retrieving Collocations from Text: Xtract Financial Economics Financial Economics

1986 Translating collocations for bilingual lexicons: A statistical approach

Financial Economics Financial Economics 1393 Knowledge Management, Data Mining, and Text

Mining in Medical Informatics Machine Learning Financial Economics

1284 Chapter 9 Creating Metabolic Network Models using Text Mining and Expert Knowledge

Biology Financial Economics

653 Creating Knowledge Repositories From Biomedical Reports: The MEDSYNDIKATE Text Mining System

Biology Biology

513 High-throughput Functional Annotation of Novel Gene Products Using Document Clustering

Biology Biology

179 Using Rules to Analyse Bio-medical Data: A Comparison between C4.5 and PCL

Biology Biology

10% to 90% with a step of 10% have been reported in AppendixA. Another important aspect to be noticed is CAGE’s robustness with respect to the size of the training data that is due to the joint contribution of textual and structural information conveyed by the attributed graph. Moreover, by comparing the results obtained by CAGE on the two benchmark datasets, it is possible to notice that performance on the M10 dataset is signiﬁcantly higher than those on the DBLP dataset. This is due to the contribution of the informative content of textual information, which appears to be on average longer (and consequently more informative) in M10 rather than in DBLP, thus further conﬁrming the importance of the contribution of the attribute information.

(12)

Fig. 5. Three-step ego network of document 201 of the M10 dataset. The nodes are labeled by the original identifiers of the dataset. The node fill color represents the actual class of a document, while the node color border denotes the predicted class (black border color identifies nodes in the training set).

therefore, from the structural point of view, considering the ﬁrst- and second-order proximities, we would expect all test nodes to be classiﬁed as Biology.

Interestingly, our model is able to correctly classify nodes 44 and 1986 as Financial Economics (orange) even if their neighbourhoods (i.e. the structural information) is clearly suggesting that the predicted class should be green. These two papers are related to the statistical analysis of financial data with NLP techniques. Their correct classification is due to the contribution given by their textual information. Indeed, by looking at the most similar nodes in the embedding space, their representation is similar to that of the paper titled “Forecasting Intraday Stock Price Trends with Text Mining Techniques” (classified as Financial Economics and belonging to the training nodes).

6. Conclusionandfuturework

In this paper we introduced CAGE, an unsupervised model for attributed graph embeddings. The proposed model aims to obtain graph embeddings that explicitly represent node attributes while preserving the local and global graph structure. The model has been experimented on a node classiﬁcation task where nodes are characterized by textual attributes. The experimental evaluation on two benchmark datasets has shown signiﬁcant improvements in terms of different performance measures with respect to different state of the art models which consider only the graph structure, only the node contents, or a combination of them. Moreover, the proposed model has demonstrated good robustness with respect to the size of the training data, showing good performance also in the case of limited ground-truth data. The experimental results val- idated the hypothesis that modelling the attribute information together with the relational information, especially when nodes are not structurally related, but they have similar attributes, or when the relational structure is too sparse to provide information.

The future works will be devoted to the deﬁnition of a joint model able to learn simultaneously both attribute and graph embeddings. This model will lead to a unique optimization problem, where the contribution of attributes and graph structure will be determined according to Lagrangian methods. Even if this model is expected to obtain meaningful representations, it will be characterized by a large number of parameters to be optimized. Finally, further future work will be focused on dealing with attributed graphs with probabilistic relationships. Creating meaningful embeddings of attributed graphs where noisy and uncertain relations are considered represents a major challenge for tackling real-world problems.

DeclarationofCompetingInterest

(13)

AppendixA

In the following, TablesA1–A6show additional results on the two benchmark datasets.

Table A1

Comparison of CAGE vs S models using 50% of labeled data in terms of macro-averaged performance measures.

Accuracy 0.7479 ± 0.004 0.4792 ± 0.004 0.4684 ± 0.003 0.6933 ± 0.004 0.4952 ± 0.000 DBLP Precision macro 0.7180 ± 0.003 0.2198 ± 0.123 0.2830 ± 0.091 0.6792 ± 0.008 0.4180 ± 0.001 Recall macro 0.6640 ± 0.005 0.2501 ± 0.000 0.2560 ± 0.002 0.5655 ± 0.005 0.3046 ± 0.000 F-measure macro 0.6831 ± 0.005 0.1621 ± 0.001 0.2010 ± 0.002 0.5824 ± 0.005 0.2841 ± 0.000 Accuracy 0.8487 ± 0.006 0.5175 ± 0.040 0.3548 ± 0.011 0.7949 ± 0.012 0.6375 ± 0.003 M10 Precision macro 0.7928 ± 0.076 0.4452 ± 0.086 0.1162 ± 0.027 0.6203 ± 0.069 0.4523 ± 0.010 Recall macro 0.6817 ± 0.048 0.1426 ± 0.001 0.1450 ± 0.001 0.5043 ± 0.031 0.4036 ± 0.008 F-measure macro 0.7168 ± 0.055 0.0858 ± 0.010 0.0875 ± 0.005 0.5252 ± 0.034 0.4204 ± 0.009 Table A2

Comparison of CAGE vs T and S+T models using 50% of labeled data in terms of macro-averaged performance measures.

Accuracy 0.7479 ± 0.004 0.7450 ± 0.004 0.7095 ± 0.000 0.7106 ± 0.000 0.7019 ± 0.003 0.4833 ± 0.004 DBLP Precision macro 0.7180 ± 0.003 0.7278 ± 0.004 0.6543 ± 0.000 0.6654 ± 0.000 0.6558 ± 0.003 0.2324 ± 0.003 Recall macro 0.6640 ± 0.005 0.6433 ± 0.005 0.6080 ± 0.000 0.6133 ± 0.000 0.5952 ± 0.002 0.2648 ± 0.004 F-measure macro 0.6831 ± 0.005 0.6702 ± 0.004 0.6198 ± 0.000 0.6309 ± 0.000 0.6144 ± 0.002 0.2099 ± 0.001 Accuracy 0.8487 ± 0.006 0.7482 ± 0.002 0.7750 ± 0.000 0.8092 ± 0.002 0.8128 ± 0.004 0.3340 ± 0.004 M10 Precision macro 0.7928 ± 0.076 0.7244 ± 0.022 0.6314 ± 0.000 0.6435 ± 0.011 0.6515 ± 0.013 0.1592 ± 0.003 Recall macro 0.6817 ± 0.048 0.6612 ± 0.003 0.4708 ± 0.000 0.5682 ± 0.003 0.5452 ± 0.003 0.1598 ± 0.005 F-measure macro 0.7168 ± 0.055 0.5871 ± 0.033 0.4775 ± 0.000 0.5912 ± 0.005 0.5711 ± 0.003 0.1548 ± 0.004 Table A3

Comparison of CAGE vs S models in terms of F-measure micro on DBLP using different % of labeled data from 10%

to 90%.

10% 0.7118 ± 0.005 0.3111 ± 0.001 0.3697 ± 0.004 0.6641 ± 0.003 0.4373 ± 0.000 20% 0.7310 ± 0.002 0.3109 ± 0.001 0.3633 ± 0.005 0.6690 ± 0.005 0.4201 ± 0.000 30% 0.7324 ± 0.001 0.3130 ± 0.002 0.3616 ± 0.005 0.6765 ± 0.003 0.4235 ± 0.000 40% 0.7372 ± 0.004 0.3095 ± 0.004 0.3564 ± 0.005 0.6794 ± 0.002 0.4208 ± 0.000 50% 0.7402 ± 0.005 0.3106 ± 0.005 0.3524 ± 0.005 0.6814 ± 0.004 0.4238 ± 0.000 60% 0.7381 ± 0.006 0.3143 ± 0.005 0.3481 ± 0.005 0.6805 ± 0.003 0.4234 ± 0.000 70% 0.7439 ± 0.005 0.3124 ± 0.007 0.3491 ± 0.007 0.6822 ± 0.004 0.4268 ± 0.000 80% 0.7355 ± 0.002 0.3137 ± 0.006 0.3419 ± 0.008 0.6806 ± 0.010 0.4294 ± 0.000 90% 0.7421 ± 0.010 0.3246 ± 0.016 0.3512 ± 0.014 0.6944 ± 0.012 0.4365 ± 0.000 Table A4

Comparison of CAGE vs S models in terms of F-measure micro on CiteSeer-M10 using different % of labeled data

from 10% to 90%.

(14)

Table A5

Comparison of CAGE vs T and S+T models in terms of F-measure micro on DBLP using different % of labeled data from 10% to 90%.

CAGE KATE Doc2Vec DW + D2V TriDNR∗ _GAT2VEC

10% 0.7118 ± 0.005 0.6927 ± 0.007 0.3668 ± 0.000 0.6541 ± 0.000 0.6620 ± 0.003 0.3756 ± 0.001 20% 0.7310 ± 0.002 0.7122 ± 0.003 0.3898 ± 0.000 0.6799 ± 0.000 0.6743 ± 0.003 0.3689 ± 0.001 30% 0.7324 ± 0.001 0.7275 ± 0.001 0.3943 ± 0.000 0.6971 ± 0.000 0.6868 ± 0.003 0.3665 ± 0.002 40% 0.7372 ± 0.004 0.7319 ± 0.002 0.3968 ± 0.000 0.6975 ± 0.000 0.6872 ± 0.003 0.3552 ± 0.003 50% 0.7402 ± 0.005 0.7325 ± 0.004 0.4005 ± 0.000 0.6992 ± 0.000 0.6885 ± 0.003 0.3646 ± 0.002 60% 0.7381 ± 0.006 0.7374 ± 0.007 0.4091 ± 0.000 0.7040 ± 0.000 0.6903 ± 0.002 0.3632 ± 0.002 70% 0.7439 ± 0.005 0.7368 ± 0.003 0.4097 ± 0.000 0.7072 ± 0.000 0.6916 ± 0.004 0.3556 ± 0.001 80% 0.7355 ± 0.002 0.7362 ± 0.003 0.4187 ± 0.000 0.7088 ± 0.000 0.6889 ± 0.003 0.3753 ± 0.003 90% 0.7421 ± 0.010 0.7378 ± 0.005 0.4116 ± 0.000 0.7261 ± 0.000 0.7059 ± 0.004 0.3728 ± 0.001 Table A6

Comparison of CAGE vs T and S+T models in terms of F-measure micro on CiteSeer-M10 using different % of labeled data from 10% to

90%.

10% 0.7611 ± 0.012 0.7183 ± 0.009 0.6917 ± 0.000 0.7104 ± 0.003 0.7324 ± 0.003 0.2795 ± 0.001 20% 0.7964 ± 0.009 0.7482 ± 0.005 0.7423 ± 0.000 0.7759 ± 0.001 0.7739 ± 0.006 0.2843 ± 0.001 30% 0.8152 ± 0.013 0.7667 ± 0.006 0.7369 ± 0.000 0.7799 ± 0.002 0.7784 ± 0.004 0.3072 ± 0.002 40% 0.8177 ± 0.014 0.7806 ± 0.010 0.7541 ± 0.000 0.7876 ± 0.003 0.7889 ± 0.005 0.2996 ± 0.001 50% 0.8260 ± 0.008 0.7825 ± 0.013 0.7642 ± 0.000 0.8031 ± 0.002 0.8039 ± 0.004 0.3108 ± 0.002 60% 0.8306 ± 0.009 0.8069 ± 0.009 0.7698 ± 0.000 0.8144 ± 0.000 0.8062 ± 0.006 0.2937 ± 0.003 70% 0.8282 ± 0.016 0.8007 ± 0.014 0.7661 ± 0.000 0.8187 ± 0.005 0.8114 ± 0.005 0.2742 ± 0.002 80% 0.8388 ± 0.020 0.8055 ± 0.016 0.7744 ± 0.000 0.8178 ± 0.002 0.8211 ± 0.005 0.2732 ± 0.001 90% 0.8577 ± 0.044 0.8102 ± 0.023 0.7751 ± 0.000 0.8183 ± 0.004 0.8033 ± 0.004 0.2728 ± 0.003 Supplementarymaterial

Supplementary material associated with this article can be found, in the online version, at doi: 10.1016/j.ins.2019.12.082.

CRediTauthorshipcontributionstatement

DeboraNozza: Conceptualization, Methodology, Data curation, Investigation, Writing - original draft. ElisabettaFersini:

Conceptualization, Writing - review & editing, Supervision. Enza Messina: Conceptualization, Writing - review & editing, Supervision.

References

[1] A. Ahmed , N. Shervashidze , S. Narayanamurthy , V. Josifovski , A.J. Smola ,Distributed large-scale natural graph factorization, in: Proceedings of the 22nd International Conference on World Wide Web, 2013, pp. 37–48 .

[2] M. Belkin , P. Niyogi , Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (6) (2003) 1373–1396 .

[3] J. Bergstra , R. Bardenet , Y. Bengio , B. Kégl , Algorithms for hyper-parameter optimization, in: Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011, 2011, pp. 2546–2554 .

[4] J. Bergstra , D. Yamins , D.D. Cox , Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures, in: Proceedings of the 30th International Conference on Machine Learning, 28, 2013, pp. 115–123 .

[5] S. Cao , W. Lu , Q. Xu , Deep neural networks for learning graph representations, in: Proceedings of the Thirtieth AAAI Conference on Artiﬁcial Intelli- gence, 2016, pp. 1145–1152 .

[6] S. Chang , W. Han , J. Tang , G.-J. Qi , C.C. Aggarwal , T.S. Huang , Heterogeneous network embedding via deep architectures, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 119–128 .

[7] H. Chen , B. Perozzi , Y. Hu , S. Skiena , HARP: hierarchical representation learning for networks, in: Proceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018, pp. 2127–2134 .

[8] Y. Chen , M.J. Zaki , KATE: K-competitive autoencoder for text, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discov- ery and Data Mining, 2017, pp. 85–94 .

[9] Y. Dong, N.V. Chawla, A. Swami, Metapath2Vec: scalable representation learning for heterogeneous networks, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, in: KDD ’17, ACM, 2017, pp. 135–144, doi: 10.1145/3097983.3098036 .

[10] S.R. Goldberg , H. Anthony , T.S. Evans , Modelling citation networks, Scientometrics 105 (3) (2015) 1577–1604 .

[11] P. Goyal, E. Ferrara, GEM: a Python package for graph embedding methods, J. Open Source Softw. 3 (29) (2018) 876, doi: 10.21105/joss.00876 . [12] P. Goyal, E. Ferrara, Graph embedding techniques, applications, and performance: a survey, Knowl.-Based Syst. 151 (2018) 78–94, doi: 10.1016/j.knosys.

2018.03.022 .

[13] P. Goyal, H. Hosseinmardi, E. Ferrara, A. Galstyan, Capturing edge attributes via network embedding, IEEE Trans. Comput. Social Syst. 5 (4) (2018) 907–917, doi: 10.1109/TCSS.2018.2877083 .

[14] A. Grover , J. Leskovec , Node2Vec: scalable feature learning for networks, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining, 2016, pp. 855–864 .

[15] R. Hong , Y. He , L. Wu , Y. Ge , X. Wu , Deep attributed network embedding by preserving structure and attribute information, IEEE Trans. Syst. Man Cybern (2019) .

[16] X. Huang , J. Li , X. Hu , Accelerated attributed network embedding, in: Proceedings of the 2017 SIAM International Conference on Data Mining, 2017, pp. 633–641 .

(15)

[18] Y. LeCun, Y. Bengio, G.E. Hinton, Deep learning, Nature 521 (7553) (2015) 436–4 4 4, doi: 10.1038/nature14539 .

[19] J. Li , H. Dani , X. Hu , J. Tang , Y. Chang , H. Liu , Attributed network embedding for learning in a dynamic environment, in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, ACM, 2017, pp. 387–396 .

[20] Y. Li , C. Sha , X. Huang , Y. Zhang , Community detection in attributed graphs: an embedding approach, in: Proceedings of the Thirty-Second AAAI Conference on Artiﬁcial Intelligence, AAAI Press, 2018, pp. 338–345 .

[21] K.W. Lim , W.L. Buntine , Bibliographic analysis with the citation network topic model, in: Asian Conference on Machine Learning, 2015, pp. 142–158 . [22] J. Ma , P. Cui , W. Zhu , DepthLGP: learning embeddings of out-of-sample nodes in dynamic networks, in: Proceedings of the Thirty-Second AAAI Con-

ference on Artiﬁcial Intelligence, 2018, pp. 370–377 .

[23] D. Nozza , D. Maccagnola , V. Guigue , E. Messina , P. Gallinari , A latent representation model for sentiment analysis in heterogeneous social networks, in: Software Engineering and Formal Methods - SEFM 2014 Collocated Workshops: HOFM, SAFOME, OpenCert, MoKMaSD, WS-FMDS, 8938, 2014, pp. 201–213 .

[24] M. Ou , P. Cui , J. Pei , Z. Zhang , W. Zhu , Asymmetric transitivity preserving graph embedding, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1105–1114 .

[25] S. Pan , J. Wu , X. Zhu , C. Zhang , Y. Wang , Tri-party deep network representation, in: Proceedings of the 25th International Joint Conference on Artiﬁcial Intelligence, 2016, pp. 1895–1901 .

[26] B. Perozzi , R. Al-Rfou , S. Skiena , DeepWalk: online learning of social representations, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, pp. 701–710 .

[27] S.T. Roweis , L.K. Saul , Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326 .

[28] L.D. Santos, B. Piwowarski, L. Denoyer, P. Gallinari, Representation learning for classiﬁcation in heterogeneous graphs with application to social networks, ACM Trans. Knowl. Discovery Data 12 (5) (2018) 62:1–62:33, doi: 10.1145/3201603 .

[29] P. Sen , G. Namata , M. Bilgic , L. Getoor , B. Gallagher , T. Eliassi-Rad , Collective classiﬁcation in network data, AI Mag. 29 (3) (2008) 93–106 .

[30] N. Sheikh, Z.T. Kefato, A. Montresor, GAT2VEC: representation learning for attributed graphs, Computing 101 (3) (2019) 187–209, doi: 10.1007/ s00607- 018- 0622- 9 .

[31] J. Tang , M. Qu , M. Wang , M. Zhang , J. Yan , Q. Mei , LINE: large-scale information network embedding, in: Proceedings of the 24th International Confer- ence on World Wide Web, 2015, pp. 1067–1077 .

[32] J. Tang , J. Zhang , L. Yao , J. Li , L. Zhang , Z. Su , ArnetMiner: extraction and mining of an academic social network, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, pp. 990–998 .

[33] D. Wang , P. Cui , W. Zhu , Structural deep network embedding, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1225–1234 .

[34] L. Xu , X. Wei , J. Cao , P.S. Yu , Embedding of embedding (EOE): joint embedding for coupled heterogeneous networks, in: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, ACM, 2017, pp. 741–749 .

[35] C. Yang , Z. Liu , D. Zhao , M. Sun , E.Y. Chang , Network representation learning with rich text information, in: Proceedings of the Twenty-Fourth Inter- national Joint Conference on Artiﬁcial Intelligence, 2015, pp. 2111–2117 .

[36] H. Zhang, X. Shang, H. Luan, M. Wang, T. Chua, Learning from collective intelligence: feature learning using social images and tags, ACM Trans. Multimedia Comput.Commun. Appl. 13 (1) (2016) 1:1–1:23, doi: 10.1145/2978656 .

[37] Y. Zhou , H. Cheng , J.X. Yu , Clustering large attributed graphs: an eﬃcient incremental approach, in: Proceedings of the 10th IEEE International Confer- ence on Data Mining, IEEE, 2010, pp. 689–698 .