Missing links prediction in bipartite networks: mathematical analysis and application to e-commerce data

(1)

Dipartimento di Matematica

Corso di Laurea Magistrale in Matematica

Tesi di Laurea Magistrale

Missing link prediction in bipartite

networks: mathematical analysis and

application to e-commerce data

Candidato:

Relatore:

Sara Gennari

Prof. Marco Romito

Controrelatore:

Prof. Stefano Galatolo

(2)

(3)

Introduction

This work of thesis is the result of an internship experience at a company which operates in the area of data science, big data and statistics. During the internship we dealt with the analysis of a real world dataset generated from e-commerce stores, with the purpose of identifying similarities between purchasing behaviours of customers. This lead us to examine, through use of the statistical software R, networks representing relations between products and customers and the resulting problem of missing link prediction. The work was supported by the study of the mathematical formulation of the methods and the strategies employed.

Missing link prediction is a widely studied problem of network analysis that concerns the prediction of new links in a partially observed graph.

Depending on the specific task to be solved, many approaches to link prediction problem can be found in the literature. In particular, we focus on similarity-based algorithms founded on measures of proximity between nodes in the graphs. The methods assign a different score to each unobserved pair of nodes in the graph inducing a ranking on them, then highest ranked edges are chosen for prediction.

Our work of thesis explores link prediction problem from two sides: the mathematical analysis of some classical informal scoring similarity-based methods and their application in a real world e-commerce network.

Specifically, in chapter 1 we recap some notions on graph theory that will be useful in the rest of the work.

In chapter 2 we present proximity measures grouped by the type of topo-logical information they require: local measures use nodes characteristics like degrees and number of common neighbours, while global ones exploit the en-tire structure of the graph, like paths and random walks between nodes; quasi-local methods lies in the middle.

In chapter 3 we analyze how missing link prediction algorithms previously presented perform in a network generated from e-commerce data, where cus-tomers and products are nodes and an edge between a pair customer-product represents a purchase. Such networks have a natural bipartite structure, that

(6)

lead us to study the problem on its projections.

In this context link prediction problem is strictly related to recommenda-tion tasks. We reproduce different types of suggesrecommenda-tion in a double manner: a global one, predicting the most likely links in the network, and a personal one, predicting specific links for each customer.

Finally, in order to generalize the obtained results from a statistical per-spective, we reproduce the original bipartite network via a model graph based on degree distributions.

(7)

Chapter 1 Preliminary notions on graph

theory

This first chapter is thought to introduce a reader not familiar with graph theory to terminology, structures and concepts that will be important in the rest of the thesis. For this reasons, the notions we provide could appear fragmentary or incomplete from a global point of view, but they are organized to be thorough and self-consistent in the perspective of chapters 2 and 3.

The chapter is structured as follows: after some basic definitions, in the first section we present some commonly studied structures and characteristics of a graph, while in the second one we present the notion of graph model.

Formally, a graph is an ordered pair G = (V, E) where V is a nonempty finite set, called set of vertices (or nodes) and E is a set of unordered pair of distinct vertices, called set of edges (or links). We denote by N = |V | the number of vertices and with M = |E| the number of edges of a graph G. A graph H = (VH, EH) is a subgraph of a graph G = (V, E) if VH ⊂ V and

EH ⊂ E.

To distinguish a graph as defined from a directed graph, where the edges are ordered pair of vertices, we also refer to a graph as an undirected graph. In this thesis we will handle only undirected graphs, therefore, with abuse of notation, we denote by (u, v) instead of {u, v} an edge between a pair of nodes u, v ∈ V.

An edge e ∈ E is incident on a vertex v ∈ V if the latter is one of its endpoint, i.e. v ∈ e. Moreover two vertices u, v ∈ V are adjacent if (u, v) ∈ E and two edges e1, e2 ∈ E are adjacent if they are incident on a

common vertex. For all v ∈ V we denote by N (v) the set of neighbours of v, i.e. the vertices adjacent to v in G.

A graph G = (V, E) is often represented thorough its adjacency matrix, 3

(8)

that is an N × N matrix A with entries:

aij =

(

1 if (i, j) ∈ E 0 otherwise,

where we supposed the vertex labeled form 1 to N . Note that the adjacency matrix of a simple and undirected graph is symmetric and has all zeros in the diagonal. Another strictly related matrix that will be useful for its spectral properties is the Laplacian matrix L, but we will come back to it in chapter 2.

If a graph is equipped with real-valued function of its edges it is a weighted graph. A weighted graph is often denoted by G = (V, E, W ), where we is the

weight of an edge e ∈ E.

A generalization a graph that admits the presence of edges connecting a vertex with itself (called loops or self-loops) and multiple edges between the same pair of vertices (multiedges) is called multigraph. To highlight that a graph is not a multigraph we say that it is simple.

Finally let us define two classical families of graphs.

A complete graph is a graph G = (V, E) vertex is joined to every other vertex by an edge, i.e. E = {(u, v)|u, v ∈ V }. In particular, in a complete graph M = N₂.

Definition 1.1 A bipartite graph is a graph G = (V, E) such that the vertex set V may be partitioned into two disjoint sets, say V1 and V2, and each edge

in E has one endpoint in V1 and the other in V2.

We will often refer to V1 and V2 as top-vertex set VT and bottom-vertex

set VB.

A bipartite graph G = (VB ∪ VT, E) is frequently studied through its

projections (Figure 1.1) on the two vertex sets GT = (VT, ET) and GB =

(VB, EB), where (i, j) ∈ ET if and only if there is a k ∈ VBsuch that (i, k) ∈ E

and (k, j) ∈ E.

1.1 Descriptive analysis of network graph

char-acteristics

Studying complex networks, there are some classical characteristics that can be explored to gain important information about the topological structures of the graph. For the entire section, let G = (V, E) be an undirected simple graph.

(9)

1.1. DESCRIPTIVE ANALYSIS OF NETWORK GRAPH CHARACTERISTICS5 1 2 3 4 A B C D E F G H I J K A B C D E F G H I J K

Figure 1.1: Projection on the bottom side of a toy bipartite graph.

1.1.1 Vertex degree

The degree dv of a vertex v ∈ V is defined as the number of edges of E

incident on v, i.e.

dv = |{w ∈ V : (v, w) ∈ E}| ∀v ∈ V.

The set of degrees of all the vertices {d1, ..., dN} in a graph is called degree

sequence.

The degree of a vertex provides a basic quantification of the extent to which it is connected to other vertices within the graph.

Define fd to be the fraction of vertices v ∈ V such that dv = d. The

collection {fd}d>0 is called degree distribution of G.

Beside the descriptive analysis of the behaviour of the vertex, the study of degree distribution of a graph is extremely used to create theoretical models of networks. In particular, let fk can be interpreted as the probability that

a a randomly chosen vertex in the network has degree k. Let denote that probability pk.

In the last decades, network models based on expected degree sequence have been extensively studied in the literature. From the work of Barab´asi et al. [1] and [2] it has been found that a wide variety of real world networks

(10)

degree sequences follows the so-called heavy-tailed distributions1_{. Roughly}

speaking, most nodes have a relatively small degree, while a few nodes (called hubs) are extremely connected. One of the commonest law followed by degree distributions of real networks is the discrete power-law, that is

pk ∝

1

kα, α > 0,

where generally α is between 2 and 3. More detail on this distribution will be given in section 3.4.1.

1.1.2 Connectivity

A sequence of edges {e1, ..., en} is a walk if ek and ek+1 are adjacent for every

k < n − 1. A path is a walk with no repeated edges. We say that a vertex is reachable from another if there is a path between them. An interesting quantity that represent the level of connectivity between two nodes u, v ∈ V is the length of the shortest path connecting them. It is also called geodesic distance and denoted by dist(u, v). The diameter of a graph summarize this information:

diam(G) = max

u∈V {dist(u, v)|v is reachable from u}.

A subgraph H = (VH, EH) is said to be connected if for all u, v ∈ VH

there exists a path from from u to v, otherwise it is disconnected. A connected subgraph H that is maximal for inclusion, i.e. for which the subgraph induced by adding any external vertex v ∈ V \ VH is disconnected, is called connected

component (cc). In chapter 3 we will denote by Ncc the number of vertices

of each connected component of the studied graph.

In real world networks there is often a component that dominates the others, the so called giant component. As reported by Watts and Strogatz in [3] geodesic distances, and consequently the diameter, of the giant component are relatively small. Typically it increase only as the logarithm of the number of vertices. In the literature we refer to this phenomenon as the small-world effect.

1_{The distribution of a random variable X with distribution function F is said to have} a heavy-tail if the moment generating function of X, MX(t), is infinite for all t > 0. That means:

Z ∞

−∞

(11)

1.1. DESCRIPTIVE ANALYSIS OF NETWORK GRAPH CHARACTERISTICS7

1.1.3 Local Density

The study of local density of network can be approached from two sides: we can specify locally dense structures and explore whether they are in the graph, or we can instead define a measure of local density and then char-acterize the extent to which subsets of vertices are dense according to this measure. Following this classification we will firstly introduce two dense structures, the cliques and the plexes, then we will propose some density measure like density and transitivity.

A clique is a complete subgraph H of graph G. In particular, a subgraph H = (VH, EH) is a k-clique if and only if |VH| = k and for all v, w ∈ VH the

edge (v, w) ∈ EH. In this sense, cliques are subsets of vertices that are fully

cohesive because all vertices within the subset are connected by edges. In practice, large cliques are relatively rare so it can be useful to give a less restrictive definition of this notion: the plex. A subgraph H = (VH, EH)

of m vertices is a k-plex if for all v ∈ VH |{w ∈ VH : (v, w) ∈ EH}| ≥ m − k.

Unfortunately, the detection of both cliques and plexes of a graph can result computationally expansive, so it is often unfeasible to explore them on large networks.

Given a graph G, the density of a subgraph H = (VH, EH) id defined as

the ratio of the number of edges to the total number of possible edges of that subset: den(H) = |E_|VH| H| 2 = |EH| |VH|(|VH| − 1)/2 .

The value of den(H) will lie between zero and one and provides a measure of how close H is to being a clique. Note that den(H) is just a rescaling of the average degree ¯d(H) of H, since den(H) = (|VH| − 1) ¯d(H). Obviously,

choosing H = G we obtain the density of the overall graph G:

den = |E|

|V |(|V | − 1)/2.

On the other side, as proposed by Watts and Strogatz in [3], it can be set to be the subgraph of the neighbours of a vertex v, H = Hv, for all v ∈ V and

the average over den(Hv) can be used as a global clustering coefficient.

Another way to express the Watts-Strogatz coefficient is in terms of the density of triangles among connected triples, where a triangle is defined as 3-clique and a connected triple as a subgraph of three vertices connected by two edges. In particular, let τ4(v) denote the number of triangles in G that

have a vertex v, τ3(v) the number of connected triples in G for which the two

edges are both incident to v and define

(12)

It is easy to show that den(Hv) = cl(v) for those v ∈ V such that τ3(v) > 0.

The average over all the vertices v with dv ≥ 2, namely V0, of cl(v) is called

the Watts-Strogatz global clustering coefficient: cl(G) = 1 |V0_| X v∈V0 cl(v) = 1 |V0_| X v∈V0 τ4(v) τ3(v) .

Another coefficient of clustering that is relevant in the literature is the transitivity of a graph. It was proposed by Bollobas and Riordan in [4] and measures how many connected triple of a graph form also a triangle with their three vertices:

clT(G) = P v∈V τ4(v) P v∈V τ3(v) .

For the same reason, the Watts-Strogatz coefficient 1.1 is also known as local transitivity of a vertex.

1.2 Graph models

By model for a network graph we mean a collection

{Pθ(G), G ∈ G : θ ∈ Θ}, (1.2)

where G is a collection of possible graphs, Pθ is a probability distribution on

G, and θ is a vector of parameters ranging over possible values in Θ.

The existing models in the literature largely differ in the way the proba-bility P is specified: some approaches, for example, simply let P be uniform on G, but restrict G to contain only those graphs G satisfying certain prop-erties of interest, other approaches induce P implicitly through the recurrent application of generative algorithms.

As we explained in section 1.1, it has been shown recently that most real-world complex networks have some specific properties in common. This is why a strong effort has been put in the realistic modeling of complex networks in the last few years, and much progress has been accomplished in this field. Some models achieve the aim of producing graphs which capture some, but not all of the main properties of real-world complex networks. Some models obtain all the wanted properties but rely on artificial methods which give unrealistic graphs (trees, graphs with uniform degrees, etc.). Others rely on construction processes which may induce some hidden properties, or are too difficult to analyze.

In this section we will present some classical models of social network, focusing on what and how many behaviours of real world network they can reproduce.

(13)

1.2. GRAPH MODELS 9 The Erd¨os and R´enyi model

The pioneer models for complex networks were proposed by Erd¨os and R´enyi [5] and independently by Gilbert [6] in 1959.

In the ER model a graph is chosen uniformly at random from the collec-tion GN,M of all graphs which have N nodes and M edges. In the Gilbert

formulation, a collection GN,p is defined to consist of all graphs G of order N

that may be obtained by assigning edge independently to each pair of distinct vertices with probability p ∈ (0, 1). Under the assumption that the expected number of edges N₂p tends to infinity as N increases, that is commonly true in real world networks, this two classes of models are essentially equivalent for large number of nodes (see Chapter 2 of [7] for more details).

In practice, due to the ease of statistical analysis allowed by the indepen-dence of the edges, the second model is preferred to the other. Nevertheless, because of the key role of Erd¨os and R´enyi in the foundation of a rigorous graph theory based on probabilistic results, in the literature we usually refer to G ∈ GN,p as an ER graph (or random graph).

As explained in [8] and [7], in the ER model GN,p the vertices degree

distribution (defined in 1.1.1) follows the binomial law: pk=

N − 1 k

pk(1 − p)N −1−k.

By the law of rare events2_{, as proved by Bollob´}_{as in Theorem 3.1 of [7]), as}

N grows and for constant values of N p, the degree distribution converges to a Poisson distribution with parameter λ = N p:

pk ∼

λkexp−λ k! .

Although its historical importance in the theory of graph models, ER graphs can rarely reproduce real world network behaviours. First of all because real degree distributions generally do not follow the Poisson law (as we anticipated in section 1.1.1). Hence the need of developing more sophisticated models. The configuration model

The configuration model is a generalization of the GN,M model that

gen-erate random graphs with a given number of nodes N and a degree

se-2_{The law of rare events, or Poisson limit theorem, asserts that, given a sequence p} n of real numbers in [0, 1] such that npn → λ finite constant, then

lim n→∞ n k pk_n(1 − pn)n−k= exp−λ λk k!.

(14)

quence {d1, ..., dn}. In particular this fix also the number of edges because

2M = PN

i=1di. To create such a random model imagine creating for each

node i exactly ki “stubs”, than uniformly choosing two stabs and connecting

them by an edge and repeating the operation. The process above gener-ates each possible matching of stubs with equal probability. Obviously we assumed that the sum over the degree sequences is even and allowed the formation of multi-edges and self-loops. This issues can be easily overcome [9] but this goes beyond our interests.

A central property of the configuration model is that we can compute the probability that two nodes share an edge. Let i and j be two distinct nodes with degrees di, dj > 0 (if one the degrees is zero the probability pij will be

null). The probability that one stub of node i is connected to a stab of node j is dj

2M −1. Thus if we sum over all the di stubs of i we obtain pij = didj

2M −1. In

particular, for large M we can write: pij =

di· dj

(15)

Chapter 2 Theoretical analysis

2.1 Learning and cross-validation techniques

Dealing with learning tasks, our aim is to create models that reflect the reality as strictly as possible. In a supervised context, we build a model on a set of available data and through this model we want to make predictions for new data. In other words we are looking for a model that fits our data but that has the best possible capability of generalization.

Formally (see [10] for more details), let X ∈ Rp _{denote a real valued}

random input vector and Y ∈ R a real valued random output variable. We seek a function f (X) for predicting Y given values of the input X. For this purpose we introduce a loss function L(Y, f (X)) that penalizes the error in prediction and we look for an f that minimizes the Expected Prediction Error (EPE):

EPE(f ) = E[L(Y, f (X))] under the joint probability P r(X, Y ).

In practice, we are given a finite set of observed data T = {(x1, y1), ..., (xN, yN)},

from whom we estimate a model ˆf (x) as an approximation of f (x). Our goal will be to estimate the test error (or generalization error )

ErrT = E[L(Y, ˆf (X))|T ]

or at least his expectation, called expected test error : Err = E[ErrT] = E[L(Y, ˆf (X))].

For this purpose, we usually split the data we have into two sets: (TR) training set : used to fit the model;

(16)

Figure 2.1: A graphical sampling of the data for 5-fold cross validation (VS) validation set : used to estimate the prediction error.

However, in order to estimate prediction accuracy, it is essential that our analysis does not depend on the data we chose to train the model. One of the simplest method that meet this need is the cross validation.

The main idea of K-fold cross validation is to split the data into K equal-sized parts, called folds (see figure 2.1). We then select each fold repeatedly to use it as validation set and we keep the others as a training set. Finally we estimate the prediction error as the average over the K prediction errors obtained.

In detail, let κ : {1, ..., N } → {1, ..., K} be a function that associate an observation with a fold. Denote by f−k(x) the fitted function, computed with the k-th part of the data removed. Then the cross-validation estimate of prediction error is CV ( ˆf ) = 1 N N X i=1 L(yi, ˆf−κ(i)(xi)).

Several works (see as an example [10], paragraph 7.12) has shown that CV is in general a good estimation of the expected test error Err.

The most common choices for the number of splits are K = 5 and K = 10. The extreme case K = N is called leave-one-out cross-validation. As we will see in chapter 3, we chose 5-fold CV and leave-one-out for our tests.

2.2 Link prediction

Missing link prediction is a widely studied problem in network science. It can be approached form a static or a dynamic point of view. The first case deals with incomplete data and aims to infer from a partially known graph the presence or the absence of an edge between unobserved pair of nodes.

(17)

2.3. INFORMAL SCORING SIMILARITY-BASED METHODS 13 The second one involves evolving networks with the purpose of predicting links that will appear in the future. It plays a key role in different areas of network analysis such as recommendation systems in information retrieval and e-commerce, which can help people to find new friends, potential collab-orators or partners, interesting items in online shopping, or network growing in biological and social systems.

Formally, link prediction problem can be approached as a binary classifi-cation task. (see [11] for more details). In particular, let G = (V, E) be an undirected random network graph. We suppose that all vertices v ∈ V are known and we fix N := |V |. Let U ⊂ E × E be the set of all the possible un-ordered pairs of vertices. We can consider U as the disjoint union of two sub-sets: Uobs containing the observed pair of vertices of U and Umiss = U \ Uobs.

Let Y be the random N × N binary adjacency matrix for the full network graph G, and denote by Yobs and Ymiss the entries of Y corresponding to the vertex pairs in Uobs and Umiss respectively. Our aim is to predict the entries

in Ymiss, given the values Yobs = yobs and possibly various vertex attribute variables X = x ∈ RN_{, i.e. to infer}

P(Ymiss|Yobs _{= y}obs_{, X = x).}

Beside the general formulation, missing link prediction problem presents many different variants, depending on the specific task we need to solve. For example, given an observed graph, we can allow edges to be deleted or just added, or both. We can also develop specific algorithms that exploit particular structure of the network, as the use of temporal graphs to study dynamical networks.

Although in the literature several approaches to the problem have been explored ([12]), most of them can be classified as model-based methods, that make use of classical machine learning tools (for example logistic regression, bayesian models, etc.) or as scoring-methods.

In this thesis we will focus on informal scoring methods to predict the formation of new links in network, not subjected to temporal evolution.

Nevertheless model-based methods are worthy of interest as they can naturally exploit external features of the nodes such as textual information or structural similarities.

2.3 Informal scoring similarity-based

meth-ods

Due to their simplicity and effectiveness, informal scoring methods are ex-tremely used to solve link prediction tasks. As their name suggests, these

(18)

methods are based on a score function that assign a value sij to each pair of

nodes (i, j) ∈ Umiss. Then the score is used to rank unobserved links of the

graph and higher ranked edges are the predicted ones.

In the methods we are going to present, the score function is intended as an index of similarity between two nodes. The idea that underlies these algorithms is borrowed from the homophily theory which assert that two nodes are more likely to share an edge if they share similar characteristic. That is why they are commonly referred to as similarity-based algorithms.

In this context the similarity of a pair of nodes has to be intended as an index of proximity between them and it is based only on the topological struc-ture of the observed graph Gobs _{= (V, E}obs_{), in contrast to other techniques}

that exploits external characteristics and information of the network.

Newman in [9] distinguishes two fundamental approaches for constructing measures of topological similarity: the structural equivalence and the regular equivalence. Two nodes are considered structurally equivalent if they share many of the same network neighbors. There are, however, many cases in which vertices occupy similar structural positions in networks without having common neighbors. Observations like the latter led to the notion of regular equivalence: two nodes are similar if they are connected to other vertices that are themselves similar. We will explore this concepts more deeply in sections 2.3.1 and 2.3.2.

The measures we present are classified into three categories depending on the type of information they require:

• local methods, • global methods, • quasi-local methods.

From now on the graphs we will consider will always be connected. In-deed, as we will see, the measures we define either assign null score to uncon-nected pair of nodes or they are not well defined without this assumption.

2.3.1 Local methods

Local similarity measures1 _{compute the similarity between two nodes using}

local neighborhood features like node degree, node neighbors, and common neighbors between the two nodes.

1_{In this context the term measure is commonly used in the literature, even if it is}

(19)

2.3. INFORMAL SCORING SIMILARITY-BASED METHODS 15

Indices sx,y

Common Neighbours (CN) |N (x) ∩ N (y)| Salton Index, or cosine similarity (CS) |N (x)∩N (y)|√

dx·dy

Jaccard Index (JAC) |N (x)∩N (y)|_{|N (x)∪N (y)|} Sørensen Index (SOR) 2|N (x)∩N (y)|_d

x+dy

Hub Promoted Index (HPI) |N (x)∩N (y)|_min{d

x,dy}

Hub Depressed Index (HDI) |N (x)∩N (y)|_max{d

x,dy}

Leicht-Holme-Newman Index (LHN1) |N (x)∩N (y)|_d

x·dy

Adamic-Adar Index (AA) P

z∈N (x)∩N (y) 1 log dz

Resource Allocation Index (RA) P

z∈N (x)∩N (y) 1 dz

Table 2.1: Similarity functions for local methods

Recall that we denote by N (x) the set of neighbours of a node x and by dx its degree.

To avoid weighing down the dissertation, in the next section we will present in detail only a selection of all the methods applied in chapter 3. Anyway in table 2.3 we summarize all the tested local scores.

Common Neighbours index and some variants

Most local methods are based on the idea that two vertices are similar, in the sense of “they are more likely to share a link”, if they have many common neighbours.

The simplest measure of this neighbourhood overlap is the direct count: the resulting index is called Common Neighbours (CN) index and the simi-larity score of two nodes x and y is:

sCN_x,y = |N (x) ∩ N (y)|.

Note that it is also the number of different paths with length two con-necting x and y, that in terms of adjacency matrix A of the graph can be written as sCN_x,y = (A2)x,y.

Despite his simplicity, the CN-index has been proofed to be very effective in real-world networks. For example, in [13] Newman used this quantity in the study of collaboration networks, showing a positive correlation between

(20)

the number of common neighbors and the probability that two scientists will collaborate in the future.

Many other local methods are based on the ratio between the number of common neighbours of a pair of nodes and a function of their degrees. The latter is chosen appropriately to emphasize different behaviours of the network we are modeling. For example, with the hub promoted index (HPI) higher scores are likely to be assigned to links adjacent to high-degree nodes (called hubs), while Cosine Similarity, Sørensen Index (SOR) and Leicht-Holme-Newman Index (LHN1) in some way normalize the CN-score with functions of average number of connecting pair of nodes in the configuration model (see section 1.2).

Adamic-Adar and Resource Allocation indices

Both these measures are refinement of the common neighbours, where each node is weighted based on its degree.

The Adamic-Adar (AA) similarity score of a pair (x, y) of nodes is defined in [14] as: sAA_x,y = X z∈N (x)∩N (y) 1 log dz . It assigns more weight to less-connected neighbours.

The Resource Allocation (RA) index was proposed in [15], motivated by resource allocation dynamics on complex networks. Let consider the problem of a node x that have to send an unit of resource to a node y with whom x is not directly connected. Their common neighbors can play the role of transmitters. If the amount of resource is equally distributed from a node to all its neighbours, the similarity score between x and y is defined as the quantity of resources received by y from x:

sRA_x,y = X

z∈N (x)∩N (y)

1 dz

.

Despite AA and RA indices was formulated in different contexts, they appear very similar. In particular, when the degree dz of a shared node

is is small the difference between the two scores is insignificant, while it is considerable when dz is large. In other words RA assign higher score than

AA to high degree common neighbours.

2.3.2 Global methods

Global similarity measures compute the similarity between two nodes using more comprehensive global features such as the number of paths or

(21)

informa-2.3. INFORMAL SCORING SIMILARITY-BASED METHODS 17 tion flow between two nodes.

For this reason we will make use of the adjacency matrix notation of a graph, denoted by A, that encloses all the information we need; remember, for example, that the element (i, j) of Ak gives the number of walks of length k that connect vertex i to vertex j. In what follows we will present the indices divided into three categories: the firsts are based on the number of path that connect two nodes, the seconds uses probabilistic elements like random walks, the third consist of only one index that is based on matrix-forests.

Indices based on paths

In the paper [16] in 1953, Katz proposed an index of similarity for social network based on the number of paths between two nodes. The path are exponentially weighted on their length so that short paths are more relevant that long ones. Formally, the Katz score between two nodes x and y is

sKatz_x,y = ∞ X l=1 βl· (Al₎ x,y (2.1)

where β is a free parameter called damping factor. To ensure convergence of equation , β must be lower than the reciprocal of the largest eigenvalue of A. Note that for very small values of β the score is similar to the Common Neighbour index.

We can provide a matricial form for the similarity function: SKatz = (I − βA)−1− I.

A variant of the Katz index is the global Leicht-Holme-Newman index (LHN2) [17]. It is based on the concept that two node are similar if their neighbours are similar, as such it is an example of regular equivalence. How-ever, we do not dwell on this index because we will not use it in the tests. Indices based on random walks

Let imagine a random walker moving on a graph. Starting from a vertex v, he will repeatedly choose uniformly one of the dv edges adjacent to v.

A random walk on a graph G = (V, E) can be modelled via a discrete time-homogeneous Markov chain on the set of vertices V (see section A in the appendix for a brief recall of the main notions on Markov chains).

The transition matrix of a random walk on a graph is:

(22)

where D = diag(d1, ..., dN).

Random Walk with Restart (RWR) is a direct application of the PageRank algorithm [18]. PageRank simulates a random walker on the Net that with a certain probability α ∈ (0, 1) “teleports” on another vertex (web page) uniformly chosen. Following this idea, random walk with restart considers for every vertex x of the graph a random walker that with probability α returns in x. The resulting process, for every fixed x, is a Markov chain with transition matrix

ˆ

Px = (1 − α)P + αe · eTx,

where P is the transition matrix of the ordinary random walk defined in 2.2, e = (1, ..., 1) and ex is a unit vector with 1 on position corresponding to node

x.

Note now that if G is connected then P is irreducible and so it is ˆPx for

every x. Furthermore, for theorem A.9, if a Markov chain with finite states is irreducible then is positive recurrent. Thus theorem A.11 guarantees the existence and the uniqueness of the stationary probability distribution qx

that satisfies the following equation:

qx = α ˆPxTqx+ (1 − α)ex = (1 − α)(I − α ˆPxT) −1

ex.

Finally, in order to achieve symmetry the RWR index is defined as sRW R_xy = qxy + qyx.

Let us consider a again a regular random walk on a graph. An interesting quantity that can be computed is the average first-time passage between two nodes i and j: m(j, i). Its formal definition can be found in A.1, however it represents the average number of steps needed by a random walker for reaching state j for the first time, when starting from state i. Its symmetrized version is the average commute time (A.2):

n(i, j) = m(i, j) + m(j, i),

that can be thought as the average number of steps needed for reaching state j, starting from state i and returning in i. The average commute time has been proved to be a distance measure on the graph [19]. It also has the nice property of increasing when the number of paths connecting two elements increases and when the length of paths decreases. Thus, assuming that two nodes i and j are more similar if they are closer with the distance induced by n(i, j), roughly if there are many short path connecting them, then we define the Average Commute Time (ACT) [19] similarity index as:

si,j =

1 n(i, j).

(23)

2.3. INFORMAL SCORING SIMILARITY-BASED METHODS 19 Although we can calculate m(i, j) with probabilistic considerations, it is computationally worth to reformulate the average first-passage time in terms of the Moore-Penrose pseudoinverse of the Laplacian matrix L = D − A of the graph. In appendix B there is a detailed proof of this fact, starting from a recursive definition A.3 of the average first-passage time. In particular, the following equation holds:

m(k, i) = n X j=1 (l+_ij − l+_ik− l+_kj + l_kk+)dj ∀k, i ∈ V, where L+_{= (l}+

ij)i,j∈V is the pseudoinverse of L.

By definition we obtain for all i and j:

n(i, j) = m(i, j) + m(j, i) (2.3) = n X k=1 (l+_jk − l+ ji− l + ik+ l + ii)dk+ n X k=1 (l+_ik− l+ ij − l + jk + l + jj)dk = (2.4) = n X k=1 (l+_ii + l+_jj− 2l+ ij)dk = (2.5) = (l_ii++ l+_jj− 2l+ ij) n X k=1 dk. (2.6)

Note that the quantityPn

k=1dkdoes not depends on i and j, thus we can

define the similarity index for a pair x and y of nodes n(x, y) as the reciprocal of n(x, y) up to this constant value:

sACT_x,y = 1 (l+

xx+ l+yy− 2l+xy)

.

A measure that is strictly related to ACT is the Normalized Average Commute Time (ACTN). If we denote by π the stationary distribution of a random walk on the graph, the ACTN similarity coefficient between two nodes x and y is defined as:

sACT N_xy = 1

m(x, y)πx+ m(y, x)πy

. We can prove that πx = P dx

y∈Vdy, therefore the ACTN index is a version of

ACT index weighted with the degrees of the nodes.

The pseudoinverse of the Laplacian matrix L+ exploited in the computa-tion of the previous indices has several important properties, discovered in

(24)

the area of the information retrieval. For example it is positive semidefinite and represents the inner product of a transformed euclidean space of the nodes that preserve the average commute time [19]. Thus it can be use itself as a similarity matrix. We denote this index by L+ _{direct (L+):}

SL+= L+.

Another derived index of similarity is the Cosine based on L+ (COS+) that measures the cosine of the angle between node vectors in a space spanned by columns of L+_{. The coefficient between two nodes x and y is defined as}

sCOS+_xy = l + xy pl+ xxl+yy . Matrix-Forest Index

The Matrix-Forest Index (MFI) was proposed by Chebotarev and Shamis in [20] as a complex sociometric index of proximity. It is based on a generalized version of the Matrix-Tree theorem (see [21]) applied to weighted multigraphs (graphs with multiple edges and self loops). First we state some preliminary definitions :

- a subgraph H = (VH, EH) of G = (V, E) is a spanning subgraph if

VH = V ;

- a graph G is a forest if is it is cycleless; - a tree is a connected forest;

- a rooted tree is a pair (T, r), where T is a tree and r a selected vertex of T called root ;

- a rooted forest is a collection of disjoint rooted trees.

Kirchhoff’s Theorem (or Matrix-Tree Theorem) asserts that any cofactor2

of the Laplacian matrix L of a graph G is equal to the number of spanning trees in G.

W.T.Tutte generalized this theorem for weighted multigraphs. Let G = (V, E) be a multigraph with V = 1, ..., n and let p_ij denote the weight if the p-th edge between i and j in G. Let denote by A = (aij) the generalization

2_{The (i, j) cofactor of a matrix M , also known as minor, is the determinant of the}

submatrix formed by deleting the i-th row and j-th column of M . It is often denoted by Mi,j_.

(25)

2.3. INFORMAL SCORING SIMILARITY-BASED METHODS 21 for multigraphs of the adjacency matrix where aij represents the number of

edges connecting i and j and by L = (lij) the Kirchhoff matrix, or generalized

Laplacian matrix, where

lij = (− Paij p=1 p ij j 6= i −P k6=ilik j = i . (2.7)

Finally, for each subgraph H of G, let (H) = Q

(i,j)∈EH

Qaij

p=1 p

ij, and for

every non-empty set of subgraphs G define (G) =P

H∈G(H).

The theorem states what follows:

Theorem 2.1 (Matrix-Tree Theorem for weighted multigraphs) Let G be a weighted multigraph and let T denote the set of all spanning trees of G. Then, for every pair of vertices i, j ∈ V , Lij = (T ).

Consider now the matrix W = I + L. Chebotarev and Shamis proved an analogous of theorem 2.1 for spanning rooted forests using the matrix W : Theorem 2.2 Let G be a weighted multigraph. If F denotes the set of all spanning rooted forests of G, then

det(W ) = (F ). (2.8)

Moreover, for any i, j ∈ V , let Fij _{be the set of those spanning rooted forests}

of G such that i and j belongs to the same tree rooted in i. Then it is true that

Wij = (Fij). (2.9)

Note that for the symmetry of W equation 2.9 remains true if we replace (Fij) with (Fji).

If the weights are non-negatives (and not all nulls) equation 2.9 of theo-rem 2.2 guarantees non-singularity of the matrix W . This implies that the hypotheses of the following theorem are often fulfilled:

Theorem 2.3 Let G be a weighted multigraph. If the matrix Q = W−1 exists, then its elements are characterized as follows:

qij = (Fij)/(F ) (2.10)

By theorem 2.10, the (i, j)-entry of the matrix Q = W−1 can be consid-ered as a measure of relative “forest-accessibility” of the vertex i from j (or j from i).

Observe that in this thesis we deal only with unweighted simple graphs. Nevertheless, an unweighted graph can be easily thought as a weighted one

(26)

with all weights equal to 1. In this context we can recognize the Kirchhoff matrix defined in 2.7 to coincide with the Laplacian matrix L = D − A and we can define the score matrix of MFI as

SM F I = (I + L)−1.

2.3.3 Quasi-local methods

Although the global indices in general provide much more accurate predic-tion than the local ones, they suffer two disadvantages:the calculapredic-tion of a global index is very time-consuming and sometimes, the global topological information is not available. For this reason quasi-local indices are often a good tradeoff.

Local Path Index

Local Path Index (LPI) was introduced in [15] as an improvement of the CN measure. As Zhou et al. states in that article, for many real-world networks the probability that two node pairs are assigned the same CN-score is high and most node pairs are assigned zero score. In order to solve this problem they proposed an index that could make the score more distinguishable with-out losing the low computational complexity of Common Neighbours. The LPI score matrix is defined as follows:

SLP I = A2+ A3,

where is a free parameter that controls the relevance of paths with length 3. Note that for = 0 we obtain the exactly the CN index.

2.4 Theoretical justification for common

neigh-bours measures

In [22] Sarkar, Chakrabarti and Moore formulate the link prediction problem as a problem of estimating distances between pairs of nodes, where the nodes lie at unknown positions in some latent space. In particular, they consider a well-known class of graph models that associate every node with a position in a D-dimensional metric space and connections are more likely between closer nodes. In this context, the authors show that the number of common neighbours between a pair of nodes gives bounds on the distance between them and the upper bound on distance decrease quickly as the count of common neighbours increases.

(27)

2.4. THEORETICAL JUSTIFICATION FOR COMMON NEIGHBOURS MEASURES23 Let us present the formal framework of the article starting from the

la-tent space. The model assumes the nodes uniformly distributed in a D-dimensional Euclidean space. Hence, denoting by dij the distance between

two nodes i and j,

P(dij ≤ x) = V (x) = V (1)xD,

where V (r) indicates the volume of the D-dimensional hypersphere of radius r. The probability of a link between a pair of nodes (denoted by ∼) is defined as a logistic function of their distance, with parameters α ≥ 0 and r > 0:

P(i ∼ j|dij) =

1 1 + eα(dij−r).

Note that α controls the sharpness of the function and the limit case α = +∞ leads to the deterministic model where two nodes i and j are connected if and only if dij < r. The main results of the article are shown for the deterministic

model, then a weak version is stated for large but still finite α.

Let N be the total number of nodes and fix two of them, namely i and j. For every other node k, let Yk be a random variable that is 1 if k ∈

N (i) ∩ N (j) and 0 otherwise. Conditioned on the distance dij the variables

Yk are independent. E[Yk|dij] = P(i ∼ k ∼ j|dij) = Z dik,dkj P(i ∼ k|dik)P(k ∼ j|dkj)P(dik, dkj|dij)d(dij). (2.11) In the deterministic model the expectation of equation 2.11 is equal to the intersection of two balls of radius r centred at i and j. Denoted this area by A(r, r, dij), it can be computed analytically as twice the volume of the

spherical cap with heigh r − dij/2:

A(r, r, dij) = 2 πD−12 Γ(D+1₂ )r D Z arccos(dij₂ ) 0 sinD(t)dt.

Let η be the observed value of the number of common neighbours between i and j i.e. P

kYk and let V ar(Y) be the sample variance of the vectorˆ

Y = (Yk)k, namely V ar(Y) =ˆ _n−11

Pn

i=1(Yi− E[Yi]))

2_{, whose observed value}

is η−η_{N −1}2/N.

Using empirical Bernstein bound3 _{[23], we get the following confidential}

bound for the area A(r, r, dij) depending on the number of common

neigh-bours between the pair of nodes:

(28)

P   P kYk N − E[Yk] ≥ s 2 ˆV ar(Y) log 2/δ N + 7 log 2/δ 3(N − 1)  ≤ 2δ (2.12) Thus, setting = (δ) = q 2 ˆV ar(Y) log 2/δ N + 7 log 2/δ

3(N −1) and recalling that for

deterministic case E[Yk|dij] = A(r, r, dij), we obtain:

P η N − ≤ A(r, r, dij) ≤ η N + ≥ 1 − 2δ. (2.13) Finally, applying the following analytical bounds

1 −dij 2r D ≤ A(r, r, dij) V (r) ≤ 1 − dij 2r 2! D 2

to equation 2.13, we are able to bound, with probability at least 1 − 2δ, the distance dij: 2r 1 − η/N + V (r) _D1! ≤ dij ≤ 2r s 1 − η/N − V (r) _D1 (2.14) With the bounds we have proved we are able to state the results we were looking for. Let us recall that, using the model given at the beginning of this section, the link prediction problem for a node i consists of selecting the non-neighbour of i with the minimum distance. denote by OP T such a node and with dOP T its distance from i. Unfortunately the position in the latent

space are unknown, so we predict a link to the node that shares the most common neighbours with i. Let us call this node M AX. For a better clarity, we will adopt the following conventions: we will add as subscript “OPT” or “MAX” to all the quantities defined above referring respectively to the pair (i, OP T ) and (i, M AX).

Theorem 2.5 shows that dM AX is bounded by the distance between i and

the node with minimum distance form it, up to an additive factor that goes to zero as N increases.

Theorem 2.4 (Empirical Bernstein bound) Let Z = (Z1, ..., Zn) be a vector of i.i.d

random variables with values in [0, 1] and let δ > 0. Then

P  E[Zi] − 1 n n X i=1 Zi≤ s 2 ˆV ar(Z) ln 2/δ n + 7 ln 2/δ 3(n − 1)  > 1 − δ,

whereV ar(Z) is the sample variance defined asˆ V ar(Z) =ˆ 1 n(n−1) P 1≤i<j≤n(Zi− Zj)2= 1 n−1 Pn i=1(Zi− E[Zi]))2.

(29)

2.4. THEORETICAL JUSTIFICATION FOR COMMON NEIGHBOURS MEASURES25 Theorem 2.5 Define f = OP T + M AX. Then, with high probability:

dOP T ≤ dM AX ≤ dOP T + 2r f V (r) _D1 ≤ dOP T + 2 f V (1) _D1 . (2.15) As we mentioned at the beginning of this section, if we remove the hypoth-esis that we can infer the whole graph knowing the position of the nodes, the results that can be proved become weaker. That happens primarily because the the probability that two nodes share a neighbour 2.11 in the non-deterministic case is hard to compute. Therefore the exact formulation P(i ∼ k ∼ j|dij) = A(r, r, dij) that holds for the deterministic case is

substi-tuted with bounds provided by the following theorem: Theorem 2.6 With the previous notation, it holds:

P(i ∼ k ∼ j|dij) > 1 4 A(r, r, dij) + 2e −αdij_{(V (r) − A(r, r, d} ij)) and P(i ∼ k ∼ j|dij) <          A(r, r, dij) + 2V (r) 1 − _αrDD 1 −αr_D−1 if αr > D A(r, r, dij) + 2DV (r) if αr = D A(r, r, dij) + 2V (D_α) 1 − _αrD−D αr_D − 1−1 if αr < D

(30)

(31)

Chapter 3 Application to an e-commerce

network

An important application of missing link prediction methods is recommenda-tion systems. For example, similarity between two users in social network is used to suggest new friends. Nowadays, because of the theoretical but overall the business relevance of the task, recommendation systems are extremely complex and exploit many side information on users. A commercial network, in its simplest formulation, consist of two sets of vertices: products and users, or customers. A link connects a pair user-product if the former bought at least once the latter. Such a network has a natural bipartite structure. On bi-partite graphs generally the assumptions made by similarity-based methods, as high transitivity and clustering, are not satisfied. For this reasons most of link prediction measures described in chapter 2 are useless. For example common neighbours measures assign null score to every pair user-product. In order to bypass this problem some researchers modify existing measures to make them suit the bipartite structure, while others focus on similarities in the unipartite projection graphs and exploit them to make predictions in the original graph. We chose the latter way, analyzing link prediction meth-ods proposed in chapter 2 on the projections of the main graph. We do not develop a real recommendation system suggesting new items to customers. However we explore similarity for prediction from two sides: a global one, predicting the most likely links in the network, and a personal one, predicting specific links for each person. Finally, in order to generalize the analysis form a statistical perspective, we modelled the real graph with a bipartite network model based on two sided degree distribution. Then we tested the average behaviour of the methods on several instances generated by this model.

Everything that is described in this chapter was implemented using the statistical programming language R, version 3.6.0 . In order to make the

(32)

analysis easily reproducible and clear for external users, we implemented the code with R Markdown1_.

3.1 Introduction to the dataset and the graphs

For our analysis we chose a real world public dataset available at this link https://www.kaggle.com/olistbr/brazilian-ecommerce. The dataset col-lects about 100k orders, from 2016 to 2018, made through the Brazilian e-commerce platform Olist. For each order a wide variety of information is provided, ranging from order status, price, payment and freight performance to customer location, product attributes, reviews written by customers. We decided to take into account only on basic purchase information, obtaining a dataset that collects pairs product-customer each time a customer bought a product.

In table (3.1) and figure (3.1) we summarize the main characteristics of Olist dataset in order to have an idea of what we are dealing with and subsequently to create an artificial dataset, as we will explain in section 3.4.

Number of orders 112650

Number of customers 95420

Number of products 32951

Average no. of product per customer 1.51 Average no. of products customer per product 36.41

As we can observe from the number of product per capita distribution (figure 3.1), the majority of the users are “occasional” customers because they bought only one item. For this reason, we decided to consider both the entire dataset and the reduced one, which results from the exclusion of all the occasionally customers.

The nature of the described dataset led us to represent it as a bipartite graph (definition 1.1). Naturally our vertex sets was the customer set and the product set, while a pair customer-product was connected by an edge if and only if such product had been bought by such customer. Note that the graph created is unweighted: we do not take into account how many times a customer bought a product but only if it happens.

In our context, the projection of the product-customer graph are:

1

(33)

3.1. INTRODUCTION TO THE DATASET AND THE GRAPHS 29 0 20000 40000 60000 80000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 20 21 24 Number of products bought

Frequency

No. of products per customer

0 5000 10000 15000 0 200 400 Number of customer Frequency

No. of customers per product

Figure 3.1: Frequencies of the number of products bought per customer (above) and the number of customer per product. The dashed line represent the average value.

(34)

C1: in which each node represents a customer and a pair of nodes share an edge if they bought at least a common product;

P1: in which each node represents a product and there is edge between two nodes if at least a customer bought both of them.

Similarly, we denote by C1red and P1red the projection of the reduced bipartite graph.

Table 3.1 shows the values of the main characteristics of the projection graphs.

C1 and P1 are decomposed in many connected components. In particu-lar, there is a high number of small components, few large components and one extremely larger than the others, having almost 1/5 of the number of nodes. That explains why the density of the networks is lower than e-04, although many components are almost complete. However, C1 and P1 show completely different behaviours due primarily to the fact that, on the av-erage, a product is bought by many customers while a customer buy one or two products. Consequently, C1 include many cliques consisting of sets of client that bought the same product and its global and local transitivity coefficients are very high (0.96 and 0.98 on the average, respectively). On the other side P1 has an average vertex degree lower than one. Its global transitivity coefficient is 0.39 and the local one has average value lower than 0.6 but high variance. As we will see in section 3.2.2 such differences in the structure reflect on different performances of the link prediction methods.

3.2 Tests on the main projection graphs

In this section we present the results of 5-fold cross-validation for the missing link prediction problem on the projections of the bipartite original (sec. 3.2.2) graph and the reduced one (sec. 3.2.3).

In particular, we tested prediction accuracy of main methods available in R for link prediction and theoretically analyzed in section 2.3.

For this purpose we use the CRAN package linkprediction2_.

In 3.2.1 we describe the methodology used for the tests and the evaluation metrics selected for the problem.

3.2.1 Cross validation and evaluation metrics

In section 2.3 we briefly described how informal scoring methods operate. Here we will point out all the details, focusing on the problems we observed

2

(35)

3.2. TESTS ON THE MAIN PROJECTION GRAPHS 31 C1 P1 C1red P1red N 95420 32951 11869 10682 M 1563117 7724 63153 7724 mean(d) 32.76 0.47 10.64 1.45 sd(d) 72.22 1.32 25.78 2.00 diam 25 24 25 24 no.of cc 27190 27190 4921 4921 mean(Ncc) 3.51 1.21 2.41 2.17 sd(Ncc) 148.63 13.40 56.24 31.49

den 3.43 e-04 1.4 e-05 0.90 e-04 1.35 e-04

clT 0.96 0.39 0.85 0.39

mean(cl) 0.98 0.58 0.86 0.58

sd(cl) 0.10 0.44 0.25 0.45

Table 3.1: Main characteristics of graph C1, P1 and their reduces version C1red and P1red. For a complete description of the characteristics refer to section 1.1.

and the choices we made.

We split the edges set E into k disjoint sets or folds, namely E1,...,Ek.

Than for each fold i ∈ {1, ..., k} we select Etr = tj6=iEj as a training edge

set and Evs = Ej as a validation set. Similarly we denote by Gtr = (V, Etr)

the training graph and by Gvs = (V, Evs) validation graph each iteration.

Moreover, let us denote with U the set of all the possible unordered pair of nodes of a graph G.

At each iteration we run link prediction methods on Gtr, obtaining a list

of scores for each pair of nodes in Uvs := U \ Etr that are all the possible

unobserved edges. From this list we select for prediction a fraction L of the edges with the best scores, where L ∈ [0, 1] is a threshold parameter. Let us denote the predicted edges by Epr.

Interpreting link prediction as a binary classification tasks, we recognise as: positive the edges of the validation graph Evs and negative the remaining

pair of nodes that are not edges of the training graph, i.e. Uvs \ Evs. In the

(36)

Actual condition Positive Negative Predicted condition Positive TP FP

Negative FN TN

Table 3.2: Confusion matrix for binary classification

ones in Uvs\ Epr are the negatively classified. From the previous sets we can

easily obtain the values of the quantities in the confusion matrix 3.2.1. To evaluate prediction accuracy we use common measures of binary clas-sification:

Recall measures the ratio of edges correctly predicted to the number of edges to be predicted, i.e. the number of edges of the validation graph:

T P T P + F N

Precision measures the ratio of edges correctly predicted to the number of edges predicted:

T P T P + F P.

Moreover in order to summarize their information, we exploit the F1-measure, defined as the harmonic mean of precision and recall:

F 1 = 2 · precision · recall precision + recall.

3.2.2 Tests on C2 and P1

Initially we tried to test link prediction algorithms on the projection C1 and P1 of the main bipartite graph. Unfortunately, the computation on C1 resulted unfeasible due to its dimensions. Thus we decided to remove from the projected graph “less significant” edges. Let us clarify this idea.

Let G = (V, ET, EB) be a bipartite graph: its projections are two

uni-partite weighted graphs where the weight represent the number of common neighbors between two nodes in the same side of the bipartite graph. Con-sider for example the projection on the top side GT = (VT, ET, WT): for all

(37)

3.2. TESTS ON THE MAIN PROJECTION GRAPHS 33

Figure 3.2: Representation of graph C2

ET = {(i, j) ∈ V × V : wij > 0}. So far we considered C1 and P1

un-weighted graphs, because the classical version of similarity-based methods for link prediction cares only about the existence or not of an edge between two nodes. Obviously there are in the literature several generalizations that exploit also weights of the edges (see as an example the theory developed for the Matrix-Forest Index 2.3.2). However, due to the structure of the con-sidered database where the edges has generally small weights, that kind of analysis would have probably led to strictly better results but would have increased the complexity of the problem.

Our solution was to think of the weights as thresholds: for all n, we removed from C1 and P1 the edges with weight strictly less than n and the resulting unconnected nodes, then we denoted respectively the obtained graphs by Cn and Pn.

As we mentioned before, the majority of the edges has a small weight: to get an idea, P2 has less than 250 edges and C3 has only 17. A reasonable amount of clients has indeed two products in common, therefore we believe that C2 (fig. 3.2) is a good replacement for C1. Table 3.3 shows the main characteristics of C2 compared with the ones of C1.

Observe that although the number of nodes and edges decreases consider-ably, the density increases. Moreover, while the average number of nodes for

(38)

name N M diam no.cc mean(Ncc) sd(Ncc)

C1 95420 1563117 25 27190 3.51 148.63

C2 744 3094 4 202 3.68 5.83

name den mean(d) sd(d) clT mean(cl) sd(cl)

C1 3.43 e-04 32.76 72.22 0.96 0.98 0.10

C2 1.12 e-02 0.06 1.33 0.98 0.94 0.21

Table 3.3: C2 characteristics compared to C1

each connected component is unchanged, its variance is much lower: small components disappear and the big ones decompose into smaller components. In particular, the largest components of C1 split and form largest components of C2.

Figure 3.3 shows the F1-measure for C2 and P1 with the threshold value L ranging in [0.1, 0.2, ...0.9, 1]. The behaviours are opposite for the two graphs: most of the methods tested on C2 reach high levels of accuracy, while on P1 the values are all lower than 0.3 .

Let us focus firstly on C2. We observe from figure 3.3 that best F1-measures for C2 are obtained for L = 0.4, i.e when the best ranked 40% of the edges is selected for prediction. In particular 11 of the 17 methods reach their maximum when L = 0.4, Katz ACTN and LPI when L = 0.3 and the others for extreme values of L. Therefore in figure 3.4 we show boxpots of Recall and Precision for that threshold value. Recall takes average values greater than 0.8 for all the methods exception made for HDI and L+, that lie between 0.6 and 0.7, and finally COS+ and LHN1 that are worse than a random classifier. We note moreover that COS+ has an high variance. From the precision plot we observe an interesting phenomenon: local methods assume values greater than 0.9 (except for HDI) while global and quasi-local methods lower than 0.8. From the Precision-Recall curve (figure 3.5) we can observe that local methods and LPI have a greater area under the curve (PR-AUC) than the majority o the global methods. In particular, AA, RA and CN have excellent performances overall.

Let us consider now graph P1. Both precision and recall values are daunt-ing. In order to understand why this happens consider the classifier that classifies positive all the possible instances, i.e. the case where all the node pairs with positive score are selected for prediction, obtained setting L = 1. As we can observe from the boxplots in figure 3.6, the recall values are lower

(39)

3.2. TESTS ON THE MAIN PROJECTION GRAPHS 35

Global Local Quasi−local

0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 Threshold F1−measure Methods AA ACT ACTN CN COS COS+ HDI HPI JAC Katz L+ LHN1 LPI MFI RA RWR SOR F1−measures for C2

0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.0 0.1 0.2 0.3 Threshold F1−measure Methods AA ACT ACTN CN COS COS+ HDI HPI JAC Katz L+ LHN1 LPI MFI RA RWR SOR F1−measures for P1

Figure 3.3: The F1-score is the average of the values obtained with the 5-fold cross validation. Threshold values range in [0.1, 0.2, ...0.9, 1].

(40)

0.4 0.6 0.8

AA ACT ACTN CN COS COS+ HDI HPI JAC Katz L+ LHN1 LPI MFI RA RWR SOR Methods Recall group Global Local Quasi−local

Boxplots of Recall of methods for C2

0.4 0.6 0.8 1.0

AA ACT ACTN CN COS COS+ HDI HPI JAC Katz L+ LHN1 LPI MFI RA RWR SOR Methods Precision group Global Local Quasi−local

Boxplots of Precision of methods for C2

Figure 3.4: Each boxplot show values of recall or precision for every iteration of the 5-fold cross validation. The threshold is fixed L=0.4 .

(41)

3.3. SINGLE-CUSTOMER PREDICTION 37

0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 Recall Precision Methods AA ACT ACTN CN COS COS+ HDI HPI JAC Katz L+ LHN1 LPI MFI RA RWR SOR Precision−Recall curve for C2

Figure 3.5: Precision-Recall curve for C2 obtained from the average values of the 5-fold cross-validation.

than 0.5 for all the methods. That can be explained by the fact that most of the edges that are removed for the validation disconnect subsets of the ver-tices from their connected component, so that they are assigned null score and consequently excluded from prediction, increasing the number of False Negatives (FN).

3.2.3 Tests on the reduced dataset

Figure 3.8 shows the precision-recall curve on the reduced customer graph. The recall value is constantly high for almost all the methods, exception made for LNH1. On the contrary, the behaviour of the precision depends strongly on the type of information required by the method. In particular, global methods has almost null precision, precision values of LPI (quasi-local) ranges over [0,0.25] while for local methods exceed also 0.5 for low thresholds.

3.3 Single-customer prediction

In this section we focus on the prediction of missing links for single customers. We aim to know for a fixed customer with which of the other customers is

(42)

0.36 0.38 0.40 0.42 0.44 0.46

AA ACT ACTN CN COS COS+ HDI HPI JAC Katz L+ LHN1 LPI MFI RA RWR SOR

Methods Recall group Global Local Quasi−local Boxplots of Recall of methods for P1

Figure 3.6: Each boxplot show values of for every iteration of the 5-fold cross validation. The threshold is fixed L = 1.

Type of link

FN FP TP Training

Figure 3.7: Plot of the second largest connected component of P1, during the first iteration of cross-validation with L = 1. The blue edges are the ones of the training graph, while the edges of the validation graph are divided in two sets: the predicted ones (TP, red) and the not-predicted ones (FN, light blue). The remains are the FP edges (yellow).

(43)

3.3. SINGLE-CUSTOMER PREDICTION 39

0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Recall Precision Methods AA ACT ACTN CN COS COS+ HDI HPI JAC Katz L+ LHN1 LPI MFI RA RWR SOR Precision−Recall curve for C1red

Figure 3.8: Precision-Recall curve for C1red obtained from the average values of the 5-fold cross-validation.

more likely to buy the same products. In particular, as we did in the previous section with “global” prediction, we test the accuracy of the similarity-based methods described in section 2.3.

In order to estimate the prediction error we implemented a variant of the leave-one-out cross validation: iterating on the edges of the graph, at each iteration we remove an edge ¯e and we choose one of its ending points as selected customer. We compute scores with the training graph being the entire graph (actually the connected component of the selected customer) without the removed edge. Than we consider only the scores of the edges incident on the selected customer and we predict the L-fraction of the best scores, as we did in 3.2.1. Finally we compute recall and precision of the methods. For a greater clarity let us denote the set of predicted edges by P rede¯.

In this context the validation set is represented exclusively by the ¯e thus the recall value is 0-1 depending on its presence or absence in the list of the predicted edges. Precision is a sort of weighted recall error and is equal to:

precision = ( ₁ 1+FP = 1 |P red¯e|, if ¯e ∈ P red¯e 0 otherwise.

(44)

We decided to select for prediction a set of edges and not just one edge because often prediction methods, especially local ones, assign same scores to groups of edges. Thus the choice of a single pair of nodes would be guided by randomness. However, low values of L reproduce this situation and the precision coefficient give us an idea of the number of edges selected.

3.3.1 Tests on C2

Figure 3.9 shows the average values of precision and recall over the edges of graph C2 when the threshold L ranges from 0.1 to 1. All the methods share an high average recall: more than the 95% of the edges is actually predicted. Also precision values are high even if, as we expected, they are more dependent from the threshold.

Finally we inspected if there is some sort of correlation between the recall value and some features of the selected customer, like the degree or the size of its connected component. In particular, this test was to verify if in small components was more likely to be correctly predicted. However, the results do not show relations of this kind.

3.4 Artificial data-set

To generalize the analysis made on the previous section we decided to repro-duce “artificially” an e-commerce data set.

Specifically, we create a model of bipartite graph based on a certain degree distribution, as in [24]. Graph models based on a given degree sequences are quite common in random graph theory (see for example configuration model in chapter 1), however they often overfit the real graph. Thus, following several studies made on real world network, we modelled the top and bottom degree distributions of the original bipartite graph via a random variable with discrete power law.

The structure of this section reproduce the steps made for the construc-tion of the complete model:

1. fit real degree distributions with different discrete power law;

2. simulate independent trials from the obtained random variables in order to get feasible top and bottom degree sequences;

3. create a bipartite simulated graph with the obtained degree sequences. This step reproduces the creation mechanism of the configuration model.

(45)

3.4. ARTIFICIAL DATA-SET 41

0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.96 0.97 0.98 0.99 Threshold Precision Methods AA ACT ACTN CN COS COS+ HDI HPI JAC Katz L+ LHN1 LPI MFI RA RWR SOR

Recall of single prediction on C2

0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.80 0.85 0.90 Threshold Precision Methods AA ACT ACTN CN COS COS+ HDI HPI JAC Katz L+ LHN1 LPI MFI RA RWR SOR

Precision of single prediction on C2

Figure 3.9: Recall and precision values for single costumer prediction, the threshold values ranging in [0.1, 0.2, ...0.9, 1].

(46)

3.4.1 Discrete power law as a model for degree

distri-bution

Let X be a continuous random variable following a continuous power-law with parameter a. The probability density function (PDF) of X will be:

fa(x) =

a − 1

xa χ[1,+∞], a > 1. (3.1)

Let define Y as a discrete approximation of X: Y := bXc. We can find the PDF of Y from 3.1:

P(Y = n) = P(n ≤ X < n + 1) = Z n+1 n a − 1 xa dx = 1 na−1 − 1 (n + 1)a−1.

We can estimate the parameter a using the Maximum Likelihood Exti-mation (MLE). Given x1, ..., xk observations of X:

ln(LX(x1, ..., xk|a)) = ln( k Y i=1 fa(xi)) = k ln(a − 1) − a k X i=1 ln xi.

Maximizing log-likelihood function of X in a d daln(LX(x1, ..., xk|a)) = 0 ⇐⇒ k a − 1 = k X i=1 ln xi

we obtain the maximum likelihood continuous estimator: ˆ ac = 1 + k Pk i=1ln xi .

Similarly, the ML estimator for the discrete law should satisfy the follow-ing equation: LY(n1, ..., nk|a) = k Y i=1 1 na−1_i − 1 (ni+ 1)a−1 .

Unfortunately find a solution to the previous equation is not an easy task. Thus we decided to optimize it with the R function optim that implements a quasi Newton method with constraints.

With this approximation we obtained respectively the values ˆaT = 5.18 and ˆaB _{= 2.31 for the top and the bottom MLE.}

Missing links prediction in bipartite networks: mathematical analysis and application to e-commerce data

Missing link prediction in bipartite

networks: mathematical analysis and

application to e-commerce data

Candidato:

Relatore:

Sara Gennari

Prof. Marco Romito

Controrelatore:

Prof. Stefano Galatolo

Contents

Introduction

Chapter 1

Preliminary notions on graph

theory

1.1

Descriptive analysis of network graph

char-acteristics

1.1.1

Vertex degree

1.1.2

Connectivity

1.1.3

Local Density

1.2

Graph models

Chapter 2

Theoretical analysis

2.1

Learning and cross-validation techniques

2.2

Link prediction

2.3

Informal scoring similarity-based

meth-ods

2.3.1

Local methods

2.3.2

Global methods

2.3.3

Quasi-local methods

2.4

Theoretical justification for common

neigh-bours measures

Chapter 3

Application to an e-commerce

network

3.1

Introduction to the dataset and the graphs

3.2

Tests on the main projection graphs

3.2.1

Cross validation and evaluation metrics

3.2.2

Tests on C2 and P1

3.2.3

Tests on the reduced dataset

3.3

Single-customer prediction

3.3.1

Tests on C2

3.4

Artificial data-set

3.4.1

Discrete power law as a model for degree

distri-bution