Community detection in dynamic networks: phase transitions and online inference

(1)

Community detection in dynamic

networks: phase transitions and online

inference

Author:

Nicola PEDRESCHI

Supervisors: Prof. Fabrizio LILLO

Prof. Riccardo MANNELLA

PhD. Daniele TANTARI

Dipartimento di Fisica

(2)

““The central task of theoretical physics in our time is no longer to write down the ulti-mate equations but rather to catalog and understand emergent behavior in its many guises, including potentially life itself.””

(3)

UNIVERSITÀ DI PISA

Abstract

Dipartimento di Fisica

Community detection in dynamic networks: phase transitions and online inference

by Nicola PEDRESCHI

In this thesis I propose a model, and associated algorithms, for online community detection in dynamic networks. I test the performance of the algorithms on syn-thetic networks generated by the Dynamic Stochastic Block model, focusing on the question whether we can, at every moment in the network’s dynamics, infer its or-ganization in communities and learn the unknown parameters that were used to generate it, based on the only knowledge of the connections between pairs of nodes. We extend the analysis studied in literature on the phase diagrams of Belief Prop-agation algorithms for community detection, using the cavity method of statistical physics to evaluate the phase diagram of our and existing models.

(4)

(5)

Acknowledgements

I would like to express my gratitude to my supervisors. I am thankful to prof. Fab-rizio Lillo for suggesting such an interesting and challenging research problem and for supporting me throughout my whole work with crucial insights. I would like to acknowledge prof. Riccardo Mannella, whose advice and suggestions were ex-tremely helpful. A special "thank you" goes to Daniele Tantantari who, with out-standing competence and commitment (and a fair amount of patience!), never let me down even in the most difficult moments.

Finally, I would also like to thank Roberto, Vittorio, Luca, Letizia, Paolo, Francesca, Riccardo, Michela, Giulio, Viola and Daniele for hosting me in their lab: your kind company and suggestions meant a lot to me in these past few months.

(6)

(7)

List of Figures

1.1 Dynamic network . . . 2

2.1 Second order phase transition . . . 12

2.2 Weighted network . . . 17

2.3 Large scale complex network . . . 18

2.4 Adjacency matrices . . . 19 2.5 Emergence of communities . . . 22 3.1 Modular network . . . 28 3.2 Bipartite network . . . 29 3.3 Core-periphery network . . . 30 3.4 Hierarchical network . . . 30

3.5 Detectability/Undetectability phase transition . . . 33

4.1 Dynamic inference . . . 60

4.2 Dynamic BP: dependence of the detectability/undetectability thresh-old on the network’s size N . . . 62

4.3 Dynamic BP: dependence of the detectability/undetectability thresh-old on the network’s size N . . . 64

4.4 Dynamic BP: dependence of the detectability/undetectability thresh-old on the time series’ length T . . . 65

4.5 Dynamic BP: detectability/undetectability threshold in the(e, η)phase space . . . 66

(10)

5.1 Schematic representation of the four different approaches for commu-nity detection in dynamic networks in this thesis. a) is a schematic representation of the dynamic inference of the nodes’ group assign-ments performed by running our own version (in Appendix) of the static algorithm presented in [8]: at each time step the inference is performed by running the static BP presented in Chapter 3 with no use of temporal information. b) corresponds to the optimal dynamic algorithm of [12], 4, which uses the information of the whole time se-ries and outputs the network’s communities at every time step only at the end of the process. c) and d) are the schematic representation of the two algorithms that we present in this chapter: the first one uses the node’s group labels inferred at t−1 to correct the inference at t; the second method runs c) back and forth in a time window of length K. 68 5.2 Schematic representation of how spatial messages and the mean

ex-ternal field carrying the temporal information are used to infer the marginals µi_g_i(t)at each time step in the Online inference . . . 70 5.3 On-line BP: comparison with the other algorithms . . . 72 5.4 On-line BP: dependence of the detectability/undetectability threshold

on the community persistence η. . . 73 5.5 On-line BP: detectability/undetectability threshold in the(e, η)phase

space . . . 74 5.6 On-line algorithm: overlaps vs. T. . . 75 5.7 On-line BP: dependence of the detectability/undetectability threshold

on η; N=1000, q=2, c=3 . . . 76 5.8 On-line BP: dependence of the detectability/undetectability threshold

on η; N=1000, q=2, c=16 . . . 76 5.9 Dynamic learning of the community persistence . . . 80 5.10 K-online BP: comparison with on-line and static algorithms; η =0.75 . 83 5.11 K-online BP: comparison with on-line and static algorithms; η =0.95 . 85 5.12 K-online BP: comparison with on-line and static algorithms; η =0.75,

(11)

(12)

(13)

Chapter 1

Introduction

Many systems of interest consist of a large number of nodes, e.g. atoms, neu-rons, market firms, Boolean variables, and in many cases the only available informa-tion that can be observed about the system are connecinforma-tions, edges, between pairs of nodes. The resulting structure is a network. Examples of networks are plentiful and include a variety of different systems in nature: most of this networks are dynamic, and the most efficient way in representing their evolution is achieved by chaining consecutive snapshots of the network in a sequence of graphs (fig. 1.1). A common step in analyzing the structure of these complex systems is the detection of com-munities (CD for brevity), in which we seek to find the groups of nodes that play a similar structural role in the network. This represents an ambiguous task as the defi-nition of community itself is ambiguous. This thesis work consists in the application of advanced statistical physics theories and techniques to the Bayesian inference and network theory problem of retrieving the modular structure of a dynamic network only from its connections.

A variety of radically different approaches to Community detection, proposed by sci-entists belonging to disparate fields (computer science, statistics, physics, mathe-matics), have been developed since Watts and Strogatz presented their theory of “small world networks”, in their 1998 work [37]: that article probably represents the milestone of the study of random networks with clustered structure. One possible answer to solve such an ambiguously posed question is to first design a genera-tive model, i.e. a theoretical, probabilistic framework capable of producing random modular networks with characteristics as close as possible to real-world networks

(14)

with underlying communities. One can then design methods to retrieve the com-munity structure as theorized by the model, test those methods on the surrogate data generated synthetically and then, finally, use those methods to analyze real net-works that are supposed to hide a clustered topology.

In this work we adopt, as a generative model for synthetic networks, the Stochastic Block Model (SBM): a generative model which is demonstrated to be effective in the description of a variety of static real-world networks with clustered structure. On this basis, we introduce an original method for community detection in dynamic networks.

FIGURE1.1: Graphic representation of a time series corresponding to the evolution in time of a network. The network represented is a graph of brain connections [5]. We see how, from a time-frame to the next, some links vanish and others are formed: the shape and size of the different communities change and the network topology evolves.

The process of designing a method to check whether and how a network exhibits a community structure compatible with the generative model is strongly based on key concepts and principles of Bayesian inference. In this thesis we study the con-nection between Bayesian inference and Statistical Physics as, in fact, the problem of inferring the network’s topology can be described and understood in depth thanks to well known theories and techniques that physicists use to describe complex, dis-ordered systems. The aim is to design a model and associated algorithm for on-line com-munity detection in dynamic networks. With “on-line” we refer to the algorithm’s char-acteristic of inferring the network’s topology in real time, as the network evolves. In order to do so, I first studied the most relevant literature on the Stochastic Block Model and re-implemented methods that were analyzed in depth in terms of:

(15)

• The different “phases” of the algorithms’ accuracy in inferring the community structure of the network. In the next chapters we will see how the space of the parameters that define the generative model is divided into an area where the inference of the communities is possible (corresponding to the ordered phase of a first ordered phase transition in statistical physics), which we refer to as the detectable area of parameter space, and an area where community detection is impossible (the disordered phase), which we refer to as the undetectable area of phase space.

• The “phase transitions” from one phase to another depending on the value of a parameter indicating the degree of order of the system. We will use the cavity method developed in the statistical physics of spin glasses [32] to evaluate the phase diagram.

Such analysis has been performed for SBM-based community detection on static net-works in [8], and then generalized for algorithms based on the Dynamic Stochastic Block Model (DSBM) [12]. What motivated our research was that existing meth-ods, like in [12], even if optimal in terms of the range of values of the parameters of the generative model that it allows to perform CD on, was not meant to infer the network’s communities during its evolution, but only a posteriori. This approach prevents the authors of [12], for example, to devise methods aimed at learning time-dependent parameters of the generative model, and is characterized by a compu-tational complexity of the associated algorithm which is at least linear in the time series length. However, the algorithm presented in [12], and described in Chapter4, achieves optimal accuracy in the inference of the network’s communities in terms of the area of parameter space where community detection is possible.

The algorithm that we designed belongs to the class of sum-product message pass-ing (Belief Propagation) algorithms and it aims at the computation of the marginal probability distribution of each unobservable node, conditioned on the observed nodes (such quantity is referred to as “message”), and the marginal probability dis-tribution of each node to belong to a given community. The recursive definition of

(16)

these object leads to a set of equations that the algorithm is designed to solve itera-tively: one can metaphorically think of this process as a discussion forum in which people exchange their personal opinions, based on personal beliefs and background information, until a common consensus is found. The tight bond with statistical physics lies in the fact that these objects can be interpreted as Pott’s spins and the whole inference problem can be mapped into the search for the lowest-energy state of the system. We tested our algorithm on synthetic networks, generated by SBM, whose nodes’ community assignments were therefore known, and we use the over-lap between the actual assignments and those retrieved by our method as a measure of the algorithm’s accuracy. Studying the behavior of the overlap for varying values of the parameters plugged into the generative model, we find it represents an order parameter of the system: its value indicates whether the algorithm manages to infer accurately for a specific set of values of the generative parameters (hence, a point of the parameter space which belongs to the detectable area of the phase space) or if its performance on that set of values is not better than chance (undetectable area of the phase space). The iterative solution of the Belief Propagation equations is made com-putationally affordable by the definition of a mean field: for the computation of the messages incoming to and outgoing from each node only the links with its first neigh-bors are taken into account, while all the information of the non-neighbor nodes is encoded in a mean external field, coherently with well known approaches in sta-tistical physics. This method was first introduced in [8] for the inference of nodes’ community assignments in static networks. In this work we present our original idea of defining a temporal external mean field carrying the information from a tempo-ral snapshot (a frame of the tempotempo-ral series representing the network’s evolution) to the next.

It is this mean field approach that enables us to perform on-line inference of the group assignments, i.e., real time community detection on dynamic networks. In chapters 5-6 we show and analyze in depth the results of the simulations and ex-periments ran to test our algorithm, demonstrating that our proposed model and procedure represent an effective trade-off between the optimal performance of the algorithm presented by Ghasemian et al. in [12], and the one described by Decelle,

(17)

Krzakala and Zdeborova in [8]: our algorithm has a lower degree of computational complexity when compared to the optimal one (in terms of width of the area of parameters space where community detection is possible), and, even though the de-tectability/undetectability transition of our model prevents us from obtaining the same results of the procedure outlined in [12], we demonstrate its higher efficiency when compared to the static algorithm [8]. We stress that our approach is more efficient than the static one especially for networks characterized by a low average degree, i.e., a low number of connections per node. This is due to the fact that the information carried by the mean external temporal field, in our procedure, becomes a funda-mental correction to the Belief Propagation equations when the amount of spatial connections, and therefore information, at each time-step is low. Such result high-lights the usefulness of our method to real world data as most real networks, besides being evidently dynamic, are also characterized by a low average degree: a scenario where the topology of a temporal graph, representing a real system, needs to be re-trieved in a fast and simple way represents a task that suits the characteristics of our method.

In the last part of the thesis work consists in the design of a further developed ver-sion of our algorithm for on-line inference aimed at exploring if higher performance in terms of width of the detectable area of the phase diagram is achievable. We end our scientific investigation showing how the this modified version of the algorithm achieves slightly improved performance compared to the plain on-line algorithm, confirming the validity of our choice.

This thesis is organized as follows:

• In Chapter2we introduce concepts of Network Science and Bayesian inference that are fundamental to our analysis; we also recall what a phase transition is and how it is described by a simple statistical physics model. Finally, we stress how our investigation crosses the boundaries of these scientific domains.

• In Chapter 3 we introduce the Stochastic Block Model and outline the thor-ough analysis of [8]. We describe the results obtained by re-implementing the algorithm for community detection on static networks.

(18)

• In Chapter 4we give a description of the model and algorithm proposed in [12] or community detection in dynamic networks. We introduce the Dynamic Stochastic Block Models and present several empirical results of test ran to further analyze and understand the performance of the algorithm.

• In Chapter5we introduce our original theoretical contribution in the design-ing of a model for on-line community detection on dynamic networks. We an-alyze and interpret the empirical results and performance of the algorithm for on-line inference. We propose a further developed version of the algorithm, the K-online algorithm, aimed at out-performing the plain On-line, and describe the results of the simulations.

• In Chapter6we summarize the original contributions and results of this thesis, ranging from the re-implementation and analysis of existing algorithms pre-sented in [8] and [12], to the thorough description of the performance our On-line and K-onOn-line algorithms, concluding to have designed an effective trade-off between the cases studied in literature.

• In AppendixAwe present the Python code for our re-implementation of the static Belief Propagation algorithm for community detection of [8] and for our On-line and K-online algorithms. The code of the algorithm presented in [12], which we used for tests and comparisons, was kindly made available to my supervisors by the authors.

(19)

Chapter 2

Network Science and Statistical

Physics

This thesis work consists in the application of advanced statistical physics theories and techniques to a Bayesian inference and network theory problem: retrieving the modular structure of a network only from its adjacency matrix. In this introduc-tory chapter I first give a brief description of how the domains of statistical physics and statistical inference became so intertwined. I then outline how these two tightly bonded fields came into contact with the domain of network science. In the follow-ing sections I will introduce some vocabulary and basic notions of this field with the aim of providing the reader with enough information to closely follow the thorough analysis that will be performed in the next chapters.

2.1 Statistical Physics and Inference

The connection between statistical physics and statistical inference has a long his-tory. The first to use a vocabulary typical of statistical mechanics to describe infor-mation theory problems was C. E. Shannon, who introduced, in 1951, the concept of "Information Entropy" in his work "Prediction and Entropy of printed English" [10]. In this paper, the author embraces the challenge of "estimating the entropy and redun-dancy of a language" by trying to predict what letter will be next when the preceding text is known. Shannon interprets entropy as the amount of information, being ev-idently inspired by mathematical expressions proper of thermodynamics. Building on Shannon’s pioneering work, E. T. Jaynes further and more thoroughly described

(20)

the link between the two disciplines in his 1957 "Statistical mechanics and information theory" [18]. It was after the development of theories aimed at the description of out-of-equilibrium and disordered systems, such as the theory of spin glasses, that the connection with information theory became even stronger ("Statistical Physics of Spin Glasses and Information Processing An Introduction", H. Nishimori [32]).

In this section I first summarize the main basic concepts of statistical physics that will be fundamental to understand the analysis carried on in chapters 3 and 4. Then we introduce the readers to the definition of Bayesian inference and the benchmark notions of this domain. Finally, we explore the encounter of these two disciplines, describing the numerous bridges that link these two fields.

2.1.1 Statistical Physics: phase transitions

We introduce the very basics of the theory of phase transitions and, in order to do so, we start by describing a simple and exactly solvable model. In statistical physics, we refer to the transformation of a thermodynamic system from a phase or state to another with the term phase transition. During a phase transition of a system, some of its properties change, in several cases discontinuously, as a consequence of the values of some external conditions such as, for instance, temperature, pressure or magnetic field. The transition of a liquid to a gas when the former reaches its boiling temperature is the most intuitive everyday example of a phase transition.

A classic example of phase transition that physicists have studied in depth over the years is the paramagnetic/ferromagnetic phase transition. A simplified theory of ferromagnetism was proposed by Pierre Curie and Pierre Weiss and it is their model that we describe as their techniques and reasoning will be of extreme importance in our work.

Let us now introduce properly the Curie-Weiss model of a ferromagnet. It is characterized by the Hamiltonian:

H= − J N₍_i,j

∑

₎_,i₆₌_jσiσj−µB N

∑

i=1 σi (2.1)

(21)

where the N sites labeled by i are organized in a lattice. We assign an Ising spin σi

to each site, whose value is+1 if the microscopic magnetic spin is pointing up, −1 if it is pointing down. The first term of the Hamiltonian (2.1) represents, in fact, the interaction between the spin pairs: J is a constant, while the normalization _N1 im-plies that H is an extensive quantity. The second term represents the energy gained by each spin, due to its magnetic moment µ, when exposed to a magnetic induction field B.

At zero temperature (0 K) the magnetic system is in its lowest energy (and least dis-order) state with all the spins being parallel or anti parallel to the external magnetic field. The magnetization M of the system is non-zero, a ferromagnetic behavior. As the temperature increases from zero, the directions of the spins are randomized by the thermal agitation and the system becomes more and more disordered. The disor-der increases with increasing temperature, while M decreases, until the temperature reaches a critical value, Tc: above Tc (the critical value of temperature) the system’s

magnetization vanishes and the system behaves as a para-magnet. We now define the partition function ZN(K, h):

ZN(β, h) =

∑

{σ} e−βH({σ},h) =

∑

{σ} exp K 2N N

∑

i=1 σi 2 − K 2 +h N

∑

i=1 σi =e−K2

∑

{σ} exp r K 2N N

∑

i=1 σi 2 +h N

∑

i=1 σi

Where β = 1/(kBT) (kB is the Boltzmann constant), K = β J and h = βµB. The

definition of the partition function leads us to the introduction of the Boltzmann probability distribution of state, in terms of realization of the spin orientations,{σi}:

Pβ({σi}) =

e−βH({σi}

ZN

(2.2)

and of the Helmoltz free energy

FN(β, h) = −1

(22)

We now use in the expression of ZN(β, h)the identity: e−a2 =√1 2π Z +∞ −∞ e −ξ2 2 + √ 2aξ_dξ

where in our case a=√K/2N∑N_i=1σiand ZN(K, h)can be thus written as:

ZN = e−K2 √ 2π +1

∑

s1=−1 +1

∑

s2=−1 · · · +1

∑

sN=−1 Z +∞ −∞ e −z2 2 e √ K 2N∑iσi z+h∑iσi_dz

The sums over the spin values can be brought into the integral and transformed into a product of hyperbolic cosines:

ZN = e−K2 √ 2π Z +∞ −∞ e −z2 2 N

∏

i=1 2 cosh r K 2Nz+h !

With the change of variable y= z/√KN we get:

ZN =2N KN 2π 1₂ e−K2 Z +∞ −∞ e −KN 2 y2[cosh(Ky+h)]N

The free energy per particle can thus be written as

f(β, h) = −1 β

ln ZN

N

We are interested in analyzing systems of large size, therefore we compute the values of these quantities in the thermodynamic limit:

−β f(β, h) = lim N→∞ 1 Nln KN 2π 1₂ +ln 2+ln lim N→∞ 1 N Z +∞ −∞ e −KN 2 y2[cosh(Ky+h)]N

Let us define φh,K(y) = cosh(Ky+h). We can re-write the integrand in the last

expression as exp[−N₂y2+N log φh,K(y)]. Exploiting Laplace’s theorem we have:

−β f(β, h) =ln 2+ sup −∞<y<+∞ − Ky 2 2 +log φh,K(y)

(23)

We need to find the points where the y-derivative of the function fh,K(y) = −y 2 2 + log φh,K(y)is zero: ∂ f_h,K ∂y h,K = −Ky+ K

cosh(Ky+h)sinh(Ky+h) = −Ky+K tanh(Ky+h)

Call m the solution of the transcendental equation y=tanh(Ky+h), the free energy is:

−β f(β, h) =ln 2+− m

2

2 +log cosh(Km+h)

Now we introduce the magnetization of the system, M= _N1 ∑_iN₌₁si, whose

aver-age M_β = ∑{_s_i_}M({si})Pβ({si})(where we use the Boltzmann distribution (2.2)

can be computed by deriving the free energy per particle:

M β = − ∂ f(β, h) ∂h β =tanh(Km+h)

We note how m(β, h)coincides with the average magnetization M

βas: m(β, h) ≡ − ∂ f(β, h) ∂h β =tanh(Km(β, h) +h)

The behavior of the average magnetization for varying values of the temperature can be seen in fig2.1.

We will see that techniques used to describe disordered physical systems will help us in the description of how an algorithm aimed at the discovery of commu-nities in complex network works: we will, in fact, analyze phase transitions in the models describing the algorithms. We will have to define a proper order parameter, like in analogy with M

β in the Curie-Weiss model, whose behavior will indicate

the transition from a phase to another. In order to do so we will use some techniques proper of the field of spin glasses [32]. Let us describe a basic spin glass problem by

(24)

FIGURE2.1: Here y≡m, and θ ≡T/kB, where kbis the Boltzmann

constant. In the diagram on the left, where B=0, we can see that the magnetization falls to zero sharply at T =Tc; in the graph on the

right we see that even for an arbitrarily small value of h>0 the magnetization ceases to be zero for temperature values above the critical threshold and the paramagnetic phase vanishes. This figure was taken from the review on the Curie-Weiss model in [21]

introducing a system described by the Ising Hamiltonian

H(σ) =

∑

(i,j)∈E Jijσiσj+ N

∑

i=1 hiσi (2.3)

In eq. (2.2), σ = {σ_i}_iN₌₁, the spins are σ_i = −1,+1, J_ij is the interaction between

pairs of spins belonging to the ensemble of all the pairs E, and hi is the magnetic

field. The main difference with the system in eq. (2.1) is that now the interactions Jij

are independent random quantities taken from a given distribution. We refer to the random coupling among the degrees of freedom as quenched disorder.

The order parameter that is used to characterize the phase transitions of these sys-tems is the overlap between two configurations (or states):

q= 1 N

∑

_i σ α iσ β i

where the index α or β indicates the configuration the spin belongs to. This parame-ter measures the similarity of different configurations: it changes from a zero value for a disorder state (the configurations are not correlated) to a nonzero value for more ordered states.

(25)

that we introduced in this section, q coincides with the square of the magnetization: q= M2 = 1 N N

∑

i=1 σ_i2

It is thanks to the definition of a quantity analogous to q that we will be able to study the phase transitions in the community detection.

2.1.2 Statistical Inference

"Inference is the act or process of deriving logical conclusions from premises known or assumed to be true. Statistical inference is the process of deducing properties of an underlying distribution by analysis of data." This is the definition of statistical inference given by Krzakala and Zdeborova in [40]. We might as well think of an in-ference problem as, in general, a method to extract knowledge from a large amount of data: problems of this sort are common to many different disciplines nowadays, ranging from machine learning to biology, from economics to neuroscience.

A basic way of introducing an inference problem is the following: let x be a set of variables{xi}i=1,..,N and y = {yj}j=1,..,M the set of measurements, whose values

are aleatory variables characterized by a certain distribution, of the aforementioned variables; then an example of inference problem would be trying to deduce (infer) the value of the variables x based on the measurements y. The main focus of scien-tific investigation in inference problems is whether, and under what conditions, the information contained in the measurements is enough to retrieve the values of the variables x, and if there is any computationally efficient algorithm to perform such task.

A fundamental concept in Bayesian inference is the definition of the conditioned prob-ability of event A, conditioned on event B:

P(A|B) ≡ P(A, B)

P(A) (2.4)

where P(A, B) is the joint probability of events A and B and P(A) ≡ ∑_BP(A, B)

(26)

theorem), which is the milestone upon which this entire field is based:

P(x|y) = P(y|x)

P(y) P(x) (2.5)

where:

• P(x|y)is the posterior probability distribution of x given all the information on the measurements y;

• P(y|x)is the likelihood,

• P(y)is the normalization, often written as Z(y)(suggesting the link with the partition function’s role in statistical physics);

• P(x)is the prior distribution, encoding all the a priori information on the vari-ables x: it represents the belief of what x should be before any data is collected.

The{xi}i=1,..,N represent mutually exclusive hypothesis, of which only one is true,

and the probability distribution P(x|y)represents the beliefs about hypothesis. We stress how one of the trickiest challenges of Bayesian inference is represented by the choice of the prior distribution P(x): this is the topic over which many schools of thought have argued over the years. In the next section we will present the teacher-student scenario, in which the knowledge of the underlying model and of the prior distribution is not in doubt: the premises of our work allow us to find ourselves exactly in this scenario, making the next section even more important for the under-standing of the following chapters.

2.1.3 Statistical Physics and Inference: the Teacher-Student scenario and the planted ensemble

In this section we describe the intertwining between Bayesian inference and statisti-cal physics as presented in [40]. We start by introducing the so-called teacher-student scenario:

• In a first step, the teacher produces a realization of the variables (set of param-eters) x∗ (to be distinguished by x, as the latter represents the variable when not associated to a peculiar value) that we will refer to as ground truth. x∗

(27)

is generated from the prior distribution Ptp(x), where the tp subscript stands

for "teacher’s prior". In the second phase, the teacher exploits x∗ to generate a set of empirical data y from a likelihood distribution of y conditioned on x∗: Ptm(y|x)(tm → "teacher’s model"). Finally, the teacher provides the student

with the data y and some information on the distributions Ptp(x)and Ptm(y|x).

• The student’s objective is to infer with the highest possible degree of accuracy the original values x∗of the variables

The inference problems that we are mostly interested in and that we will study in depth in the next chapters are those where the teacher’s prior can be written as a product: Ptp(x) = N

∏

i=1 Ptp(xi)

In the cases that we will focus on, the observed values{yj} are characterized by

distributions conditioned on a subset of values (parameters){xi}i=1,..,N:

Ptm(y|x) =

∏

j∈S

Ptm(yj|{xi}i=1,..,N)

Now, in order to recover the original values x∗, the student needs to compute the posterior distribution of the variables x exploiting Bayes’ theorem (??):

P(x|y) = 1 Z(y) N

∏

i=1 Ptp(xi)

∏

j∈S Ptm(yj|{xi}i=1,..,N) (2.6)

We introduce the marginal probability of a variable, which is deduced directly from its posterior distribution:

µi(xi) =

Z

P(x|y)

∏

j6=i

(28)

The relationship between Bayesian inference and statistical physics becomes evi-dent, as we note that the posterior distribution (2.4) can be re-written as

P(x|y) = 1 Z(y)exp β N

∑

i=1 log Ptp(xi) +β M

∑

j=1 log Ptm(yj|{xi}i∈S) = 1 Z(y)e −βH(x,y) (2.8)

In this notation (where the auxiliary parameter β clearly equals unity) we can iden-tify the posterior with a Boltzmann probability distribution (2.2) of the variables

(x, y)that characterize the HamiltonianH(x, y), where Z(y) does indeed represent the partition function and where log Ptp(xi)and log Ptm(yj|{xi}i∈S)represent the

in-teraction of the spins with the local magnetic field and the inin-teraction between the spins themselves (the two terms in expression ((2.1)), respectively. This Boltzmann distribution could represent a rather general model in which the experimental data yplay the role of a quenched disorder (2.3) : this disorder, in the teacher-student sce-nario, is generated by the teacher relatively to the Boltzmann distribution itself. The case where the teacher grants the student with enough information on the origi-nal values x∗of the variables represents the optimal inference problem and scenarios of this kind are known in literature as the planted models. We will study in depth and exploit the properties of such model in the next chapter, when we will introduce the cavity method, or Belief Propagation, that was used in [8] and [12] to design an algorithm capable of inferring the group assignments of the nodes in a network with modular structure.

2.2 Network Theory

Any system characterized by some agents, connected to each other by some sort of correlation, can, in principle, be described as a network: a graph where the nodes represent the agents and the edges between them (links) represent the correlation. These web-like structures of varying complexity (depending on the number of nodes or the peculiarity of the interactions) can therefore be used to describe and analyze systems typical of the most diverse fields, ranging from representations of social or

(29)

economical environments as well as biological or physical phenomena. Some exam-ples of real world simple (fig. 2.2) and more complex (fig. 2.3) networks are, for example, ecological systems such as food webs, the network of science collabora-tions and citacollabora-tions, financial networks and, off course, the World Wide Web.

FIGURE2.2: This weighted network represents the web of passes within a team during a football game. Each node represents a player and each edge is the volume of passes exchanged between two players: the thicker is the line connecting the two nodes, the more passes between the two players. This picture is taken from an application of network science techniques to a sports analytics problem [6].

Albert-Laszlo Barabasi claims in his work Statistical mechanics of complex networks that "physics has developed an arsenal of successful tools for predicting the behav-ior of a system as a whole from the properties of its constituents". As we have seen in the previous chapter, such claim clearly refers to the analytical framework and techniques proper of statistical physics. Therefore, while in the first half of the last century, networks were investigated within the framework of graph theory, in the last 60 years, the focus of scientific interests changed onto large-scale systems with no easily retrievable topology. Complex network started being described through the theory of random graphs: from the name we can already guess the relation be-tween this field and that of statistics and statistical inference; the link with statistical physics is a natural consequence.

(30)

FIGURE2.3: This visualization of a network of Facebook users is a

clear example of what large scale complex networks look like [7]

2.2.1 Metrics and definitions

The mathematical representation of a graph is a pair of sets G({xi}i=1,..,N, E), where

the{xi}i=1,..,Nare the N nodes and E represents the set of edges between the nodes.

As we see in fig. 2.2, graphs can be visualized as a set of dots, corresponding to the nodes; two dots are then connected by a line if the two nodes share an edge. A powerful mathematical tool to synthesize the information carried by a graph is its adjacency matrix Aijwhose indexes i, j run over the set of N nodes and its entries are

Aij = 1 if the two nodes are connected by a link, Aij = 0 otherwise. A graph can

be directed or undirected if its edges, representing the correlation, are characterized by a preferential direction from one node to the other: the corresponding adjacency matrix is symmetrical in the latter case, not symmetrical in the former (fig.2.4). The edges of a network can also have different weights: weighted graphs are funda-mental in the description of systems characterized by correlations whose strength or intensity might vary from a pair of nodes to another (fig.2.2).

Another key property of a graph is its average degree: the degree ki of node i is the

(31)

FIGURE2.4: Several simple examples of networks and relative adjacency matrices. Image taken from Laszlo Barabasi’s Network Science textbook [4]

average of the degrees of its nodes, defined as

hki = 1 N N

∑

i=1 ki

Being L = 1₂∑_iN₌₀ki the total number of links we can also define the average degree

k of an undirected network ashki = 2L N.

Another property that allows us to characterize a network is its degree distribution pk:

the probability distribution that a randomly chosen node in the graph has degree k. The degree distribution is normalized:

∞

∑

i=1

pk =1

and, for a network of N nodes, it is estimated by:

pk =

Nk

N

where N_kis the number of nodes with degree k. The definition of degree distribution is a powerful tool since, once we know pk, we can compute:

hki =

∑

∞

i=0

k·pk

The degree of a node can also be directly deduced from the adjacency matrix of the network it is included in. For undirected networks it is a sum over either the rows

(32)

or the columns of the adjacency matrix: k_i = N

∑

i=0 A_ij = N

∑

j=0 Aij

For directed networks the sum over the rows and columns of Aij give, respectively,

the so-called incoming and outgoing degrees:

kin_i = N

∑

i=0 Aij , kouti = N

∑

j=0 Aij ki =kini +kouti 2.2.2 Random graphs

The ultimate task of network science is to represent real world systems. Most of the networks that we encounter cannot be described as a simple spiderweb nor as a regular lattice. For this reason the theory of random graphs was developed, in which edges are distributed randomly among the pairs of nodes.

There are two equivalent definitions of a random network:

• G(N, L): Paul Erd˝os and Alfrèd Rènyi defined a random network as N nodes connected by L edges chosen randomly from the N(N₂−1)possible edges. There are therefore(N(N−1)2

L )possible and equiprobable realizations of the graph.

• G(N, p): in a model introduced by Gilbert, each pair of the N nodes are con-nected to each other by a link with probability p.

The probability of generating a random network of N nodes and edge-probability p, with L links is:

pL = N(N−1) 2 L pL(1−p)N(N−1)2 −L

Which is a binomial distribution. Hence, the average number of links is:

L= N(N−1) 2

∑

L=0 L·pL = p N(N−1) 2

(33)

Thus, the average degree of the network can be computed with he following expres-sion:

k= 2 L

N

We now wish to find the degree distribution of a random network. In order to find the probability that node i has exactly k links, we need to consider the probability pk of k of its links being present, the probability that the other(N−1−k)links are missing p(N−1−K)and, finally, the number of possible choices of k links from a set of N−1 possible links:

p_k = N−1

k

pk(1−p)N−1−k

This is evidently a Poisson distribution. A peculiarity of real-world networks is that they are very sparse: meaning that the number of links L is much smaller than its maximum Lmax= N(N₂−1) and thathki << N. In this regime the binomial coefficient

(N−_k1)can be approximated with N−1 k ≈ (N−1) k k!

while(1−p)N−1−k can be simplified as

ln((1−p)N−1−k) = (N−1−k)ln(1− hki

N−1)

and using the series expansion of the logarithm

ln((1−p)N−1−k) ≈ (N−1−k) hki

N−1 = −hki(1− k

N−1) ≈ −hki

This represents the small degree approximation which causes the last term in the degree distribution to become

(34)

and the degree distribution, in this approximation is a Poisson p_k = N−1 k pke−hki = (N−1) k k! p k_e−hki₌ (N−1)k k! _h ki N−1 k e−hki and, finally: pk =e−hki kk k!

FIGURE2.5: Here is a visualization of a network organized in communities. In particular, this picture represents the network of Facebook friendships of a user [28].

2.2.3 Community Detection

Instead of retrieving the degree distribution together with the other properties that characterize a graph, our work focuses on a different, network theory problem that deals with a peculiar topology of networks. We focus on graphs that have modular structure; in other words, the object of our study will be networks in which one can find communities, or groups of highly connected nodes (fig. 3.1): such property of networks, which does indeed increase their complexity, is called modularity, and the action of retrieving the clustered structure of networks is known as community detec-tion or discovery (CD).

(35)

graphs was originally investigated by defining the so-called clustering coefficient of the nodes of a network Ci. The clustering coefficient of node i is an of estimate the

density of links between i’s kineighbors. The number of possible links is ki(k₂i−1) and

the expected number of links Li between i’s neighbors, being p the probability of

two nodes to share a link in a random network, is:

Li

= pki(ki2−1)

Ci, the local clustering coefficient of a network, is therefore defined as:

Ci =

2 Li

ki(ki−1)

Two fundamental properties of node-correlations in real networks, their high aver-age clustering coefficients and the logarithmic dependence on the number of nodes N of their average length, led Duncan Watts and Steven Strogatz to introduce a new model able to describe them [37] in 1998. The study of modularity in random net-works has seen an exponential development ever since. Let us note here some of the basic concepts that we will need to carry on our analysis.

We first need to give a clear definition of community and, in order to do so, we will highlight two fundamental assumptions:

• A network’s community structure is uniquely encoded in its diagram: this means that the existence of a community or cluster (when present) is encoded in the graph’s adjacency matrix;

• A community is a locally dense connected subgraph in a network: the number of edges that the members of a community share is higher than the number of edges that connect them with nodes that do not belong to the community.

We shall then continue by distinguishing strong communities from weak ones. A subgraph C is a strong community if

(36)

where kin

i (C)and kouti (C)represent the number of links that node i shares with nodes

from inside and outside the subgraph C, respectively. A weak community is defined by the relation:

∑

i∈C kin_i (C) >

∑

i∈ kout_i (C)

Let us now consider a graph with N nodes, L links and a partition into nc

com-munities. With Nc we denote the number of nodes belonging to each community

and with Lc the number of edges that connect them (c = 1, ..., nc). As a measure of

"connectedness" of each community (denoted as subgraph Cc) we use the difference

between the network’s actual number of links (retrieved directly from its adjacency matrix) and the expected number of links pij between each pair of nodes i and j if

the network were generated randomly:

Mc= 1 2L nc

∑

i,j∈Cc (Aij−pij)

Using then the fact that pij = kikj

2L and the assumption that all the nodes in a

commu-nity have the same degree kc, we derive the following definition of modularity:

M ≡ nc

∑

c=1 Mc = nc

∑

c=1 Lc L − kc 2L 2

The characteristic of modularity is that the higher is its value, the better and clearer is the graphs partition. When M reaches zero it means that the whole network can be interpreted as a macro-community; whereas, negative values of modularity imply a network each of whose nodes represent a different community.

Detecting communities is a typical problem in complex network analysis and one of the reasons behind its complexity is the ambiguous definition of community. There are numerous different approaches to the problem of community discovery in complex, static and dynamic networks as we can see in the various surveys and reviews on this topic [34]-[28]. In our work, following the analysis carried on in [8]-[12], the strategy used to infer the network’s structure is:

(37)

the main features of the Stochastic Block Model, a generative model able to re-produce graphs characterized by the emergence of communities, in accordance with some real world examples.

• We map the community detection problem into a Bayesian inference problem, whose solution will be sought in a teacher-student scenario where all the in-formation available to the student is encoded in the generative model.

• Finally, we will perform a last translation from a Bayesian inference vocabulary to a statistical physics framework: we will exploit methods that will enable us to write a set of equations whose solution will ultimately represent the retriev-ing of the best possible estimate of the networks clustered structure.

(38)

(39)

Chapter 3

Community detection with

Stochastic Block Model

In this chapter I introduce the generative model which lies at the very heart of all the methods for community detection that will be described in this thesis. I will then describe the theoretical analysis, presented in [8], of the emerging phase transitions in the inference of the community structure of static network. I introduce the logic of the algorithm proposed by the authors of [8] and I present results obtained re-implementing my own version of their Belief Propagation approach for community detection algorithms in networks generated with the Stochastic Block Model.

3.1 The Stochastic Block Model

The Stochastic Block Model (SBM) is one of the possible generative models used to both describe real world modular networks and to reproduce synthetic graphs with a community topology. The parameters that characterize SBM are the number of groups q; the expected fraction of nodes in each group a,{na}, for 1 ≤ a ≤ q; and

the q×q affinity matrix pab(the probability of an edge between groups a and b). The

graph G is generated by creating an adjacency matrix A. Each node i has a label ti ∈ {1, ..., q}indicating the group it belongs to. For each node i the probability of

belonging to a certain group ti = a is na. The role of the affinity matrix becomes

fundamental once all the group labels are assigned to the nodes, since we assign an edge between nodes i and j (Aij = 1) with probability pti,tj and we set Aij = 0

(40)

when ti 6= tj. Self loops are forbidden, so Aii = 0. Let Na be the number of nodes

in each group a: Na is a binomially distributed value and, in the thermodynamic

limit N→∞, it is concentrated around its mean value naN: with this definition, the

parameters naplay the role of chemical potentials, giving us a Hamiltonian which is

easier to analyze than one with hard constraint that might be pursued, for example, by fixing the number of nodes in each group.

We are interested in sparse graphs, where pab = O(1/N), therefore we will work

with a rescaled affinity matrix c_ab = N p_ab. In the large-N limit the expression for the average degree for a directed network is:

c=

∑

a,b

cabnanb (3.1)

while for an undirected network, which means that Aij, pab and cab are symmetric,

the average degree is:

c=

∑

a,b

cabnanb+

∑

a

caan2a (3.2)

The power of the Stochastic Block Model lies in the fact that it allows us to reproduce many of the different community organizations that can characterize the topology of real world networks. We can, in fact, distinguish the following structures that SBM can generate, depending on the values used for is parameters:

FIGURE3.1: Visualization of social interactions among people of different ethnic groups in two different high schools [13]: it is evident how the network is characterized by the emergence of different communities, whose nodes exchange far more edges between each other than with nodes from other communities.

(41)

FIGURE3.2: Bipartite network of HBO actors and series [1].

• Modular structure (fig. 3.1): organized in "assortative" communities, identified as highly connected subgraphs, whose nodes share more links within the com-munity than with nodes belonging to other communities (pin > pout);

exam-ples of modular graphs are representations of social, economical and molecular networks.

• Bipartite structure (fig. 3.2): characterized by the presence of "disassortative" communities, whose nodes share more links with nodes belonging to the other groups (pin > pout); a basic example is the buyers-sellers network: buyers only

buy from sellers and sellers only sell to buyers, therefore the two groups are identified by a dense volume of edges between them, but a small number of links connceting members of the same community.

• Core/Periphery structure (fig.3.3): in the case of a two-groups network, a core/pe-riphery structure is present when the graph has the following characteristics: the nodes in the core are densely connected to each other and to nodes belong-ing to the periphery; the nodes in the periphery share more links with nodes in the core than with nodes within the same community (pcore > pcore/periphery >

pperiphery); the most intuitive example is the network of human mobility within

(42)

FIGURE3.3: A clear example of a core/periphery network [35]

• Hierarchical structure: the hierarchical organizations of communities within a network can be reproduced with the help of the SBM by designing its affinity matrix, after having defined the probabilities s,h, and p, of backward link, for-ward link to nearest upper class and forfor-ward link to distant class respectively, in the following way:

p=          s h . . . q . .. ... . .. h s s         

in this model the expected number of forward edges must exceed that of back-ward edges, therefore h > q > s; in finance hierarchical networks are used to describe payment flows, whereas in biology they can be used to represent food chains and metabolic webs.

FIGURE3.4: A payment-flow network, an example of hierarchical graph

(43)

We now wish to introduce several cases of stochastic block model studied in literature, that were also mentioned in [8].

1. The Newman and Girvan "four groups" test is a common benchmark in com-munity detection. The test network is divided into four equally sized groups: na =1/4, the average degree is c=16, and

cab =cin , a= b (3.3)

cab= cout , a6=b (3.4)

whit cin > cout. By varying the values of cin and coutit is possible to generate

more or less complex modular structures of the network. In literature it is expected that as N → ∞ it is possible to achieve a configuration correlated

with the initial assignments of nodes to the four groups as soon as the average in-degree (edges between nodes of the same group) is larger than c/q and that failure to do so is due to imperfection of the algorithm. However, we will see that a phase transition in the inference process, depending on the values of the generative parameters, is present.

2. Planted graph partitioning is a generalization of the Newman Girvan model and it is well known in the mathematics and computer science literature. In this model na = 1/q, pab = pin if a = b and pab = pout if a 6= b. Once

again the assumption pin > poutis made, but now the sparse-network

condi-tion is no longer assumed. A classical result [9] shows how, in this case, for pin/pout > O(logN/N)the planted partition is with high probability equal

to the best possible partition in terms of minimum number of edges between groups. In our work as well as in [8] and [12], we are interested in conditions under which a polynomial-time algorithm can find a partition that is corre-lated with the planted partition. The result of [8] show how this is possible when pin−pout>pqpin+q(q−1)pout/N and, also, how this task will be

im-possible, or exponentially hard, depending on the values of these parameters.

3. Planted coloring is another model in which na =1/q and pab = poutfor a6= b

(44)

average degree of this sort of graph is then c = (q−1)poutN/q. The analysis

in [22] shows that the condition that needs to be satisfied in order to find a proper coloring, correlated strongly with the planted coloring, in polynomial time, is that c >O(q2₎_{. The cavity method was applied to this model as well in}

[23], whose results showed how planted-coloring-correlated partitions can be found if and only if c> ccritical: for large q the threshold is ccritical =O(2q log q).

Such colorings can, however, be found in polynomial time only for c > (q−1)2

and there is reason to believe that c = (q−1)2 _{represents "a threshold for a}

large class of algorithms"[8].

Here, as well as in [8] and [12], we are interested in the large N behavior of the gen-erative model, and we focus our analysis on two different aspects of the algorithms: from now on we will refer to the parameter learning process as the process in which we aim at finding the most likely values of the generative parameters q, na, and pab

while the only given information lies in the adjacency matrix of the graph G; on the other hand, we call assignments inference process the method for finding the best pos-sible group assignment to each node of the graph G knowing the parameters and, also, under which values of the parameters the most likely assignment is better than a random guess.

3.2 Community detection on static networks

In this section I describe the work done in [8] to model and analyze the performance of BP algorithms for community detection on static modular networks. The presence of a detectability/undetectability phase transition is evident in the phase diagram of the overlap between the inferred and original group assignments. The overlap is a measure of the agreement of the inferred labels{qi}with the original ones{ti},

defined, in its normalized form, as:

Q(ti, qi) =max π 1 N ∑iδti,π(qi)−maxana 1−maxana (3.5)

where π ranges of the permutation on the q possible group labels. The overlap Q is thus defined so that if ti = qi for all nodes (which means we found the exact

(45)

FIGURE3.5: Here we can see the detectability/undetectability phase

transition: the overlap continuously varies from positive values (detectability) to zero (undetectability), a sign that the BP equations have reached the factorized fixed point (3.42). The two graphs show the overlap’s behavior for q=2, c=3 (top) and c=16 (bottom) as the affinity matrix, characterized by the parameter e=cout/cin,

ranges from that of a modular networks with highly unconnected communities (e=0) to a densely connected graph where

distinguishing different groups is impossible (e=1). The

continuous red line tells us at which value of e the threshold (3.49) is located for the corresponding values of the generative parameters.

original labeling) than Q = 1. If Q = 0, it means that our guess was not better than a random guess, since the only pertaining information are the group sizes and we assign each node to the largest group to maximize the probability of the correct assignment for each node. In fig.3.5we see how, depending on the value of the pa-rameter e= cout/cin, the overlap undergoes a real phase transition and, therefore, it

represents the order parameter whose value discontinuously changes from positive values (detectable phase, in which the inferred assignments are correlated with the original ones) to zero (undetectable phase, in which the algorithm’s performance is not better than chance).

(46)

3.2.1 Group assignment

As described in [8], the probability that the stochastic block model generates a graph G, with corresponding adjacency matrix A, with given group assignments qi

condi-tioned on the parameters θ=q,{na},{pab} is

P(G,{qi}|θ) =

∏

i6=j pAij qi,qj(1−p Aij qi,qj)

∏

i nqi (3.6)

where, in case the graph is undirected, the product is over pairs i < j. The above probability is normalized, meaning_∑_G,_{_q_i_}P(G,{qi}|θ) =1. The community

assign-ments inference process consists in knowing only the graph G and the parameters θ and being interested in extrapolating information on which node belongs to which community; in order to do so we exploit Bayes’ theorem:

P({qi}|G, θ) =

P(G,{q_i}|θ)

∑{ti}P(G,{ti}

(3.7)

In statistical physics this expression corresponds to the Boltzmann distribution of a Pott’s model with Hamiltonian:

H({qi}|G, θ) = −

∑

i log nqi−

∑

i6=j Aijlog cqi,qj + (1−Aij)log(1− cqi,qj N ) (3.8)

Where the sum is over the pairs i < j, in the undirected case. The group labels qi

correspond to the Pott’s spins taking one of the possible q values, whereas the log-arithm of the group sizes nqi represent local magnetic fields. In the sparse case, the

one we are interested in, we have cab =O(1)and therefore there are strong

interac-tions (O(1)) between connected nodes (Aij =1), and weak interactions (O(1/N))

be-tween non-neighbor nodes (Aij =0). We define H({qi}|G, θ) = −log P(G,{qi}|θ) −

M log N in order to ensure the extensive (proportional to N) nature of the energy in the thermodynamic limit. We then note that the unit temperature Boltzmann distri-bution is:

P({qi}|G, θ) =

e−H({qi}|G,θ)

∑{ti}e

(47)

where the denominator represents the partition function:

Z(G, θ) =

∑

{ti}

e−H({ti}|G,θ) _(3.10)

In the case where the parameters q, {na}, cab are known, the Boltzmann

distribu-tion (3.5) asymptotically achieves uniformity over all configurations with the right number of nodes per community and the right number of edges between pairs of communities: the original, correct assignment, is just one of these configurations. It represents an equilibrium configuration for the distribution. It is now crucial to define the so-called "marginals" of the Boltzmann distribution, the marginal proba-bilities µi(qi)of node i to belong to group qi:

µi(qi) =

∑

{qj}i6=i

P({qj}j6=i, qi|G, θ) (3.11)

The estimate of the assignment q∗_i associates each node to its most-likely group argmaxqiµi(qi). The authors of [8] refer to this method as marginalization whereas,

in Bayesian inference, q∗_i is called maximum posterior marginal: it is known in refer-ence ([16]) that this represents the best possible estimator of the original assignment if we aim at maximizing the number of nodes for which ti =q∗i.

The Boltzmann distribution is symmetric with respect to permutation of the commu-nity labels, therefore the marginals over the whole distribution are uniform. Never-theless, if the community labels permutation symmetry is broken as N→_∞,

mean-ing that each permutation corresponds to a different Gibbs state, the authors of [8] claim that marginalization within one of these Gibbs state achieves the optimal esti-mate for the overlap (2.1), where the maximization runs over all the permutations. Krzakala et al. [8] also claim one of the advantages of the knowledge of the marginals of the distribution to be that, in the thermodynamic limit, the overlap Q({ti},{q∗i})

can be evaluated even without knowing the exact original assignments, i.e.:

Qmargin =limN→∞

1

N∑iµi(q∗i)−maxana

1−maxana

(48)

The Qmargin is a measure of information about the original assignment that can be

retrieved from the network’s topology, when the parameters θ are known.

3.2.2 Parameter learning

The parameter learning process consists in finding the most likely values of the gen-erative parameters of the network, the graph itself (its adjacency matrix) being the only accessible information. The intent is to deduce the parameters θ = {q, na, cab}

from the adjacency matrix Aij of the graph G. Applying Bayes’ theorem once more,

we find that the probability P(θ|G)that the generative parameters take certain

val-ues, conditioned on the graph G, is proportional to the probability that a generative model with parameters θ would generate the graph G, hence P(G|θ):

P(θ|G) = P(θ) P(G)P(G|θ) = P(θ) P(G)_{

∑

qi} P(G,{qi}|θ) (3.13)

We recall that in the language of Bayesian inference we refer to P(θ|G)as the posterior

distribution, whereas P(θ)is the prior distribution and it represents all the information

available on the parameters θ prior to the inference process. In order to remain ab-solutely agnostic about the parameters as we wish not to have a biased inference process towards assortative structures, we assume a uniform prior P(θ) = 1 up to

normalization. Maximizing the posterior over θ is therefore equivalent to maximiz-ing the partition function (3.10), or equivalently, it corresponds to minimizing the free energy of the Pott’s model, defined as:

FN(G; θ)

N = −

log Z(G, θ)

N −−−→N→∞ f(G, θ) (3.14)

Analogously to the saddle point method, if the free energy f has a non-degenerate minimum then, in the thermodynamic limit, this minimum is achieved with high probability for the values of the parameters that generated the network. More rig-orously, let θ∗be the original parameters and θ the one that minimizes f(θ). For all e>0 the probability limN→∞P(|θ∗−θ| <e) =1. It is crucial to note that, thanks to

the self-averaging nature of the model, as N → _{∞ the free energy only depends on}

(49)

us now see how we can learn the parameters in case that the free energy is a function of θ and it does indeed have a non-degenerate minimum. The explicit stationarity conditions for f(θ)in terms of the group sizes na, for 1≤ a ≤ q, with the condition

∑ana =1, is (setting each derivative of f with respect to na to zero):

1

N

∑

_i hδqi,ai = hNai

N =na , ∀a=1, ..., q (3.15) wherehf(qi)i = ∑{qi} f({qi})µ({qi}|G, θ)is the thermodynamic average and Na is

the number of nodes per each community a. The predictable result, as stated in [8] is that for each group a, the most likely value of na is the average group size na.

Deriving now f(θ)w.r.t the group affinities cab we have:

1

Nnanb₍_i,j

∑

_)∈_E

= hMabi

Nnanb

= cab , ∀a, b (3.16)

where E is the ensemble of all the pairs of nodes connected by an edge. This means that the most likely value of the affinities is proportional to the most likely number of edges between groups a and b (hMabi, and the most likely value of the non-rescaled

affinity matrix p_ab = c_ab/N is the average fraction of the NaNbpotential edges from

a to b that actually exist. In the undirected case the result becomes:

1 Nn2 a (i,j

∑

)∈E = hMaai Nn2 a =caa , ∀a (3.17)

These equations suggest an iterative way to look for the values of the parameters θ that minimize the free energy. The procedure starts by assigning arbitrary values to the parameters, then proceeding by measuring the mean valueshNaiandhMaaiin

the Boltzmann distribution with parameters θ and then updating the value of θ with the new estimates, computed by solving equations (3.11-3.13). The new parameters are therefore used to define a new Boltzmann distribution and, therefore, to mea-sure new mean values forhNaiandhMaai: the procedure stops when a fixed point

is reached.

In statistical physics, the stationarity conditions (3.11-3.13) can be seen as the ana-logue of the quenched and annealed magnetization and correlations. In spin glass models (as in [16]) they are sometimes referred to as the Nishimori conditions. The

(50)

process of iteratively searching a maximum of the free energy is the equivalent of the expectation-maximization in statistics [40]. If the conditions (3.11-3.13) are satis-fied, the average of the Hamiltonian in the Boltzmann distribution in the undirected (most general) case is:

1

NhHi ≡e= −

∑

_a nalog na−_a

∑

_<_bnanbcablog cab−

∑

_a nacaalog caa+ c

2 (3.18)

where the term c comes from the weak O(1/N)interactions between unconnected pairs of nodes. We shall note that, as usual, in statistical physics the free energy is the average of energy minus the entropy. In this model, the entropy is represented by the logarithm of the number of assignments that have the corresponding energy. Within the Bayesian inference framework, entropy is interpreted by taking into account the assignments that are as good (same group sizes, same number of edges between each pair of groups) as the original one. This definition of entropy also provides a useful measure of the significance of communities in the network (a numerical study is described in [14]).

3.3 Belief propagation

The Bayesian approach described in the previous sections is well known in litera-ture (in machine learning and statistics). However, one of the hardest computational tasks of this approach is the computation of the partition function and the marginal probabilities of a model defined by the Hamiltonian (3.4), or the averages on the left side of equations (3.11-3.13). The model proposed in [8] is no exception as there is no exact polynomial-time algorithm for any of these tasks.

A common method in statistical inference is to approximate thermodynamic av-erages through Monte Carlo Markov chains (MCMC), an approach also known as Gibbs sampling. The fundamental idea is that every Markov Chain respecting de-tailed balance will, after a sufficient number of steps, produce sample configurations accordingly to the Boltzmann distribution (3.7). This is then used to compute the av-erages hNai and hMaai, or the free energy. A central question is then what is the

Community detection in dynamic networks: phase transitions and online inference