Understanding spreading and evolution in complex networks

(1)

Department of Computer Science Ph.D. in Computer Science

Understanding spreading and evolution

in complex networks

Ph.D. Thesis

Ph.D. Candidate Letizia Milli

Supervisors Prof. Dino Pedreschi Dott.ssa Fosca Giannotti

(2)

(3)

(4)

(5)

Nowadays, the increasing availability of Big Data, describe our desires, opinions, senti-ments, purchases, relationships and social connections, provide access to a huge source of information on an unprecedented scale. In the last decades Social Network Analysis, SNA, has received increasing attention from several, different fields of research. Such popularity was certainly due to the flexibility offered by graph representation: a powerful tool that allows reducing many phenomena to a common analytical framework whose basic bricks are nodes and their relations. Especially, the analysis of diffusive phenomena that unfold on top of complex networks is a task able to attract growing interests from multiple fields of research. Understanding the mechanism behind the global spread of an epidemic or information is fundamental for applications in a diversity of areas such as epidemiology or viral marketing.

This thesis aims to understand spreading and evolution phenomena in complex net-works. We developed two libraries framework: DyNetX, a package designed to model evolving graph topologies and a simulation framework, called NDlib, aimed to model, sim-ulate and study diffusion phenomena that unfold over complex networks. This framework can be fruitfully used by different user segments, from developers to students as well as non-technicians. The purpose of this simulation framework is to empirically compare the effects of diffusion processes according to different diffusion models over several network topologies within different contexts. Covered models include classic and network epidemic models, threshold models and opinion dynamics models; the repertoire of models is exten-sible. NDlib is the first library that, leveraging DyNetX, allows the implementation of diffusion models explicitly designed to work on top of evolving network topologies. NDlib is about being released on SoBigData.eu.

We also investigated connected problems, including the early discovery of successful innovations, and the diminishing return effect in diffusion processes, leveraging on large, real datasets from diverse domains, such as retail and music.

(6)

(7)

I would like to extend thanks to the many people who supported me during my Ph.D. thesis. Firstly, I would like to express my sincere gratitude to my supervisors, Dino Pedreschi and Fosca Giannotti, who believed in me and encouraged to start the Ph.D. adventure. I would like to thank you for encouraging my research and for allowing me to grow as a research scientist. My sincere thank also goes to Giulio, his advice on both research as well as on my career has been invaluable. I would also like to thank my committee members Rossano Venturini and Gianna Del Corso, who followed my progresses every year, the two coordinators of Ph.D. course at the University of Pisa, Pierpaolo Degano and Paolo Ferragina and my external reviewers Aristides Gionis and Yong-Yeol Ahn for their valuable suggestions.

A special thank goes to Francesca, a friend more than a colleague; with her, the work and some negative period were more enjoyable. Special mention goes to Anna, my master degree supervisor; the enthusiasm she has for the research was contagious and motivational for me and without this start I didn’t achieve this result and to Diego, for his essential advertises in the first period of this growth period. A special thank goes to my office mates Michela and Daniele, who supported me in these last years also in the days when I freaked out and I want to thank all my KDDLab colleagues for their wonderful collaboration. They were always willing to help me and the group has been a source of friendships as well as good advice and collaboration.

Finally, I would like to thank my parents, my mum and my dad for providing me with unfailing support and continuous encouragement throughout my years of study. Last but not the least, I want to express my very profound gratitude to Antonino for all his advice, for listening to me and encouraging me during these years and especially for his patience in my hysterical moments. This accomplishment would not has been possible without him.

And at the end, I want to thank my little son, Giuseppe, for all cheerful moments that he donate me in these two years, for teaching me to scale down the problems and for making me smile every time I was in a bad mood.

(8)

(9)

I Setting the Stage 25 1 Network Science 27 1.1 Network representation . . . 27 1.2 Network properties . . . 28 1.3 Network Models . . . 32 1.3.1 Random Graph . . . 32 1.3.2 Small World . . . 32 1.3.3 Scale Free . . . 33 1.3.4 Forest Fire . . . 34

2 Diffusion on Complex Networks 35 2.1 Common components of contagion models . . . 37

2.2 Taxonomy . . . 38 2.2.1 Independent interaction . . . 38 2.2.2 Dependent interaction . . . 42 2.2.3 External effects . . . 53 2.2.4 Complex entities . . . 55 2.2.5 Adaptive networks . . . 60

2.3 Libraries to simulate diffusion phenomena . . . 62

3 Quantification 67 3.1 What is the quantification task? . . . 68

3.2 Problem Definition . . . 69

3.3 Evaluation measures for quantification . . . 71

3.4 Quantification methods via classification . . . 72

3.5 Quantification methods on Networks . . . 75

4 Innovators & Science of Success 77 4.1 Innovators . . . 77

4.1.1 Influence Maximization . . . 78

4.1.2 Super-forecasters . . . 79

(10)

II Diffusion on Networks 83

5 DyNetX and NDlib: Studying Dynamics Of and On Networks 87

5.1 Dynamic Network Analysis . . . 88

5.1.1 Dynamic Network Models . . . 88

5.1.2 DyNetX: Dynamic Network Library . . . 89

5.2 NDlib Ecosystem . . . 93

5.2.1 NDlib: Network Diffusion Library . . . 93

5.2.2 NDlib-REST: simulation web service . . . 101

5.2.3 NDlib-Viz: Visualization Framework . . . 103

5.3 Comparisions . . . 106

5.3.1 Qualitative . . . 106

5.3.2 Quantitative . . . 107

5.4 Discussion . . . 108

5.5 Appendix: Diffusion Methods implemented in NDlib . . . 109

5.5.1 Static Epidemic Models. . . 110

5.5.2 Dynamic Epidemic Models. . . 112

5.5.3 Opinion Dynamic Models. . . 112

6 Diffusive Phenomena in Dynamic Networks: a data-driven study 115 6.1 Problem definition . . . 115

6.2 Evaluation . . . 117

6.2.1 Data driven Study . . . 117

6.2.2 Datasets . . . 117

6.2.3 Diffusion models . . . 119

6.2.4 Diffusion Analysis . . . 121

6.3 Analysis of the results . . . 124

7 Information Diffusion: the Active/Passive Conundrum 129 7.1 Problem definition . . . 129

7.2 Information Diffusion Scenarios . . . 130

7.3 Passive/Active Diffusion: a Data-driven analysis . . . 130

III Diffusion-related problem(s) 135 8 Quantification in Social Networks 139 8.1 Problem formulation . . . 139

8.1.1 Quantification via classification . . . 141

8.2 Community Discovery for Quantification . . . 142

8.3 Ego-networks for Quantification . . . 143

8.4 Experiments . . . 145

8.4.1 Community Discovery Methods . . . 146

8.4.2 Empirical Evaluation . . . 146

(11)

9.1.1 Experimental Data . . . 154

9.1.2 Defining Success . . . 155

9.1.3 Identify Hit-Savvy . . . 158

9.1.4 Hit-Savvy: Who are they? . . . 159

9.2 Predicting Success . . . 161

9.2.1 Hits&Flops: Learning Hitters and Floppers . . . 161

9.2.2 Hits&Flops: Forecast model . . . 163

9.3 Evaluation . . . 165

9.3.1 Experimental Results . . . 166

9.3.2 Null Model Analysis . . . 168

9.3.3 Case Study: Last.fm . . . 170

IV Conclusion and Future Works 173 10 Conclusion and Future Works 175 10.1 Diminishing returns and threshold statistics . . . 175

10.1.1 Dataset . . . 175

10.1.2 Identifying Cascades . . . 176

10.1.3 Success of recommendations . . . 177

10.1.4 Null model Analysis . . . 178

10.2 Conclusion . . . 181

(12)

(13)

1.1 Network Science: Graph Topologies . . . 29

1.2 Network Science: Small World Model . . . 33

1.3 Network Science: Growth of the network as in Bar´abasi-Albert model . . . 34

2.1 Diffusion models: SIR model . . . 39

2.2 Diffusion models: Adoption S-curve . . . 40

2.3 Diffusion models: Independent Cascade Model . . . 41

2.4 Diffusion models: Simulation of a contagion process in Facebook . . . 45

2.5 Diffusion models: Diminishing Returns phenomenon . . . 46

2.6 Diffusion models: Phase-switching of the general contagion model . . . 49

2.7 Diffusion models: Induced agent interaction network resulting in wealth-creating and wealth degrading relationships . . . 52

2.8 Diffusion models: Layered social influence . . . 60

3.1 Quantification: Example training set and test set . . . 70

3.2 Quantification: Selection policy of the thresholds . . . 74

4.1 Diffusion of Innovation: Rogers’ adopter categories . . . 78

5.1 NDlib: DiffusionTrend and DiffusionPrevalence SIR plots . . . 96

5.2 NDlib: DiffusionTrendComparison and DiffusionPrevalenceComparison for the “Infected” class of two distinct models . . . 97

5.3 NDlib: Multiple executions for SIR model . . . 98

5.4 NDlib: NDlib-REST documentation webpage . . . 102

5.5 NDlib: Visualization Framework appearance during a simulation . . . 105

6.1 Dynamic Diffusion: Network toy . . . 117

6.2 Dynamic Diffusion: Difference between S2 and S3 . . . 118

6.3 Dynamic Diffusion: Simulation of SI models on WEIBO and FB07 dataset 122 6.4 Dynamic Diffusion: Simulation of SIR models on WEIBO and FB07 dataset123 6.5 Dynamic Diffusion: Statistics from WEIBO dataset . . . 124

6.6 Dynamic Diffusion: Daily trend of nodes and edges in the WEIBO and FB07 dataset . . . 125

7.1 Active-Passive Conundrum: Diffusion trends for Passive, Active and Mixed models . . . 133

7.2 Active-Passive Conundrum: Diffusion trends for Active and Mixed models with blocked nodes . . . 134

(14)

8.1 Quantification: Quantification vs Classification . . . 141

8.2 Quantification: Community-based approach . . . 142

8.3 Quantification: 1-hop and 2-hop ego-networks . . . 144

8.4 Quantification: Ego-network based approach . . . 145

8.5 Quantification: Label frequencies distribution on CoRA and Google+ . . . 146

8.6 Quantification: Label frequency distribution on CoRA . . . 148

9.1 Hit-Savvy: Last.fm trends analysis . . . 156

9.2 Hit-Savvy: Comparison of volumes . . . 157

9.3 Hit-Savvy: HF-propensity and Active Periods . . . 160

9.4 Hit-Savvy: WMSC toy example . . . 163

9.5 Hit-Savvy: Probability distributions Π and ∆ . . . 164

9.6 Hit-Savvy: Predictive accuracy stability . . . 170

10.1 Future work: Distribution of cascades on Last.fm . . . 176

10.2 Future work: Diminishing Returns on Last.fm . . . 177

(15)

2.1 Diffusion models: Example agent strategy, stigmergy game . . . 51

3.1 Quantification: Confusion Matrix . . . 69

3.2 Quantification: Confusion Matrix example . . . 71

3.3 Quantification: Confusion matrix of a quantifier . . . 73

5.1 NDlib: Diffusion Libraries and Tools . . . 106

5.2 NDlib: Runtimes Comparison . . . 109

6.1 Dynamic Diffusion: Base statistics of the analyzed interaction graphs . . . . 119

6.2 Dynamic Diffusion: Simulation parameter settings for the SI and SIR models121 8.1 Quantification: CoRA results . . . 148

8.2 Quantification: IMDb results . . . 148

8.3 Quantification: Google+ results . . . 149

9.1 Hit-Savvy: Datasets’ details . . . 155

9.2 Hit-Savvy: Statistics . . . 159

9.3 Hit-Savvy: PPV, NPV, Recall and Specificity results . . . 167

9.4 Hit-Savvy: z-test . . . 169

9.5 Hit-Savvy: Predictive performances when dealing with different temporal splits . . . 170

(16)

(17)

1 DyNetX: Example of build of a Dynamic Network . . . 90

2 DyNetX: Example of methods for network statistics . . . 91

3 DyNetX: Example of methods for temporal operations . . . 91

4 DyNetX: Methods for I/O facilities . . . 92

5 DyNetX: Methods for load/save json files . . . 93

6 NDlib: Example of experiment definition, configuration and execution . . . 94

7 NDlib: Example of DiffusionTrend . . . 95

8 NDlib: Example of DiffusionPrevalence . . . 95

9 NDlib: Example of methods for trend comparison . . . 96

10 NDlib: Example of MultiPlot with ndlib.viz.bokeh . . . 97

11 NDlib: Example of parallel Model Executions with multi_runs . . . 98

12 NDlib: Example of extension of the NDlib library . . . 99

13 NDlib: Example of iteration() method to extend NDlib library . . . 100

(18)

(19)

We live in a world where more and more data are produced on the web and by sensors of electronics devices. We leave our desires, opinions, and sentiments on our web pages and blogs, in the social media, in the query logs of the search engines. With transaction records we leave traces of our shopping patterns and lifestyles; our relationships and social connections leave their traces in the network of our phone, email contacts, and social network; we leave our movements in the records of our mobile phone calls, in the GPS tracks. These Big Data can be described as: “high volume, velocity and variety of data that demand cost-effective, innovative forms of processing for enhanced insight and decision making” [130]. Analyzing Big data, we can observe and measure how our society works; we can scrutinize the individual and collective behavior with a level of unprecedented detail and in real time. Big Data together with social mining, the ability to discover patterns and models of human behavior from data coming from various social dimensions, can help us to understand many complex and hidden socio-economic phenomena.

Sending emails, participating in an online social network are daily activities that can be modeled with networks. These are used to describe any collection of objects in which links connect some pairs of entities; this definition is so flexible that it is easy to find networks in many domains.

Within the area of network theory, of particular interest it is understanding the dy-namic behavior of spreading processes on complex networks. When we talk about spread-ing processes, we think about contagious diseases caused by biological pathogens, like in-fluenza, measles, chickenpox and sexually transmitted infections which spread from person to person. However, other phenomena can be linked to the concept of the epidemic: think about the spread of computer virus [258], or the spread of mobile phone virus [119, 276], or the diffusion of knowledge, innovations, products in an online social network [33].

Information, opinions, beliefs, and behaviors spread through a social system. Individ-uals transmit information through explicit face to face communication and the host of communication technologies. But the transmission of information in societies through the interaction of people is more rich and complex. There are many tacit forms of commu-nication, for example signaling through facial expressions or our garb. Furthermore, the transmission of information can, and frequently, influences (changes/effects) the observable behavior of individuals, or more subtly, the beliefs and opinions that they hold.

Good models for spreading processes, can help us to understand and predict important dynamics in social systems and perhaps guide interventions at the individual or micro-level that can lead to desirable system-micro-level outcomes. For example, “how does news or rumor spread?” “What will be the outcome of political campaigns?” “When do individ-uals opinions on controversial topics converge and when will polarization be deepened?” “When can behavior change/stop the spread of an epidemic?” “When will scares based

(20)

on misinformation take hold in a population?” These questions are all the more pressing given the pervasive role of information technologies, which allows for the magnification and acceleration of some of these processes, but also make available to intervene in the processes, for better or worse, at a large scale.

Modeling interactions between individuals through the formalism of networks and transmission as a function that describes the change of a person’s state as a probabilistic function of the state of people in its local contact network is a very successful approach. Most of these models fall under the umbrella of spreading process. The notion of spreading or, contagion, underpins many widespread phenomena in biological, social, and techno-logical systems. The scientific study of these processes arguably had its roots and initial success in epidemiology, where the idea of disease transmissibility began to gain a foothold in the mid 18th century [23]. However, it took until the early 20th century for these ideas to be formalized mathematically, in the subfield of Mathematical Epidemiology [144]. Math-ematical models of transmission are nonetheless broadly applicable across many scientific disciplines, and indeed there is a long tradition of spreading models in the sociology and economics literature applied to innovations [19, 228] and rumors [224, 101]. Initially, the models to understand spreading phenomena were simple, but as the limitations of such models become evident, more complex models have been developed.

Therefore, in this thesis we propose to investigate spreading phenomena on complex networks and related problems; we want to develop useful tools for understanding, moni-toring and signaling diffusion phenomena.

We focus on mainly the diffusion of innovation, ideas, information, opinions, beliefs, and behaviors that spread through a social system. With the increasing use of the online social network (OSN) as well as the proliferation of user-oriented online services, we have access to a very vast source of information. The social networks have a fundamental rule in the diffusion of information; they were mighty in many situations, such as Facebook during the 2010 Arab spring [131] or Twitter during the 2008 U.S. presidential elections [143]. Understanding the dynamics of these networks can help in better following events (e.g., examining revolutionary waves), solving problems (e.g., preventing terrorist attacks, anticipating natural accidents), optimizing business performance (e.g., optimizing social marketing campaigns), etc. Therefore, in recent years, researchers have developed a variety of techniques and models to capture information diffusion in social networks, analyze it, extract knowledge from it and predict it. Information diffusion is a vast research domain and has attracted research interests from many fields, such as physics, biology, etc.

Some open problems arise, such as:

• finding a statistical method to validate in advance a forecast obtained by the model • finding a partition of users that maximizes the spreading; this is of great significance for applications across various domains. For instance, the knowledge of influential spreaders is crucial for the design of effective strategies to control the outbreak of epidemics while targeting the innovators in information diffusion is helpful for conducting successful campaigns in commercial product promotions

• finding the distribution of the threshold in a population

• finding the “number” of recommendations needed to saturate the probability of getting adopter, i.e., the diminishing return phenomenon

(21)

• simulate diffusion phenomena.

The thesis is organized into four parts. In the first one, Setting the Stage, we introduce the preliminary notions needed to become familiar with the fields of research covered by this thesis.

In Chapter 1, we provide a brief overview of the essential concepts of complex network theory, introducing how to represent the networks with graphs, which are the principal properties and finally, we present four different networks’ models.

In Chapter 2 we present the fundamental subject of the thesis: the diffusion models. In the last decade, plenty of methods including mean-field theories to rigorous results have provided new knowledge in the dynamics of contagion processes in complex networks. In some cases, remarkable progress has been derived independently by using different methods and assumptions. This fragmented panorama does not advance the field and causes the duplication of research effort. Therefore, we wrote a review to describe the methods and recent results on spreading phenomena in complex networks [287]. So the Chapter is a summary of this survey of representative diffusion methods. We propose a taxonomy of the various diffusion models that includes both spreading and opinion dynamics. We have identified five categories: the first two are based on the rules by which pairs of agents in the population interact (independent or dependent), further divided regarding the underlying social network. The last three categories are transversal to the first two, i.e., models with both dependent and independent interaction can be included, however, they have to add additional mechanisms: external influence, co-evolution of the social network, spreading of various entities or topics.

At the end of the Chapter 2.3 we propose a review of the most used libraries for the simulation of diffusion phenomena.

After the presentation of the main theme of the thesis, we present the literature needed to resolve other two topics related to the diffusion phenomena. In Chapter 3 we introduce the quantification task, and in Chapter 4 we present the identification and analysis of innovators and an emerging area, the Science of Success.

In the second part of this thesis, we get into the core of our work, describing two libraries framework: DyNetX, a package designed to model evolving graph topologies, and NDlib, a novel simulation framework able to model, simulate and study diffusive phenomena that unfold on complex networks (Chapter 5). The former extends the Net-workX project by providing the user facilities to describe and analyze networks whose nodes and edges changes as time goes by and represents, to the best of our knowledge, the first library specifically intended to provide support to easily describe complex time dependent networks. The latter, NDlib, represents a multi-level solution: it is designed to offer a programmatic interface to developers, an experimental server to those centers that need to provide simulations as a service and, finally a visual interface for those, students as well as non-technicians, who want to run simulations and experiments but don’t have the time to learn a new library or programming language. The framework was presented in [234, 233, 232] and it is integrated into a visual simulation platform developed within the CIMPLEX H2020 EU project and it is being released on SoBigData.eu. It is currently

(22)

hosted on GitHub1, on pypi2 and has its online documentation on ReadTheDocs3.

Then, with NDlib leveraging DyNetX, in Chapter 6 we compare the behaviors of classical spreading models (SI and SIR) when applied on top of a same social network whose topol-ogy dynamics are described with different temporal-granularity ([194]). We show that analyze diffusive phenomena not considering topology dynamic indeed cause a relevant overestimate of the actual velocity and amplitude of the spreading. Moreover, taking ad-vantage of the topology evolution enables the analyst to identify special context-dependent events (activity peaks) as well as activity patterns (weekend/weekdays). So the adoption of a static topology affects on spreading simulations and we implement on NDlib dynamic network variants for two classical epidemic models.

Continuing with the study of spreading phenomena and in particular with the diffusion of innovation, in Chapter 7 we concentrate on the activeness of users ([195]). Indeed, often treated as similar processes, diffusion of information and epidemic spreading can be easily distinguished by a single feature: the degree of activeness of the subjects they affect. The spreading process of a virus does not require an active participation of the people that catch it; conversely, we can argue that the diffusion of an idea, innovation, or a trend is strictly depended not only to the social pressure but also to the individual choice. We com-pare with NDlib alternative modeling choices that simulate the diffusion of information, news, ideas, trends with both passive and/or active processes. We introduce two novel approaches whose aim is to provide active and mixed schemas applicable in the context of innovations/ideas diffusion simulation.

In the third part, we focus on the resolution of two issues related to the diffusion. The first one is the quantification task (Chapter 8). Indeed in the literature, there is no method to validate the results obtained with a good model of diffusion in real time. For example, the pandemic forecasts have now been confirmed using actual data from surveil-lance collected during the pandemic [263]. The use of quantification methods could be of great help in this context; they could enable the validation of the results in real-time. We could use labeled initial data to learn a reliable quantifier of an epidemic and then, apply the quantifier to big data to monitor the epidemic, e.g., to learn the number of in-fected and then validate our predictive model. So we propose techniques for quantification on networks that exploit the homophily effect observed in many social networks ([193]). The quantification task can be useful also as a tool to estimate directly the percentage of infected of a diffusion phenomenon; indeed this is a classic case of distribution drift among train set and test set. The question that we would like answered is: if we have an incomplete diffusion process, can we estimate the distribution of infected nodes using the quantification task?

The second topic related to the diffusion is the research of a special niche among early adopters (Chapter 9). Innovations are continuously launched over markets, such as new products over the retail market or new artists over the music scene. Some innovations be-come a success, others don’t. Forecasting which innovations will succeed at the beginning of their lifecycle is hard. We provide a data-driven, large-scale account of the existence of a special niche among early adopters, individuals that consistently tend to adopt successful

1

NDlib GitHub: https://goo.gl/1tstvG

2

NDlib pypi: https://pypi.python.org/pypi/ndlib

3

(23)

innovations before they reach success: we call them Hit-Savvy ([231]). Hit-Savvy can be discovered in very different markets and retain over time their ability to anticipate the success of innovations. This problem is strictly related to the task of influence maximiza-tion, to the research on “word-of-mouth” and “viral marketing” [19, 28, 68, 103, 178, 227]. It is the problem of finding a small subset of nodes that could maximize the spread of influence. For example, a small company developed a new product and wanted to market it through a social network. It has a limited budget so that it can select a small number of the initial user. The company wishes that through the “word-of-mouth” effect a large population buy the product. Therefore the problem is who to select as initial users to ob-tain the greatest cascade of further adoptions. Our experiments suggest that this subset exists and we expose a procedure to find them. As our second contribution, we devise a predictive analytical process, exploiting Hit-Savvy as signals, which achieves high accuracy in the early-stage prediction of successful innovations, far beyond the reach of state-of-the-art time series forecasting models. Indeed, our findings and predictive model can be fruitfully used to support marketing strategies and product placement. This analysis was carried in the mean-field context; we tested our method in two datasets without a network. Finally, the fourth part Conclusion and Future Works ends the thesis, summarizing the main findings, presenting work in progress (the research of the threshold distribution in Last.fm dataset and the presence of the diminishing return effect) and possible future works that can enforce the results described in this dissertation.

(24)

(25)

(26)

(27)

Network Science

In this Chapter, we introduce the essential notion of network science. This science is the study of the theoretical foundations of networks structure/dynamic behavior and the ap-plication of networks to many subfields. Currently known subfields include social network analysis (SNA), collaboration networks (bibliographic citations, online social networks), emergent system (power grid, WWW), physical science system (phase transition, perco-lation theory) and life science system (epidemic, genetics). Because network science often model complex systems, it is closely associated with the older field of complex adaptive systems. In fact, network science incorporates graph theory from mathematics, statistical mechanics from physics, data mining and information visualization from computer science, inferential modeling from statistics, and social structure from sociology. Network science offers a language through which different disciplines can interact with each other.

In the first Section we discuss how to represent the networks with graphs (Section 1.1) and which are the principal properties (Section 1.2); after that, we introduce four different networks’ models: Random Graph, Small World, Scale Free and Forest Fire (Section 1.3). None of the proposed models can capture the exact evolution pattern of real networks, but they are good approximations that can be used to classify the real networks.

1.1 Network representation

The graph is the mathematical structure used to model pairwise relations between entities. A graph, often denoted as G = (V, E) consists of a set of objects called nodes or vertices V , where certain pairs of these objects are connected by links as edges E. According to this simple definition, any system with coupled elements can be represented as a network, so our world is full of networks.

Graph theory first appears in the 18th century where the mathematician Euler resolved the famous problem of Seven Bridges of K¨onigsberg. In his first short paper [74], Euler showed how simple relations between vertices and edges could explain everyday problems. In the second half of the 20th century, the studies and the results on graph theory have been widely developed, thanks to the development of the combinatorial, automatic calculation and the availability of a huge amount of data. The computer, on the hand, has enabled us to develop experimental investigations on graphs (used in particular to prove the four-color theorem), and on the other hand, led the graph theory to investigate algorithms and models of high-impact applications. Within fifty years graph theory has become a

(28)

chapter of mathematics highly developed, rich and deep results with strong influences of applications. In 1990 it emerged a “new science”, that apply concepts of mathematical graph theory to problems in a variety of fields, called “Network Science”. An important goal of the network science is to build models that accurately recreate the properties of real networks observed in real systems. Those increasing attentions have pointed to the need for analyzing and categorizing models.

There are several types of both semantic and syntactic extensions of graph structures; here we will recall some of them:

– The edges in a graph can be undirected or directed. For an undirected graph the edge (u, v) is identical to the edge (v, u). A graph is called directed, or digraph, if the edges have an orientation. As example we can consider a graph representing telephone calls or email messages between individuals: in these cases, the graph will be composed of directed edges since each call/message goes only in one direction as shown in Figure 1.1(a)-(b);

– Graphs can be semantically enriched by introducing (possibly multiple) labels on both nodes and edges. As an example, we can consider a social graph, i.e., a network in which the nodes identify people and the edges their relationships. In such context we might be interested in enriching the graph model with the aim of reporting information on the type of connection elapsing among pair of nodes or regarding some particular attribute belonging to each vertex (i.e., gender, age and weight) as shown in Figure 1.1(c), where the dimension of the nodes represents the different weight and on each edge, we have the timestamp of the edge;

– Multiple types of nodes can be related in a graph structure: in this case, we are studying k-partite graph. An example of a bipartite graph is the one which relates customers to purchased goods, as shown in Figure 1.1(d).

1.2 Network properties

In this section we will discuss some properties and indicators used to extract indicator from complex networks.

Degree

A key property of a node is its degree. In graph theory, the degree of a vertex of a graph is the number of edges incident to the node, with self-loops counted twice. In a digraph we can distinguish among in-degree (the number of edges entering in a node) and out-degree (the number of edges departing from the node). In the rest of the thesis we will use undirected networks, so we will use Γ(u) to indicate the set of neighbors of u ∈ V and |Γ(u)| to note the degree of u.

Degree distribution

Once all the degrees are calculated we might be interested in computing the degree dis-tribution for the nodes in the graph. We define pk the fraction of nodes in the network

with the degree equal k, or in other words, the probability that a randomly selected node in the network has degree k. To represent the degree distribution, we can plot the histogram

(29)

(a) Undirected graph (b) Directed graph

(c) Labeled graph (d) Bipartite graph

Figure 1.1: Several graph topologies. In the first row are shown, starting from the left, an undirected and a directed graph; in the second row a labeled graph and a bipartite graph.

of pk values for all the values of k in a network. If we build a graph introducing edges

on nodes pair selected uniformly at random we will get a Poisson distribution: however, most networks in the real world have degree distribution very different from this. In [3] has been underlined that the node degree often follows a highly right-skewed distribution, meaning that a large preponderance of nodes have a low degree, but a small number, known as hubs, have a high degree. Hubs exist in most of the real networks. Examples of hubs are well-connected users on Facebook, the most important airport in the World Airport Network (WAN), the ATP protein for the metabolic networks.

In some real networks the degree distribution is well approximated with pk ≈ k−γ; this

equation is called a power law distribution. These networks are called scale-free be-cause they have not a scale. Although this distribution is not universal in networks [5], it is widespread in several natural and artificial networks, such as WWW [4], Internet backbone [75], metabolic reaction networks [137], phone call graphs. In co-authorship networks of scientists [18, 204], the degree distribution is fitted better by a power law with an exponential cutoff, in the power grid of western United States the degree distribution is an exponential distribution [5], while in a social network of Mormons in Utah, pk is a

Gaussian [5]. The more that scientists studied networks, the more they uncovered scale-free structures. These finding out an important question: how can different systems have the same architecture and obey the same laws? They also share another property: the value of the parameter γ in the power law distribution tends to fall between 2 and 3 [206].

(30)

Paths

Given two nodes u, v ∈ V a path is defined as the sequence of edges which connect u and v. We can define the shortest path, or geodesic path, between two nodes as a path with the minimum number of edges. The shortest path is not necessarily unique, but the geodesic distance (the length of a path) is the same for all of them. In most real world networks, the geodesic distance is short, in particular when compared with the number of nodes of the network. This phenomenon is called the small-world effect [279]. The first experimental study of small world phenomenon, also known as six degrees of sepa-ration, took place in 1967 by Stanley Milgram [264]. In his experiments, Milgram chose a broker in Boston and a student in Massachusetts as the target. Randomly he selected Midwestern volunteers who received a letter containing a summary of the study and some personal information about the target person. He asked volunteers to forward the letter to a friend who is more likely to know the target person. The finding was that, on average, the number of intermediaries was 5.5, a relatively small number.

The highly clustered nature of networks was guessed in a social context by Granovetter [112] who proposed a sociological interpretation: the network behind our society consists of small, fully connected circles of friends connected by strong ties and circles of acquain-tance connected by weak ties. Milgram’s study was confined to the United States, but with the playwright John Guare, the six degrees phenomenon was generalized to the whole planet [115]. In 1998 emerged a new interest in small world following the study of Dun-can Watts and Steve Strogatz [279] that analyzed three real systems: the actor network of Hollywood, the neural network of the worm C.elegans and the North America power grid. They found that these networks show, on average, small path length and a high clustering coefficient. The last result concerning the clustering coefficient is in contrast to the random graph theory, where the value of the clustering coefficient is low.

Strictly connected to the path definition is the diameter, which is the largest geodesic distance in the graph. The diameter in real world networks is usually found to be very small; its value can be approximated with log n where n = |V |. The diameter is a measure less useful for a real network than the average distance, and its value is sensitive to small changes to the set of nodes.

Connected components

In a graph, a connected component is a maximal set of nodes such that any two vertices are connected to each other by a path. All those edges, that if removed cause the immediate partitioning of a component into two disconnected ones, are called bridges. In real world networks, we find that there is a large component, the giant component, that fills most of the network, usually collects from 70% to 100% of the nodes of the network. The rest of the network is divided into a large number of small components disconnected from the rest. Centrality Measures

When studying a complex phenomenon, identify the most important vertices, such as the most influential persons in a social network, super-spreaders in a disease phenomenon and key infrastructure nodes in Internet is a frequent issue. In graph theory, the concept of centrality is used to identify a defined set of measures aimed at answering questions such as: Which are the most important or central vertices in a network? The word importance has a wide number of meanings, leading to many different definitions of centrality. Among all the indicators proposed in the last decades three are the most used: betweenness,

(31)

closeness and eigenvector centrality.

• Betweenness: this measure quantifies the number of times a node behaves as a bridge along the shortest path between two other nodes. The betweenness can be represented as: CB(v) = X s6=v6=t∈V σs,t(v) σs,t

where σs,t is the total number of shortest paths from node s to node t and σs,t(v) is

the number of those shortest paths that pass through v. Some studies have shown that betweenness and degree are usually strongly connected [102]. The nodes with high betweenness may have considerable influence in a network produced by their control over information passing among others. They are also the ones whose removal from the network will most disrupt communications.

• Closeness: this measure quantifies the mean distance from a vertex to other vertices and can be represented as:

C(v) =X

j6=v

1 dv,j

where dv,j is the geodesic distance between the node v and j. This measure takes

small values for nodes that are separated from others by only a short geodesic dis-tance on average. Such nodes could have better access to information at other nodes or more direct influence on other nodes. For example in a social network, a per-son with a little value of closeness might find that his opinions reach others in the community more quickly than the opinion of someone with a value high.

• Eigenvector: this measure assigns relative scores to all nodes in the network based on the idea that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes. Two variants of this centrality measure are Google’s PageRank [210] and the Katz centrality [141]. It is a measure of the influence of a node in a network.

Transitivity

Another property very important, especially in the social network, is transitivity, also known as Clustering coefficient. This indicator captures the so called triadic closure effect: it calculates the ratio of closed triangles over all the triplets of nodes. In a social network, nodes tend to create tightly knit group characterized by a relatively high density of ties. Exist two versions of this measure: the global version that was designed to give an overall indication of the clustering in the network and the local that indicates the em-beddedness of single nodes. A graph is considered small-world, if its mean local clustering coefficient ¯C is significantly greater than a random graph built on the same vertex set, and if the graph has about the same mean-shortest path length as its analogous random graph. Community structure

One important characteristic of Social Network is the presence of community structures, i.e., sets of nodes such that each set of nodes is thickly connected internally, such that there are many edges inside groups and few edges between groups. What exactly we mean by “many” and “few” is debatable, so a plethora of different definitions have been proposed.

(32)

1.3 Network Models

In this part we report the fundamental network’s models: Random Graph (Section 1.3.1), Small World (Section 1.3.2), Scale Free (Section 1.3.3) and Forest Fire (Section 1.3.4). The general aim of network modeling is to capture some essential properties lying behind real-world phenomena and replicate them while generating synthetic data, all imposing only a few simple constraints (i.e., the rich-get-richer effect explanation for power law degree distributions). As we will see each model focuses only on few properties of real world complex networks and none of the proposed models can capture the exact evolution pattern of the real networks, but they are good approximations that can be used to classify the real networks.

1.3.1 Random Graph

One of the most famous models of the 20th century is the Random Graph model. The model was introduced by Rapoport and Solomonoff [253] in 1951 and independently by Erd¨os and R´enyi [71] in 1959. In a sequence of eight papers, they studied the simplest random graph: their graph has n elements, and each pair of elements is independently connected with a fixed probability. So, a random graph Gn,m is defined as a set of n labeled nodes

connected by m edges chosen randomly, with probability p among the |V |(|V |−1)₂ possible edges. The random network model was independently introduced by Gilbert [97] in the same year. Since the impact of Erd¨os and R´enyi’s work is still now overwhelming they are considered the fathers of random networks.

Random networks generated with the same values for the parameters V and p are slightly different. Not only the detailed wiring diagram will vary, but since the expected value of the number of edges m is determined by V and p, then it is not fixed. If we increase p from p = 0 to p = 1 the random network becomes denser and the average number of links increases from m = 0 to m = |V |(|V |−1)₂ and the mean degree of a node increases from hki = 0 to hki = |V | − 1. The most important results for this model, showed by studies of several mathematicians, are: the degree distribution follows a Poisson distribution; random graphs tend to have small diameters and there is a similarity among the diameter and the average path length; the presence of a giant component when the mean degree is hki = pm < 1, instead if hki > 1 there is a phase transition and the random graph is typically composed of isolated tree.

The random graph theory has dominated the scientific research about networks since 1959; Erd¨os and R´enyi described networks as completely random. However, at the end of the last century, it was showed the real networks differ from random graph model in several features. The most important difference between them is that in a random network most nodes have equivalent degrees, forgetting hubs, while in real networks we observe a significant number of document highly connected nodes and large differences in node degrees. As a consequence, real networks do not have a Poisson degree distribution.

1.3.2 Small World

In 1998 Watts and Strogatz proposed a model, called Small-World that merge structure and randomness. The starting point of their model is a circular lattice network in which each node is connected to k neighbors; then for every edge rewire with probability p

(33)

Figure 1.2: Network topology at different p rates in Small-world model [279].

avoiding duplicate edges and self-loops. Varying the value of p we can obtain a different network structure:

• for p = 0 we have a regular lattice with high clustering coefficient;

• for p = 1 we have a random network with low clustering coefficient that exhibits small-world properties;

• for 0 < p < 1 we get networks with high clustering coefficient and a small distance between the nodes.

The authors also analyzed the clustering coefficient C(p) and the characteristic path length L(p), given the value of p. Varying the value of p they discovered that there are a high number of cases for which L(p) is almost small as Lrandom (the characteristic path length

of Random Graph model) and at the same time C(p) >> Crandom (the clustering

coef-ficient of Random Graph model). This result is obtained by the introduction of random shortcuts that connect distant nodes (Figure 1.2). Recently, both the small-world effect and the clustered nature were discovered in a lot of natural and artificial networks like WWW [4, 1], Internet [294, 213], cellular networks [137], food webs [286], scientific col-laboration networks [17, 204], neural networks [87, 279], email contacts networks [67] and the Messenger contacts networks [164].

1.3.3 Scale Free

The previous models do not consider the existence of some special nodes, called hubs, that are more highly connected than others.

In 1999 Bar´abasi and Albert [16] showed that this heavy-tailed degree distribution can be the result of the continuous expansion of networks by the addition of new nodes. As new nodes appear, they tend to connect to the most connected sites. This “rich get richer” process will favor the first nodes, which are more likely to become hubs eventually. So, they proposed the Scale-free model (also known as Bar´abasi-Albert model) based on two principal characteristics: network growth and preferential attachment. We start at time t0

with a small random network (e.g., m0 nodes) and then at every step we add a new node x

with m edges connecting x to the nodes that are already in the graph. The probability of connecting x to an existing node w follows the preferential attachment, in other words it

(34)

Figure 1.3: Growth of the network as in Bar´abasi-Albert model. Starting from a dyad (top-left), at each step, a new node (the white one) is added to the network, and two new edges were added following the preferential attachment. In the network emerges a few highly connected hubs.

is proportional to its degree: P(w) = Pdeg(w)

x∈Vdeg(x). After t time steps the model converges

to a random networks with t + m0 nodes and mt edges (as shown in Figure 1.3).

During the last decade more sophisticated models were proposed; models that include the effects of adding a rewiring links, models allowing nodes to age so that they can no longer accept new links, or varying the form of preferential attachment.

1.3.4 Forest Fire

After the Scale Free model and Small World model, several models were introduced to describe and build networks with desirable characteristics. The most famous one is the Forest Fire model, proposed by Leskovec [165]. His model tries to introduce community structures into the generative process.

In particular, the authors defined the model in the following way: nodes arrive one at time and form out-links to some subset of the existing nodes; to form out-links, a new node x attaches to an existing node v and then begins burning links outwards from v. This process could be intuitively found in social networks: a new student arrives at a university, meets some older students, who introduce him to their friends and the introduction may continue recursively. In this way, each node joining only with entities that are nearby to its center of gravity, the community structure spring very quickly.

(35)

Diffusion on Complex Networks

1

In this Chapter, we present the fundamental subject of the thesis: the diffusion models. The work that we present is a summary of a wider contribution, a survey of representa-tive diffusion methods [287]. We delve into the literature to move past simple contagion models, into the more general constellation of ideas which are loosely described as complex contagion. We review those developments which are pertinent to models of spreading in social systems, focusing on those which capture non-biological spread.

Our objective is to lay out the fundamental models that have driven work on social con-tagion, and organize the most prominent and promising recent developments in different fields, around an appropriate taxonomy of complexity in the scope of spreading phenom-ena. This taxonomy that we choose is, of course, imperfect, so we take care to discuss how these different forms of complexity are interconnected. We delve into biological spreading models only where we believe that the models are relevant to information and behavior spreading. In this way, we can reveal under-explored but promising applicability to social spreading. We intend to provide a unifying framework for models introduced by differ-ent communities, pointing to similarities in models which may seem differdiffer-ent superficially and emphasizing where the different approaches lead to fundamentally different behaviors. When we talk about the epidemic disease, we think about contagious diseases caused by biological pathogens, like influenza, measles, chickenpox and sexually transmitted in-fections which spread from person to person. However, other phenomena can be linked to the concept of the epidemic, such as the spread of computer virus [258], or the spread of mobile phone virus [119, 276], or the diffusion of knowledge, innovations, products in an online social network [33]. Several elements determine the pattern by which epidemics spread through groups of people: the properties of the pathogen carries (its contagious-ness, the length of its infectious period and its severity), the structure of the network and the mobility patterns of people.

Fully understanding these processes may well require a knowledge of the individual level information processing that happens through rational and more intuitive and emotional mechanisms. However, here we take the sociological and complex systems perspective that good progress can be made with models that only phenomenologically capture some of this rich internal life, but preserve the relevant behavioral regularities. The fundamental

1

O. Woolley, A. Sˆırbu, L. Milli, G. Rossetti, L. Sanders, D. Knipl, F. Giannotti, D. Pedreschi, “Complex contagion: a review”, Technical Report, 2017

(36)

interest and strength of this approach are to uncover the dynamics that result due to the aggregate behavior of multiple individuals. Of course, good scientific work will strike a balance between the complexity at the individual level and the tractability of models, which allows insight into the question at hand. Good models for these spreading processes, can help us to understand and predict important dynamics in social systems and perhaps guide interventions at the individual or micro-level that can lead to desirable system level outcomes.

Modeling interactions between individuals through the formalism of networks and transmission as a function that describes the change of a person’s state as a probabilistic function of the state of people in its local contact network is a very successful approach. Most of these models fall under the umbrella of spreading process. The notion of spreading or, contagion, underpins many widespread phenomena in biological, social, and techno-logical systems. The first study in this field was carried out in the mid 18th century by Bernoulli [23]. In the early 20th century, these ideas were formalized mathematically, in the subfield of Mathematical Epidemiology by Kermack and McKendrick [144]. Mathe-matical models of transmission are applicable to a broad range of different phenomena. The spread of information, innovations [19, 228] and rumors [224, 101] can be modeled as a contagion process. Initially, the models to understand spreading phenomena were simple, but as the limitations of such models become evident, more complex models have been developed. As stated above, herein we tracked diffusion models from simple contagions to complex contagions.

In addition to the biologically motivated models of dynamical processes on networks, we considered typically data-driven computer science (machine learning) models to track the spread of information online and the innovation literature that is sociological and eco-nomic. Furthermore, we also discussed opinion formation models from statistical physics. These models are different from spreading models in the sense that transmission is symmet-ric, i.e., individuals can influence each other’s states, while from the infection perspective the infected individual is not influenced, as pointed out in one of the most comprehensive surveys of statistical physics applied to social systems [36]. However, if we think of conta-gion and influence models more generally as transitions in the state of an individual as a function of the state of its neighbors and the system, then discrete state opinion dynamics models have a straightforward connection to spreading models. Thus, here we surveyed these models and drew appropriate connections. For a more comprehensive review of opinion models, including continuous opinions, see [252]. Thus, as a whole, this review can perhaps best be described as a view into complex contagion from the perspective of computational social science [160].

The structure of this Chapter is as follows. We introduce the common components needed to describe all types of contagion (Section 2.1) and we then present the models and their empirical validation. We start with models where each contact between a susceptible individual and an individual who has been “infected” with information or a behavior leads to a transmission with a probability that is independent of past contacts and the state of the rest of the system (Section 2.2.1). We then continue with models where the probability of transmission depends on the current state of the system beyond the pair of individuals under consideration, and/or its history (Section 2.2.2). As we present these basic models, we make a distinction between mechanistic models of contagion and those that invoke rational strategic choice, following [295]. We also discuss models that

(37)

consider the greater complexity of the individuals, beyond that captured by their position in a social network, affecting their susceptibility or the degree of influence they have on others. Similarly, we discuss models that consider how the characteristics or content of a spreading entity can affect the infectivity or other transmission dynamics. In the next Section, we briefly explore the role of transmission mechanisms operating at a scale larger than the pair interaction, which is typically modeled as global exogenous effects (Section 2.2.3). We then discuss models that consider the simultaneous spread of multiple entities (Section 2.2.4). Most simply these are co-spreading phenomena, with either competing or enhancing effects, frequently motivated by the limited attention of individuals. We also consider the interaction between different spreading processes, for instance how the spread of information and behavior as separate entities, or of disease, information and behavior. These models are of special interest to us since they allow for the study of behavioral response to disease outbreaks. We dedicate a Section to the role of time inhomogeneities and temporal networks [127] (Section 2.2.5). At the end of the Chapter, we propose a review of the most used libraries for the simulation of diffusion phenomena (Section 2.3).

2.1 Common components of contagion models

Models of contagion employ a set of distinct mechanisms to simulate behavior seen in real systems. These mechanisms are common to all types of contagion, be it epidemics or opinion dynamics, with only some details changing from one model to another. In the following, we list the main components of the models.

• Agent population. In general, models work with a population of agents, represent-ing individuals in society. Agents hold a state described by one or more variables. These variables can be discrete or continuous, or scalar or vectorial. Typically, spreading models work with discrete state variables (e.g., susceptible and infected) while opinion models can use both discrete (yes/no) or continuous opinions.

• Social network. A second common component, and very important, is the so called social network. This network considers the above mentioned agents to be nodes, while edges represent a relation between agents (typically a social relation, or physical proximity). Nodes connected by an edge can interact to one another, so the network is one of the drivers of contagion. The literature has considered many types of underlying networks, starting from complete graphs (mean field approach) and regular lattices, to more realistic random graphs with scale free or community structures embedded.

• Interaction rule. The third element is describing contagion models in an inter-action rule. The models can be seen as agent-based models, where agents have an initial state which evolves in time, based on this interaction rule. This interaction rule defines the next state of the agent, taking into account its previous state(s) and those of its neighbors. Interaction rules range from a straightforward copying action to complex mathematical equations, where time can be discrete or continuous. • External effects. Many contagion models assume that the agent state depends

on interaction with peers and the previous state. However, in reality, an important driver of the state of an individual in a population is external information, such as

(38)

mass media, advertising, Internet, and other sources. This effect has been introduced in models typically by considering a special individual connected to all others in the population, whose state is static but who can influence the others in the same way peers influence their neighbors.

2.2 Taxonomy

Based on the various components presented in the previous Section, we propose a taxon-omy of the different models, which includes both spreading and opinion dynamics. The first two categories identified are based on the rules by which pairs of agents in the pop-ulation interact (independent or dependent, Section 2.2.1 and 2.2.2 respectively), further divided regarding the underlying social network. The last three categories are transver-sal to the first two, i.e., models with both dependent and independent interaction can be included. However, they have to include additional mechanisms: external influence, co-evolution of the social network, spreading of various entities or topics.

2.2.1 Independent interaction

Here we discuss models that employ independent transmission rules, which is to say that there is a constant probability of transmission with each contact between a susceptible and infectious agent, independent of the history of contacts and state of other neighboring individuals.

The behavior of the well-mixed, deterministic approximation of these systems is well understood. However, the dynamics of simple independent transmission rules in spatially extended systems and their generalization to less regular network substrates can produce different complex behavior. For example, little is understood about the effect of temporal networks or the effect or higher order structural properties of the network such as com-munity structure. Furthermore, the role of noise and stochasticity in finite size systems, and its interaction with a non-trivial transmission structure also generates complex and dynamics that are hard to characterize fully.

Biological models

A great reduction of a system to understand spreading is a compartment model ([212, 205, 124, 122, 7]). The basic idea of these models is to divide the population into disjoint groups (compartments), according to a few key characteristics which are relevant to the process under consideration. Then the time evolution of an epidemic is modeled by keeping track of the number of individuals within each compartment. This approach relies on the assumption that populations are fully mixed, meaning that people interact with each other at random and each member in a compartment is treated indistinguishably from the others in that same compartment. These interactions, and in general transition processes between the compartments are captured in the model. We write down nonlinear differential equations for the changes in the number of individuals in the various compartments.

A primary compartmental model that applies to many common infections is the SIR model, where we divide the people into those who are susceptible (S), those who are in-fected (I) and those who have recovered and are immune (R) (see Figure 2.1). In the sim-plest setting, the total population is assumed constant, but extensions with demographic

(39)

Figure 2.1: In the SIR model an individual can be in one of three states: susceptible, infected or removed. The arrow indicates that an individual becomes infected or removed [15].

processes are standard. The corresponding system of ordinary differential equations reads dS dt = − βS(t)I(t) N (t) , dI dt = βS(t)I(t) N (t) − γI(t), dR dt = γI(t), (2.1)

where S(t), I(t), and R(t) are the fractions of the population in each of the three states at time t (N (t) = S(t) + I(t) + R(t)) [212, 205]. Under the assumption of a fully mixed population, β is the average rate infective individuals transmit with other individuals per unit time, and γ is the recovery rate. In the particular case when γ = 0 we arrive at the SI model that assumes that individuals never recover from the infection. On the other hand, to model diseases where recovery does not confer immunity one easily turns the above system into a SIS model, by moving the term γI(t) into the right-hand side of the first equation and canceling the third equation. Numerous variants of the SIR model have been devised in the literature, for example by specifying further subdivisions such as those who have been vaccinated, those who are receiving treatment, age groups, risk groups, etc. Other models arise as we consider a different term for the transmission process, or assume a different distribution for the time individuals spend in the infected compartment, leading to a non-constant recovery rate. Spatial effects can be incorporated by adding diffusion terms to the equations, or by considering patch models and the underlying network of individuals’ mobility. More elaborate extensions to the model involve the separation of the state I into a sequence of several states (e.g., early, middle, late) and allowing the contagion probabilities to vary across these states [142]. Another variation of the model is the one that considers the disease-causing pathogen able to mutate during the outbreak [98].

The compartmental models introduce from mathematical epidemiology such as the Susceptible-Infected (SI) model have been widely applied to the spreading of disease but also outside of this original application to non-biological phenomena. The most common use are those of the Susceptible-Infected (SI) model but also the SIR model was introduced

(40)

Figure 2.2: The adoption of an innovation follows an S curve when plotted over a length of time.

as a way to model the spread of rumors [58, 181]. A comparable SIR model was also applied to study the service adoption in Skype [140]. Here the authors showed that the percentage of spontaneous service adoption is constant, the likelihood of adoption via social diffusion is linearly proportional to the ratio of adopting neighbors, and the rate of service termination is time-invariant and autonomous of the behavior of peers. In [59], another SIR-type model was applied to describe participation in London riots. In 2013 Wang et al. [276] used the SI model to studying epidemics in the context of computer viruses, in which a mobile phone can pass through only two states: infected (I) and susceptible (S), with no possibility to recover from the infection.

Diffusion of Innovations and Information

The idea that innovations diffuse through a population driven by social influence gained traction in sociology by due to the, mostly qualitative, work of Rogers [228]. The Bass Model [19] offers a mathematical representation of this insight. Essentially, it models the percentage of adoption of a new product as proportional to the existing number of users, in addition to a constant term due to the intrinsic properties of the product. This model assumes a well-mixed population. The diffusion results in a characteristic S shape adoption curve, as shown in Figure 2.2.

In the information-spreading literature, one of the most commonly used models is the Independent Cascade Model (ICM) [103, 143]. The ICM is similar to a SI process. Let be G a social network, represented by a directed graph, we will speak of each node as being either inactive or active (an adopter of the innovation). In the ICM we start with an initial set of active nodes A0, and the process displays in discrete steps according to the

following randomized rule. When the node v first becomes active in step t, it is given a single chance to activate each currently inactive neighbor w; it succeeds with probability pv,w independently of the history thus far. If w has many newly activated neighbors, their

attempts are sequenced in an arbitrary order. If v succeeds, then w will become active in step t + 1; but if v does not succeed, it cannot make any further attempts to activate w in subsequent rounds. Again, the process runs until no more activations are possible. In the following example we clarify how the method works.

(41)

Figure 2.3: Example of diffusion with Independent Cascade Model. Source: goo.gl/ 6sZ5f8

network is undirected; therefore we assume pvi,vj = pvj,vi. The numbers on the edges

represent the probability pvi,vj. The left number denotes the random number generated and

the right denotes the probability pvi,vj.

Consider the network at the first step; the ICM model starts with a set of nodes acti-vated. In our case, only the node v1 is activated. The activated node generates a random

number for each neighbor. If the random number is less then the respective pvi,vj, the

neighbor become activated. Following the procedure, in the step two the node v1 actives the

node v2, but not the node v3. So the node v1 now can not active nodes more; the node v3

can be activate only from the other neighbors, i.e., the node v4. The process of activation

continues and terminates at step five, where the network becomes stable.

The spread of information through different types of online networks has been broadly studied as simple contagion processes using the ICM in a lot of papers, such as: [249], [163], [145], where ICM is applied to the biggest Russian social network Vkontakte (VK), [26] where the authors propose a new model based on ICM for learning diffusion probabilities and performed experiments on a large collection of datasets (Digg, LastFm, Twitter, ...). Opinion dynamics

Concerning spreading of opinions, one of the most popular opinion dynamics models is the voter model, originally introduced to simulate competition of species [47]. In this model, individuals hold an opinion sI ∈ {±1}, which evolves through pairwise interaction with

neighbors. Interaction is independent of previous interactions or the state of other neigh-bors. Specifically, at each discrete time step, a random agent chooses a random neighbor and copies his opinion. In this model opinions change locally through a diffusion

(42)

mecha-nism [200], thus it shares many similarities with the SIS (susceptible-infected-susceptible) and independent cascade models. In particular, we consider the −1 state as susceptible and +1 state as infected, in a population connected by a social network with adjacency matrix aij. Then the probability that an infected individual i changes the opinion of a

susceptible individual j is aij/Pkakj, that is the inverse of the degree of node j. The

difference is that the transition to the susceptible state follows the same dynamics, rather than being based on a time interval, like in the SIS model.

A significant amount of literature is dedicated to analyzing the voter model on a complete network. Starting from every initial condition, the voter model converges to population consensus on one of the two possible opinions [155]. The question is, how the initial condition affects the final result: what is the exit probability E(f ), i.e., the probability that the population will reach consensus on state +1 given f , the fraction of individuals in the initial population holding opinion +1? This probability is in fact E(f ) = f [155]. Another quantity of interest is the consensus time, based on the population size N , i.e., the number of time steps required to achieve consensus. This scales as N on a complete graph [155].

Several model variations have been introduced for the voter model over the years. Extensions include the confident voter model [274], where a binary confidence level is added to the state of an agent. A confident individual will not change opinion at the first interaction but switch to an unsure state, from which they can then flip state upon a second interaction with an opposite opinion. This model is shown to converge to consensus in a time that grows like ln N , so the introduction of confidence speeds up the dynamics. Apart for the mean field analyses, the voter model was also studied on lattices and the more realistic complex networks, with consensus always obtained on finite connected networks. On finite d-dimensional lattices, consensus time scales as N2 for d = 1, N ln N for d = 2 and N for d > 2 [155]. On complex networks, on the other hand, consensus time depends on the first (µ1) and second (µ2) moment of the degree distribution of the

nodes, scaling as Nµ1

µ2 [254]. Hence networks with a wide degree distribution have a quick

consensus.

During the years, several voter model variations have been introducing, such as the noisy voter model that introduces a probability that an individual will flip opinion inde-pendently from the neighbors [35], or the voter model with a popularity bias [82], in which voters select a neighbor to interact with based on the degree of the neighbor, which should represent their popularity or the voter model with zealots [292].

2.2.2 Dependent interaction

A commonly made distinction between simple and complex contagion is introducing trans-mission rules that go beyond the assumption that the transtrans-mission probability from a single infected individual to another susceptible individual is independent of other contacts.

Typically, the motivations for dependent transmission models are: (i) social pressure to conform

(ii) interaction between the state of different individuals (for example through payoff of adoption)