A P2P architecture for Distributed Dunbar-based Social Networks

(1)

Dipartimento di Informatica

Dottorato di Ricerca in Informatica

Ph.D. Thesis

A distributed Dunbar-based Framework

for Online Social Network

Barbara Guidi

Supervisor Laura Ricci Supervisor Marco Conti

June 2015

(2)

(3)

(4)

(5)

Online Social Networks (OSNs) are becoming more and more popular on the Web. Distributed Online Social Networks (DOSNs) are OSNs which do not exploit a central server for storing users’ data and enable users to have more control on their profile content, ensuring a higher level of privacy. In this thesis we propose DiDuSoNet, a novel P2P Distributed Dunbar-based Online Social Network where users can exercise full access control on their data. Our system exploits trust relationships over the novel Dunbar-based Social Overlay for providing a set of important social services like information diffusion and data availability. In par-ticular, our system manages the problem of data availability by proposing two P2P dynamic trusted storage approaches. By following the Dunbar concept, our system stores the data of a user only on friend nodes, which have regular contacts with it. Differently from other approaches, nodes chosen to keep data replicas are not stat-ically defined but dynamstat-ically change according to users’ churn. Furthermore, the system provides a new epidemic protocol able to spread social updates in DOSN overlays, where the links between nodes are defined by considering the social in-teractions between users. Our approach is based on the notion of Weighted Ego Betweenness Centrality (WEBC), which is an ego-centric social measure approxi-mating the Betweenness Centrality. The weights considered in the computation of the WEBC correspond to the tie strength between friends so that nodes having a higher number of interactions are characterized by an higher value of the WEBC. The lack of real dataset containing structural and temporal information of users is the main limitation to the research on this field. To fill the gap, we have de-veloped a Facebook application and we have crawled data about more than 300 Facebook users. We have studied the dataset in deep to obtain useful information to characterize OSNs users not only under the structural point of view, but also to understand the user behaviours in term of session length. A set of experimental results, conducted by using our Facebook dataset and an old Facebook Regional Network dataset, proving the effectiveness of our system are presented.

(6)

(7)

Acknowledgement . . . iii

1 Introduction 1 1.1 Outline of the Thesis . . . 8

I

Basic Concepts

11

2 Background and Related Works 13 2.1 P2P Systems . . . 13

2.1.1 The P2P look-up service: DHT . . . 15

2.2 Distributed Online Social Networks . . . 20

2.3 Data management in DOSNs . . . 21

2.3.1 Data Availability and Persistence . . . 22

2.3.2 Information diffusion in DOSNs . . . 26

2.4 Security and Privacy . . . 30

2.5 Privacy breaches . . . 31

2.5.1 Privacy breaches from centralized service providers . . . 31

2.5.2 Privacy breaches from other users . . . 31

2.6 Existing Approaches . . . 33

2.7 Analysis of existing DOSNs . . . 40

3 Complex Networks Analysis for DOSNs 43 3.1 Complex Networks . . . 43

3.2 Small World . . . 44

3.3 Clustering . . . 45

3.4 Degree distribution, scale-free network model and network resilience 46 3.5 Centrality Indexes . . . 48

3.6 Social Networks: the Dunbar’s property . . . 49

3.6.1 Ego Networks . . . 49

3.6.2 The Dunbar circles . . . 50

(8)

II

A P2P Dunbar-based Distributed Online Social

Net-work

53

4 System Model 55

4.1 Profile-Based Communication . . . 56

4.2 The Social Graph . . . 56

4.3 DiDuSoNet: the general architecture . . . 58

4.3.1 Mapping the social graph onto the social overlay . . . 59

4.3.2 The Dunbar-based Social Overlay . . . 60

4.4 The DHT lookup service: Pastry . . . 63

4.5 Analysis of the Dunbar Connections . . . 66

5 Friendly Routing over Dunbar-based Overlays 69 5.1 Friendly Routing . . . 70

5.2 Social Pastry: the DHT structure . . . 71

5.3 Friendly Routing by using goLLuM . . . 73

5.3.1 The Routing Algorithm . . . 74

5.3.2 Privacy . . . 75

5.3.3 Independence . . . 75

5.4 Experimental Results . . . 77

6 Data Availability: a trusted storage approach 79 6.1 Trusted Social Storage . . . 80

6.2 Basic Trust: k-trusted-replicas approach . . . 81

6.2.1 Social Score and Selection Strategies . . . 83

6.2.2 PoS Election Algorithms . . . 86

6.2.3 Dynamism and Points of Storage (PoS) . . . 88

6.2.4 Data Consistence . . . 92

6.3 Full Trust: Network Coverage . . . 93

6.3.1 Social Score Definition . . . 97

6.3.2 Computing Social Coverage . . . 98

7 Information Diffusion on Dunbar-based Overlays 105 7.1 Ego Betweenness Centrality . . . 108

7.2 Ego Betweenness Centrality Computation . . . 110

7.2.1 Ego Betweenness Centrality Computation in directed graphs 112 7.3 Weighted Ego Betweenness Centrality . . . 112

7.4 WEBC-based information diffusion . . . 116

7.4.1 The Information Diffusion algorithm . . . 117

(9)

7.5.1 EBC Broadcast Protocol . . . 119

7.5.2 EBC Gossip Protocol . . . 120

III

System Evaluation

125

8 Experimental Results: The Datasets 127 8.1 The Zhao’s Dataset: general characteristics . . . 129

8.2 SocialCircles! dataset . . . 134

8.2.1 Topology . . . 135

8.2.2 Analysis of the users interactions and ego network properties 150 8.2.3 Analysis of temporal characteristics related to data availability153 9 System Evaluation 163 9.1 PeerFactSim.KOM . . . 163

9.2 Social DHT: Experimental Results . . . 164

9.2.1 Experimental Setup . . . 165

9.2.2 Experimental Results . . . 168

9.3 Data Availability: Experimental Results . . . 173

9.4 Information Diffusion: Experimental Results . . . 183

IV

Conclusion

191

10 Conclusion and Future Work 193 10.1 Thesis Contributions . . . 193

10.2 Future Work . . . 195

A The Facebook application: SocialCircles! 197 A.1 Introduction . . . 197

A.2 What data can we obtain? . . . 198

A.3 The SocialCircles! functionalities . . . 199

A.3.1 The Graph page . . . 200

A.3.2 The Statistics page . . . 201

A.3.3 The Interactions page . . . 202

A.3.4 Friends Map page . . . 203

(10)

A.4 The application structure . . . 204

A.5 The development technologies and tools . . . 205

A.5.1 Front-End technologies . . . 205

A.5.2 Server technologies . . . 207

A.5.3 Database technologies . . . 208

A.6 The server side . . . 208

A.6.1 The filter level . . . 209

A.6.2 The servlet level . . . 209

A.6.3 The background demons level . . . 213

A.6.4 Other demons and maintenance script . . . 214

A.7 The database . . . 215

A.7.1 Service tables . . . 216

A.7.2 Topology and profiles tables . . . 216

A.7.3 Interaction Information tables . . . 219

A.7.4 Online status table . . . 222

A.8 The hosting and the deployment . . . 222

A.9 The application publishing . . . 223

Bibliography 223

(11)

2.1 An Abstract P2P Overlay Network Architecture . . . 14

2.2 Comparison of central server, flooding search, and distributed in-dexing. . . 17

2.3 Interface of a DHT with a simple put/get − interface. . . 17

2.4 State of a hypothetical Pastry node with NodeID 10233102, b = 2. The top row is row zero. The NodeID is coloured with three different colours to indicate the common prefix with 10233102 (blue colour), next digit (green colour), and rest of the NodeID (red colour) 18 2.5 General DOSN’s Architecture . . . 21

2.6 Distributed Online Social Network architecture . . . 22

2.7 Data Management Services . . . 22

2.8 Availability in DOSNs . . . 23

2.9 Information diffusion in structured and unstructured DOSNs. . . . 27

2.10 Privacy violations through contents’ metadata. . . 34

2.11 SafeBook Architecture . . . 35

2.12 Control access in LifeSocial.KOM. . . 37

2.13 Example Cachet objects. . . 39

2.14 Gemstone: system architecture . . . 40

2.15 Current proposed P2P-based DOSNs . . . 41

3.1 Graphical representation of a undirected (a), a directed (b), and a weighted undirected (c) graph with N = 7 nodes and K = 14 links. 44 3.2 Ego network of the red node. . . 50

3.3 Dunbar’s circles. . . 51

4.1 The Dunbar-based ego network of nodes x and y. . . 57

4.2 DiDuSoNet Architecture. . . 60

4.3 The SocialTable of node N (the Additional Table is identical). . . . 62

4.4 The Ego Network of node N and its representation in a node . . . . 62

4.5 Searching friend inside the DHT . . . 65

(12)

5.2 Basic idea of the routing algorithm is to route around blocked links. 74 6.1 (a) X’s ego network EN (X), (b) nodes’ CGX when no P oS(X) is

still elected, (c) nodes’ CGX after D’s election as PoS(X) . . . 86

6.2 Dunbar Ego Network of Node n. . . 89

6.3 PoS re-election for an offline node . . . 89

6.4 Case 1: Ego network of the node n after the disconnection of A. . . 90

6.5 Case 2 . . . 90

6.6 Case 3 . . . 91

6.7 Ego Network of Node n after the reconnection of n. D is not a PoS and the updated data are stored on A and n. . . 92

6.8 PoS dynamism and Data Consistence. . . 92

6.9 Network Coverage through PoSs: an example . . . 95

6.10 Coverage of an edge (A,B) . . . 96

6.11 Local degree computation . . . 98

6.12 Local Degree vs. Global degree . . . 98

6.13 Network Coverage . . . 100

6.14 A SS node goes offline . . . 102

6.15 When an offline node joins the network . . . 104

7.1 Messaging Patterns in an OSN . . . 106

7.2 The social update dissemination problem over the ego network of node u . . . 106

7.3 EBC example. . . 109

7.4 An example of network graph . . . 110

7.5 Ego Network of the node E . . . 111

7.6 Example of the Dijkstra’s algorithm . . . 113

7.7 Weighted Ego Betweenness on Undirected Graphs . . . 114

7.8 A weighted and directed graph . . . 116

7.9 Community Discovery . . . 117

8.1 Interaction Graph: four time windows . . . 130

8.2 Distribution of the contact frequences . . . 131

8.3 Distribution of the contact frequences in the interval [0, 0.1] . . . . 131

8.4 Indegree distribution . . . 132

8.5 Outdegree distribution . . . 133

8.6 Joint Degree Distribution . . . 134

8.7 Distribution of Facebook friends among ego networks. . . 136

8.8 CDF of the number of Facebook friendships. . . 137

8.9 Distribution of average clustering coefficient. . . 138

(13)

8.11 Distribution of Normalized Ego Betweenness Centrality. . . 140

8.12 Distribution of Modularity value. . . 141

8.13 Distribution of discovered communities . . . 142

8.14 Distribution of total cliques among ego networks. . . 145

8.15 Distribution of the median clique size for each network . . . 146

8.16 Distribution of k-cliques in the typical network . . . 147

8.17 Distribution of overlapping cliques (any size) per node for each net-work . . . 148

8.18 Distribution of full and approximate covering set size for each network.149 8.19 Analysis of the users interactions and ego network structure . . . . 152

8.20 Tie Strength and clusters analysis . . . 152

8.21 Temporal availability of all users. . . 155

8.22 Analysis of the temporal properties . . . 157

8.23 Analysis of the Dunbar circles temporal features . . . 158

8.24 Analysis of the temporal matching on Dunbar circles . . . 159

8.25 Percentage of matching . . . 160

8.26 Conditional probability on active network . . . 160

9.1 PeerFactSim.KOM: Architecture . . . 164

9.2 Distribution of incoming and outgoing links for the Zhao’s dataset model . . . 167

9.3 Distribution of incoming links for the Barab´asi-Albert model . . . . 168

9.4 Zhao’s Dataset: Finished Lookups . . . 169

9.5 Barab´asi-Albert Model: Finished Lookups . . . 170

9.6 Small World Graph: Finished Lookups . . . 171

9.7 Experiment A: Quantity of online nodes and points of storages. . . 176

9.8 Experiment A: Ratio of available profiles to total number of profiles. 176 9.9 Experiment A: Quantity of ping-pong messages for different values of alpha. . . 177

9.10 Experiment B: Simulation results, all nodes leave the network with-out selecting another point of storage (failure). . . 178

9.11 Experiment C: Simulation results, half of the leaving nodes select another PoS wheres the other half fails. The simulation time is set to 48 hours. . . 179

9.12 Pure Availability and Friend Availability . . . 181

9.13 Average PoS tie strength and Dunbar ego network . . . 181

9.14 Simulation results . . . 181

9.15 Frequency of the total amount of replicas needed to have a network coverage . . . 183

9.16 Normalised values for BC and EBC for a random network extracted from the data set and composed by 1595 nodes . . . 185

(14)

9.17 Normalised values for BC and EBC for a random network extracted

from the data set and composed by 5000 nodes . . . 185

9.18 Evaluation of how often the EBC computation is required in a net-work of 1000 nodes and 150 time instants. . . 186

9.19 Variation between EBC and WEBC for random networks obtained by the dataset . . . 187

9.20 Replica Distribution in FoF . . . 188

9.21 Number of replica according to the community dimension in a net-work of 4000 nodes. . . 189

9.22 Number of replica according to the community dimension in a net-work of 10000 nodes . . . 190

A.1 The index page. . . 200

A.2 The graph page. . . 201

A.3 The statistics page. . . 202

A.4 The interactions page. . . 203

A.5 The map page. . . 204

A.6 High level application design: a three-tier structure. . . 205

A.7 The server side design. . . 210

A.8 Database schema - Topology and profiles . . . 217

A.9 Database schema - Interaction information . . . 220

(15)

6.1 Table of Symbols used for the Basic Trust Approach . . . 82

6.2 Table of Symbols used in this section . . . 94

7.1 Centrality Metrics for the graph in figures 7.4: betweenness central-ity (BC), closeness centralcentral-ity (CC) and ego-betweenness centralcentral-ity (EBC). . . 111

8.1 Social Graph characteristics. . . 129

8.2 Refined dataset: characteristics. . . 130

8.3 Pearson Correlation . . . 133

8.4 Some properties of different group of networks according to total clique size. . . 143

8.5 Dunbar circles analysis for networks with k-opt = 4. 95% confidence intervals are reported in square brackets. . . 153

9.1 Simulator Setup . . . 165

9.2 Link distribution for the Barab´asi-Albert model and the Small World graph . . . 172

9.3 Average hop counts for different routing methods, friendship models and levels of trust . . . 173

9.4 Simulator Setup. . . 175

9.5 Network properties . . . 183

9.6 Number of Messages of the both protocols by varying the number of nodes . . . 186

(16)

Published Papers

1. DiDuSoNet: A P2P architecture for distributed Dunbar-based so-cial networks.

Guidi, B., Amft, T., De Salve, A., Graffi, K., and Ricci, L. Peer-to-Peer Networking and Applications, 1-18 (2015).

2. Trusted Dynamic Storage for Dunbar-Based P2P Online Social Networks.

Conti, M., De Salve, A., Guidi, B., Pitto, F., and Ricci, L. In On the Move to Meaningful Internet Systems: OTM 2014 Conferences (pp. 400-417). Springer Berlin Heidelberg.

3. Epidemic Diffusion of Social Updates in Dunbar-Based DOSN. Conti, M., De Salve, A., Guidi, B., and Ricci, L. In Euro-Par 2014: Parallel Processing Workshops (pp. 311-322). Springer International Publishing. 4. Distributed protocols for ego betweenness centrality computation

in DOSNs.

Guidi, B., Conti, M., Passarella, A., and Ricci, L. In Pervasive Computing and Communications Workshops (PERCOM Workshops), 2014 IEEE Inter-national Conference on (pp. 539-544). IEEE.

5. HyVVE-A Voronoi based Hybrid Architecture for Massively Mul-tiplayer On-line Games

Ricci, L., Genovali, L., and Guidi, B. In DCNET 2013, 4-th International Conference on Data Communication Networking, Reykjavik, Island. (pp. 15-23) - BEST PAPER AWARD.

6. Managing Virtual Entities in MMOGs: A Voronoi-Based Approach. Ricci, L., Genovali, L., and Guidi, B. In E-Business and Telecommunications (pp. 58-73). Springer Berlin Heidelberg.

7. P2P architectures for distributed online social networks.

Guidi, B. In High Performance Computing and Simulation (HPCS), 2013 International Conference on (pp. 678-681). IEEE.

8. GoDel: delaunay overlays in P2P networks via Gossip.

Baraglia, R., Dazzi, P., Guidi, B., and Ricci, L. In Peer-to-Peer Computing (P2P), 2012 IEEE 12th International Conference on (pp. 1-12). IEEE.

(17)

9. The impact of user’s availability on On-line Ego Networks: a Face-book analysis.

Andrea De Salve, A., Dondio, M., Guidi, B., and Ricci, L. Computer Com-munications, Special Issue on Online Social Networks. Elsevier (2015). 10. FRoDO: Friendly Routing over Dunbar-based Overlays.

Amft, T., Guidi, B., Graffi, K., and Ricci, L. 40th IEEE Conference on Local Computer Networks (LCN), 2015, Clearwater Beach, Florida, USA.

(18)

It is hard to find the words to thank all the people that supported me during my PhD. PhD is a beatiful experience, a sort of journey in which you can enrich yourself.

First of all, I would like to say thanks to my supervisors: Prof. Laura Ricci, which has been really important for me and without her guidance and persistent help this dissertation would not have been possible; and Marco Conti. Both of them have given me the opportunity of realizing this dream.

I would like to thank Andrea De Salve who worked with me on some aspects of this thesis.

Furthermore, I would like to thank my parents, my boyfriend, friends and col-leagues who encouraged and advised me during my PhD. Without their continuous support the success of my PhD would not have been possible.

Finally, I would like to say thanks to me, because I never thought of being able to finish my PhD after the ordeal I had to face during this experience. Instead, we know, the impossible always seems impossible until it’s done!

(19)

Introduction

An Online Social Network (OSN) is defined in [1] as an online platform that pro-vides services for a user to build a public profile and to explicitly declare the connection between his/her profile and those of the other users; an OSN enables a user to share information and content with selected users or to define the in-formation as public; furthermore, it supports the development and usage of social applications enabling the interaction of the user with both his/her friends and strangers.

The currently popular OSNs are centralized which means they are based on centralized servers storing all the information of the users. This centralized struc-ture has several drawbacks including scalability, privacy, and dependence on a provider [2]. In particular, in recent years, the rise and quick development of social networks have led to important phenomena: the rapid spread of informa-tion and the user privacy disclosure. Social networks have become the epicentre and main channel of individual privacy disclosure. These problems have moved researchers to investigate alternative solutions with respect to the centralized one. Facebook is one of the most well-known OSN and it is the better scenario to understand the privacy problem arising in OSNs. During the last years, there have been some concerns expressed regarding Facebook and the usage of private data of its users. Two students from the Massachusetts Institute of Technology (MIT) used a script to download information of over 70,000 Facebook profiles from four schools (MIT, NYU, the University of Oklahoma, and Harvard University) as part of a research project on Facebook privacy published in 2005 [3].

In August 2007, a small fraction of the code used to generate Facebook web pages was made public1_{. As explained in [4], a configuration problem on a}

(20)

book server caused the PHP code to be displayed instead of the web page the code should have created. This problem raised concerns about how secure private data on the site was. A visitor copied, published and later removed the code from his web forum, claiming he had been served and threatened with legal notice by Facebook.

In February 2008, a New York Times article2 advises that Facebook does not provide any mechanism to close user accounts, and raised the concern that, as a consequence, private user data would remain stored on Facebook’s servers. Re-cently, Facebook gives users the possibility to deactivate or delete their accounts. Deactivating an account does not remove user data, because the account can be restored later, while deleting it will remove the account ”permanently”. However, some data submitted by that account (posts or messages) will remain. Facebook’s Privacy Policy now states: ”When you delete an account, it is permanently deleted from Facebook.” [4].

In May 2015, a report3_{commissioned by the Belgian Data Protection Authority}

revised policies and terms-of-use and concluded that Facebook gives users a false sense of control over their data privacy. Facebook has been tracking on a long-term basis, users who visit any page belonging to the Facebook.com domain. Facebook has the power to link internet users’ browsing habits to their real identity, social network interactions and sensitive data including private real information such as medical information, religious, sexual and political preferences. It is in a unique position compared to most of the other cases of so-called “third-party tracking”. In particular, the report affirms that Facebook has the ability to track users’ activity outside Facebook, mainly through the spread of social-plugins, “Like” buttons, and through new forms of mobile tracking. It affirms that Facebook now gathers information through these plugins regardless of whether the buttons are used. It said Facebook’s acquisition of Instagram and WhatsApp has allowed Facebook to collect more kind of user data, which enables more detailed profiling. The report also said that it is impossible to add information on Facebook that may not later be used for targeting advertisements and that Facebook’s privacy settings were less clear in relation to the collection and use of data by Facebook itself or by third-parties like application developers.

We can also illustrate other scenarios which are not strictly related to users data privacy protection, as all the scenarios presented above, but on Internet free-dom as a human right. Facebook has been banned in some countries, such as in China. In Tunisia, the government hacked into and stole passwords from citizens’

2_{http://www.nytimes.com/2008/02/11/technology/11facebook.html?_r=0} 3_{https://www.huntonprivacyblog.com/tag/belgium/}

(21)

Facebook accounts. In May 2011, Syrian activists noticed that the telecommunica-tions ministry was tapping into Facebook activity. We can think that the problem is related to the Facebook site, instead others Social Networks, such as Twitter has been blocked as well. Twitter was inaccessible in Egypt on 25 January 2011 dur-ing the Egyptian protests. In 2009, durdur-ing 2009 Iranian presidential election, the Iranian government blocked Twitter due to fear of protests being organised. On 20 March 2014, access to Twitter was blocked when a court ordered that ”protection measures” be applied to the service.

All presented scenarios represent the main motivation which have lead re-searchers to evaluate a distributed platform for implementing an Online Social Network.

A Distributed Online Social Network (DOSN) [2] is an online social network implemented on a distributed information management platform, such as a network of trusted servers, P2P systems or an opportunistic network.

During the last years DOSNs have been argument of several works from both academic researchers and open source communities. By decentralizing OSNs, the concept of a service provider is changed, as there is no single provider but a set of peers that take on and share the tasks needed to run the system. This has several consequences: in terms of privacy and operation, no central entity that decides or changes the terms of service exists. Moving from a centralized web service to a decentralized system also means that different system models become possible: using one’s own storage or cloud storage, exploiting delay-tolerant social networks, and P2P networks, to name some of them. Since the social data is stored on the peers, their availability depends on the online behaviour of peers.

The first big project in this area has been Diaspora [5]. Diaspora was founded in 2010 by four students, who started to work on the project after being motivated by a speech given by Columbia University law professor Eben Moglen. In his speech, Moglen described centralized social networks as ”spying for free”, and in a New York Times interview, Salzberg, the co-funder of Diaspora, said ”When you give up that data, you’re giving it up forever ... The value they give us is negligible in the scale of what they are doing, and what we are giving up is all of our privacy”. The group decided to address this problem by creating a distributed social network. Diaspora is the first decentralized approach based on a decentralized federation of servers. In the last years fully decentralized solutions based on the P2P model [6] have been proposed for Online Social Network.

Decentralizing the existing functionalities of online social networks requires finding ways for distributing storage of data, propagating updates, defining an overlay topology and a protocol enabling searching and addressing, robustness against churn, etc. We discuss these problems in more detail below.

(22)

infrastructure dynamism. The social dynamism concerns social relationships which can change due to the variation of relations between users and of the number of users of the DOSN. This kind of dynamism is presented also in centralized OSN, but in DOSN it has an impact on the structure of the underlying overlay. The infrastructure dynamism is related to the underlying overlay network. Nodes may join/leave the underlying network, so that the corresponding user has a state that indicates its availability (online/offline). In different snapshots of the overlay, the number of available connections may change in the term of number of active links.

• Data availability/persistence. Decentralizing the social network, data should be stored at some nodes to be always available. The main questions are which mechamisn should be used, for example replication and where data should be stored, for instance exclusively at nodes run by friends, or at random nodes. The requirement for redundancy to provide availability of data depends to a large extent on the duration and distribution of nodes availability online. • Scalability. Mapping a social graph onto a distributed network can be very

expensive due to the number of social links for each node, so the cost of mirroring the social network links into distributed network links can be high. And it can be very inefficient, because as discussed previously, most of the social links are inactive (two friends who rarely interact online).

• Topology. Nodes should be connected according to their social connections in order to cluster friends in the overlay network. This should facilitate operations as information diffusion or data storage. As a downside, this would limit the availability and robustness of data access.

• Updates. This issue is related to how to deal with updates, i.e. status updates of friends. In DOSNs, with distributed storage and replication and a potential need for scalability, the mechanisms for the diffusion of data are fundamental and each update from one user must be diffused to all its social connections.

• Privacy. Data can be stored not only on the node of profile owner, but also on others. The classical requirements for security should be refined to consider the context of DOSN.

Most of proposed P2P Distributed Online Social Networks focus on the privacy problem. Some approaches are peer-to-peer based, for example Safebook [7], Life-Social [8, 9], PeerSoN [10], Cachet [11] and My3 [12]. Safebook is an architectural approach that aims to solve privacy issues focusing on communication anonymisa-tion. LifeSocial provides a plugin/based architecture in which user information is

(23)

stored in a DHT and is accessible from various plugin-based applications; PeerSoN exploits a two-tier system architecture for content organization and for communi-cations, where users store their information on their local devices; Cachet relies on structured architectures where users’ data could be stored and replicated on different peers. Finally, My3 uses an unstructured approach where users’ data are hosted on a set of self-chosen trusted peers among their friends.

When we consider a distributed structure, such as a peer-to-peer solution, it is necessary to consider other problems, for example nodes have a high fluctuation which can lead to data becoming unavailable or even lost. Data availability, as well as Information Diffusion, are important problems that have to be addressed in a DOSN.

Most of proposed systems use classic encryption techniques, such as symmetric or asymmetric encryption, to guarantee privacy and anonymity. The usage of encryption is led mainly by the DHT, which stores user data on untrusted nodes. These techniques are useful but, in a such dynamic and large environment, they can be potentially expensive. For example, consider a system in which data are encrypted using a conjunction of symmetric and asymmetric cryptography, a user encrypts its symmetric data with its private key and then, the key has to be encrypted with the public key of each friend. This means that each user has to generate a huge amount of keys and has to manage all of them. The attribute-based technique [13] is more efficient than the others, but it is not completely suitable in DOSNs. Another important point is regarding to the nature of a data. In a DOSN, a profile can be split into two different categories: private and public data. Public data are those that can be accessible by everyone (e.g. name and surname), instead private data has to be accessible only by friends. For these reasons, public data cannot be encrypted, but the same cannot be applied to private data. As just said before, the problem of these systems is the usage of DHTs, where private data are stored everywhere in the network and they have to be encrypted to guarantee privacy.

The goal of this thesis is to present a new Distributed framework for Online Social Network which is able to prevent user data violation by considering trust instead of encryption. By using a Friend-to-Friend network for storing private data, the above encryption problem is partially resolved. Trust is widely accepted as a major component of human social relationships. In general, trust is a measure of confidence that an entity will behave in an expected manner, despite the lack of ability to monitor or control the environment in which it operates [14]. In OSNs, trust is a critical determinant of sharing information and developing new relationships ( [15], [16]). Trust is also important for successful online interactions. In online systems such as eBay and Amazon, trust is based on the feedback on past interactions between members [17, 18]. In this sense, trust is relational. As two

(24)

members interact with each other frequently, their relationship strengthens, and trust evolves through a positive or negative increase based on their experience.

We propose DiDuSoNet (Distributed Dunbar-based Social Network), a system based on an important social concept, which permit us to guarantee trust: the Dunbar circles. Dunbar [19] explains that human brain has cognitive limit to the number of people with whom one can maintain stable social relationships. This concept is used to build a novel Dunbar-based Social Overlay, where each node maintains connections only to trusted nodes according to the strength of the relation. Furthermore, each node maintains only a limited number of connections according to the Dunbar’s number (which is about 150).

Our system is logically organized on two tiers: the Dunbar-based Social Over-lay and a P2P lookup service, implemented through a DHT. Each user directly connects to its Dunbar friends, which are the friends with whom it has a stronger relation. This enables to limit the number of outgoing connections of a user to at most 150 connections. Furthermore, since most Dunbar relationship are recipro-cal, this limits also the number of incoming connections. The DHT is exploited to allow the bootstrap of the nodes in the system, the search of new friends, and to store information useful for higher level services. The usage of a DHT causes a reduction to the level of trust provided from our system. For this reason, we propose a novel DHT called Social Pastry, which is based on Pastry and guarantee a friendly-routing functionality. Due to the constraints introduced in the routing table of Pastry to obtain Social Pastry, the efficiency of the routing is reduced. This reason is the main motivation that has led us to the introduction of a general friendly-routing algorithm, which is independent from the network overlay and can be used above structured and unstructured overlays.

DiDuSoNet provides a set of social services needed for the data management in a DOSN. These services are needed to manage data availability and information diffusion. We provide two replication based approaches for the data availabil-ity problem, which use the trust of nodes to elect storage points among network friends. The trust of a node is evaluated by considering their structural and tem-poral properties. In particular, we evaluate the user behaviour and the degree of a node to understand which nodes are suitable to be elected as storage nodes. The first approach guarantees a high availability of data by using 2-replicas of each users’ profile on its Dunbar friends and it provides a functionality to manage the data consistence of replicas.

The second approach is based on the concept of network coverage we propose and it it guarantees that a node can retrieve the profile of a friend through trusted connections, i.e. connections with common friends.

Furthermore, we provide an information diffusion service by using an epidemic diffusion algorithm. We propose a new metric, called Weighted Ego Betweenness

(25)

Centrality and we use the metric to guide the diffusion over the Dunbar-based Social Overlay. Our main goal is to guarantee the availability of data in our system. Both our approaches exploits structural and temporal properties of each node.

The lack of real datasets containing temporal information about users be-haviour represents the main limitation to the research in this field. To fill this important gap, we decided to implement a Facebook application to collect use-ful information, such as the social graph of each user, the interactions between users which are needed to build the Dunbar-based Social Overlay, and information about the online sessions of users. We have analysed in deep our dataset to obtain information about the structure of the social graph and the users’ behaviour. Fur-thermore, we have obtained information about the evolution of particular charac-teristics of Facebook users by comparing our dataset analysis with others proposed in literature. We have used the real datasets to evaluate our system, in particular the data availability service.

The system evaluation has been conducted by testing the three major novelties proposed in this thesis. We have evaluated our friendly routing algorithm and we have demonstrated that our approach provides a more efficient solution than a friendly routing through a Social DHT.

Furthermore, we have evaluated our data availability solution with a real dataset and we have demonstrated that we are able to guarantee a high availability with only 2 replicas of each profile. Finally, we have evaluated our information diffusion service and the Weighted Ego Betweenness Centrality metric used in the algorithm. We are able to demonstrate the efficiency of our epidemic algorithm.

To recap, the main research contributions of this thesis are:

• the definition of a novel Social Overlay based on an important social con-cept, the Dunbar’s circles. The Dunbar-based Social Overlay represents the baseline of all the social services of our system.

• the proposal of a friendly routing by exploiting the Dunbar-based Social Overlay. We have proposed a general algorithm which is independent from the underlying overlay and which guarantees a trust-preserved routing. • the proposal of two data availability approaches based on structural and

temporal properties of the users. The first approach guarantees a high avail-ability with a fixed number of replica and provides a mechanism to guarantee the consistence of replicas. The second approach provides an original repli-cation scheme based on the network coverage.

• the definition of a new centrality metric, the Weighted Ego Betweenness Centrality (WEBC), which is used to guide the diffusion over the

(26)

Dunbar-based Social Overlay. We propose two distributed protocols for the WEBC computation and an epidemic protocol for the information diffusion.

• the evaluation of the system with an up to date real dataset. We have implemented a Facebook application to retrieve real data and we have studied the dataset to validate our assumptions.

1.1 Outline of the Thesis

The thesis is conceptually organized in four separate parts:

Part I: Basic Concepts in which we introduce the reader to the world of Distributed Online Social Networks and Complex Networks Analysis, putting the basis for understanding the following parts. In particular:

• Chapter 2 reviews the current state of Distributed Online Social Networks. We describe the current proposals and the most important challenges, such as data availability and information diffusion.

• Chapter 3 describes the properties of Complex Networks and important metrics which are used in our work.

Part II: A P2P Dunbar-based Distributed Online Social Network in which we introduce our Dunbar-based P2P Distributed Online Social Network and the Social Services which are provided by the system. In particular:

• Chapter 4 introduces DiDuSoNet by showing the system model and in particular, our novel Social Overlay based on the Dunbar’s concept.

• Chapter 5 describes an improvement proposed for the Pastry DHT, which we use in our system. We propose a Social Pastry in which the routing table is built in according to social links. We introduce a general algorithm which can be used over each kind of overlay (structured and unstructured) and which provides a friendly routing.

• Chapter 6 describes the data availability management service provided by our system. We propose two approaches based on the concept of trust. • Chapter 7 describes the information diffusion management service provided

by our system. We propose a diffusion algorithm based on the Ego Between-ness Centrality.

(27)

Part III: System Evaluation in which we show our experimental results and a detailed analysis of the two used real Facebook datasets. In particular:

• Chapter 8 provides a detailed analysis of a dataset related to a Facebook Regional Network. Furthermore, we show and analyse the structural and temporal properties of our Facebook dataset obtained by SocialCircles!, our Facebook Application.

• Chapter 9 describes the evaluation of all our novel approaches, in details the evaluation of Social Pastry and our goLLuM algorithm; the evaluation of our data management services: data availability and information diffusion. Part IV: Conclusions which concludes the thesis. In particular, we have:

• Chapter 10 presents the conclusions of the thesis and the possible improve-ments planned as future works.

• Appendix A presents an overview of SocialCircles!, the Facebook applica-tion.

(28)

(29)

(30)

(31)

Background and Related Works

In this chapter we review the main background notions related to the research field of Distributed Online Social Networks. We introduce an overview of P2P systems, of current DOSNs, and of the main challenges for the development of a DOSN, such as data availability and information diffusion.

2.1 P2P Systems

A P2P network is a distributed network composed of a large number of distributed, heterogeneous, autonomous, and highly dynamic peers in which participants share a part of their own resources, such as processing power, storage capacity, software, and files contents [20]. The participants of the P2P network can act as a server and a client at the same time. They are accessible by other nodes directly via the logical overlay links, without passing through intermediary entities (client/server) and, this represents one of the most important advantage of using a P2P system at application layer. P2P systems build and maintain overlay networks at the application-layer, assuming the presence of an underlying network layer, which assures connectivity among any pair of nodes that can communicate through a self-organizing and fault tolerant network topology. A P2P system has particular properties [21], such as:

• High Degree of Decentralization. Based on the degree of decentralization, the P2P systems are classified into two categories: hybrid systems and purely de-centralized system. In a purely dede-centralized P2P system, there is no central entity and peers are able to implement both client and server functionality, and most of the system’s states and tasks are dynamically allocated among the peers.

(32)

• Self-organization. Self-organization is required for reliability and to manage continuous peers joining and leaving. Node joining and leaving is handled in a distributed manner. When a node joins the network, it builds connections with other nodes to become an active participant. When a node leaves, it is removed from the network and the overlay is updated to maintain con-nected all the others nodes. Furthermore, when a node fails without warning, connected node needs to update their routing and neighbouring connections. • Scalability. This concept refers to the ability of a system to continuously evolve in order to support a growing amount of peers. This means that P2P systems are able to maintain the system’s performance attributes indepen-dent from the number of nodes or documents in the network.

• Reliability. This concept denotes the ability of the network to deliver its services even when one or several nodes fail.

• Churn. Churn represents the situation where a number of nodes join, leave, and fail in a continuous and rapid way.

Figure 2.1: An Abstract P2P Overlay Network Architecture

Figure 2.1 shows an abstract P2P architecture as presented in [22]. The Network Communications layer describes the network characteristics of devices (sensor-based or computers) which are connected over the Internet (or wireless). The Overlay Nodes Management layer is responsible about the management of peers into the P2P network. It includes the discovery of peers and routing al-gorithms. The Features Management layer covers the security, reliability, fault resiliency and aggregated resource availability aspects of maintaining the robust-ness of P2P systems. The Services Specific layer supports the underlying P2P infrastructure and the application components through scheduling of parallel and

(33)

computation-intensive tasks, content and file management. The Application-level layer is concerned with tools, applications and services that are implemented on top of the underlying P2P overlay infrastructure.

As said before, P2P systems can be classified as centralized and purely decen-tralized. In centralized systems, such as Napster [23], the index, which supports the retrieval of the information, is centralized, but data are exchanged directly between peers. A basic problem of this approach is that, in case of a failure of the node that stores all the information about the network, all the network fails as well. Also, the system has a poor performance when the number of nodes is increased. Not fully centralized P2P systems have a dedicated controller node which is responsible to maintain the set of participating nodes and controls the system. Instead, in a purely decentralized P2P system, there are no nodes with particular responsibilities and that represent the critical point for the operation of the system. They have no inherent bottleneck and can potentially be resilient to failures, attacks, and legal challenges. In some purely decentralized P2P systems, nodes with plenty of resources and high availability, act as super-nodes. These super-nodes have additional responsibilities, such as acting as a rendezvous point for nodes behind firewalls, storing state or keeping an index of available content. This kind of P2P systems are generally referred as hybrid P2P systems.

P2P systems maintain an overlay network, which is a layer on top of the phys-ical network connections, which creates transparent services to handle the virtual topology of the network. Overlay networks offer services for the communication and connection handling between the peers in the network, including peers that are not directly connected.

Furthermore, there is an additional distinction of P2P systems between structured and unstructured ones. In a unstructured P2P system [22], there are no con-straints on the links between different nodes, and therefore the overlay graph does not have any particular structure, so nodes can occupy any position within the overlay network.

In a typical unstructured system, a new node joins the network and establishes its initial connections in a random way starting at the bootstrap node. As ex-plained in [24], in a structured overlay, each node has a unique identifier in a large numeric key space. Identifiers are chosen in a way that makes them uniformly distributed in that space. The overlay graph has a specific structure and the iden-tifier determines the position of a peer within the structure and constrains its set of overlay links.

2.1.1 The P2P look-up service: DHT

A Distributed Hash Table (DHT) provides a global view of data distributed among nodes in a network, independent of the actual location. As referred in [25], a

(34)

Dis-tributed Hash Table manages data by distributing it across a number of nodes and implementing a routing scheme which allows efficient look up the node on which a specific data item is located. In contrast to flooding based searches in unstruc-tured systems, each node in a DHT becomes responsible for a particular range of data items. Also, each node stores a partial view of the whole distributed system which effectively distributes the routing information. Based on this information, the routing procedure typically traverses several nodes, getting closer to the des-tination with each hop, until the desdes-tination node is reached. As reported in [25], DHTs have the following characteristics:

• each node has a partial view of the whole system by managing only a small set of links to other nodes. Generally, these are O(log N) references, where N depicts the number of nodes in the system.

• by mapping nodes and data items into a common address space, routing to a node leads to the data items for which a certain node is responsible. • queries are routed via a small number of nodes to the target node. Because

of the small set of references each node manages, a data item can be located by routing via O(log N) hops.

• by distributing the identifiers of nodes and data items nearly equally through-out the system, the load for retrieving items should be balanced equally among all nodes.

• because no node plays a distinct role within the system, the formation of hot spots or bottlenecks can be avoided.

• the departure or dedicated elimination of a node should have no considerable effects on the functionality of a DHT. Therefore, DHTs are considered to be very robust against random failures and attacks.

• if a data item is stored in the system, the DHT guarantees that the data is found.

The following table (figure 2.2), proposed in [26], compares again the main characteristics of three well-known approaches in terms of complexity, vulnerabil-ity and query abilvulnerabil-ity. According to their complexvulnerabil-ity in terms of communication overhead, per node state maintenance, and their resilience, DHT shows the best performance unless complex queries are not vital.

DHTs follow a proactive strategy for data retrieval. In comparison, routing in unstructured systems is not related to the location of specific data items but only reflects connections between nodes.

(35)

Figure 2.2: Comparison of central server, flooding search, and distributed indexing.

In a DHT system, each data item has an identifier (ID), a unique value from the address space. This value can be chosen freely by the application, but it is often derived from the data itself via a collision-resistant hash function, such as SHA-1. Each document index is expressed as a (K, V ) pair, K is called keyword, can be the hash of a file name (or description of document) of the hash value, V is the actual file storage node IP address (or description of the other nodes). The generic operations are: the put function, which accepts an identifier and arbitrary data and it is used to store the data (on the node responsible for the ID); and, symmetrically, the get function retrieves the data associated with a specified identifier (figure 2.3).

Figure 2.3: Interface of a DHT with a simple put/get − interface.

During the years, several DHT variants have been proposed. In the following, we introduce Pastry, which is used in our system.

Pastry

Pastry [24] is a structured P2P overlay network which uses a prefix routing to build a self-organizing decentralized overlay network. Each node in the Pastry network has a unique numeric identifier: (NodeID ). The NodeID is used to give a position in a circular NodeID space, which ranges from 0 to 2128−1 and, the set of NodeIDs is assumed to be uniformly distributed in the 128−bit space. The NodeID is assigned at random when a peer joins the system. For a network of

(36)

N peers, Pastry routes to the numerically closest peer to a given key in less than log2bN steps under normal operation (where b is a configuration parameter with

typical value is 4). The NodeIDs and keys are considered a sequence of digits with base 2b_{. In each routing step, a node normally forwards the message to a peer}

whose NodeID shares with the key a prefix that is at least one digit longer the key shares with the current peer NodeID. If such a node is not known, the message is forwarded to a node whose NodeID shares a prefix with the key as long as the current node, but is numerically closer to the key than the current node.

Each Pastry peer maintains a routing table, a Neighbourhood Set, and a Leaf Set. The routing table is organized in log₂bN rows with 2b− 1 entries each. The 2b− 1

entries at row n of the routing table refer to a peer whose NodeID shares the current peer’s NodeID in the first n digits, but whose (n + 1)th digit has one of the 2b _{− 1 possible values other than the (n + 1)th digit in the current peer’s}

NodeID. The choice of N and b determines the size of the routing table. Larger b increases the routing table size but reduces the number of hops. Each entry in the routing table contains the IP address of peers whose NodeID have the appropriate prefix, and it is chosen according to the proximity metric. If no node is known with a suitable NodeID, then the routing table entry is left empty. The choice of b involves a trade-off between the size of the populated portion of the routing table (approximately (log2bN ) ∗ (2b− 1) entries) and maximum number of hops required

to route between any pair of peers (logBN ).

In figure 2.4 is shown the routing table of a hypothetical Pastry node, whose ID is 10233102. The entries in row j refer to a node whose ID shares the NodeID 10233102 only in the first j digits.

Figure 2.4: State of a hypothetical Pastry node with NodeID 10233102, b = 2. The top row is row zero. The NodeID is coloured with three different colours to indicate the common prefix with 10233102 (blue colour), next digit (green colour), and rest of the NodeID (red colour)

(37)

The Neighbourhood Set maintains information about nodes that are close to-gether in terms of network locality. It is not normally used in routing messages, but it is useful to maintain locality properties.

Finally, the Leaf Set L is the set of nodes with the |L|/2 numerically closest larger NodeIDs, and the |L/2| numerically closest smaller NodeIDs. It is used during the message routing, as explained below.

The primary goal of the routing algorithm is to quickly locate the node responsible for a particular key. Pastry routing works as follow:

1. Given a message with a particular key K, the node first checks if K falls within the range of NodeIDs covered by its Leaf Set. If so, the node is forwarded directly to the destination node, namely the node in the leaf set whose NodeID is closest to the key.

2. If the key is not covered by the leaf set, then the node checks the routing table and the message is forwarded to a node that shares a common prefix with the key by at least one more digit.

3. If the appropriated entry in the routing table is empty or the associated node is unreachable, then the message is forwarded to a node that shares a prefix with the key at least as long as the current node, and it is numerically closer than the current node’s id.

As explained in [24], this routing procedure always converges, because each step takes the message to a node that either (1) shares a longer prefix with the key than the current node, or (2) shares as long as a prefix with, but is numerically closer to the key than the local node.

Nodes Leave/Failure Nodes in Pastry may fail or depart without warning. Routing table is handled by periodically exchanging keep alive messages among neighbouring nodes.

Node failure is detected when its neighbours in the NodeID space can no longer communicate with it. When it occurs, it is necessary to repair all leaf sets of its neighbours. A neighbour contacts the live node with the largest index on the side of the failed node, and asks that node for its leaf table.

Another important aspect of Pastry’s routing is locality. Pastry’s notion of network proximity is based on scalar proximity metric, such as the number of IP routing hops or geographic distance. It is assumed that each node can determine the distance of a node with a given IP address to itself. A node with lower distance value is assumed to be more desirable.

(38)

2.2 Distributed Online Social Networks

A Distributed Online Social Network (DOSN) [2] is an Online Social Network implemented on a distributed information management platform. Authors in [2] propose the reference architecture of a DOSN, which could consist of six layers (Figure 2.5) and provides an architectural abstraction of variety of current related approaches to decentralized social networking in the research literature.

The lower layer of this architecture is the physical communication network, which can be the Internet or another physical communication network. The distributed P2P overlay management provides core functionalities to manage re-sources in the infrastructure of the system, which can be a distributed network of trusted servers or a P2P overlay. Specifically, this layer provides services for looking up resources, routing, and retrieving information reliably and effectively among nodes in the overlay. On top of this overlay is the decentralized data man-agement layer, which implements distributed functionalities such as query, insert, and update various persistent objects to the systems. The social networking layer implements all basic functionalities and features that are provided by centralized social networking services. Among these functionalities the most important ones are given in Fig. 2.5, namely the capability to search the system (Distributed search) for relevant information, the management of users and shared space (User account and share space management), the management of security and access control issues (Trust management, Access control and security), the coordination and management of social applications developed by third parties (Application management). The top layer of the architecture includes the user interface to the system and various applications built on top of the development platform provided by the DOSN.

The decentralization of a OSN may be done with different granularity. Pro-posed approaches can be divided into the following categories:

• Federation of servers, which require that social network providers agree upon standards of operations in a collective fashion. Federated social networks are not real P2P system, but they enable users to share their social contents with friends from other OSNs. An example of these kind of DOSNs is Diaspora [5]. • OSNs over unstructured P2P overlays, which have users’ personal data

dis-tributed among multiple peers [27], [28], [29], and [30].

• OSNs over structured P2P overlays, which utilize a DHT approach or social overlay [7, 10, 12, 31, 32].

Decentralizing the existing functionalities of Online Social Networks requires finding ways for distributing storage of data, propagating updates, defining a

(39)

topol-Figure 2.5: General DOSN’s Architecture

ogy and a protocol that enabling search and addressing, robustness against churn, etc.

In the following, we explain in detail, the challenges that we have faced into thish thesis: the data availability problem, the information diffusion and privacy problems.

2.3 Data management in DOSNs

The data management can be seen as a layer over the P2P infrastructure, which includes part of the Social Network Support Layer, in particular services as the Information and Updates diffusion and privacy and security mechanisms; and the Distributed or P2P storage system, which can be referred as the data availability service. The three most important services provided by the data management layer are shown in Figure 2.7. In this section we show the current approaches proposed to manage these issues. In order to explain the data management issue, we can classify the current proposals according to two solutions, which emerge as dominant for DOSNs in term of topology, as shown in figure 2.6.

In the first solution, we consider all systems in which the DOSN topology is integrated into the P2P overlay. The second solution, shown in figure 4.2, is a two tiers structure in which data are stored in a separated layer. System has two levels, in which the first level is the P2P infrastructure, which is used for routing and search functions. The second level is the Social Topology, or Social Overlay, as called in [33].

(40)

(a) Single Tier (b) Two Tiers

Figure 2.6: Distributed Online Social Network architecture

Figure 2.7: Data Management Services

A Social Overlay is a logical overlay in which peers are connected to known peers, as explained in [33]. An edge between a pair of nodes indicates that a tie exists between two adjacent nodes.

2.3.1 Data Availability and Persistence

One of the main challenges in decentralization comes from guaranteeing availability of the data when the owner of the data is not online.

When we talk about data availability, we talk about an important challenge in distributed systems. Without a central point of storage, data are distributed among all nodes in the network. When a node leaves the network, its data should remain available in the network. As a matter of fact, the data of a user become no more available when the user disconnects from the social network, even if its host is not shut down.

It is important to notice that, in the following, we will use the term user availability and host availability in an interchangeable way.

In the context of DOSNs, we consider availability of user’s data, called content, that can be seen as the digital representation of a user, which is stored on a com-puting device and can be transmitted from one device to another. In the context of DOSNs, two types of data availability are generally considered: Pure availabil-ity and Friend Availabilavailabil-ity [34] (figure 2.8). The Pure Availabilavailabil-ity measures the fraction of time a piece of users’ data is available for other users. However, in a DOSN, people interested in the data of a given user are primarily its friends. So the Friend Availability, which measures the fraction of time a user’s data is

(41)

available when its friends are online, is a good metric for this scenario.

Figure 2.8: Availability in DOSNs

The main technique used to resolve the problem of the content availability in DOSNs is the Replication [35]. Replication is a well-known technique in a distributed system and it is a technique based on the storage of the same data on different storage devices.

Several proposals use DHT for storing data, as explained in section 2.6. By using the DHT, data are always available, but a generic user’s profile is stored on unknown nodes. Instead, by using a social overlay, data can be stored on specific peers which can be chosen by the user itself. With a social overlay, users have more control over their data. When a social overlay is exploited, storage nodes are chosen by exploits different characteristics like temporal information. In particu-lar, one of the well-known solution is to store data on friend nodes. [36] finds the limitation of storing data only on friends on the data availability. Authors show that the problem of obtaining maximal availability while minimizing redundancy is NP-complete and proposed greedy data placement heuristics to improve the data availability

In [37] the Erasure Codes (ECs) is proposed to manage data availability, which has been proven to be more efficient in terms of redundancy than replication.

Replica selection policies

One most important goal in a replication-based data management system is to decide where data should be stored. To manage this challenge, most of the current data availability management proposals provide replica selection policies. In [38] three different replication policies are studied. The first one increases the ability by choosing, as replica locations, the user friends which maximize the avail-ability of the user profile. The second approach prioritizes the most active friends

(42)

for placing the replicas, where active means that a user is mostly online and, in the third approach the storage nodes are randomly chosen.

In [34] are proposed, evaluated and compared further three policies: • No replication, only the user provides its own data.

• Direct replication only, the data is made available by the user and its friends, which receive a copy of data directly from the user itself.

• Indirect replication, friends of a user can collaborate with other friends. The indirect replication increases the availability but reduces the trustworthiness.

The work presented in [39] introduces My3, which is mainly focuses on the distributed storage layer. In this work, several replication strategies are introduced: • Minimize the number of replicas, this approach aims to minimize the storage

and replica management overhead.

• Minimizing update propagation delay, the algorithm minimizes the update propagation delay which is the delay in time between the time instance an update occurs on a user profile at one of the replicas, and the instance the update reaches all the other replicas;

• Minimizing the access cost, the approach minimizes the access cost incurred in accessing a user’s profile. It assigns the nearest trusted node connected to a user in the online time graph;

• Maximizing the replication gain, the approach quantifies the replication gain of a subset of trusted nodes and explores the entire solution space to pick the right set with the minimum effective cost.

In [40] an efficient replication selection policy is proposed to select the set of storage nodes. It considers three aspects:

• online time which represents the average online probability;

• social relation which is currently an absolute measure of either being a friend or not;

• user experience, which is computed in regarding to the suitable of the storage nodes to be again storage nodes.

(43)

User Behaviour Analysis and Availability Prediction

Studies concern availability prediction of users of a P2P system based on their past behaviour are much less than those on their generic availability. Several dis-tributed storage systems are proposed from classical file sharing application, but OSNs are different in term of user’s behaviour. In [41], the time spent online by users of Bebo, MySpace, Netlog and Tagged has been studied through statistical analysis and authors discovered that a Weibull distribution accurately models the behaviour of 80% of users. Session lengths and the number of sessions follow power law distributions. In [42] the Orkut, MySpace, LinkedIn and Hi5, OSNs have been studied. Authors concluded that session lengths follow a heavy-tailed distribution, while inter-arrival times a Log-normal distribution.

In [43] the users availability in MySpace has been studied. Authors show that users availability is not only dependent of the daytime, but also correlated to the presence of their friends on the platform.

This last work is an interesting study for the data persistence problem, but the lack of a dataset with both temporal and structural information is the actual lim-itation in this research field. As a matter of fact, the online patterns of OSN users are more discontinuously than in traditional decentralized applications [41]. Most of DOSN proposals don’t consider the problem of data availability by con-sidering the system as stable or by concon-sidering external storages. Instead, in the last years, the data availability problem in a DOSN has become more popular and the replication approach is widely used.

In [43], the purpose is to find peers who behave according to a given availabil-ity pattern. In particular nodes seek partners for two different types of problems known as disconnection matching (a peer looking for a partner who will disconnect about the same time) and presence matching (a peer looking for a partner who will be online with its in the future). For both of these purposes it’s necessary to predict the future users behaviour based on their past behaviour. The authors propose to use a simple binary predictor that predicts the online presence of a peer in a time window of 10 minutes: the prediction is based on the number of days in the past week in which the peer was online in the same time window. If the peer was online in the same time slot in at least 5 days out of 7 of the last week, then it is estimated that the peer will be online again. The authors use a real dataset collected from eDonkey to verify the effectiveness of predictors. The dataset contains about 12 million peers, whose predictability is not very high. However authors can extract a reduced set of nodes with predictable behaviour filtered in according to a set of parameters. This set contains only 19600 peers, but the knowledge of the online traces of a limited number of peers can allow to improve the performance of different applications. In [38] an empirical study of the various system properties of DOSNs and the parameters that influence them

(44)

have been studied. Authors explained that one of the most important parameters to provide availability is the online time of a user. By considering the replication of the user profile of an user u on a set of trusted friends Ru, let OTu the online

time period of an user u, the profile of u is accessible by an arbitrary user v only if ∃j ∈ Ru such that OTv∩ OTj 6= 0. They model the online times based on user

activities in three different ways:

• Sporadic, which assumes that a user is online several time a day sporadically (20 minutes session length)

• Continuous-Fixed Length, all the users in the network are assumed to be online, each day of the week, during a continuous time window of a fixed length

• Continuous-Random Length, each user randomly chooses its own length of the online time window from the range [2, 8] hours.

The Sporadic model is considered the most realistic.

Authors explain that another important parameter is the replication degree: higher is the replication degree, more is the level of potential exposure of personal infor-mation to others.

2.3.2 Information diffusion in DOSNs

Social information systems have seen a rapid growth in recent years reaching a huge amount of contents and interactions. The task of disseminating information plays a key role in these systems because they are seen as an infrastructure for sharing information. Given the huge amount of content produced, such systems must enable the dissemination of information between users and improve it by personalized content dissemination on specific or popular topics. In a profile-based system, the user’s profile requests are directly done by the other nodes in the network. Instead, when we consider the profile update propagation (i.e. the news feed of Facebook), it’s important to send updates by considering the topology and eventually privacy policies.

Data dissemination techniques have been developed in many networking fields as sensor networks, mobile networks, and distributed networks. The existing meth-ods include gossip protocol, epidemic routing, probabilistic routing based on pre-diction, distributed caching, and Content Distribution Networks.

OSN’s users produce huge amount of information and the propagation of this information to the destination has to be well coordinated so as to reduce informa-tion overload, duplicainforma-tion, latency and to ensure quality. Disseminainforma-tion of updates can be seen as a publish/subscribe (pub/sub) problem wherein users are publishers and their friends subscribers.

(45)

Usually information dissemination in structured distributed systems takes place through pub/sub methods which rely on rendezvous node to bring information from publisher to his subscribers, but rendezvous nodes may be hotspots of the system because they could act as bottlenecks to application performance and raise scalability issues. Pub/Sub methods on structured distributed systems are more sensitive to the dynamism of the network because structure increases both com-plexity and the overhead needed for joining the system.

Gossip-based methods are used for spreading information in unstructured and structured distributed systems. These methods are well suited to unstructured systems because they are less susceptible to user’s churn and more robust to fail-ures. Gossip-based methods for information diffusion ensure with a high probabil-ity that a node interested in a content, eventually gets that content. The reliabilprobabil-ity of this methods depend on the parameters of the epidemic algorithms. Usually it is possible to adjust this parameters to achieve different level of reliability despite failure and dynamic network topology. Furthermore unstructured distributed sys-tems mitigate the overhead needed for joining the system because the absence of structure reduces both complexity and susceptibility to dynamism.

Unstructured System Structured System Hybrid System Information Diffusion Pub/Sub Gossip-based Robustness Hotspots Robust Scalability Replication Scalable Churn

More susceptible Less susceptible

Congestion Reliability

High reliability Different levels of reliability

Less susceptible More susceptible

Figure 2.9: Information diffusion in structured and unstructured DOSNs. Authors in [44] define a overlay-independent pub/sub system for social net-works. Nodes use attenuated Bloom filters to create an area of attraction (black-hole) for messages moving around them. These black-holes allows both to attract