Discovering and managing dynamic communities in DOSNs

(1)

communities in Distributed Online

Social Networks

Michienzi Andrea

Supervisor: Prof. Ricci Laura

Doc. Guidi Barbara

Co-Supervisor: Prof. Monreale Anna

Department of Computer Science

Università di Pisa

This dissertation is submitted for the

Master Degree in Computer Science

(2)

(3)

(4)

(5)

List of figures vii

List of tables ix

1 Introduction 1

2 Related Work 9

2.1 P2P Systems . . . 9

2.1.1 The P2P look-up service: DHT . . . 12

2.2 Decentralized Online Social Networks . . . 16

2.2.1 Open Issues . . . 17

2.3 DOSNs: current proposals . . . 18

2.3.1 DHT-based proposals . . . 18

2.3.2 SO-based proposals . . . 20

2.3.3 External Resource-based Solutions . . . 22

2.4 Data availability . . . 24

2.4.1 Data Availability in DHT-based DOSN . . . 25

2.4.2 Data Availability in SO-based DOSNs . . . 25

2.4.3 Availability in External Resource-based Solutions . . . 26

2.4.4 Replica selection strategies . . . 27

2.5 Community detection on static networks . . . 30

2.5.1 Community definitions . . . 30

2.5.2 Static community discovery algorithms classification . . . 32

2.6 Community detection on dynamic networks . . . 35

2.6.1 Dynamic Networks representations . . . 35

2.6.2 Dynamic Community Discovery . . . 37

2.6.3 DCD approaches classification . . . 38

(6)

2.6.5 Temporal Trade-off Communities Discovery . . . 41

2.6.6 Cross-Time Communities Discovery . . . 43

2.6.7 Discussion . . . 45

3 Towards the usage of dynamic communities in DOSNs 49 3.1 An overview on the data availability problem in DOSNs . . . 49

3.1.1 Scenario . . . 50

3.1.2 Current SO-based data availability solutions . . . 53

3.2 The importance of community discovery in Social Networks . . . 55

3.3 Dynamic communities for data availability in DOSNs . . . 56

4 The algorithm SONIC-MAN 61 4.1 TILES: an online algorithm for dynamic communities . . . 61

4.2 A deep walkthrough of SONIC-MAN . . . 63

5 Experimental results 69 5.1 Instruments used . . . 69

5.1.1 Peersim: a P2P simulator . . . 69

5.2 A Facebook case study . . . 72

5.2.1 The Facebook application SocialCircles! . . . 72

5.2.2 An analysis of temporal information . . . 74

5.2.3 An analysis of topological information . . . 77

5.3 An Instant-optimal dynamic community detection study . . . 80

5.3.1 The discovery phase . . . 80

5.3.2 The matching phase . . . 81

5.3.3 Results . . . 82

5.4 A Temporal trade-off dynamic community detection study . . . 86

5.4.1 Setting up the simulations . . . 86

5.4.2 Redefining the events . . . 86

5.4.3 Results . . . 87

5.4.4 Number and size . . . 87

6 Conclusion and future work 91 6.1 Conclusions . . . 91

6.2 Future works . . . 92

(7)

2.1 An Abstract P2P Overlay Network Architecture . . . 10

2.2 Comparison of central server, flooding search, and distributed indexing. . . 13

2.3 Interface of a DHT with a simple put/get − interface. . . 13

2.4 State of a hypothetical Pastry node with NodeID 10233102, b = 2. The top row is row zero. The NodeID is coloured with three different colours to indicate the common prefix with 10233102 (blue colour), next digit (green colour), and rest of the NodeID (red colour) . . . 15

2.5 Distributed Online Social Network architecture . . . 17

2.6 SOUP: system architecture. . . 19

2.7 An example of an ego network. The red node is the ego. . . 21

2.8 Prometheus: system architecture. . . 21

2.9 Data Availability in DOSNs . . . 29

2.10 Community transformation . . . 46 2.11 Community Lifecyce . . . 47 2.12 Algorithm Taxonomy . . . 47 2.13 Instant Optimal . . . 48 2.14 Temporal Trade-off . . . 48 2.15 Cross Time . . . 48

3.1 Distributed Online Social Network architecture . . . 51

4.1 Sequence diagram of a user joining the network . . . 65

4.2 Sequence diagram of an ego joining the network . . . 66

4.3 Sequence diagram of a user leaving the network . . . 67

4.4 Sequence diagram of a primary moderator leaving the network . . . 67

4.5 Sequence diagram of a secondary moderator leaving the network . . . 67

5.1 SocialCircles! index page . . . 73

(8)

5.3 CDF of the number of sessions for each user in the observed period of time 76

5.4 CDF of the length of all the sessions . . . 76

5.5 A zoom of figure 5.4 showing only the leftmost part . . . 77

5.6 CDF of the session inter-arrival time . . . 77

5.7 A zoom of figure 5.6 showing only the leftmost part . . . 78

5.8 Community events for each time slot of all dynamic communities . . . 85

5.9 Community events for each time slot of selected dynamic communities . . . 85 5.10 Community events of all dynamic communities detected by our algorithm . 89

(9)

5.1 Statistical measures on number and size of static communities . . . 79 5.2 Statistical measures on number and size of dynamic communities . . . 83 5.3 Statistical measures on community events of all dynamic communities . . . 85 5.4 Statistical measures on community events of selected dynamic communities 85 5.5 Statistical measures on number and size of dynamic communities discovered

by our algorithm . . . 87 5.6 Statistical measures on community events of selected dynamic communities 88

(10)

(11)

Introduction

In recent years we assisted to a significant evolution of the Internet. Thanks to an increasing diffusion of broadband connections and to new technologies, the Web has turned from a mere collection of mostly textual pages to a richer interactive environment. From a more sociological point of view, Internet has become a system where the user can not only search and consume contents, but also create and share them with the whole World in a very short time. Social media like YouTube and WordPress are becoming more and more popular, but, with this revolution, the killer applications are the Online Social Networks by far. An Online Social Network (OSN) is defined in [18] as an online platform that provides services for a user to build a public profile and to explicitly declare the connection between his/her profile and those of the other users. An OSN enables a user to share information and content with selected users or to define the information as public; furthermore, it supports the development and usage of social applications enabling the interaction of the user with both his/her friends and strangers.

The currently popular OSNs are implemented using a centralized architecture, which means they are based on centralized servers storing all the information of the users. This centralized structure has several drawbacks including scalability, dependence on a provider, and privacy [40]. In particular, in recent years, the rise and quick development of social networks has led to important phenomena: the user privacy disclosure and the rapid spread of information. Social networks have become the epicentre and main channel through which individual privacy is violated. These problems have moved researchers to investigate alternative OSN architecture solutions with respect to the centralized one.

Facebook is one of the most well-known OSN and it is the better scenario to understand the privacy problem arising in OSNs. During the last years, there have been some concerns expressed regarding Facebook in particular and the usage of private data of its users. Two students from the Massachusetts Institute of Technology (MIT) used a script to download

(12)

information of over 70,000 Facebook profiles from four schools (MIT, NYU, the University of Oklahoma, and Harvard University) as part of a research project on Facebook privacy published in 2005 [91].

In August 2007, a small fraction of the code used to generate Facebook web pages was made public1. As explained in [115], a configuration problem on a Facebook server caused the PHP code to be displayed instead of the web page the code should have created. This problem raised concerns about the safety of providing personal data to the service. A visitor copied, published and later removed the code from his web forum, claiming he had been served and threatened with legal notice by Facebook.

In February 2008, a New York Times article2advises that Facebook does not provide any mechanism to permanently close user accounts, and raised the concern that, as a consequence, private user data would remain stored on Facebook’s servers. Recently, Facebook gives users the possibility to deactivate or delete their accounts. Deactivating an account does not remove user data, because the account can be restored later, while deleting it will remove the account "permanently". However, some data submitted by that account (posts or messages) are not necessarily deleted. Facebook’s Privacy Policy now states: "When you delete an account, it is permanently deleted from Facebook." [115].

In May 2015, a report3commissioned by the Belgian Data Protection Authority revised policies and terms-of-use of Facebook and concluded that the OSN gives users a false sense of control over their data privacy. Facebook has been tracking on a long-term basis, users who visit any page belonging to the Facebook.com domain. Facebook has the power to link Internet users’ browsing habits to their real identity, social network interactions and sensitive data including private real information such as medical information, religious, sexual and political preferences. It is in a unique position compared to most of the other cases of so-called “third-party tracking”. In particular, the report affirms that Facebook has the ability to track users’ activity outside Facebook, mainly through the spread of social-plugins, for instance “Like” buttons and connected accounts from other social medias or services, but also through new forms of mobile tracking. It affirms that Facebook now gathers information through these plugins regardless of whether the buttons are used. The recent acquisition of Instagram and WhatsApp has allowed Facebook to collect even more kind of user data, which enables more detailed profiling of users. The report also stated that it is impossible to add information on Facebook that may not later be used for targeting advertisements and that Facebook’s privacy settings were less clear in relation to the collection and use of data by Facebook itself or by third-parties like application developers.

1_{www.foxnews.com/story/2008/06/25/facebook-source-code-leaked-onto-internet} 2_{http://www.nytimes.com/2008/02/11/technology/11facebook.html?_r=0}

(13)

We can also illustrate other scenarios which are not strictly related to users data privacy protection, as all the scenarios presented above, but on Internet freedom as a human right. In some countries, such as in China, Facebook has been banned. In Tunisia, the government hacked into and stole passwords from citizens’ Facebook accounts. In May 2011, Syrian activists noticed that the telecommunications ministry was tapping into Facebook activity.

We can think that the problems we presented are related to the Facebook site only, instead others Social Networks, such as Twitter, has been blocked as well. Twitter was inaccessible in Egypt on 25 January 2011 during the Egyptian protests. In 2009, during the Iranian presidential election, the Iranian government blocked Twitter due to fear of possible protests being organised. On 20 March 2014, access to Twitter was blocked when a court ordered to apply some "protection measures" to the service.

All presented scenarios motivated researchers to evaluate a distributed platform for implementing an Online Social Network.

A Distributed Online Social Network (DOSN) [40] is an online social network imple-mented on a distributed information management platform, such as a network of trusted servers, P2P systems or an opportunistic network.

During the last years, DOSNs have been argument of several works and projects from both academic researchers and open source communities. By decentralizing OSNs, the concept of a service provider is changed, as there is no single provider but a set of peers that take on and share the tasks needed to run the system. This has several remarkable consequences: in terms of privacy and operation, no central entity that decides or changes the terms of service exists. Moving from a centralized web service to a decentralized system also means that different system models become possible: using one’s own storage or cloud storage, exploiting delay-tolerant social networks, and P2P networks, to name some of them. The availability of users’ social data is now highly affected by the availability of the users on the service, his online behaviour, because data are stored on users’ device.

The first big project in this area has been Diaspora [Dia]. Diaspora was founded in 2010 by four students, who started to work on the project after being motivated by a speech given by Columbia University law professor Eben Moglen. In his speech, Moglen described centralized social networks as "spying for free", and in a New York Times interview, Salzberg, the co-funder of Diaspora, said "When you give up that data, you’re giving it up forever ... The value they give us is negligible in the scale of what they are doing, and what we are giving up is all of our privacy". The group decided to address this problem by creating a distributed online social network. Diaspora is the first decentralized approach based on a decentralized federation of servers. In the last years fully decentralized solutions based on the P2P model [25] have been proposed for Online Social Network.

(14)

Decentralizing the existing functionalities of online social networks requires finding ways for distributing storage of data, propagating updates, defining an overlay topology and a protocol enabling searching and addressing, robustness against churn, etc. We discuss these problems in more detail below.

• Dynamism. In a DOSN there are two types of dynamism: a social and an infrastructure dynamism. The social dynamism concerns social relationships which can change due to the variation of relations between users and of the number of users of the DOSN. Even though this kind of dynamism is presented also in centralized OSN, in DOSN it has also an impact on the structure of the underlying overlay. The infrastructure dynamism is related to the underlying overlay network. Nodes may join/leave the underlying network correspondingly to the user’s login to/logout from the service. Observing the overlay network in snapshots obtained from different moments in time, the number of available connections may change in the term of number of active nodes and active links.

• Data availability/persistence. Decentralizing the social network, data should be stored at some nodes to be always available. The main questions are which mechanisms should be used to this aim. For example we can think of using a replication technique, which arises the problem of where data should be stored, for instance exclusively at nodes run by friends, or at random nodes. The requirement for redundancy to provide availability of data depends to a large extent on the duration and distribution of the availability of nodes.

• Scalability. Mapping a social graph onto a distributed network can be very expensive due to the number of social links for each node, so the cost of mirroring the social network links into distributed network links can be high. And it can be very inefficient and unpractical because, as discussed previously, most of the social links may be inactive (two friends who rarely interact online).

• Topology. Nodes should be connected according to their social connections in order to cluster friends in the overlay network. This trick should facilitate operations as information diffusion or data storage. As a downside, this would limit the availability and robustness of data access.

• Updates. This issue is related to how to deal with updates, i.e. status updates of friends. In DOSNs, with distributed storage and replication and a potential need for scalability, the mechanisms for the diffusion of data are fundamental and each update from one user must be diffused quickly and efficiently to all its social connections.

(15)

• Privacy. Data can be stored not only on the node of profile owner, but also on others. The classical requirements for security should be refined to consider the context of DOSN.

Most of proposed P2P Distributed Online Social Networks focus on the privacy problem. Some approaches are peer-to-peer based, for example Safebook [39], LifeSocial [73, 74], PeerSoN[Buchegger et al.], Cachet [129] and My3 [120]. Safebook is an architectural approach that aims to solve privacy issues focusing on communication anonymisation; LifeSocial provides a plugin/based architecture in which user information is stored in a DHT and is accessible from various plugin-based applications. PeerSoN exploits a two-tier system architecture for content organization and for communications, where users store their information on their local devices; Cachet relies on structured architectures where users’ data could be stored and replicated on different peers. Finally, My3 uses an unstructured approach where users’ data are hosted on a set of self-chosen trusted peers among their friends.

When we consider a distributed structure, such as a peer-to-peer solution, it is necessary to consider other problems, for example nodes with an high fluctuation can lead to data becoming unavailable or even lost. Data availability, as well as Information Diffusion, are important problems that have to be properly addressed in a DOSN.

Most of proposed systems use classic encryption techniques, such as symmetric or asymmetric encryption, to guarantee privacy and anonymity. The usage of encryption is led mainly by the DHT, which stores user data on untrusted nodes. These techniques are useful but, in a such dynamic and large environment, they can be potentially expensive. For example, consider a system in which data are encrypted using a conjunction of symmetric and asymmetric cryptography, a user encrypts its symmetric data with its private key and then, the key has to be encrypted with the public key of each friend. This means that each user has to generate a huge amount of keys and has to manage all of them. The attribute-based technique [71] is more efficient than the others, but it is not completely suitable in DOSNs. Another important point is regarding to the nature of a data. In a DOSN, a profile can be split into two different categories: private and public data. Public data are those that can be accessible by everyone (e.g. name and surname), instead private data has to be accessible only by friends. For these reasons, public data cannot be encrypted, but the same cannot be applied to private data. As just said before, the problem of these systems is the usage of DHTs, where private data are stored everywhere in the network and they have to be encrypted to guarantee privacy.

More recent approaches put much effort on studying efficient solutions for data persis-tence and availability: these issues are among the most interesting, yet the most complex to assess effectively. Interesting works in this direction are the ones like DiDuSoNet [76],

(16)

where authors propose to exploit existing social connections among users to define storage policies and identify trusted points of memorization for users’ profiles and information. The authors study the social graph giving particular importance on a single user perspective: the Ego Network [55]. This model is well suited for this purpose, representing a local view of the social graph relative to a user (the ego), where we consider only his friends (the alters) and their interconnections.

This work is part of the wide research work involved in the study of data persistence and availability problem in DOSNs. We didn’t aim to create and study a new DOSN, instead we tried to understand which data are needed to design effective and robust systems. By borrowing some concepts and ideas taken from the field of complex network analysis, we aim to design a versatile support for distributed peer to peer systems as DOSNs. Our main goal is to address the problem of data availability and persistence, but the same results and techniques can also be applied to other distributed systems, for instance pervasive computing, IoT, Mobile Ad-hoc NETworks and Opportunistic networks. The novelty brought by this work is the usage of the community structure, typical of complex networks in addressing the problem of data availability.

There were two major steps in this work. A first step is composed by a preliminary, yet large and extensive study of the community structure of OSNs. This was carried out by deeply analysing first topological and temporal informations of the dataset separately, then together. This first study uncovered many informations, for instance the presence of a daily pattern of access of users and also the fact that the dynamic vision of the OSN (that is, considering only online users) is quite far from the static one (without considering any temporal information).

This preliminary analysis showed us that the intuition was good and that further research was needed. To this purpose, we designed a data availability system based on the replication of the information and the concept of dynamic community. We tried to design the system completely decentralized and as light as possible, to let it run also on low-end devices. This system, to actually exploit the community structure, also needed an online dynamic community discovery algorithm: an algorithm capable of finding communities as the system embedding it lives. To the best of our knowledge, even if there are plenty of algorithms present in literature, based on many different approaches, none of them is distributed. The reason lies in the fact that such concepts, as the one of community discovery, are mostly used in a data mining context, and are not thought to be used as the system lives. The second step of our research consisted in developing a distributed peer to peer algorithm for dynamic community discovery. After the development, we also analysed the communities that the algorithm is able to find pointing out its strength and weaknesses. Finally, we included in

(17)

our research a comparison between the preliminary study and our algorithm results, with particular emphasis on the events that describe a dynamic community lifecycle.

This dissertation is structured as it follows: in chapter 2 we present the state of the art for what concerns DOSNs and community discovery. We start by presenting DOSNs and some of their implementations, then we go in detail on the availability problem and how current DOSNs address this problem. We continue chapter 2 by presenting a possible classification of algorithms for static and dynamic community discovery, pointing out strengths and weaknesses of each approach. In chapter 3 we tackle the problem of data availability and we show in detail how a class of DOSNs cope with the problem. Then we present the scenario in which we set and we present our idea to solve the problem of data availability. In chapter 4 we have a deep walkthrough of the implementation of our distributed algorithm of dynamic community discovery. In chapter 5 we show all the experimental results, including both the preliminary study and the results obtained from our algorithm. Finally, in chapter 6, we draw conclusions and point out some possible future works to improve this idea.

(18)

(19)

Related Work

In this chapter we present the state of the art closely related to our work. In the first section, we introduce the reader to Peer to Peer systems giving basic definitions and stating some properties. Then we present Distributed Hash Tables as Peer to Peer look-up services and we show one working example: Pastry. In the second section we will give a light introduction to DOSNs, also showing some working proposals. Then we will deeply go through one of the most important problems in DOSNs: Data availability and persistence. Finally we will present the problem of community discovery for static and dynamic networks by showing the various algorithms present in literature and pointing out strengths and weaknesses.

2.1 P2P Systems

A P2P network is a distributed network composed of a large number of distributed, hetero-geneous, autonomous, and highly dynamic peers in which participants share a part of their own resources, such as processing power, storage capacity, software, and files contents [134]. The participants of the P2P network can act as a server and a client at the same time. They are accessible by other nodes directly via the logical overlay links, without passing through intermediary entities (client/server) and, this represents one of the most important advantage of using a P2P system at application layer. P2P systems build and maintain overlay networks at the application-layer, assuming the presence of an underlying network layer, which assures connectivity among any pair of nodes that can communicate through a self-organizing and fault tolerant network topology. A P2P system has particular properties [141], such as:

• High Degree of Decentralization. Based on the degree of decentralization, the P2P systems are classified into two categories: hybrid systems and purely decentralized system. In a purely decentralized P2P system, there is no central entity and peers are

(20)

able to implement both client and server functionality, and most of the system’s states and tasks are dynamically allocated among the peers.

• Self-organization. Self-organization is required for reliability and to manage continuous peers joining and leaving. Node joining and leaving is handled in a distributed manner. When a node joins the network, it builds connections with other nodes to become an active participant. When a node leaves, it is removed from the network and the overlay is updated to maintain connected all the others nodes. Furthermore, when a node fails without warning, connected node needs to update their routing and neighbouring connections.

• Scalability. This concept refers to the ability of a system to continuously evolve in order to support a growing amount of peers. This means that P2P systems are able to maintain the system’s performance attributes independent from the number of nodes or documents in the network.

• Reliability. This concept denotes the ability of the network to deliver its services even when one or several nodes fail.

• Churn. Churn represents the situation where a number of nodes join, leave, and fail in a continuous and rapid way.

Fig. 2.1 An Abstract P2P Overlay Network Architecture

Figure 2.1 shows an abstract P2P architecture as presented in [108]. The Network Communications layer describes the network characteristics of devices (sensor-based or computers) which are connected over the Internet (or wireless). The Overlay Nodes Man-agement layer is responsible about the manMan-agement of peers into the P2P network. It includes the discovery of peers and routing algorithms. The Features Management layer

(21)

covers the security, reliability, fault resiliency and aggregated resource availability aspects of maintaining the robustness of P2P systems. The Services Specific layer supports the underlying P2P infrastructure and the application components through scheduling of parallel and computation-intensive tasks, content and file management. The Application-level layer is concerned with tools, applications and services that are implemented on top of the underlying P2P overlay infrastructure.

As said before, P2P systems can be classified as centralized and purely decentralized. In centralized systems, such as Napster [107], the index, which supports the retrieval of the information, is centralized, but data are exchanged directly between peers. A basic problem of this approach is that, in case of a failure of the node that stores all the information about the network, all the network fails as well. Also, the system has a poor performance when the number of nodes is increased. Not fully centralized P2P systems have a dedicated controller node which is responsible to maintain the set of participating nodes and controls the system. Instead, in a purely decentralized P2P system, there are no nodes with particular responsibilities and that represent the critical point for the operation of the system. They have no inherent bottleneck and can potentially be resilient to failures, attacks, and legal challenges. In some purely decentralized P2P systems, nodes with plenty of resources and high availability, act as super-nodes. These super-nodes have additional responsibilities, such as acting as a rendezvous point for nodes behind firewalls, storing state or keeping an index of available content. This kind of P2P systems are generally referred as hybrid P2P systems.

P2P systems maintain an overlay network, which is a layer on top of the physical network connections, which creates transparent services to handle the virtual topology of the network. Overlay networks offer services for the communication and connection handling between the peers in the network, including peers that are not directly connected.

Furthermore, there is an additional distinction of P2P systems between structured and unstructured ones. In an unstructured P2P system [108], there are no constraints on the links between different nodes, and therefore the overlay graph does not have any particular structure, so nodes can occupy any position within the overlay network.

In a typical unstructured system, a new node joins the network and establishes its initial connections in a random way starting at the bootstrap node. As explained in [147], in a structured overlay, each node has a unique identifier in a large numeric key space. Identifiers are chosen in a way that makes them uniformly distributed in that space. The overlay graph has a specific structure and the identifier determines the position of a peer within the structure and constrains its set of overlay links.

(22)

2.1.1 The P2P look-up service: DHT

A Distributed Hash Table (DHT) provides a global view of data distributed among nodes in a network, independent of the actual location. As referred in [175], a Distributed Hash Table manages data by distributing it across a number of nodes and implementing a routing scheme which allows efficient look up the node on which a specific data item is located. In contrast to flooding based searches in unstructured systems, each node in a DHT becomes responsible for a particular range of data items. Also, each node stores a partial view of the whole distributed system which effectively distributes the routing information. Based on this information, the routing procedure typically traverses several nodes, getting closer to the destination with each hop, until the destination node is reached. As reported in [175], DHTs have the following characteristics:

• each node has a partial view of the whole system by managing only a small set of links to other nodes. Generally, these are O(log N) references, where N depicts the number of nodes in the system.

• by mapping nodes and data items into a common address space, routing to a node leads to the data items for which a certain node is responsible.

• queries are routed via a small number of nodes to the target node. Because of the small set of references each node manages, a data item can be located by routing via O(log N) hops.

• by distributing the identifiers of nodes and data items nearly equally throughout the system, the load for retrieving items should be balanced equally among all nodes. • because no node plays a distinct role within the system, the formation of hot spots or

bottlenecks can be avoided.

• the departure or dedicated elimination of a node should have no considerable effects on the functionality of a DHT. Therefore, DHTs are considered to be very robust against random failures and attacks.

• if a data item is stored in the system, the DHT guarantees that the data is found. The following table (figure 2.2), proposed in [161], compares again the main charac-teristics of three well-known approaches in terms of complexity, vulnerability and query ability. According to their complexity in terms of communication overhead, per node state maintenance, and their resilience, DHT shows the best performance unless complex queries are not vital.

(23)

Fig. 2.2 Comparison of central server, flooding search, and distributed indexing. DHTs follow a proactive strategy for data retrieval. In comparison, routing in unstructured systems is not related to the location of specific data items but only reflects connections between nodes.

In a DHT system, each data item has an identifier (ID), a unique value from the address space. This value can be chosen freely by the application, but it is often derived from the data itself via a collision-resistant hash function, such as SHA-1. Each document index is expressed as a (K,V ) pair, K is called keyword, can be the hash of a file name (or description of document) of the hash value, V is the actual file storage node IP address (or description of the other nodes). The generic operations are: the put function, which accepts an identifier and arbitrary data and it is used to store the data (on the node responsible for the ID); and, symmetrically, the get function retrieves the data associated with a specified identifier (figure 2.3).

Fig. 2.3 Interface of a DHT with a simple put/get − interface.

During the years, several DHT variants have been proposed. In the following, we introduce Pastry, which is used in our system.

Pastry

Pastry [147] is a structured P2P overlay network which uses a prefix routing to build a self-organizing decentralized overlay network. Each node in the Pastry network has a unique numeric identifier: (NodeID). The NodeID is used to give a position in a circular NodeID

(24)

space, which ranges from 0 to 2128−1 and, the set of NodeIDs is assumed to be uniformly distributed in the 128−bit space. The NodeID is assigned at random when a peer joins the system. For a network of N peers, Pastry routes to the numerically closest peer to a given key in less than log₂bNsteps under normal operation (where b is a configuration parameter

with typical value is 4). The NodeIDs and keys are considered a sequence of digits with base 2b. In each routing step, a node normally forwards the message to a peer whose NodeID shares with the key a prefix that is at least one digit longer the key shares with the current peer NodeID. If such a node is not known, the message is forwarded to a node whose NodeID shares a prefix with the key as long as the current node, but is numerically closer to the key than the current node.

Each Pastry peer maintains a routing table, a Neighbourhood Set, and a Leaf Set. The routing table is organized in log₂bN rows with 2b− 1 entries each. The 2b− 1 entries at row n of

the routing table refer to a peer whose NodeID shares the current peer’s NodeID in the first n digits, but whose (n + 1)th digit has one of the 2b− 1 possible values other than the (n + 1)th digit in the current peer’s NodeID. The choice of N and b determines the size of the routing table. Larger b increases the routing table size but reduces the number of hops. Each entry in the routing table contains the IP address of peers whose NodeID have the appropriate prefix, and it is chosen according to the proximity metric. If no node is known with a suitable NodeID, then the routing table entry is left empty. The choice of b involves a trade-off between the size of the populated portion of the routing table (approximately (log₂bN) ∗ (2b− 1) entries) and maximum number of hops required to route between any

pair of peers (logBN).

In figure 2.4 is shown the routing table of a hypothetical Pastry node, whose ID is 10233102. The entries in row j refer to a node whose ID shares the NodeID 10233102 only in the first j digits.

The Neighbourhood Set maintains information about nodes that are close together in terms of network locality. It is not normally used in routing messages, but it is useful to maintain locality properties.

Finally, the Leaf Set L is the set of nodes with the |L|/2 numerically closest larger NodeIDs, and the |L/2| numerically closest smaller NodeIDs. It is used during the message routing, as explained below.

The primary goal of the routing algorithm is to quickly locate the node responsible for a particular key. Pastry routing works as follow:

1. Given a message with a particular key K, the node first checks if K falls within the range of NodeIDs covered by its Leaf Set. If so, the node is forwarded directly to the destination node, namely the node in the leaf set whose NodeID is closest to the key.

(25)

Fig. 2.4 State of a hypothetical Pastry node with NodeID 10233102, b = 2. The top row is row zero. The NodeID is coloured with three different colours to indicate the common prefix with 10233102 (blue colour), next digit (green colour), and rest of the NodeID (red colour)

2. If the key is not covered by the leaf set, then the node checks the routing table and the message is forwarded to a node that shares a common prefix with the key by at least one more digit.

3. If the appropriated entry in the routing table is empty or the associated node is unreach-able, then the message is forwarded to a node that shares a prefix with the key at least as long as the current node, and it is numerically closer than the current node’s id. As explained in [147], this routing procedure always converges, because each step takes the message to a node that either (1) shares a longer prefix with the key than the current node, or (2) shares as long as a prefix with, but is numerically closer to the key than the local node. Nodes Leave/Failure Nodes in Pastry may fail or depart without warning. Routing table is handled by periodically exchanging keep alive messages among neighbouring nodes.

Node failure is detected when its neighbours in the NodeID space can no longer com-municate with it. When it occurs, it is necessary to repair all leaf sets of its neighbours. A neighbour contacts the live node with the largest index on the side of the failed node, and asks that node for its leaf table.

Another important aspect of Pastry’s routing is locality. Pastry’s notion of network proximity is based on scalar proximity metric, such as the number of IP routing hops or geographic distance. It is assumed that each node can determine the distance of a node with a given IP address to itself. A node with lower distance value is assumed to be more desirable.

(26)

2.2 Decentralized Online Social Networks

A general model for the decentralization of the services of an OSN is defined in [39] where a DOSN is defined as a distributed system including a Social Network (SN), a Social Networking Service (SNS) and a Communication and Transport (CT) level. The SN level provides the common social network functionalities, while the SNS is usually implemented by a P2P network, i.e. a distributed network composed of a large number of distributed, heterogeneous, autonomous, and highly dynamic peers sharing their own resources (processing power, storage capacity files contents, etc...). The participants of the P2P network can act as a server and a client at the same time and their functionalities are accessible by other peers directly, without passing through intermediary entities. Finally, the CT level consists of Internet, mobile or opportunistic infrastructures which are used by the above levels.

A first coarse-grained classification of current DOSNs may be done by considering the solution they adopt for managing social data. A first distinction regards the fact of exploiting only the storage of the nodes of the users participating to the social network or also external storage services. The former class of solutions includes both DOSNs exploiting a Distributed Hash Table (DHT-based proposal) for storing social content and those based on the definition of a social overlay (SO) (SO-based proposals). A social overlay is an overlay network that effectively mirrors an underlying social network by constraining communication to pairs of peers whose owners are friends.

In a DHT-based system, data are stored, usually in an encrypted way, in a DHT. Since data storage is guided by an hash function, users can not control where data are stored. Instead, in a SO-based system, users can choose "trusted" nodes which store data and provide also other functionalities, like the diffusion of social updates. Moreover, cryptography can be optional due to the usage of trusted nodes.

Actually, many DOSNs are based on an hybrid architecture based on a social overlay, which also exploits a DHT to guarantee the overlay management, i.e. for bootstrapping new users, for look-up services, and to guarantee a complete connectivity of the network.

Due to the advent of cloud services, several DOSNs integrate the P2P layer with external resources, such as cloud storage services. In the following, we will refer this second class of solutions as External resource-based proposals. Figure 2.5 presents the main architectural solutions for DOSNs.

(27)

(a) DHT-based (b) SO-based (c) External Resources-based

Fig. 2.5 Distributed Online Social Network architecture

2.2.1 Open Issues

Decentralizing the existing functionalities of OSNs requires finding solutions to overcome new issues concerning the decentralization.

• Data availability/persistence. Without a centralized storage, data should be stored at some nodes to be always available. The main problem concerns the mechanism that should be used to guarantee the data availability in presence of a high level of dynamism. As a matter of fact, dynamism is a well-known characteristic of P2P systems, where peers can enter and leave the system at any time. In a DOSN, it is possible to detect two types of dynamism, at different levels of the system. Since relationships can change due to the variation of relations between users and of the number of users of the DOSN, the structure of the social graph changes during time. This kind of dynamism is presented also in centralized OSN, but in DOSN it may have an impact on the structure of the underlying overlay, especially for SO-based proposals. The infrastructure dynamism is related to the churn of users which may join/leave the overlay network. Each user has a state that indicates its availability on the overlay (online/offline) and, in different snapshots of the overlay, the number of available connections may change due to the high dynamism of the overlay network. • Information diffusion. In a DOSN, mechanisms for the diffusion of data are

funda-mental. These are needed, for instance, to manage users’ updates, i.e. status updates, published posts, comments and so on. Each update from one user must be propagated to all its social connections in a scalable ena d efficient way.

• Scalability. Mapping a social graph onto a distributed network can be very expensive due to the number of social links for each node, so the cost of mirroring the social network links into distributed network links can be high, and it can be very inefficient due to the high number of inactive links.

(28)

• Topology. In common P2P applications, such as file sharing, nodes are connected with unknown nodes in the network and they exchange content with any other nodes in the network. Instead, in a DOSN, the information about social relationships is important and nodes should be connected according to their social connections in order to cluster friends in the overlay network. This should facilitate operations as information diffusion or data storage. As a downside, this would limit the availability and robustness of data access.

• Privacy. Data can be stored not only on the node of profile owner, but also on others, which can be both known and unknown nodes. The classical requirements for security should be refined to consider the context of DOSN.

• Physical Locality. If we consider an underlying system including also mobile P2P connections, it is possible to take advantage of geographic proximity and its correlation with the interests of the users.

2.3 DOSNs: current proposals

In the follow we introduce current DOSNs proposals. There are many ways to present DOSNs, here we chose to present the various proposals by grouping them on the main technique used to store data. This choice is driven by the fact that the way data is stored is of primary importance for our study. The 3 identified categories are: DHT-based, SO-based and External Resource-based.

2.3.1 DHT-based proposals

PeerSon [Buchegger et al., 23] is a two-tier architecture in which one tier is implemented by a DHT and it serves as a look-up service. The second tier consists of peers and contains the user data, such as user profiles. The DHT stores only the meta-data required to search users and their data. Social data are stored by users on their devices and they can directly exchange contents with other users. PeerSon is the only DOSNs in which the direct connection between devices is used. Indeed, communications between users are directly P2P when both are online, instead an asynchronous messaging support is available when users are not online at the same time.

LifeSocial.KOM [74] is plugin-based architecture which provides the common function-alities of OSNs. It uses a DHT, in detail FreePastry [Fre] and PAST [48] as data storage. Data are encrypted by users and only authorized users can access.

(29)

GemStone [169] is a P2P social network system that acts as a middleware to support different OSN applications. Gemstone assists these applications by proving a shared social graph, serving profile information and handling messages delivery to peers. GemStone is completely decentralized and protects the user’s privacy by encrypting all data using an Attribute Based Encryption (ABE) technique [148]. Furthermore, the system provides a data storage solution based on data replication. Private data are stored among a set of other nodes called Data Holding Agents (DHAs). The confidentially is a key element in GemStone. The system allows each user to grant fine grained access to his confidential data. Data can not be accessed by other entities than those that have the corresponding decrypting key. Data Management is strongly dependent from the choice of topology.

Fig. 2.6 SOUP: system architecture.

DECENT [87] is an architecture based on a DHT which guarantees cryptographic protec-tions for confidentiality and integrity. DHT is used to store and retrieve data objects. Each object is encrypted by using an ABE encryption technique to provide confidentiality. Each object has three access policies associated with it. To ensure a higher level of availability, several object’s replicas are stored in the DHTs typically use the neighbor set of the node. The policies are: attribute-based, identity based, or a combination of both types.

Cachet [129] is an architecture that provides security and privacy. It guarantees con-fidentiality, the integrity and the availability of the user content. In Cachet the DHT is augmented with social links between users. Cachet provides a distributed pool of nodes to store user personal data and these nodes are untrusted. The system uses the distributed hash table as a base storage layer, and a gossip-based social caching algorithm. In Cachet, users data are protected by DECENT. Cachet uses social caching to manage data availability and information diffusion. The basic idea is that a user performs a few DHT lookups to get presence objects of some social contacts. Then, she identifies those who are online and

(30)

contacts them to inform them that it is online and, pull both cached objects and cached recent updates of their mutual social contacts.

SOUP [96] is based on Pastry DHT and it uses replication on mirrors nodes to increase data availability. In the DHT, the identifiers of mirror nodes are stored. Each node needs to select the most eligible mirror nodes and SOUP takes into account friendship relationships. SOUP provides also a support for mobile nodes. A mobile node is not directly joint to the DHT but it uses the DHT through a gateway node that is on the DHT. Finally, in SOUP user encrypts data with an Attribute Based Encryption (ABE) encryption.

2.3.2 SO-based proposals

In the follow, we list all the SO-based DOSNs. For sake of clearness, we explain in detail the concept of social overlay.

A Social Overlay is a logical overlay in which peers are connected to known peers, as explained in [116]. An edge between a pair of nodes indicates that a tie exists between two adjacent nodes. An Online Social Network (OSN) may be formally described by a social graph G = (V, E), where V represents the set of users and E the social ties between them. When considering the simplest formalization of OSN, G models relationships self-defined by each user (for instance friendship relations defined by the users in Facebook) and contains an edge between any pair of friends. The ego network [111] is a social network model usually used to model a Social Overlay. The ego network of a user represents a structure built around the ego which contains her direct friends, known as alters and may also include information about the direct connections between the alters. Formally, each vertex u ∈ V can be seen as an ego and EN(u) = (Vu, Eu) is the ego network of u where Vu= {u} ∪ {v ∈ V |(u, v) ∈ E},

and Eu= {(a, b) ∈ E|a = u ∨ b = u ∨ {a, b} ⊆ Vu}. N(u) = Vu− {u} is the set of adjacent

nodes of u. In figure 2.7 we can see the ego network of the red node, the ego, with the blue nodes, its alters, and the relations among them.

Safebook [39] proposes a three tiers architecture for DOSN with the main focus on privacy, integrity and availability. Each user in SafeBook has a logical concentric structure called Matryoshka. Matryoshkas are concentric rings of nodes built around each peer which provide a trusted data storage and communication obfuscation through indirection. The innermost circle is composed by specific social contacts, as friends of the core and, each of them store encrypted data. Instead, the outermost circle acts as a gateway for all the requests addressed to the core.

Prometheus [97] gathers information about users through social sensors that are are applications running on behalf of the user. Social Sensors are able to retrieve information

(31)

Fig. 2.7 An example of an ego network. The red node is the ego.

Fig. 2.8 Prometheus: system architecture.

from other sources, such as Facebook, and in particular they retrieve interactions with other users via email, phone, instant messaging, comments on blogs, which are used to create a weighted, directed, and labeled multi-edged graph, where vertices correspond to users and edges correspond to interactions between users. Figure 2.8 shows the architecture of the system. Both the social information from sensors and the social subgraphs are stored and maintained in a P2P network. Information from sensors can be decrypted only by “trusted” peers, which are selected by users.

My3 [120] is a privacy-friendly DOSN which exploits well-known interesting properties of online social networks, for instance the locality of users. The system allows users to have a granular access control on the content. The system exploits trust relationships among users to improve the availability of data in the network. Users’ profile is hosted only on a set of self-chosen trusted nodes, called Trusted Proxy Set (TPS). The TPS of a user is populated by exploiting availability and performance goals as its geographical location and its online

(32)

time period. The system uses a DHT for storing the privacy-preserving index of the profile content and other meta information. A user u and his set of trusted nodes (TPS) are stored in the DHT in the form of (key,value) pair, with key being the U Idu and value being the

members of the TPS. This mapping is useful for contacting the nodes where the profile of a particular user is stored. The user trusts these nodes both for storing his profile content and for enforcing access control on the access requests. The selection of the Trusted nodes (TPS) and their churn dynamics are handled according to the geographical location of nodes.

DiDuSoNet [76] is a two-tier system, where the lower level is implemented by Pastry [147] and it is used for the bootstrapping phase, for the look-up service, for searching other users and, to retrieve the replica nodes list. The upper level is implemented by a Dunbar-based social overlay. In the Social Overlay, nodes are connected to other nodes with whom the tie strength computed on the interaction between them is higher. Users are able to communicate with each other directly via point to point for sending private messages or private data. The novelty of this system respect to the others described before is that it is completely based on trust between users. Social data are stored only on trusted nodes chosen respect to the Dunbar’s number [52]. Each node can chooses two replicas to have a high level of availability. As soon as the replica becomes unavailable, the remain replica is able to elect another one through the online nodes. The system manages both voluntarily and involuntarily disconnection of peers and gives a simple solution to the consistency of replicas.

2.3.3 External Resource-based Solutions

Vis-à-Vis [153] is a privacy-preserving framework based on hierarchical overlay networks of Virtual Individual Servers (VIS). A VIS runs in a paid could-computing utility. The main concept of this framework is the group. A group consists of an owner, a set of users defining the group’s membership, and a mapping from group member to geographic regions. The framework uses a distributed tree structure to provide efficient, fault-tolerant, and scalable range queries. The initial architecture of Vis-à-Vis was based on a two tier DHT [154] where the upper tier was used to advertise and search for public OSN groups, while the lower tier was used for indexing the OSN groups in the system. Each user stores its own data on a VIS and it could access the location-based groups through clients such as mobile applications and web browsers. Groups are created and administered by a user. A group owner’s public key is used to identify that group. Users are defined by a self-signed key pair and the private key is stored securely by its VIS, allowing a VIS to act on its behalf. The public key of the group and IP address of the corresponding VIS are sent in broadcast throughout the network. Each node maintains a subgraph of the location tree, giving it partial knowledge of the group membership and members’ locations.

(33)

Vegas [53] is an hybrid solution in which the proposed DOSN takes advantage of external resources to increase the data availability. Vegas has a social overlay which is represented by a relaxed ego network where each ego node knows about the direct connection with its friends, but it does not anything about relationships between its friends or between its friends with unknown users. The visibility of the social graph is restricted to the relaxed ego network. A new friendship can be established only in two ways: with a face-to-face exchange of the specific public key and the exchanger addresses via encrypted NFC or Bluetooth connection or, with a simple protocol used only to create a friendship between two nodes with a common friend (a triangle). The system permits to users to connect with mobile or stationary client devices and it introduce datastores for the management of data availability and for device synchronization. Datastores can be implemented as a web resource like FTP, WebDAV, or a cloud storage. In Vegas, each user holds a unique public key pair for each of its friends and messages are encrypted. When a user wants send a message to a friend, it exchange the specific public key with this friend and it chooses one or more exchangers. An exchanger represents the abstract concept of a message queue that is used to transmit messages. The system supports emailing, SMS, instant messaging and, micro-blogging (Twitter) as exchanger. This system suffers of several limitations: the creation of new friendships, the search operation to find other users which is not provided and also the communication channels used.

SuperNova [156] is a super-peer based DOSN architecture. Super-peers can help boot-strap of new peers who are yet to have any friends, maintaining a directory of users to find friend by name or interests, and help peers to discover other peers to store their content in case they don’t have adequate friends or if their friends are already overloaded. Super-peers may be run either on end user computers, or on cloud services. Each node maintains a logical clock for itself and share it with all the friends and storekeepers. Also each group owner maintains a logical clock for the community. Logical clock is used for keeping a synchronization between a node and its friends and storekeepers. Furthermore, it is used for keeping a synchronization between all the community members.

In Confidant [106] encrypted data can be stored anywhere, including P2P systems, cloud services, or existing OSNs such as Facebook. Users use PCs or enterprise workstations as storage servers, and create replicas of data on trusted machines controlled by their friends. Confidant relies on two low-cost infrastructure: storage servers and cloud-based name servers. A storage server hosts users’ data and authenticates read and write requests. Users can select a small number of friends’ servers to host replicas of data. Name servers are assumed to be highly available and they are responsible for maintaining a logical clock and a list of available

(34)

replicas. Users only store private data on servers controlled by trusted friends. Confidant use an Attribute Based Access Control (ABAC) [83] technique to encrypt data.

POSN [54] is a Personal Online Social Network based on the combination of phone-to-phone applications with cloud infrastructure. Each user stores data in a private cloud and its friends are able to retrieve data from the cloud. In POSN, a new relationship can be created only by using users’ existing contacts, such as phone number or email addresses. Indeed, the platform can exchange emails or SMS messages between users to establish tokens that will inform the cloud location and public keys of friends. In order to provide security and privacy, data is encrypted before uploading into the cloud. Every user has a public key, that is exchanged during the friendship establishment, and shared through the cloud. When online, users can directly exchange information about common friends and common friends’ latest posts to optimize the data distribution.

In PrPl [152] every user runs a person-centric service called Personal Cloud Butler, which is a personal service to keep personal information; it organizes our data and shares them with our friends based on private preferences, expressed by users, on access control. Larger data such as video, photo, music are stored in federated storages called Data Stewards. PrPl assumes that the butler service and user’s personal storage are deployed on devices with a high availability and for this reason, the system does not provide any replication or caching mechanism. PrPl relies on the OpenID [137] service for identity management. The PrPl system utilizes federated, decentralized identity management that enables secure logins, single sign-on, and communication among applications in an environment where Butlers belong to different administration domains.

2.4 Data availability

One of the main challenges of distributed systems comes from guaranteeing data availability when the data owner is off-line. Even if several techniques have been proposed to solve this problem in a general context, for instance Distributed Caching [183, 90], Dynamic Data Replication [3] and erasure coding [142], traditional approaches can not be easily adapted to DOSNs because of the different usage patterns [169]. Indeed, users’ online sessions are different with respect to temporal patterns of other applications, for instance file sharing or content distributed networks [78]. Moreover, users’ social data are generally accessed only by specific sets of users, for instance his/her friends or friends-of-friends. These peculiar characteristics may be exploited for managing social data storage and retrieval. For instance, a user could choose his/her friends for storing his/her private contents, because these nodes

(35)

are considered trusted and can exploit the knowledge of their temporal behavior to choose the best replica.

In the following sections we describe current techniques exploited to guarantee data availability in the different kind of DOSNs introduced in Section 2.3.

2.4.1 Data Availability in DHT-based DOSN

DOSNs which use the DHT for storing social data rely on the underlying DHT replication mechanisms for data availability. The DHT guarantees a high level of availability by distributing and replicating data among the online peers. In this way, data are always available and usually, they are also evenly distributed among the peers, depending on the load balancing mechanisms of the underlying DHT.

However several concerns regard the usage of a DHT for storing social data in DOSNs. Since in a DHT the mapping of data to nodes is governed by the hash function, the user cannot control it and its data are stored among untrusted peers. This arises privacy problems similar to those of a centralized solution.

The second concern regards the huge amount of social data generated by current OSNs which produce a lot of traffic. For example, Facebook users generate 250 million posts per hour and users like over 4 million posts every minute 1. With a high level of production of social contents, the performance of a DHT is reduced due to the continuous put/get of contents. We have also several latent interactions [89], which are passive actions such as profile browsing that cannot be observed by traditional measurement techniques. Latent interactions affect the traffic of social contents in the network. When a content is stored on a DHT, each access to social data implies a sequence of routing steps to retrieve it. Even if most DHTs define efficient routing algorithms, the traffic on the network increases by a factor proportional to the routing hops required to access a content.

Finally, the high dynamicity of DOSNs, due also to daily users sessions which are short and frequent, implies a high maintenance cost for the DHT, in particular a high overhead for maintaining the data structures required for routing service [180].

2.4.2 Data Availability in SO-based DOSNs

In SO-based DOSNs, data is generally stored on peers selected by exploiting the connections of the social overlay. Since these DOSNs exploit, as storage support, the devices of the participating users, the analysis of the users’ behaviour is essential to guarantee a high level of availability. As explained in [14], availability is not well-modeled by a single-parameter

(36)

distribution, but instead it is a combination of two time-varying distributions: short-term daily joins and leaves of individual hosts, and long-term host arrivals and departures. This representation is a consequence of the different behaviour of users in a DOSN. Indeed, as explained in [41], user sessions are shorter than 20 min (median value of 24 min). Only a few users sessions have a long duration, exceeding the 2 h. In general, in a DOSN, data availability is managed as function of the online behaviour of users [42, 35, 157]. In a DOSN, replication is both the most suitable and used technique to ensure availability. Common P2P replication techniques are not suitable in a SO-based DOSNs due to the different characteristics of the online behaviour of users comparing with other P2P applications, such as file sharing applications. Indeed, the online patterns of DOSN users are more discontinuously than in traditional decentralized applications [78, 41].

2.4.3 Availability in External Resource-based Solutions

Data availability in hybrid DOSNs is guaranteed by exploiting external resources, generally cloud resources. Cloud services are currently a well-known solution because they provide a storage service always available and which can scale up and down to overcome storage capacity issues. External Resource-based DOSNs offer a compromise solution between the usage of a centralized server and a complete decentralized system.

Several solutions exploit hybrid solutions which combine distributed storage with cloud services. One of the main motivation for a hybrid solution is to increase the Quality of Service (QoS). In [72], authors show that a pure distributed storage, like the ones used in SO-based, DOSNs presents a poor QoS. For instance, [72] proposes F2BOX, which allows to tune the QoS level by tuning two parameters: the target level of data availability and the fraction of data stored in the Cloud.

In [63], a Cloud-assisted data replication is proposed which uses replication to increase the data availability and the erasure codes to store data in the cloud. All data replicas are stored in the friend circle while an amount of data segment created by applying erasure coding are stored in the cloud.

In [154] three schemes for storing data are proposed: a cloud-based scheme, a desktop-based scheme and, a hybrid scheme. The Hybrid scheme is obtained by combining the desktop-based scheme obtained by placing VISs on desktop machines belonging to OSN users with a cloud-based service. This approach has the potential of combining low cost and high availability.

(37)

2.4.4 Replica selection strategies

Replication is the most used technique to face the data availability problem. A replication technique is characterized by a selection strategy used to choose replica nodes. In this section we discuss replica selection strategies for SO-based DOSNs, because the replication strategies for DHT-based DOSNs is guided by the underlying DHT strategy and is not controlled by the DOSN. In recent years, several replication strategies have been proposed which may exploit different policies for replica placement: [32]:

• strategies based on trust relationships of a user with other ones.

• strategies based on specific parameters, most relevant is sessions’ patterns of the friends.

• replicas chosen by users.

The two major users’ features used to define a replication strategy are the following ones: [35]:

• Temporal behavior. The analysis of the behavior of users in terms of online presence in a systems is defined as temporal behaviour. In a social network, both centralized and distributed, this concerns the user’s usage of the system, which may be active, by producing social contents, or only passive, when the user only access the social content of the other ones.

• Social Graph. Social Graph may be exploited to guide the replication strategies, by exploiting social information such as friendship type, common interests, closeness, trustworthiness, tie strength, etc. The two main features generally exploited to define a replication strategy generally concern the friendship relationships [120, 149, 130, 102] or the trustworthiness, usually measured in term of social interactions [35, 44, 43]. Before investigating the different replication strategies more deeply, we introduce the main metrics exploited to evaluate the effectiveness of a replication strategy: Pure availability and Friend Availability [149]. Pure Availability measures the fraction of time data is available which respect to a given time period. In the context of DOSN, however, people interested in the user’s data are primarily its friends. For this reason the Friend Availability has been introduced to measure the fraction of time a user’s data is available when its friends are online. Other metrics concern the replication degree and the peer load [120]. The replication degree is the number of replicas hosting a user’s profile, instead the peer load measures the storage load of a peer due to the number of social contents assigned to it.

(38)

In Figure 2.9, we propose a summary of the data availability techniques in DOSNs, which we will investigate more deeply in the next sections

Replica selection guided by temporal behaviour

A common problem in designing the replication strategies based on temporal behaviors is the difficulty to retrieve the online behaviour of users from the most used OSNs, due to the lack of public data. However, in the last few years, some studies have been conducted.

The main information used to define a selection strategy concerns the session length, as presented in [35].

In [122] an empirical study of the various system properties of DOSNs and the parameters that influence them have been studied. Authors explained that one of the most important parameters to provide availability is the online time of a user, while the properties of the social graph are used usually to minimize the number of replicas or to manage the load balancing.

Several studies try to generate synthetic models of the temporal behavior of a users and of the corresponding session length, because of the lack of real dataset. This approaches usually take into account the behavior of users in other distributed applications.

For instance, in [122] three different user session time models are presented.Authors assume that a user could be:

• Sporadic: a user is online several time a day sporadically (20 minutes session length) • Continuous-Fixed Length, all users are assumed to be online, each day of the week,

during a continuous time window of a fixed length

• Continuous-Random Length, each user randomly chooses its own online session length time window in the range [2, 8] hours.

The Sporadic model is considered by the authors as the most realistic one.

In [149], the session durations is modeled independently from the session starting time, using the Weibull distribution.

An evaluation based on a real dataset is presented in [41], where a Facebook dataset containing the online user behaviour is analysed. Real data show a daily temporal pattern of users and on average less than 100 daily sessions, while the average number of sessions for all users is less than 4 sessions per day. Almost half of the sessions are shorter than 20 min and a significant percentage of 34% of them are less than 10 min. Almost 50% of users present an inter-arrival time shorter than 1 hour. These data are confirmed considering a larger period of 1 month [42].

(39)

Another current trend is using predictors to foresee the availability of users. A predictor takes into account the history of online behaviour of users in the past to predict if a users could be online in a specific interval of time. In [42], a linear predictor is used to choose the replica nodes.. In [45] authors define five different features, which differ by set of users considered (individual or global) and by periodicity (daily, weekly, and flat) and combine them with logistic regression, to predict the probability for each individual user to be online in a future time instant.

Replica selection strategies guided by the social relationships

Another common trend for the replica selection is the definition of a set of trusted nodes, which are chosen by exploiting friendship relationships (friend nodes or friend-of-friend nodes). Trusted nodes are used as replica nodes. Replicating data at all friends could affect the cost of storing data in multiple copies, in particular for users with a large number of friends [157]. In [122] replica selection tries to maximize the availability by choosing users’ friends which maximize the availability of the user profile. Moreover, a second approach prioritizes the most active friends for placing the replicas and, the final approach chooses storage nodes at random. [120] proposes four selection strategies which are based on minimizing the number of replicas, minimizing the update propagation delay, minimizing the access cost incurred in accessing a user’s profile and, maximizing the replication gain. In [35], trustworthiness and number of replicas leads selection strategies.