• Non ci sono risultati.

Group Membership for Peer-to-Peer Communication

N/A
N/A
Protected

Academic year: 2021

Condividi "Group Membership for Peer-to-Peer Communication"

Copied!
24
0
0

Testo completo

(1)

Group Membership for Peer-to-Peer Communication

Roberto Baldoni and Sara Tucci Piergiovanni Dipartimento di Informatica e Sistemistica

Universit`a di Roma “La Sapienza”

Via Salaria 113, 00198 Roma, Italia.

{baldoni,tucci}@dis.uniroma1.it

Abstract

This paper studies the problem of maintaining eventual consistent group membership without any synchrony assumption inside an unbounded group G of processes that varies over the time (processes may join and leave the group). Eventual consistency means that if at any time all group membership changes cease, processes will converge in a finite time to a single consistent view. Due to the lack of any synchrony assumption, this specification is well suited to large scale peer to peer environments. The paper also presents two impossibility results and two eventual group membership implementations. Last but not least, pointing out a circularity problem, the paper also shows the impossibility of implementing eventual group membership without the existence of a special peer inside the group.

Keywords: Group Membership, Peer-to-Peer Systems, Dynamic Asynchronous Dis- tributed Systems.

Contact Author: Roberto Baldoni, Dipartimento di Informatica e Sistemistica, Universit´a di Roma La Sapienza, Via Salaria 113, Roma, Italia, baldoni@dis.uniroma1.it, tel. +39 0649918481

(2)

1 Introduction

Group membership problems have been extensively addressed in the literature in the context of group communications. Traditionally, such a group membership aims to support process group communication [2] and has the following objectives [12]: (i) determining the set of processes that are currently up and (ii) ensuring that processes agree on the successive values of this set. From an implementation point of view, a group membership so specified is not solvable in purely asyn- chronous distributed systems with failures [5]. Therefore implementations of such group member- ship services require periods of synchrony of the underlying system model long enough to ensure previous properties.

A peer-to-peer environment is formed by a dynamic and scalable group of processes sharing a common interest. In this setting processes communicate through application-level multicast protocols over an overlay network formed by the peers themselves [9, 3]. Due to joins and leaves, the overlay continuously changes. This implies that the multicast protocol, to be effective, has to rely on a group membership service1 to individuate time by time the set of the intended peer receivers for each multicast message. Due to its dynamicity and to the large number of processes, no ”long enough” period of synchrony can actually be assumed within this environment. Therefore group membership services defined for group communication are not adequate in this context both for implementation (due to [5]) and specification reasons. Ideally, to better support application level multicast protocols in a peer-to-peer environment, a group membership service should indeed provide views close as much as possible to the actual composition of the group, at any point of

1The multicast can also directly embed the membership management, but in the following we maintain separated these two concerns: multicast communication of application information and group membership management.

(3)

time. The service should determine who is currently inside the group (not necessarily currently up processes) by managing joins and leaves and provide new views to peers upon a membership change (without the need of an agreement on the order in which views are provided).

The aim of the paper is therefore to formalize an implementable specification of a group member- ship service in a peer-to-peer environment that has some useful deterministic guarantee to support at its best application-level multicast protocols. To illustrate the advantage of having a group mem- bership service in a peer-to-peer environment consider the following scenario: a process p joins the group. Let us suppose that p becomes a peer at time t0. From this time on, the group membership service will provide peers with a new view including p. Independently of the service implementation (centralized, decentralized, etc.) new views will be available at each peer with a certain latency.

The expected behavior of the group membership service should allow that: the more p remains inside the group, the more its chances are to appear in peer views. Clearly, the set of peer views in which p will appear is not predictable, since many peers belonging to the group at time t0 will leave the group in a successive moment, even before being notified with the new view. However, if p and some peer q remain forever in the group we expect that p eventually enters the q’s view (and vice versa). In other terms, the group membership specification should deterministically avoid the isolation of p from other peers, if any. It is well known that process isolations or process partitions seriously degrade performance of any application-level multicast protocol [6].

The paper presents a new group membership specification that formalizes that behavior, namely eventual group membership. This specification, partially inspired by the one stated in [8], guarantees that, after the overlay network stabilizes (no more join and leave occur), peers can converge to a single consistent view in a finite time. The specification also states that join and leave operations must take finite time to guarantee the progress of the system.

The paper also presents some impossibility result. These impossibility results show that the

(4)

specification is not implementable if (i) the group is empty at some point of time and (ii) the group does not always contain a correct process (i.e., a process that never crashes) in the group (from the beginning to the end). Moreover, the last impossibility result points out a circularity problem when eventual group membership is implemented in a fully decentralized way by the peers themselves. This problem reflects the impossibility of a newcomer to know someone (i.e., a contact) inside the group due to the dynamism of the group itself (e.g. the join operation of a newcomer could not terminate in a finite time if its contact already left the group). This circularity problem actually implies that there must be a special peer to be known in advance (i.e., at design time) by a newcomer in order to terminate the join operation. This knowledge makes impossible any fully distributed implementation of eventual group membership where no process plays a special role. Finally, two implementations of eventual group membership are presented in the context of client/server and peer-to-peer environment and their failure resilience is discussed.

The paper is organized as follows Section 2 presents the system model including the group membership management. Section 3 introduces the specification of eventual group membership, the impossibility results and the circularity problem. Section 4 shows two eventual group membership implementations. Section 5 discusses the related work and Section 6 concludes the paper.

2 System Model

The system consists of an unbounded set of processes Π (Π may be infinite). Any process may fail by crashing. A process that never fails is correct. The system is asynchronous: there is no global clock and there is no timing assumption on process scheduling and message transfer delays. Each pair of processes pi, pj may communicate along point-to-point unidirectional fair lossy links[1].

Roughly speaking, a fair lossy link may intermittently drop a finite number of messages, and may

(5)

do so infinitely often, but if pi repeatedly sends some message to pj and pj is correct, then pj eventually receives the message.

To simplify the description without losing generality, we assume the existence of a fictional global clock, whose output is the set of positive integers denoted by T .

Group Membership. Each process pi ∈ Π may become member of a group G. Once member, it may decide to leave the group. To this aim pi may invoke the following operations: join(G) to enter G and leave(G) to come out of the group.

The set of processes constituting the group G at time t ∈ T is denoted as G(t). At any time t, G(t) is a subset of Π with size unbounded and finite. The rules defining the membership of G are the following:

• a process p ∈ Π becomes a member of G immediately after the completion of join(G).

• a process p ceases to be member of G immediately after the completion of leave(G).

• a process p may become member of G at most once. 2.

A group member p ∈ G is stationary if it is correct and never invokes leave(G).

A group member p ∈ G is transient if it is correct and eventually executes leave(G).

Note that, since G(t) is unbounded and finite at any point of time, the number of stationary processes is unbounded and finite, as well.

Group Membership Management. To abstract in a general manner the group membership management we consider that each process can locally access to a distributed oracle. Each process pi invokes the join(G) and leave(G) operations through the group membership management local

2This is not a restriction because the process may join with another identifier.

(6)

module GMi that is in charge of the actual execution of these operations. Each GMi module provides pi with a local view of the current membership of the group. We assume that GMi crashes only when pi crashes.

Upon the invocation of join(G) by pi, it generates an event denoted as invi(join(G)), then GMi gains the grant to access the group for pi. After that, GMi returns to pi with an upcall by generating the event resi(join(G)).

Upon the invocation of leave(G) by pi, denoted as invi(leave(G)), GMi is in charge to gain the grant to leave the group for pi. After that, GMi returns to pi with an upcall resi(leave(G)).

Note that GMi may provide to pi a local view even when a pi is not a group member. In this case we denote the view as access viewi. As long as the process pi belongs to the group, the local view is denoted as group viewi. Only between the events resi(join(G)) and resi(leave(G)), pi ∈ group viewi.

3 Eventual Group Membership

The view information for one group can be represented as a knows-about directed graph K = (Π, E) [8]. For each pair of processes pi, pj, there will be an edge (pi, pj) in the graph if pj ∈ group viewi, and an edge (pj, pi) if pi ∈ group viewj. There exists an edge (pi, pi) for every process pi such that pi ∈ group viewi, i.e. for every current member3. This graph actually represents the overlay network to be used as underlying communication network by an application-level multicast protocol in a peer-to-peer environment.

3Let us remark that there exists an edge (pi, pi) even for each faulty member pi that crashes before generating resi(leave(G)).

(7)

3.1 Specification

Every group membership algorithm implementing eventual consistency has to dynamically manage a knows-about graph in order to:

• connect a new joined member,

• disconnect a leaving member without creating a partition,

• tolerate process failures without creating a partition,

while keeping the knows-about graph connected.

Safety. Since view information is propagated along edges of the knows-about graph (each edge is a fair-lossy link), once joins and leaves cease, every stationary member pi belonging to the graph should have for each stationary member at least one path formed by stationary members.

This is necessary because even though leaves and joins no longer occur, crashes are still possible.

Such crashes could lead to partition the set of stationary members. Therefore, if this condition is satisfied, each stationary member eventually includes in its view all stationary ones. Formally,

Property 1 (Safety). Let K = (Π, E) the knows-about graph at time t s.t. no edge (pi, pi) will be added or removed for each t0> t (i.e., joins and leave cease at time t). Let us consider the subgraph Ks= (S, Es) such that

(i) pi ∈ Π and pi is stationary ⇔ pi ∈ S (ii) ∀ pi, pj ∈ S, (pi, pj) ∈ E ⇔ (pi, pj) ∈ Es.

Then, ∀pi, pj ∈ S there exists an edge (pi, pj) in the transitive closure of Es for each t0 > t.

Liveness. A trivial group membership implementation may maintain safety blocking the com- pletion of each join(G)/leave(G). If upon group initialization the group is empty, preventing any

(8)

process from completing the join means that no process will ever become a member. If upon the group initialization the group is not empty, preventing the completion of join/leave means that Safety is assured by choosing a proper structure for the knows-about graph that tolerates some number of failures [11].

Then, to avoid static implementations the following property holds:

Property 2 (Liveness). The execution of the join(G) and leave(G) operations requires finite time.

3.2 Impossibility results

As said in Section 2 a process pi is provided through GMi with a list of members of G. Ideally, there is no constraint on the composition of access viewi, as the group membership specification poses constraints only on the composition of group viewis. Thus, impossibility results start from the general assumption that access viewi is a random set of processes belonging to Π without any relation with the current membership. In other terms, access viewi is neither accurate (it could comprise some process not yet a member) nor complete (it can miss same current member).

Unfortunately, as we see later, to guarantee the group membership specification even access viewi has to satisfy some property stated by Corollary 1. That property is very lightweight but introduces a not avoidable circularity discussed in Section 3.3.

Impossibility Result 1. If there exists a time t ∈ T s.t. G(t) ≡ ∅, eventual group membership cannot be solved.

Proof. (sketch) Let us suppose by contradiction that at some point of time t, |G(t)| ≡ ∅.

Let us assume a process pi executing the join(G) operation produces the invi(join(G)) event while |G(t)| ≡ ∅. pi does not know if the group is empty or not as access view is not complete

(9)

neither accurate. pi can send a JOIN message to its access view but cannot get any acknowledge- ment (like any concurrent joining process) since G(t) is empty. To respect Liveness it has become member after a finite amount of time exploiting a time-out strategy 4. Then, it concludes to be alone a time t + T and includes in group viewi only itself. Due to the asynchrony of the underlying system, another process pj with pj 6∈ access viewi and pi6∈ access viewj can decide to join. As pj does not ”see” pi it uses the same strategy and generates resi(join(G)) including in group viewj only itself at time t + T . If both pi, pj are stationary no edge connect them at time t0 ≥ t + T . If no other join and leave occur there is no way to add that edge in a successive moment. No edge will connect them for each t0≥ t + T violating Safety.

Lemma 1. Let us suppose that |G| is never empty. Then, any process picannot generate resi(join(G)) until there exist at least one edge (pi, pj) ∈ E and one edge (pj, pi) ∈ E.

Proof. (Sketch)

Let us suppose that resi(join(G)) is generated at time t and that G(t) contains a stationary member p. By the way of contradiction, let us suppose that does not exist any edge (pi, pj) in E at time t. However, after resi(join(G)) pi has an edge (pi, pi) ∈ E at time t. If pi is also stationary then G(t) contains two stationary processes and no edge in the transitive closure of E. If no other join and leave occur there is no way to add that edge in a successive moment. No edge will connect them for each t0≥ t + T , violating Safety.

Impossibility Result 2. If there exists a time t ∈ T s.t. G(t) contains no stationary member, eventual group membership cannot be solved.

Proof. Let us suppose by contradiction that there exists a point of time t ∈ T s.t. G(t) does not

4The strategy can encompass mechanisms such as setting a timeout T or retransmitting the JOIN message k times.

(10)

contain stationary members. From Lemma 1 every joining process has to establish two edges with a process pj, before generating resi(join(G)). From Liveness it has to establish those edges in a finite time. Without loss of generality suppose that at time t, G(t) comprises k faulty members and c transient processes. Taken any subset S(t) ⊆ G(t), of one process pj, that process is faulty or transient. Let us assume that:

1. pj is transient and belongs to G between time moments tJj and tLj.

2. pi generates invi(join(G)) and sends at time t a JOIN message to each member of G(t).

As the system is asynchronous, the delay experienced by JOIN on the fair lossy link connecting pi to pj, could be greater than tLj − tJj and then the message of pi does not meet pj. Moreover, pj does not yet know pi, so pj cannot communicate with pi before the JOIN arrives to pj.

Then, pi cannot establish any edge with pj. pi cannot establish in a finite time any edge unless some other stationary member will join the group. However, no stationary member can surely join in a finite time for the same reason that blocks pi.

Then, pi waits for an infinite time violating Liveness.

Note that the presence of a stationary process it is necessary also to guarantee Leave Liveness.

Upon a leave, the leaving process should communicate its view to its neighbors before leaving. In this manner partition may be avoided. However, if no stationary process is in the group, the leaving process cannot communicate in a reliable manner with its view in a finite time.

3.3 The circularity problem

From Impossibility Result 2, the following Corollary holds:

Corollary 1. If access viewi does not eventually contain at least one stationary member, eventual group membership cannot be solved.

(11)

In particular, if access viewi does not contain a stationary member, Join Liveness may be violated. This constraint on access view poses a circularity problem when the group membership is implemented in a fully decentralized manner by members themselves and no process plays a special role from the beginning.

To point out this problem let us consider that ideally the group is defined on the fly only by its peers that dynamically change. This implies that the mechanism to connect the group cannot be provided at the design time including some particular peer in each access view. The newcomer has to perform a run time discovery to fill the access view in order to be compliant to Corollary 1. This discovery cannot be pushed-based (from the current members of the group to the newcomer): none can indeed provide the newcomer with a view as no member of the group knows the newcomer 5. Thus, to discover a current peer, the newcomer has to contact someone (e.g. a special process) that knows some peer. Following a pure peer-to-peer approach (i.e., there is special no process), only a peer may have this knowledge. Then, to know a peer, the newcomer must already know a peer. The circularity is evident since the hen-and-egg problem arises. Circularity, in these systems, may be solved by assuming either that eventually the newcomer will know someone inside the group or the existence of a set of special processes constantly known by all other processes from the beginning of the group, losing a pure peer-to-peer approach.

4 Eventual Group Membership Implementations

In this section we provide two algorithms solving eventual group membership. The first algorithm is based on a set of servers and is thus well suited to small scale systems in terms of number of participants in the group. The second algorithm follows a fully decentralized peer-to-peer approach

5The number of potential newcomers is infinite. As a consequence the identifiers of potential newcomers cannot be available at design time.

(12)

to scale towards a huge number of participants.

4.1 A server-based implementation

A group of servers {s1, s2, ...sr} ⊆ Π totally interconnected and defined in the initialization phase, manages joins and leaves and provides members with views. The number of tolerated server failures is f = (r − 1)/2.

Every process pi has all servers in access viewi6. When pi invokes join(G), GMi sends a JOIN message to its access viewi and wait for a majority of acknowledgments from the servers, then a resi(join(G)) is generated. When pi invokes leave(G), GMi sends a LEAVE message to its access view and wait for a majority of acknowledgments from the servers, then a resi(leave(G)) is generated.

Upon the reception of a JOIN/LEAVE message, each server sj incrementally builds a list current membersj of current members and maintains a list of members that have left the group old membersj. The list current membersj is sent along with acknowledgements to join.

When a group member pi receives the list current membersj from sj it updates its own group viewi, accordingly. Subsequently, the group membership changes may be sent by a server to each of its current members, periodically7. Both lists, current membersj and old membersj, are periodically exchanged among servers. When a server sj receives the list current membersk from sk, it then (i) deletes from list current membersj all processes in old membersk (if any) and from list current membersk all processes old membersj (if any) and (ii) performs the union of the two lists so obtained.

Note that, once all group membership changes cease, the lists maintained by each server sj

6Impossibility results are circumvented by the presence of servers.

7Different strategies may be pursued: periodic pull by a member, push upon a change by a server, periodic push by servers, etc.

(13)

does not change after a time t. From this time t on, in a finite time, all servers reach a consistent view, agreeing on a certain list 8. This list contains all stationary members. The list is eventually delivered to each stationary member. All stationary members agree on the same view and are comprised in the view, so they form a clique in the knows-about graph.

4.2 A peer-to-peer implementation

The above implementation is well suited when the system scale in terms of number of participants is small. To scale towards a huge number of participants the number of servers should be very high to try to balance the load. Thus, the natural choice seems to be a group membership service completely decentralized and implemented by the peers themselves where each peer of the group only manages a reasonably small subset of peers, having only a partial view of the group[7, 6].

The proposed solution consists of an algorithm that generates, in a decentralized manner, even- tually connected knows-about graphs. The resulting graphs show a particular structure in which each member has around itself a clique of at least 2f + 1 members, where f is the number of tol- erated failures. The other important feature of the algorithm consists in imposing a partial order on nodes to manage concurrent leaves that potentially may partition the graph.

Data Structures. The variable group viewi is the union of two different variables: sponsorsi and sponsoredi. A variable ranki gives an indication of the position of pi in the graph, inducing a partial order on nodes. A boolean variable leaving is initialized to ⊥.

Initialization of the group. A set of processes {p1, ...p3f +1} ⊆ Π totally interconnected and defined

8The list may also comprise some crashed member, as no failure detection mechanism is exploited.

(14)

in the initialization phase instantiates the group9. All these processes have rank ranki = 0. They are special processes, they never leave the group.

Join Management. Rules of the algorithm:

• GMi sends 10 a JOIN message to access viewi 11

• When GMi receives a JOIN message from GMj and pi∈ group viewi: (1) GMi inserts pj in sponsoredi; (2) it sends an acknowledgement to pj along with its own rank ranki.

• When GMi receives 2f + 1 acknowledgments: (1) GMi includes in sponsorsi all the senders and pi; (2) it sets ranki = max(rankk, ∀senderpk)+1 and (3) returns to pi generating resi(join(G)).

At the end of this phase, a newly joined member has a clique of 2f + 1 members around itself.

In Figure 1 is shown a possible graph evolution at different instants of time generated by the join algorithm. The group starts with a initialization group composed by four stationary processes, resilient to f = 1 failures. The rank of each node is shown.

Note that a member of rank = 1 has f + 1 stationary sponsors thanks to the initialization phase. Every member of rank r > 1 has f + 1 correct sponsors, both transient and stationary. All transient sponsors will be replaced upon their leave by other members.

Leave Management. Rules of the algorithm:

9Impossibility results are circumvented because of the presence of these processes.

10Each message is sent through a fair lossy link, the send primitive embeds a retransmission mechanism that ends to retransmit until an acknowledgement is received. The send primitive is supposed non-blocking.

11The mechanism to fulfill access view, addressing Corollary 1, will be discussed in the reminder of this Section.

(15)

0

0 0

0 0

0 0

0 0

0 0 0

1

0

0 0 0

1

1 0

0 0 0

1

1

0

0 0 0

1

1

2 0

0 0 0

1

1

2 0

0 0 0

1

1

2 3

t0 t1> t0 t2> t1

t3> t2 t4> t3

Figure 1: Evolution of the knows-about graph generated by the joins representing a group G

• GMi (i) sets leavingi = > and (ii) sends a LEAVE message to sponsorsi, so composed hLEAV E, sponsoredi, rankii;

• When GMi receives a LEAVE message hLEAV E, sponsoredr, rankri from GMj and rankj >

ranki and leavingi= ⊥: (1) GMi inserts sponsoredr in sponsoredi; (2) it sends an acknowl- edgement to pj and (3) sends a message hN EW SP ON SOR, oldsponsor = pji to sponsoredr.

• When GMi receives a majority of acknowledgments from its sponsors: (1) discards pi from sponsorsi and (2) returns to pi generating resi(LEAV E(G)).

• When GMireceives hN EW SP ON SOR, oldsponsorri from GMjand oldsponsorr ∈ sponsorsi: GMi includes pj in sponsorsi and discards oldsponsorr from sponsorsi.

Note that a process piterminates the leave(G) operation when d(2t+1)/2e = t+1 own sponsors become the new sponsors of those processes previously sponsored by pi. Those new sponsors have a rank greater than the rank of the newly sponsored processes.

(16)

The algorithm works since, among 2f + 1 sponsors, at least f + 1 are correct. In this way, any subset of f + 1 processes picked up by a leaving process contains at least one correct process (transient or stationary). This assures that upon a leave of a transient process pi, around each process sponsored by pi the sponsors number does not decrease, i.e. it remains always at least equal to 2f + 1, among which only f nodes are faulty.

Note that that algorithm manages concurrent leaves. Thanks to ranks, it is possible to induce a partial order on the nodes. In practice, when two nodes pi, pj with rank ranki ≤ rankj want to concurrently leave, a partition may occur if they actually leave at the same time. The algorithm sequences the leaves, by allowing a leave of a process of rank rankj only if none of its sponsor pi with rank ranki < rankj is concurrently leaving. Note that pj remains blocked as long as also new sponsors of pj concurrently leave. Eventually, if all processes with rank lesser than rankj leave, then pj will have as sponsors processes with rank 0. By construction these processes never leave (they are stationary), then also pj eventually may leave (Leave Liveness).

Figure 2 shows a particular graph configuration generated by the join algorithm in which all processes with rank 1 and rank 2 concurrently invokes a leave(G) operation. In case (a) the graph remains partitioned as processes with rank 0 becomes sponsors of processes no longer member of the group. It may happen if nodes with rank 1, nevertheless are leaving, send ACK LEAV E to processes with rank 2. Then, rank 2 processes leave and upon the leave of rank 1 processes, no process of rank 0 is aware of the leave of rank 2. Note that, it may be wondered a case in which, if a process of rank 1 is leaving and changes its view, it reissues the leave message to its sponsors (rank 0). The problem is that such an algorithm may never terminate if all processes with rank greater than 2, continuously leave12. In such a case the leave(G) operation takes an infinite time,

12The number of processes with rank greater than 2 may be infinite as Π may be infinite.

(17)

violating Liveness.

0

0 0 0

1

1

1

2 2 2

subgraph

0

0 0

0 subgraph 0

0 0 0

2 2 2

subgraph

(a) Concurrent leaves with partition (b) Concurrent leaves sequenced, no partition

0

0 0 0

1

1

1

2 2 2

subgraph

0

0 0

0 subgraph 0

0 0 0

2 2 2

subgraph

(a) Concurrent leaves with partition (b) Concurrent leaves sequenced, no partition

Figure 2: Two graph configurations after the concurrent leaves of all nodes of rank 1 and 2.

Knows-about Graph and Failures. If the group has to tolerate f failures, it has to start with 3f + 1 processes. Nominally, the number of tolerated failures does not change in face of a possible growing number of processes. This should imply that all processes that add in the graph are supposed to be correct. However, a graph may be more or less robust to different failure patterns depending on its topology. In practice, we can individuate a number of tolerated failures per rank. For instance, in a star topology, all processes can be divided in two subsets: processes of rank 0 (3f + 1) and processes of rank 1 (N − (3f + 1)). In the subset of processes of rank 0, f failures are tolerated;

in the subset of rank 1 all processes may be faulty. In more complex topologies, for each subset of processes with a given rank, a certain number of failures can be tolerated. The more the number of tolerated failures for each rank, the more the robustness of the topology. Relations between robust topologies and implementation mechanisms to approximate them are out of the scope of this paper.

Techniques to keep scalable partial views. The overhead per-node due to data communication is

(18)

proportional to the node degree. Moreover, the average degree should be:

1. very close to the degree of each node (in this case the topology is very well-balanced)

2. reasonably small

To address the first point an indirection mechanism [7] may be pursued to allow that a joining node chooses its contact uniformly at random.

To address the second point the degree of each node may be predefined or mechanisms to adapt the degree to the total number of nodes may be figured out. First of all, if the number of accepted connections (number of sponsored) is defined and never changes, it means that when a node pi leaves, some of its sponsored nodes may be not accepted by a pi’s sponsor. In this case the non-accepted nodes should re-join the group.

An adaptive mechanism for resizing partial views can be the same used by SCAMP. In practice, each node pi accepts with a probability equal to 1/(|sponsored|+1) both (i) a newly join request by a node pj (in this case pi is the contact of pj) and (ii) an already group member pk whose sponsor has left. If pi does not accept the join request of a newly node pj, pj is in charge to retransmit the join request (to a node uniformly chosen at random). If pi does not accept an already member pk, pi sends a message to pk to warn it that its node-degree is being decreased. When pk knows that its degree has been decreased, it can re-join (maintaining its current partial view) in order to come back to its previous node degree by finding another sponsor.

Solving the circularity problem. To solve the circularity problem it is necessary to have inside the group a subset of stationary special peers known by all processes. In our implementation this subset can be represented by the set of processes with rank zero. In other words each access view contains the identifiers of these special peers.

The indirection mechanism [7] may be used to properly establish edges in the knows-about

(19)

graph by starting to contact special peers (without necessarily establishing edges in the knows- about graph with them). In particular, before joining, a newcomer pings the special peers. The special peers choose a set of members currently in their views and send it to the newcomer. The newcomer fills its access view with these sets. Now, the newcomer may decide either to repeat the indirection gaining a new access view starting from the current one or to send a JOIN message to the current access view to build the group view.

Join/Leave Latency. It is worth to note that above mentioned sophisticated mechanisms to maintain scalable partial views have a high cost in terms of join/leave latency when the network is very dynamic. In fact, a joining node should wait that 2f + 1 members accept its connection and a leaving node should have 2f + 1 sponsors before leaving. In a fairly static system the latency experienced would be reasonable, but it is not the case when the group dynamics is high (the latency should be unpredictable).

In other words, keeping a scalable and reliable topology would mean that in very dynamic periods of the overlay the Liveness property is not assured.

When the network is very dynamic the size of access viewishould enlarge in order to make the join latency lower (even an already member that has to re-join can use an access viewi very large comprising all members in its current group viewi). With an |access viewi| large, the probability to get 2f + 1 connections increases. Once 2f + 1 connections are established, there are |access viewi|- 2f + 1 false connections to remove. To remove false connections without compromise reliability, the joining node has to use a sequence number for each message sent. Then, it has to contact each process in its access view one by one (using a timeout). When the timeout elapses it send firstly (i) a CANCEL message to the previous contacted node (or another node if in the meanwhile the previous contacted node has left the group) in order to remove a possible false connection with it

(20)

and (ii) it sends a join request to the successive contact in its partial view.

When a node receives a join request it stores the identifier of the joining node in sponsored along with the sequence number of the join request.

When a node receives a CANCEL message it removes the identifier received from sponsored only if the sequence number is equal to the stored one+1.

5 Related Work

Traditional group membership does its best to provide each process with a precise membership information agreed among all processes in an asynchronous distributed systems with a finite number of processes and where processes can crash [10]. However it is well known that, in this system model, if the network does not eventually stabilize (i.e., the network does not show a synchrony period long enough to terminate a membership algorithm), no meaningful deterministic guarantee can be provided on the views delivered to processes [4, 10]. More specifically, [4] proves that such an agreement can be reached only if failure detectors in the system behave like eventually perfect ones and (ii) there is eventually a stable component (i.e., a set of processes that are alive and connected to each other and there is no connection between processes outside the stable component and the internal ones). It is easy to see that in previous model a solution to traditional group membership is also a solution to eventual group membership.

Let us now consider the system model of Section 2 where eventual group membership can be solved (as shown by the algorithms in Section 4). Traditional group membership cannot be solved in that setting. Such a model does not provide indeed any guarantee on the formation of a stable component as it would imply to have a time t after that synchrony assumptions hold.

(21)

Our framework also has similarities with Probabilistic Group Membership (PGM ) [6, 7, 8].

Our specification is partially inspired to the group membership specification in [8]. The main difference regards the Liveness property which has not been specified in [8]. Respecting Liveness has a valuable impact on the problem solvability leading to the two impossibility results above and the circularity problem (discussed in Section 4).

In [6, 7] there is no formal group membership specification, in both cases the authors present PGM algorithms doing their best to ensure a connected knows-about graph by using redundant messages. These PGM algorithms work in principle in the system model presented in Section 2 and their only aim is to support a probabilistic broadcast algorithm (e.g., a gossip algorithm) without degrading the performance of the latter [7]13. As stated by the same authors, these algorithms exhibit the risk of process isolation or of partitions which would lead to a severe degradation of the performance of the probabilistic broadcast algorithm. In our model these algorithms actually guarantee eventual group consistency only with a certain probability. Recovery from partitions of a group membership requires, for example, to define a set of priority processes which are constantly known by each process [6]. Both algorithms implicitly solves the circularity problem with the assumption that a joining process knows a contact inside the group.

6 Conclusion

The paper pointed out the necessity to formalize a new group membership specification in an envi- ronment where no ”long-enough” period of synchrony can be assumed to occur (i.e., asynchronous system model). This system model is typical of a peer-to-peer environment where the need to scale to a very large number of processes makes the occurrence of such synchrony periods not a

13Let us remark that probabilistic broadcast algorithms do not give any deterministic guarantee on the actual delivery of a message to any process.

(22)

reasonable assumption.

The paper provided a specification, namely eventual group membership, that does not require either to track precise group membership or agreement on group views among group members (as in classical group membership implementations). Nevertheless, the paper showed that eventual group membership is implementable in asynchronous distributed systems (two implementations have been presented in the paper) and that the specification gives deterministic guarantee against process isolation and partitions. Implementations of eventual group membership are actually in charge of dynamically adjusting the underlying overlay network (i.e., the knows-about graph topology) to ensure peer connectivity despite join, leaves and failures.

As remarked in the example shown in the Introduction, the specification of a group membership in asynchronous distributed systems can only be given on stationary processes (a transient process could never be included in any group view of another process). Therefore the issue related to track such transient processes inside the group views is a problem of performance of each eventual group membership algorithm. The more precise is the tracking of transient processes done by the algorithm, the more the group membership algorithm is effective.

The paper also pointed out a circularity problem that to be solved it implies the existence of some special process known by a newcomer at design time. This actually makes impossible a pure peer to peer implementation (no special process exists) of the eventual group membership specifi- cation. This limitation applies also to probabilistic implementation of group membership presented in [6, 7] as they are actually probabilistic implementations of eventual group membership.

Acknowledgments. We would like to thank Jean-Michel H´elary with whom we started to inves- tigate basic properties of dynamic distributed systems.

(23)

References

[1] Anindya Basu, Bernardette Charron-Bost, Sam Toeug: Simulating Reliable Links with Unreliable Links in the Presence of Process Crashes. Proceedings of the 10th International Workshop on Distributed Algorithms:

105 - 122 (1996)

[2] Kenneth Birman and Robert van Renesse: Reliable Distributed Computing with the Isis Toolkit. IEEE Computer Society Press (1994).

[3] Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, Antony Rowstron: Scribe: A large-scale and de- centralized application-level multicast infrastructure. IEEE Journal on Selected Areas in communications (2002)

[4] Gregory Chockler, Idit Keidar, Roman Vitenberg: Group Communication Specifications: a Comprehensive Study. ACM Comput. Surv. 33(4): 427-469 (2001)

[5] Tushar Deepak Chandra, Vassos Hadzilacos, Sam Toueg, and Bernardette Charron-Bost. On the impossibility of group membership. In 15th Annual ACM Symposium on Principles of Distributed Computing (PODC), pages 322330 (1996).

[6] Patrick Th. Eugster, Rachid Guerraoui, Sidath B. Handurukande, Petr Kouznetsov, Anne-Marie Kermarrec:

Lightweight Probabilistic Broadcast. ACM Trans. Comput. Syst. 21(4): 341-374 (2003)

[7] Ayalvadi J. Ganesh, Anne-Marie Kermarrec, Laurent Massouli´e: Peer-to-Peer Membership Management for Gossip-Based Protocols. IEEE Trans. Computers 52(2): 139-149 (2003)

[8] Richard A. Golding and Kim Taylor: Group membership in the epidemic style, /Technical Report UCSC- CRL-92-13, University of California, Santa Cruz (1992).

[9] John Jannotti, David K. Gifford, Kirk L. Johnson, M. Frans Kaashoek, James W. O’Toole: Overcast:

Reliable Multicasting with an Overlay Network. In OSDI, San Diego (2000)

[10] Idit Keidar, Jeremy B. Sussman, Keith Marzullo, Danny Dolev: A Client-Server Oriented Algorithm for Virtually Synchronous Group Membership in WANs. In 20th International Conference on Distributed Com- puting Systems: 356-365 (2000)

(24)

[11] Meng-Jang Lin, Keith Marzullo and Stefano Massini: Gossip versus Deterministically Constrained Flooding on Small Networks. In 14th International Conference on Distributed Computing: 253-267 (2000)

[12] Andr´e Schiper and Sam Toueg: From Set Membership to Group Membership: A Separation of Concerns, Technical Report, EPFL, Lausanne, (2003)

Riferimenti

Documenti correlati

A causa dell’imprecisione statistica delle stime ottenute e della possibilità che il livello dei metalli nel liquor possa non essere correlato ad una precedente prolungata

All’interno di Internet sembra del resto prevalere oggi il modello di funzionamento proposto da Google, un modello che dichiara in apparen- za di perseguire la libertà e

This differ- ence in treatment between invasive fish and vertebrate homoeothermics seems even more puzzling and para- doxical when we consider that, at present, control plans are

• Based on central index server (farm).. • User register and give list of files

The comparison points out an interesting trade-off between reliability, in terms of guaranteeing overlay connectivity at any churn rate, and scalability in terms of creating

In particular, we discovered that: (i) two enJSRV loci that entered the host genome before speciation within the genus Ovis (; 3 million y ago) acquired, after their integration,

Service Private Limited Liability Company, Limited Liability Company of the Hungarian Idők Publisher, New Wave Media Group Communication and Service Limited Liability Company

Each packet carried a sequence number (that allowed to match packets received from different peers) and a timestamp with the time the packet was sent over the network.. With