Group Membership for Peer-to-Peer Communication

(1)

Group Membership for Peer-to-Peer Communication

Roberto Baldoni and Sara Tucci Piergiovanni Dipartimento di Informatica e Sistemistica

Universit`a di Roma “La Sapienza”

Via Salaria 113, 00198 Roma, Italia.

{baldoni,tucci}@dis.uniroma1.it

Abstract

This paper studies the problem of maintaining eventual consistent group membership without any synchrony assumption inside an unbounded group G of processes that varies over the time (processes may join and leave the group). Eventual consistency means that if at any time all group membership changes cease, processes will converge in a finite time to a single consistent view. Due to the lack of any synchrony assumption, this specification is well suited to large scale peer to peer environments. The paper also presents two impossibility results and two eventual group membership implementations. Last but not least, pointing out a circularity problem, the paper also shows the impossibility of implementing eventual group membership without the existence of a special peer inside the group.

Keywords: Group Membership, Peer-to-Peer Systems, Dynamic Asynchronous Dis- tributed Systems.

Contact Author: Roberto Baldoni, Dipartimento di Informatica e Sistemistica, Universit´a di Roma La Sapienza, Via Salaria 113, Roma, Italia, baldoni@dis.uniroma1.it, tel. +39 0649918481

(2)

1 Introduction

Group membership problems have been extensively addressed in the literature in the context of group communications. Traditionally, such a group membership aims to support process group communication [2] and has the following objectives [12]: (i) determining the set of processes that are currently up and (ii) ensuring that processes agree on the successive values of this set. From an implementation point of view, a group membership so specified is not solvable in purely asynchronous distributed systems with failures [5]. Therefore implementations of such group membership services require periods of synchrony of the underlying system model long enough to ensure previous properties.

A peer-to-peer environment is formed by a dynamic and scalable group of processes sharing a common interest. In this setting processes communicate through application-level multicast protocols over an overlay network formed by the peers themselves [9, 3]. Due to joins and leaves, the overlay continuously changes. This implies that the multicast protocol, to be effective, has to rely on a group membership service¹ to individuate time by time the set of the intended peer receivers for each multicast message. Due to its dynamicity and to the large number of processes, no ”long enough” period of synchrony can actually be assumed within this environment. Therefore group membership services defined for group communication are not adequate in this context both for implementation (due to [5]) and specification reasons. Ideally, to better support application level multicast protocols in a peer-to-peer environment, a group membership service should indeed provide views close as much as possible to the actual composition of the group, at any point of

1The multicast can also directly embed the membership management, but in the following we maintain separated these two concerns: multicast communication of application information and group membership management.

(3)

time. The service should determine who is currently inside the group (not necessarily currently up processes) by managing joins and leaves and provide new views to peers upon a membership change (without the need of an agreement on the order in which views are provided).

The aim of the paper is therefore to formalize an implementable specification of a group member- ship service in a peer-to-peer environment that has some useful deterministic guarantee to support at its best application-level multicast protocols. To illustrate the advantage of having a group mem- bership service in a peer-to-peer environment consider the following scenario: a process p joins the group. Let us suppose that p becomes a peer at time t₀. From this time on, the group membership service will provide peers with a new view including p. Independently of the service implementation (centralized, decentralized, etc.) new views will be available at each peer with a certain latency.

The expected behavior of the group membership service should allow that: the more p remains inside the group, the more its chances are to appear in peer views. Clearly, the set of peer views in which p will appear is not predictable, since many peers belonging to the group at time t₀ will leave the group in a successive moment, even before being notified with the new view. However, if p and some peer q remain forever in the group we expect that p eventually enters the q’s view (and vice versa). In other terms, the group membership specification should deterministically avoid the isolation of p from other peers, if any. It is well known that process isolations or process partitions seriously degrade performance of any application-level multicast protocol [6].

The paper presents a new group membership specification that formalizes that behavior, namely eventual group membership. This specification, partially inspired by the one stated in [8], guarantees that, after the overlay network stabilizes (no more join and leave occur), peers can converge to a single consistent view in a finite time. The specification also states that join and leave operations must take finite time to guarantee the progress of the system.

The paper also presents some impossibility result. These impossibility results show that the

(4)

specification is not implementable if (i) the group is empty at some point of time and (ii) the group does not always contain a correct process (i.e., a process that never crashes) in the group (from the beginning to the end). Moreover, the last impossibility result points out a circularity problem when eventual group membership is implemented in a fully decentralized way by the peers themselves. This problem reflects the impossibility of a newcomer to know someone (i.e., a contact) inside the group due to the dynamism of the group itself (e.g. the join operation of a newcomer could not terminate in a finite time if its contact already left the group). This circularity problem actually implies that there must be a special peer to be known in advance (i.e., at design time) by a newcomer in order to terminate the join operation. This knowledge makes impossible any fully distributed implementation of eventual group membership where no process plays a special role. Finally, two implementations of eventual group membership are presented in the context of client/server and peer-to-peer environment and their failure resilience is discussed.

The paper is organized as follows Section 2 presents the system model including the group membership management. Section 3 introduces the specification of eventual group membership, the impossibility results and the circularity problem. Section 4 shows two eventual group membership implementations. Section 5 discusses the related work and Section 6 concludes the paper.

2 System Model

The system consists of an unbounded set of processes Π (Π may be infinite). Any process may fail by crashing. A process that never fails is correct. The system is asynchronous: there is no global clock and there is no timing assumption on process scheduling and message transfer delays. Each pair of processes p_i, p_j may communicate along point-to-point unidirectional fair lossy links[1].

Roughly speaking, a fair lossy link may intermittently drop a finite number of messages, and may

(5)

do so infinitely often, but if p_i repeatedly sends some message to p_j and p_j is correct, then p_j eventually receives the message.

To simplify the description without losing generality, we assume the existence of a fictional global clock, whose output is the set of positive integers denoted by T .

Group Membership. Each process p_i ∈ Π may become member of a group G. Once member, it may decide to leave the group. To this aim p_i may invoke the following operations: join(G) to enter G and leave(G) to come out of the group.

The set of processes constituting the group G at time t ∈ T is denoted as G(t). At any time t, G(t) is a subset of Π with size unbounded and finite. The rules defining the membership of G are the following:

• a process p ∈ Π becomes a member of G immediately after the completion of join(G).

• a process p ceases to be member of G immediately after the completion of leave(G).

• a process p may become member of G at most once. ².

A group member p ∈ G is stationary if it is correct and never invokes leave(G).

A group member p ∈ G is transient if it is correct and eventually executes leave(G).

Note that, since G(t) is unbounded and finite at any point of time, the number of stationary processes is unbounded and finite, as well.

Group Membership Management. To abstract in a general manner the group membership management we consider that each process can locally access to a distributed oracle. Each process p_i invokes the join(G) and leave(G) operations through the group membership management local

2This is not a restriction because the process may join with another identifier.

(6)

module GM_i that is in charge of the actual execution of these operations. Each GM_i module provides p_i with a local view of the current membership of the group. We assume that GM_i crashes only when p_i crashes.

Upon the invocation of join(G) by p_i, it generates an event denoted as inv_i(join(G)), then GM_i gains the grant to access the group for p_i. After that, GM_i returns to p_i with an upcall by generating the event res_i(join(G)).

Upon the invocation of leave(G) by p_i, denoted as inv_i(leave(G)), GM_i is in charge to gain the grant to leave the group for p_i. After that, GM_i returns to p_i with an upcall res_i(leave(G)).

Note that GM_i may provide to p_i a local view even when a p_i is not a group member. In this case we denote the view as access view_i. As long as the process p_i belongs to the group, the local view is denoted as group view_i. Only between the events res_i(join(G)) and res_i(leave(G)), p_i ∈ group view_i.

3 Eventual Group Membership

The view information for one group can be represented as a knows-about directed graph K = (Π, E) [8]. For each pair of processes p_i, p_j, there will be an edge (p_i, p_j) in the graph if p_j ∈ group view_i, and an edge (p_j, p_i) if p_i ∈ group view_j. There exists an edge (p_i, p_i) for every process p_i such that p_i ∈ group view_i, i.e. for every current member³. This graph actually represents the overlay network to be used as underlying communication network by an application-level multicast protocol in a peer-to-peer environment.

3Let us remark that there exists an edge (pi, pi) even for each faulty member pi that crashes before generating resi(leave(G)).

(7)

3.1 Specification

Every group membership algorithm implementing eventual consistency has to dynamically manage a knows-about graph in order to:

• connect a new joined member,

• disconnect a leaving member without creating a partition,

• tolerate process failures without creating a partition,

while keeping the knows-about graph connected.

Safety. Since view information is propagated along edges of the knows-about graph (each edge is a fair-lossy link), once joins and leaves cease, every stationary member p_i belonging to the graph should have for each stationary member at least one path formed by stationary members.

This is necessary because even though leaves and joins no longer occur, crashes are still possible.

Such crashes could lead to partition the set of stationary members. Therefore, if this condition is satisfied, each stationary member eventually includes in its view all stationary ones. Formally,

Property 1 (Safety). Let K = (Π, E) the knows-about graph at time t s.t. no edge (p_i, p_i) will be added or removed for each t⁰> t (i.e., joins and leave cease at time t). Let us consider the subgraph K_s= (S, E_s) such that

(i) p_i ∈ Π and p_i is stationary ⇔ p_i ∈ S (ii) ∀ p_i, p_j ∈ S, (p_i, p_j) ∈ E ⇔ (p_i, p_j) ∈ E_s.

Then, ∀p_i, p_j ∈ S there exists an edge (p_i, p_j) in the transitive closure of E_s for each t⁰ > t.

Liveness. A trivial group membership implementation may maintain safety blocking the com- pletion of each join(G)/leave(G). If upon group initialization the group is empty, preventing any

(8)

process from completing the join means that no process will ever become a member. If upon the group initialization the group is not empty, preventing the completion of join/leave means that Safety is assured by choosing a proper structure for the knows-about graph that tolerates some number of failures [11].

Then, to avoid static implementations the following property holds:

Property 2 (Liveness). The execution of the join(G) and leave(G) operations requires finite time.

3.2 Impossibility results

As said in Section 2 a process p_i is provided through GM_i with a list of members of G. Ideally, there is no constraint on the composition of access view_i, as the group membership specification poses constraints only on the composition of group view_is. Thus, impossibility results start from the general assumption that access view_i is a random set of processes belonging to Π without any relation with the current membership. In other terms, access view_i is neither accurate (it could comprise some process not yet a member) nor complete (it can miss same current member).

Unfortunately, as we see later, to guarantee the group membership specification even access view_i has to satisfy some property stated by Corollary 1. That property is very lightweight but introduces a not avoidable circularity discussed in Section 3.3.

Impossibility Result 1. If there exists a time t ∈ T s.t. G(t) ≡ ∅, eventual group membership cannot be solved.

Proof. (sketch) Let us suppose by contradiction that at some point of time t, |G(t)| ≡ ∅.

Let us assume a process p_i executing the join(G) operation produces the inv_i(join(G)) event while |G(t)| ≡ ∅. p_i does not know if the group is empty or not as access view is not complete

(9)

neither accurate. p_i can send a JOIN message to its access view but cannot get any acknowledge- ment (like any concurrent joining process) since G(t) is empty. To respect Liveness it has become member after a finite amount of time exploiting a time-out strategy ⁴. Then, it concludes to be alone a time t + T and includes in group view_i only itself. Due to the asynchrony of the underlying system, another process p_j with p_j 6∈ access view_i and p_i6∈ access view_j can decide to join. As p_j does not ”see” p_i it uses the same strategy and generates res_i(join(G)) including in group view_j only itself at time t + T . If both p_i, p_j are stationary no edge connect them at time t⁰ ≥ t + T . If no other join and leave occur there is no way to add that edge in a successive moment. No edge will connect them for each t⁰≥ t + T violating Safety.

Lemma 1. Let us suppose that |G| is never empty. Then, any process p_icannot generate res_i(join(G)) until there exist at least one edge (p_i, p_j) ∈ E and one edge (p_j, p_i) ∈ E.

Proof. (Sketch)

Let us suppose that res_i(join(G)) is generated at time t and that G(t) contains a stationary member p. By the way of contradiction, let us suppose that does not exist any edge (p_i, p_j) in E at time t. However, after res_i(join(G)) p_i has an edge (p_i, p_i) ∈ E at time t. If p_i is also stationary then G(t) contains two stationary processes and no edge in the transitive closure of E. If no other join and leave occur there is no way to add that edge in a successive moment. No edge will connect them for each t⁰≥ t + T , violating Safety.

Impossibility Result 2. If there exists a time t ∈ T s.t. G(t) contains no stationary member, eventual group membership cannot be solved.

Proof. Let us suppose by contradiction that there exists a point of time t ∈ T s.t. G(t) does not

4The strategy can encompass mechanisms such as setting a timeout T or retransmitting the JOIN message k times.

(10)

contain stationary members. From Lemma 1 every joining process has to establish two edges with a process p_j, before generating res_i(join(G)). From Liveness it has to establish those edges in a finite time. Without loss of generality suppose that at time t, G(t) comprises k faulty members and c transient processes. Taken any subset S(t) ⊆ G(t), of one process p_j, that process is faulty or transient. Let us assume that:

1. p_j is transient and belongs to G between time moments t^J_j and t^L_j.

2. p_i generates inv_i(join(G)) and sends at time t a JOIN message to each member of G(t).

As the system is asynchronous, the delay experienced by JOIN on the fair lossy link connecting p_i to p_j, could be greater than t^L_j − t^J_j and then the message of p_i does not meet p_j. Moreover, p_j does not yet know p_i, so p_j cannot communicate with p_i before the JOIN arrives to p_j.

Then, p_i cannot establish any edge with p_j. p_i cannot establish in a finite time any edge unless some other stationary member will join the group. However, no stationary member can surely join in a finite time for the same reason that blocks p_i.

Then, p_i waits for an infinite time violating Liveness.

Note that the presence of a stationary process it is necessary also to guarantee Leave Liveness.

Upon a leave, the leaving process should communicate its view to its neighbors before leaving. In this manner partition may be avoided. However, if no stationary process is in the group, the leaving process cannot communicate in a reliable manner with its view in a finite time.

3.3 The circularity problem

From Impossibility Result 2, the following Corollary holds:

Corollary 1. If access view_i does not eventually contain at least one stationary member, eventual group membership cannot be solved.

(11)

In particular, if access view_i does not contain a stationary member, Join Liveness may be violated. This constraint on access view poses a circularity problem when the group membership is implemented in a fully decentralized manner by members themselves and no process plays a special role from the beginning.

To point out this problem let us consider that ideally the group is defined on the fly only by its peers that dynamically change. This implies that the mechanism to connect the group cannot be provided at the design time including some particular peer in each access view. The newcomer has to perform a run time discovery to fill the access view in order to be compliant to Corollary 1. This discovery cannot be pushed-based (from the current members of the group to the newcomer): none can indeed provide the newcomer with a view as no member of the group knows the newcomer ⁵. Thus, to discover a current peer, the newcomer has to contact someone (e.g. a special process) that knows some peer. Following a pure peer-to-peer approach (i.e., there is special no process), only a peer may have this knowledge. Then, to know a peer, the newcomer must already know a peer. The circularity is evident since the hen-and-egg problem arises. Circularity, in these systems, may be solved by assuming either that eventually the newcomer will know someone inside the group or the existence of a set of special processes constantly known by all other processes from the beginning of the group, losing a pure peer-to-peer approach.

4 Eventual Group Membership Implementations

In this section we provide two algorithms solving eventual group membership. The first algorithm is based on a set of servers and is thus well suited to small scale systems in terms of number of participants in the group. The second algorithm follows a fully decentralized peer-to-peer approach

5The number of potential newcomers is infinite. As a consequence the identifiers of potential newcomers cannot be available at design time.

(12)

to scale towards a huge number of participants.

4.1 A server-based implementation

A group of servers {s₁, s₂, ...s_r} ⊆ Π totally interconnected and defined in the initialization phase, manages joins and leaves and provides members with views. The number of tolerated server failures is f = (r − 1)/2.

Every process p_i has all servers in access view_i⁶. When p_i invokes join(G), GM_i sends a JOIN message to its access view_i and wait for a majority of acknowledgments from the servers, then a res_i(join(G)) is generated. When p_i invokes leave(G), GM_i sends a LEAVE message to its access view and wait for a majority of acknowledgments from the servers, then a res_i(leave(G)) is generated.

Upon the reception of a JOIN/LEAVE message, each server s_j incrementally builds a list current members_j of current members and maintains a list of members that have left the group old members_j. The list current members_j is sent along with acknowledgements to join.

When a group member p_i receives the list current members_j from s_j it updates its own group view_i, accordingly. Subsequently, the group membership changes may be sent by a server to each of its current members, periodically⁷. Both lists, current members_j and old members_j, are periodically exchanged among servers. When a server s_j receives the list current members_k from s_k, it then (i) deletes from list current members_j all processes in old members_k (if any) and from list current members_k all processes old members_j (if any) and (ii) performs the union of the two lists so obtained.

Note that, once all group membership changes cease, the lists maintained by each server s_j

6Impossibility results are circumvented by the presence of servers.

7Different strategies may be pursued: periodic pull by a member, push upon a change by a server, periodic push by servers, etc.

(13)

does not change after a time t. From this time t on, in a finite time, all servers reach a consistent view, agreeing on a certain list ⁸. This list contains all stationary members. The list is eventually delivered to each stationary member. All stationary members agree on the same view and are comprised in the view, so they form a clique in the knows-about graph.

4.2 A peer-to-peer implementation

The above implementation is well suited when the system scale in terms of number of participants is small. To scale towards a huge number of participants the number of servers should be very high to try to balance the load. Thus, the natural choice seems to be a group membership service completely decentralized and implemented by the peers themselves where each peer of the group only manages a reasonably small subset of peers, having only a partial view of the group[7, 6].

The proposed solution consists of an algorithm that generates, in a decentralized manner, eventually connected knows-about graphs. The resulting graphs show a particular structure in which each member has around itself a clique of at least 2f + 1 members, where f is the number of tol- erated failures. The other important feature of the algorithm consists in imposing a partial order on nodes to manage concurrent leaves that potentially may partition the graph.

Data Structures. The variable group view_i is the union of two different variables: sponsors_i and sponsored_i. A variable rank_i gives an indication of the position of p_i in the graph, inducing a partial order on nodes. A boolean variable leaving is initialized to ⊥.

Initialization of the group. A set of processes {p₁, ...p_{3f +1}} ⊆ Π totally interconnected and defined

8The list may also comprise some crashed member, as no failure detection mechanism is exploited.

(14)

in the initialization phase instantiates the group⁹. All these processes have rank rank_i = 0. They are special processes, they never leave the group.

Join Management. Rules of the algorithm:

• GM_i sends ¹⁰ a JOIN message to access view_i ¹¹

• When GM_i receives a JOIN message from GM_j and p_i∈ group view_i: (1) GM_i inserts p_j in sponsored_i; (2) it sends an acknowledgement to p_j along with its own rank rank_i.

• When GM_i receives 2f + 1 acknowledgments: (1) GM_i includes in sponsors_i all the senders and p_i; (2) it sets rank_i = max(rank_k, ∀senderp_k)+1 and (3) returns to p_i generating res_i(join(G)).

At the end of this phase, a newly joined member has a clique of 2f + 1 members around itself.

In Figure 1 is shown a possible graph evolution at different instants of time generated by the join algorithm. The group starts with a initialization group composed by four stationary processes, resilient to f = 1 failures. The rank of each node is shown.

Note that a member of rank = 1 has f + 1 stationary sponsors thanks to the initialization phase. Every member of rank r > 1 has f + 1 correct sponsors, both transient and stationary. All transient sponsors will be replaced upon their leave by other members.

Leave Management. Rules of the algorithm:

9Impossibility results are circumvented because of the presence of these processes.

10Each message is sent through a fair lossy link, the send primitive embeds a retransmission mechanism that ends to retransmit until an acknowledgement is received. The send primitive is supposed non-blocking.

11The mechanism to fulfill access view, addressing Corollary 1, will be discussed in the reminder of this Section.

(15)

0

0 0

0 0 0

1

0

0 0 0

1

1 0

0 0 0

1

0

0 0 0

1

2 0

0 0 0

1

2 0

0 0 0

1

2 3

t₀ t₁> t₀ t₂> t₁

t₃> t₂ t₄> t₃

Figure 1: Evolution of the knows-about graph generated by the joins representing a group G

• GM_i (i) sets leaving_i = > and (ii) sends a LEAVE message to sponsors_i, so composed hLEAV E, sponsored_i, rank_ii;

• When GM_i receives a LEAVE message hLEAV E, sponsored_r, rank_ri from GM_j and rank_j >

rank_i and leaving_i= ⊥: (1) GM_i inserts sponsored_r in sponsored_i; (2) it sends an acknowl- edgement to p_j and (3) sends a message hN EW SP ON SOR, oldsponsor = p_ji to sponsored_r.

• When GM_i receives a majority of acknowledgments from its sponsors: (1) discards p_i from sponsors_i and (2) returns to p_i generating res_i(LEAV E(G)).

• When GM_ireceives hN EW SP ON SOR, oldsponsor_ri from GM_jand oldsponsor_r ∈ sponsors_i: GM_i includes p_j in sponsors_i and discards oldsponsor_r from sponsors_i.

Note that a process p_iterminates the leave(G) operation when d(2t+1)/2e = t+1 own sponsors become the new sponsors of those processes previously sponsored by p_i. Those new sponsors have a rank greater than the rank of the newly sponsored processes.

(16)

The algorithm works since, among 2f + 1 sponsors, at least f + 1 are correct. In this way, any subset of f + 1 processes picked up by a leaving process contains at least one correct process (transient or stationary). This assures that upon a leave of a transient process p_i, around each process sponsored by p_i the sponsors number does not decrease, i.e. it remains always at least equal to 2f + 1, among which only f nodes are faulty.

Note that that algorithm manages concurrent leaves. Thanks to ranks, it is possible to induce a partial order on the nodes. In practice, when two nodes p_i, p_j with rank rank_i ≤ rank_j want to concurrently leave, a partition may occur if they actually leave at the same time. The algorithm sequences the leaves, by allowing a leave of a process of rank rank_j only if none of its sponsor p_i with rank rank_i < rank_j is concurrently leaving. Note that p_j remains blocked as long as also new sponsors of p_j concurrently leave. Eventually, if all processes with rank lesser than rank_j leave, then p_j will have as sponsors processes with rank 0. By construction these processes never leave (they are stationary), then also p_j eventually may leave (Leave Liveness).

Figure 2 shows a particular graph configuration generated by the join algorithm in which all processes with rank 1 and rank 2 concurrently invokes a leave(G) operation. In case (a) the graph remains partitioned as processes with rank 0 becomes sponsors of processes no longer member of the group. It may happen if nodes with rank 1, nevertheless are leaving, send ACK LEAV E to processes with rank 2. Then, rank 2 processes leave and upon the leave of rank 1 processes, no process of rank 0 is aware of the leave of rank 2. Note that, it may be wondered a case in which, if a process of rank 1 is leaving and changes its view, it reissues the leave message to its sponsors (rank 0). The problem is that such an algorithm may never terminate if all processes with rank greater than 2, continuously leave¹². In such a case the leave(G) operation takes an infinite time,

12The number of processes with rank greater than 2 may be infinite as Π may be infinite.

(17)

violating Liveness.

0

0 0 0

1

2 2 2

subgraph

0

0 0

0 ^subgraph 0

0 0 0

2 2 2

subgraph

(a) Concurrent leaves with partition (b) Concurrent leaves sequenced, no partition

0

0 0 0

1

2 2 2

subgraph

0

0 0

0 ^subgraph 0

0 0 0

2 2 2

subgraph

(a) Concurrent leaves with partition (b) Concurrent leaves sequenced, no partition

Figure 2: Two graph configurations after the concurrent leaves of all nodes of rank 1 and 2.

Knows-about Graph and Failures. If the group has to tolerate f failures, it has to start with 3f + 1 processes. Nominally, the number of tolerated failures does not change in face of a possible growing number of processes. This should imply that all processes that add in the graph are supposed to be correct. However, a graph may be more or less robust to different failure patterns depending on its topology. In practice, we can individuate a number of tolerated failures per rank. For instance, in a star topology, all processes can be divided in two subsets: processes of rank 0 (3f + 1) and processes of rank 1 (N − (3f + 1)). In the subset of processes of rank 0, f failures are tolerated;

in the subset of rank 1 all processes may be faulty. In more complex topologies, for each subset of processes with a given rank, a certain number of failures can be tolerated. The more the number of tolerated failures for each rank, the more the robustness of the topology. Relations between robust topologies and implementation mechanisms to approximate them are out of the scope of this paper.

Techniques to keep scalable partial views. The overhead per-node due to data communication is

(18)

proportional to the node degree. Moreover, the average degree should be:

1. very close to the degree of each node (in this case the topology is very well-balanced)

2. reasonably small

To address the first point an indirection mechanism [7] may be pursued to allow that a joining node chooses its contact uniformly at random.

To address the second point the degree of each node may be predefined or mechanisms to adapt the degree to the total number of nodes may be figured out. First of all, if the number of accepted connections (number of sponsored) is defined and never changes, it means that when a node p_i leaves, some of its sponsored nodes may be not accepted by a p_i’s sponsor. In this case the non-accepted nodes should re-join the group.

An adaptive mechanism for resizing partial views can be the same used by SCAMP. In practice, each node p_i accepts with a probability equal to 1/(|sponsored|+1) both (i) a newly join request by a node p_j (in this case p_i is the contact of p_j) and (ii) an already group member p_k whose sponsor has left. If p_i does not accept the join request of a newly node p_j, p_j is in charge to retransmit the join request (to a node uniformly chosen at random). If p_i does not accept an already member p_k, p_i sends a message to p_k to warn it that its node-degree is being decreased. When p_k knows that its degree has been decreased, it can re-join (maintaining its current partial view) in order to come back to its previous node degree by finding another sponsor.

Solving the circularity problem. To solve the circularity problem it is necessary to have inside the group a subset of stationary special peers known by all processes. In our implementation this subset can be represented by the set of processes with rank zero. In other words each access view contains the identifiers of these special peers.

The indirection mechanism [7] may be used to properly establish edges in the knows-about

(19)

graph by starting to contact special peers (without necessarily establishing edges in the knows- about graph with them). In particular, before joining, a newcomer pings the special peers. The special peers choose a set of members currently in their views and send it to the newcomer. The newcomer fills its access view with these sets. Now, the newcomer may decide either to repeat the indirection gaining a new access view starting from the current one or to send a JOIN message to the current access view to build the group view.

Join/Leave Latency. It is worth to note that above mentioned sophisticated mechanisms to maintain scalable partial views have a high cost in terms of join/leave latency when the network is very dynamic. In fact, a joining node should wait that 2f + 1 members accept its connection and a leaving node should have 2f + 1 sponsors before leaving. In a fairly static system the latency experienced would be reasonable, but it is not the case when the group dynamics is high (the latency should be unpredictable).

In other words, keeping a scalable and reliable topology would mean that in very dynamic periods of the overlay the Liveness property is not assured.

When the network is very dynamic the size of access view_ishould enlarge in order to make the join latency lower (even an already member that has to re-join can use an access view_i very large comprising all members in its current group view_i). With an |access view_i| large, the probability to get 2f + 1 connections increases. Once 2f + 1 connections are established, there are |access view_i|- 2f + 1 false connections to remove. To remove false connections without compromise reliability, the joining node has to use a sequence number for each message sent. Then, it has to contact each process in its access view one by one (using a timeout). When the timeout elapses it send firstly (i) a CANCEL message to the previous contacted node (or another node if in the meanwhile the previous contacted node has left the group) in order to remove a possible false connection with it

(20)

and (ii) it sends a join request to the successive contact in its partial view.

When a node receives a join request it stores the identifier of the joining node in sponsored along with the sequence number of the join request.

When a node receives a CANCEL message it removes the identifier received from sponsored only if the sequence number is equal to the stored one+1.

5 Related Work

Traditional group membership does its best to provide each process with a precise membership information agreed among all processes in an asynchronous distributed systems with a finite number of processes and where processes can crash [10]. However it is well known that, in this system model, if the network does not eventually stabilize (i.e., the network does not show a synchrony period long enough to terminate a membership algorithm), no meaningful deterministic guarantee can be provided on the views delivered to processes [4, 10]. More specifically, [4] proves that such an agreement can be reached only if failure detectors in the system behave like eventually perfect ones and (ii) there is eventually a stable component (i.e., a set of processes that are alive and connected to each other and there is no connection between processes outside the stable component and the internal ones). It is easy to see that in previous model a solution to traditional group membership is also a solution to eventual group membership.

Let us now consider the system model of Section 2 where eventual group membership can be solved (as shown by the algorithms in Section 4). Traditional group membership cannot be solved in that setting. Such a model does not provide indeed any guarantee on the formation of a stable component as it would imply to have a time t after that synchrony assumptions hold.

(21)

Our framework also has similarities with Probabilistic Group Membership (PGM ) [6, 7, 8].

Our specification is partially inspired to the group membership specification in [8]. The main difference regards the Liveness property which has not been specified in [8]. Respecting Liveness has a valuable impact on the problem solvability leading to the two impossibility results above and the circularity problem (discussed in Section 4).

In [6, 7] there is no formal group membership specification, in both cases the authors present PGM algorithms doing their best to ensure a connected knows-about graph by using redundant messages. These PGM algorithms work in principle in the system model presented in Section 2 and their only aim is to support a probabilistic broadcast algorithm (e.g., a gossip algorithm) without degrading the performance of the latter [7]¹³. As stated by the same authors, these algorithms exhibit the risk of process isolation or of partitions which would lead to a severe degradation of the performance of the probabilistic broadcast algorithm. In our model these algorithms actually guarantee eventual group consistency only with a certain probability. Recovery from partitions of a group membership requires, for example, to define a set of priority processes which are constantly known by each process [6]. Both algorithms implicitly solves the circularity problem with the assumption that a joining process knows a contact inside the group.

6 Conclusion

The paper pointed out the necessity to formalize a new group membership specification in an environment where no ”long-enough” period of synchrony can be assumed to occur (i.e., asynchronous system model). This system model is typical of a peer-to-peer environment where the need to scale to a very large number of processes makes the occurrence of such synchrony periods not a

13Let us remark that probabilistic broadcast algorithms do not give any deterministic guarantee on the actual delivery of a message to any process.

(22)

reasonable assumption.

The paper provided a specification, namely eventual group membership, that does not require either to track precise group membership or agreement on group views among group members (as in classical group membership implementations). Nevertheless, the paper showed that eventual group membership is implementable in asynchronous distributed systems (two implementations have been presented in the paper) and that the specification gives deterministic guarantee against process isolation and partitions. Implementations of eventual group membership are actually in charge of dynamically adjusting the underlying overlay network (i.e., the knows-about graph topology) to ensure peer connectivity despite join, leaves and failures.

As remarked in the example shown in the Introduction, the specification of a group membership in asynchronous distributed systems can only be given on stationary processes (a transient process could never be included in any group view of another process). Therefore the issue related to track such transient processes inside the group views is a problem of performance of each eventual group membership algorithm. The more precise is the tracking of transient processes done by the algorithm, the more the group membership algorithm is effective.

The paper also pointed out a circularity problem that to be solved it implies the existence of some special process known by a newcomer at design time. This actually makes impossible a pure peer to peer implementation (no special process exists) of the eventual group membership specification. This limitation applies also to probabilistic implementation of group membership presented in [6, 7] as they are actually probabilistic implementations of eventual group membership.

Acknowledgments. We would like to thank Jean-Michel H´elary with whom we started to inves- tigate basic properties of dynamic distributed systems.

(23)

References

[1] Anindya Basu, Bernardette Charron-Bost, Sam Toeug: Simulating Reliable Links with Unreliable Links in the Presence of Process Crashes. Proceedings of the 10th International Workshop on Distributed Algorithms:

105 - 122 (1996)

[2] Kenneth Birman and Robert van Renesse: Reliable Distributed Computing with the Isis Toolkit. IEEE Computer Society Press (1994).

[3] Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, Antony Rowstron: Scribe: A large-scale and decentralized application-level multicast infrastructure. IEEE Journal on Selected Areas in communications (2002)

[4] Gregory Chockler, Idit Keidar, Roman Vitenberg: Group Communication Specifications: a Comprehensive Study. ACM Comput. Surv. 33(4): 427-469 (2001)

[5] Tushar Deepak Chandra, Vassos Hadzilacos, Sam Toueg, and Bernardette Charron-Bost. On the impossibility of group membership. In 15th Annual ACM Symposium on Principles of Distributed Computing (PODC), pages 322330 (1996).

[6] Patrick Th. Eugster, Rachid Guerraoui, Sidath B. Handurukande, Petr Kouznetsov, Anne-Marie Kermarrec:

Lightweight Probabilistic Broadcast. ACM Trans. Comput. Syst. 21(4): 341-374 (2003)

[7] Ayalvadi J. Ganesh, Anne-Marie Kermarrec, Laurent Massouli´e: Peer-to-Peer Membership Management for Gossip-Based Protocols. IEEE Trans. Computers 52(2): 139-149 (2003)

[8] Richard A. Golding and Kim Taylor: Group membership in the epidemic style, /Technical Report UCSC- CRL-92-13, University of California, Santa Cruz (1992).

[9] John Jannotti, David K. Gifford, Kirk L. Johnson, M. Frans Kaashoek, James W. O’Toole: Overcast:

Reliable Multicasting with an Overlay Network. In OSDI, San Diego (2000)

[10] Idit Keidar, Jeremy B. Sussman, Keith Marzullo, Danny Dolev: A Client-Server Oriented Algorithm for Virtually Synchronous Group Membership in WANs. In 20th International Conference on Distributed Com- puting Systems: 356-365 (2000)

(24)

[11] Meng-Jang Lin, Keith Marzullo and Stefano Massini: Gossip versus Deterministically Constrained Flooding on Small Networks. In 14th International Conference on Distributed Computing: 253-267 (2000)

[12] Andr´e Schiper and Sam Toueg: From Set Membership to Group Membership: A Separation of Concerns, Technical Report, EPFL, Lausanne, (2003)