Eventual Leader Election in Infinite Arrival Message-passing System Model with
Bounded Concurrency
Sara Tucci-Piergiovanni
Dipartimento di Informatica e Sistemistica Sapienza Universit`a di Roma
Rome, Italy CEA LIST Saclay Centre
DRT/LIST/DILS/LISE 91191 Gif sur Yvette, France
Roberto Baldoni
Dipartimento di Informatica e Sistemistica Sapienza Universit`a di Roma
Rome, Italy
Abstract—We study the failure detection problem in a message-passing system that may dynamically change over time, so that the number of processes which make progress during a computation may grow to infinity as time tends to infinity but the number of concurrently up processes do not exceed a known bound. We first propose the specification of a new oracle, called HB∗, able to give hints on which processes are making progress in the system. A possible HB∗ implementation is given. Then, we show how to use HB∗ to implement the oracle Ω that eventually identifies a unique leader in the system. To the best of our knowledge this is the first implementation of Ω running in a message passing system with infinitely many processes.
I. INTRODUCTION
Context. The advent of a new class of applications and technologies, such as VANET, airborne networks, smart environments, P2P, broad area supercomputing, and cloud computing federations poses a new challenge for reliable applications design. The challenge lies in the long-lived and unmanaged nature of these applications. Long-lived and unmanaged applications are characterized by a continual arrival of processes running the application, but at the same time they do not rely on a superior manager having the control on these processes. The manager presence is indeed assumed in managed applications where the manager does its best to guarantee along the time the verification of the assumptions of the underlying distributed system by, for example, keeping the required degree of synchrony or by manually restarting a process to maintain a certain number of processes alive. The absence of the manager in unmanaged systems means that the system does not start with a known and pre-defined setting and that nice manageable system model assumptions either cannot be guaranteed or do not last for long, while a potential infinite number of new arrivals need to be accommodated.
Problem and Motivation. Reliable distributed computing is concerned with the study of fundamental problems under system models that successfully abstract real distributed
systems. Inside a system model it is defined the notion of correct process and reliability guarantees have to be specified and delivered to all correct processes. Correct processes, however, are not known in advance and proper abstractions, called failure detectors [9], have been designed to master this uncertainty. One of the most studied is the failure detector Ω, initially proposed in [8]. Ω eventually provides each correct process with a unique leader in the system chosen among the set of correct processes. Failure detectors have been studied under system models that successfully abstract managed distributed systems. One of the most studied model is the crash failure model, where a system starts with a known set of processes and where some of them may prematurely stop working by crashing while other will never crash (correct processes) [13]. Another well-studied model is the crash-recovery model [3]. In this model processes may recover after a crash. In this case correct processes are those processes that recover a finite number of times, i.e., they are always running or eventually always running, while non- correct processes are those processes that do not recover after crashing or that will crash an infinite number of times.
This model is usually augmented with stable storage and, as in the crash failure model, the set of processes that will be part of the system is known in advance.
For long-lived and unmanaged applications, infinite arrival models [11], [16] are able to better fit application needs than both the crash failure model and crash recovery model as (1) there is no assumption of initial knowledge about the set of processes which will be part of the system (2) processes may crash and later restart with an another identifier an infinite number of times without relying on stable storage, i.e. the node where the process runs can physically fail and later restart without affecting the correctness of the application.
These nice features make the study of possible implementa- tions of failure detectors, as Ω, in infinite arrival models of great importance but at the same time they make the problem of realizing such failure detector far from being trivial. Our study starts by assuming the strongest infinite arrival model
in the hierarchy of models defined by Aguilera in [1], i.e.
the infinite arrival model with bounded concurrency [11], [16]. In this model processes may start and/or fail an infinite number of times as time tends to infinity, but the total number of concurrently running processes does not exceed a known bound.
Implementation Issues. The implementation of a failure detector brings to the issue of discovering correct processes in the system as correct processes are not known in advance.
In infinite arrival models prone to crash failures, not only the set of correct processes is not known in advance, but no finite set containing the set of correct processes is known in advance as there is no initial knowledge of the set of processes that will be part of the system, and this set is potentially infinite as time elapses. This uncertainty leads to two different issues (1) discovering the finite set of processes currently running and (2) dealing with a possibly infinite set of non-correct processes that may wake up at any time, covering with their up times the whole computation.
Actually, solutions to the first issue are strongly anchored to the type of network processes are supposed to run in (wireless networks, WANs, LANs). In this paper we consider a model that abstracts processes deployed on WANs where an IP multicast primitive can be assumed1. The IP multicast enables any process independently join/leave a group and send a multicast message to the group in an unreliable way.
By assuming only temporary partitions, each process can witness its presence in the system by periodically sending a heartbeat and its identifier, to let other up processes know it and consider it as part of the system. At first glance, it may seem that discovering the finite set of processes currently running is the hard part of the problem, herein solved by assumption, and that among this set it is possible to eventually select a unique leader following well-known so- lutions employed in other models. For instance in the crash- failure model this issue is solved by eventually suspecting as crashed those processes whose heartbeats/messages stops to arrive and if more than one process is considered as alive, processes can be totally ordered in accordance with their identifiers. A local deterministic rule (e.g. taking the one with the lowest identifier) allows to eventually select the unique leader. In the crash-recovery model, each process can access a variable, namely epoch, stored in the stable storage. epoch is the number of times a same process has already recovered. By sorting processes following the increasing order of their epochs (a total order considering identifiers when there is a tie on epochs) and choosing the process sorted as first, the unique leader is eventually and permanently selected.
1This assumption is nowadays validated by an increasingly growing number of large scale systems, such as pan-european Air Traffic Con- trol (ATC) federation (Flight Object Interoperability Proposed Standard), metropolitan and regional Metropolitan monitoring and control systems, which are deployed over IP-based WAN with full support for IP multicast.
In contrast to the crash-failure model with a finite set of processes, the infinite arrival model implies that heartbeats may arrive from correct and non-correct processes over the entire computation. The list of alive processes built by sorting process identifiers2 will continually include correct processes and up but non-correct processes since for any identifier assigned to a correct process, an infinite set of non-correct processes with a lower (or higher) identifier (e.g.
processes running on nodes with a lower, respectively higher, IP address) may arrive over the entire computation. Since any choice on a flat set of correct and non-correct processes may lead to elect a non-correct leader infinitely often, the specification of Ω would be violated. The continual arrival of heartbeats is an issue also afflicting the crash-recovery model solved by sorting processes following the values of a variable, the epoch, which will eventually stabilize (getting an eventually constant value) for correct processes and which will be unbounded for non-correct ones. In our model, however, even using a finite set of identifiers, there is no possibility of using a variable counting the number of times a process with the same identifier will start in the system as no stable storage is assumed.
Contribution. The paper proposes an implementation of the failure detector Ω in a message passing system where in- finitely many processes may arrive and depart over time and where the number of processes which may be simultaneously up is finite and cannot exceed a known bound C [11], [16].
In this system the notion of correct process concerns any process that will be eventually and permanently up. In the rest of the paper such a process is called good. A process that is eventually up but remains up only a finite time is called bad.
The implementation of Ω in this system is composed by two different algorithms. The first algorithm implements a new lower-level oracle called HB∗ which provides a list called alive of length C containing processes deemed to be up in the system. The alive lists have to eventually and permanently include good processes in the first positions. To this end, the algorithm implementing HB∗ sorts processes in alive according to their age where the age is a sequence number on heartbeats. Since (1) the age of a good process will never stop increasing (the process will continually send heartbeats incrementing the age) while the age of any bad process will stop, and (2) we are assuming unknown but finite bounds on message losses and message delay; then there will exist a point of time after which no bad process can be perceived as older than any good process and then will be listed after good processes. Even though this simple mechanism let eventually have good processes in the first positions of the alive lists, possible losses and delays of heartbeats could lead a good process to never occupy a fixed
2It is here assumed that process identifiers, as locally assigned, have the standard structure (IP address, process identifier).
position (in presence of other good processes). This may happen as any process may perceive the progress of good processes in an intermittent way (due to message losses and delays), so that the process that seems the oldest at certain point of time, even if always up and sending heartbeats, could be perceived as mute for a certain interval of time as the age perceived does not increase during the interval.
During that interval, another good process could surpass with its age the perceived age of the other, becoming the oldest. This scenario may occur infinitely often among good processes. As the moment in which a good process will be never surpassed by a bad one is unknown, each time a process loses its position for a lower position, it is like it is being suspected. This means that good processes could be suspected all the time if they could change their position infinitely often. This may lead to change a (good) leader infinitely often, whichever is the policy of choosing a process among the good ones. For this reason the HB∗specification embeds a property called Stability stating that eventually the good processes will take a fixed position inside the alive list.
To meet this property the HB∗ algorithm embeds a novel mechanism which is able to detect a possible problem of position instability among a set of processes and that it is able to fix the problem according to one of the processes a so-called lif e bonus, i.e. a greater age than the one perceived. A life bonus granted to a certain process mimics a scenario in which a certain number of future heartbeats will not be lost/delayed from this process, trying then to keep its position even in the case a certain number of possible losses/delays will occur. The HB∗ oracle, thus, is able to eventually order good processes in the first positions giving to them a fixed position. The oracle, however, does not guarantee an eventual total order on good processes.
Moreover, in any run the actual number of good processes g may be lower than C, and since the bad processes may continually arrive, the alive lists could always include other C − g bad processes.
The second algorithm that we propose outputs eventually a unique good leader in the system and uses the alive lists provided by HB∗. Actually it uses these lists and employs a mechanism to identify a subset of good processes totally ordered. In this algorithm a known lower bound b on the number of good processes is used, to let the algorithm safely choose, among the first b positions of alive lists, the set of leader candidates. Actually, the real number of good processes g can be greater than b, and different processes may have different processes in the first b positions. The algorithm will exchange alive lists and manipulate them by selecting the subset of processes eventually and permanently among the first b positions in all alive lists. A majority assumption on the number of good processes w.r.t. the maximum number C of processes concurrently running makes this selection possible.
Related Work. Many algorithms have been proposed for
the implementation of Ω (e.g. [4], [5], [10], [12], [15]) but none of them in models with infinitely many processes. The implementation of the lower-level oracle HB∗ is inspired by the oracles [3], [2] used in the crash-recovery model and for quiescent communication in the crash model. In [3] the idea of giving a list of processes deemed to be up (instead of a list of dead ones) is shown to be well- suited to the crash-recovery model, and it also well-suited to infinite arrival models. The implementation in infinite arrival models, however, cannot leverage the information about the process identity as it is like each recovery is done by changing process identifier. In [2] it is proposed an oracle which does not use timeouts but it counts heartbeats. This exactly is the approach used in the HB∗ implementation.
Models for dynamic systems have been proposed in [6]
that, along with the concept of infinitely many processes, introduces the concept of an arbitrary small and possibly changing part of the system a process may interact with (neighborhood). This neighborhood is defined by a dynamic interconnection graph which keeps connected all processes in the system. The presence of the dynamic connected interconnection graph is, however, given by assumption and the implementation of this assumption is still an open problem [18]. In this paper the assumption on a multicast primitive means assuming a fully connected dynamic graph.
What is interesting is that this fully connected graph is only used to build alive lists, alive lists can be then used to build a new neighborhood tailored for an overlay application. A characterization of churn in a dynamic distributed system is given in [7]. In this paper solvability of a basic distributed computing abstraction, namely a regular register, is con- strained by the churn rate.
Other more practical works tackle the problem of leader election in dynamic systems [17], [14]. [17] proposes an architectural framework composed by a membership module and a leader election module. Processes may join and leave a group using the membership module. When a leader leaves a group, the leader election module will provide a new leader. Different algorithms are evaluated under different failures conditions in order to study the leader stability (the property of non demoting a leader while it is operational).
Interestingly, our algorithm preserves leader stability as it eventually elects a leader using as parameter its age and not using a fixed parameter as its identifier. In [14] the leader election problem is studied in peer-to-peer networks. In this work algorithms and requirements are specific of these type of networks, considering for instance particular algorithms exploiting the graph of structured networks or the election of leaders that have same specific distance from other peers.
The paper is organized as follows: Section II presents the model, Section 3 presents the HB∗ formal specification and a possible implementation. Section 4 presents the Ω implementation based on HB∗and Section 5 concludes the paper. Due to lack of space the correctness proofs are not
included here. The interested reader can refer to [19].
II. MODEL
Process Behaviour. Processes can be up or down. Initially each process is down, then it may become up at some arbitrary time. A process can make computational steps only in the up state. Computational steps are the atomic events a process can execute during the computation. When up a process behaves according to its specification. Once up, a process may later become down. The number of up processes in any interval of time is upperly bounded by a known constant C (bounded concurrency [16]). An infinite number of processes may become up during the computation. In this case an infinite number of processes will return to be down. We can distinguish two different types of processes:
• A process is good if and only if (iff) it is eventually and permanently up. Note that a good process becomes up at an arbitrary time (so it may stay down for an arbitrary long time before being up).
• A process is bad iff, it is eventually up and each time it is up, it turns to be down within a finite time.
Note that an unbounded number of processes may remain permanently down during a computation, these processes are neither bad nor good. Since the number of simultaneous up processes is bounded by a constant C, then the number of good processes in any computation is also upperly bounded by C3. On the other hand, the number of bad processes has no upper bound since at any time a new process could become up while another one turns down.
Process Identifiers. Each process is identified through an identifier pi that is a pair (IP address, P id), where the prefix IP address is a standard IP address and P id is the local process identifier represented by a natural number.
The set of all identifiers is finite but a priori unknown. Note that the same identifier can be used more than once in not overlapping intervals of time4.
Communication. Any pair of processes is able to communi- cate with two different primitives: a multicast primitive and a unicast primitive. The multicast primitive multicast(msg) sends the message msg to all other processes joined the multicast group5. The multicast group is identified by an IP address, to send/receive message to/from this IP address processes perform a join. It is assumed that the join operation eventually succeeds for good processes (assumption realized by the use of IP multicast). We assume that each up process joins the multicast group at the moment it becomes up and
3If the number of good processes exceeded C, then eventually more than C processes would be simultaneously up for an infinite interval of time, which contradicts the hypothesis on the upper bound C.
4That may happen when on the same node a new process takes a local process identifier P id previously used by some other process no more running, for instance when the machine is restarted.
5For sake of simplicity we assume the presence of only one multicast group.
never leaves the multicast group unless it turns down. The unicast primitive send(msg, pi) sends the message msg to the process identified by pi. All messages are received through an upcall receive(msg, source) where source is only the IP multicast address for messages sent through a multicast and it is the process identifier for messages sent through a unicast. All messages are sent over links that can lose and delay messages, however the following assumptions hold:
• there exists a finite and unknown bound Lmax such that the number of consecutive message losses on a link does not exceed Lmax.
• there exists a finite and unknown bound δmaxsuch that the message transfer delay does not exceed δmax. Assumptions on Clocks and Processing Time. It is as- sumed that each process is equipped with a correct clock.
The drift rate with respect to the real time is limited.
Clocks, however, are not synchronized among each other.
To simplify the presentation, and without loss of generality, we assume in the following that local processing takes no time. Only message transfer delay takes time.
III. THEHB∗ORACLE
A. HB∗ specification
The HB∗ distributed oracle is constituted of a local module running at each up process. This oracle provides a list of processes deemed to be up, called alive. The alive list’s size is bounded by C. Considering that in a run the actual number of good processes is g ≤ C, the ultimate goal of the oracle is to give as output an alive list that eventually contains the g good processes and they appear before the C −g bad processes possibly included [2]. During an arbitrary finite interval of time the alive list may contain in the first g positions bad processes and the number of inclusions of a bad process in the first g positions is finite and unknown in each run. However, there exists a time after which the alive list will permanently contain all and only good processes in the first g positions. It is also required that the order in the first g positions does not continually change over the time, i.e. there exists an arbitrary time after which each good process occupies the same position permanently.
Formally, the HB∗’s specification is defined through the three following properties:
Property 1. (Completeness). At each good process pi, the alive list will eventually include all good processes permanently.
Property 2. (Accuracy). Eventually and permanently, at each good processpi, for each good processpj in the alive list, each processpk ordered before pj is a good process.
Property 3. (Stability). Eventually and permanently, at each good processpi, for each good process pj in the alive list, pj will occupy a fixed position in the list.
Note that Stability does not imply that a process pj will occupy the same fixed position at each process pi, in fact no total order property is part of the HB∗ specification.
In Section 4 we will describe how the proposed Ω implementation exploits these properties to eventually select a leader. Intuitively, Completeness and Accuracy are used to eventually elect a good process, while Stability avoids to change a good leader infinitely often.
B. HB∗ Protocol
The oracle is implemented through an always running heartbeat protocol encapsulated at each module. Each heart- beat is sent every ∆i along with a local sequence number, called age, to other processes by calling the multicast primitive. At receiver side, processes are listed in the alive list by sorting them following the decreasing order of their ages. We will assume in the following that (i) for each pair pi, pj ⇒ ∆i = ∆j (all processes grow at the same frequency, as the age will be incremented every ∆i ), (ii) the (fifo) order is never violated 6.
With this simple mechanism and (unknown) bounds on message losses and delays, Completeness and Accuracy are easily assured, but Stability is not. Completeness comes from the fact that a good process will send an infinite number of heartbeats, thus, each other good process will eventually re- ceive a heartbeat from it infinitely often. Bounds on message losses and delays are also sufficient to guarantee Accuracy even in presence of an infinite arrival of bad processes. To explain why let us describe a simple scenario (Figure 1) with a good process and a sequence of bad processes covering with their uptimes an infinite interval of time. In the scenario an observer process receives heartbeats such that heartbeats sent by bad processes are never lost while heartbeats from the good process are lost with a growing frequency for each pair of heartbeats successfully received. The figure shows the age perceived at the observer for the good process G and the currently up bad process B. As the scenario shows, bad processes waked up later than the good one can be perceived at the observer as older (in bold in the figure) than the good at some point of time before turning down.
Thanks to bounds on message losses and delays, however, the period the observer does not hear from the good process is bounded. Intuitively these conditions let the good one be eventually and permanently perceived as older than any process that will later become up 7 and thus to let good processes eventually be sorted before all bad processes.
6This is not a restriction as we assume that heartbeats carry timestamps and that an out-of-order heartbeat is discarded. This discarded heartbeat can be considered as lost. Losses due to arrivals out of order do not lead to unbounded consecutive losses in a link thanks to finite bounds on consecutive losses and message delays.
7This comes from the fact that we implicitly assume that there exists a time t0 in which the system starts, thus only a finite number of bad processes can wake up earlier than the good process.
5 3
5 3 Good
Obs
Bad ones G|B 0|0
G|B 1|1
G|B 1|2
G|B 1|3
G|B 4|3
G|B 5|1
G|B 5|2
G|B 5|3
G|B 5|4
G|B 5|5
G|B 5|6
G|B 11|6
1 2 3 1 2 4 6
1 2 4 6 7 8 9 10 11
Figure 1. Race among a good process and an infinite sequence of bad processes
To assure the Stability property bounds on message losses and delays does not suffice. What may happen by just counting heartbeats is a continual change of order in the alive lists as the scenario depicted in Figure 2 shows. At the starting point, the observer sees a tie between the two good processes (in case of tie, the order in the alive list is defined by using a local rule, see below for further details). From this time on, it starts a race in which their relative age at the observer continually inverts, so that the younger becomes the elder infinitely often: in the figure when two heartbeats are lost from G1, two heartbeats are gathered from G2 and viceversa.
Good 1
Obs
Good 2 G1|G2
6|7 G1|G2
6|8 G1|G2
9|8 G1|G2
10|8 G1|G2 10 |12
G1|G2 13|12
G1|G2 14|12
7 8 9 10 11 12 13 14
G1|G2 10 |11 G1|G2
6|6
7 8 9 10 11 12 13 14
Figure 2. Race among two good processes
More generally, a dangerous race with inverting ages (as the one depicted in Fig. 2) arises if a good process pi, once it gets a certain age such that it is perceived as older than another set of processes, will infinitely often (i) be perceived as mute for a time long enough such that the latter set grows older than pi and (ii) will again be perceived as older than the previous set as this set becomes mute. The algorithm, thus, embeds first a mechanism to detect this possible race among processes and, once this behavior is detected, it increments artificially the number of heartbeats to racing processes. We call this number as lif e bonus and the protocol will attribute an extra number of heartbeats greater than the maximum lif e bonus computed so far. That way, the protocol tries to cover the possible next period of muteness, granting to the good process some progress not yet occurred. Thanks to the unknown bounds on message losses and delays, the monotonically increasing lif e bonus will eventually cover the maximum length of muteness a process can show, as it will reach the bounds. When this happens,
the lif e bonus stops growing. This makes this mechanism not dangerous even when applied to bad processes, as a bad process will eventually be down and the prediction artificially extends its life only of a bounded quantity.
Let us note that a continual change of position in alive lists may occur even when two processes pi, pj race this way:
once pj is perceived older than pi, pi and pj reach a tie but the local rule sorts pibefore pj, then pjbecomes again older than pi. This scenario differs from the previous scenario as ages have never been inverted. To fix this problem it is sufficient to use a local rule that sorts processes with the same age by preserving the relative order they had before the tie. In this case pj will always appear before pi. Proofs of the algorithm are given in the Appendix.
C. TheHB∗ Pseudo-code
In Figure 3 the HB∗ protocol pseudo-code is shown.
Data Structures. The protocol manages a data structure called HB[], which is a list indexed by process identifiers.
At a process pi, HB[j] contains a record made of three fields: (i) age: age of the last heartbeat received at pi from pj; (ii) elders: the set of processes pk which result older than pj since the last heartbeat from pj has been received;
(iii) e age: the number of heartbeats estimated by pi for pj, including the possible life bonus pi decides to attribute to pj. Each process manages also the variable lif e bonus, a monotonically increasing variable which constitutes the possible prediction on the number of lost and delayed heartbeats. The started variable keeps track of the set of processes pi heards from.
Protocol Description. Each process sends a heartbeat every
∆ time units through a multicast. The heartbeat is sent along with a sequence number which tracks the total number of heartbeats already sent (lines 5-7). At the receiver side, the HB list (initially empty) is updated by adding the element related to pj only when the first heartbeat is received from pj (lines 9-11) 8, then pi stores the number of heartbeats gathered from pj in the record field HB[j].age (line 13).
The amount of heartbeats includes possible lost heartbeats.
At this point it is detected if a race characterized by alternating muteness periods is happened (lines 14-17). In particular, each time a process pk is found older than a set of processes already started, it is added in the elders variable of this set (lines 14 and 18). When one of this set, let’s say pj, is successively found older than pk at line 15, then the muteness variable is set to true, as a possible alternating muteness has been found. At lines 19- 21, the lif e bonus is thus monotonically incremented by an arbitrary positive constant c and then assigned to pj. At this point the estimated age of pj is updated including the lif e bonus (line 21). When the next heartbeat will arrive
8The HB list is then unbounded, but it contains a finite number of elements at each finite point of time.
from pjthe lif e bonus will be granted again (line 12) to pj. In the output thread the C − 1 processes with the maximum number of heartbeats in e age are selected and included in the alive variable. Possible ties are ordered through a local deterministic rule which maintains the order processes have before the tie (initially all processes have the same age and can be ordered arbitrarily). The pi’s identifier is always in the first position of alive (lines 23-24).
INITIALIZATION 1 started ← ∅;
2 list HB[] of elements(age, e age, elders) ← ∅;
3 alive ← ∅;
4 age ← 0; lif e bonus ← 0; muteness ← f alse
SENDING 5 every (∆) 6 age ← age + 1;
7 multicast (Heartbeat(age, pi))
RECEIVING
8 when (receive (Heartbeat(ager, pj))) do 9 if (pj6∈ started)
10 then add the element(0, 0, ∅) indexed by j to the HB list;
11 started ← started ∪ {pj}
12 HB[j].e age ← HB[j].e age + (ager− HB[j].age);
13 HB[j].age ← ager;
14 for each pk ∈ started : HB[k].e age < HB[j].e age 15 if (pk ∈ HB[j].elders)
16 then HB[j].elders ← HB[j].elders\{pk};
17 muteness ← true;
18 HB[k].elders ← HB[k].elders ∪ {pj} 19 if (muteness = true)
20 then lif e bonus ← lif e bonus + c;
21 HB[j].e age ← HB[j].e age + lif e bonus;
22 muteness ← f alse
OUTPUT
23 while true do
24 alive ← {pi} ◦ list of the C − 1 oldest pjsorted by e age
Figure 3. HB∗protocol pseudo-code at a process pi
IV. A Ω IMPLEMENTATION BASED ONHB∗ In this section we will present an implementation of the Ω oracle. The goal of the algorithm is to eventually elect a unique leader in the system. More formally, the Ω abstraction is defined by the following property:
Property 1. (Eventual Leader). There exists a time t after which every good process elects the same good process.
The proposed Ω oracle implementation is based on the output provided by the HB∗oracle. These lists will guaran- tee the Accuracy, Completeness and Stability properties, but at the same time they also have the following characteristics:
(1) they may always contain bad processes and (ii) different lists may never reach the same order on good processes. To face these issues the Ω algorithm works under the following assumption:
Assumption 1. The number of good processes is lower bounded by a known constantb such that b ≥ b(C/2) + 1c.
Thanks to the lower bound b on the number of good processes, a good process pican eventually choose as leader a good process as it has the guarantee that eventually the first b positions will be occupied by good processes in alive. Let us call this set the trusted set at a process pi. Let us note that each process could see a different trusted set as in general the actual number of good processes in any run might be greater than b. Alive lists thus have to be exchanged to let any process know about others trusted sets. Lists are exchanged such that each process sends its trusted set to all processes in the alive list, a process computes the union of its trusted set only with those coming from trusted processes. This way it safely extends its trust to a greater number of good processes (if any). The majority assumption implies that at least one good process will be eventually included in all trusted sets.
However, these sets may differ since some trusted sets may include some processes not included in other trusted sets.
The protocol then will compute the so-called candidates set that is the intersection of trusted sets received by trusted processes in the alive lists. The leader is the process with the lowest identifier in the candidates set. Let us note that, at the beginning, any process only knows its alive list. Alive lists are then continuously exchanged and the trusted and candidates sets computed each time trusted processes change their trusted sets or the order in the trusted set (a change of order may happen because the leader becomes down). The Stability, Accuracy and Completeness assure that eventually the trusted set at each process never change. In the following the pseudo-code is presented. Proof of the algorithm are given in []
A. Ω Pseudo-code
In Figure 4 the protocol pseudo-code is shown.
Data Structures. Each process endows the following data structures:
• a set for each trusted neighbor, T1, ..., Tk, ...Tb−1 con- taining the trusted processes of the k-th neighbor;
• a ordered list T0 that contains the first b processes in alive;
• a trusted set and a candidates set;
• a leader variable containing the identifier of the current leader.
Protocol Description. At the beginning T1, ...Tb−1 are set to empty sets and trusted changed is set to true, so that immediately trusted = T0 at line 6 (actually T0 is a list while trusted and candidates are sets, the conversion of types is implicit). The current value of trusted (line 10) is sent to the current alive set. Each time trustedr is received from a process pj which belongs to T0 and is in the k-th position, then Tkis updated to trustedr(line 13). When the alive list changes in one (or more) of its first b positions then
T0is updated (line 16). The trusted and candidates set are updated each time some Tk changes (lines 6,7). Each time candidates changes, the leader variable is updated (line 19). The sending thread (line 10) allows to communicate the last trusted set to others. Eventually a same set is sent infinitely often, the same candidates set at line 7 will stabilize at each process and a same leader will be elected (line 19).
INITIALIZATION
1 T1← ∅; ...Tb−1← ∅;
2 trusted ← ∅; candidates ← ∅;
3 T0← f irst b positions of alive;
4 trusted changed ← true
UPDATING
5 when (trusted changed) do 6 trusted ←S
k=0..b−1Tk; 7 candidates ←T
k=0..b−1Tk; 8 trusted changed ← f alse
SENDING
9 while true do
10 send [trusted] to alive
RECEIVING
11 when (receive [trustedr] from pj: j 6= i) do 12 if (pj= kthelement of T0 ∧ Tk6= trustedr) 13 then Tk← trustedr; trusted changed ← true
INPUT
14 while true do
15 if (T06= f irst b positions in alive) 16 then T0← f irst b positions of alive;
17 trusted changed ← true
OUTPUT
18 while true do
19 leader ← min(candidates)
Figure 4. Ω protocol pseudo-code at a process pi
V. CONCLUDING REMARKS
In this paper we gave an implementation of an eventual leader abstraction in an infinite arrival model with bounded concurrency. The implementation is based on tracking a property of processes, namely their age, which makes it possible to eventually distinguish between good and bad processes. Actually, the lower-level HB∗ algorithm gives some weak hints on the set of good processes by providing a list of processes deemed to be up that, however, (1) could always contain bad processes and (2) may never reach a same order on good processes at different processes. The specification of the oracle does not imply a total order, anyway. It is the upper level algorithm that fixes these problems by eventually letting know a particular subset of good processes totally ordered to all good process. This algorithm uses a lower bound on good processes that is at least the majority of the maximum number of processes that can be simultaneously up. However, what set of minimal assumptions is needed to implement Ω in this model has not
been demonstrated and it will be the subject of our future work.
REFERENCES
[1] M. Aguilera. A Pleasant Stroll through the Land of Infinitely Many Creatures. ACM SIGACT news, DC column, 35(2), pages 36-59, 2004.
[2] M. Aguilera, W. Chen, and S. Toueg. Heartbeat: a Timeout- Free Failure Detector for Quiescent Reliable Communication.
Technical Report 97-1631, Ithaca, NY, USA, 1997.
[3] M. Aguilera, W. Chen, and S. Toueg. Failure Detection and Consensus in the Crash-recovery Model. Distributed Computing, 13(2), pages 99-125, 2000.
[4] M. Aguilera, C. Delporte-Gallet, H. Fauconnier, and S. Toueg.
On Implementing Omega with Weak Reliability and Syn- chrony Assumptions. In Proceedings of the 22nd annual sym- posium on Principles of Distributed Computing (PODC03), pages 306- 314, 2003.
[5] M. Aguilera, C. Delporte-Gallet, H. Fauconnier, and S. Toueg.
Communication-efcient Leader Election and Consensus with Limited Link Synchrony. In Proceedings of the 23rd annual ACM symposium on Principles of Distributed Computing (PODC04), pages 328-337, 2004.
[6] R. Baldoni, M. Bertier, M. Raynal, and S. Tucci-Piergiovanni.
Looking for a Definition of Dynamic Distributed Systems. In Proceedings of 9th Int. Conference on Parallel Computing Technologies (PaCT07), Springer Verlag LNCS, pages 1-14, 2007.
[7] R. Baldoni, S. Bonomi, M. Raynal. Regular Register: an Im- plementation in a Churn-prone Environment. In Proceedings of 16th International Colloquium on Structural Information and Communication Complexity (SIROCCO 2009).
[8] T. Chandra, V. Hadzilacos, S. Toueg. The Weakest Failure Detector for Solving Consensus. Journal of ACM, 43 (4), pages 685722, 1996.
[9] T.Chandra, and S.Toueg. Unreliable Failure Detectors for Reliable Distributed Systems. Journal of ACM, 43(2), pages 225-267, 1996.
[10] A. Fernandez, E. Jimenez, and M. Raynal. Eventual Leader Election with Weak Assumptions on Initial Knowledge, Com- munication Reliability, and Synchrony. In Proceedings of In- ternational Conference on Dependable Systems and Networks (DSN06), pages 166-178, 2006.
[11] E. Gafni, M. Merritt, and G. Taubenfeld. The Concurrency Hierarchy, and Algorithms for Unbounded Concurrency. In Proceedings of the 20th annual ACM symposium on Prin- ciples of Distributed Computing (PODC01) pages 161-169, 2001.
[12] E. Jimenez, S. Arevalo and A. Fernandez. Implementing unreliable failure detectors with unknown membership. In- formation Processing Letters, 100, pages 60-63, 2006.
[13] L.Lamport and M. Fischer. Byzantine Generals and Transac- tion Commit Protocols. SRI International, Technical Report 62, 1982.
[14] V.Lo, D. Zhou, Y. Liu, C. Dickey, and J. Li. Scalable Supernode Selection in Peer-to-Peer Overlay Networks. In Proceeding of the 2nd International Workshop on Hot Topics in Peer-to-Peer Systems (HOT-P2P05), pages 18-25, 2005.
[15] D. Malkhi, F. Oprea, and L. Zhou. Omega meets Paxos:
Leader Election and Stability without Eventual Timely Links.
In Proceedings of the 19th International Symposium on Distributed Computing (DISC05), pages 199-213, 2005.
[16] M. Merritt and G. Taubenfeld. Computing with Infinitely Many Processes. In Proceedings of the 14th International Conference on Distributed Computing (DISC00), pages 164178, 2000.
[17] N. Schiper and S.Toueg. A Robust and Lightweight Stable Leader Election Service for Dynamic Systems. In Proceed- ings of the 38th International Conference on Dependable Systems and Networks (DSN08), pages 207-216, 2008.
[18] S. Tucci-Piergiovanni and R.Baldoni. Connectivity in Even- tually Quiescent Dynamic Systems. In Proceedings of 3rd Latin-American Symposium on Dependable Computing (LADC07), Springer Verlag LNCS, pages 38-56, 2007.
[19] S. Tucci-Piergiovanni and R. Baldoni. Eventual Leader Election with Infinite Arrivals. Technical Report.
http://www.dis.uniroma1.it/ midlab/publications