Eventual Leader Election in the Infinite Arrival Message-passing
System Model
Sara Tucci-Piergiovanni and Roberto Baldoni
Dipartimento di Informatica e Sistemistica ”Antonio Ruberti”
”Sapienza” University, Rome
Abstract
We study the the failure detection problem in a message-passing system that may dynamically change over time, so that the number of processes which make progress during a computation may grow to infinity as time tends to infinity. We first propose an oracle well-suited to such systems, called HB∗, by detailing its specification and a possible implementation. HB∗gives hints on what processes are alive in the system. Then, we show how to use HB∗to implement the oracleΩ that eventually identifies an unique leader in the system. To the best of our knowledge this is the first implementation ofΩ running in a message passing system with infinitely many processes.
Contact Author: Sara Tucci Piergiovanni email: [email protected]
Postal Address: Dipartimento di Informatica e Sistemistica, Universit´a di Roma La Sapienza, Via Ariosto 25, I-00198 Roma, Italy.
Telephone number: +39-06-77274056
Please consider our paper for REGULAR SUBMISSION.
1 Introduction
Context. Distributed systems are rapidly evolving, the advent of a new class of applications and technologies (VANET, Airborn Networks, DoD Global Information Grid), radically changed the way we were used to think about them.
Emerging systems (also called dynamic systems) have a structure that is self-defined at any time by entities that autonomously decide to locally run the same distributed application. For instance, in many peer-to-peer applications (e.g. kazaa) a user has just to download a software, then she can decide how to configure it (e.g. how many resources to share, depending also on the local availability), when and where to run it. To some extent the system can be viewed at any time as just the sum of all running entities, it might also cease its existence when no entity is currently active while at some later moment new entities arrive and connect to each other to form again the system. The set of entities that over time might form the system is potentially infinite. This ”infinite arrival model” is the key distinguishing factor between dynamic systems and traditional distributed systems where, on the contrary, the set of system entities is fixed since the deployment of application components is controlled and managed.
Problem and Motivation. For many years theoretical research was concerned with the study of fundamental prob- lems, e.g. reliable and ordered communications, consensus, etc., under models that successfully abstract traditional distributed systems. One of the most studied model is the crash failure model, where system starts with a known set of processes and where some of them may prematurely stop working by crashing while other will never crash (correct processes). Problems are usually specified in terms of what guarantees have to be assured to correct processes. How- ever, correct processes are not known in advance and proper abstractions, called failure detectors, have been designed to master this uncertainty. One of the most studied is the failure detectorΩ. Ω eventually provides each correct process with a unique leader in the system chosen among the set of correct processes.
In infinite arrival models prone to crash failures, not only the set of correct processes is not known in advance, but no finite set containing the set of correct processes is known in advance. This actually is a higher level of uncertainty to be mastered in dynamic systems. This makes the study of possible implementations of failure detectors, asΩ, of paramount importance and at the same time makes the problem of realizing such failure detector far from being trivial.
Implementation Issues. The uncertainty posed by infinite arrival models brings to two different issues (1) discovering the finite set of processes currently running and (2) dealing with a possible infinite set of non-correct processes that may wake up at any time, covering with their up times the whole computation. Actually, solutions to the first issue are strongly tight to the type of network processes are supposed to run in (wireless networks, WANs, LANs). In this paper we consider a model that abstracts processes deployed on WANs where an IP multicast primitive can be assumed1. The IP multicast let any process independently join/leave a group and send a multicast message to the group in an unreliable way. By assuming only temporary partitions, each process can witness its presence in the system by periodically sending a heartbeat and its identifier, to let other up processes know it and consider it as part of the system.
At first glance, it may seem that discovering the finite set of processes currently running is the hard part of the problem, herein solved by assumption, and that among this set it is possible to eventually select a unique leader following well-known solutions employed in the crash-failure model. However, in the crash-failure model the objec- tive is to assure that any correct process is eventually able to univocally select one correct process among an initial set containing a finite number of non-correct processes. By eventually suspecting as crashed those processes whose heartbeats/messages stops to arrive, all the complexity lies in avoiding to falsely suspect at least one correct process infinitely often. Solving this issue means that eventually and permanently at least one correct process will be con- sidered as alive. If more than one correct process is considered as alive, processes can be totally and locally ordered at each process sorting their identifiers. By establishing this total order it is possible to apply at each process a local
1This assumption is nowadays validated by an increasingly growing number of large scale systems, such as pan-european Air Traffic Control (ATC) federation (Flight Object Interoperability Proposed Standard), metropolitan and regional Metropolitan monitoring and control systems, which are deployed over IP-based WAN with full support for IP multicast.
deterministic rule that independently chooses the same process (e.g. the one with the lowest identifier) as leader.
In contrast to this crash-failure model with a finite set of processes, the infinite arrival model implies that heartbeats may arrive from correct and non-correct processes over the entire computation. A list of alive processes built by sorting process identifiers2will continually include correct processes and up but non-correct processes since for any identifier assigned to a correct process, an infinite set of non-correct processes with a lower identifier or higher identifier (e.g.
processes running on nodes with a lower, respectively higher, IP address) may arrive over the entire computation.
Without a selection of the only set of correct processes, any choice on the flat mixed set may lead to elect a non- correct leader infinitely often, violating the specification ofΩ.
Contribution. The paper proposes an implementation of the failure detectorΩ in a message passing system where infinitely many processes may arrive and depart over time and where the number of processes which may be simul- taneously up is finite and cannot exceed a known boundC [7, 9]. The implementation is composed by two different algorithms. The first algorithm implements a new lower-level oracle calledHB∗ which provides a list calledalive of lengthC containing processes deemed to be up in the system. The alive lists have to eventually and permanently include correct processes in the first positions. The algorithm implementingHB∗ sorts processes inalive by their age. The age is a sequence number on heartbeats. A correct process can just getting older and older, i.e. its age never stops increasing. A non-correct process will reach a finite age and will turn down. By assuming unknown bounds on message losses and message delay, there will exist a point of time after which no non-correct process can be perceived as older than any correct process. Sorting by age is then a way to eventually have correct processes in the first positions of the alive lists. The oracle, however, does not guarantee an eventual total order on correct processes. Moreover, in any run the set of correct processes may be lower thanC, and since the non-correct processes may continually arrive, the alive lists could always include otherC − b non-correct processes.
The second algorithm eventually outputs a unique correct leader in the system and uses the alive lists provided by HB∗. Actually it uses these lists and employs a mechanism to identify a subset of correct processes totally ordered.
In this algorithm a known lower boundb on the number of correct processes is used, to let the algorithm safely choose, among the firstb positions of alive lists, the set of leader candidates. Actually, the real number of correct processes can be greater thanb, and different processes may have different processes in the first b positions. The algorithm will exchange alive lists and manipulate them by selecting the subset of processes eventually and permanently among the firstb positions in all alive lists. A majority assumption on the number of correct processes w.r.t. the total number of processes concurrently running makes this selection possible.
Related Work. Many algorithms have been proposed for the implementation ofΩ (e.g. [3, 4, 6, 8]) but none of them in models with infinitely many processes. The implementation of the lower-level oracleHB∗is inspired to the oracles [2, 1] used in the crash-recovery model and for quiescent communication in the crash model. In [2] the idea of giving a list of processes deemed to be up (instead of a list of dead ones) is shown to be well-suited to the crash-recovery model, and it also well-suited to infinite arrival models. The implementation in infinite arrival models, however, cannot leverage the information about the process identity as it is like each recovery is done by changing process identifier.
In [1] it is proposed an oracle which does not use timeouts but it counts heartbeats. This exactly the same approach used in theHB∗implementation.
Models for dynamic systems have been proposed in [5] that, along with the concept of infinitely many processes, introduces the concept of an arbitrary small and possibly changing part of the system a process may interact with (neighborhood). This neighborhood is defined by a dynamic interconnection graph which keeps connected all pro- cesses in the system. The presence of the dynamic connected interconnection graph is, however, given by assumption and the implementation of this assumption is still an open problem [10]. In this paper the assumption on a mul- ticast primitive means assuming a fully connected dynamic graph. What is interesting is that this fully connected
2It is here assumed that process identifiers , as locally assigned, have the standard structure (IP address, process identifier).
graph is only used to build alive lists, alive lists can be then used to build a new neighborhood tailored for an overlay application.
The paper is organized as follows: Section 2 presents the model, Section 3 presents theHB∗formal specification and a possible implementation. Section 4 presents theΩ implementation based on HB∗and Section 5 concludes the paper. Due to lack of space formal proofs of algorithms are given in the Appendix.
2 Model
Process Behaviour. Processes can beup or down. Initially each process is down, then it may become up at some arbitrary time. A process can make computational steps only in theup state. Computational steps are the atomic events a process can execute during the computation. Whenup a process behaves accordingly to its specification. Once up, a process may later becomedown. The number of up processes in any interval of time is upperly bounded by a known costantC (bounded concurrency [9]). An infinite number of processes may become up during the computation. In this case an infinite number of processes will return to bedown.
We can distinguish two different types of processes:
• A process is good if and only if (iff) it is eventually and permanently up. Note that a good process becomes up at an arbitrary time (so it may staydown for an aribtrary time before being up).
• A process is bad iff, it is eventually up and each time it is up, it turns to be down in a finite time.
Note that an unbounded number of processes may remain permanently down during a computation, these processes are neither bad nor good. Since the number of simultaneousup processes is bounded by a costant C, then the number ofgood processes in any computation is also upperly bounded by C3. On the other hand, the number ofbad processes has no upper bound since at any time a new process could become up while another one turns down.
Process Identifiers. Each process is identified through an identifierp that is a pair (IP address, P id), where the prefixIP address is a standard IP address and P id is a natural number. The set of all identifiers is finite but a priori unknown. Note that the same identifier can be used more than once in not overlapping interval of times4.
Communication. Any pair of processes inΠ is able to communicate with two different primitives: a multicast prim- itive and a unicast primitive. The multicast primitivemulticast(msg) sends the message msg to all other processes joined to the multicast group5. We assume that each up process joins to the multicast group at the moment it becomes up and never leaves the multicast group unless it turns down. The unicast primitivesend(msg, pi)) sends the message msg to pi. All messages are received through an upcallreceive(msg, source) where source is only the IP multicast address for messages sent through a multicast and it is the process identifier for messages sent through a unicast.
All messages are sent over links that can lose and delay messages, however the following assumptions hold:
• there exists a finite and unknown bound Lmaxsuch that the number of consecutive message losses on a link does not exceedLmax.
• there exists a finite and unknown bound δmaxsuch that the message transfer delay does not exceedδmax. Assumptions on Clocks and Processing Time. It is assumed that each process is equipped with a correct clock. The drift rate with respect to the real time is limited. Clocks, however, are not synchronized each other. To simplify the
3If the number of good processes exceededC, then eventually more than C processes would be simultanously up for an infinite interval of time, which contradicts the hypotesis on the upper boundC.
4That may happen when on the same node a new process takes a process identifierP id previously used by some other process no more running, for instance when the machine is restarted.
5For sake of simplicity we assume the presence of one multicast group.
presentation, and without loss of generality, we assume in the following that local processing takes no time. Only message transfer delay takes time.
3 The HB
∗Oracle
TheHB∗distributed oracle is constituted of a local module running at eachup process. This oracle provides a list of processes deemed to be up, calledalive. The alive list is bounded by C. Considering that in a run the actual number ofgood processes is g ≤ C, the ultimate goal of the oracle is to give as output an alive list that eventually contains theg good processes and they appear before the C − g bad processes possibly included[1]. During an arbitrary finite interval of time thealive list may contain in the first g positions bad processes. The number of inclusions of a bad process in the firstg positions, is finite and unknown in each run. However, there exists a time after which the alive list will permanently contain all and only good processes in the firstg positions. It is also required that the order in the firstg positions does not continually change over the time, i.e. there exists an arbitrary time after which each good process occupies the same position permanently. Formally, theHB∗’s specification is defined through the three following properties:
Property 1 (Completeness). At eachgood process pi, thealive list will eventually include all good processes perma- nently.
Property 2 (Accuracy). Eventually and permanently, at eachgood process pi, for eachgood process pjin thealive list, each processpkordered beforepjis agood process.
Property 3 (Stability). Eventually and permanently, at eachgood process pi, for eachgood process pj in thealive list,pjwill occupy a fixed position in the list.
In Section 4 we will describe how the proposedΩ implementation exploits these properties to eventually select a leader. Intuitively, Completeness and Accuracy are used to eventually elect a good process, while Stability avoids to change a good leader infinitely often.
3.1 HB
∗Protocol
The oracle is implemented through an always running heartbeat protocol encapsulated at each module. Each heartbeat is sent every∆ialong with a local sequence number, calledage to other processes calling the multicast primitive. At receiver side, processes are listed in thealive list by sorting them following the decreasing order of their ages.
We will assume in the following that (i) for each pairpi, pj⇒ ∆i= ∆j(all processes grow with same frequency), (ii) the (fifo) order is never violated6.
With this simple mechanism and (unknown) bounds on message losses and delays, completeness and accuracy are easily assured, but stability is not. Completeness comes from the fact that a good process will send an infinite number of heartbeats, thus, each other good process will eventually receive a heartbeat from it infinitely often. Bounds on message losses and delays are also sufficient to guarantee Accuracy even with an infinite arrival of bad processes.
To explain why let us describe a simple scenario (Figure 1) with a good process and a sequence of bad processes covering with their uptimes an infinite interval of time. In the scenario anobserver process receives heartbeats such that heartbeats sent by bad processes are never lost while heartbeats from the good process are lost with a growing frequency for each pair of heartbeats successfully received. The figure shows the age perceived at the observer for the good processG and the currently up bad process B. As the scenario shows, bad processes waked up later than the
6This is not a restriction as we assume that heartbeats carry timestamps and a late heartbeat is discarded. This discarded heartbeat can be considered as lost.
good one can be perceived at the observer as older (in bold in the figure) than the good at some point of time before turning down. Thanks to bounds on message losses and delays, however, the period the observer does not hear from the good process is bounded. Intuitively these conditions let the good one be eventually and permanently perceived as older than any process that will later become up7and thus to let good processes be eventually sorted before all bad processes.
Pgood
Pobs
Bad ones G|B 0|0
G|B 1|1
G|B 1|2
G|B 1|3
G|B 4|3
G|B 5|1
G|B 5|2
G|B 5|3
G|B 5|4
G|B 5|5
G|B 5|6
G|B 11|6
1 2 3 4 5 6 7 8 9 10 11
1 2 3 1 2 3 4 5 6
Figure 1: Race among a good process and an infinite sequence of bad processes
To assure the stability property bounds on message losses and delays does not suffice. What may happen by just counting heartbeats is a continual change of order in the alive lists as the the scenario depicted in Figure 2 shows. At the starting point, the observer judges there is a tie for the two good processes (in case of tie, the order in the alive list may be defined by using a local rule, not necessarily the same at each process). From this time on, it starts a race in which their relative age at the observer continually inverts, so that the younger becomes the elder infinitely often: in the figure when two heartbeats are lost fromG1, two heartbeats are gathered from G2 and viceversa.
Pgood1
Pobs
Pgood2 G1|G2
6|6
7 8 9 10 11 12 13 14
7 8 9 10 11 12 13 14
G1|G2 6|7
G1|G2 6|8
G1|G2 9|8
G1|G2 10|8
G1|G2 10|11
G1|G2 10|12
G1|G2 13|12
G1|G2 14|12
Figure 2: Race among two good processes
More generally, this race arises if a good processpi, once it gets a certain age such that it is perceived as older than an another set of processes, will infinitely often (i) be perceived as mute for a time long enough such that the latter set grows older thanpiand (ii) will again be perceived as older than the previous set as this set becomes mute. Let us note that if all heartbeats from the processpi arrive, this problem does not arise. The algorithm, thus, embeds first a mechanism to detect this possible race among processes and when this behavior is detected it increments artificially the number of heartbeats to racing processes. We call this number aslif e bonus and the protocol will attribute an extra number of heartbeats greater than the maximumlif e bonus computed so far. That way, the protocol tries to cover the possible next period of muteness, granting to the good process some progress not yet occured. Thanks to the unknown bounds on message losses and delays, the monotonic increasinglif e bonus will eventually cover the maximum length of muteness a process can show, as it will reach the bounds. When this happens, thelif e bonus stops to grow. This makes this mechanism not dangerous even when applied to bad processes, as a bad process will eventually be down and the prediction artificially extends its life only of a bounded quantity. Proofs of the algorithm are given in the Appendix.
7This comes from the fact that we implicitly assumed that there exists a timet0in which the system starts, thus only a finite number of bad processes can wake up earlier than the good process.
3.2 The HB
∗Pseudo-code
Data Structures. The protocol manages a data structure calledHB[] which is a list indexed by process identifiers.
At a processpi, HB[j] contains a record made of three fields: (i) age: number of heartbeats actually received at pifrompj; (ii)elders: the set of processes pk which result older thanpj since the last heartbeat frompj has been received; (iii)e age: the number of heartbeats estimated by piforpi, including the possible life bonuspidecides to attribute topj. Each process manages also the variablelif e bonus, a monotonic increasing variabile which constitues the possible prediction on the number of lost and delayed heartbeats. Thestarted variable keeps track of the set of processespiheard from.
Protocol Description. Each process sends a heartbeat every∆ time units through a multicast. The heartbeat is sent along with a sequence number which tracks the total number of heartbeats already sent (lines 5-7). At the receiver side, the HB list (initially empty) is updated by adding the element related topj only when the first heartbeat is received frompj (lines 9-11)8, thenpi stores the number of heartbeats gathered frompj in the record fieldHB[j].age (line 13). The amount of heartbeats includes possible lost heartbeats. At this point it is detected if a race characterized by alternating muteness periods is happened (lines 14-17). In particular, each time a processpkis found older than a set of processes already started, it is added in theelders variable of this set (lines 14 and 18). When one of this set, let’s saypj, is successively found older thanpkat line 15, then themuteness variable is set to true, as a possible alternating muteness has been found. At lines 19-21, thelif e bonus is thus monotinically incremented by an arbitrary positive constantc and then assigned to pj. At this point the estimated age ofpjis updated including thelif e bonus (line 21).
When the next heartbeat will arrive frompjthelif e bonus will be granted again (line 12) to pj. In the output thread theC − 1 processes with the maximum number of heartbeats in e age are selected and included in the alive variable.
Possible ties are ordered with any local deterministic rule, here not specified. Thepi’s identifier is always in the first position ofalive (lines 23-24).
4 A Ω Implementation based on HB
∗In this section we will present an implementation of theΩ oracle. The goal of the algorithm is to eventually elect a unique leader in the system.
More formally, TheΩ abstraction is defined by the following property:
Property 1 (Eventual Leader). There exists a timet after which every good process elects the same good process.
The proposedΩ oracle implementation is based on the output provided by the HB∗ oracle. These lists will guarantee the accuracy, completeness and stability properties, but at the same time they also have the following char- acteristics: (1) they always contain bad processes and (ii) different lists may never reach the same order on good processes. The latter characteristic imposes to exchange alive lists among processes to eventually reach a common order. However, this task has to be carefully handled due to the first characteristic. To face these issues theΩ algorithm works under the following assumption:
Assumption 1. The number of good processes is lower bounded by a known constantb such that b ≥ ⌊(C/2) + 1⌋.
The possible permanent presence of bad processes in the alive lists (even if in the last positions) and their possible different order justify the assumption on a lower boundb on the number of good processes. This way a good process pican discard all bad processes as it has the guarantee that eventually the firstb positions will be occupied by good processes in alive. Let us call this set thetrusted set at a process pi. Let us note that each process could see a different
8The HB list is then unbounded, but it contains a finite number of elements at each finite point of time.
INITIALIZATION 1 started ← ∅;
2 list HB[] of elements(age, e age, elders) ← ∅;
3 alive ← ∅;
4 age ← 0; lif e bonus ← 0; muteness ← f alse;
SENDING 5 every(∆) 6 age ← age + 1;
7 multicast(Heartbeat(age, pi))
RECEIVING
8 when(receive (Heartbeat(ager, pj)) do 9 if(pj6∈ started)
10 thenadd the element(0, 0, ∅) indexed by j to the HB list;
11 started ← started ∪ {pj}
12 HB[j].e age ← HB[j].e age + (ager− HB[j].age);
13 HB[j].age ← ager;
14 for each pk∈ started : (HB[k].e age < HB[j].e age) 15 if(pk∈ HB[j].elders)
16 thenHB[j].elders ← HB[j].elders/{pk};
17 muteness ← true;
18 HB[k].elders ← HB[k].elders ∪ {pj};
19 if(muteness = true)
20 thenlif e bonus ← lif e bonus + c;
21 HB[j].e age ← HB[j].e age + lif e bonus;
22 muteness ← f alse;
OUTPUT
23 whiletrue do
24 alive ← {pi} ◦ list of the C − 1 oldest pjsorted by e age
Figure 3:HB∗protocol pseudo-code at a processpi
trusted set as in general the actual number of good processes in any run might be greater thanb. Alive lists thus have to be exchanged to let any process know about others trusted sets. List are exchanged such that each process sends its trusted set to all processes in the alive list, a process computes the union of its trusted set only with those coming from trusted processes. This way it safely extends its trust to a greater number of good processes (if any). The majority assumption implies that at least one good process will be eventually included in all trusted sets. However, these sets may differ since some trusted sets may include some processes not included in other trusted sets. The protocol then will compute the so-called candidates set that is the intersection of trusted sets received by trusted processes in the alive lists. The leader is the process with the lowest identifier in the candidates set. Let us note that, at the beginning, any process only knows its alive list. Alive lists are then continuously exchanged and the trusted and candidate sets computed each time trusted processes change their trusted sets or the order in the trusted set (a change of order may happen because the leader becomes down). The stability, accuracy and completeness assure that eventually the trusted set at each process never change. In the following the pseudo-code is presented. Proof of the algorithm are given in the Appendix.
4.1 Ω Pseudo-code
In Figure 4 the protocol pseudo-code is shown.
Data Structures. Each process endows the following data structures:
• a set for each trusted neighbor, T1, ..., Tk, ...Tb−1containing the trusted processes of thek−th neighbor;
• a ordered list T0that contains the firstb processes in alive;
• a trusted set and a candidates set;
• a leader variable containing the identifier of the current leader.
Protocol Description. At the beginningT1, ...Tb−1are set to empty sets andtrusted changed is set to true, so that immediatelytrusted = candidates = T0(lines 6,7) and theleader is set to the process with the lowest identifier inT0(line 20). The current value oftrusted (line 10) is sent to the current alive set. Each time trustedris received from a processpjwhich belongs toT0and it is in thek-th position, then Tkis updated totrustedr(line 13). When the alive list changes its firstb positions then T0is updated (line 17). The trusted and candidates set are updated each time someTk changes (lines 6,7). Each timecandidates changes, the leader variable is updated (line 20). The sending thread (line 10) allows to communicate the last trusted set to others. Eventually a same set is sent infinitely often.
INITIALIZATION
1 T1← ∅; ...Tb−1← ∅;
2 trusted ← ∅; candidates ← ∅;
3 T0← f irst b positions of alive;
4 trusted changed ← true
UPDATING
5 when(trusted changed) do 6 trusted ←S
k=0..b−1Tk; 7 candidates ←T
k=0..b−1Tk; 8 trusted changed ← f alse;
SENDING 9 whiletrue do
10 send[trusted] to alive;
RECEIVING
11 when(receive [trustedr] from pj: j 6= i) do 12 if(pj= kthelement of T0∧ Tk6= trustedr) 13 thenTk← trustedr;
14 trusted changed ← true;
INPUT
15 whiletrue do
16 if(T06= f irst b positions in alive) 17 thenT0← f irst b positions of alive;
18 trusted changed ← true;
OUTPUT
19 whiletrue do
20 leader ← min(candidates);
Figure 4:Ω pseudo-code at a process pi
5 Concluding remarks
In this paper we gave an implementation of an eventual leader abstraction in an infinite arrival model. The implemen- tation is based on tracking a property of processes, their age, which makes possible to eventually distinguish from good and bad processes. Actually, the lower-levelHB∗algorithm gives some weak hints on the set of good processes as (1) the lists of processes deemed to be up will never be permanently purged from bad processes and (2) good processes are not ordered in the same way at each process. The first point is inherent to the model, however the second point rises from the particular algorithm used. The specification of the oracle does not imply a total order, anyway. It is the upper level algorithm that fixes these problems by eventually letting know a particular subset of correct processes
totally ordered to all good process. This algorithm uses a lower bound on correct processes that is at least the majority of the maximum number of processes that can be simultaneously up. However, what set of minimal assumptions is needed to implementΩ in this model has not been demonstrated and it will be the subject of our future work.
References
[1] M. Aguilera, W. Chen, and S. Toueg. Heartbeat: a Timeout-Free Failure Detector for Quiescent Reliable Communication.
Technical Report 97-1631, Ithaca, NY, USA, 1997.
[2] M. Aguilera, W. Chen, and S. Toueg. Failure Detection and Consensus in the Crash-recovery Model. Distributed Computing, 13(2):99–125, 2000.
[3] M. Aguilera, C. Delporte-Gallet, H. Fauconnier, and S. Toueg. On Implementing Omega with Weak Reliability and Synchrony Assumptions. In Proceedings of the 22nd annual symposium on Principles of Distributed Computing (PODC03), pages 306–
314, 2003.
[4] M. Aguilera, C. Delporte-Gallet, H. Fauconnier, and S. Toueg. Communication-efficient Leader Election and Consensus with Limited Link Synchrony. In Proceedings of the 23rd annual ACM symposium on Principles of Distributed Computing (PODC04), pages 328–337, 2004.
[5] R. Baldoni, M. Bertier, M. Raynal, and S. Tucci-Piergiovanni. Looking for a Definition of Dynamic Distributed Systems. In Proceedings of 9th Int. Conference on Parallel Computing Technologies (PaCT07), Springer Verlag LNCS, pages 1–14, 2007.
[6] A. Fernandez, E. Jimenez, and M. Raynal. Eventual Leader Election with Weak Assumptions on Initial Knowledge, Com- munication Reliability, and Synchrony. In in Proceedings of International Conference on Dependable Systems and Networks (DSN06), pages 166–178, 2006.
[7] E. Gafni, M. Merritt, and G. Taubenfeld. The Concurrency Hierarchy, and Algorithms for Unbounded Concurrency. In Proceedings of the 20th annual ACM symposium on Principles of Distributed Computing (PODC01), pages 161–169, 2001.
[8] D. Malkhi, F. Oprea, and L. Zhou. Omega meets Paxos: Leader Election and Stability without Eventual Timely Links. In Proceedings of the 19th International Symposium on Distributed Computing (DISC05), pages 199–213, 2005.
[9] M. Merritt and G. Taubenfeld. Computing with Infinitely Many Processes. In Proceedings of the 14th International Conference on Distributed Computing (DISC00), pages 164–178, 2000.
[10] S. Tucci-Piergiovanni and R.Baldoni. Connectivity in Eventually Quiescent Dynamic Systems. In Proceedings of 3rd Latin- American Symposium on Dependable Computing (LADC07), Springer Verlag LNCS, pages 38–56, 2007.
Appendix
HB
∗Protocol Correctness Proof
In this Section correctness proofs of theHB∗ protocol are provided. To simplify the description without losing generality, we assume the existence of a fictional global clock, whose output is the set of positive integers denoted by T. We then have a timet0= 0 ∈ T which is the time origin of our system.
For sake of simplicity and without loss of generality, it is introduced in these proofs the presence of a special process calledobserver not part of the system but running the HB∗protocol. This process may only receive heartbeats from other processes and we focus on what the protocol assures investigating what happens at the observer. The observer wakes up at some arbitrary timet ∈ T . We will denote as x.age the number of heartbeats the observer gathers from the processx ∈ Π during the computation. The life bonus computed at the observer and assigned to a processx is denoted as x.lif e bonus, while x.e age is the sum of x.age and x.lif e bonus. We will refer to the alive list at the observer simply as theAlive ordered set.
Let us now consider that in each run we have two distinct sets with empty intersection, namelyBadSet and GoodSet. Each process x ∈ GoodSet if and only if it is a good process and each process y ∈ BadSet if and only if it is a bad process.
Some basic observations:
Observation 1. From Assumptions on Clocks and Processing time and from lines 5-6, any processx ∈ GoodSet ∪ BadSet increases its age every ∆ time units.
Observation 2. In a interval of time[t1, t2] > ∆, all processes up during this interval increase their age of a number between |t2−t∆1|and|t2−t∆1|+ 1.
Observation 3. Among two processesx and y, during any run, the maximum difference between their ages (max) and minimum difference among their ages (min), is such that max-min=1.
Observation 3 comes from Observation 2 and Observation 1.
Lemma 1. ∀x ∈ Good, x.age goes to infinity as time passes by.
Proof. (sketch) From lines 5-7 a heartbeat is sent infinitely often. From link properties we have that if a heartbeat is sent infinitely often, then it is received infinitely often (line 8). This implies thatx.age, which counts the number of global number of heartbeats sent and it is piggybacked in each heartbeat (line 13), grows to infinity.
Lemma 2. There exists a point of time after which the observer receives a heartbeat from a processy ∈ Good such thaty.age will be constantly greater than other age carried by a heartbeat sent from a x ∈ Bad.
Proof. (sketch) Let us suppose by contradiction that the setS constituted by all processes x ∈ Bad such that x.age >
y.age at some point of time is infinite. Let us suppose that the process y woke up at time ty. The system starts a time t0, and by bounded concurrency it follows that the set of processesE which wake up earlier than py is finite. This implies that all processesx ∈ Bad ∩ E is finite, thus S = E ∪ X with X an infinite set of bad processes which wake up at a later time thanty, and by bounded concurrency after any timet > ty there always exists a timetx > t in which a bad process wakes up as the set which already woke up is finite at any time. Let us now notice that the number of consecutive losses on links does not exceedLmaxand the delay each message suffers from on this channel does not exceedδmax. Moreover, by Lemma 1 an infinite number of heartbeats fromy will be received by the observer and that after theithheartbeathbyi sent to the observer, a maximum ofLmaxheartbeats can be lost. Thus, after the hbyi heartbeat is received by the observer, the next heartbeat will be received after a time T (i) that cannot exceed
∆ ∗ Lmax+ δmax.
The contradiction can be expressed as follows: there always exists aT (i) such that a process pxwaking up a time txequal to the timeithheartbeathbyi is received at the observer plusδmax, is able to send a heartbeat carrying an age greater than the age carried byhbyi. The age carried by theh − th heartbeat hbhis equal toh. If the last heartbeat hbh
fromx arrived at tx, by Observation 3, the age ofx can grow at maximum inside an interval T (i) − δmaxofLmax. Thus, once thehbyi heartbeat: i > Lmaxis received by the observer at timet, any process pxwhich wakes up later thant cannot grow older than y, which contradicts the hypothesis.
Lemma 3. In any run in whichLmaxandδmaxeventually hold, the life bonus assigned to any process does not exceed an unknown bound.
Proof. (sketch) Let us assume by contradiction that the life bonus goes to infinity as time passes by for a process x ∈ GoodSet. This implies that for the process x (contextually with the reception of heartbeats from x at line 8) line 20 is executed an infinite number of times, i.e. muteness went to true infinitely often. This implies that each time line 15 is executed, there exists a processzk which results younger thanx (line 14), but that has been perceived as older thanx at some earlier point of time, since zk is part ofx.elders (line 15). Let us denote as {s0, s1, ...sk...} the infinite sequence of processes which have been perceived as older thanx during a computation, the sequence is a total order which respects the real-time order in which these processes have been found older thanx, i.e. when they have been included inx.elders at line 18. Let us note that a same process could be included more than once in x.elders.
Without loss of generality let us suppose, however, that thek − th element is the process zk and that the life bonus already assigned tozkwhen line 15 goes to true, upon the reception of a heartbeat fromx, is equal to k. Let us also suppose that each time a processzk goes into theelders set of x, thus immediately after a heartbeat from x arrives which triggers line 15. This implies that the life bonush for x when line 20 is executed will be greater than k, let’s says(k) = k + c with c > 0. Moreover by line 7 the number of lost heartbeats is included each time a heartbeat is received, then theage variable includes all heartbeats sent even though some losses could have happened (line 8).
This implies that thee age variable will be equal to the number of heartbeats sent each ∆ since the process had been waking up plus an accumulated number of heartbeats artificially added equal tos(k). By link assumptions the last heartbeat received fromx and the next one from x it could be pass at most T (i) = ∆ ∗ Lmax+ δmax. Whenx executes line 15 finding as younger process inelders the process zk′such that the life bounsh′got at line 20 is such thath′> T (i), it gathers an additional number of heartbeats such that it will look as each sent heartbeat is not lost and the transfer delay is zero since at line 8 all lost heartbeats are counted and the advantage is then renewed until the next heartbeat will be received (line 12). From this point on forx it will be counted the maximum number of heartbeats that can be gathered from a process which woke up simultaneously tox (at least one each ∆). Since by hypothesis the process that will result older thanx is zk′+1with a life bonusk′+ 1, and the contribution of the life bonus to x.e age cannot exceeds(k′ + 1), this implies that zk′+1is an older process thanx, i.e. a process which woke up before x.
The set of older processes thanx is finite. Thus, x will reach a life bonus as it has been up since the beginning at the maximum speed, let us say at the life bonusl. In this case no process z can be perceived older than x, then the sequence{s1, s2, ....sk...} is finite, which contradicts the initial hypothesis.
Theorem 1 (Accuracy). For each good processx in GoodSet permanently included in Alive, there exists a time t : ∀t′ > t ∧ ∀y ∈ BadSet ⇒ x.e age > y.e age.
Proof. (sketch) The theorem follows from Lemma 2 and Lemma 3, sincee age is the sum between age and lif e bonus.
Theorem 2 (Completeness). Each good process inGoodSet will be eventually and permanently included in Alive.
Proof. (sketch) The theorem follows from Lemma 2 and Lemma 3.
Theorem 3 (Stability). Eventually and permanently, each good process inGoodSet will occupy a fixed position in Alive.
Proof. (sketch) Let us suppose by contradiction that each timet : x, y ∈ GoodSets.t.x.e age > y.e age there exists a timet′ > t : y.e age > x.e age. This means that the number of times in which a process x is perceived older than at least one other process, previously perceived as older thanx, is infinite (line 15). By Lemma 3, however, the life bonus assigned at each process is finite. By the pseudo-code if the line 15 is executed an infinite number of times for the same processx, also the life bonus assigned to x goes to infinite (line 20), which contradicts Lemmma 3.
Ω Protocol Correctness Proof
Let us denote asℓi= {l0, l1, ...ln, ...} the sequence of leaders elected by a process pi.
Lemma 4. There exists a point of timet after which trusted changed at each good process stops to change.
Proof. (sketch) By the Completeness, Accuracy and Stability property of theHB∗oracle and the lower boundb on the number of good processes, the firstb positions of each alive list do not change infinitely often. This means that there exists a point of time after which thetrusted changed variable at pican be set to true only by the reception of a new trusted set by some neighborpj(line 11). This set is always included in the local trusted set (lines 12-13 and 6). By bounded concurrency, the number of good processes is at mostC and at least b, and the trusted set at each good process (lines 3 and 6) will include only good processes which areC at maximum. Then, after a finite number of inclusions at line 6, no new trusted set can be computed. Then, there exists a point of time after which no new trusted set can be received,trust changed remains set to f alse (line 8) and line 6 is never executed again.
Corollary 1. For each good processpi,ℓiis finite.
Proof. (sketch) From Lemma 4 it follows that line 7 is executed a finite number of times. Thus, the candidate set remains eventually the same forever and the leader will never change from this point of time on (line 20).
Let us suppose to have in the systemn ≤ C good processes and let us denote as S1, S2, ...SntheT0at each process after the time it does not change anymore. Let us denote this timetstability.
Let us define the relationtrusts as follows:
pitrusts pjif and only if:
(1)pj ∈ Sior (2) there exists a processpk: pitrusts pkandpktrusts pj.
Lemma 5. Aftertstability, there exists a timet > tstability such that eachpi.trusted will include all and only those processespj: pi7→trustspj.
Proof. (sketch) By the pseudo-code whentstability is reached, each good processpj will send infinitely many times atrusted set which includes at least Sj (line 10). This means that eachpi that will receive the message frompj
: pj ∈ Si, will eventually include those pk thatpj directly trusts as part ofSj. At this point a new trusted set is computed which includesSjand sent again to all processes in alive. The processphfor whichpi∈ Shwill eventually include all processes trusted bypi, then even processes trusted bypjand so on.
Lemma 6. For each pair of good processespi, pj, there existsk such that for each h > k lh∈ ℓi= lh∈ ℓjandlhis a good process.
Proof. Let us suppose by contradiction that given two good processes, namelypi andpj, it does not exist ah such that they will eventually elect the same process. By Corollary 1 eventually each process will stop to change leader, then we can suppose by contradiction that the last leaders elected, namelyl1andl2are such thatl16= l2. This implies thatl1∈ pi.candidates ∧ l1 6∈ pj.candidates. By the pseudo-code when l1is inserted inpk.candidates, it is also inserted inpi.trusted and sent to current alive processes (lines 6-7 and 10). Since l1is the last leader elected, it will be part ofpi.candidates forever and then it will part of pi.trusted forever. Thus, let us suppose, without loss of generality thattstability has been reached when the elections ofl1andl2happen. By the majority assumpion we have that a processpitrusts direclty inpi.T0a majority of processes.
Case 1. Maximal number of setsSiequal each other. By the majority assumption the maximal number of setsSithat can be equal each other is at most a majority. In this case a majority of processes, let us say{p1, p2, ...p(n+1)/2} recip- rocally trust each other and theirtrusted variable will eventually include only M = {p1, p2, ...p(n+1)/2} excluding the minority of processesm = {[p(n+1)/2]+1,...pn} by Lemma 5. Moreover, the sets Sjof the minority have to include at least one process that belongs toM .
case(1a): pi, pj ∈ m. In this case no processes in Sican trustl1, the non-inclusion ofl1coming from some process inSiwill eventually arrive atpithat will execute line 7 excludingl1from candidates. Thenpicannot electl1which leads to the absurdl16= l1.
case(2a):pi∈ M, pj∈ m. In this case pitrusts a processl1together with all processes inM . pihas at least one trust inM as every process in m. This means that eventually every process in m has in the trusted variable the process l1
by Lemma 5. This implies that eventuallyl1∈ ph.trusted ∧ l1∈ pi, ∀pi∈ m ∪ M . This implies that l1is in included inpj.candidates which leads to the absurd l1= l26= l2.
Case 2. Minimal number of sets equal each other. In this case we have that
∀ Si, Sj⇒ Si6= Sj (1) The majority assumption implies that
∀ pi, pj ∈ Good ⇒ Si∩ Sj 6= 0 (2)
(1), (2) and Lemma 5 imply that each good process is eventually part of each trusted set for alln processes. This implies thatl1is in included inpj.candidates which leads to the absurd l1= l26= l2.
Case 3. For all other configurations of{S1, ....Sn} we can reduce to the previous two cases.
Theorem 4. Eventually and permanently all good processes will elect the same good process.
Proof. The theorem immediately follows from Corollary 1 and Lemma 6.