Total Order Communications over Asynchronous
Distributed Systems: Specifications and
Implementations
∗Roberto Baldoni, Stefano Cimmino, Carlo Marchetti Dipartimento di Informatica e Sistemistica
Universit`a di Roma “La Sapienza”
Via Salaria 113, 00198, Roma, Italy
email: {baldoni,cimmino,marchet}@dis.uniroma1.it
Abstract
During the last two decades the design and development of total order (TO) communications has been one of the main research topics in dependable distributed computing. The huge amount of research work has produced several TO specifications and a wide variety of TO implementations with different guarantees whose differences are often left hidden or unclear. This paper presents a systematic classification of six distinct TO specifications based on a well-defined formal frame- work. The classification allows us (i) to define in a formal way the differences among the behaviors of faulty and correct processes admit- ted by each specification, and (ii) to derive a methodology that enables the classification of TO implementations with respect to their enforced specification. The paper also discusses the impact of TO specifications on the design of application logics. The methodology is then used to formally study the properties of eight variations of TO implementa- tions based on a fixed sequencer given in a well-known context, namely primary component group communication systems.
∗This work is partially supported by a grant from EU IST Project “EU-PUBLI.COM” (#IST- 2001-35217), and by a grant from MIUR in the context of project “MAIS”.
Contents
1 Introduction 6
2 System model 8
3 Total order properties 9
3.1 Validity and Integrity . . . 10
3.2 The Agreement property . . . 10
3.3 The Order property . . . 11
4 A hierarchy of total order specifications 14 4.1 On the behavior of faulty processes . . . 17
4.2 Associating implementations to specifications . . . 18
4.3 Impact of TO specifications on the application logic . . . 19
5 Primary component group communication systems 21 5.1 Reference architecture and the virtual synchrony program- ming model . . . 21
5.2 Static vs. dynamic group communications. . . 23
6 TO implementations in group communication systems 24 6.1 Fixed sequencer protocols . . . 24
6.2 A generic fixed sequencer protocol . . . 25
6.2.1 BB fixed sequencer protocol . . . 27
6.2.2 SB fixed sequencer protocol . . . 28
6.2.3 AB fixed sequencer protocol . . . 29
6.3 Enforced TO specification . . . 30
6.3.1 Preliminary lemmas . . . 30
6.3.2 Fixed sequencer protocols enforce at least T O(N U A,- W N U T O) . . . 33
6.3.3 Fixed sequencer protocols using only U Rcast enforce T O(U A, SU T O) . . . 35
6.3.4 Specifications enforced by BB fixed sequencer protocols 36 6.3.5 Specifications enforced by SB fixed sequencer protocols 38 6.3.6 Specifications enforced by AB fixed sequencer protocols 38 7 Performance analysis 39 7.1 Group communication toolkits . . . 39
7.2 Experimental settings. . . 41
7.3 Experimental results . . . 42
8 Conclusion and related work 44
Appendix A: Formal analysis of the behavior of faulty pro-
cesses 48
Appendix B: Privilege-based protocols 50
Appendix C: Group toolkit configuration 53
1 Introduction
Since Lamport’s seminal paper [24], the problem of totally ordered (TO) communications has been extensively studied in the literature. Intuitively, a TO primitive ensures that processes of a message-passing distributed sys- tem deliver the same sequence of messages. Researchers pointed out several application scenarios in which the use of TO communications is extremely useful, e.g maintaining consistency among the internal states of a set of de- terministic replicas (active software replication [31]), as well as when facing the problem of delivering stock quote information in the same order to a set of stock operators while preserving fairness of exchanges [12].
During the last two decades more than sixty TO implementations have been designed [17]. Typically, they ensure that, in the absence of failures, processes deliver the same set of messages in the same order. In contrast, distinct TO implementations may behave in different manners in the pres- ence of failures. As an example, some TO implementations admit faulty processes to deliver messages in an order different from correct ones. This aspect (and especially its impact on applications) still deserves careful anal- ysis. Let us consider an application logic based on active replication with strong replica consistency requirements [20]. To preserve safety (i.e., strong replica consistency), it is necessary that all replicas (faulty or not) deliver the same sequence of messages. Therefore the occurrence of wrong message ordering at a faulty replica would violate the application’s safety if not ap- propriately handled by the application logic.1 Unfortunately, the analysis of the TO implementation’s behavior under failure scenarios is often under- estimated, especially by practitioners. Furthermore, even if one wishes to undertake such analysis, he/she has to face the problem of understanding a complex (or even missing) specification, e.g. [9, 23]. This problem is also reinforced by the fact that specifications are sometimes given using different ad-hoc formalisms, which are difficult to compare among them, e.g. [19, 23].
Despite several valuable works have tried to gather the existing literature into a uniform comprehensive framework (e.g. [15, 17]), the problem of clarifying in a simple and easy to understand way both the meaning of TO specifications and their impact on applications has not been addressed in a satisfactory way.
This paper aims to fill this need. Using a formal framework, based on first order logic as specification language, we present and describe in a sys- tematic way eight existing TO specifications. Some of this specifications appeared in the literature, while others have been derived from our for- mal analysis. Besides the formalization of TO specifications, the novelty
1In this case it is likely to face the paradox that a practitioner exploits a TO imple- mentation to preserve the correctness of its application and to increase availability despite failures, whereas this implementation causes a violation of the application’s safety exactly upon the occurrence of a replica failure.
of the analysis lies in the explicit description of failure scenarios admitted by each specification, in terms of the possible behavior of faulty processes.
This allows a deep understanding of the impact of each specification on applications and enables to identify the most suitable TO specification to support the implementation of a given application. As an example, we dis- cuss the impact of TO specifications on applications aiming to provide one copy serializability on a replicated database [7]. Furthermore, the paper presents a methodology that allows us to fit existing unspecified (or speci- fied with an ad-hoc formalism) TO implementations into the classification framework. The methodology is then applied to analyze the implementation of TO primitives in a particular context, namely primary component group communications [15, 29], which have proven to be an effective paradigm for implementing fault-tolerant distributed applications. In particular, we first introduce a reference architecture based on the virtual synchrony program- ming model that abstracts group toolkit’s implementation details. Then we analyze some of the most widely used classes of TO implementations, namely fixed sequencer protocols and privilege-based protocols, by formally proving their enforced specifications. Let us note that, despite the popu- larity of these protocols, they have been rarely formalized and/or proved to enforce a precise TO specification. Finally, the paper presents a perfor- mance comparison of several TO implementations given in the context of group communications. The evaluation aims to shed the light on the differ- ences in terms of performance between distinct TO implementations, which stem from the implementation of distinct TO specifications. To this end, we briefly describe the group toolkits chosen for the comparison, placing their TO implementations within the specification hierarchy, and then we present the experimental results. The performance comparison confirms the intuitive claim that TO implementations enforcing weaker TO specifications performs better than those enforcing stronger ones.
The remainder of this paper is organized as follows. Section 2 presents the system model. Section 3 reports a detailed study of the properties defining the TO problem. Section 4 (i) introduces a hierarchy of TO specifi- cations, (ii) examines differences among the behaviors of correct and faulty processes admitted by each specification (Appendix A presents the corre- sponding formal analysis), (iii) presents a methodology to classify TO im- plementations, and finally (iv) discusses their impact on application logics in the context of database replication. Section 5 introduces group commu- nication systems, and Section 6 analyzes TO implementations based on a fixed sequencer, formally identifying their enforced specifications (Appendix B discusses privilege-based protocols). Finally, Section 7 reports the per- formance analysis (Appendix C describes the configuration of the group toolkits used within the experiments), and Section 8 concludes the paper also discussing related work.
2 System model
The specifications and properties that we introduce in the following sections are based on the asynchronous distributed system model described below.
Expressing properties assuming such a system model allows us to apply them to a wide class of synchronous, asynchronous, and partially synchronous real distributed systems in which a TO implementation is developed. We do not consider systems admitting malicious behaviors of processes and/or channels.
Asynchronous distributed system. We consider a system composed by a finite set of processes Π = {p1. . . pn} communicating by message pass- ing. Each process behaves according to its specification until it possibly crashes. A process that never crashes is correct, while a process that even- tually crashes is faulty. The system is asynchronous, i.e. there is no bound known or unknown on message transfer delays and on processes’ relative speed. In order to broadcast a message m, a process invokes the TOcast(m) primitive. Upon receiving a message m, the underlying layer of a process invokes the TOdeliver(m) primitive, which is an upcall used to deliver m to the process.
Histories and runs. Each process p ∈ Π can experience the occurrence of three kinds of events, namely T Ocast(m), T Odeliver(m) and crash. An history hp is the sequence of events occurred at p during its lifetime. We denote as ei ∈ hp the i-th event in the history of p. Note that crash may only occur as the last event in the history of a faulty process. A system run is a set of histories hpi, one for each process pi∈ Π. We denote as R the set of all possible runs in the system. To characterize runs in R, we introduce the predicates defined in Table 1.
correct(p) , crash /∈ hp
f aulty(p) , crash ∈ hp
tocast(p, m) , T Ocast(m) ∈ hp
todel(p, m) , T Odeliver(m) ∈ hp
todel(p, m) < todel(p, m0) , ∃i, j ei= T Odeliver(m) ∈ hp∧ ej= T Odeliver(m0) ∈ hp∧ i < j
Table 1: Shorthand predicates
Properties and specifications. A property P on R is a first order logic predicate based on the shorthand predicates given in Table 1, and thus defining a set of runs RP ⊆ R associated to P . More precisely, a run r ∈ R is admitted by P , i.e. r ∈ RP, iff r satisfies P . Let P and P0 be two properties on R, then P → P0 iff RP ⊂ RP0. If P → P0, we say that P is stronger than P0, and that P0 is weaker than P . Note that P → P0 iff P ⇒ P0∧ ¬(P0 ⇒ P ), where ⇒ is the logical implication.
A specification S(P1. . . Pm) (with m ≥ 1) on R is a predicate on R
composed by the conjunction of m properties, i.e. S =V
i=1...mPi, defining a set of runs RS = T
i=1,...mRPi ⊆ R. RS is composed by all system runs satisfying S. More precisely, a run r ∈ R is admitted by a specification S(P1. . . Pm), i.e. r ∈ RS = T
i=1,...mRPi, iff r satisfies S. Given two specifications S(P1. . . Pm) and S0(P10. . . P`0), S is stronger than S0, denoted S → S0, iff RS ⊂ RS0. In this case we also say that S0 is weaker than S. Note that the → relation is transitive, i.e. if S1 → S2 ∧ S2 → S3 then S1 → S3. Finally, two specifications S and S0 are equivalent, denoted S ≡ S0, iff RS ≡ RS0.
3 Total order properties
To the best of our knowledge, the first coherent description of TO specifica- tions is due to Hadzilacos and Toueg, dating back to 1993 [21]. Since then, total order broadcast has been usually specified by means of four proper- ties, namely Validity, Integrity, Agreement, and Order. Informally speaking, a Validity property guarantees that messages sent by correct processes are eventually delivered at least by correct processes; an Integrity property guar- antees that no spurious or duplicate messages are delivered; an Agreement property ensures that (at least correct) processes deliver the same set of messages; an Order property constrains (at least correct) processes deliver- ing the same messages to deliver them in the same order. Each property can be formally defined in distinct ways, thus generating distinct specifications.
A typical example of differing formulations of a property of a TO specifica- tion is given by its uniform and non-uniform versions. A uniform property imposes some restrictions on the histories of (at least) correct processes on the basis of some events occurred in the histories of some processes (correct or not). In contrast, a non-uniform property imposes some restrictions on the histories of correct processes on the basis of some events occurred in the histories of some correct processes. As a consequence, given a property P , its uniform formulation U P turns out to be stronger than its non-uniform formulation N U P . Therefore, U P → N U P , and RU P ⊂ RN U P. It is worth noting that uniform properties are meaningful only in certain environments.
For instance, uniform properties are not enforceable assuming malicious fault models.
In the following sections we introduce and discuss several formulations of the properties, giving their definitions and highlighting differences in their admitted system runs. For each property we give both the formal definition (see Table 2) and a description in natural language. Concerning the latter, we say that a process p ∈ Π tocasts a message m iff tocast(p, m), i.e. iff T Ocast(m) ∈ hp. Analogously, we say that a process p ∈ Π todelivers a message m iff todel(p, m), i.e. iff T Odeliver(m) ∈ hp (see Table 1).
Validity and Integrity properties N U V , ∀p∀m tocast(p, m) ∧ correct(p) ⇒ todel(p, m) U I , ∀m∀p ei= T Odeliver(m) ∈ hp⇒
(∃q tocast(q, m) ∧ ∀ej∈ hpei= ej⇔ i = j) Agreement properties
U A , ∀p∀m todel(p, m) ⇒ (∀q correct(q) ⇒ todel(q, m))
N U A , ∀p∀m correct(p) ∧ todel(p, m) ⇒ (∀q correct(q) ⇒ todel(q, m)) Order properties
SU T O , ∀p∀m, m0 todel(p, m) < todel(p, m0) ⇒ (∀q todel(q, m0) ⇒ todel(q, m) < todel(q, m0))
SN U T O , ∀p∀m, m0 correct(p) ∧ todel(p, m) < todel(p, m0) ⇒ (∀q correct(q) ∧ todel(q, m0) ⇒ todel(q, m) < todel(q, m0))
W U T O , ∀p, q∀m, m0 todel(p, m) ∧ todel(p, m0) ∧ todel(q, m) ∧ todel(q, m0) ⇒ (todel(p, m) < todel(p, m0) ⇔ todel(q, m) < todel(q, m0))
W N U T O , ∀p, q∀m, m0 correct(p) ∧ correct(q) ∧ todel(p, m) ∧ todel(p, m0)
∧todel(q, m) ∧ todel(q, m0) ⇒ (todel(p, m) < todel(p, m0) ⇔ todel(q, m) < todel(q, m0))
Table 2: Formal definition of the properties defining TO specifications
3.1 Validity and Integrity
Avoiding unlikely assumptions, it is not possible to guarantee that a faulty process eventually delivers messages it sent. Therefore, we only consider the non-uniform version of the Validity property, defined as follows:
Non-uniform Validity (N U V ). If a correct process tocasts a message m, then it eventually todelivers m.
Concerning the Integrity property, assuming a crash fault model enables the enforcement of its uniform version without additional overhead. Hence, we only deal with Uniform Integrity, defined as follows:
Uniform Integrity (U I). For any message m, every process p todelivers m at most once, and only if m was previously tocast by some process.
3.2 The Agreement property
An Agreement property imposes some constraints on the sets of messages delivered by (at least correct) processes. This property is usually specified according to one of the following formulations:
Uniform Agreement (U A). If a process todelivers a message m, then all correct processes eventually todeliver m;
Non-uniform Agreement (N U A). If a correct process todelivers a mes- sage m, then all correct processes eventually todeliver m.
m4 p1
p2
p3
m2 m4
m1
m1
m3 m3
m3 m4 m1
m2
(a) A run in RU A
m4 p1
p2
p3
m2 m4
m1
m1
m3 m3
m3 m4 m1
m2
m5
(b) A run in RN U A− RU A
Figure 1: Differences between U A and N U A
Both formulations constrain all correct processes to deliver the same set of messages. The difference between the two properties lies in the restrictions imposed on the set of messages delivered by faulty processes. Let us first consider U A. This property imposes that each message delivered by some process (correct or not) is also delivered by each correct process. In contrast, faulty processes are allowed to skip2the delivery of some messages delivered by some other process. This implies that in any run satisfying U A each faulty process delivers a subset of the messages delivered by correct processes.
Figure 1(a) depicts a run satisfying U A.3
Differently from U A, N U A admits faulty processes to deliver messages that are not delivered by any other process, e.g. message m5 delivered by p3 in Figure 1(b). Therefore, in any run satisfying N U A, the set of messages delivered by each faulty process only intersects the set of messages delivered by correct processes, being the intersection possibly empty.
Let us note that a communication primitive satisfying an Agreement property along with N U V and U I is usually called reliable broadcast [21].
3.3 The Order property
All formulations of the Order property force correct processes to deliver messages in the same order, but differ for (i) the restrictions imposed on deliveries made by faulty processes, and (ii) the way in which the order of message deliveries is defined. In particular, the latter feature gives rise to strong and weak Order properties.
2In this paper we use the term skip to denote the fact that a process p misses to deliver a message m which is instead delivered by some other process. This can occur, for example, because the sender of m crashes after m has reached only a subset of its destinations, not including p.
3Considering only distinct formulations of Agreement and Order properties allows us to focus on message deliveries and process crashes as the events characterizing system runs. As a consequence, the figures of this section only contain these events.
Strong vs. weak Order properties. A weak Order property defines a total order by requiring the same order of delivery for each pair of mes- sages delivered by two distinct processes. This restriction does not prevent a process p to skip the delivery of some messages. Therefore, it allows the occurrence of gaps in the sequence of messages delivered by p with respect to those delivered by other processes. In contrast, a strong Order property avoids gaps in the sequence of delivered messages as it requires that two processes delivering a message m have delivered exactly the same ordered sequence of messages up to m.
Combining uniform and non-uniform with strong and weak formula- tions, we obtain four Order properties, namely Strong Uniform Total Order (SU T O), Strong Non-uniform Total Order (SN U T O), Weak Uniform To- tal Order (W U T O) and Weak Non-uniform Total Order (W N U T O). The remainder of this section analyzes differences and implications among these four formulations.
Strong Uniform Total Order (SU T O). If some process todelivers some message m before message m0, then a process todelivers m0 only after it has todelivered m.
p1
p2
p3
m2 m2
m2 m1
m1
m1 m4
m3 m3
m6 m6
m5
(a) A run in RSU T O
p1
p2
p3
m2 m2 m1
m1
m1 m4
m3 m3
m7 m6
m5
(b) A run in RW U T O − RSU T O
Figure 2: Differences between SU T O and W U T O
SU T O imposes that if some processes deliver the same messages, they all deliver these messages in the same order. However, if any two processes deliver distinct messages as the i − th delivered message, they are forced to successively deliver only distinct messages. This is the case of processes p1, p2, and p3of Figure 2(a), in which before the dashed line the sets of delivered messages coincide, and thus their ordering. Then p1 and p2 deliver m3 that is skipped by p3. As a consequence, the following messages delivered by p3 must be disjoint from those delivered by p1 and p2.
Weak Uniform Total Order (W U T O). If processes p and q both deliver messages m and m0, then p delivers m before m0if and only if q delivers m before m0.
The following Lemma shows that W U T O is weaker than SU T O, i.e.
SU T O → W U T O.
Lemma 1. SU T O → W U T O
Proof. Assume by contradiction that RSU T O− RW U T O 6= ∅ and let r be a run such that r ∈ RSU T O∧ r /∈ RW U T O. In order to violate W U T O, there must exist two processes p, q and two messages m, m0in r such that p delivers m before m0 while q delivers m0 before m. However, this violates SU T O, which is a contradiction. As a consequence RSU T O ⊆ RW U T O. Furthermore, Figure 2(b) depicts a run in RW U T O−RSU T O. Therefore RSU T O ⊂ RW U T O, i.e. SU T O → W U T O.
Being a weak Order property, W U T O allows the occurrence of gaps in the sequence of messages delivered by distinct processes. As an example, Figure 2(b) depicts a run in RW U T O− RSU T O in which p2 delivers m3 after having skipped m2, while p1 delivers both m2 and m3 in this order.
p1
p2
p3
m1 m2
m2 m1
m1
m2 m4
m3 m3
m7 m6
m6
(a) A run in RSN U T O− RSU T O
p1
p2
p3
m1 m2 m1
m1
m2 m4
m3 m3
m7 m6
m5
(b) A run in RW N U T O − RW U T O
Figure 3: Differences between SN U T O and SU T O and between W N U T O and W U T O
Strong Non-uniform Total Order (SN U T O). If some correct process todelivers some message m before message m0, then a correct process todelivers m0 only after it has todelivered m.
SN U T O is the non-uniform counterpart of SU T O. Hence, due to the relation between uniform and non-uniform properties, SU T O → SN U T O.
The difference between these two properties lies in the behavior of faulty processes. In fact, SN U T O allows faulty processes to deliver an arbitrary set of messages and to deliver them in an arbitrary order (e.g. process p3 in Figure 3(a)). In contrast, correct processes are forced to agree on a prefix of the ordered set of delivered messages. Furthermore, any two correct processes delivering distinct messages as the i − th delivered message has to successively deliver only disjoint set of messages (see processes p1 and p2 in Figure 3(a)).
Weak Non-uniform Total Order (W N U T O). If correct processes p and q both todeliver messages m and m0, then p todelivers m before m0 if and only if q todelivers m before m0.
W N U T O is the non-uniform counterpart of W U T O. Hence W U T O → W N U T O. Furthermore, in a way similar to Lemma 1, it is possible to show that SN U T O → W N U T O. As SN U T O, W N U T O allows faulty processes to deliver messages in arbitrary order (see Figure 3(b)). Moreover, processes (correct or not) are free to deliver distinct sets of messages, i.e. W N U T O allows the occurrence of gaps in the sequence of message deliveries.
SUTO
WUTO SNUTO
WNUTO
Figure 4: Relations among Order properties
Figure 4 summarizes the previous discussion by pointing out relations holding among Order properties.
4 A hierarchy of total order specifications
Assuming Non-Uniform Validity and Uniform Integrity, it is possible to combine Agreement and Order properties to obtain eight TO specifications.
We denote T O(A, O) the TO specification S(N U V, U I, A, O), where A ∈ {U A, N U A} and O ∈ {SU T O, SN U T O, W U T O, W N U T O}.
Lemma 2 and Corollary 1 below prove that T O(N U A, SN U T O) ≡ T O(N U A, W N U T O) and that T O(U A, SN U T O) ≡ T O(U A, W N U T O).
Lemma 2. T O(N U A, SN U T O) ≡ T O(N U A, W N U T O) Proof.
1. (⊆) By contradiction. Let r be a run such that r ∈ RSN U T O∩ RN U A∩ RN U V∩RU I∧r /∈ RW N U T O∩RN U A∩RN U V∩RU I. From r ∈ RSN U T O∩ RN U A∩ RN U V∩ RU I, it follows that r must satisfy N U A ∧ N U V ∧ U I.
As a consequence r violates W N U T O, i.e. there exists two correct processes p, q and two messages m, m0 such that p delivers m before m0 in r while q delivers m0 before m in r. However, this violates SN U T O. Contradiction.
2. (⊇) By contradiction. Let r be a run such that r ∈ RW N U T O∩RN U A∩ RN U V∩RU I∧r /∈ RSN U T O∩RN U A∩RN U V∩RU I. From r ∈ RW N U T O∩
RN U A∩ RN U V∩ RUI, it follows that r must satisfy N U A ∧ N U V ∧ U I.
As a consequence, it violates SN U T O. Suppose that p, q are two correct processes and m, m0 are two messages in r, and that p delivers m before m0. Since r violates SN U T O, we have the following two cases:
(a) q delivers m0 without delivering m. This contradicts the hypoth- esis that r satisfies N U A.
(b) q delivers m after delivering m0. This contradicts the hypothesis that r satisfies W N U T O.
Corollary 1. T O(U A, SN U T O) ≡ T O(U A, W N U T O) Proof.
1. (⊆) Let r be a run such that r ∈ RSN U T O∩ RU A∩ RN U V∩ RU I. Since U A → N U A, r ∈ RU A ⇒ r ∈ RN U A. Therefore, r ∈ RSN U T O ∩ RN U A∩ RN U V ∩ RU I. From Lemma 2, it follows that r ∈ RW N U T O∩ RN U A ∩ RN U V ∩ RU I. As a consequence r ∈ RW N U T O, and thus r ∈ RW N U T O∩ RU A∩ RN U V ∩ RU I.
2. (⊇) Let r be a run such that r ∈ RW N U T O∩ RU A ∩ RN U V ∩ RU I. Since U A → N U A, r ∈ RU A ⇒ r ∈ RN U A. Therefore, r ∈ RW N U T O∩ RN U A∩ RN U V ∩ RU I. From Lemma 2, it follows that r ∈ RSN U T O∩ RN U A ∩ RN U V ∩ RU I. As a consequence r ∈ RSN U T O, and thus r ∈ RSN U T O∩ RU A∩ RN U V ∩ RU I.
These results can be intuitively explained noting that W N U T O and SN U T O differ for the restrictions they impose on the set of messages deliv- ered by correct processes, not on the order of deliveries. As a consequence, they give rise to equivalent specifications when combined with an Agreement property, which forces correct processes to deliver the same set of messages.
Therefore we can restrict our attention to six (out of the expected eight) significant TO specifications, considering O ∈ {SU T O, W U T O, W N U T O}.
It is possible to exploit the precedence relations derived for Order and Agreement properties to identify relations among TO specifications. To this aim, it is first worth noting that given two specifications differing only for the definition of the Agreement property, i.e. T O(A, O) and T O(A0, O), then T O(A, O) → T O(A0, O) iff A → A0 and RT O(A0,O)− RT O(A,O)6= ∅. In this case any run r ∈ RT O(A0,O)− RT O(A,O) satisfies ¬A, i.e. ¬A characterizes all runs in RT O(A0,O)− RT O(A,O). The same reasoning applies to specifications differing only for their Order property. As an example, observing that U A → N U A and that the run depicted in Figure 6(b) belongs to RT O(N U A,SU T O)− RT O(U A,SU T O), it follows that T O(U A, SU T O) → T O(N U A, SU T O). Fur- ther relations among specifications are due to the transitivity of the → relation. For example, T O(U A, SU T O) → T O(N U A, W N U T O).
TO(UA,SUTO)
TO(NUA,SUTO) TO(UA,WUTO)
TO(NUA,WUTO)
TO(NUA,WNUTO) TO(UA,WNUTO)
TO(UA,SUTO)
TO(NUA,SUTO) TO(UA,WUTO)
TO(NUA,WUTO)
TO(NUA,WNUTO) TO(UA,WNUTO)
¬UA
¬SUTO
¬WUTO
¬WUTO
¬UA
¬UA
¬SUTO
Figure 5: A hierarchy of TO specifications
Figure 5 depicts the transitive reduction of the → relation among TO specifications. This reduction represents a hierarchy of specifications. Note that any two specifications S and S0 connected by an edge in the hierarchy differ only for distinct formulations of a single property, say P and P0, such that P → P0. The edge is then labelled with the predicate ¬P (see Table 3 for its definition) that can be used to discriminate runs belonging to RS0− RS.
¬U A , ∃p, q∃m todel(p, m) ∧ ¬todel(q, m) ∧ correct(q)
¬SU T O , ∃p, q∃m, m0 todel(p, m) < todel(p, m0) ∧ del(q, m0) ∧ (¬del(q, m)∨
todel(q, m0) < todel(q, m))
¬W U T O , ∃p, q∃m, m0 todel(p, m) < todel(p, m0) ∧ todel(q, m0) < todel(q, m)
Table 3: Predicates associated with direct relations in the hierarchy Let us note that the root of the hierarchy, i.e. T O(U A, SU T O), is the specification closest to the intuitive notion of total order broadcast, as it imposes that the set of messages delivered by each process is a prefix of the ordered set of messages that is delivered by all correct processes. This is due to the interaction between SU T O and U A. To explain this, let us consider a faulty process skipping the delivery of a message delivered by correct processes, e.g. p3 omitting m3 in Figure 6(a). Due to SU T O, the messages delivered by p3 after skipping m3 have not to be delivered by processes that deliver m3. Now suppose that p3 delivers a new message m after having skipped m3 and before crashing. To let the run satisfy U A, m has to be delivered by correct processes. However, in that case the run would violate SU T O. As a consequence p3 may not deliver any message after skipping the delivery of a message that is delivered by some other process. This is sufficient to ensure that all processes deliver all messages in the same order before crashing. In contrast, weaker specifications admit runs in which faulty
processes are allowed to exhibit a wider set of behaviors, as discussed in the following section. It is worth noting that weaker specifications are those implemented by several real systems, e.g. Ensemble [22], JavaGroups [5].
4.1 On the behavior of faulty processes
Each TO specification constrains all correct processes to deliver exactly the same ordered set of messages. On the contrary, each specification poses its own restrictions on the ordered set of messages that can be delivered by each faulty process. Differences among the sequences of delivered messages can be characterized using the following execution patterns.
EP1: a faulty process p delivers a prefix of the ordered set of messages delivered by correct processes;
EP2: a faulty process p delivers some messages which are not delivered by correct processes;
EP3: a faulty process p skips the delivery of some messages de- livered by correct processes;
EP4: a faulty process p delivers some messages in an order dif- ferent from correct processes.
Each specification allows the occurrence of one or more of the above execution patterns. Moreover, from the definition of the → relation, it follows that for each pair of specifications S, S0 : S → S0, S0 allows at least all execution patterns admitted by S. For example, T O(U A, SU T O) allows EP1 while T O(U A, W U T O) allows EP1 and EP3. Table 4 shows for each specification the admitted execution patterns. Let us note that these execution patterns are formally derived from specifications (see Appendix A).
p1
p2
p3
m2 m2
m2 m1
m1
m1
m3 m3
m4 m4
(a) A run in
RT O(U A,SU T O)
p1
p2
p3
m2 m2
m2 m1
m1
m1 m4
m3 m3
m6 m6
m5
(b) A run in
RT O(N U A,SU T O) − RT O(U A,SU T O)
Figure 6: Differences between T O(U A, SU T O) and T O(N U A, SU T O)
m3 p1
p2
p3
m2
m2 m1
m1
m1
m3 m3
m4 m4 m4
(a) A run in
RT O(U A,W U T O) −
RT O(U A,SU T O)
m3 p1
p2
p3
m2 m1
m1
m1
m3 m3
m5 m4
m4
m2
m4 m6
m6
m6
(b) A run in
RT O(N U A,W U T O) − RT O(U A,W U T O)
Figure 7: Differences between T O(U A, W U T O) and T O(N U A, W U T O)
m3 p1
p2
p3
m2
m2 m1
m1 m1
m4 m3
m4
m3 m4
(a) A run in
RT O(U A,W N U T O) − RT O(U A,W U T O)
m4 p1
p2
p3
m2 m1
m1
m1
m3
m3 m4
m2 m3 m4
m5 m6
m6
m6
(b) A run in
RT O(N U A,W N U T O) − RT O(U A,W N U T O)
Figure 8: Differences between T O(U A, W N U T O) and T O(N U A, W N U T O)
4.2 Associating implementations to specifications
When dealing with a TO implementation with a complex or loose speci- fication (e.g. [9, 23]) it could be difficult to identify its guarantees. This problem can be solved by classifying the TO implementation with respect to the hierarchy of Figure 5 and then inspecting Table 4 to identify the ad- mitted behavior for faulty processes. To this end, let us first introduce the following definition.
Definition. Let I be a TO implementation and let RI be the set of runs that I can generate. I enforces a TO specification S iff :
1. RI ⊆ RS, and
2. ∀S0 S0 → S ⇒ RI * RS0.
Therefore the problem of finding the TO specification enforced by a TO implementation I boils down to find a TO specification S defining the small- est superset of RI. It is worth noting that, given two specifications S and S0
TO specification Admitted differences in the histories of Example processes p (faulty) and q (correct)
T O(U A, SU T O) EP1 Figure 6(a)
T O(U A, W U T O) EP1 or EP3 Figure 7(a)
T O(U A, W N U T O) EP1 or EP3 or EP4 Figure 8(a)
T O(N U A, SU T O) EP1 or EP2 Figure 6(b)
T O(N U A, W U T O) EP1 or EP2 or EP3 Figure 7(b) T O(N U A, W N U T O) EP1 or EP2 or EP3 or EP4 Figure 8(b)
Table 4: Possible differences between the behavior of faulty and correct processes
connected by an edge labelled with ¬P in the hierarchy of Figure 5, a suffi- cient condition to state that I does not implement S but possibly implements S0 is that there exists a run that can be generated by I which satisfies ¬P . As an example, consider T O(U A, SU T O) → T O(N U A, SU T O). Verifying that a run r ∈ RIsatisfies ¬U A, which is the label of the edge between these two specifications in Figure 5, allows us to state that RI * RT O(U A,SU T O), but possibly RI ⊆ RT O(N U A,SU T O).
A methodology to find the implemented specification S is to navigate the hierarchy according to the following rules:
1. verify that I is a TO implementation, i.e. RI ⊆ RT O(NU A,W N U T O). In the affirmative, go to step 2, otherwise I is not a TO implementation;
2. set S to the hierarchy top node, i.e. T O(U A, SU T O);
3. while there exists an unchecked specification S0 such that S → S0 belongs to the transitive reduction of → depicted in Figure 5, check if there exists a run r ∈ RI satisfying the associated predicate ¬P . In the affirmative, repeat this step setting current node to S0. Otherwise go to step 4;
4. the current node S represents the specification implemented by I.
4.3 Impact of TO specifications on the application logic When selecting a TO primitive to build an application, one has to carefully take into account the specification it enforces. Each specification can be useful, provided that the application logic is able to preserve correctness despite the possible behavior of faulty processes. From this point of view, the hierarchy of Figure 5 can induce a tradeoff between the guarantees of a TO primitive and the complexity of the application logic: the stronger is the specification enforced by the TO primitive, the simpler the application logic may result.
To illustrate this tradeoff, let us consider the following example: guar- anteeing one copy serializability [7] on a replicated database following the framework proposed by Wiesmann et al. in [32]. In the context of deter- ministic transactions4, to enforce one copy serializability it suffices that (i) all replicas serialize transactions in the same order and (ii) they execute the same set of transactions. To this end, each time a database replica re- ceives a transaction t from a client, it broadcasts t to all replicas using a TO primitive; t is then executed upon delivery by all replicas. Assuming that replicas serialize transactions according to the order of transaction delivery, one copy serializability can be preserved using a TO primitive compliant with T O(U A, SU T O) and letting each replica independently decide on the outcome of each transaction (either commit or abort), as replicas will decide the same outcome.
Weaker TO primitives may be used, provided that the application logic removes possible inconsistencies due faulty replicas. This can be achieved by introducing a voting phase, coordinated for simplicity by a fault tolerant software component C, which is a part of the application logic. In the following description we assume for simplicity that a majority of replicas are correct and that the communication between replicas and C happens through FIFO point-to-point reliable channels. For each specification we show how C can be designed in order to ensure one-copy serializability.
T O(N U A, SU T O). Due to the possible occurrence of execution pattern EP2 (see Table 4), a faulty replica could try to commit a transac- tion that will never be delivered by correct replicas. This is avoided by letting C decide on the outcome of each transaction. In partic- ular, when a replica is ready to commit a transaction t, it sends a prepare to commit(t) message to C. Upon receiving a prepare to com- mit(t) message from a majority of database replicas, C reliably broad- casts a message commit(t) to all replicas, which in turn commit the transaction.
T O(U A, W U T O) or T O(N U A, W U T O). Table 4 shows that in these cases faulty replicas could also skip the delivery of some transactions due to the occurrence of execution pattern EP3. If the previous C im- plementation is used, this pattern generates a violation of one copy serializability due to the fact that a faulty replica executes a subset of the transactions executed by correct replicas. Therefore, the voting phase has to be enhanced by letting C broadcast a commit(t) message for a transaction t only once it has received a prepare to commit(t) message from all non-crashed database replicas. This requires C to on-the-fly monitor the set of non-crashed replicas.
4As pointed out in [32], assuming nondeterministic transactions requires to run more complex protocols that are out of the scope of this example.
T O(U A, W N U T O) or T O(N U A, W N U T O). In these cases, faulty repli- cas could also deliver, and thus serialize, transactions in different or- ders due to the execution pattern EP4 (see Table 4). Therefore C has to also handle the possible arrival of out of order prepare to commit messages which reflect the different local serialization order of each replica.
5 Primary component group communication sys-
tems
In this section we briefly introduce the main features of primary component group communication systems, which are by far one of the most successful class of systems implementing TO primitives.
5.1 Reference architecture and the virtual synchrony pro- gramming model
Group communication systems adopt several distinct architectures [26]. For the sake of clarity, in the remainder of this paper we use a simplified archi- tecture depicted in Figure 9(a), in which a Total Order layer implements a TO specification by relying on another layer, namely VSC, which provides virtually synchronous communications [8].5 According to the virtual syn- chrony programming model, processes are organized into groups. Groups are dynamic, i.e. processes are allowed to join and voluntarily leave a group using appropriate primitives. Furthermore, faulty processes are excluded by groups after crashing. A group membership service provides each process of a group with a consistent view vi composed by the identifiers of all non- crashed processes currently belonging to the group. Upon a membership change, processes agree on a new view through a view change protocol. At the end of this protocol, group members are provided with a view vi+1 that (i) is delivered to all the members of vi+1through a view change event, and (ii) contains the identifier of all the members that deliver vi+1 (see Figure 9(b)). We assume a primary component membership service, e.g. [11], guar- anteeing that all members of the same group observe the same sequence of views as long as they stay in the group. In this context, the VSC layer guarantees (i) that membership changes of a group occur in the same order in all the members that stay within the group, and (ii) that membership changes are totally ordered with respect to all messages sent by members.
It is worth noting that the primary component membership service is not implementable in a non-blocking manner in asynchronous systems [13].6
5Let us remark that other approaches incorporating the implementation of Order and Agreement properties into a single protocol are possible, e.g. [14].
VSC Total Order
Unreliable network Application
TOcast TOdeliver
[U]Rcast [U]Rdeliver
Rsend Rreceive viewchange
(a) Architecture
p1
p2
p4
(faulty) End of view change. The
new view is delivered vi= {p1,p2,p3,p4} vi +1= {p1,p2,p3}
p3
Failure detected.
Start view change view change
protocol m1
m2 m1
m1
m1
m2 m1
m2
m2
m3 m3
m3
m3 m2
(b) The Virtual Synchrony property
Figure 9: Group communication systems
To formalize these concepts, we let the VSC layer export primitives of the membership service, namely view change, as well as of a set of com- munication primitives, namely Rsend(m) and Rreceive(m) (for point-to- point communications), Rcast(m) and Rdeliver(m) (for non-uniform reli- able broadcast communications), and U Rcast(m) and U Rdeliver(m) (for uniform reliable broadcast communications). In order to parameterize TO protocols with respect to the employed reliable broadcast primitive, we use [U ]Rcast(m) and [U ]Rdeliver(m) to denote the invocation of any reliable broadcast primitive, uniform or not.
These primitives satisfy the following properties:
VSC1. If a correct process p Rsends a message m to a correct process q, then q eventually Rreceives m.
VSC2. For each message m, a process p Rreceives m at most once, and only if m was Rsent to p by some process q.
These properties basically state that point-to-point communications am- ong processes occur through quasi-reliable channels [6].
Concerning [U ]Rcast(m) and [U ]Rdeliver(m), these primitives satisfy the following properties [15, 33]:
VSC3. If a correct process p [U]Rcasts a message m, then it eventually [U]Rdelivers m.
VSC4. For each message m, each process [U]Rdelivers m at most once, and only if m was [U]Rcast by some process.
6In partitionable systems, groups may partition into subgroups (or components), e.g.
due to network failures, and members of distinct subgroups can deliver distinct sequences of views. In this setting, specifying a total order primitive can drive to complex specifica- tions, e.g. [15], whose usefulness has still to be verified [30]. However, non-blocking imple- mentations of partitionable group communications systems are feasible in asynchronous systems.
VSC5. If a process (resp. a correct process) p URdelivers (resp. Rdelivers) m in view v, then all processes which are either correct or deliver a view change event in v URdeliver (resp. Rdeliver) m.
VSC6. If a process p [U]Rdelivers a message m in view v and a process q [U]Rcasts m in view v0 then v = v0.
These properties guarantee that processes agree on the set of messages delivered before installing a new view, thus enforcing virtual synchrony. Fur- thermore, property VSC6 (named also Sending View Delivery [15]), guar- antees that messages are delivered in the view they were sent.
5.2 Static vs. dynamic group communications.
Following the model proposed by Hadzilacos and Toueg in [21], the prop- erties introduced in Section 3 are based on a system model that does not take process joins into account. In contrast, group communication systems typically support dynamic groups, which drive to more complex specifica- tions, e.g. [15, 30]. We now show how the TO specifications introduced in Section 4 can be used to classify also dynamic TO implementations. To this aim, we introduce the notion of static sub-run, i.e. a portion of the overall computation of a system supporting dynamic groups in which join events may only appear at the beginning of the sub-run. Considering the compu- tation depicted in Figure 10, this run can be decomposed in three static sub-runs, namely sr1, sr2, sr3. It is important to note that a sub-run can be described with events and process histories as those introduced in Section 2, i.e. T Ocast(m), T Odeliver(m), and crash. As an example the sub-run sr2 depicted in Figure 10 is composed by the histories of processes p1, p2 and p4 containing the message delivery events of m4 and m5. Moreover, p1 is correct in sr1 and sr2 while it is faulty in sr3.
In this dynamic context, the answer to the question “which TO speci- fication is enforced by an implementation I?”, can be given by considering the set of all static sub-runs (instead of the set of runs) that I can produce as the set RI used in the methodology described in Section 4.2.
As an example, in the run depicted in Figure 10, sub-runs sr1 and sr2 satisfy T O(U A, SU T O), while sr3 only satisfies T O(N U A, SU T O) (due to p5 delivering message m9). Therefore, an implementation I that may gen- erate this run enforces at most T O(N U A, SU T O) (i.e. I does not enforce T O(U A, SU T O)).