Total Order Communications over Asynchronous Distributed Systems: Speciﬁcations and Implementations

(1)

Total Order Communications over Asynchronous

Distributed Systems: Specifications and

Implementations

^∗

Roberto Baldoni, Stefano Cimmino, Carlo Marchetti Dipartimento di Informatica e Sistemistica

Universit`a di Roma “La Sapienza”

Via Salaria 113, 00198, Roma, Italy

email: {baldoni,cimmino,marchet}@dis.uniroma1.it

Abstract

During the last two decades the design and development of total order (TO) communications has been one of the main research topics in dependable distributed computing. The huge amount of research work has produced several TO specifications and a wide variety of TO implementations with different guarantees whose differences are often left hidden or unclear. This paper presents a systematic classification of six distinct TO specifications based on a well-defined formal framework. The classification allows us (i) to define in a formal way the differences among the behaviors of faulty and correct processes admitted by each specification, and (ii) to derive a methodology that enables the classification of TO implementations with respect to their enforced specification. The paper also discusses the impact of TO specifications on the design of application logics. The methodology is then used to formally study the properties of eight variations of TO implementations based on a fixed sequencer given in a well-known context, namely primary component group communication systems.

∗This work is partially supported by a grant from EU IST Project “EU-PUBLI.COM” (#IST- 2001-35217), and by a grant from MIUR in the context of project “MAIS”.

(2)

1 Introduction

Since Lamport’s seminal paper [24], the problem of totally ordered (TO) communications has been extensively studied in the literature. Intuitively, a TO primitive ensures that processes of a message-passing distributed system deliver the same sequence of messages. Researchers pointed out several application scenarios in which the use of TO communications is extremely useful, e.g maintaining consistency among the internal states of a set of de- terministic replicas (active software replication [31]), as well as when facing the problem of delivering stock quote information in the same order to a set of stock operators while preserving fairness of exchanges [12].

During the last two decades more than sixty TO implementations have been designed [17]. Typically, they ensure that, in the absence of failures, processes deliver the same set of messages in the same order. In contrast, distinct TO implementations may behave in different manners in the pres- ence of failures. As an example, some TO implementations admit faulty processes to deliver messages in an order different from correct ones. This aspect (and especially its impact on applications) still deserves careful analysis. Let us consider an application logic based on active replication with strong replica consistency requirements [20]. To preserve safety (i.e., strong replica consistency), it is necessary that all replicas (faulty or not) deliver the same sequence of messages. Therefore the occurrence of wrong message ordering at a faulty replica would violate the application’s safety if not ap- propriately handled by the application logic.¹ Unfortunately, the analysis of the TO implementation’s behavior under failure scenarios is often under- estimated, especially by practitioners. Furthermore, even if one wishes to undertake such analysis, he/she has to face the problem of understanding a complex (or even missing) specification, e.g. [9, 23]. This problem is also reinforced by the fact that specifications are sometimes given using different ad-hoc formalisms, which are difficult to compare among them, e.g. [19, 23].

Despite several valuable works have tried to gather the existing literature into a uniform comprehensive framework (e.g. [15, 17]), the problem of clarifying in a simple and easy to understand way both the meaning of TO specifications and their impact on applications has not been addressed in a satisfactory way.

This paper aims to fill this need. Using a formal framework, based on first order logic as specification language, we present and describe in a systematic way eight existing TO specifications. Some of this specifications appeared in the literature, while others have been derived from our formal analysis. Besides the formalization of TO specifications, the novelty

1In this case it is likely to face the paradox that a practitioner exploits a TO implementation to preserve the correctness of its application and to increase availability despite failures, whereas this implementation causes a violation of the application’s safety exactly upon the occurrence of a replica failure.

(5)

of the analysis lies in the explicit description of failure scenarios admitted by each specification, in terms of the possible behavior of faulty processes.

This allows a deep understanding of the impact of each specification on applications and enables to identify the most suitable TO specification to support the implementation of a given application. As an example, we discuss the impact of TO specifications on applications aiming to provide one copy serializability on a replicated database [7]. Furthermore, the paper presents a methodology that allows us to fit existing unspecified (or specified with an ad-hoc formalism) TO implementations into the classification framework. The methodology is then applied to analyze the implementation of TO primitives in a particular context, namely primary component group communications [15, 29], which have proven to be an effective paradigm for implementing fault-tolerant distributed applications. In particular, we first introduce a reference architecture based on the virtual synchrony programming model that abstracts group toolkit’s implementation details. Then we analyze some of the most widely used classes of TO implementations, namely fixed sequencer protocols and privilege-based protocols, by formally proving their enforced specifications. Let us note that, despite the popu- larity of these protocols, they have been rarely formalized and/or proved to enforce a precise TO specification. Finally, the paper presents a performance comparison of several TO implementations given in the context of group communications. The evaluation aims to shed the light on the differences in terms of performance between distinct TO implementations, which stem from the implementation of distinct TO specifications. To this end, we briefly describe the group toolkits chosen for the comparison, placing their TO implementations within the specification hierarchy, and then we present the experimental results. The performance comparison confirms the intuitive claim that TO implementations enforcing weaker TO specifications performs better than those enforcing stronger ones.

The remainder of this paper is organized as follows. Section 2 presents the system model. Section 3 reports a detailed study of the properties defining the TO problem. Section 4 (i) introduces a hierarchy of TO specifications, (ii) examines differences among the behaviors of correct and faulty processes admitted by each specification (Appendix A presents the corre- sponding formal analysis), (iii) presents a methodology to classify TO implementations, and finally (iv) discusses their impact on application logics in the context of database replication. Section 5 introduces group communication systems, and Section 6 analyzes TO implementations based on a fixed sequencer, formally identifying their enforced specifications (Appendix B discusses privilege-based protocols). Finally, Section 7 reports the performance analysis (Appendix C describes the configuration of the group toolkits used within the experiments), and Section 8 concludes the paper also discussing related work.

(6)

2 System model

The specifications and properties that we introduce in the following sections are based on the asynchronous distributed system model described below.

Expressing properties assuming such a system model allows us to apply them to a wide class of synchronous, asynchronous, and partially synchronous real distributed systems in which a TO implementation is developed. We do not consider systems admitting malicious behaviors of processes and/or channels.

Asynchronous distributed system. We consider a system composed by a finite set of processes Π = {p₁. . . p_n} communicating by message passing. Each process behaves according to its specification until it possibly crashes. A process that never crashes is correct, while a process that even- tually crashes is faulty. The system is asynchronous, i.e. there is no bound known or unknown on message transfer delays and on processes’ relative speed. In order to broadcast a message m, a process invokes the TOcast(m) primitive. Upon receiving a message m, the underlying layer of a process invokes the TOdeliver(m) primitive, which is an upcall used to deliver m to the process.

Histories and runs. Each process p ∈ Π can experience the occurrence of three kinds of events, namely T Ocast(m), T Odeliver(m) and crash. An history h_p is the sequence of events occurred at p during its lifetime. We denote as e_i ∈ h_p the i-th event in the history of p. Note that crash may only occur as the last event in the history of a faulty process. A system run is a set of histories h_p_i, one for each process p_i∈ Π. We denote as R the set of all possible runs in the system. To characterize runs in R, we introduce the predicates defined in Table 1.

correct(p) , crash /∈ hp

f aulty(p) , crash ∈ hp

tocast(p, m) , T Ocast(m) ∈ hp

todel(p, m) , T Odeliver(m) ∈ hp

todel(p, m) < todel(p, m⁰) , ∃i, j ei= T Odeliver(m) ∈ hp∧ ej= T Odeliver(m⁰) ∈ hp∧ i < j

Table 1: Shorthand predicates

Properties and specifications. A property P on R is a first order logic predicate based on the shorthand predicates given in Table 1, and thus defining a set of runs R_P ⊆ R associated to P . More precisely, a run r ∈ R is admitted by P , i.e. r ∈ R_P, iff r satisfies P . Let P and P⁰ be two properties on R, then P → P⁰ iff R_P ⊂ R_P⁰. If P → P⁰, we say that P is stronger than P⁰, and that P⁰ is weaker than P . Note that P → P⁰ iff P ⇒ P⁰∧ ¬(P⁰ ⇒ P ), where ⇒ is the logical implication.

A specification S(P₁. . . P_m) (with m ≥ 1) on R is a predicate on R

(7)

composed by the conjunction of m properties, i.e. S =V

i=1...mP_i, defining a set of runs R_S = T

i=1,...mR_P_i ⊆ R. R_S is composed by all system runs satisfying S. More precisely, a run r ∈ R is admitted by a specification S(P₁. . . P_m), i.e. r ∈ R_S = T

i=1,...mR_P_i, iff r satisfies S. Given two specifications S(P₁. . . P_m) and S⁰(P₁⁰. . . P_`⁰), S is stronger than S⁰, denoted S → S⁰, iff R_S ⊂ R_S⁰. In this case we also say that S⁰ is weaker than S. Note that the → relation is transitive, i.e. if S₁ → S₂ ∧ S₂ → S₃ then S₁ → S₃. Finally, two specifications S and S⁰ are equivalent, denoted S ≡ S⁰, iff R_S ≡ R_S⁰.

3 Total order properties

To the best of our knowledge, the first coherent description of TO specifications is due to Hadzilacos and Toueg, dating back to 1993 [21]. Since then, total order broadcast has been usually specified by means of four proper- ties, namely Validity, Integrity, Agreement, and Order. Informally speaking, a Validity property guarantees that messages sent by correct processes are eventually delivered at least by correct processes; an Integrity property guar- antees that no spurious or duplicate messages are delivered; an Agreement property ensures that (at least correct) processes deliver the same set of messages; an Order property constrains (at least correct) processes deliver- ing the same messages to deliver them in the same order. Each property can be formally defined in distinct ways, thus generating distinct specifications.

A typical example of differing formulations of a property of a TO specifica- tion is given by its uniform and non-uniform versions. A uniform property imposes some restrictions on the histories of (at least) correct processes on the basis of some events occurred in the histories of some processes (correct or not). In contrast, a non-uniform property imposes some restrictions on the histories of correct processes on the basis of some events occurred in the histories of some correct processes. As a consequence, given a property P , its uniform formulation U P turns out to be stronger than its non-uniform formulation N U P . Therefore, U P → N U P , and R_{U P} ⊂ R_{N U P}. It is worth noting that uniform properties are meaningful only in certain environments.

For instance, uniform properties are not enforceable assuming malicious fault models.

In the following sections we introduce and discuss several formulations of the properties, giving their definitions and highlighting differences in their admitted system runs. For each property we give both the formal definition (see Table 2) and a description in natural language. Concerning the latter, we say that a process p ∈ Π tocasts a message m iff tocast(p, m), i.e. iff T Ocast(m) ∈ h_p. Analogously, we say that a process p ∈ Π todelivers a message m iff todel(p, m), i.e. iff T Odeliver(m) ∈ h_p (see Table 1).

(8)

Validity and Integrity properties N U V , ∀p∀m tocast(p, m) ∧ correct(p) ⇒ todel(p, m) U I , ∀m∀p ei= T Odeliver(m) ∈ hp⇒

(∃q tocast(q, m) ∧ ∀ej∈ hpei= ej⇔ i = j) Agreement properties

U A , ∀p∀m todel(p, m) ⇒ (∀q correct(q) ⇒ todel(q, m))

N U A , ∀p∀m correct(p) ∧ todel(p, m) ⇒ (∀q correct(q) ⇒ todel(q, m)) Order properties

SU T O , ∀p∀m, m⁰ todel(p, m) < todel(p, m⁰) ⇒ (∀q todel(q, m⁰) ⇒ todel(q, m) < todel(q, m⁰))

SN U T O , ∀p∀m, m⁰ correct(p) ∧ todel(p, m) < todel(p, m⁰) ⇒ (∀q correct(q) ∧ todel(q, m⁰) ⇒ todel(q, m) < todel(q, m⁰))

W U T O , ∀p, q∀m, m⁰ todel(p, m) ∧ todel(p, m⁰) ∧ todel(q, m) ∧ todel(q, m⁰) ⇒ (todel(p, m) < todel(p, m⁰) ⇔ todel(q, m) < todel(q, m⁰))

W N U T O , ∀p, q∀m, m⁰ correct(p) ∧ correct(q) ∧ todel(p, m) ∧ todel(p, m⁰)

∧todel(q, m) ∧ todel(q, m⁰) ⇒ (todel(p, m) < todel(p, m⁰) ⇔ todel(q, m) < todel(q, m⁰))

Table 2: Formal definition of the properties defining TO specifications

3.1 Validity and Integrity

Avoiding unlikely assumptions, it is not possible to guarantee that a faulty process eventually delivers messages it sent. Therefore, we only consider the non-uniform version of the Validity property, defined as follows:

Non-uniform Validity (N U V ). If a correct process tocasts a message m, then it eventually todelivers m.

Concerning the Integrity property, assuming a crash fault model enables the enforcement of its uniform version without additional overhead. Hence, we only deal with Uniform Integrity, defined as follows:

Uniform Integrity (U I). For any message m, every process p todelivers m at most once, and only if m was previously tocast by some process.

3.2 The Agreement property

An Agreement property imposes some constraints on the sets of messages delivered by (at least correct) processes. This property is usually specified according to one of the following formulations:

Uniform Agreement (U A). If a process todelivers a message m, then all correct processes eventually todeliver m;

Non-uniform Agreement (N U A). If a correct process todelivers a mes- sage m, then all correct processes eventually todeliver m.

(9)

m4 p1

p2

p3

m2 m4

m1

m3 m3

m3 m4 m1

m2

(a) A run in RU A

m4 p₁

p2

p₃

m2 m4

m1

m3 m3

m3 m4 m1

m2

m5

(b) A run in RN U A− RU A

Figure 1: Differences between U A and N U A

Both formulations constrain all correct processes to deliver the same set of messages. The difference between the two properties lies in the restrictions imposed on the set of messages delivered by faulty processes. Let us first consider U A. This property imposes that each message delivered by some process (correct or not) is also delivered by each correct process. In contrast, faulty processes are allowed to skip²the delivery of some messages delivered by some other process. This implies that in any run satisfying U A each faulty process delivers a subset of the messages delivered by correct processes.

Figure 1(a) depicts a run satisfying U A.³

Differently from U A, N U A admits faulty processes to deliver messages that are not delivered by any other process, e.g. message m₅ delivered by p₃ in Figure 1(b). Therefore, in any run satisfying N U A, the set of messages delivered by each faulty process only intersects the set of messages delivered by correct processes, being the intersection possibly empty.

Let us note that a communication primitive satisfying an Agreement property along with N U V and U I is usually called reliable broadcast [21].

3.3 The Order property

All formulations of the Order property force correct processes to deliver messages in the same order, but differ for (i) the restrictions imposed on deliveries made by faulty processes, and (ii) the way in which the order of message deliveries is defined. In particular, the latter feature gives rise to strong and weak Order properties.

2In this paper we use the term skip to denote the fact that a process p misses to deliver a message m which is instead delivered by some other process. This can occur, for example, because the sender of m crashes after m has reached only a subset of its destinations, not including p.

3Considering only distinct formulations of Agreement and Order properties allows us to focus on message deliveries and process crashes as the events characterizing system runs. As a consequence, the figures of this section only contain these events.

(10)

Strong vs. weak Order properties. A weak Order property defines a total order by requiring the same order of delivery for each pair of mes- sages delivered by two distinct processes. This restriction does not prevent a process p to skip the delivery of some messages. Therefore, it allows the occurrence of gaps in the sequence of messages delivered by p with respect to those delivered by other processes. In contrast, a strong Order property avoids gaps in the sequence of delivered messages as it requires that two processes delivering a message m have delivered exactly the same ordered sequence of messages up to m.

Combining uniform and non-uniform with strong and weak formula- tions, we obtain four Order properties, namely Strong Uniform Total Order (SU T O), Strong Non-uniform Total Order (SN U T O), Weak Uniform To- tal Order (W U T O) and Weak Non-uniform Total Order (W N U T O). The remainder of this section analyzes differences and implications among these four formulations.

Strong Uniform Total Order (SU T O). If some process todelivers some message m before message m⁰, then a process todelivers m⁰ only after it has todelivered m.

p1

p2

p3

m2 m2

m₂ m1

m1

m1 m4

m₃ m3

m₆ m6

m5

(a) A run in RSU T O

p1

p₂

p3

m2 m2 m1

m₁

m1 m4

m3 m3

m7 m6

m5

(b) A run in RW U T O − RSU T O

Figure 2: Differences between SU T O and W U T O

SU T O imposes that if some processes deliver the same messages, they all deliver these messages in the same order. However, if any two processes deliver distinct messages as the i − th delivered message, they are forced to successively deliver only distinct messages. This is the case of processes p₁, p₂, and p₃of Figure 2(a), in which before the dashed line the sets of delivered messages coincide, and thus their ordering. Then p₁ and p₂ deliver m₃ that is skipped by p₃. As a consequence, the following messages delivered by p₃ must be disjoint from those delivered by p₁ and p₂.

Weak Uniform Total Order (W U T O). If processes p and q both deliver messages m and m⁰, then p delivers m before m⁰if and only if q delivers m before m⁰.

(11)

The following Lemma shows that W U T O is weaker than SU T O, i.e.

SU T O → W U T O.

Lemma 1. SU T O → W U T O

Proof. Assume by contradiction that R_{SU T O}− R_{W U T O} 6= ∅ and let r be a run such that r ∈ R_{SU T O}∧ r /∈ R_{W U T O}. In order to violate W U T O, there must exist two processes p, q and two messages m, m⁰in r such that p delivers m before m⁰ while q delivers m⁰ before m. However, this violates SU T O, which is a contradiction. As a consequence R_{SU T O} ⊆ R_{W U T O}. Furthermore, Figure 2(b) depicts a run in R_{W U T O}−R_{SU T O}. Therefore R_{SU T O} ⊂ R_{W U T O}, i.e. SU T O → W U T O.

Being a weak Order property, W U T O allows the occurrence of gaps in the sequence of messages delivered by distinct processes. As an example, Figure 2(b) depicts a run in R_{W U T O}− R_{SU T O} in which p₂ delivers m₃ after having skipped m₂, while p₁ delivers both m₂ and m₃ in this order.

p1

p2

p3

m1 m2

m2 m1

m₁

m2 m4

m3 m3

m7 m6

m6

(a) A run in RSN U T O− RSU T O

p1

p₂

p₃

m1 m2 m1

m1

m2 m4

m3 m3

m7 m6

m5

(b) A run in RW N U T O − RW U T O

Figure 3: Differences between SN U T O and SU T O and between W N U T O and W U T O

Strong Non-uniform Total Order (SN U T O). If some correct process todelivers some message m before message m⁰, then a correct process todelivers m⁰ only after it has todelivered m.

SN U T O is the non-uniform counterpart of SU T O. Hence, due to the relation between uniform and non-uniform properties, SU T O → SN U T O.

The difference between these two properties lies in the behavior of faulty processes. In fact, SN U T O allows faulty processes to deliver an arbitrary set of messages and to deliver them in an arbitrary order (e.g. process p₃ in Figure 3(a)). In contrast, correct processes are forced to agree on a prefix of the ordered set of delivered messages. Furthermore, any two correct processes delivering distinct messages as the i − th delivered message has to successively deliver only disjoint set of messages (see processes p₁ and p₂ in Figure 3(a)).

(12)

Weak Non-uniform Total Order (W N U T O). If correct processes p and q both todeliver messages m and m⁰, then p todelivers m before m⁰ if and only if q todelivers m before m⁰.

W N U T O is the non-uniform counterpart of W U T O. Hence W U T O → W N U T O. Furthermore, in a way similar to Lemma 1, it is possible to show that SN U T O → W N U T O. As SN U T O, W N U T O allows faulty processes to deliver messages in arbitrary order (see Figure 3(b)). Moreover, processes (correct or not) are free to deliver distinct sets of messages, i.e. W N U T O allows the occurrence of gaps in the sequence of message deliveries.

SUTO

WUTO SNUTO

WNUTO

Figure 4: Relations among Order properties

Figure 4 summarizes the previous discussion by pointing out relations holding among Order properties.

4 A hierarchy of total order specifications

Assuming Non-Uniform Validity and Uniform Integrity, it is possible to combine Agreement and Order properties to obtain eight TO specifications.

We denote T O(A, O) the TO specification S(N U V, U I, A, O), where A ∈ {U A, N U A} and O ∈ {SU T O, SN U T O, W U T O, W N U T O}.

Lemma 2 and Corollary 1 below prove that T O(N U A, SN U T O) ≡ T O(N U A, W N U T O) and that T O(U A, SN U T O) ≡ T O(U A, W N U T O).

Lemma 2. T O(N U A, SN U T O) ≡ T O(N U A, W N U T O) Proof.

1. (⊆) By contradiction. Let r be a run such that r ∈ R_{SN U T O}∩ R_{N U A}∩ R_{N U V}∩R_{U I}∧r /∈ R_{W N U T O}∩R_{N U A}∩R_{N U V}∩R_{U I}. From r ∈ R_{SN U T O}∩ R_{N U A}∩ R_{N U V}∩ R_{U I}, it follows that r must satisfy N U A ∧ N U V ∧ U I.

As a consequence r violates W N U T O, i.e. there exists two correct processes p, q and two messages m, m⁰ such that p delivers m before m⁰ in r while q delivers m⁰ before m in r. However, this violates SN U T O. Contradiction.

2. (⊇) By contradiction. Let r be a run such that r ∈ R_{W N U T O}∩R_{N U A}∩ R_{N U V}∩R_{U I}∧r /∈ R_{SN U T O}∩R_{N U A}∩R_{N U V}∩R_{U I}. From r ∈ R_{W N U T O}∩

(13)

R_{N U A}∩ R_{N U V}∩ R_UI, it follows that r must satisfy N U A ∧ N U V ∧ U I.

As a consequence, it violates SN U T O. Suppose that p, q are two correct processes and m, m⁰ are two messages in r, and that p delivers m before m⁰. Since r violates SN U T O, we have the following two cases:

(a) q delivers m⁰ without delivering m. This contradicts the hypoth- esis that r satisfies N U A.

(b) q delivers m after delivering m⁰. This contradicts the hypothesis that r satisfies W N U T O.

Corollary 1. T O(U A, SN U T O) ≡ T O(U A, W N U T O) Proof.

1. (⊆) Let r be a run such that r ∈ R_{SN U T O}∩ R_{U A}∩ R_{N U V}∩ R_{U I}. Since U A → N U A, r ∈ R_{U A} ⇒ r ∈ R_{N U A}. Therefore, r ∈ R_{SN U T O} ∩ R_{N U A}∩ R_{N U V} ∩ R_{U I}. From Lemma 2, it follows that r ∈ R_{W N U T O}∩ R_{N U A} ∩ R_{N U V} ∩ R_{U I}. As a consequence r ∈ R_{W N U T O}, and thus r ∈ R_{W N U T O}∩ R_{U A}∩ R_{N U V} ∩ R_{U I}.

2. (⊇) Let r be a run such that r ∈ R_{W N U T O}∩ R_{U A} ∩ R_{N U V} ∩ R_{U I}. Since U A → N U A, r ∈ R_{U A} ⇒ r ∈ R_{N U A}. Therefore, r ∈ R_{W N U T O}∩ R_{N U A}∩ R_{N U V} ∩ R_{U I}. From Lemma 2, it follows that r ∈ R_{SN U T O}∩ R_{N U A} ∩ R_{N U V} ∩ R_{U I}. As a consequence r ∈ R_{SN U T O}, and thus r ∈ R_{SN U T O}∩ R_{U A}∩ R_{N U V} ∩ R_{U I}.

These results can be intuitively explained noting that W N U T O and SN U T O differ for the restrictions they impose on the set of messages deliv- ered by correct processes, not on the order of deliveries. As a consequence, they give rise to equivalent specifications when combined with an Agreement property, which forces correct processes to deliver the same set of messages.

Therefore we can restrict our attention to six (out of the expected eight) significant TO specifications, considering O ∈ {SU T O, W U T O, W N U T O}.

It is possible to exploit the precedence relations derived for Order and Agreement properties to identify relations among TO specifications. To this aim, it is first worth noting that given two specifications differing only for the definition of the Agreement property, i.e. T O(A, O) and T O(A⁰, O), then T O(A, O) → T O(A⁰, O) iff A → A⁰ and R_{T O(A}0,O)− R_{T O(A,O)}6= ∅. In this case any run r ∈ R_{T O(A}⁰_,O)− R_{T O(A,O)} satisfies ¬A, i.e. ¬A characterizes all runs in R_{T O(A}⁰_,O)− R_{T O(A,O)}. The same reasoning applies to specifications differing only for their Order property. As an example, observing that U A → N U A and that the run depicted in Figure 6(b) belongs to RT O(N U A,SU T O)− RT O(U A,SU T O), it follows that T O(U A, SU T O) → T O(N U A, SU T O). Fur- ther relations among specifications are due to the transitivity of the → relation. For example, T O(U A, SU T O) → T O(N U A, W N U T O).

(14)

TO(UA,SUTO)

TO(NUA,SUTO) TO(UA,WUTO)

TO(NUA,WUTO)

TO(NUA,WNUTO) TO(UA,WNUTO)

TO(UA,SUTO)

TO(NUA,SUTO) TO(UA,WUTO)

TO(NUA,WUTO)

TO(NUA,WNUTO) TO(UA,WNUTO)

¬UA

¬SUTO

¬WUTO

¬UA

¬SUTO

Figure 5: A hierarchy of TO specifications

Figure 5 depicts the transitive reduction of the → relation among TO specifications. This reduction represents a hierarchy of specifications. Note that any two specifications S and S⁰ connected by an edge in the hierarchy differ only for distinct formulations of a single property, say P and P⁰, such that P → P⁰. The edge is then labelled with the predicate ¬P (see Table 3 for its definition) that can be used to discriminate runs belonging to R_S⁰− R_S.

¬U A , ∃p, q∃m todel(p, m) ∧ ¬todel(q, m) ∧ correct(q)

¬SU T O , ∃p, q∃m, m⁰ todel(p, m) < todel(p, m⁰) ∧ del(q, m⁰) ∧ (¬del(q, m)∨

todel(q, m⁰) < todel(q, m))

¬W U T O , ∃p, q∃m, m⁰ todel(p, m) < todel(p, m⁰) ∧ todel(q, m⁰) < todel(q, m)

Table 3: Predicates associated with direct relations in the hierarchy Let us note that the root of the hierarchy, i.e. T O(U A, SU T O), is the specification closest to the intuitive notion of total order broadcast, as it imposes that the set of messages delivered by each process is a prefix of the ordered set of messages that is delivered by all correct processes. This is due to the interaction between SU T O and U A. To explain this, let us consider a faulty process skipping the delivery of a message delivered by correct processes, e.g. p₃ omitting m₃ in Figure 6(a). Due to SU T O, the messages delivered by p₃ after skipping m₃ have not to be delivered by processes that deliver m₃. Now suppose that p₃ delivers a new message m after having skipped m₃ and before crashing. To let the run satisfy U A, m has to be delivered by correct processes. However, in that case the run would violate SU T O. As a consequence p₃ may not deliver any message after skipping the delivery of a message that is delivered by some other process. This is sufficient to ensure that all processes deliver all messages in the same order before crashing. In contrast, weaker specifications admit runs in which faulty

(15)

processes are allowed to exhibit a wider set of behaviors, as discussed in the following section. It is worth noting that weaker specifications are those implemented by several real systems, e.g. Ensemble [22], JavaGroups [5].

4.1 On the behavior of faulty processes

Each TO specification constrains all correct processes to deliver exactly the same ordered set of messages. On the contrary, each specification poses its own restrictions on the ordered set of messages that can be delivered by each faulty process. Differences among the sequences of delivered messages can be characterized using the following execution patterns.

EP1: a faulty process p delivers a prefix of the ordered set of messages delivered by correct processes;

EP2: a faulty process p delivers some messages which are not delivered by correct processes;

EP3: a faulty process p skips the delivery of some messages de- livered by correct processes;

EP4: a faulty process p delivers some messages in an order dif- ferent from correct processes.

Each specification allows the occurrence of one or more of the above execution patterns. Moreover, from the definition of the → relation, it follows that for each pair of specifications S, S⁰ : S → S⁰, S⁰ allows at least all execution patterns admitted by S. For example, T O(U A, SU T O) allows EP1 while T O(U A, W U T O) allows EP1 and EP3. Table 4 shows for each specification the admitted execution patterns. Let us note that these execution patterns are formally derived from specifications (see Appendix A).

p1

p₂

p₃

m₂ m₂

m₂ m1

m1

m₃ m3

m₄ m4

(a) A run in

RT O(U A,SU T O)

p1

p2

p3

m2 m2

m₂ m1

m1

m1 m4

m₃ m3

m₆ m6

m5

(b) A run in

RT O(N U A,SU T O) − RT O(U A,SU T O)

Figure 6: Differences between T O(U A, SU T O) and T O(N U A, SU T O)

(16)

m3 p₁

p2

p₃

m2

m2 m1

m1

m3 m3

m4 m4 m4

(a) A run in

RT O(U A,W U T O) −

RT O(U A,SU T O)

m₃ p1

p2

p3

m2 m1

m₁

m1

m3 m3

m₅ m4

m4

m₂

m4 m6

m6

(b) A run in

RT O(N U A,W U T O) − RT O(U A,W U T O)

Figure 7: Differences between T O(U A, W U T O) and T O(N U A, W U T O)

m₃ p1

p2

p₃

m2

m2 m1

m1 m1

m4 m3

m₄

m3 m4

(a) A run in

RT O(U A,W N U T O) − RT O(U A,W U T O)

m₄ p1

p2

p3

m2 m₁

m1

m3

m₃ m4

m2 m3 m4

m₅ m6

m₆

m6

(b) A run in

RT O(N U A,W N U T O) − RT O(U A,W N U T O)

Figure 8: Differences between T O(U A, W N U T O) and T O(N U A, W N U T O)

4.2 Associating implementations to specifications

When dealing with a TO implementation with a complex or loose specification (e.g. [9, 23]) it could be difficult to identify its guarantees. This problem can be solved by classifying the TO implementation with respect to the hierarchy of Figure 5 and then inspecting Table 4 to identify the admitted behavior for faulty processes. To this end, let us first introduce the following definition.

Definition. Let I be a TO implementation and let R_I be the set of runs that I can generate. I enforces a TO specification S iff :

1. R_I ⊆ R_S, and

2. ∀S⁰ S⁰ → S ⇒ R_I * R_S⁰.

Therefore the problem of finding the TO specification enforced by a TO implementation I boils down to find a TO specification S defining the small- est superset of R_I. It is worth noting that, given two specifications S and S⁰

(17)

TO specification Admitted differences in the histories of Example processes p (faulty) and q (correct)

T O(U A, SU T O) EP1 Figure 6(a)

T O(U A, W U T O) EP1 or EP3 Figure 7(a)

T O(U A, W N U T O) EP1 or EP3 or EP4 Figure 8(a)

T O(N U A, SU T O) EP1 or EP2 Figure 6(b)

T O(N U A, W U T O) EP1 or EP2 or EP3 Figure 7(b) T O(N U A, W N U T O) EP1 or EP2 or EP3 or EP4 Figure 8(b)

Table 4: Possible differences between the behavior of faulty and correct processes

connected by an edge labelled with ¬P in the hierarchy of Figure 5, a suffi- cient condition to state that I does not implement S but possibly implements S⁰ is that there exists a run that can be generated by I which satisfies ¬P . As an example, consider T O(U A, SU T O) → T O(N U A, SU T O). Verifying that a run r ∈ R_Isatisfies ¬U A, which is the label of the edge between these two specifications in Figure 5, allows us to state that R_I * RT O(U A,SU T O), but possibly R_I ⊆ RT O(N U A,SU T O).

A methodology to find the implemented specification S is to navigate the hierarchy according to the following rules:

1. verify that I is a TO implementation, i.e. R_I ⊆ RT O(NU A,W N U T O). In the affirmative, go to step 2, otherwise I is not a TO implementation;

2. set S to the hierarchy top node, i.e. T O(U A, SU T O);

3. while there exists an unchecked specification S⁰ such that S → S⁰ belongs to the transitive reduction of → depicted in Figure 5, check if there exists a run r ∈ R_I satisfying the associated predicate ¬P . In the affirmative, repeat this step setting current node to S⁰. Otherwise go to step 4;

4. the current node S represents the specification implemented by I.

4.3 Impact of TO specifications on the application logic When selecting a TO primitive to build an application, one has to carefully take into account the specification it enforces. Each specification can be useful, provided that the application logic is able to preserve correctness despite the possible behavior of faulty processes. From this point of view, the hierarchy of Figure 5 can induce a tradeoff between the guarantees of a TO primitive and the complexity of the application logic: the stronger is the specification enforced by the TO primitive, the simpler the application logic may result.

(18)

To illustrate this tradeoff, let us consider the following example: guar- anteeing one copy serializability [7] on a replicated database following the framework proposed by Wiesmann et al. in [32]. In the context of deter- ministic transactions⁴, to enforce one copy serializability it suffices that (i) all replicas serialize transactions in the same order and (ii) they execute the same set of transactions. To this end, each time a database replica re- ceives a transaction t from a client, it broadcasts t to all replicas using a TO primitive; t is then executed upon delivery by all replicas. Assuming that replicas serialize transactions according to the order of transaction delivery, one copy serializability can be preserved using a TO primitive compliant with T O(U A, SU T O) and letting each replica independently decide on the outcome of each transaction (either commit or abort), as replicas will decide the same outcome.

Weaker TO primitives may be used, provided that the application logic removes possible inconsistencies due faulty replicas. This can be achieved by introducing a voting phase, coordinated for simplicity by a fault tolerant software component C, which is a part of the application logic. In the following description we assume for simplicity that a majority of replicas are correct and that the communication between replicas and C happens through FIFO point-to-point reliable channels. For each specification we show how C can be designed in order to ensure one-copy serializability.

T O(N U A, SU T O). Due to the possible occurrence of execution pattern EP2 (see Table 4), a faulty replica could try to commit a transaction that will never be delivered by correct replicas. This is avoided by letting C decide on the outcome of each transaction. In partic- ular, when a replica is ready to commit a transaction t, it sends a prepare to commit(t) message to C. Upon receiving a prepare to com- mit(t) message from a majority of database replicas, C reliably broad- casts a message commit(t) to all replicas, which in turn commit the transaction.

T O(U A, W U T O) or T O(N U A, W U T O). Table 4 shows that in these cases faulty replicas could also skip the delivery of some transactions due to the occurrence of execution pattern EP3. If the previous C im- plementation is used, this pattern generates a violation of one copy serializability due to the fact that a faulty replica executes a subset of the transactions executed by correct replicas. Therefore, the voting phase has to be enhanced by letting C broadcast a commit(t) message for a transaction t only once it has received a prepare to commit(t) message from all non-crashed database replicas. This requires C to on-the-fly monitor the set of non-crashed replicas.

4As pointed out in [32], assuming nondeterministic transactions requires to run more complex protocols that are out of the scope of this example.

(19)

T O(U A, W N U T O) or T O(N U A, W N U T O). In these cases, faulty repli- cas could also deliver, and thus serialize, transactions in different or- ders due to the execution pattern EP4 (see Table 4). Therefore C has to also handle the possible arrival of out of order prepare to commit messages which reflect the different local serialization order of each replica.

5 Primary component group communication sys-

tems

In this section we briefly introduce the main features of primary component group communication systems, which are by far one of the most successful class of systems implementing TO primitives.

5.1 Reference architecture and the virtual synchrony programming model

Group communication systems adopt several distinct architectures [26]. For the sake of clarity, in the remainder of this paper we use a simplified architecture depicted in Figure 9(a), in which a Total Order layer implements a TO specification by relying on another layer, namely VSC, which provides virtually synchronous communications [8].⁵ According to the virtual synchrony programming model, processes are organized into groups. Groups are dynamic, i.e. processes are allowed to join and voluntarily leave a group using appropriate primitives. Furthermore, faulty processes are excluded by groups after crashing. A group membership service provides each process of a group with a consistent view v_i composed by the identifiers of all non- crashed processes currently belonging to the group. Upon a membership change, processes agree on a new view through a view change protocol. At the end of this protocol, group members are provided with a view v_i+1 that (i) is delivered to all the members of v_i+1through a view change event, and (ii) contains the identifier of all the members that deliver v_i+1 (see Figure 9(b)). We assume a primary component membership service, e.g. [11], guar- anteeing that all members of the same group observe the same sequence of views as long as they stay in the group. In this context, the VSC layer guarantees (i) that membership changes of a group occur in the same order in all the members that stay within the group, and (ii) that membership changes are totally ordered with respect to all messages sent by members.

It is worth noting that the primary component membership service is not implementable in a non-blocking manner in asynchronous systems [13].⁶

5Let us remark that other approaches incorporating the implementation of Order and Agreement properties into a single protocol are possible, e.g. [14].

(20)

VSC Total Order

Unreliable network Application

TOcast TOdeliver

[U]Rcast [U]Rdeliver

Rsend Rreceive viewchange

(a) Architecture

p1

p₂

p₄

(faulty) End of view change. The

new view is delivered vi= {p1,p2,p3,p4} vi +1= {p1,p2,p3}

p3

Failure detected.

Start view change view change

protocol m1

m2 m1

m1

m2 m1

m2

m3 m3

m3

m3 m2

(b) The Virtual Synchrony property

Figure 9: Group communication systems

To formalize these concepts, we let the VSC layer export primitives of the membership service, namely view change, as well as of a set of com- munication primitives, namely Rsend(m) and Rreceive(m) (for point-to- point communications), Rcast(m) and Rdeliver(m) (for non-uniform reli- able broadcast communications), and U Rcast(m) and U Rdeliver(m) (for uniform reliable broadcast communications). In order to parameterize TO protocols with respect to the employed reliable broadcast primitive, we use [U ]Rcast(m) and [U ]Rdeliver(m) to denote the invocation of any reliable broadcast primitive, uniform or not.

These primitives satisfy the following properties:

VSC1. If a correct process p Rsends a message m to a correct process q, then q eventually Rreceives m.

VSC2. For each message m, a process p Rreceives m at most once, and only if m was Rsent to p by some process q.

These properties basically state that point-to-point communications among processes occur through quasi-reliable channels [6].

Concerning [U ]Rcast(m) and [U ]Rdeliver(m), these primitives satisfy the following properties [15, 33]:

VSC3. If a correct process p [U]Rcasts a message m, then it eventually [U]Rdelivers m.

VSC4. For each message m, each process [U]Rdelivers m at most once, and only if m was [U]Rcast by some process.

6In partitionable systems, groups may partition into subgroups (or components), e.g.

due to network failures, and members of distinct subgroups can deliver distinct sequences of views. In this setting, specifying a total order primitive can drive to complex specifications, e.g. [15], whose usefulness has still to be verified [30]. However, non-blocking implementations of partitionable group communications systems are feasible in asynchronous systems.

(21)

VSC5. If a process (resp. a correct process) p URdelivers (resp. Rdelivers) m in view v, then all processes which are either correct or deliver a view change event in v URdeliver (resp. Rdeliver) m.

VSC6. If a process p [U]Rdelivers a message m in view v and a process q [U]Rcasts m in view v⁰ then v = v⁰.

These properties guarantee that processes agree on the set of messages delivered before installing a new view, thus enforcing virtual synchrony. Fur- thermore, property VSC6 (named also Sending View Delivery [15]), guar- antees that messages are delivered in the view they were sent.

5.2 Static vs. dynamic group communications.

Following the model proposed by Hadzilacos and Toueg in [21], the properties introduced in Section 3 are based on a system model that does not take process joins into account. In contrast, group communication systems typically support dynamic groups, which drive to more complex specifications, e.g. [15, 30]. We now show how the TO specifications introduced in Section 4 can be used to classify also dynamic TO implementations. To this aim, we introduce the notion of static sub-run, i.e. a portion of the overall computation of a system supporting dynamic groups in which join events may only appear at the beginning of the sub-run. Considering the computation depicted in Figure 10, this run can be decomposed in three static sub-runs, namely sr₁, sr₂, sr₃. It is important to note that a sub-run can be described with events and process histories as those introduced in Section 2, i.e. T Ocast(m), T Odeliver(m), and crash. As an example the sub-run sr₂ depicted in Figure 10 is composed by the histories of processes p₁, p₂ and p₄ containing the message delivery events of m₄ and m₅. Moreover, p₁ is correct in sr₁ and sr₂ while it is faulty in sr₃.

In this dynamic context, the answer to the question “which TO speci- fication is enforced by an implementation I?”, can be given by considering the set of all static sub-runs (instead of the set of runs) that I can produce as the set R_I used in the methodology described in Section 4.2.

As an example, in the run depicted in Figure 10, sub-runs sr₁ and sr₂ satisfy T O(U A, SU T O), while sr₃ only satisfies T O(N U A, SU T O) (due to p₅ delivering message m₉). Therefore, an implementation I that may gen- erate this run enforces at most T O(N U A, SU T O) (i.e. I does not enforce T O(U A, SU T O)).

Total Order Communications over Asynchronous Distributed Systems: Speciﬁcations and Implementations