Regular Registers in Dynamic Distributed Systems with Byzantine Processes: Bounds and Performance Analysis

(1)

Regular Registers in Dynamic Distributed Systems with

Byzantine Processes: Bounds and Performance Analysis

Roberto Baldoni, Silvia Bonomi, Amir Soltani Nezhad Sapienza Universit`a di Roma, Via Ariosto 25, 00185 Roma, Italy

{baldoni,bonomi}@dis.uniroma1.it amir.soltaninezhad@gmail.com

Abstract

This paper addresses the problem of building a byzantine fault tolerant storage service implemented by a set of servers that can dynamically join and leave the computation (i.e., server churn). In this challenging environment, we prove a bound for ensuring both validity of read operations (i.e., read- validity property) and the persistence of a value written by a write operation (i.e., write-persistence property). This bound correlates the churn rate, the number of faulty processes and the time taken by register operations (i.e., join, read and write operations).

Considering a specific synchronous system, we first prove the minimal number of communication steps necessary to implement each register operation, then we instantiate the previous bound to compute the minimum churn that makes impossible to implement a regular register protocol as a function of the communication steps and the number of faulty processes. Finally, given a specific algorithm for a regular register using the minimum number of communication steps, we will provide an experimental evaluation of its behavior when churn is out of its solvability theoretical bound. The study analyzes the number of read operations returning a valid value by varying both churn and the number of faulty processes.

Keywords: Regular Register, Dynamic Distributed Systems, Byzantine failure, Churn, Synchronous System.

(2)

1 Introduction

Building a storage which is able to tolerate arbitrary failures is a central component in any mission criti- cal system that has to ensure both correct and highly-available operation executions despite software and accidental errors as well as malicious operations. Protocols that can tolerate such failures are said to be Byzantine fault-tolerant (BFT) [15]. Many BFT storage protocols have appeared in the literature (e.g., [5, 10, 18]). Most of them are based on the replication paradigm where a set of servers cooperates (i.e., execute a computation) in order to give the illusion to clients of a unique highly available storage (e.g., [5, 18]).

In the recent years, BFT storage became a first class abstraction also in large scale distributed systems for implementing strategic services such us directory services [9], sequencer services [11], namespaces, key management etc. These services are usually implemented by dozens of servers geographycally deployed over a large scale settings (such as peer-to-peer systems, interconnected datacenters etc) that notoriously have to withstand various types, patterns, degrees and rates of churn. As an example, a Directory System in the context of cloud computing has to be deployed on the top of interconnected datacenters including hundreds of thousands of servers that are constantly under churn due to the changes in demand, dynamic re-provisioning of servers, switch/link failures and recoveries etc. [9]. In such environments, churn creates unpredictable patterns of servers that leave and join a computation and no assumption can be done on the duration of the churn period and on the distance in time between two successive churn periods (i.e., churn is non-quiescent).

Churn adds a new complexity dimension to the implementation of BFT storage. As an example, during churn periods, servers can leave the computation and others can join in order to try to keep the storage service formed by the same number of servers along the time. While leave operations are instantaneous, a join operation takes time because a new server has to receive the correct value of the storage, say v, by a certain number of server replicas (usually f +1 same values where f is the maximum number of faulty servers) before the new server becomes part of the servers set implementing the storage. If the churn rate is too high, a new server could not be able to reach the threshold of the same values received by other replicas compro- mising the correctness of the BFT storage. In other words, high churn may compromise the persistency of a write operation in the BFT storage, leading to non-valid read operations.

In this paper we address the problem of building a BFT storage based on replication of the value, specif- ically a regular register [14], on the top of a distributed system prone to non-quiescent churn. In such a challenging environment, we provide the following results:

• Assuming register operations are live, we prove a bound for ensuring validity of read operations done on the regular register. This bound correlates the churn rate c, the number of faulty processes f and the execution time taken by register operations (i.e., join, read and write operations). This bound shows that if the execution time of operations grows, the churn tolerated by the BFT storage tends to zero.

• In the context of a synchronous system, we first prove the number of communication steps necessary and sufficent to implement each operation (i.e., join, read and write) of a generic regular register protocol, then we compute the actual validity bound for such synchronous system. This bound matches the one found in [19] in case of no churn and the one found in [3] in case of no faulty processes.

Given a specific regular register protocol using the minimal number of communication steps (shown in the Appendix), we carry out an experimental evaluation of the protocol that analyzes the number of valid read operations varying both churn and the number of faulty processes. The study shows that if the workload of the BFT storage is dominated by very frequent write operations, the regular register is able to resist to a churn that is much higher than the one computed through the theoretical bound. This is due to the fact that

(3)

frequent write operations ”refresh” the content of the storage thus contrasting with the loss of persistency due to the churn. The evaluation also compares theoretical and the empirical bound for ensuring validity of read operations by varying the number of faulty processes. The smaller is the number of faulty processes, the higher is the distance between the theoretical and the empirical bound.

To the best of our knowledge, this is the first paper analyzing implications of building a BFT storage under churn. Implementations of registers resilient to crash failures have been provided indeed on the top of a distributed system prone to quiescent [2, 17] and continuous [3] churn respectively.

The rest of the paper is structured as follows: in Section 2 and Section 3, we introduce the system model and the regular register specification. In Section 4 we establish a validity bound for read operations.

Section 5 instances the validity bound in a specific synchronous distributed system. Sections 6 shows the performance analysis of a BFT protocol resilient to non-quiescent churn. Finally, Section 7 and Section 8 present the related work and the concluding remarks respectively.

2 System Model

The distributed system is composed of a universe of client Uc (i.e. the client system) and of a disjoint universe of serversUs (i.e. the server system). The client system is composed of a finite arbitrary number of processes (i.e. U_c = {c₁, c₂, . . . c_m}) while the server system is dynamic, i.e. processes may join and leave the system at their will. A server enters the server system by executing the join System procedure.

Such an operation aims at connecting the new process to both clients and servers that already belonged to the system. A server leaves the system by means of the leave System operation. In order to model processes continuously arriving to and departing from the system, we assume the infinite arrival model (as defined in [21]). The set of processes that can participate in the server system, i.e. the server system population, is composed of a potentially infinite set of processes U_s = {. . . s_i, s_j, s_k. . . }, each one having a unique identifier (i.e. its index). However, the server system is composed, at each time, of a finite subset of the population. Clients and servers can communicate only by exchanging messages through reliable and authenticated channels. In the following, we assume the existence of a protocol managing the arrival and the departure of servers from the distributed system; such a protocol is also responsible for the connectivity maintenance among processes belonging to the distributed system. Examples of such topologies are [12], [16], [24]. As in [18, 19, 20], we assume that clients are correct and servers can suffer arbitrary failures.

Distributed Computation. At time t0, when the server system is set up, n servers belong to the server computation. A server s_i, belonging to the server system, that wants to join the distributed computation has to execute the join Server() operation. Such this operation invoked at some time t, is not instantaneous and it takes time to be executed; how much this time is, depends on the specific implementation provided for the join Server() operation. However, from time t, the server s_ican receive and process messages sent by any other processes that participate in the computation.

When a server sj participating in the distributed computation, wishes to leave the computation, it executes the leave Server operation. Without loss of generality, we assume that if a server leaves the computation and later wishes to re-join, it executes again the join Server() operation with a new identity.

Figure 1 shows the distributed system and the distributed computation layers.

It is important to notice that (i) there may exist processes belonging to the server system that never join the distributed computation (i.e. they execute the join System() procedure, but they never invoke the join Server() operation) and (ii) there may exist processes that after leaving the distributed computation still remain inside the distributed system (i.e. they are correct, but they stop to process messages related to the

(4)

cm

c1 c2

c3 ci

cj

Clients Computation

sn s1

s2 si

sj

Servers Computation join_Server()

read() write(v) Distributed Computation

leave_Server() join_Server()

join_System()

leave_System()

Server System

Client System

Figure 1: Distributed System vs. Distributed Computation

computation). To this aim, it is important to identify the subset of processes that are actively participating in the distributed computation.

Definition (Active Server Set) A server is active in the distributed computation from the time it returns from thejoin Server() operation until the time it starts executing the leave Server() operation. A(t) ≤ n denotes the set of servers that are active at timet, while A([t, t⁰]) denotes the set of servers that are active during the whole interval[t, t⁰] (i.e. s_i∈ A([t, t⁰]) iff s_i ∈ A(τ ) for each τ ∈ [t, t⁰]).

Servers that obey their specification are said to be correct. On the contrary, a faulty server can deviate arbitrarily from its specification. We assume at most f ≤ n/3 servers can fail during the whole life of the computation. It is important to note that servers know the value f , but they are not able to know the subset of Usrepresenting the faulty processes.

Non-Quiescent Churn. The computation alternates periods of churn and periods of stability. More specif- ically, there exists some periods Tchurnin which servers join and leave the computation, then there exists a period Tstabilitywhere the computation becomes stable and no join or leave operations occur. However, no assumption is made about how long T_churnand T_stability are.

During the churn periods, we assume that the churn “refreshes” a fraction of the computation participants at each time unit. More precisely, we define as churn rate, denoted c, the percentage of processes that are

“refreshed” at every time unit (c ∈ [0, 1]). This means that while the number of servers remains constant (equal to n), in every time unit in Tchurn, c × n processes leave the computation and the same number of processes join the computation.

Let us finally remark that in this churn model, there is no guarantee that a server remains permanently in the computation and additionally, this model is general enough to encompass both a distributed computation prone to continuos churn (i.e. there exists a time t after that churn holds forever) and quiescent churn (i.e.

there exists a time t after that stability holds forever)).

3 Regular Register

A register is a shared variable accessed by a set of processes, i.e. clients, through two operations, namely read() and write(). Informally, the write() operation updates the value stored in the shared variable while

(5)

the read() obtains the value contained in the variable (i.e. the last written value). Every operation issued on a register is, generally, not instantaneous and it can be characterized by two events occurring at its boundary:

an invocation event and a reply event. These events occur at two time instants (invocation time and reply time) according to the fictional global time.

An operation op is complete if both the invocation event and the reply event occur (i.e. the process executing the operation does not crash between the invocation and the reply).

Given two operations op and op⁰, and their invocation event and reply event times (t_B(op) and t_B(op⁰)) and return times (tE(op) and t_E(op⁰)), we say that op precedes op⁰(op ≺ op⁰) iff tE(op) < t_B(op⁰). If op does not precede op⁰ and op⁰ does not precede op, then op and op⁰ are concurrent (op||op⁰). Given a write(v) operation, the value v is said to be written when the operation is complete.

Regular Register Specification. In case of concurrency while accessing the shared variable, the meaning of last written value becomes ambiguous. Depending on the semantics of the operations, three types of register have been defined by Lamport [14]: safe, regular and atomic. In this paper, we will consider a regular register which is specified as follows:

• Termination: If a correct process participating in the computation invokes an operation and does not leave the system, it eventually returns from that operation.

• Validity: A read operation returns the last value written before its invocation, or a value written by a write operation concurrent with it.

As a specialization of the generic model of the computation presented in the previous Section, we consider the existence of the following operations: join register server operation and leave register server operation. Concerning the departures from the computation, we model the leave register server() operation as an implicit operation; when a process pileaves the computation, it just stops to send and process messages related to the register computation. Moreover, to simplify the notation, whenever not strictly necessary, we refer to the join register server() operation as join() operation.

Protocol Model. A protocol P_regimplementing a regular register is a collection of distributed algorithms, one for each operation (i.e. Preg ={A_{J S}, ALS, AR, AW} where AJ S, ALS, AR, AW are respectively the distributed algorithm implementing the join Server(), the leave Register(), the read() and the write() operations). Each algorithm is composed of a sequence of computation steps and communication steps. A computation step is represented by the computation executed locally to each process while a communication step is represented by the sending and the delivering events of a message.

A register is maintaied by a set of servers (possibly a subset of the ones belonging to the computation). No agreement abstraction is assumed to be available at a server. Clients do not maintain any register informa- tion; they can just trigger operations and interact with servers through message exchanges. Moreover, we assume that each server has the same role in the distributed computation (i.e. there is no special server acting as a coordinator). Let us remark that quorum-based protocols (e,g., [18]) are captured by this model because P_regdoes not require that all the processes execute each operation while Pregis not a State Machine Repli- cation protocol (e.g., [23]) due to the lack of agreement abstraction. For simplicity in the following, we assume a single register replicated at each server belonging to the computation.

(6)

4 Validity Bound

Consider a generic protocol Preg implementing a regular register such that (i) every operation terminates and that (ii) there exists a period of churn larger than the longest operation issued on the register¹. In this section, we show a bound that relates the number of faulty processes to the churn by defining the minimum churn rate that makes impossible for Pregto satisfy the validity property specification. Let us now introduce two properties that Preg has to satisfy to ensure validity:

Byzantine Resiliency. There are at least f +1 servers at any time that maintain the same value;

Write Persistency. Servers maintain the last value written by a write operation despite server departures (i.e., any value v written by a write() operation must persist in the system until a new operation overwrites v);

Concerning the first property, in [20], it has been proved that having 3f + 1 replicas of the register value is a lower bound to implement a read operation that satisfies the validity in an authenticated, asynchronous, byzantine prone system, with no churn. This bound applies also in a distributed system prone to churn.

Thus, given a read() operation starting at some time t and lasting ∆t time units, we need that at least 3f + 1 servers remain active during the whole period [t, t + ∆t] (i.e. |A([t, t + ∆t])| > 3f ).

In a distributed system with no churn, write-persistency is always satisfied because (i) there are no server departures and (ii) in the system, there always exists “enough” servers that run ”permanently” the register protocol (e.g., in [2] there is the assumption of the majority of correct servers, in [18] it is assumed that given f failure, the number of servers, n, in the system is always greater than 3f + 1 ). In a distributed system prone to churn, write-persistency is an issue, and it has to be opportunely addressed by the protocols implementing the register. As an example, consider the following scenario: at time t₀, all the servers belonging to the server computation store a “valid” value in their local copy of the register (i.e. the default value of the register written by a fictional write() operation). As soon as the churn starts, processes having the value leave and new processes, with no value, arrive. Due to the absence of stable processes, along the time, all the servers with a value can leave and, in the worst case, the value of the memory can be lost if it is not transfered to the new servers, bringing the system to violate write-persistency. To this aim, we need to specify a property for the join() operation, suited our system model, and representing a necessary condition to have write persistency (a formal proof of this statement can be found in Lemma 7 shown in Appendix B).

Join − Validity: A join operation terminates by obtaining the last value written before its invocation, or a value written by a write operation concurrent with it.

In the following, we compute the bounds on the churn rate that make impossible to satisfy respectively the join-validity (Lemma 1) and the write-persistency (Lemma 2). Finally, we will show the bound following the need to have at least 3f + 1 servers during a read operation (Lemma 3), and we deduce the global bound on the churn rate following the composition of the previous ones (Theorem 1).

Lemma 1 Let A_{J S}be an algorithm that implements thejoin register server() operation, and let ∆t_jbe the maximum time interval needed byA_{J S} to terminate ajoin register server() operation. If c ≥ _n×∆t^n−3f

j, then A_{J S}cannot satisfy the join-validity.

Proof Without loss of generality, let us consider the first time unit of the first churn period, i.e. t0. At time t₀, all the servers are active and maintain a valid value; it follows that |A(t₀)| = n. Due to definition of c, at time t0+ 1, nc servers invoke the join operation and nc servers leave; thus, |A[t₀, t₀+ 1]| = n − nc. Also during the second time unit, nc new processes enter the computation and replace the nc processes that left

1This actually corresponds to the worst case scenario for computing the validity bound).

(7)

the system during that time unit. In the worst case, the nc processes that left the system are processes that were present at time t0(i.e., they are not processes that entered the system between t0and t0+ 1). So, we have |A[t0, t0+ 2]| ≥ n − 2nc. If we consider a period of ∆tj time units, i.e. the longest period needed to terminate a join operation, we will obtain |A[t₀, t₀+ ∆t_j]| ≥ n − ∆t_jnc = n(1 − ∆t_jc). Imposing that

|A[t₀, t₀+ ∆t_j]| ≥ 3f , it follows that the inequality is satisfied only if c < _n×∆t^n−3f

j, and therefore the claim

follows. 2Lemma 1

Having an algorithm that implements the join() operation by obtaining a valid value for the register is a necessary condition to ensure the write-persistency, but is not sufficient. Consider the following scenario in which there is a write operation concurrent with the join one. The writer client issues a write(v) operation at time tBand terminates such this operation at time tE. Each server joining concurrently with the write can return either the value v or the value previously written in the register vold. If the join-validity holds, in the worst case, each of these servers terminates the join with the valid value v_old. However, if the churn rate is too high, it is possible that all the servers maintaing v, i.e. the new value, leave before new servers are able to obtain such a new value and the write-persistency of the register can be compromised. Thus, join-validity is satisfied and write persistency is not.

Lemma 2 Let A_{J S} be the algorithm implementing the join register server() operation satisfying join- validity and letA_W be the algorithm implementing the write() operation. Let ∆tj and ∆tw be respectively the maximum time interval needed by A_{J S} to terminate the join operation and the maximum time needed byA_W to terminate the write operation. Ifc ≥ _n×(∆t^n−3f

w+∆tj), then it is not possible to ensure the write-persistency.

Proof We split the proof in two cases: (i) there is no write concurrent with a join operation and (ii) there is a write concurrent with a join operation.

Case (i): we are exactly in the situation of Lemma 1 and the bound holds.

Case (ii): let us suppose by contradiction that write-persistency is satisfied and c ≥ _n×(∆t^n−3f

w+∆tj). If the write persistency is satisfied, it means that a value v written by a write(v) operation is transfered from the writer client to active servers, and then from active servers to the processes that join the computation.

Without loss of generality, let us assume that a write(v) operation is issued by a client cw at time t. Let us consider the case where t = t₁ = t₀+ 1 and t₀is the instant of time in which the churn starts. At time t₀, all the servers are active and all the correct servers store in their local copy of the register the valid value of the register. At time t1, when the write(v) operation starts, servers start to join and leave the computation. Note that, all the join() operations started in the period [t₁, t₁+ ∆t_w] are concurrent with the write(v) operation.

Therefore, due to the join-validity, in the worst case, none of them terminates returning the value v, but all return the last value written before the write(v). On the contrary, all the join() operations starting after t₁+ ∆t_wmust return the value v. As a consequence, there must exist at least 3f + 1 servers that are active during the join() operation (otherwise the validity cannot be achieved [20]), and those have been updated by the write operation (i.e. they have been in the system from time t1 to t1+ ∆tw+ ∆tj). Thus, we have that n − nc∆tw− nc∆t_j > 3f ; considering c ≥ _n×(∆t^n−3f

w+∆tj), we have a contradiction.

Considering both the bound found in Lemma 1 and the bound found in Case (ii), it is not possible to ensure the write-persistency if c ≥ min{_n×∆t^n−3f

j,_n×(∆t^n−3f

j+∆tw)} ≥ _n×(∆t^n−3f

w+∆tj), and the claim follows.

2Lemma 2

Lemma 3 Let A_R be an algorithm implementing the read() operation. Let ∆t_r be the maximum time interval needed byA_R to terminate aread() operation. If c ≥ _n×∆t^n−3f

r then it is not possible to ensure the read-validity.

(8)

Proof Considering the bound proved in [20] and the fact that a join operation corresponds actually to a read operation to get a valid value, we can repeat the reasoning of the proof of Lemma 1 by considering the specific time taken by read operations (i.e., ∆tr), and then the claim follows. 2Lemma 3

From Lemma 2 and Lemma 3, we are able to compute the following bound:

Theorem 1 Let AJ S,A_R andA_W be the algorithm implementing respectivelyjoin(), read() and write() operations. Let∆tj,∆trand∆twbe respectively the maximum time interval needed byA_{J S} to terminate the join operation, A_R to terminate the read() operation and A_W to terminate the write operation. If c ≥ min{_n×∆t^n−3f

r,_n×(∆t^n−3f

j+∆tw)}, then it is not possible to ensure both the write-persistency and the read- validity.

Interestingly, the bound is affected by the time taken by the operations to be executed, which depends on the algorithm implementing the operations and on the synchrony assumptions of the underlying distributed systems. In the case of a synchronous distributed system, the values of ∆t_j, ∆t_rand ∆t_ware finite and thus, the value of c can be calculated exactly (as we will do in the next section). Moving towards an asynchronous system, the values of ∆tj, ∆tr and ∆tw can only be estimated; thus, there is no more a deterministic guarantee that the system works, but it is still possible to provide an estimation of the churn rate that can be tolerated by the system. It is also interesting to note how the formula confirms the impossibility result found in [3]: no algorithm can implement a regular register in the presence of non quiescent churn on the top of a fully asynchronous distributed system; as the system becomes asynchronous, ∆t_j, ∆t_rand ∆t_wgo to infinity, and then the churn goes to zero, meaning that a deterministic implementation is possible if and only if the system is prone to at most quiescent churn (as the protocol proposed in [1]).

5 Computing Validity Bound for a Synchronous System

Let consider a specific synchronous distributed system where (i) there is a known upper bound on processing delays (i.e. the duration of every computational step can be bounded ) and (ii) processes are equipped with a broadcast and a point-to-point communication primitive that have the following specifications:

• There exists a known and finite bound δ such that every message broadcasted at some time t is delivered up to time t + δ (TimelyBroadcastDelivery).

• There exists a known and finite bound δ⁰ < δ such that every message sent at some time t is delivered up to time t + δ⁰(TimelyChannelDelivery).

Considering these communication primitives, in this section we first compute, for each register operation, the lower bound on the number of communication steps (a communication step is bounded by a period of δ time units) required by a protocol Preg to execute the operation. Secondly we derive the time needed by each operation, and finally we can instantiate the bound on the churn rate to satisfy the validity property.

Before showing the formal proofs, let us remark that in a synchronous sytem, the termination property can be easily obtained by exploiting the synchrony assumptions. Considering the longest chain of messages related by the happened-before relations [13] that could occur during the execution of an operation, it is possible to set timeouts to trigger the termination of the operation. In the following, we assume that the time needed to execute a computational step is negligible with respect to the time used to perform a communication step.

(9)

Lemma 4 Let R be a regular register and let c_w be the writer client ofR. If c_w can access authenticated communication primitives satisfyingTimelyBroadcastDelivery and TimelyChannelDelivery, then one communication step is necessary and sufficient to implement an algorithmA_w(belonging toP_reg) for awrite() operation.

Proof

Necessary condition. The proof trivially follows by considering that the set of clients (i.e., the readers and the writer) and the set of servers (i.e. processes maintaining the state of the register) are disjoint sets. Given a write(v) operation, issued by a writer client ci, the written value v needs to be transfered from cito servers.

Contrary, it is easy to find an execution where both join-validity and read-validity may be not satisfied.

Sufficient condition. To prove that 1 communication step is enough to implement a write() operation we show an algorithm using only one step. The algorithm AW works as follows: when the writer client wants to write a value v on the register, it increments a local sequence number for the write, sends a broadcast message containing the value and the related sequence number and then waits for δ time unit before returning form the operation. When the server receives the message from the client, it updates the local copy of the register and the corresponding sequence number. Following this idea, we report in the Appendix A the pseudo-code

of the algorithm for both client and servers. 2Lemma 4

From Lemma 4 we have:

Corollary 1 Let R be a regular register and let cw be the writer client ofR. If cw can access autenticated communication primitives satisfyingTimelyBroadcastDelivery and TimelyChannelDelivery, then ∆tw = δ is a lower bound on the time needed to implement awrite() operation.

Lemma 5 Let R be a regular register and let cr be a reader client of R. If cr can access autenticated communication primitives satisfyingTimelyBroadcastDelivery and TimelyChannelDelivery and |A(t)| >

3f , then two communication steps are necessary and sufficient to implement an algorithm A_R(belonging to P_reg) for aread() operation.

Proof

Necessary condition. The proof trivially follows by considering that (i) the set of clients (i.e. readers and writer) and the set of servers (i.e. processes that maintain the state of the register) are disjoint sets and (ii) the state of the register is maintained from servers. Let us consider a read() operation issued by some reader c_i in the set of clients; ci has no copy of the register and it needs to inquiry the set of servers to retrieve the value (i.e. one communication step is needed). Each server si does not know when a client will issue a read() nor the identity of the readers thus it cannot use a proactive approach; it follows that the value of the register can be sent from servers to readers only when a request is received. As a consequence, at least two communication steps are needed to implement a read() operation.

Sufficient condition. The algorithm A_Rworks as follows: when a reader client wants to read, it broadcast a request to all the servers participating in the computation and then waits for 2δ time units (i.e. the maximum round trip delay). After this period, it selects among the value received the one for which it has at least f + 1 copies. When a server receives the request, it just send back the local copy of the register. Following this idea, we report in the Appendix A the pseudo-code of the algorithm for both client and servers. 2Lemma 5

Corollary 2 Let R be a regular register and let cr be a reader client ofR. If crcan access autenticated communication primitives satisfyingTimelyBroadcastDelivery and TimelyChannelDelivery and |A(t)| >

3f , then ∆tr= 2δ is a lower bound on the time needed to implement a read() operation.

(10)

Lemma 6 Let R be a regular register and let s_i be a server joining R. If s_i can access autenticated communication primitives satisfyingTimelyBroadcastDelivery and TimelyChannelDelivery and |A(t)| >

3f then two communication steps are necessary and sufficient to implement an algorithm AJ S (belonging toP_reg) for ajoin() operation.

Proof The proof trivially follows by considering that the same arguments and the same algorithm shown in the proof of Lemma 5 can be applied also for the join operations. 2Lemma 6

Corollary 3 Let R be a regular register and let si be a server joining R. If si can access autenticated communication primitives satisfyingTimelyBroadcastDelivery and TimelyChannelDelivery and |A(t)| >

3f then, then ∆tj = 2δ is a lower bound on the time needed to implement a join() operation.

Considering Theorem 1, Corollary 1, Corollary 2 and Corollary 3, we are in the position to compute the bound for a synchronous system:

Corollary 4 Let R be the regular register, let C be the set of clients that access R and let S be the set of servers that maintainsR. If clients and servers can access autenticated communication primitives satisfying TimelyBroadcastDelivery and TimelyChannelDelivery and c ≥ _n×(3δ)^n−3f then it is not possible to ensure validity.

It is important to note that the bound of Corollary 4 matches both the result in [3] c < 1/3δ when considering a distributed system prone to continuous churn and no failures and the result of [20] when considering a distributed system with no churn and arbitrary failures.

6 Experimental Evaluation

In the previous sections, we have shown some theoretical bounds that relates the churn rate and the number of faulty processes and that makes possible to implement a regular register in a distributed system prone to non-quiescent churn. It is important to note that such bounds follow from the worst case scenario; however, real executions rarely match the worst case and the practical bound on the churn rate is usually higher than the theoretical one. In this section we show the results of a set of experiment where we simulate a generic protocol implementing a regular register that matches the theoretical bound shown in Section 5. The protocol works as follows: the write() operation is implemented through the dissemination of the value together with its sequence number (by using the broadcast primitive); when a server receives the value, it checks if it is newer with respect to the one maintained locally and if it so, it updates the local value². After δ time units from the broadcast (i.e. the maximum latency of the communications), a timeout ends and the client terminates the operation. The read() operation is implemented with a request-reply pattern; the client asks for the value and the servers respond by sending back the local copy of the value. After 2δ time units, the client selects among all the received replies the one value for which there exists at least f + 1 same copies and return such values. The join() operation is a reengineering of the read() operation. In particular, the join algorithm directly addresses the write-persistency by waiting δ time units (i.e. the maximum time to execute a write) before inquiry to obtain the value. Due to lack of space, the complete description of the protocol is reported in Appendix A. The simulation study was carried out by implementing such a protocol in OMNET ++ [22].

2We recall that all the communications are authenticated, thus each server is able to distinguish real write messages from the one possibly generated by faulty processes.

(11)

Parameters. In addition to the size of the computation n, the number af aulty processes f and the churn c, to run our experiment, we have defined the following parameters:

• delay of the communication δ expressed in seconds;

• read frequency f req_r: is the frequency at which read() operations occur and it is expressed as

#read()/timeinterval(insec) (e.g. a read frequency f req_r = 1/10 means that 1 read() operation is issued every 10 seconds);

• write frequency f req_w:is the frequency at which write() operations occur and it is expressed as

#write()/timeinterval(insec) (e.g. a write frequency f req_w = 1/10 means that 1 write() operation is issued every 10 seconds).

Metrics. The aim of our experiment is to observe the behavior of the regular register protocol when it runs out of the theoretical bounds. To evaluate the performance of the algorithm we have defined the following metrics:

• % of valid read() operation: #read() returning a valid value/total number of read().

General Settings. We have simulated a server computation with size n = 100, and a client computation with one writer client and one reader client. The computation lasts for 5 minutes and there is continuous churn (i.e., Tchurn is 5 minutes). Operations are issued in this way: (i) the writer client issues a write() operation with a frequency f req_w; (ii) the reader client issues a read() operation with a frequency f req_r and (iii) every seconds cn processes are refreshed. Concerning the departure from the computation, servers are chosen uniformly at random among the set of correct servers. In this way, we have simulated a scenario where a faulty server is aware of its fault and voluntary decides to remain in the computation to try to let the register un-correct. We have assumed a maximum communication delay δ = 1sec, and we have used a constant read frequency f reqr = 1read/2sec in all the simulations. For each experiment, we run 10 simulatins and variance of the results was below 1%, therefore it is not reported in the plots.

Impact of the write frequency and churn. In this test, we have fixed the number of faulty process to f = 10 and we have evaluated as the write frequency affects the validity of read operations for increasing churn (Figure 2(a)).

0 10 20 30 40 50 60 70 80 90 100

20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

% of Valid read operations

C (%)

0 W /1 sec 1W /1 sec 1W /2 sec 1W /20 sec

(a) Percentage of valid read() for different write frequencyf reqw.

0 10 20 30 40 50 60 70 80 90 100

20 25 30 35 40 45 50 55 60 65 70 75 80

% of Join operation ending in 1 sec.

C (%)

1W /1 sec 1W /2 sec 1W /5 sec 1W /10 sec 1W /20 sec

(b) Percentage of valid join() for different write frequencyf reqw.

Figure 2: Impact of the write frequency on the register validity.

In Figure 2(a) it is possible to see how moving from a system characterized by low write frequency (i.e.

no write operation or f req_w = 1/20 - one write operation every 20 seconds) to a write intensive system (i.e. f reqw=1 - 1 write operation each second), the maximum churn rate that can be tolerated is incremented of about 50%. This huge increment of the performance is due to the particular implementation of the join

(12)

0  10  20  30  40  50  60  70  80  90  100 

0  5  10  15  20  25  30  35  40  45  50  55  60  65  70  75  80  85  90  95  100 

% of Valid Join Opera0ons 

C (%) 

f=0  f=5  f=10  f=15  f=20  f=25  f=30 

(a) Percentage of valid read() for different value of f .

0 10 20 30 40 50 60 70 80 90 100

0 5 10 15 20 25 30

C (%)

f Theore1cal Empirical

(b) Theoretical Bound vs. Empirical Bound

Figure 3: Percentage of valid read() for different value of f

operation we provided that is focused on the write persistency maintenance. In fact, in the protocol, a joining server tries to obtain values as new as possible by waiting for the maximum time of a write operation (to capture possibly concurrent write) before inquirying other servers. As soon as we reduce the frequency of the write operations, it reduces the probability of having write operations concurrent to join ones, thus reducing the number of join operations that ends in one communication step (as shown in 2(b)), weaking thus the positive effect against churn.

Combined effect of faulty processes and churn. In this test, we set the write frequency f reqw = 1write/5sec and we have evaluated the effect of faulty processes on the percentage of valid read operation for increasing churn (Figure 3(a)). Moreover, we have compared the theoretical bound computed in Section 5 with the experimental results and we have drown the empirical bound computed as the maximum churn rate that allows to obtain 100% of valid read operations (Figure 3(b)).

In Figure 3(a) it is possible to see how moving from a failure free computation (i.e. f =0) to one having 5%

of arbitrary failures (i.e. f = 5), the maximum churn rate tolerated, still ensuring validity, decrease of 20%.

The result is even more evindent in Figure 3(b) comparing the theoretical bound and the empirical bound.

This behavior is a direct consequence of the server computation composition. At any time, servers participating in the computation can be split into faulty servers, not active servers (i.e. servers still executing the join operation) and active servers. A valid read needs f + 1 same values obtained by servers in the latter set.

Increasing either the churn rate, the number of faulty servers or both, reduces the set of active processes and thus the probability to get a valid result.

7 Related Work

To the best of our knowledge, this is the first work that considers together byzantine failures and churn.

Figure 4 summarizes requirements of byzantine quorum systems with respect to churn and failure models.

Byzantine fault tolerant systems without churn. Traditional solutions to build byzantine storage can be divided into two categories: replicated state machines [23] and byzantine quorum systems [18, 19, 20, 4].

Both the approaches are based on the idea that the state of the storage is replicated among processes and the main difference is in the number of replicas involved simultaneously in the state maintenance protocol.

Replicated state machines requires that every non-faulty replica receives every request and it processes the

(13)

requests in the same order before returning to the client [23]. Given the number of failures f , the replicated state machine approach requires only 2f + 1 replicas in order to provide a correct register implementation due to the need of agreement to order requests. On the contrary, byzantine quorum systems need just a sub- sets of the replicas (i.e. quorums) to be involved simultaneously. The basic idea behind byzantine quorum systems is that each operation, issued on the memory storage, is executed by a quorum and any two quorums must intersect. Our contribution addresses byzantine quorum-like systems as we do not assume any form of agreement.

In [18] the authors consider protocols managing replicated data, with the characteristic that the writer process knows when the write operation is terminated (i.e. write-confirmable protocols) and provide an analysis on quorum systems to ensure data availability and consistency. Given the number of faulty processes f , they define the total number of processes needed and the load imposed by the quorums.

In [19, 20] the authors provide two algorithms that improve the bound found in [18] by using quorums of different size for read and write operations and extend the analysis also to protocols where the writer does not know exactly when the operation is terminated but it only knows that it eventually terminates. In [4]

byzantine quorum systems have been investigated in a synchronous environment with reliable channels and it has been proved that exploiting the synchrony of the system it is possible to implement a safe register by reducing the number of servers.

Registers under continuous churn. In [3], it has been considered the problem of building a regular register in a distributed system prone to continuos churn. In particular, [3] shows that it is not possible to implement a regular register in a fully asynchronous system if the churn is continuos. Moreover, it provides two algorithm that solve the problem both in a synchronous and in a partially synchronous system provided that the churn rate satisfies specific constraints.

Registers under quiescent churn. Several recent works (e.g., [1], [6], [8], [17]) address the implementation of registers prone to quiescent churn (see Section 2).

In [17], a Reconfigurable Atomic Memory for Basic Object (RAMBO) is presented. RAMBO works on the top of a distributed system where processes can join and fail by crash. To guarantee reliability of data, in spite of network changes, RAMBO replicates data at several network locations and defines configurations to manage small and transient changes. For large changes in the set of participant processes, RAMBO defines a reconfiguration procedure whose aim is to move the system from an existing configuration to a new one by changing the membership of the read quorum and of the write quorums. Such reconfiguration is implemented by a distributed consensus algorithm.

It is important to note that in RAMBO the notion of churn is abstracted by a sequence of configurations.

Moreover, to ensure liveness of the system RAMBO assumes that there exists stability periods long enough to allow the algorithm to converge (i.e., assumption of quiescent churn). [8] and [6] present improvements to the original RAMBO protocol to make fast reconfigurations.

In [1] Aguilera et al. show that a crash resilient atomic register can be realized without consensus and, thus, on a fully asynchronous distributed system provided that the number of reconfigurations is finite and thus the churn is quiescent. Configurations are managed by taking into account all the changes (i.e. join and failure of processes) suggested by the participants and the quorums are represented by any majority of processes.

To ensure liveness of read and write operations, the authors assume that the number of reconfigurations is finite and that there is a majority of correct processes in each reconfiguration.

(14)

Crash Failure Byzantine Failure Self-verifying Data Synchronous a perfect failure detector for both regular and atomic

specification [7]

f+1 servers for safe semantics and not write confirmation [4]

2f+1 servers for regular specification and not write confirmation (e.g. [20], [21])

3f+1 servers for regular specification and write confirmation (e.g. [19], [20])

Synchronous c<1/(3!) to implement a regular specification [3] c < (n-3f)/(n3!) to implement a regular register

Asynchronous Not

Quiescent Churn No Churn

Asynchronous a mojority of correct processes for both regular and atomic specifications [2]

Impossible due to the result of [3]

Figure 4: Summary of protocol requirements for register implementation in different failures and churn settings (new results are shown in bold)

8 Conclusion

This paper has established a bound for ensuring validity of read operations issued on a regular register which is able to resist to arbitrary failures and non-quiescent churn. Then the bound has been actualized in a context of a specific synchronous distributed system considering protocols using the minimum number of communication steps necessary to implement register operations. Finally a performance analysis of one of these protocols has been carried out which, among the others, studies the distance beween the theoretical and the empirical bound by varying the churn and the number of failures to be tolerated by the object.

Let us remark that the system model defined in this paper is able to capture complex scenarios composed by either arbitrary failures or different churn patterns or both. Therefore, proving the correctness of any basic service under this challenging environment ensures the high availability of such service under very severe and unpredictable conditions that are typical of large scale systems such as interconnected datacenters.

Future work may study if register storage is weaker than consensus not only in distributed systems prone to both crash failures and quiescent churn (as shown in [1]) but also in a setting including byzantine failures.

References

[1] Aguilera M. K., Keidar I., Malkhi D., Shraer A., Dynamic atomic storage without consensus, in Proceedings of 28^thAnnual ACM Symposium on Principles of Distributed Computing (PODC) 2009, 17-25

[2] Attiya H., Bar-Noy A. and Dolev D., Sharing Memory Robustly in Message-Passing Systems, Journal of the ACM, 42(1):129-142, 1995.

[3] Baldoni R., Bonomi S., Kermarrec A.M., Raynal M., Implementing a Register in a Dynamic Distributed System, in Proceedings of the 29th IEEE International Conference on Distributed Computing Systems (ICDCS’09), IEEE Computer Society Press, Montreal (Canada), June 2009.

[4] Bazzi R. A., Synchronous Byzantine Quorum Systems, Distributed Computing 13(1): 45-52, 2000.

[5] Castro M., Liskov B., Practical byzantine fault tolerance and proactive recovery, ACM Transactions on Computer Systems (TOCS), 20(4), 398-461, 2002.

[6] Chockler G., Gilbert S., Gramoli V., Musial P. M. and Shvartsman A., Reconfigurable distributed storage for dynamic networks Journal Parallel Distributed Computing, 69(1): 100-116 (2009)

(15)

[7] Delporte-Gallet C., Fauconnier H., Guerraoui R., Hadzilacos V., Kouznetsov P., Toueg S., The weakest failure detectors to solve certain fundamental problems in distributed computing, in Proceedings of the 23rd Annual ACM Symposium on Principles of Distributed Computing (PODC 2004), 338-346, 2004.

[8] Gilbert S., Lynch N., and Shvartsman A., RAMBO II: Rapidly Reconfigurable Atomic Memory for Dynamic Networks, in Proceedings of International Conference on Dependable Systems and Networks (DSN 2003).

[9] Greenberg A. G., Hamilton J. R., Jain N., Kandula S., Kim C., Lahiri P., Maltz D. A., Patel P., SenguptaS., VL2: a scalable and flexible data center network, in Proceedings of the ACM SIGCOMM 2009 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM 2009), 51-62, 2009

[10] Hendricks J., Ganger G. R., Reiter M. K., Low-overhead byzantine fault-tolerant storage, in Proceedings of the 22ndh ACM Symposium on Operating Systems Principles (SOSP 2007)73-86, 2007.

[11] Kubiatowicz J., Bindel D., Chen Y., Czerwinski S. E., Eaton P. R., Geels D., Gummadi R., Rhea S. C., Weather- spoon H., Weimer W., Wells C., Zhao B. Y., OceanStore: An Architecture for Global-Scale Persistent Storage, in Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2000), 190-201, 2000.

[12] Kuhn F., Schmid S., Wattenhofer R., A Self-repairing Peer-to-Peer System Resilient to Dynamic Adversarial Churn, in Proceeding of 4th International Workshop on Peer-to-Peer Systems (IPTPS) 2005 3-23

[13] Lamport. L., Time, Clocks, and the Ordering of Events in a Distributed System, Communication of the ACM 21(7): 558-565, 1978.

[14] Lamport. L., On Interprocess Communication, Part 1: Models, Part 2: Algorirhms, Distributed Computing, 1(2):77-101, 1986.

[15] Lamport L., Shostak R. E., Pease M. C., The Byzantine Generals Problem, ACM Transactions on Programming Languages and Systems (TOPLAS)4(3), 382-401, 1982.

[16] Liben-Nowell D., Karger D. R., Kaashoek M. F., Dabek F., Balakrishnan H. Stoica I. and Morris R., Chord:

A Scalable Peer-to-peer Lookup Protocol for Internet Applications, in IEEE/ACM Transactions on Networking, 11(1): 17-32 (2003).

[17] Lynch, N. and Shvartsman A., RAMBO: A Reconfigurable Atomic Memory Service for Dynamic Networks, in Proceedings of the 16th International Symposium on Distributed Computing (DISC’02), Springer-Verlag LNCS

#2508, pp. 173-190, 2002.

[18] Malkhi D., Reiter M. K. Byzantine Quorum Systems, Distributed Computing 11(4): 203-213, 1998

[19] Martin J., Alvisi L., Dahlin M., Small Byzantine Quorum Systems, in Proceedings of the International Confer- ence on Dependable Systems and Networks (DSN 2002)pp. 374-388, 2002

[20] Martin J., Alvisi L., Dahlin M.. Minimal Byzantine Storage, in Proceedings of the 16th International Symposium on Distributed Computing (DISC 2002), pp. 311-325, 2002.

[21] Merritt M. and Taubenfeld G., Computing with Infinitely Many Processes, in Proceedings of the 14th Int’l Symposium on Distributed Computing (DISC’00), Springer-Verlag LNCS #1914, pp. 164-178, 2000.

[22] OMNeT ++ Simulator http://www.omnetpp.org/

[23] Schneider Fred B. , Implementing Fault-Tolerant Services Using the State Machine Approach, ACM Computing Surveys, 22(4): 299-319, 1990

[24] Voulgaris S., Gavidia D., and van Steen M., CYCLON: Inexpensive Membership Management for Unstructured P2P Overlays, Journal of Network and Systems Management, 13(2): (2005)

(16)

Appendix A - A protocol P

reg

for the synchronous system of Section 5

For the sake of completeness, we report in this appendix the pseudo-code of a general algorithm implementing a regular register in a dynamic distributes system. This protocol is similar the one presented in [3] and it has been extended to let it works with byzantine failures.

Local variables at a client c_i Each client cimaintain just one local variable, denoted read sni, representing the sequence number to associate to each read() operation. Moreover, the writer client cwalso maintains the snwrepresenting the sequence number to associate to each write() operation.

Local variables at a server s_i Each server s_ihas the following local variables.

• Two variables denoted registeri and sni; registeri contains the local copy of the regular register, while sniis the associated sequence number.

• A boolean active_i, initialized to false, that is switched to true just after s_ihas joined the system.

• Two set variables, denoted replies_i and reply toi, that are used in the period during which si joins the system. The local variable repliesi contains the 3-uples < id, value, sn > that si has received from other processes during its join period, while reply toicontains the processes that are joining the system concurrently with si(as far as siknows).

The local variables of each of these servers s_k are such that register_k contains the initial value of the regular register³, snk = 0, active_k= false, and replies_k= reply to_k= ∅.

The write() operation. The algorithms for the write operation associated with the regular register (both client and server side) is described in Figure 5. The write is initiated from the writer client cwand consists in the dissemination, to all the servers currently in the computation, of the pair < value, sn > (lines 01 - 02 ). In order to guarantee the correct delivery of that value, the writer is required to wait for δ time units before terminating the write operation (line 03).

operation write(v):

(01) sni← sni+ 1;

(02) broadcastWRITE(i, < v, snw>);

(03) wait (δ);

(04) return(ok).

(a) Client Protocol

whenWRITE(j, < val, sn >) is delivered:

(01) if ((j = w) ∧ (sn > sni)) (02) then valuei← val;

(03) sni← sn;

(04) end if;

(b) Server Protocol

Figure 5: The write() protocol for a synchronous system

When a messageWRITE(j, < val, sn >) is delivered to a server s_i, it checks if the message arrives from the writer client (line 01) and if it so, it takes into account the pair (val, sn) if it is more uptodate than its current pair (lines 02-04).

3Without loss of generality, we assume that at the beginning every server skhas in its variable registerkthe value 0

(17)

The read() operation. The algorithms for the read operation associated with the regular register (both client and server side) is described in Figure 6.

After having initialized its local variables, the client ci broadcasts aREAD(i) message to require to servers the current value of the regular register (line 03) and waits for 2δ time units, i.e., the maximum round trip delay (line 04). When this period terminates, ci select the value for which it has at least f + 1 same replies (line 05). Finally, Cireturns the value (line 06).

operation read(i):

(01) read sni← read sni+ 1;

(02) repliesi← ∅;

(03) broadcastREAD(i, read sni);

(04) wait (2δ);

(05) let < id, val, sn >∈ repliesisuch that

(∃ f + 1 < −, v, sn⁰>∈ repliesi: (v = val ∧ sn = sn⁰));

(06) return(val).

————————————————————————————–

whenREPLY(< j, val, sn >, r sn) is delivered:

(07) if (read sni= r sn)

(08) repliesi← replies_i∪ {< j, val, sn >};

(09) endif

(a) Client Protocol

whenREAD(j, r sn) is delivered:

(01) if (activei)

(02) then sendREPLY(< i, valuei, sni>, r sn) to pj; (03) else reply toi← reply to_i∪ {< j, r sn >};

(04) end if.

(b) Server Protocol

Figure 6: The read() protocol for a synchronous system

When a server delivers a READ(j, sn_r) message, it answers by sending back a REPLY message containing its local copy of the register together with the sequence number if it is active (line 02), otherwise it postpones its answer until it becomes active (Figure 6 line 03 and Figure 7 line 15).

Finally, when a client ci delivers a REPLY message, it puts the received value and sequence number in its repliesiset (line 08).

The join() operation. The algorithm implementing the join operation is described in Figure 7.The server sifirst initializes its local variables (line 01), and waits for a period of δ time units (line 02); This waiting period is necessary to avoid the server to loose messages related to possibly concurrent write operations (cfr.

[3] for details).

If registerihas not been updated during this waiting period (line 03), pibroadcasts (with the broadcast() operation) anINQUIRY(i) message to the servers that are in the computation (line 05) and waits for 2δ time units, i.e., the maximum round trip delay (line 06)⁴.

When this period terminates, siupdates its local variables registeri and snito the values for which it has received at least f + 1 same replies (lines 07-11). Then, pibecomes active (line 13), which means that it can answer the inquiries it has received from other processes, and does it if reply to 6= ∅ (line 14). Finally, sireturns ok to indicate the end of the join() operation (line 17).

When a server si receives a message INQUIRY(j), it answers sj by return sending back a REPLY(<

i, register_i, sn_i >) message containing its local variable if it is active (line 19). Otherwise, s_i postpones

4The statement wait(2δ) can be replaced by wait(δ + δ⁰), which provides a more efficient join operation; δ is the upper bound for the dissemination of the message sent by the reliable broadcast that is a one-to-many communication primitive, while δ⁰is the upper bound for a response that is sent to a process whose id is known, using a one-to-one communication primitive. So, wait(δ) is related to the broadcast, while wait(δ⁰) is related to point-to-point communication. We use the wait(2δ) statement to make easier the presentation.

(18)

operation join(i):

(01) valuei← ⊥; sni← −1; active i ← false; repliesi← ∅; reply toi← ∅;

(02) wait(δ);

(03) if (registeri= ⊥) then (04) repliesi← ∅;

(05) broadcastINQUIRY(i);

(06) wait(2δ);

(07) let < id, val, sn >∈ repliesisuch that

(∃ f + 1 < −, v, sn⁰>∈ repliesi: (v = val ∧ sn = sn⁰));

(08) if (sn > sni) (09) then sni← sn;

(10) valuei← val;

(11) end if (12) end if;

(13) activei← true;

(14) for each < j, r sn >∈ reply toido

(15) sendREPLY(< i, valuei, sni>, r sn) to pj; (16) endfor

(17) return(ok).

————————————————————————————————————–

(18) whenINQUIRY(j) is delivered:

(19) if (activei) then sendREPLY(< i, valuei, sni>, 0) to pj

(20) else reply toi← reply toi∪ {< j, 0 >}

(21) end if.

(22) whenREPLY(< j, value, sn >, r sn) is received:

(23) if (r sn = 0)

(24) repliesi← repliesi ∪ {< j, value, sn >, }

(25) endif

Figure 7: The join() protocol for a synchronous system (server code)

its answer until it becomes active (line 20 and lines 13-14). Finally, when s_ireceives a messageREPLY(<

j, value, sn >) from a server s_jit adds the corresponding 3-uple to its set repliesi(line 25).