ByzantineFault-TolerantStorageinDynamicDistributedSystems “L a S apienza ” ` U niversit adegli S tudidi R oma

(1)

Universit`a degli Studi di Roma

“La Sapienza”

Facolt`a di Ingegneria dell’Informazione,

Informatica e Statistica

Thesis of Master in Computer Engineering

December 2011

Byzantine Fault-Tolerant Storage in

Dynamic Distributed Systems

Amir Soltani Nezhad

Advisor: Prof. Roberto Baldoni co-Advisor: Dr. Silvia Bonomi

(2)

Acknowledgments

First, I would like to say especially Thanks to my advisor, Prof. Roberto Baldoni, from whom I have learned a lot. I never forget that on the first day of my arrival in Rome, after the first lecture of the course Distributed Sys- tems, I came over your office and said to you that I am interested in joining your research group to do research. Although you knew that I had no enough knowledge at the time to start researching with your team, you tried to keep me eager and motivated, and said to me that I join your research group in the second semester when I finished my core courses. Briefly, I would like to say again Thank You for everything you taught me in doing research.

Second, I would like to say Thanks to Dr. Silvia Bonomi who is a kind person and great researcher. I learned the basics of doing research from you by asking so many questions, and you were always nice and helpful. I never forget your helps and Thank You again!

Finally, I would like to dedicate this work to my parents. Even though over the last two years, I have not visited you many times, I always felt your presence beside myself. I would like to say Many Thanks from the bottom of my heart to both of you for everything and for your incredible supports.

i

(3)

(4)

Chapter 1 Introduction

In recent years, the ever cheaper and more powerful hardware, together with the always increasing availability of bandwidth led software as a service computing paradigm to transform many distributed applications into services of- fered by clouds providers. Among all the possible cloud services, distributed storage like Amazon Simple Storage Service (S3) [13] is one of the most popular ones, due to its capability to provide simple read and write interfaces ensuring a certain degree of consistency and a well defined service availability level. Note that, usually such a kind of services is regulated by specific con- tracts (i.e. Service Level Agreement), and thus service providers must guarantee such a level of quality of service despite any types of failures including malicious ones.

As another example, Google’s Chubby lock service, which is intended to provide coarse-grained locking as well as reliable (though low-volume) storage for a loosely-coupled distributed system. Chubby provides an interface much like a distributed file system with advisory locks, but the design empha- sis is on availability and reliability, as opposed to high performance. Many instances of the service have been used, with several of them each handling a few tens of thousands of clients concurrently. The smallest unit of this large- scale lock service is named cell, which is composed of a small number of replicas [88]. Since there is a huge number of cells to serve as the Google’s lock service deployed for several applications such as file system, primary election and recently DNS service, there is the phenomenon of dynamicity.

This dynamicity appears in different shapes, such as failures of replicas in cells, and replacement of them by new ones, or possible intentional rejuvenation of some replicas and unavailability of them; so clearly, the new replicas

1

(7)

2 Introduction

should be aligned with other replicas in a certain cell in order to ensure high availability. On the other hand, byzantine phenomenon might arise in such a large-scale distributed system. In other words, a replica starts behaving in op- position with its specification due to different causes such as different types of attacks run by adversaries, or having some capacity problems such as buffer overflow etc. Therefore, byzantine tolerance is very vital in such a context.

Similarly, Facebook has the largest social networking platform serving hundreds of millions users at peak times using tens of thousands of servers located in many data centers around the world. performance, reliability, effi- ciency and scalability are four factors essential for Facebook. To Deal with failures in a platform consisted of thousands of components is Facebook’s standard mode of operation; there should always be expected a small but sig- nificant number of servers and network components that are failing at any given time. Thus, failure tolerance is quite normal than exceptional. To meet the reliability and scalability requirements described above Facebook has de- veloped Cassandra [94] to encompass both failure detection and the dynamicity, which results from failures and replacements at the same time.

In the literature, a common approach to ensure storage availability is to keep a fixed number of replicas each one hosted at a separate server aligned, and many protocols have been proposed to build byzantine-fault-tolerant (BFT) storage services on top of a message-passing system. However, they do not consider the possibility to have changes in the set of servers hosting replicas.

Servers can leave due to the ordinary or unexpected maintenance procedures, and new replicas need to be set up (i.e. join) in order to maintain a minimum number of active replicas needed to provide the service. These changes caused by joins and departures of servers (churn phenomenon), if not properly mastered, can either block protocols or violate the safety of the storage.

In this thesis, I mainly survey designing and implementing a type of storage service in environments where there are byzantine servers together with possibility of changes in number and combination of servers (dynamicity) in different synchrony assumptions. Also, I support this work with validity bounds and some experimental evaluations verifying the correctness of the bounds.

(8)

3

Thesis Organization. Chapter 2 presents several churn models used to ex- press dynamicity, which one of them has been chosen as a churn model used in chapter 3, and finally introduces other works similar to this thesis topic available in the literature.

Chapter 3 presents a bound for ensuring validity of read operations performed on the regular register in a dynamic distributed systems. This bound is independent of the protocol implementing the operations, and follows directly from the system model introduced in this chapter.

Chapter 4 presents the implementation of a BFT regular register in a fully synchronous environment based on the replication of the register value on top of a distributed system prone to non-quiescent churn. It first proves the number of communication steps necessary and sufficent to implement each operation (i.e., join, read and write) of a generic regular register protocol, then it discusses the actual validity bound for such a synchronous system. Af- terwards, it shows an experimental evaluation of the protocol that analyzes the number of valid read operations varying both the churn rate and the number of faulty processes.

Chapter 5 presents the problem of implementing another type of a shared storage named safe register in an eventual synchronous system with the presence of non-quiescent (continuous) churn. In this chapter, I present a new bound for such a storage based on the bound introduced in [67] that is a bound concerning a safe register in static distributed systems.

Chapter 6 concludes this thesis with overall remarks regarding this thesis and its results and possible future works.

Finally, let me remark that some parts of the results obtained from this work have been published in the papers ([89, 90, 91]), and there are the manuscripts [92, 93], obtained from this work submitted for publication as well.

(9)

4 Introduction

(10)

Chapter 2 Related Work

In this chapter, I provide the concepts and definitions that this thesis is based on, and also other works similar to the topic of this thesis. First, I start with several churn models, specifically, the one explained in [1], and review them. Then, I give the definition and different types of registers, and also some works in the literature regarding fault-tolerant storage either in static or dynamic networks. Finally, I conclude this chapter with a different method of fault tolerance named recovery. Although this method is not used in this thesis, I found it conceptually similar to this thesis topic, and worthwhile to discuss.

2.1 Churn Model

In the literature, several methods exist to model distributed systems with churn (dynamicity). The dynamicity due to the continuous join and leave of nodes in the computation is a phenomenon identified under the name churn.

[2] and [68] are two of the works where the authors relax the assumption of having a finite and known number of processes. They try to solve some famous classic concurrent problems (i.e. election, mutual exclusion, and consensus) by considering that the number of processes which may participate in the computation is infinite. To solve these problems, they consider two concepts of concurrency level and required participation. Saying that in each infinite run the number of processes taking part to the computation is unknown, the concurrency level defines the maximum number of processes that may take part in the algorithm at the same time. In particular, they present three types of concurrency level:

5

(11)

6 Chapter 2: Related Work

- finite: there is a finite and known bound on the maximum number of processes that participate simultaneously in the algorithm, over all the runs;

- bounded: (also called infinite arrival model) in each run, there is a finite bound on the number of processes that participate simultaneously to the algorithm (note that such bound can change between two different runs);

- unbounded: in each run, the number of processes that participate simultaneously to the algorithm can grow without bound.

Note that, the concept of process failure/leave is implicit. Furthermore, the authors observe that, depending on the problem specification, all the processes or only a subset of them are required to participate in the computation.

The participation is defined by providing a pair [`, u] where l is the minimum number of processes required from the algorithm and u is the maximum.

Using such a model, the authors show the weakest combinations of concurrency level and required participation that make it possible to implement (i) election, (ii) consensus, (iii) dead-lock free mutual exclusion, (iv) starvation free mutual exclusion and (v) test & set, by using atomic registers in a fault free model.

In [2], two variants of the bounded concurrency model are pointed out:

- infinite arrival model with b-bounded concurrency: every run has a maximum concurrency bounded by a constant b known by the algorithms;

- infinite arrival model with bounded concurrency: each run has a maximum finite concurrency.

Note that, in this two variants the departure of nodes is involved; in fact, having a maximum degree of concurrency implies that when such maximum concurrency is reached, arrival and departure rate become strictly related.

Moreover, [27] gives one of the first definitons of a dynamic system as follows:

A dynamic system is a continually running system in which an arbitrarily large number of processes are part of the system during each interval of time and, at any time, any process can directly interact with only an arbitrary

(12)

2.1. CHURN MODEL 7

small part of the system.

This definition is general and is applicable also to large scale system, i.e.

peer-to-peer system. In this paper, the authors extend the works of [2, 68]

by considering a second dimension for dynamicity named the diameter of a network. They model a network as a graph, and each process plays the role of a vertex. There is an edge between two vertexes if they are neighbors. This graph is subject to change by removing or adding new edges and vertices (processes). During the changes, the diameter of the graph represents the dynamicity.

- bounded and known diameter: every run has a maximum diameter bounded by a constant b known by the algorithms;

- bounded and unknown diameter: in each run the diameter is finite but the union of all the run may be unbounded;

- unbounded diameter: the diameter is unbounded and can grow indefi- nitely in the run.

[56] presents an approach to model a churn model. Such framework is appropriate to describe systems where the number of processes participating is constant over time and contains two different models (i) train model and (ii) crowd model.

1. In the train model, processes join and leave the computation at periodic time and in groups of the same size c that is the same for all the computation (i.e. at each time t = t0+ ∆i, c processes join and c processes leave, where t0 is the time at which the computation starts,∆ is the period between two join/leave burst and i ∈ N).

2. The crowd model is a generalization of the train model and it allows processes to join and leave the computation at arbitrary time instant in groups of the same size cithat may vary over the computation.

Similarly to [56], [1] proposes an infinite churn model that is mainly based on the definition of two functions (i) the join function λ(t) that defines the number of processes that invoke the join operation at time t and (ii) the leave function µ(t) that defines the number of processes that have left the system at time t.

(13)

4

9 8

1

10 11

7 5

6 2 3

4

9 8

1

10 11

7 5

6 2 3

(a) Time t₀: λ(t₀)= 0, µ(t0)= 0

12

9 8

1

10 11

13

5

6 2 3

4 7 12

9 8

1

10 11

13

5

6 2 3

4 7

(b) Time t₁: λ(t₁)= 2, µ(t1)= 2

12

9 8

1

10 11

13

5

6 2 3

4

7 12

9 8

1

10 11

13

5

6 2 3

4 7

(c) Time t₂: λ(t₂)= 0, µ(t2)= 3

Figure 2.1: Examples of Join and Leave functions

In Figure 2.1 is shown an example of how it is possible to characterize the churn of the computation by using the join function and the leave function.

Let t0be the starting time of the computation. At time t0, if no process joins or leaves the computation then λ(t₀)= 0 and µ(t0)= 0 (Figure 2.1(a)); therefore, it is possible to say that in t₀ the computation is composed by a set Π0 of processes and the size of the computation is n0 (i.e. |Π0|= n0). Moreover, for any time t < t₀I say that λ(t)= µ(t) = 0.

In Figure 2.1(b) it is possible to see an example where λ(t₁) = 2, meaning that two processes enter the computation, namely p₁₂ and p₁₃, and µ(t₁) = 2 meaning that two processes depart from the computation, namely p₄and p₇. Note that, in this case the number of processes inside the system is still n₀. In Figure 2.1(c) it is possible to see another example where λ(t₂)= 0, meaning that no new process join the computation and µ(t₂) = 3 meaning that three processes depart from the computation, namely p₁, p₁₀ and p₁₁. In this case, the number of processes inside the system is decreased. From this example, it is possible to see how the variation of the computation members can be easily described quantitatively by using the two function.

Note that, this way of modeling is able also to represent a “static system”:

a system without failures is a system characterized by a join function and a leave function that are always equal to 0 for any time t (i.e. λ(t) = µ(t) = 0

∀t) while a system with crash failures is a system characterized by a null join function and a leave function having the constraint that the total number of failure is smaller than n₀.

As soon as the dynamicity introduced by the joins and the leaves appears in the computation, it is possible to observe that the size of computation and its composition can change.

The number of participants to the computation, N(t), can be calculated depending on the values of λ(t) and µ(t) as follows: N(t)= N(t−1)+λ(t)−µ(t),

(14)

2.1. CHURN MODEL 9

t N(t)

n₀

t₁ t₂ n₀+k₂

n₀– k₁

t N(t)

n₀

t₁ t₂ n₀+k₂

n₀– k₁

Figure 2.2: Computation Size in an interval∆N = [n0− k1, n0+ k2] with N(t₀)= n0.

For instance, the scenario shown in Figure 2.1, the corresponding node function is:

- N(t₀)= n0 = 11

- N(t₁)= N(t0)+ λ(t1) − µ(t₁)= n0+ 2 − 2 = n0 = 11 - N(t2)= N(t1)+ λ(t2) − µ(t2)= n0+ 0 − 3 = n0− 3= 8.

Churn constraint on the upper and lower bounds on the network size Based on the previous definitions, this section derives the constraint that a join function and a leave function have to satisfy in order the network size remains in a given interval. Note that, such kind of behavior is typical of real applications like peer-to-peer systems, VoIP based application etc... [47], [50].

Let n0be the number of processes inside the computation at the start time t₀ and k₁, k₂ be two positive integers such that 0 ≤ k₁ < n₀ and k₂ ≥ 0. The aim is to model a computation affected by the dynamicity whose effect is the variation of the system size in an interval ∆N that is defined by the upper bound n₀+ k2 and the lower bound n₀− k₁(i.e. ∆N = [n0− k₁, n₀+ k2]). An example of such computation is shown in Figure 2.2.

(15)

The churn model used in Chapter 3 is based on the model introduced in [1]

with an assumption that, at each time unit, λ(ti)= µ(ti)= c, ∀i with c ≤ n ∈ N.

Thus, the network size is always constant.

The model presented in [1] is able to include the models presented by [2, 68]. In other words, [1] is able to present and show how the following three concurrency levels can be obtained.

- the finite concurrency level - the bounded concurrency level - the unbounded concurrency level

Note that, the churn model presented in [1] is able to model both the behavior described by the train model and the crowd model described in [56].

Generally speaking, it is generalization of the crowd model. In fact, it makes it possible to characterize systems whose dimension change within a give range in addition to systems with a constant dimension.

Furthermore, only [1] and [56] present a quantitative approach to model the churn and dynamicity. The other models mentioned above give only a qualitative measure of the dynamicity.

From the perspective of churn duration, churn can be divided into two types:

- quiescent churn: A type of churn that is expected a time t after which there is a long-enough stability peiord in the system. In other words, there is a time after which there is no dynamicity in the system, and the system behaves for an unkown but long-enough time the same as a static distributed system.

- non-quiescent churn: (also called continuous churn): A type of churn that is expected no time t after which there is a long-enough stability peiord in the system. In other words, there is no time after which dynamicity in the system ends, and therefore the system may not behave the same as a static distributed system.

(16)

2.2. REGISTERS 11

2.2 Registers

A register is an object shared by a set of processes. Such an object provides processes with two operations, namely read() and write(), that allow them to read the value contained in the object or to modify such a value.

According to the value domain of the register, the set of processes that are allowed to read it, the ones that are allowed to write it, and the specification of which value is the value returned by a read operation, a family of types of registers can be defined. As far as the last point (the value returned by a read operation) is concerned, Lamport has defined three types of register [60].

• A safe register can be written by one writer only. Moreover, a read operation on such a register returns its current value if no write operation is concurrent with that read. In case of concurrency, the read can return any value of the value domain of the register (which means that a read concurrent with a write can return a value that has never been written).

This type of register is very weak.

• A regular register can have any number of writers and any number of readers. The writes appear as if they were executed sequentially, com- plying with their real time order. If no write operation is concurrent with a read operation, that read operation returns the current value kept in the register. Otherwise, the read operation returns any value written by a concurrent write operation or the last value of the register before these concurrent writes. A regular register is stronger than a safe register, as the value returned in presence of concurrent write operations is no longer arbitrary.

• An atomic register is such that all its read and write operations appear as if they have been executed sequentially, this sequential total order respecting the real time order of the operations.

Interestingly enough, safe, regular and atomic registers have the same computational power. This means that it is possible to implement a multi- writer/multi-reader atomic register from single-writer/single-reader safe registers. There is a huge number of papers in the literature discussing such transformations (e.g., [36, 52, 72, 74, 75] to cite a few).

Multi-Writer Multi-Reader Regular Register Another type of a register is Multi-Writer registers that as the name implies, it is a type of registers

(17)

Crash Failures Byzantine Failure Self-verifying Data

No Churn ^EventualSynchronous [79] [90]

Asynchronous [4] [67, 21, 22, 9]

Quiescent Churn

Eventual

Synchronous [10, 12, 66]

Asynchronous [25]

Non-Quiescent Churn

Synchronous [1] cfr. Chapter 4

Eventual

Synchronous [1] cfr. Chapter 5

Asynchronous Impossible due to the result of [1]

Figure 2.3: Register Implementations in Distributed Systems

that has more than one writer rather than having just one writer. From the perspective of linearizability, they can be categorized into several groups. [29]

presents a category and formal definitions for multi-writer regular registers that I use one of these definitions in chapter 4.

2.3 Register Implementations

In Figure 2.3 it is presented a summary of the main works done on the implementation of register objects in dynamic and static systems, with respect to the assumption made on the churn, and is highlighted the contribution of this thesis with respect to the state of the art.

2.3.1 Crash-Fault-Tolerant Register Implementations un-

der no churn

In [79], the authors present a perfect failure detector for both regular and atomic specification in a eventually synchronous environment. Moreover, [4] presents implementation of a wait-free, atomic, single-writer multi-reader

(18)

2.3. REGISTER IMPLEMENTATIONS 13

register in unreliable, asynchronous networks. These results make it possible to view the shared-memory model as a higher-level language for designing algorithms in fully asynchronous distributed systems. The authors assume at least a majority of the processors that are not faulty and remain connected.

2.3.2 Byzantine-Fault-Tolerant Register Implementations un-

der no churn

Traditional solutions to build byzantine storage in static distributed systems can be divided into two categories: replicated state machines [24] and byzantine quorum systems [67, 21, 22, 9]. Both the approaches are based on the idea that the state of the storage is replicated among processes and the main difference is in the number of replicas involved simultaneously in the state maintenance protocol.

Replicated state machines approach requires that every non-faulty replica receives every request and it processes the requests in the same order before returning to the client [24] (i.e. it assumes that processes are able to totally order the request and execute them according to such order). Given the number of failures f , the replicated state machine approach requires only 2 f + 1 replicas in order to provide a correct register implementation.

On the contrary, byzantine quorum systems need just a sub-sets of the replicas (i.e. quorums) to be involved simultaneously. The basic idea behind byzantine quorum systems is that each operation, issued on the memory storage, is executed by a quorum and any two quorums must intersect (i.e. members of the quorum intersection act as witnesses for the correct execution of both the operations).

In [67] the authors consider protocols managing replicated data, with the characteristic that the writer process knows when the write operation is terminated (i.e. write-confirmable protocols) and provide an analysis on quorum systems to ensure data availability and consistency. Given the number of faulty processes f , they define the total number of processes needed and the load imposed by the quorums.

In [21, 22] the authors provide two algorithms that improve the bound found in [67] using quorums of different size for read and write operations and extend also the analysis to protocols, where the writer does not know exactly when the operation is terminated, but it only knows that it eventually terminates. In [9], byzantine quorum systems have been investigated in a synchronous environment with reliable channels, and it has been proved that exploiting the synchrony of the system, it is possible to implement a safe

(19)

register by reducing the number of servers.

In [38], the authors give a wait-free BFT atomic register with the need of 3f+1 replicas. They present bounded, wait-free implementation of a distributed atomic register, without the use of unproven cryptographic primitives or requiring communication among servers. Unlike previous solutions, the sizes of messages sent to writers depend only on the actual number of active readers and not on the total number of readers in the system. In this work, they also consider the possibility of having byzantine readers (clients). The algorithms of reading and writing in this paper, have two phases to guarantee the wait-free property. This work is one of the first works that uses just 3f+1 servers, in a non-communicating server model, without using unbounded storage, message of unbounded size, and an unbounded number of messages per read operation to implement a wait-free atomic register. In this paper, the system consists of a set of n replicas (servers), a set of m writers and a set of readers. Readers and writers are collectively referred to as clients. Clients have unique identifers that are totally ordered. When considering bounded- ness of the sizes of messages, they assume that a read operation in the system can be uniquely identifed with a finite bit string (otherwise any message sent by a reader can be unbounded in size). The identifer consists of a reader identifer and a read operation tag. Similarly write operations are identifed by the writer identifer and the timestamp of the value being written. Since timestamps are non-skipping [84], writes can also be represented by finite strings. They bound the sizes of messages sent to servers using three rounds of communication between writers and servers. These rounds occur in par- allel with the first two rounds of the write protocol and no server receives a total of more than two messages across the three rounds. In the first round, the writer estimates the number of concurrent readers; in the second and third rounds it determines their identities. Writers are benign and can only fail by crashing. First they assume that the readers are benign; then they relax this assumption and consider Byzantine readers . When considering Byzantine readers, they make the additional assumption that the channels between the servers and the writers are private i.e. messages sent over these channels cannot be eves-dropped by the adversary. The interesting thing in this paper is that Briefly, this paper uses an efficient number of replicas (3f+1) with respect to the previous works. However, to guarantee the wait-free property, it uses two-phase reading and writing.

In [85] they extended the work of [38] to adapt the algorithm to a Multi- Writer Multi-Reader regular register. They use one of the three definitions of Multi-Writer Regular Register [73] that I also review this definition and

(20)

2.3. REGISTER IMPLEMENTATIONS 15

implement a multi-writer regular register based upon that in Chapter 4. Since the core of this work is [38], this work has two-phase reading and writing operations, and is wait-free as well, but it needs (4f+1) replicas.

In [86], the authors present wait-free regular storage from Byzantine components (shared memories). In this work, they present a wait free regular storage from Byzantine components in which f faulty processes out of at least 4f + 1 servers in the system. The idea behind this paper is utilizing a novel building block called 1-regular register, and implementing a single-reader and single-writer regular register on top of this building block. Then, m copies of this emulated register in which the writer accesses all of the m copies in paral- lel, but each reader is allowed to just read its own copy, thus formed a MRSW register. Their approach includes read and write protocols that are performed in a single phase.

In [82], to tolerate byzantine faults and solving two variants of the consensus problem, weak consensus and strong consensus, the authors introduce a shared memory object called PEATS that is combination of augmented tuple space and some policies that protect the object against malicious users and unauthorized accesses.

2.3.3 Crash-Fault-Tolerant Register Implementations un-

der quiescent churn

In [66], a Reconfigurable Atomic Memory for Basic Object (RAMBO) is presented. RAMBO works in a distributed system where processes can join and fail during the execution of the algorithm. To guarantee reliability of data, in spite of network changes, RAMBO replicates data at several network location and defines configuration to manage small and transient changes. Each configuration is composed by a set of members, a set of read-quorum and a set of write-quorums. In order to manage large changes to the set of participant process, RAMBO defines a reconfiguration procedure whose aim is to move from an existing configuration to a new one where the set of members, read- quorum or write-quorum are modified. In order to ensure atomicity, the reconfiguration procedure is implemented by a distributed consensus algorithm that makes all the processes agree on the same successive configurations. There- fore, RAMBO cannot be implemented in a fully asynchronous system. It is important to note that in RAMBO the notion of churn is abstracted by defining a sequence of configurations. Note that, RAMBO poses some constraints on the removal of old configurations and in particular, a certain configuration C cannot be removed until each operation, executed by processes belonging

(21)

to C, is not ended; as a consequence, many old configurations may take long time to be removed.

[46] and [10] present improvements to the original RAMBO protocol and in particular to the reconfiguration mechanism. In [46] the reconfiguration protocol has been changed by parallelize new configuration installation and the removal of an arbitrary number of old configurations. In [10], the authors present a mechanism that combine the features of RAMBO and the underling consensus algorithm to speed up the reconfiguration and reduce the time during which old configurations are accessible.

In [25] Aguilera et al. show that an atomic R/W object can be realized without consensus and, thus, on a fully asynchronous distributed system provided that the number of reconfiguration operations is finite and thus the churn is quiescent (i.e. there exists a time t after which no join and no failure occur).

Configurations are managed by taking into account all the changes (i.e. join and failure of processes) suggested by the participant and the quorums are represented by any majority of processes. However, to ensure liveness of read and write operations, the authors assume that the number of reconfigurations is finite and that there is a majority of correct processes in each reconfiguration.

2.3.4 Crash-Fault-Tolerant Register Implementations un-

der non-quiescent (continuous) churn

In [1], it has been considered the problem of building a regular register in a distributed system prone to continuos churn. In particular, [1] shows that it is not possible to implement a regular register in a fully asynchronous system if the churn is continuos. Moreover, it provides two algorithms solving the problem both in a synchronous and in a partially synchronous system, and that the churn rate satisfies specific constraints.

The model presented in [1], as in [25], the possibility to add and remove processes during the computation but differs from it because no assumption is done about the number of join and leave (i.e. reconfigurations) invoked, that can be possibly infinite. The main consequence is that, in this environment, it is not possible to implement a register in a fully asynchronous dynamic system.

(22)

2.4. RECOVERY METHOD FOR BFT STORAGE 17

2.3.5 Byzantine-Fault-Tolerant Register Implementations un-

der non-quiescent (continuous) churn

To the best of my knowledge, this thesis is the first work studying byzantinefault-tolerant register implementation under non-quiescent (continuous) churn.

2.4 Recovery method for BFT Storage

The works introduced in sections 2.3 and also this thesis specify an upper bound on the number of faulty servers and assume that the number of faulty processes never exceed this number. However, in some cases where the system should work for a long time and in many practical scenarios, having an upper bound on the number of faulty servers is not feasible. To deal with this problem, a recovery method should be used to clean the faulty servers in order to avoid the number of faulty servers going beyond the threshold of the number of faulty servers such as n/3 or n/4 etc. So, from this perspective recovery method is interesting. In the context of recovery, there is a widespread term, proactive recovery, as described below [78].

replicas are rejuvenated from time to time to remove the effects of malicious attacks/faults. Rejuvenation procedures may change the cryptographic keys and/or load a clean version of the operating system. If the rejuvenation is performed sufficiently often, then an attacker is unable to corrupt enough replicas to break the system. Therefore, using proactive recovery, one can increase the resilience of any intrusion-tolerant replicated system able to tolerate up to f faults/intrusions for each rejuvenation period and an unbounded number of intrusions may occur during its lifetime, as long as no more than f occur between rejuvenations.

Several works with periodic rejuvenation (proactive recovery) have been proposed [79, 80, 81] that present intrusion-tolerant replicated that are re- silient to an infinite number of faults.

An inherent limitation of proactive recovery is that a malicious replica can run any actions to interrupt the system’s normal operations (e.g., flood the network with arbitrary packets), and there is a little or nothing that a correct replica (that detects this abnormal behavior) can do to stop/recover the faulty replica [78]. To cope with this shortage, [78] has proposed a method in which they combine the proactive recovery with another type of recovery named reactive recovery.

This type of recovery is triggered by correct replicas when they detect or suspect malicious replicas. This technique may improve the overall perfor-

(23)

Figure 2.4: wormhole and payload architecture

mance of a system under attack by reducing the amount of time a malicious replica has to disturb the normal operation of a system, without sacrificing periodic rejuvenation.

In this hybrid model, any replica has two parts, wormhole and payload.

Wormhole is a fully trusted component of a replica, and all of the wormholes are connected through a secure and fully synchronous channel. How- ever, at most f wormholes may crash, and any crash of a wormhole results in the crash of the whole replica. On the other hand, payloads are any-synchrony subsystem of a replica. They are connected through a secure and any-synchronous channel. Any payload has two primitives, Detect and Suspect, by which it can inform its wormhole that a particular replica is faulty. Then, the wormhole through the synchronous channel informs the detected/suspected replica’s wormhole as the authors depicted the architecture in Figure 2.4. PRRW stands for Proactive Reactive Recovery Wormhole. The interesting point in this work is that the authors care about availability of service in the whole system lifetime. So, if a replica receives at least f + 1 detected messages, it immediately starts the recovery procedure. This immediate and unscheduled recovery is named reactive recovery that is a new technique proposed by [78] to complement the scheduled proactive recovery. Interestingly enough, if a wormhole receives at least f + 1 suspected messages, through a proven and accurate process, the wormhole starts reactively recovering it- self if at the time not more than f replicas are simultaneously recovering. In other words, in the case of suspect, the wormhole starts recovering if it does not endanger the availability of the whole system. As said earlier, along with this reactive recovery, I have a scheduled (proactive) recovery that due to the synchronous channle among the wormholes are performed in such a way that not more than f replicas are being recovered at the same time to respect the ever-availability.

(24)

Chapter 3 Validity Bound for a BFT Regular

Register in a Dynamic Distributed

System

Assuming register operations are live (i.e. they eventually terminate), I prove a bound for ensuring validity of read operations performed on the regular register. This bound is independent of the protocol implementing the operations, and follows directly from the system model that is considered in this chapter.

It correlates the churn rate c, the number of faulty processes f and the execution time taken by register operations (i.e., join, read and write operations).

This bound shows that if the execution time of operations grows, the churn tolerated by the BFT storage tends towards zero.

3.1 System Model

The distributed system is composed of a universe of clients Uc (i.e. the client system) and of a disjoint universe of servers Us (i.e. the server system).

The client system is composed of a finite arbitrary number of processes (i.e.

Uc = {c1, c₂, . . . cm}), while the server system is dynamic, i.e. processes may join and leave the system at their will. A server enters the server system by executing connect() procedure. Such an operation aims at connecting new servers to both clients and other servers already belonging to the system. A server leaves the system by means of disconnect() operation. In the following, I will assume that the disconnect operation is passive i.e., servers do not

19

(25)

20 Chapter 3: Validity Bound

take any specific actions, and they just stop executing any algorithms. In order to model the continuous arrival and departure of servers from the system, I assume the infinite arrival model (as defined in [68]). In particular, I will assume the set of servers that can be part of the server system (i.e. the server- system population), is composed of a potentially infinite number of processes Us = {. . . si, sj, sk. . . }, each one having a unique identifier (i.e. its index).

However, at each time t, the server system is composed of a finite subset of its population (Us(t)), including all the servers that are connected, i.e. all the servers that (i) have executed the connect() operation and (ii) have not invoked the disconnect() operation.

Clients and servers can communicate only by exchanging messages through reliable and authenticated channels. In the following, I assume the existence of a protocol managing the arrival and the departure of servers from the distributed system; such a protocol is also responsible for the connectivity maintenance among the processes belonging to the distributed system. Examples of such topologies are [57], [61], [76]. As in [67, 21, 22], I assume that clients are correct and servers can suffer arbitrary failures.

Distributed Computation. On top of the distributed system, a distributed computation is running (e.g. a register computation). Let me denote as Cs(t) the subset of the server system that is participating to the distributed computation at time t (i.e. Cs(t) ⊆ Us(t)).

At time t0, when the server computation is set up, n servers belong to the server computation (i.e. |Cs(t₀)| = n). A server si belonging to the server system that wants to join the distributed computation has to execute the join() operation. Without loss of generality, let me assume that a server sican issue a join operation, at some time t, if and only if it is already connected to the distributed system (i.e. iff at time t, si ∈ Us(t)). A join() operation, invoked at some time t, is not instantaneous and takes time to be executed; how much this time is, depends on the specific implementation provided for the join() operation. However, from time t, the server si can receive and process messages sent by any other servers sj that already participate in the computation.

When a server sj participating in the distributed computation wishes to leave the computation, it stops executing the server protocol (i.e. the leave operation is passive). Without loss of generality, I assume that if a server leaves the computation and later wishes to re-join, it executes again the join() operation with a new identity. Figure 3.1 shows the distributed system and the distributed computation layers.

It is important to notice that (i) there may exist processes belonging to the server system that never join the distributed computation (i.e. they execute the

(26)

3.1. SYSTEM MODEL 21

cm

c1 c2

c3 ci

cj

Clients Computation

sn s1

s2 si

sj

Servers Computation join()

read() write(v) Distributed Computation

leave() join()

connect()

disconnect()

Server System

Client System

Figure 3.1: Distributed System vs. Distributed Computation

connect() procedure, but they never invoke the join() operation) and (ii) there may exist processes, which after leaving the distributed computation, still remain inside the server system (i.e. they are correct, but they stop processing messages related to the computation). To this aim, it is important to iden- tify the subset of processes that are actively participating in the distributed computation.

Definition (Active Server Set) A server is active in the distributed computation from the time it returns from the join() operation until the time it leaves. A(t) denotes the set of servers that are active at time t, while A([t, t⁰]) denotes the set of servers that are active during the whole interval[t, t⁰] (i.e.

si ∈ A([t, t⁰]) iff si ∈ A(τ) for each τ ∈ [t, t⁰]).

Servers that obey their specification are said to be correct. On the contrary, a faulty server can deviate arbitrarily from its specification. I assume that at most f servers can be byzantine faulty at each time-instant of the computation.

It is important to note that servers know the value f , but they are not able to know, at any time t, the subset of Cs(t) representing the faulty processes.

Non-Quiescent Churn Model. The computation alternates periods of churn and periods of stability. More specifically, there exist some periods Tchurn

in which servers join and leave the computation, then there exists a period Tstability where the computation becomes stable, and no join or leave operations occur. However, no assumption is made about how long Tchurn and Tstability are.

During the churn periods, I assume that the churn “refreshes” a fraction of the computation participants at each time unit. More precisely, I define as churn rate, denoted c, the percentage of processes that are “refreshed” at every time unit (c ∈ [0, 1]). This means that since the number of servers belonging to the server computation remains constant (i.e. for each time t, |Cs(t)| = n), in every time unit in Tchurn, c × n processes leave the computation and the same

(27)

number of processes invoke the join() operation. It is shown in [56] that this assumption is fairly realistic for several classes of applications built on top of dynamic systems.

Note that, while the number of servers participating in the computation remains constant, the number of active servers, which is the number of processes that effectively participate in the distributed computation, changes over time (i.e. for every time t, |Cs(t)|= n, while |A(t)| ≤ n).

Let me finally remark that in this churn model, there is no guarantee that a server remains permanently in the computation and additionally, this model is general enough to encompass both (i) a distributed computation prone to non- quiescent churn i.e., churn holds forever and (ii) a distributed system prone to quiescent churn i.e., there exists a time t after which stability holds forever.

3.2 Regular Register Specification

In this chapter, I consider a register computation (i.e. a distributed computation where servers collaborate to implement a register). A register is a shared variable accessed by a set of processes, i.e. clients, through two operations, namely read() and write(). Informally, the write() operation updates the value stored in the shared variable, while the read() obtains the current value contained in the variable. Generally, every operation issued on a register is not instantaneous, and can be characterized by two events occurring at its bound- ary: an invocation event and a reply event. These events occur at two time instants (invocation time and reply time) according to the fictional global time.

Given two operations op and op⁰and their invocation times (tB(op) and tB(op⁰)) and return times (tE(op) and tE(op⁰)), I say that op precedes op⁰ (op ≺ op⁰) iff tE(op) < tB(op⁰). If op does not precede op⁰ and op⁰ does not precede op, then op and op⁰are concurrent (op||op⁰). Given a write(v) operation, the value v is said to be written when both the invocation and the reply events occur.

• Termination: If a correct process participating in the computation in- vokes an operation and does not leave the system, it eventually returns from that operation.

• Validity: A read() operation returns the last value written before its invocation, or a value written by a write operation concurrent with it.

Protocol Model. A protocol Preg implementing a regular register is a collec- tion of distributed algorithms, one for each operation (i.e. Preg ={AJS, AR, A_W} where AJS, AR, AW are respectively the distributed algorithm imple-

(28)

3.3. VALIDITY BOUND 23

menting the join(), the read() and the write() operations). Each algorithm is composed of a sequence of computation steps and communication steps. A computation step is represented by the computation executed locally to each process while a communication step is represented by the sending and the de- livering events of a message.

A register is maintaied by a set of servers (possibly a subset of the ones belonging to the computation). No agreement abstraction (e.g. consensus abstraction, total order primitives) is assumed to be available at a server. Clients do not maintain any register information; they can just trigger operations and interact with servers through message exchanges. Moreover, I assume that each server has the same role in the distributed computation (i.e. there is no special server acting as a coordinator). Let me remark that quorum-based protocols (e,g., [67]) are captured by this model because Pregdoes not require that all the processes execute each operation while Pregis not a State Machine Replication protocol (e.g., [24]) due to the lack of agreement abstraction. For simplicity in the following, I assume a single register replicated at each server belonging to the computation.

3.3 Validity Bound

Let me consider a generic protocol Preg implementing a regular register such that (i) every operation terminates and that (ii) there exists a period of churn larger than the longest operation issued on the register¹. In this section, I show a bound that relates the number of faulty servers to the churn by defining the minimum-churn rate that makes it impossible for Preg to satisfy the validity property specification. Let me now introduce two properties that Preg has to satisfy to ensure validity:

_ Byzantine Resiliency. There are at least f +1 servers at any time that maintain the same value;

_ Write Persistency. Servers maintain the last value written by a write operation despite server departures (i.e., any value v written by a write() operation must persist in the system until a new operation overwrites v);

Concerning the first property, in [22], it has been proved that having 3 f+1 replicas of the register value is a lower bound to implement a read operation

1This actually corresponds to the worst case scenario for computing the validity bound).

(29)

that satisfies the validity in an authenticated, asynchronous and byzantine- prone system, with no churn. This bound applies also in a distributed system prone to churn. Thus, given a read() operation starting at some time t and lasting∆t time units, I need that at least 3 f + 1 servers remaining active during the whole period [t, t+ ∆t] (i.e. |A([t, t + ∆t])| > 3 f ).

In a distributed system with no churn, write-persistency is always satisfied because (i) there are no server departures and (ii) in the system, there always exist “enough” servers that run "permanently" the register protocol (e.g., in [4] there is the assumption of the majority of correct servers, in [67], it is assumed that given f failures, the number of servers ,n, in the system is always greater than 3 f ).

On the contrary, when the set of servers participating in the computation is affected from churn, write-persistency is an issue, and it has to be properly addressed by the protocols implementing the register. As an example, consider the following scenario: at time t₀, all the servers belonging to the server computation store a ’ ’valid value in their local copies of the register (i.e. the default value of the register written by a fictional write() operation). As soon as the churn starts, servers having the value leave, and new servers with no value, arrive. Due to the absence of stable processes, over time all the servers with a value can leave and in the worst case, the value of the memory can be lost if it is not transferred to the new servers, bringing the system to violate the write-persistency. To this aim, I need to specify a property for the join() operation, suited for our system model, and representing a necessary condition to have write persistency.

Join − Validity: A join() operation terminates by obtaining the last value written before its invocation, or a value written by a write operation concurrent with it.

In the following, I compute the bounds on the churn rate that make it impossible to satisfy respectively the join-validity (Lemma 1) and the write- persistency (Lemma 2). Finally, I will show the bound following the need to have at least 3 f + 1 servers during a read operation (Lemma 7), and I deduce the global bound on the churn rate following the composition of the previous ones (Theorem 1).

Lemma 1 Let AJS be an algorithm that implements the join() operation, and let ∆tj be the maximum-time interval needed by AJS to terminate a join() operation. If c ≥ _n×^{n−3 f}_∆t

j, then AJS cannot satisfy the join-validity.

Proof Without loss of generality, let me consider the first time unit of the first churn period, i.e. t0. At time t0, all the servers are active and maintain a valid

(30)

value; it follows that |A(t₀)|= n. Due to definition of c, at time t0+1, nc servers invoke the join operation and nc servers leave; thus, |A[t0, t0 + 1]| = n − nc.

Also during the second time unit, nc new processes enter the computation and replace the nc processes that left the system during that time unit. In the worst case, the nc processes that left the system are processes that were present at time t₀ (i.e., they are not processes that entered the system between t₀ and t₀+ 1). So, I have |A[t0, t₀+ 2]| ≥ n − 2nc. If I consider a period of ∆tj time units, i.e. the longest period needed to terminate a join operation, I will obtain

|A[t₀, t₀+ ∆tj]| ≥ n −∆tjnc= n(1 − ∆tjc). Imposing that |A[t₀, t₀+ ∆tj]| ≥ 3 f , it follows that the inequality is satisfied only if c < _n×^{n−3 f}_∆t

j, and therefore the

claim follows. 2^Lemma1

Having an algorithm that implements the join() operation by obtaining a valid value for the register is a necessary condition to ensure the write- persistency, but is not sufficient. Consider the following scenario in which there is a write operation concurrent with the join one. The writer client issues a write(v) operation at time tB and terminates such this operation at time tE. Each server joining concurrently with the write can return either the value v or the value previously written in the register vold. If the join-validity holds, in the worst case, each of these servers terminates the join with the valid value vold. However, if the churn rate is too high, it is possible that all the servers maintaing v, i.e. the new value, leave before new servers become able to obtain such a new value and the write-persistency of the register can be compromised. Thus, join-validity is satisfied and write persistency is not.

Lemma 2 Let AJS be the algorithm implementing the join() operation satisfying join-validity and let AW be the algorithm implementing the write() operation. Let∆tjand∆twbe respectively the maximum-time interval needed by AJS to terminate the join operation and the maximum time needed by AW

to terminate the write operation. If c ≥ _n×(_∆t^{n−3 f}

w+∆tj), then it is not possible to ensure the write-persistency.

Proof I split the proof in two cases: (i) there is no write concurrent with a join operation and (ii) there is a write concurrent with a join operation.

Case (i): I am exactly in the situation of Lemma 1 and the bound holds.

Case (ii): let me suppose by contradiction that write-persistency is satisfied and c ≥ _n×(^{n−3 f}_∆t

w+∆tj). If the write persistency is satisfied, it means that a value vwritten by a write(v) operation is transferred from the writer client to active servers, and then from active servers to the processes that join the computation. Without loss of generality, let me assume that a write(v) operation is

(31)

issued by a client cw at time t. Let me consider the case where t= t1 = t0+ 1 and t0 is the instant of time in which the churn starts. At time t0, all the servers are active and all the correct servers store in their local copy of the register the valid value of the register. At time t₁, when the write(v) operation starts, servers start to join and leave the computation. Note that, all the join() operations started in the period [t₁, t₁+ ∆tw] are concurrent with the write(v) operation. Therefore, due to the join-validity, in the worst case, none of them terminates returning the value v, but all return the last value written before the write(v). On the contrary, all the join() operations starting after t₁+ ∆tw must return the value v. As a consequence, there must exist at least 3 f + 1 servers that are active during the join() operation (otherwise the validity cannot be achieved [22]), and those have been updated by the write operation (i.e. they have been in the system from time t₁ to t₁ + ∆tw + ∆tj). Thus, I have that n − nc∆tw− nc∆tj > 3 f ; considering c ≥ _n×(∆t^{n−3 f}

w+∆tj), I have a contradiction.

Considering both the bound found in Lemma 1 and the bound found in Case (ii), it is not possible to ensure the write-persistency if c ≥ min{_n×^{n−3 f}_∆t

j, _n×(∆t^{n−3 f}

j+∆tw)}

≥ _n×(∆t^{n−3 f}

w+∆tj), and the claim follows.

2^Lemma2

Lemma 3 Let AR be an algorithm implementing the read() operation. Let

∆trbe the maximum-time interval needed by ARto terminate a read() operation. If c ≥ _n×^{n−3 f}_∆t

r, then it is not possible to ensure the read-validity.

Proof Considering the bound proved in [22] and the fact that a join operation corresponds actually to a read operation to get a valid value, I can repeat the reasoning of the proof of Lemma 1 by considering the specific time taken by read operations (i.e.,∆tr), and then the claim follows. 2^Lemma7

From Lemma 2 and Lemma 7, I am able to compute the following bound:

Theorem 1 Let AJS, ARand AWbe the algorithm implementing respectively join(), read() and write() operations. Let∆tj, ∆tr and∆twbe respectively the maximum-time interval needed by AJS to terminate the join operation, AR

to terminate the read() operation and AW to terminate the write operation.

If c ≥ min{_n×^{n−3 f}_∆t

r,_n×(^{n−3 f}_∆t

j+∆tw)}, then it is not possible to ensure both the write- persistency and the read-validity.

Interestingly enough, such a bound is affected by the operation-execution times, which depends on the algorithm implementing the operations and on

(32)

the synchrony assumptions of the underlying distributed systems. Thus, in a distributed system where the upper bounds on values ∆tj, ∆tr and ∆tw are known, the value of c can be calculated exactly (as I will do in the next section). It is also interesting to note how the formula confirms the impossibil- ity result found in [7]: no algorithm can implement a regular register in the presence of non-quiescent churn on top of a fully asynchronous distributed system; as the system becomes asynchronous, ∆tj, ∆tr and ∆tw cannot be bounded, and then the churn goes towards zero.

(33)

(34)

Chapter 4 Implementing a BFT Regular

Register in a Synchronous

Dynamic Distributed System

In this chapter, I address the implementation of a BFT regular register in a fully synchronous environment based on the replication of the register value on top of a distributed system prone to non-quiescent churn. In such a chal- lenging environment, I provide the following results:

• In the context of a synchronous system, I first prove the number of communication steps necessary and sufficent to implement each operation (i.e., join, read and write) of a generic regular register protocol, then I compute the actual validity bound for such a synchronous system. This bound matches the one found in [21] in case of no churn and the one found in [7] in case of no faulty processes.

• Given a specific regular register protocol using the minimal number of communication steps, I carry out an experimental evaluation of the protocol that analyzes the number of valid read operations varying both the churn rate and the number of faulty processes. The study shows that if the workload of the BFT storage is dominated by very frequent write operations, the regular register is able to resist a churn rate that is much higher than the one computed through the theoretical bound. This is due to the fact that the frequent write operations "refresh" the content of the storage, thus contrasting with the loss of persistency due to the churn. The evaluation also compares theoretical and the empirical bound for ensuring validity of read operations by varying the number of faulty processes. The smaller the number of faulty processes is,

29

(35)

30 Chapter 4: Implementing a BFT Regular Register

the higher the distance between the theoretical and the empirical bound is.

4.1 Computing Validity Bound for a Synchronous

System

In this section, I will show how the validity bound defined in Section 3.3 can be upper bounded as soon as the system is synchronous. Let me consider a specific synchronous distributed system where (i) there is a known upper bound on processing delays (i.e. the duration of every computational step can be bounded ) and (ii) processes are equipped with a broadcast and a point-to- point communication primitives that have the following specifications:

_ TimelyBroadcastDelivery(TBDel) : There exists a known and finite bound δ such that every message broadcast at some time t is delivered up to time t+ δ .

_ TimelyChannelDelivery(TCDel) : There exists a known and finite bound δ⁰ < δ such that every message sent at some time t is delivered up to time t+ δ⁰ .

Considering these communication primitives in this section, for each register operation I first compute the lower bound on the number of communication steps (a communication step is bounded by a period of δ time units) required by a protocol Pregto execute the operation. Secondly, I derive the time needed by each operation in the worst-case execution, and finally I can instantiate the bound on the churn rate to satisfy the validity property.

Let me remark that in a synchronous sytem, the termination property can be easily obtained by exploiting the synchrony assumptions. Considering the longest chain of messages related by the happened-before relations [16] occurring during operations executions, it is possible to set timeouts to trigger the termination of the operation. In the following, I assume that the time needed to execute a computational step is negligible with respect to the time used to perform a communication step.

Lemma 4 Let R be a regular register and let cw be the writer client of R. If cw can access authenticated-communication primitives satisfying TBDel and TCDel, then one communication step is necessary and sufficient to implement an algorithm Aw (belonging to Preg) for a write() operation.

ByzantineFault-TolerantStorageinDynamicDistributedSystems “L a S apienza ” ` U niversit adegli S tudidi R oma

Universit`a degli Studi di Roma

“La Sapienza”

Facolt`a di Ingegneria dell’Informazione,

Informatica e Statistica

Thesis of Master in Computer Engineering

Byzantine Fault-Tolerant Storage in

Dynamic Distributed Systems

Amir Soltani Nezhad

Acknowledgments

Contents

Chapter 1

Introduction

Chapter 2

Related Work

2.1 Churn Model

2.2 Registers

2.3 Register Implementations

2.3.1 Crash-Fault-Tolerant Register Implementations un-

der no churn

2.3.2 Byzantine-Fault-Tolerant Register Implementations un-

der no churn

2.3.3 Crash-Fault-Tolerant Register Implementations un-

der quiescent churn

2.3.4 Crash-Fault-Tolerant Register Implementations un-

der non-quiescent (continuous) churn

2.3.5 Byzantine-Fault-Tolerant Register Implementations un-

der non-quiescent (continuous) churn

2.4 Recovery method for BFT Storage

Chapter 3

Validity Bound for a BFT Regular

Register in a Dynamic Distributed

System

3.1 System Model

3.2 Regular Register Specification

3.3 Validity Bound

Chapter 4

Implementing a BFT Regular

Register in a Synchronous

Dynamic Distributed System

4.1 Computing Validity Bound for a Synchronous

System