“Sapienza” Universit`a di Roma

(1)

“Sapienza” Universit` a di Roma

Facolt` a di Ingegneria

Tesi di Laurea in Ingegneria Informatica

sessione invernale - Maggio 2008

Internal Clock Synchronization in Peer to Peer

Environment in Presence of Arbitrary Faults

Marco Platania

Relatore: Prof. Roberto Baldoni Revisore: Prof. Roberto Beraldi

Co-relatore: Ing. Sirio Scipioni

(2)

ii

Alla mia famiglia...

(3)

iii

Ringraziamenti

Arrivati a pochi giorni dalla tesi, è ora di riempire tutte quelle sezioni della Documentazione che per vari mesi sono rimaste vuote, o riempite solamente da frasi occasionali, tipo Grazie Liverpool! nei Ringraziamenti o Flavio Bri- atore come Controrelatore. Ho pensato molto a come curare questa parte, considerando che sarà l’unica che forse verrà letta da tutti, visto che è anche l’unica scritta in italiano, e, ovviamente, a chi ringraziare. Non posso che ringraziare per prima la mia famiglia, a cui va anche la dedica dell’intera tesi: se oggi mi trovo qui a svolgere questo lavoro, il merito è in parti- colar modo di chi mi ha permesso di fare tutta questa strada, dandomi l’opportunità di studiare e sostenendomi moralmente durante i momenti più difficili. Spero di aver ripagato e di ripagare anche in futuro i sacrifici dei miei genitori, sebbene durante i miei studi non hanno avuto modo di capire cosa stessi effettivamente studiando, dato che non sono ferrati in questo campo e le mie spiegazioni sono state piuttosto vaghe. Ovviamente, un ringraziamento va al Prof. Roberto Baldoni che mi ha concesso l’opportunità di svolgere questo lavoro, all’Ing. Sirio Scipioni, che mi ha seguito da vicino dimostrando pazienza e disponibilità, e agli Ingg. Leonardo Querzoni e Sara Tucci-Piergiovanni, i quali hanno messo a disposizione tutta la loro com- petenza nelle periodiche riunioni di gruppo, aiutandomi a capire a fondo tutte le problematiche di questo progetto. Un ringraziamento va anche a Silvia, la quale mi ha fornito le immagini che avreste dovuto trovare nei primi capitoli: peccato, però, che Sirio me le ha fatte togliere! Un ringraziamento importante va anche a tutti coloro che ho conosciuto e con cui ho avuto il piacere di condividere ogni piccola pausa durante questo periodo.

In primis, alle macchinette del caffè, dato che il loro richiamo ha su di me lo stesso effetto che aveva il richiamo delle Sirene su Ulisse; subito dopo, ai miei compagni di stanza Luca e Carlo, che insieme ad Ugo hanno dato vita alla lotta tra il Midlab e il loro WSN Group, con tanto di bigliettini minatori lasciati sulla mia scrivania. Poi, non posso non ringraziare anche i tesisti con cui ho condiviso l’esperienza di lavorare al DIS: Lorenzo, in particolare per avermi fatto conoscere la Nonciclopedia e per le sue brillanti idee culinarie durante la pausa pranzo. A lui va l’onore di aver accostato tra loro cibi la cui interazione è tuttora sconosciuta a molti, e soprattuto per aver scoperto il Panino defifitivo, meglio conosciuto come The final panino, di cui è meglio non specificare gli ingredienti. Ringrazio Gianluca per le sue idee coinvolgenti sul come fare soldi nella vita: peccato che il 90% di queste siano illecite! Un ringraziamento va anche a Fabio, a cui auguro un giorno di poter dimostrare che N P = P ; a Marta e Emma, perchè durante i peri- odi più difficili di questo mio lavoro pensavo alle difficoltà che poco tempo prima di me hanno dovuto affrontare, facendomi capire che al peggio non c’è mai limite! Ovviamente ringrazio anche tutti quelli che hanno iniziato questo lungo viaggio insieme a me, sin dalla Laurea Triennale, e a cui auguro

(4)

iv

presto di poter tagliare questo traguardo: Andrea e Gianluca, che si laure- ano con me; Roberto, Salvatore, Massimo, con cui ho svolto diverse tesine in questi anni; il Piccio e le sue immancabili (purtroppo...) battute esilaranti;

Francesco, JPanel, Ralph Malph e tutti coloro con cui ogni tanto ci vediamo a cena. Difficile sarà dimenticare quando durante i recenti appelli di Sistmi Distribuiti, qualcuno di loro era sul banco a lavorare ed io a controllarli alla cattedra, cos`ı come il momento della foto prima di entrare in aula! Un ringraziamento speciale va anche a tutti i miei amici fuori dall’Università (anche se qualcuno è stato anche un mio compagno di studi), con i quali da diversi hanni condivido i week end, le partite a calcetto e le vacanze es- tive: Daniele B., che è stato il primo tra gli ex compagni di Università a trovare lavoro, facendomi capire cosa significa veramente lavorare: partite di tennis durante la pausa pranzo, solarium al laghetto dell’Eur e Msn tutto il giorno; Davide e Gianpiero, che magari tra qualche anno ritroveremo a La Gazzetta dello Sport; Daniele P. e Gigi, che ormai da tempo immemore danno vita all’eterno derby della Chiesuola; Riccardo, che a causa dei suoi mille impegni giornalieri non saprei dire neanche di cosa si occupa; lo Zio Sossaldo, che prima o (molto...) poi arriverà anche lui al traguardo della laurea; Alessia e Manuela, o forse dovrei dire Moglie e Marito, future Dot- toresse che quando ho voglia di rompere le scatole a qualcuno sono sempre disponibili! Un ringraziamento non può non andare anche a Claudio Perin e Claudio Salvador, perchè senza di loro le partite di Coppa non sono più le stesse! Un ringraziamento finale va a tutti coloro che in futuro accetteranno di farmi scrivere in LÂTEXsolo in inglese, perché dopo aver scritto questa sezione in italiano vi giuro che le parole accentate sono un fastidio non da poco. Credo di essermi dilungato abbastanza, ma ci tenevo a ringraziare quante più persone potevo; se qualcuno mi è sfuggito, chissà che non ci saranno in futuro occasioni più importanti per ringraziarli.

(5)

List of Figures

1.1 Real time ordering of events . . . 14

1.2 Failure classes . . . 16

1.3 Communication channel . . . 17

1.4 Client-server interaction . . . 19

1.5 Services provided by multiple servers . . . 19

1.6 Proxy server . . . 20

1.7 Peer to peer approach . . . 21

2.1 Space-time diagram . . . 24

2.2 Correct hardware clock . . . 27

2.3 Send-receive pattern in symmetric communication . . . 32

2.4 Simulation results of the protocol . . . 37

2.5 Timing information exchange . . . 38

2.6 Principal operation of GTP . . . 39

2.7 Gossip frequency variation . . . 41

2.8 Happened before relation . . . 42

3.1 NTP offset estimation . . . 49

3.2 Convergence dependency on N and K. . . 56

3.3 Synchronization error varying channel asymmetry. . . 58

3.4 Synchronization error for slow and fast Channels with a Gaus- sian asymmetry distribution. . . 59

3.5 System behavior under dynamics. . . 60

4.1 Unreached stable synchronization point . . . 66

4.2 Destruction of a reached synchronization point . . . 66

4.3 Core Standard Deviation in malicious environment . . . 72

4.4 Core Mean in malicious environment . . . 72

4.5 Core Standard Deviation with different error . . . 73

4.6 Core Mean with different error . . . 73

4.7 Core Standard Deviation with byzantines leaving . . . 74

4.8 Core Mean with byzantines leaving . . . 74

4.9 Error Persistence . . . 77 4.10 Standard deviation of byzantine-induced and correct nodes . 78

(8)

LIST OF FIGURES viii

4.11 Infection Index . . . 79

4.12 Infection Index . . . 80

4.13 Convergence dependency on View Size and αm . . . 81

4.14 SE dependency on View Size and αm . . . 81

(9)

Introduction

A distributed system is a set of computer connected by a network that communicate and coordinate their actions by exchanging messages. This definition leads to consider the following important characteristics of a distributed system: concurrency of components, lack of global clock and independent failures of components. Examples of distributed systems are:

• Internet;

• an intranet, that is a portion of the Internet managed by organization;

• mobile and ubiquitous computing.

The sharing of resource is a main motivation for constructing distributed systems. Resource can be managed by servers and accessed by clients or they may be encapsulated as object and accessed by other client objects.

The Web is an example of resource sharing. The challenges arising from the construction of distributed systems are the heterogeneity of its components, openness (which allows components to be added or replaced), security, scalability (i.e. the ability to work well when the number of users increases), failure handling, concurrency of components and transparency.

Despite the lack of a global clock, it is frequently necessary for processors to obtain some common notion of time. The technique used to coordinate the notion of time is known as clock synchronization. Anyway, designing clock synchronization algorithms presents a number of difficulties. First, due to variations of transmission delays, each process cannot have an instantaneous global view of every remote clock value. Second, even if all clocks could be started at the same real time, they would not remain synchronized because of drifting rates. In addiction, their drift rate can change due to temperature variations or aging. The difference between two hardware clocks can thus change as time passes. Finally, the most important difficulty is to support faulty elements, which are a common fact in distributed systems.

In this work, we cope with clock synchronization problem in a Peer-to- Peer system in presence of byzantine failures. The term Peer-to-Peer refers to a class of systems and applications that employ distributed resources to perform a function in a decentralized manner. Some of the benefits of Peer- to-Peer approach includes: improving scalability by avoiding dependency on

(10)

Introduction 2

centralized points; eliminating the need for costly infrastructure by enabling direct communication among clients and enabling resource aggregation.

We present a filter-based algorithm for internal clock syncronization in presence of byzantine faults. The basis of our work is the algorithm presented in [31], which combines the gossip-based paradigm with a nature-inspired approach coming from the coupled oscillators phenomenon and which is able to synchronize clocks in large scale dynamic systems. The difficulty in devising this algorithm is to match system requirements; in fact, in large scale dynamic systems application are required:

• to operate without any assumption on the underline infrastructure

• to tolerate network dynamism

• to scale on tens of thousands of nodes

We coped with issues by making no assumption on network delays, adopting an adaptive coupling factor to adjust local clocks to maintain the stability of the system, a property strongly needed in face of network dynamism, and using a fully decentralized approach in which peers implement all the required functionalities by means of a gossip-based algorithm. A node is provided with a neighborood which can directly interact with, computing local results. This solution seems to be highly scalable, resulting to be very suitable for large scale systems.

Another difficulty is the lack of a similar work in literature: in fact many authors (Lamport in [13], Dolev in [8]) consider the network as a clique;

others (Mills in [25, 26], Van Steen in [17]) devise algorithms for external clock synchronization, by assuming at least a robust time server. In this work, instead, we consider the problem of internal clock synchronization in a random graph-based network without any central time server: each node has only a partial view of the system, which changes with time. In addiction, we make no assumption on message delays (i.e. we consider an asynchronous system), different from [13, 8, 25, 26, 18], where authors assume a known distribution or bound on message delays.

A further difficulty arises from the fact that we want to evaluate our work in a byzantine environment. We no assume any central certification author- ity neither digital signature for message autentication despite the presence of byzantine nodes. In fact we can provide a probabilistic byzantine-fault tolerance locally discarding supposed faulty values. Some deterministic solutions, such as those proposed in [7, 8, 14, 19], prove that when up to F reference time server can suffer arbitrary failures, at least 2F + 1 reference time servers are necessary for achieving clock synchronization. In this case, this solution can be fault tolerant also for byzantine faults but they require that a node reads clocks of each external time reference before achieving synchronization. Differently from these efforts, several works of Dolev et

(11)

Introduction 3

al. [7, 8, 9, 10] propose and analyze several internal clock synchronization protocols resilient to at most F byzantine faults, in a system composed by at least 3F + 1 nodes. These are applicable for WAN but they require a clique-based interconnecting topology (i.e. each node have to read remote clock of every other node in the system) which is hardly scalable with a large number of nodes. Some works on byzantine resilient protocols use the gossip approach but differ form our efforts by relying on cryptographic primitives [35, 36] but they are not strictly related to information diffusion. In fact, in [35] authors designed a byzantine fault-tolerant membership management protocol and in [36] a live streaming protocol able to tolerate byzantine and rational processes basing on byzantine fault-tolerance and game the- ory literature. We have to note that our approach applies gossiping only as the communication model. In fact, each process periodically exchanges information with random peers.

In the book, besides the description of the algorithm, we define the byzantine environment in which it runs through the definition of malicious node and two different classes of properties: one Strong Byzantine Resilience and two Weak Byzantine Resilience. Moreover, we introduce the metrics used to evaluate the proposed solution through different simulation settings:

convergence time, synchronization error, error persistence and infection index. In addiction to the experimental evaluation, we provide also a theoreti- cal evaluation of the algorithm, showing that its behaviour can be described by well-known statistical properties: for example, we use the Central Limit Theorem to compute the variance of the system, from which obtaining the synchronization error (defined as the minumum standard deviation the algorithm can achieve).

The remainder is organized as follows: Chapter 1 is an overview on Dis- tributed Systems: we introduce concepts and features of distributed systems, concentrating on interaction and failure model and introducing some example of architecture. Chapter 2 deals with the time notion in distributed systems: we cope with the clock synchronization problem by introducing physical and logical clocks and showing the solutions adopted. Chapter 3 cares about the internal clock synchronization algorithm which is the fundamental block of our work. Chapter 4 explains in detail our presented solution, showing the system behaviour in different scenarios through simulation results. Chapter 5 ends the book.

(12)

Chapter 1

System model

A distributed system is a set of computers (nodes) connected by a strict bandwidth and high latency network which communicate exchanging messages. Examples of distributed system are client-server systems, where a server can accept client requests, process them, and return the requested information to the client, or a peer to peer (p2p) systems where each node acts simultaneously as a client and a server to the other nodes in the network.

The main goal of a distributed system is to connect users and resources in a transparent, open, and scalable way. Ideally this arrangement is more fault tolerant and more powerful than many combinations of stand-alone computer systems. In a distributed system, applications must be implemented taking in account:

• time and space concurrence

• no global clock

• independent failures

1.1 Features

Main features of a distributed system are:

• heterogeneity

• openness

• security

• scalability

• fault tolerance

• concurrence

(13)

1.1 Features 5

• transparency

heterogeneity A distributed system must accomodate heterogeneous hardware, operating systems, networks, programming languages and imple- mentations from different developers.

• hardware: data types can be represented in different way on different kinds of hardware. This differences must be dealt with if a message exchange between programs running on different hardware occurs.

• operating systems: although different operating systems provide the same protocols, for example Internet protocols, they do not necessarily provide the same application programming interface.

• networks: Internet is composed by several kinds of network; their differences are masked by the fact that all the computers attached use the same Internet protocols to communicate. For example, a computer attached to an Ethernet network has an implementation of the Internet protocols over that Ethernet, whereas a computer on a different sort of network will need an implementation of the protocols for that network.

• programming languages: different programming languages rep- resent data structures such as arrays and records in a different way. The problem must be addressed if a communication between programs written in different languages is needed.

• implementation: programs written by different programmers cannot communicate if they don’t agree on a common standard, for example for network communication and the representation of primitive data and data structures in messages.

A possible solution is Middleware, a software layer intended to mask these heterogeneity and to provide a convenient programming model to application programmers. Middleware services provide a more func- tional set of Application Programming Interface (API) to allow an application to:

• locate transparently across the network, thus providing interaction with another service or application

• be independent from network services

• be reliable and available always

when compared to the operating system and network services.

Middleware platforms are often implemented over the Internet protocols, which mask the differences of the underlying networks. In

(14)

1.1 Features 6

addiction to solve the problem of heterogeneity, middleware provides a uniform computational model for use by the programmers of servers or distributed applications. Models include remote object invocation, remote event notification, remote SQL access and distributed transaction processes.

CORBA is a middleware that provides remote object invocation, that is: an object in a program running on a machine can invoke a method of an object running on a different machine, even if involved programs are written in different programming languages. Another middleware is Java RMI, although it supports only a single programming language, Java.

openness Openness is the property of distributed systems such that a system can be extended o re-implemented in various ways. Web Ser- vices protocols are standards which enable distributed systems to be extended and scaled. In general, an open system that scales has an advantage over a perfectly closed and self-contained system. Conse- quently, open distributed systems are required to meet the following challenges:

• Monotonicity Once something is published in an open system, it cannot be taken back.

• Pluralism Different subsystems of an open distributed system include heterogeneous, overlapping and possibly conflicting information. There is no central arbiter of truth in open distributed systems.

• Unbounded nondeterminism Asynchronously, different subsystems can come up and go down and communication links can come in and go out between subsystems of an open distributed system.

Therefore the time that it will take to complete an operation cannot be bounded in advance

A system can be extended at the harware level, by adding computers at the network or at the software level, by adding functionalities or re-implemented old ones, enabling application programs to share resources. .

security Security is one of the most important issues in a distributed system. It has three components:

• confidentiality protection against disclosure to unauthorized ac- cesses

• integrity protection against alteration or corruption

• availability protection against interference with the means to access the resources.

(15)

1.1 Features 7

Although Internet provides transparent location, security risks are associated with free accessing to resources over an intranet or over the Internet if they are not protected by a firewall.

In addiction, the following two challenges have not yet been fully met:

• Denial of Service attacks: a malicious user can bombard a service with a large number of pointless requests that serious users are unable to use.

• Security of mobile code: malicious users can send an executable program, for example as electonic mail attachment, that has an unpredictable effect. It may seem a picture but it reveals a program able to access to local resources or be a part of a denial of service attack.

scalability Scalability is the property of a distributed system intented to underline its ability to remain effective although increasing the number of resources and users. Internet is the main example of a scalable distributed system.

The desing of a scalable distributed system presents the following challenges:

• Controlling the cost of physical resources It should be possible to extend the system at a reasonable cost if the demand for a resource grows. For example, with the growing number of computers connected in an intranet and accessing a resource, it must be possible to add a server to avoid performance bottleneck that would arise if only a single server has to handle resource requests.

In general, for a system with n users to be scalable, the quantity of physical resources required to support them should be at most O(n), that is proportional to n.

• Controlling the performance loss For a management of a set of data whose size is proportional to the number or users or resources in the system (for example DNS), hierarchic algorithms scale better than those that use linear structures. But even when hierarchic algorithms increase in size, a loss in performance occurs: the time taken to access to hierarchically structured data is O(log n), where n in the size of the set of data. For a system to be scalable, the maximum performance loss should be no worse than this.

• Preventing software resources running out The most evident example of lack of scalability is the fact that in the late 1970s it was decided to use 32 bit numbers for Internet addresses and nowa- days the availability is running out. It is difficult to predict the demand that will be put on a system years ahead. Moreover, over

(16)

1.1 Features 8

compensating for future growth may be worse than adapting to a change when we are forced to.

• Avoiding performance bottlenecks In general, algorithms should be decentralized to avoid performance bottlenecks. An example is the predecessor of the Domain Name System in which the name table was kept in a single master file and available to be downloaded from any computers that needed it. This worked well since there was few hundreds computers on the network; with the growth of the Internet, it became a performance and administra- vive bottleneck. Domain Name System solved the problem by partitioning name table between servers located throughout the Internet and administered locally.

In addiction we could have resources accessed very frequently causing a decline of performance. Cashing and replication are techniques used to improve performance when resources are heavily used.

Ideally, the system and application software should not need to change when the scale of the system increases, but it is diffucult to achieve.

Scalability is a dominant theme in development of distributed systems.

fault tolerance A distributed system should be able to handle failures.

Different kind of failures can occur: they may affect hardware or software components and may produce incorrect results or stop the computation before it has completed the intended task. The following techniques can be used for dealing with failures:

• Detecting failures Some failures can be detected, for example a corrupted data in a message or in a file, using Checksum. Other failures, instead, are difficult or even impossible to be detected, such as a remote crashed server over Internet.

• Masking failures Some detected failure can be hidden or made less severe. Examples of hiding failures are:

1. messages can be retransmitted when they fail to arrive 2. file data can be replicated to several disks if one of them is

corrupted

• Tolerating failures A large network such as Internet can exhibit servers with failures that are no pratical to attempt to detect and hide; so their clients can be designed to be fault tolerant. For example, a web browser unable to contact a web server, informs the user about the problem, leaving him free to retry later.

• Recovering from failures Recovery ivolves the design of software so that the state of permanent data can be recovered o rolled

(17)

1.1 Features 9

back after a server has crashed. In general, computation made by some programs will be incomplete when a fault occurs and the permanent data they update may be inconsistent.

• Redundancy Redundant components are one of the key to make a service fault tolerant. All important services exploit replication: routes in Internet, name table in Domain Name System, databases to ensure that data remain available after a crash of a single server, ...

A distributed system must provide a high degree of availability in the face of hardware faults. Availability measures the proportion of time a distributed system is available for use. When a component fails, only the work on that component is affected. A user may move on another computer if the one that uses fails; a server process can be started on another computer.

concurrence Services and applications provide resources that can be shared in a distributed system. The problem is to guarantee data consis- tency when an object can be accessed by multiple clients; it could be achieved by taking in account one client request at a time, but it limits throughput. Therefore, services and applications allow multiple client requestes to be processed concurrently. But a shared resource must be responsible for ensuring that it operates correctly in a concurrent environment to avoid that several concurrent requestes may conflict producing inconsistent results. Programmers who take an implementation of an object that was not intended to use in a distributed system must do whatever is necessary to make it safe in a concurrent environment. For an object to be safe in such environment, its operation must be synchronized in such a way that its data remains consistent. This can achieved, for example by standard techniques such as semaphores, used in most operating systems.

transparency Transparency is a concealment from the user and the application programmer, so that the system is perceived as a whole rather than as a collection of independent components. The ANSA Ref- erence Manual and the International Standard Organization’s Refer- ence Model of Open Distributed Processing identify the following eight forms of transparency:

• Access transparency enables local and remote resource to be accessed using identical operations.

• Location transparency enables resources to be accessed without knowledge of their location.

(18)

1.1 Features 10

• Concurrency transparency enables several processes to operate concurrently using shared resources without interference between them.

• Replication transparency enables multiple instances of resources to be used to increase reliability and performance without knowledge of the replicas by users or application programmers.

• Failure transparency enables the concealment of faults, allowing users and application programs to complete their tasks despite the failure of hardware or software components.

• Mobility transparency allows the movement of resources and clients within a system without affecting the operation of users or programs.

• Performance transparency allows the system to be reconfigured to improve performance as loads vary.

• Scaling transparency allows the system and applications to ex- pand in scale without change to the system structures or the application algorithms.

Access and location transparency are the two most important transparan- cies: their presence or absence affect strongly the distributed system.

Together are referred as Network transparency.

Let we illustrate some examples. Access transparency can be achieved by means of API: it enables a user to access both local and remote resources in the same way. On the contrary, a lack of access transparency may be a distributed system that does not allow to access to remote files unless you make use of the ftp program to do so.

An example of transparent location is the web resource name (URL) because it identifies a web server domain rather than an Internet ad- dress. However URLs are not mobility-transparent because a web page cannot move to another domain without changing its URL. In general, identifiers as URLs prevent replication transparency too; infact the DNS allows a domain name to refer to several computers, but it picks just one of them when it looks up a name. Since a replication scheme generally needs to be able to access of all partecipating computers, it would need to access each of the DNS entries by name.

Failure transparency can be illustrated by referring to an e-mail: it will be eventually delivered despite server or links failures. Faults are masked by retransmitting the message since it will be correctly delivered by the recipient, even if it may take a lot of time.

To illustrate mobility transparency, let we consider two mobile phones:

both caller and callee can communicate to each other moving from one

(19)

1.2 Fundamental models 11

environment (cell) to another. The two phones users are uneware of the mobility of the phones between cells.

Transparency hides resources that are not directly relevant to the task in hand from users and application programmers. In general, it is desiderable for similar hardware resources to be allocated interchange- ably to perform a task. But this may not always be possible: when a user needs to connect to a local area network different from its own, he needs to make use of different services, for example the send mail service.

1.2 Fundamental models

A system model is a description of the reality in which problems have to be considered. An object model is a collection of attributes and a set of rules that specifies how objects interact. In general, a model contains only the essential ingredients that we need to consider in order to understand and reason about some aspects of a system’s behaviour. A model is said to be accurate if it is very close to the object reality and tractable if we can proceed with the object analysis. A good model should be accurate and tractable at the same time: selected attributes must be only those affecting the analyzed question. However does not exist an absolute good model; a model can be adequate to the problem and environment considered. The purpose of a model is:

• to make esplicit all relevant assumptions about the system we are modelling

• to make generalizations concerning what is possible or impossible, given those assumptions. The generalizations may take the form of general purpose algorithms or desiderable properties that we are guaranteed.

We concentrate on two aspects of distributed systems:

Interaction System nodes communicate by sending messages on a network.

Interaction model must take in account the delay that takes place in a communication; it can be of considerable duration, so the accuracy with which independent processes can be coordinated is limited by this delay and by the difficulty of maintaining the same time accross all computers in the system.

Failure Processes and links failures are threats to the correct operation of a distributed system. A description of this faults is the basis for the analysis of their potential effects and the design of systems able to tolerate faults of each type while continuing to run correctly.

(20)

1.2.1 Interaction model

The behaviour of a distributed system is described by a distributed algorithm: this is a definition of steps to be taken by each of the processes of which the system is composed, including the transmission of messages among them. Message passing is used to coordinate node activities and to transfer information. However, the rate of which processes proceed and the timing of the transmission of the messages among them is often unpredictable. It is also difficult to describe the states of a system because it must deal with failures that can occur on one or more of the nodes involved in the system (process or link failures). The state of a process consists of the set of data that it can access and update; it is private and cannot be accessed by other processes.

Two aspects affect the processes interaction in a distributed system:

communication channels Communication performance is often a limit- ing factor. Communication over a computer network has the following performance characteristics about latency, bandwidth and jitter:

• The delay between the start of a message by a process and the reception of this message by another process is referred to as latency. Latency includes:

– the time taken for the first of a string of bits transmitted through a network to reach its destination.

– the time in accessing the network, which increases significantly when the network is heavily loaded.

– the time system communication services at both the sending and receiving processes, which varies according to the current load on the operating systems.

• The bandwidth of a computer network is the total amount of informations passing over it in a given time. When a large number of communication channels are using the same network, they have to share the available bandwidth.

• Jitter is the variation in the time taken to deliver a series of messages. It is important in multimedia data. For example, if a series of consecutive samples of audio data are played with differing time intervals, then the sound will be badly distorted.

computer clocks Each computer in a distributed environment has its own internal clock which local processes use to obtain the value of the current time. Processes use this time to timestamp events. However, internal clocks of two any nodes of the system can be different at the same time they read them. This is because computer clocks drift from

(21)

the perfect time and, in particular, their drift rate differ from another one. The drift rate, as explained later, is the amount of time that a computer clock drifts from a perfect reference clock. Even if the clocks of a distributed system are initially synchronized, the clocks would eventually vary quite significantly unless corrections are applied.

In the remainder of the section, we will show three variants of the interaction model. This variants are due to the synchrony of the system:

• Synchronous systems

• Asynchronous systems

• Partially synchronous systems Synchronous systems

In a synchronous system we can make the following physical timing assumption on processes and links:

1. Synchronous computation: There is a known upper bound on processing delays. That is, the time taken by any process to execute a step is always less than this bound. A step gathers the delivery of a message sent by other process, a local computation and the sending of a message to some other process.

2. Synchronous communication: There is a known upper bound on message transmission delays. That is, the time period between the instant at which a message is sent and the time at which the message is delivered by the destination process is smaller than this bound.

3. Synchronous physical clocks: Processes are equipped with a local physical clock. There is a known upper bound on the rate at which the local physical clock deviates from a global real time clock.

In a synchronous distributed system, several useful services can be provided:

• Timed failure detection Every crash of a process can be detected within a known bounded time by all processes that did not crash. This can be achieved, for example, by using an heartbeat mechanism, where processes periodically send a message (heartbeat) to detect, within a limited time period, processes that have crashed.

• Measure of transit delay It is possible to get a good approximation of the delay spent by messages in the communication link and, from there, infer which nodes are more distant or connected through slow or overloaded channels.

(22)

Figure 1.1: Real time ordering of events

• Coordination based on time One can implement a lease abstraction that provides the right to execute some action that is granted for fixed amount of time (e.g. manipulating a specific file).

• Worst case performance By assuming an upper bound on the number of faults and on the load of the system, it is possible infer a worst case response time of a distributed algorithm. This allows a process to know when a previously sent message has reach its destination (supposed to be correct). This can be achieved even if we assume that processes commit omission failures without crashing, as long as we bound the number of these omission failures.

• Synchronized clocks A synchronous system makes it possible to synchronize the clocks of different processes in such a way they are never apart by more than a some known costant δ, known as clocks synchronization precision. Thus, processes can use their physical clock to timestamp events to guarantee a global ordering in the system. Figure 1.1 shows a real time ordering of events.

The major limitation of assuming a synchronous system is the difficulty to build a system in which all timing assumption hold with high probability.

A careful analysis of the network and processing load and the use of suitable processor and network algorithms is needed. However, this is possible in a local area network (LAN), where the network size is reasonably short. It becames unfeasible on a wide area network (WAN) like Internet, where messages can take a very long time to arrive to their destination due to the large scale of the system.

(23)

Asynchronous systems

In an asynchronous system any physical timing assumption on processes and links is no possible.

However, is still possible to measure the passage of time based on transmission and delivery of messages. Time measured in this way is called logical time; we will explain later the resulting notion of logical clock.

The main example of an asynchronous distributed system is Internet, in which no bound for servers or network load holds (i.e. an e-mail can take days to arrive). Anyway, some design problems can be solved as well. For example, although a web server cannot always provide a response within a reasonable time limit, web browsers are designed to allow users to do other things while they are waiting.

It is important to notice that any solution taken in an asynchronous system, is valid in a synchronous system too.

Actual distributed systems are very often asynchronous because of the need for processes to share the processors and for communication channels to share the network. So the resulting performance of any application cannot be guaranteed. But there are many design problems that cannot be solved in an asynchronous system, for example data streaming which requires strong time deadline.

Partially synchronous systems

Generally systems appare to be synchronous: we can notice that physical bounds are respected most of the time. However, there can be period during which a system is asynchronous, for instance because the network is overloaded or some process has a shortage of memory that slows it down. A message buffer that a process uses to store incoming and outgoing messages can be overflowed, so messages might be lost violating the time bound on the delivery. Retransmission of messages might help to ensure the reliability of channels, but it introduces an unpredictable delay. For this reason, pratical system are said to be partially synchronous, that is, instead of assuming a synchronous system, we consider an eventually synchronous system. A partially synchronous system is a system where an unknown time t exixts, after which the system will be synchronous for a period long enough for an algorithm to terminate its execution. But it no means that:

1. the system will remain synchronous forever 2. the system starts asynchronous

The assuption captures the fact that the system might not always be synchronous and there is no bound on the period during which it is asyn-

(24)

chronous.

1.2.2 Failure model

In a distributed system both processes and channels may fail, that is they may daviate from the correct behaviour. In Figure 1.2. many classes of faults are represented.

!"#$%"&"'

!"#$%"&"'()$%*(+,--&.,

&/%*,0%$1&%$20 3$+$0.(^4205'($0

-'01*"202/-(+26,57

8,0,"&5(9+$--$20

:,1,$;,(9+$--$20

<,06(9+$--$20

="&-*

Benign

Figure 1.2: Failure classes

Omission failures

An omission failure occurs if channels or processes fail to perform actions that are supposed to do.

The main kind of omission failures is the crash. If a crash failure occurs, the faulty process stops to send messages. Other processes can detect it, if they notice that the process fails repeatedly to respond to invocation messages.

However, this technique implies the use of timeouts to detect a failure, that is, a process which fails to respond to invocation messages within a time interval is suppesed to be faulty. This can achieved in a synchronous system, while in an asynchronous system a timeout can indicate that a process is not responding because it have crashed or may be slow, or the messages may not be arrived.

In a fail-stop model, a faulty process can be certainly detected by other processes. This kind of behaviour can be produced in a synchronous distributed system if processes use timeouts to detect if a process fails to respond and messages are guaranteed to be delivered. For example, let us consider two processes p and q, programmed for q to respond to message

(25)

Figure 1.3: Communication channel

from p. If p does not receive a reply from q within a time interval measured by p’s clock, p detects q to be faulty.

A communication channel produces an omission failure if it drops message. Consider the scenario in Figure 1.3, where a communication channel transports messages from the outgoing buffer of a process p to the incoming buffer of a process q. The outgoing and incoming buffers are usually provided by the operating system.

The communication channels produce an omission failures if it does not transport a message from p’s outgoing buffer to q’s incoming buffer, generally caused by lack of buffer space at the receiver or at an intervening gateway, or by a network transmission error detected by a checksum carried with the message data.

Arbitrary failures

An arbitrary or Byzantine failure is the worst failure semantic, in which any type of error may occur.

An arbitrary failure of a process is one in which it arbitrarily omits intended processing steps or takes unintended processing steps. Thus a process cannot detect an arbitrary fault by seeing whether a process responds to invocation because it might omit to reply.

Communication channel can suffer from arbitrary failures; for example message contents may be corrupted or non existent messages may be delivered or real messages may be delivered more than once. Anyway, arbitrary failures of channels are rare because the communication software is able to recognize them and reject the faulty messages. For example, checksum can be use to detect corrupted messages and a sequence number to identify uniquely a message, rejecting non existent and duplicated messages.

Timing failures

Timing failures are applicable in synchronous distributed systems, where it is possible to estabilish bounds on processes computation, communication delays and clock drift rate. A timing failure occurs when a process does not

(26)

1.3 System architectures 18

receive a response within a timeout.

In asysnchronous systems, timing failures are not applicable because we cannot distinguish a failure from an overloaded server.

Timing failures are applicable to real time applications, such as multimedia computers with audio and video channels, which require strict time con- straints. However, their design is more complex and may require redundant hardware.

Masking failures

In a distributed system components are constructed from a collection of other components. One of the main goals is to provide a reliable service, although one or more of this components may fail. For example, a group of servers which stores a replicated data collection; the sevice must be available despite one of them crashes. A knowledge of the failure characteristics of a component can aid to design a service able to mask failures of components on which it depends. A service masks a failure either by hiding it or by converting it in a more acceptable failure. For example, checksum can be use to mask corrupted messages, converting an arbitrary failure in an omission failure.

Even process crashes may be masked, by replacing the process and restoring its memory from information stored on disk by its predecessor.

A basic communication channel, which can exhibit omission failures as described above, can be used as a building block for a reliable communication service as well. Reliable communication is based on validity and integrity properties:

• validity: any message in the outgoing message buffer is eventually delivered to the incoming message buffer;

• integrity: the message received is identical to the one sent and no messages are delivered twice.

But threats to integrity come from two independent sources:

• any protocol retransmits messages without rejecting a message that arrives twice. Sequence numbers can be use to detect those that are delivered twice;

• malicious users may inject sporious messages, reply old messages or tamper with messages. Security measures can be taken to maintain integrity property in the face of such attacks.

1.3 System architectures

In this section we will focus on distributed system architectures, ranging from well defined client-server architecture to spontaneous networking.

(27)

1.3.1 Client-server model

Client-server is the most employed architecture; as depicted in Figure 1.4, client processes interact with server processes in separate host computer in order to access the shared resources that they manage. A server, in turn, may be client for other servers. For example, a web server is often a client of a local file server, that manages the files in which the web page are stored.

Web servers and most other Internet services are clients of DNS service, that translates Internet Domain Name to network addresses.

Figure 1.4: Client-server interaction

In addiction, server processes can be inplemented on different computers interacting, if necessary, to provide service to client processes (Figure 1.5).

This approach can be used either for data replication or for data partitioning.

The former provides multiple consistent copies of data in process running in different hostes; this technique increases performance and availability, improving fault tolerance. The latter, instead, consistes to partition the set of object on which the service is implemented and distribute them between different hostes.

Figure 1.5: Services provided by multiple servers

The Web is the main example of this approach: any web server manages

(28)

its own set of resources and any user can access them via a browser.

1.3.2 Proxy servers and caches

A cache is a store of recently used object, nearer to clients than the objects theirselves. When a new object is received at a computer, it is added to cache, replacing other object if necessary. When a client makes a request, the chaching service checks first if an up-to-date copy of that object is stored in cache; if so, it is provided to client. If not, an up-to-date copy is fetched.

Cache can be located either on clients or on servers, such that can be shared by different clients. Web browsers maintain a cache to store recently vis- ited web pages addresses and other web resource in the client’s file system, using a special HTTP services to investigate if a copy is up-to-date before displaying it.

Web proxy servers provide a shared cache of resources to clients at a site or accross the network (Figure 1.6), increasing availability and performance of the service by reducing the load on the wide area network and web servers.

In addiction, proxy servers can play other roles; for example they me be used to access remote web servers through a firewall.

Figure 1.6: Proxy server

1.3.3 Peer to peer (P2P) model

In this model each process plays similar roles, acting as client and server at the same time (Figure 1.7). Code in the peer processes maintains con- sistency of application-level resources and synchronizes application-level actions when necessary. The elimination of server processes reduces the inter- process communication delays for access to local objects. Such networks are useful for many purposes. Sharing content files containing audio, video, data or anything in digital format is very common, and realtime data, such as telephony traffic, is also passed using P2P technology.

Main advantages are:

• Scalability This systems allow to manage millions of users

(29)

Figure 1.7: Peer to peer approach

• Fault tolerance They provide service although failures can occur

• Dynamic Automatic management of leave/join of nodes

• Performance

• Economy Services exploit unused resources by users

• Anonymity Hides users operations and identity However, p2p systems have many drawbacks:

• Management There is no a central manager; in addiction each user can manage only its node. System automatic management algorithms are required.

• Coordination No p2p system provides primitive for coordination (mutual exclusion, consensus,...); coordination algorithms are needed.

• Heterogeneous nodes Each node has different features (bandwidth, computation, ...); a management of heterougeneous capabilities is required to avoid they affect global performances of the system. Some- times it is possible to exploit this fact for the system benefit.

• Churn rate Leave/join of nodes can affect negatively system performance.

Peer to peer systems can be devided into two classes, due to their struc- ture: unstructured and structured.

unstructured Each resource is located on the node providing it; due to the lack of organization, resource localization in the network is very difficult. However, add or remove nodes is simply and cheap.

(30)

structured Resource location is easy and estabilished by a common function (known by all nodes) and a node identifier. Anyway, add or remove a node is difficult. This kind of p2p systems can be divided in two other classes, due to the topology:

• Pure: centralized, hierarchic, ring, graph

• Hybrid: centralized and ring, centralized and graph Topology highly affects system behaviour:

• how management is complex

• the cost for adding nodes to the network

• how scalability affects performance

• the capability of network to provide service although one or more faults

• the capability of network to resist to external threats (i.e. DOS attacks)

• load balancing between nodes

A second generation of p2p systems are overlay networks; they are a distributed system built on a network (i.e. Internet), in which nodes are linked by logical links corresponding to paths, through many physical links, in the underlying network. Overlay networks can be used to route messages without using IP addresses; each node has a logical identifier and can be find out by other nodes by means Distributed Hash Table (DHT). A DHT contains a pair !name, value" for any node in the overlay; a node can efficiently retrieve the value associated with a given name. Responsibility for maintaining the mapping from names to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows DHTs to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.

Examples of DHT protocols are:

• CAN (Content Addressable Network)

• Chord

• Kademlia

• Pastry

• P-Grid

• Tapestry

(31)

Chapter 2

Time in distributed systems

Time is an important issue in a distributed system; for example, computers around the world have to timestamp electronic commerce transaction consistently. In addiction, time is required to analize how distributed exe- cutions unfold. Each computer has a physical (hardware) clock; it can be viewed as a counter register which is incremented by the ticks of an oscil- lator. Clocks tipically deviate and we cannot synchronize them perfectly.

But time is problematic in a distributed system, because there is no global time reference; it makes difficult to find out the state of processes during a distributed computation.

2.1 Computation model

Let we consider a system of N processes and channels. Every process generates a sequence of events; let e^k_i be the k-th event generated by process p_i: e^k_i can be a message of send or receive operation. Figure 2.1 shows a computation in a space-time diagram.

Let we denote by → the order relation between events in a process: we say that e → e^! in a process p_i if and only if e happens before e^!.

Definition 1. Local history The history of a process pi is the series of events occurring within it, ordered by relation →:

history(pi) = hi = !e⁰i, e¹_i, e²_i, ..."

Definition 2. Partial local history The partial local history is the prefix of the local history:

h^m_i = e¹_i, ..., e^m_i

Definition 3. Global history The global history is the union of the local histories:

H =!

i

hi, ∀1 ≤ i ≤ n

(32)

2.1 Computation model 24

Figure 2.1: Space-time diagram

For ordering events, each computer uses timestamps. In such a way, it is possible to realize a global history of the system. A trivial solution consists in timestamping events with physical clock of processes: the order within a process can be defined, while it is more difficult ordering events between processes, due to the lack of a physical clock shared among processes. To cope this problem, we can try to synchronize clocks with appropriate algorithms.

2.1.1 Clock Synchronization

Clock synchronization is the problem which deals with the fact that internal clocks of several computers may differ. Even when initially set accurately, real clocks will differ after some amount of time due to clock drift, caused by clocks counting time at slightly different rates.

Clock synchronization can be achieved either by hardware or by software.

Hardware clock synchronization uses special synchronization hardware at each processor and a separate network for clock signals to achieve a tight synchronization. Software clock synchronization, instead, is achieved by means of algorithms that use the standard communication infrastructure to send messages and do not require any special hardware.

Algorithms can provide external, internal or hearthbeat synchronization.

External synchronization aims to synchronize computers clocks as close as possible to the real time displayed by a time source.

Definition 4. External clock synchronization: let δ > 0 be a synchronization bound and t a point in the real time interval; clock Ci

of a process i (for i = 1,2,...,N) is externally synchronized with a time source S iff:

| S(t) − Cⁱ(t) |< δ (2.1)

(33)

2.1 Computation model 25

Internal synchronization aims to minimize the maximum difference between any two clocks. An internal synchronization algorithm enables a process to measure the duration of distributed activities that start at one processors and terminate on another one. It estabilishes an order between distributed events in a manner that closely approximates their real time precedence.

Definition 5. Internal clock synchronization: let δ > 0 be a synchronization bound and t a point in the real time interval; clocks Ci

and Cj of two any processes i and j (for i,j = 1,2,...,N) are internally synchronized iff:

| Cⁱ(t) − C^j(t) |< δ (2.2) External synchronization includes internal; on the contrary, internal synchronization does not include external. All internally synchronized clocks can drift from an external source, though being within a bound δ to each other. If a set of processes is externally synchronized within a bound δ, then it is internally synchronized within a bound 2δ.

Heartbeat synchronization aims to have nodes generate periodic and local heartbeat events approximately at the same time. Heartbeat synchronization algorithms are interested in guaranteeing that all nodes start and end their cycles at the same time with an error that is at least one order of magnitude smaller than the chosen cycle length.

Clock synchronization algorithms can be divided in three groups: deterministic, probabilistic and statistical.

Deterministic algorithms assume that an upper bound on transmission delays existes and guarantee an upper bound on the difference between two any clocks (precision).

Probabilistic algorithms do not make assumptions on maximum transmission delays but guarantee a costant maximum deviation between synchronized clocks. A clock knows at any time if it is synchronized with each others, but there is a non-zero probability that a clock goes out of synchronization if too many unmasked communication failures occur.

Statistical algorithms do not make assumptions too, but assume that the expectation and the standard deviation of the delay distribution are known.

In addiction, clocks do not know how far apart they are from each others, but algorithms guarantee that any two clocks are within some constant maximum deviation with a certain probability at any time. Hence, in probabilistic and statistical algorithms, the precision of any two correct clocks can be only guaranteed with some non-zero probability.

Now we define the clock synchronization problem more formally through the following properties:

(34)

2.2 Physical clock 26

Property 1. Agreement Let γ be the precision of the algorithm and i and j two correct processes, then

|Cⁱ(t) − C^j(t)| ≤ γ

This property states that two any correct clocks are approximately equal.

Deterministic algorithms ever satisfy this property. Probabilistic and statistical algorithms, instead, satisfy this property with a probability strictly less than one.

Property 2. Bounded correction There is a small bound ε on the amount by which a correct clock is adjusted at any resynchronization.

Property 3. Accuracy For any correct processor pi ∈ P and for real time t there existes a constant ν, called accuracy, such that for any execution of the clock synchronization algorithm

1

(1 + ν)t + a ≤ Cⁱ(t) ≤ (1 + ν)t + b

with a and b some constants depending on the initial condition of the algorithm.

The best drift rate of software clocks achievable is equal to the underlying hardware clock drift ρ (optimal accuracy).

Property 4. Bounded external deviation For any correct processor p_i ∈ P and any real time t, then

|Cⁱ(t) − t| ≤ ϕ

This property is satisfied by deterministic external clock synchronization algorithms; they have to bound the deviation between a correct clock and real time by an a priori given constant ϕ.

2.2 Physical clock

At time t, the operating system of a process i reads the hardware clock H(t) of the processor and generates the software clock

Ci(t) = αH(t) + β

where we are free to choose the value α and β, which approximately measures the physical time t of the process pi. For example, Ci(t) is a 64 bit value which represents nanoseconds elapsed at time t from a reference time (i.e.:

boot of the machine).

If Ci(t) behaves quite well, it can be used as timestamp for events in pi. It

(35)

Figure 2.2: Correct hardware clock

is important to notice that two successive events have different timestamps only if the resolution time (i.e.: the period between updates of the clock value) is smaller than the time interval between two successive events. Three aspect can affect physical clock:

• Skew: the difference in time between two clocks

• Drift: clocks measure time with different rates

• Drift rate: the gradual misalignment of once synchronized clocks caused by the slight inaccuracies of the time-keeping mechanism.

An hardware clock H(t) is said to be correct (Figure 2.2) if its drift rate stays within a bound ρ > 0. If H(t) is correct, the error in measuring a real time interval [t^!, t] (with t^!> t) is bounded.

For guaranteeing Ci behaves well to be used as timestamp for events ordering within process i, periodic resynchronization is needed.

2.2.1 Clock synchronization over LAN

In this section we give a sketch of some algorithms providing clock synchronization over LAN environment.

Ordering of Events

Lamport in [13] proposes an algorithm ensuring condition 2.2, in which a message m is sent at physical time t and received at time t^!. Lamport denotes by νm= t^!− t the total delay of message m. This delay is unknown by process receiving m. However, Lamport assumes that receiving process knows a minumum value of the delay, µm≥ 0, while ξ^m = νm− µ^m denotes the unpredictable delay of the message.

The algorithm works as follow: If Pi sends a message m at physical time t,

(36)

than m is timestamped with Tm= Ci(t). Upon receipt m at time t^!, process P_j sets its own clock C_j(t^!) to the maximum (C_j(t^!− 0), Tm+ µ_m).

To show that condition 2.2 holds, the system used is described by a fully connected graph, where links are used to send messages directly from a process to another. A process P_i sends a message in a physical time interval [t, t + τ ]. The diameter of the direct graph is the smallest number d such that, for any pair of distinct processes Pi and Pj, there is a path from Pi to P_j having at most d arcs.

Integrating Internal and External Clock Synchronization

Cristian and Fetzer in [19] propose an algorithm integrating internal and external clock synchronization in which nodes communicate among them by messages passing. Each node can be a reference time server or a non reference time server.

Reference time servers have a local access to the reference clock they maintain; a reference clock is correct at real time t if its difference from t is bounded by an a priori given constant ∆. The drift rate between two correct reference clock is zero.

Non reference time servers, instead, have no local access to reference clock;

thus, they use a remote method to read reference clocks from reference time servers by which they adjust, at any synchronization round, a virtual clock, that is the sum of their hardware clock and an adjustment.

Non reference time servers maintain external and internal synchronization by reading respectively reference clocks from reference time servers and virtual clocks from other non reference time servers. The remote reading method has an error bounded by an a priori given constant Λ (maximum clock reading error ).

The goals of the internal clock synchronization are to bound the deviation between two any virtual clocks by a costant δ (called maximum internal deviation) and to bound the virtual clocks drift by a constant ρ ) 1. An internal clock synchronization algorithm is correct if satisfies both requirements (internal synchronization requirements).

The goal of the external clock synchronization, instead, is to bound the deviation between a virtual clock and real time by an a priori defined constant ε (external synchronization requirements).

External/Internal clock synchronization provides three modes:

• standard all virtual clocks are externally and internally synchronized.

The majority of reference time servers must be correct in an appropriate time interval, spanned by the time of the remote reading method invocation and the response time. It is not sufficient to assume that

(37)

the majority of reference time servers must be correct at any time because the remote reading method takes time to return a response.

• degraded a majority of reference time server in a time interval doesn’t exist; thus, virtual clocks are only internally synchronized.

• integration a majority of reference time servers in a given time interval existes, but not all virtual clocks are externally synchronized.

External synchronization will be eventually achieved, unless half or more of reference time servers become faulty.

In addiction to internal and external synchronization requirements, a correct External/Internal synchronization algorithm must satisfy:

• bounded external/internal deviation for any time t in which the service is in standard mode, the difference between a virtual clock of a non reference time server and t is bounded by constant ε.

• integration requirement during the integration mode, external synchronization will be eventually achieved unless the service comes again in degraded mode.

The proposed algorithm achieves External/Internal clock synchronization until the majority of reference time servers is correct and degrades to internal synchronization if half or more of reference time servers are faulty.

This algorithm masks non reference time servers arbitrary faults: virtual clocks are guaranteed to be internally synchronized unless more than one third of them is faulty.

Non reference time servers use a midpoint function to compute the adjustment to synchronize clocks internally. Since maximum drift rate of this internally synchronized clocks is greater than zero, external deviation of them can grow between two successive adjustments. However, the amount of that increase can be bounded by a small constant D. The adjustment computed for internal synchronization is combined with the adjustment computed for external synchronization: the former ensures internal synchronization, the latter is used to change the internal adjustment by up to D to ensure that drift rate of the clocks is zero.

For example, suppose that internal clock synchronization algorithm aims to set the virtual clock of a node to Ti and the external algorithm to Te. The combination of the two algorithms uses Ti and changes it by up to D towards T_e to ensure that the maximum external deviation is not increased between two adjustment:

• when Ti and T_e are at most D apart, the combination of T_i and T_e is defined to be Te

• otherwise, the combination is Tⁱ adjusted by exactly D to Te.

“Sapienza” Universit`a di Roma