Eﬃcient Partial Replication to Improve Locality in Large-Scale Systems

(1)

Efficient Partial Replication to Improve Locality in

Large-Scale Systems

Jean-Michel H´elary ^† Alessia Milani ^∗

† IRISA, Campus de Beaulieu, 35042 Rennes-cedex, France.

∗ DIS, Universit´a di Roma La Sapienza, Via Salaria 113, Roma, Italia.

helary@irisa.fr Alessia.Milani@dis.uniroma1.it

Abstract

This work addresses data replication and consistency maintenance in large scale systems. Based on the observation that partial replication of data allows to improve network traffic and to preserve locality, particularly in large scale systems, it focuses on the constraints necessary to maintain consistency when partial replication is used. More precisely, it points out that the transfer of information necessary to manage consistency requirements might cancel the decrease in network load and the locality properties offered by partial replication. It is shown that causal consistency criterion suffers this drawback, but, on the contrary, PRAM consistency criterion allows efficient partial replication. The usefulness of PRAM is illustrated with the Bellman- Ford shortest path algorithm, which takes profit of graph locality to partially replicate the variables and is solvable in PRAM distributed shared memories.

1 Introduction

Large-scale systems spanning geographically distant sites are environments naturally appropriated for distributed applications supporting collaboration [4]. As information and users proliferate through the Internet and the in- tranets, large-scale systems must cope with the challenge of scale. To be scalable a system should accommodate large number of processes and should allow applications to manage a great deal of data. CPU processing, local storage and network bandwidth are the traditional metrics for scalability. In order to reduce network load, locality has to be exploited. In other words,

(2)

local awareness enables a system to limit the impact of local computation on wide-area performance both during regular operation and under fault conditions [8]. It has be proven that caching and replication are two effec- tive techniques to improve locality. Replicating data closer to their users reduces network traffic and latency of accesses to such a data, that is a user can experience low latency in accessing data if data are closer to it.

Moreover, locality-awareness is crucial for viability of large internets since costs are proportional to the actual distance of interacting parties [1]. In this sense, to improve performance and allowing system scalability, it is desirable that each process collaborating in a distributed application per- forms its task only accessing a subset of system information, the closest one to it. For what said, partial replication seems to be the most appropriate replication technique in large scale systems.

On the other hand, since replication requires consistency management, that is copies of data have to be maintained consistent, there could be the following drawback: the transfer of information necessary to manage consistency requirements cancels the decrease in network load and the locality properties offered by partial replication. According to this, in designing distributed applications, it is essential to analyze the entire infrastructure, that is the algorithm, the replication technique, the consistency necessary to run the given application, etc. [8].

This work addresses data replication and consistency maintenance in large scale systems, in order to determine whether a consistency criterion suffers the above drawback. Section 2 presents the main result obtained so far, concerning the possibility of efficiently implementing partial replication.

In particular, it is shown that causal consistent memories [2] suffer the above drawback, while PRAM consistent memories [6] do not. This result is based on a graph modelization of data replication. Finally, section 3 shows how a distributed application solving the widespread problem of routing, can be efficiently implemented with partial replication on the top of a distributed shared memory, thus preserving locality properties.

2 The problem of efficient partial replication im-

plementation in large-scale systems

Distributed shared memory abstraction is traditionally realized through a distributed memory consistency system(MCS) on top of a message passing system providing a communication primitive with a certain quality of service in terms of ordering and reliability [3]. Such a system consists of a collection

(3)

of nodes. On each node there is an application process and a MCS process.

An application process invokes an operation through its local MCS process which is in charge of the actual execution of the operation. Assuming a partial replicated environment means that, corresponding to the fact that each application process ap_i accesses only a subset of the shared variables X , denoted X_i, each MCS process p_i manages a replica of a variable x iff x ∈ X_i. Our aim is to determine which MCS processes are concerned by information on the occurrence of operations performed on the variable x in the system. More precisely, given a variable x, we will say that a MCS pro- cess p_i is x-relevant if, in at least one computation, it has to transmit some information on the occurrence of operations performed on variable x in this computation, to enforce the consistency criterion. Of course, each process managing a replica of x is x-relevant, but the problem is that other processes could also be x-relevant. The main result of the section is a characterization of x-relevant processes when considering causally consistent shared memo- ries. From this result, it follows that in causally consistent shared memories all MCS processes are x-relevant, for every variable x, while in PRAM con- sistent shared memories, only MCS processes managing a replica of x are x-relevant.

This result is based on the notion of share graph. Given a distribution of variables {X₁, . . . , X_N}, the share graph is an undirected (symmetric) graph whose vertices are processes, and an edge (i, j) exists between p_i and p_j iff there exists a variable x replicated both on p_i and p_j (i.e. x ∈ X_i∩ X_j).

Each variable x defines a sub-graph C(x) of SG spanned by the processes on which x is replicated. This subgraph C(x) is a clique, i.e. there is an edge between every pair of vertices. The ”share graph” is the union of all cliques C(x).

Given a variable x, we call x-hoop, any path of SG, between two distinct processes in C(x), whose intermediate vertices do not belong to C(x) (figure 1). Formally:

Definition 1 (hoop). Given a variable x and two processes p_a and p_b in C(x), we say that there is a x-hoop between p_a and p_b, if there exists a path [p_a = p₀, p₁, . . . , p_k = p_b] in SG such that:

i) p_h 6∈ C(x) (1 ≤ h ≤ k − 1) and

ii) each consecutive pair (p_h−1, p_h) shares a variable x_h such that x_h 6= x (1 ≤ h ≤ k)

Let us remark that the notion of hoop depends only on the distribution of variables on the processes, i.e. of the topology of the corresponding share graph. In particular, it is independent of any particular computation.

(4)

x x

x

x x x

C(x)

x x

x

x x x

C(x)

Figure 1: An x-hoop

Theorem 1. Given a variable x, a process p_i is x-relevant if, and only if, it belongs to C(x) or it belongs to an x-hoop.

The idea underlying this result is the following: in a causally consistent execution, an indirect causal dependency can be created between a write op- eration performed by a process p_aon variable x and another (read or write) operation performed by a process p_b on the same variable. This indirect dependency is created because of operations performed on other variables, shared by the intermediary processes. As a consequence, unless a particular variable distribution avoiding all hoops is assumed, it is necessary that each MCS process transmits control information regarding all the shared data to guarantee causal consistency in a partial replicated environment.

On the contrary, PRAM memories [6] can be efficiently implemented in a partial replicated environment, due to the fact that the PRAM consistency criterion is only concerned by “direct relations” between pairs of

“adjacent” processes and not by transitivity due to intermediary processes [7]. Thus, even if hoops exist, they cannot create such indirect dependencies.

In other words, it is not necessary to propagate global information to maintain PRAM consistency. In the next section, we propose, as a case study, a distributed implementation of the Bellman-Ford shortest path algorithm on top of a PRAM distributed shared memory, efficiently exploiting partial replication of variables.

3 Implementing the Bellman-Ford algorithm ex-

ploiting partial replication in large-scale systems

Routing is a common problem in large-scale system. According to this, in the following we propose a distributed implementation of the Bellman-Ford algorithm [5] to compute the minimum path from a source node to every

(5)

other nodes in the system. In this implementation, with each node are associated an application process and its associated MCS process.

In such a computation, locality-awareness is essential since it is desirable that each process only manipulates information shared with its neighbors.

Moreover, since we want to reduce network load in order to ensure scalability, we exploit locality in accessing shared data. In other words, we assume that each process has a local copy of the shared data it is interested in, to compute its task. This means considering partial replication to efficiently distribute the computation.

The system (network) is composed by N processes ap₁, . . . , ap_N and is modelled with a graph G = (V, Γ), where V is the set of vertex, one for each process in the system and Γ is the set of edges (i, j, w_ij) such that ap_i, ap_j belong to V and there exists a link between i and j with cost w_ij. Let Γ⁻¹(i) = {j ∈ V |(i, j, w_ij) ∈ Γ} be the set of predecessors of process ap_i. The algorithm proceeds in stages. For each successive k ≥ 0:

i) x^k_s = 0, for all k,

ii) ∀ i 6= s, x^k+1_i = ^min_j∈Γ−1(i)[x^k_j + w_ji], where s is the source node and initially x⁰_i = ∞ ∀ i 6= s. It is well-known that, if there are no negative cost cycles, the algorithm converge in at most N steps.

Without loss of generality, we assume that process ap₁is the source node.

We denote as x_i the current minimum value from node 1 to node i. Then, to compute all the minimum paths from process ap₁ to every other process in the system, processes cooperate reading and writing the following set of shared variables X = {x₁, x₂, . . . , x_N}. Moreover, since the algorithm is iterative, in order to ensure liveness we need to introduce synchronization points in the computation. In particular, we want to ensure that at the beginning of each step, each process ap_i reads the values written by his predecessors Γ⁻¹(i) during the previous step. Thus each process knows that at most after N iterations, it has computed the shortest path. To this aim, we introduce the following set of shared variables S = {k₁, k₂, . . . , k_N}.

Each application process ap_i only accesses a subset of shared variables.

More precisely, ap_i accesses x_h ∈ X and k_h∈ S, such that h = i or ap_h is a predecessor of ap_i.

Since each variable x_i and k_i is written only by one process, namely ap_i, it is simple to notice that the algorithm in Figure 2, correctly runs on a PRAM shared memory – and thus also on a causally consistent shared memory. But as we have seen in the previous section that PRAM allows efficient partial replication and thus scalability – while causal consistency does not. The former memory will obviously be chosen.

(6)

Minimum Path 1 k_i := 0;

2 if (i == 1) 3 x_i := 0;

4 else x_i := ∞;

5 while(k_i < N )do

6 while(V

h∈Γ⁻¹(i)(k_h < k_i))do;

7 x_i := min([x_j+ w_ji] ∀j ∈ Γ⁻¹(i));

8 k_i:= k_i+ 1

Figure 2: pseudocode executed by process ap_i

References

[1] I. Abraham and D. Malkhi. Principles of Locality-Aware Networks for Locating Near- est Copies of Data. Technical Report Leibnitz Center TR 2003-84, School of Computer Science and Engineering, The Hebrew University, 2003.

[2] M. Ahamad, G. Neiger, J.E. Burns, P. Kohli and P.W. Hutto. Causal Memory: Defi- nitions, Implementation and Programming. Distributed Computing 9(1): 37-49, 1995.

[3] H. Attiya and J. Welch. Distributed Computing (second edition), Wiley, 2004.

[4] O. Baboglu, A. Bartoli and G. Dini. Replicated File Management in Large-Scale Dis- tributed Systems. Technical Report UBLCS-94-16, June 1994 (Revised January 1995).

[5] R. E. Bellman. On a routing problem. Quarterly Applied Mathematics, XVI(1): 87-90, 1958.

[6] R. Lipton, J. Sandberg. PRAM: a Scalable Shared Memory. Technical Report CS- TR-180-88, Princeton University(1988).

[7] M. Raynal, A. Schiper. A suite of formal definitions for consistency criteria in distributed shared memories. Technical report PI 968, Irisa, 1995

http://www.irisa.fr/bibli/publi/pi/1995/968/968.html

[8] B. Y. Zhao, A. D. Joseph and J. Kubiatowicz. Locality Aware Mechanism for Large- scale Networks, Proc. of Workshop on Future Directions in Distributed Comp, 2002.