Low Cost Coherence Protocol for DSM Systems with Processor Consistency 1

Low Cost Coherence Protocol for DSM Systems with Processor Consistency Jerzy Brzeziński Michał Szychowiak Institute of Computing Science Poznań University of Technology Piotrowo 3a, 6-965 Poznań, POLAND phone: +48 6 665 28 9, fax: +48 6 877 5 25 Jerzy.Brzezinski@cs.put.poznan.pl Michal.Szychowiak@cs.put.poznan.pl Abstract Modern Distributed Shared Memory (DSM) systems offer high speed application processing by allowing to use relaxed consistency models, such as processor consistency. Unfortunately, most of the existing coherence protocols implementing relaxed consistency in multiprocessors or loosely couples clusters use write-update strategy which incurs large communication overhead, and therefore is impractical for most distributed applications. This paper presents a new home-based coherence protocol for DSM systems with processor consistency. The protocol uses local invalidation paradigm introducing little overhead. No additional invalidation messages are required; all coherence information is piggybacked to update messages.. Introduction Distributed Shared Memory (DSM) is a virtual memory space available for distributed processes interacting by sharing common objects maintained by the DSM system. One of the most important issues in designing DSM systems is efficiency. In practice a replication mechanism is used to increase the efficiency of the DSM object access by allowing several processes to concurrently access local replicas of the shared object. However, concurrent existence of several replicas of the same shared object requires consistency management. The coherence protocol synchronizes each access to replicas accordingly to the DSM consistency criterion. There are several DSM consistency models with different properties proposed in the literature: atomic [8], sequential [4], causal [], PRAM [], processor [2] and release consistency [3], among others. Some of the consistency models provide stronger guarantees about synchronization of replica values than other models. On the other hand, the more relaxed the consistency, the more concurrency for shared object access is allowed, resulting in better efficiency of the DSM system. A simple strategy for implementing coherency protocols is to use write-update schema, which ensures propagation of every modification of a given object to all its replicas ([], [4], [5], [], [2], [5]). The exact protocols differ in the scope of the replication, or in use of communication paradigms (as write-broadcast []). However, the write-update strategy turns to be very message intensive and therefore impractical for many applications, especially in object-based DSM systems, where the read-to-write ratio is typically low. Moreover, as the goal of proposing relaxed consistency models was to increase the performance of DSM systems, the implementation of the coherence protocol should not incur large over- This work has been partially supported by the State Committee for Scientific Research grant no. 7TC 36 2 Springer-Verlag, Berlin Heidelberg, 23. Proceedings of the 8 th International Symposium on Computer and Information Sciences (ISCIS 23 http://www.iscis3.metu.edu.tr/), Antalya, Turkey; in LNCS 2869, November 23, pp. 96-925. The original publication is available at www.springerlink.com (http://www.springerlink.com/content/nrxma9uycmqt296j/?pi=74)

head. This motivates investigations for new coherency protocols implementing relaxed consistency models DSM with reduced overhead. An alternative to the write-update strategy is the write-invalidate one. In this approach, a write operation on a replica of a given object is required to eventually mark other available replicas of this object as outdated (invalid). The actual update is only performed on demand of read access to an invalidated replica (read miss). After the invalidation several subsequent writes can be performed without the need for any communication. As the invalidation-based coherence protocols incur lower overhead of the write operation, they are a better solution for implementation of relaxed consistency models. This paper presents a new coherence protocol for the processor consistency model, PCG ([2],[6]). PCG is probably the most relaxed consistency model allowing full transparency of the coherence operation from the point of view of the application process, and still useful for solving several classes of computational problems (see [9] as an example). To the best of our knowledge this is the first invalidationbased coherence protocol for the processor consistency model, and therefore it fully allows to exploit the efficiency of the model 2. Moreover, the communication overhead of the proposed protocol is additionally reduced as no explicit invalidation messages are transmitted. Only local invalidation operations are performed to maintain the consistency, relying on the information received with update messages. The concept of local invalidation was formerly proposed in [] to track causal precedence of memory access; however, the invalidation condition proposed there does not apply to processor consistency model. Furthermore, the protocol of [] results in invalidating more objects than strictly necessary, in contrary to the one proposed here. This paper is organized as follows. In Section 2 we define the system model. Section 3 details the new coherence protocol. The protocol is proven correct in Section 4. Some concluding remarks are given in Section 5. 2. Basic definitions 2. System model The DSM system is a distributed system composed of a finite set P of sequential processes P, P 2,..., P n that can access a finite set O of shared objects. Each shared object consists of its current state (object value) and object methods which read and modify the object state. Shared object x is identified by its system-wide unique identifier denoted x.id. The current value of x is denoted x.value. In this paper we consider only read-write objects, i.e. we distinguish two operations on shared objects: read access and write access. We denote by r i (x)v that the read operation returns value v of x, and by w i (x)v that the write operation stores value v to x. Each write access results in a new object value of x. The replication mechanism is typically used to increase the efficiency of the DSM object access by allowing each process to locally access a replica of the object. However, concurrent access to different replicas of the same shared object requires consistency management. The coherence protocol synchronizes each access to replicas accordingly to the DSM consistency criterion. This protocol performs all communication necessary for the interprocess synchronization via message-passing. In this paper we assume the communication to be reliable, i.e. no message is ever lost or corrupted. 2.2 Processor consistency In this work we investigate processor consistency (PCG), first proposed by Goodman in [7], and further formalized by Ahamad et al. in [2]. This consistency model is sufficient for several classes of distributed algorithms but requires weaker synchronization than atomic, sequential or causal consistency, thus allow- 2 An invalidation-based protocol has also been implemented in Dash [6], but the consistency model used there is incomparable to PCG.

ing for more concurrency and efficiency. It guarantees that all processes accessing a set of shared objects will perceive the same order of modifications of each single object, additionally respecting the local order of write operation performed by each process. More formally, PCG execution of access operations respects PRAM consistency ([]) and cache consistency ([6]), as defined below. Let local history H i denote the set of all access operations to shared objects issued by P i, history H denote the set of all operations issued by the system (H = H i ), write history HW denote the set of all write operations (HW H), and finally H x denote the set of all write operations performed on x (H x H). By i we denote the local order relation of operations issued by process P i. Definition PRAM consistency (Pipelined RAM, []) Execution of access operations is PRAM consistent if for each P i there exists a serialization a i of the set H i HW such that: 2, i ( o o H HW U i=.. n o j o 2 o a i o 2 ) j=.. n Following from that definition, PRAM consistency preserves the local order of operations of each process. The order of operations issued by different processes may be arbitrary. Definition 2 Cache consistency (coherence, [6]) Execution of access operations is cache consistent if for each P i there exists a serialization a i of the set H i HW satisfying: ( w, w 2 HW H x x O i= w a i w 2.. n i.. n = w2 a i w ) The above definition requires cache consistency to ensure for each process the same order of operations on every shared object. Definition 3 Processor consistency (PCG, [2]) Execution of access operations is processor consistent if for each P i there exists a serialization a i of the set H i HW that preserves both PRAM consistency and cache consistency. We shall call this serialization the processor order. 3. The coherence protocol Here we propose a brand new coherence protocol, named PCGp. The PCGp protocol uses the writeinvalidate strategy and ensures that all local reads reflect the processor order of object modifications by invalidating all potentially outdated replicas. If, at any time, process P i updates object x, it determines all locally stored replicas of objects that could have possibly been modified before x, and denies any further access to them (invalidates), preventing P i from reading inconsistent values. Any access request issued to an invalidated replica of x results in fetching its up-to-date value from a master replica of x. The process holding a master replica of x is called x s owner and shall be denoted for simplicity x.owner. According to the home-based approach, the protocol statically distributes the ownership of shared objects among all processes (static ownership), i.e. the x.owner is assigned a priori and never changes (alternatively, it is called home-node of x). From now on, we shall assume that there is a home-node for each object x, and the identity of x.owner is known. The PCGp protocol introduces 3 different states of an object replica: writable (denoted WR), readonly (RO), and invalid (INV). The current state of x shall be denoted x.state. Only the master replica can

have the WR state, and this state enables the owner to perform instantaneously any write or read access to that replica. Meanwhile, every process is allowed to instantaneously read the value of any local RO replica. However, the write access to any RO replica requires additional coherence operations. Meanwhile, the INV state indicates that the object is not available locally for any access. Thus, an access to the INV replica requires the coherence protocol to fetch the value of the master replica of the accessed object and update the local replica. Each process P i manages a vector clock VT i. A vector clock is a well-known mechanism used to track the dependency of events in distributed systems [3]. Here it is intended to track the local precedence of write operations performed by distinct processes. The i-th component of the VT i increases with each write access performed by P i. Moreover, replica x has been assigned a scalar timestamp denoted by x.ts. This timestamp is used to reflect the modification time of the local replica of x, and is updated on each write operation performed to that replica. In order to keep the master replica up-to-date, the homenode (x.owner) collects modifications performed by other processes, however such writes do not increase x.ts. There are four basic procedures invoked by the protocol at P i : inc i (VT i ) increments VT i [i]; local_invalidate i (X) invalidates all replicas with identifiers belonging to the set X, i.e. for all x:: x.id X performs: x.state:=inv and x.ts:=vt i [i]; owned_by i (j) := { x.id :: x.owner = j } returns the set of identifiers of object replicas locally available at P i owned by a given process P j ; modified_since i (t) := { x.id :: x.ts > t } returns the set of identifiers of object replicas at P i having modification timestamps newer than given t; during communication with some process P j this procedure is used to detect objects possibly overwritten or invalidated recently by P i in fact, P j can be interested only in modifications of objects not owned by itself, since the master replicas are always kept up-to-date, therefore P i will actually use another set: outdated i (t, j) := modified_since i(t) \ owned_by i (j). Actually, each process P i manages two vector clocks: VT i which j-th value reflects the number of modifications performed by P j and known to P i (P i will use this value to ask P j for more recent modifications); and VTsent i which j-th value reflects the number of modifications performed by P i which has already been communicated to P j (P i will use this value to remember the latest modification P j can be aware of). Actions of the PCGp protocol are formally presented in Figure and can be described by the following rules: Rule ) If process P i wants to access object x available locally (i.e. in any other state than INV) for reading, the read operation is performed instantaneously. Rule 2) If P i wants to access a WR replica of x for writing, the write operation is performed instantaneously. Rule 3) If P i wishes to gain a read access to object x unavailable locally (x.state=inv), i.e. a read-fault occurs, the protocol sends request message REQ(i, x.id, VT i [x.owner]) to x.owner, say P k. The owner sends back update message R_UPD(x, VT k[k], INVset), where INVset contains identifiers of objects modified by P k since VT i[k] (which was received in REQ), and possibly not yet updated by P i (excluding objects owned by P i and the requested x, as x is going to be updated on receipt of the R_UPD message) i.e.: INVset:=outdated k (VT i [k],i)\{x.id}. On receipt of R_UPD, first local_invalidate i (INVset) is processed. A local replica of x is created with the value received in R_UPD and state set to RO. Finally, the requested read operation is performed.

Rule 4) If P i wishes to perform a write access to a replica of x in either INV or RO state (i.e. a writefault occurs), the protocol issues a write request W_UPD(i, x, VT i[i], INVset) to x.owner P k. The INVset contains identifiers of objects modified by P i since the previous update sent to P k, i.e. INVset := outdated i (VTsent i [k], k). On this request, P k performs local_invalidate k (INVset) and sends back to P i an acknowledgment message ACK(t), where t is the current timestamp of the master replica of x. Then, w i (x)v is performed and P i s clock is advanced with inc(vt i ). The local replica of x at P i is timestamped with t. on r i (x)v : if x.state = INV send REQ(i,x.id,VT i [x.owner]) to x.owner wait for R_UPD(x,t,INVset) local_invalidate i (INVset) VT i [x.owner] := t x.state:=ro return x.value on w i (x)v : inc(vt i ) x.value := v if x.state WR INVset := outdated i (VTsent [x.owner] i send W_UPD(i,x,VT [i], i wait for ACK(t) x.ts := t VTsent [x.owner] else x.ts := VT i [i] i := VT i[i] on REQ(j,x,t) : INVset := outdated i (t,j)\{x.id} send R_UPD(x, VT i [i],invset) to j VTsent i [j] := VT i[i] on W_UPD(j,x,t,INVset) : inc(vt i ) local_invalidate i (INVset) VT i [j] := t find y such that y.id=x.id y.value := x.value send ACK(y.ts) to j Figure. PCGp protocol operations,x.owner) INVset) to x.owner Figure 2 presents a sample execution of the coherence protocol operations in a system of 3 processes P, P 2 and P 3 sharing 2 different objects x, y, both owned by P 2. P 2 initializes the values of x and y to with w 2 (x) and w 2 (y) operations, resulting in two subsequent inc(vt 2 ). Note that the value of VT 2 is shown on the Figure below the two write operations. After that, a read request from P arrives at P 2. At this moment, P 2 determines the set oudated 2(VT 2[]=) = modified_since 2()\ returning {x.id, y.id}, and responds with R_UPD(y, VT 2[2]=2, INVset={x.id}). The id of y is not included in the INVset because y is being updated. On receipt of that message, P invalidates x and updates the sender position of its own vector clock VT [2] to 2. That value is be used to compose the next request message to P 2. In the sample execution, P 2 also receives W_UPD(y, VT 3[3]=, INVset= ) from P 3. This message causes the update of y.value and VT 2[3]. However, y.ts remains unchanged. Finally, two read accesses are invoked on P : the first one, r (x), is issued to the previously invalidated x, while the second, r (y), is performed locally on currently available replica of y. Note that value of y has not been invalidated

during the update of x. This is due to the fact that y.ts remains equal to 2 at y.owner, indicating that the local value of y at P is still processor consistent. It can be easily seen in this example that the processor consistent operation r (y) is not causally consistent (also not sequentially consistent). P r (y) x.state=inv 2 r (x) 2 4 r (y) P 2 owner(x,y) P 3 VT 2 = REQ(y.id,VT [2]=) w 2 (x) w 2 (y) 2 R_UPD(y,2,{x.id}) W_UPD(y,, ) w 3 (y) 3 REQ(x.id,VT [2]=2) w 2 (x)2 4 R_UPD(x,4, ) x.value= x.owner=2 x.ts=vt 2 [2]= modified_since 2 ()={x.id,y.id} y.value= y.owner=2 y.ts=vt 2 [2]=2 y.value= y.owner=2 y.ts=2 (!) modified_since 2 (2)={x.id} x.value=2 x.owner=2 x.ts=4 Figure 2. A sample execution of protocol operations 4. Correctness of the protocol We argue that the coherence protocol presented in Section 3 always maintains the shared memory in a state which is processor consistent. To prove this claim we shall show that in any execution of access operations the PCGp protocol serializes the operations in the processor order. First, we remark that in a given execution of the PCGp protocol on P i there is exactly one total ordering i of H i (as processes are sequential). There can be, however, several serializations a i of H i HW possible. Definition 4 Legal serialization Serialization a i of H i HW is legal if it preserves local ordering i ( i a i ), i.e.: o i o 2 o a i o 2 2, i o o H In the rest of this section, for each H i HW we choose arbitrarily a serialization that is legal and can be constructed by extending i with results of reception of R_UPD and W_UPD messages processed in a given execution of PCGp. We denote it by i. We show that such serializations preserve both PRAM and cache consistency, i.e. fulfill the processor consistency condition of Definition 3. We start by proving that each i preserves cache consistency. By o i t (x) we denote an operation on x performed by P i at time t.

Lemma If P k is the home-node for object x (x.owner=k), then: 2 w, w HW H x i.. n = w i w 2 w k w 2 Proof. ) First, let us consider two writes w =w i (x)a and w 2 =w i (x)b issued to x by the same process. The case i=k is trivial, therefore we assume i k. In PCGp, every write w i (x) issued to a non-master replica requires to update the master replica at the home node and every update must be acknowledged before issuing next w i (x). Therefore, the order of any two write operations in k agrees with the local order of the issuer, i.e.: w i (x)a i w i (x)b w i (x)a k w i (x)b, and from this w i (x)a i w i (x)b w i (x)a k w i (x)b. 2) Now, let us consider w =w j (x)a, w 2 =w i (x)b and w j (x)a i w i (x)b. The result of any w j t' (x)v, for j i k, is made visible to P i in PCGp only by reception of R_UPD during r i t'' (x)v, where t''>t', possible only if x was invalidated at P i before t''. Operation r i t'' (x)v will therefore return a value brought from master replica of x, enforcing ordering which conforms to k. Yet again, we have w j (x)a i w i (x)b w j (x)a k w i (x)b. Lemma 2 Serialization i preserves cache consistency, according to Definition 2. Proof. Let us consider any x O, and any two w,w 2 HW H x, and assume w k w 2 where k=x.owner. From Lemma, i agrees with k, i.e.: w i w 2, for any i=..n, and this ensures cache consistency. Lemma 3 Serialization i preserves PRAM consistency, according to Definition. Proof. Let us consider o,o 2 H i HW, where o =o j (x)b and o 2 =o j (y)c such that o j o 2. To prove the claim we shall show that o i o 2. We consider the following cases: ) i=j: as i is legal, the claim follows from Definition 4; 2) x=y: from Lemma 2, i preserves cache consistency, thus for all i=..n: o i o 2 ; 3) i j and x y: we have o =w j t' (x)b and o 2 =w j t'' (y)c, where t'<t'', as a consequence of w j t' (x)b j w j t'' (y)c. Let a be the previous value of x read by P i. We prove the claim by contradiction we assume that the order of o and o 2 is reversed on P i, i.e. w j t' (x)b j w j t'' (y)c w j t'' (y)c i w j t' (x)b. Such inverse order is possible if there exists r i t** (x)a such that w j t'' (y)c i r i t* (y)c i r i t** (x)a i w j t' (x)b. It follows from w t'' j (y)c i r t* i (y)c that P i reads value c of y updated by R_UPD message received prior to time t*. The INVset carried in that message must contain x, as w t' j (x)b j w t'' j (y)c (from our assumption). From Rule 3) of the PCGp protocol description, x is invalidated then. Therefore, the first subsequent r i (x) will cause the PCGp protocol to update x from the master replica of x. Because w t' j (x)b was performed at time t':: t'<t''<t*, the master replica value at time t* reflects the w t' j (x)b (the master replica value can currently be b or more recent, but not a any more). So, r t* i (y)c i r t** i (x)a is impossible in PCGp for any t**>t*, and this contradicts the assumption. Theorem PCGp implements the processor consistency, according to Definition 3. Proof. Let a i = i for each i=..n. It follows directly from Lemma 2 and Lemma 3 that a i preserves processor consistency. The protocol does not provide stronger consistency, neither causal or sequential. The

The example on Figure 2 shows operation r (y) performed by the protocol, which is not causally consistent and not sequentially consistent. 5. Conclusions The coherence protocol PCGp proposed in this paper is the first invalidation-based protocol for processor consistency PCG. It uses local invalidation paradigm to reduce the overhead of coherence communication, preserving processor consistency at low cost. No invalidation messages are sent, all coherence information is piggybacked to update messages send on read misses. The presented approach allows further several interesting extensions, concerning reliability issues for instance. Currently, we are designing an extension of the PCGp protocol aimed at providing faulttolerance of the DSM system in spite of multiple node and link failures. Another open issue is the relaxation of the wait-for-acknowledgment condition, which may additionally increase the efficiency of the protocol. References: [] M. Ahamad, P.W. Hutto and R. John, "Implementing and Programming Causal Distributed Shared Memory", Proc. th Int l Conf. on Distributed Computing, May 99, pp. 274-28. [2] M. Ahamad, R. A. Bazzi, R. John, P. Kohli, and G. Neiger, "The power of processor consistency", Technical Report GIT-CC-92/34, Georgia Institute of Technology, Atlanta, December 992. [3] C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu and W. Zwaenepoel, "TreadMarks: Shared Memory Computing on Networks of Workstations", IEEE Computer, 29(2), February 996, pp. 8 28. [4] C. Amza, A. L. Cox, and W. Zwaenepoel, "Data Replication Strategies for Fault Tolerance and Availability on Commodity Clusters", Proc. Int l Conf. on Dependable Systems and Networks (DSN2), June 2. [5] R. Christodoulopoulou, R. Azimi and A. Bilas, "Dynamic data replication: An approach to providing faulttolerant shared memory clusters", Proc. 9 th IEEE Symposium on High-Performance Computer Architecture (HPCA9), February 23. [6] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta and J. Hannessy, "Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors", Proc. 7 th Int l Symposium on Computer Architecture, May 99, pp. 5-26. [7] J. R. Goodman, "Cache consistency and sequential consistency", Technical Report 6, IEEE Scalable Coherent Interface Working Group, March 989. [8] M. Herlihy and J. Wing, "Linearizability: A Correctness Condition for Concurrent Object", ACM Transactions on Programming Languages and Systems, 2(3), July 99, pp. 463-492. th [9] L. Higham and J. Kawash, "Bounds for Mutual Exclusion with only Processor Consistency", Proc. 4 Int l Symposium on Distributed Computing (DISC2), October 2, LNCS 94, Springer, pp. 44-58. [] R. John and M. Ahamad, "Causal Memory: Implementation, Programming Support and Experiences", Technical Report GITCC-93/, Georgia Institute of Technology, 993. [] R. J. Lipton and J. S. Sandberg, PRAM: a Scalable Shared Memory, Technical Report CS-TR-8-88, Princeton University, September 988. [2] M. C. Little and S. K. Shrivastava, "Integrating Group Communication with Transactions for Implementing Persistent Replicated Objects", Lecture Notes in Computer Science vol. 752, 2, pp. 238-249. [3] F. Mattern, "Time and Global States of Distributed Systems", Proc. Int l Workshop on Parallel and Distributed Algorithms, 988. [4] M. Raynal, "Sequential Consistency as Lazy Linearizability", Technical Report PI-437, IRISA Rennes, January 22. [5] J. S. Sandberg, Design of PRAM Network, Technical Report CS-TR-254-9, Princeton Univ., April 99.