Adaptive Migratory Scheme for Distributed Shared Memory 1. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University

Size: px
Start display at page:

Download "Adaptive Migratory Scheme for Distributed Shared Memory 1. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University"

Transcription

1 Adaptive Migratory Scheme for Distributed Shared Memory 1 Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX fjhkim,vaidyag@cs.tamu.edu Technical Report November 1996 Abstract This paper presents an adaptive migratory scheme for software Distributed Shared Memory (DSM). On a migratory sharing, a message for sending a copy of a page to a remote node, on which a page fault occurs, is directly followed by an invalidation request from the remote node. Our adaptive migratory scheme eliminates an overhead for invalidation by self-invalidation on sending a copy of a page. Each node can independently detect migratory memory access pattern and self-invalidate local copy of a page by using the local information only. Experimental results show that the performance is improved by dynamically selecting the migratory protocol. Keywords: distributed shared memory, release memory consistency, adaptive protocol, migratory protocol, competitive update protocol, performance evaluation, cost analysis model. 1 This work is supported in part by the National Science Foundation under grant MIP

2 1 Introduction This report presents an adaptive migratory DSM algorithm which can dynamically selfinvalidate local copy of a page on a migratory memory access pattern. The adaptive migratory algorithm is implemented based on our previous work [9] which can adapt to memory access patterns to reduce coherency overhead by adjusting \update limit". Adaptive DSM [9] outperforms a competitive update protocol as well as an invalidate protocol and an update protocol in many types of applications. However, [9] can not adapt to a migratory memory access pattern. On a migratory sharing, a message for sending a copy of a page to a remote node, on which a page fault occurs, is directly followed by an invalidation request from the remote node. Our adaptive migratory algorithm eliminates an overhead for invalidation by selfinvalidation [11] on sending a copy of a page to the remote node. The proposed scheme allows each node to independently choose (at run-time) a dierent protocol (migratory, invalidate, or competitive update protocol) for each page by using the local information only. Experimental results show that the performance is improved by dynamically selecting the migratory protocol. Adaptive protocols for migratory sharing [4, 12, 14], and self-invalidation [11] have been proposed previously for hardware cache-coherent scheme. [4, 12, 14] dynamically identify migratory shared data and switch to migratory protocol in order to reduce the overhead. [11] can predict blocks to be invalidated and perform self-invalidation. Dynamic page placement schemes [3, 10] including page migratory have been implemented in OS-level NUMA memory management. Their schemes can dynamically select either migration or replication for page placement according to application and architecture. Our adaptive migratory scheme is implemented for a software DSM and dierent from others as follows: design domain: [4] is designed on bus-based multiprocessor and directory-based cache coherence multiprocessor. [12, 14, 11] are based on directory-based cache coherence multiprocessor. In a bus-based multiprocessor, requests (for read/write miss and invalidate) by other nodes can be detected via bus, and in a directory-based cache coherence multiprocessor, home node maintains directory entries. In these architectures, global 2

3 state (number of cached copies, last invalidator of a block) can be known by some or all nodes. However, these schemes can not be used directly in the DSM where no actual owner exists like in Munin [2] or Quarks [8]. { No node has global information about copies of a page. Our scheme is implemented incorporated into a software DSM in which memory coherency is maintained in totally distributed manner. Our adaptive migratory scheme does not need global information. Each node tries to detect the migratory memory access patterns only with local information. [3, 10] are proposed for dynamic page placement in NUMA architecture. Their dynamic page placement policy can not be applied to DSM due to architectural dierence. On page fault, in NUMA architecture, a local node can access remote memory without page allocation in local memory. However, in DSM, remote memory access is not allowed (need to allocate the copy of the page to local memory). protocol switch: Since our scheme is based on a \cost-comparison" approach, migratory protocol is chosen when the protocol is deemed optimal. If another protocol is optimal on a migratory sharing in special cases (e.g., two nodes update shared memory alternatively), other protocol (competitive update protocol, in this case) is chosen. [4, 14, 12] select a migratory protocol when memory access pattern is migratory sharing. [10] needs to tune the policy parameters on an application and architecture basis. basic protocol: [4, 14, 11] are based on invalidate protocol, and [12] is based on competitive update protocol. [3, 10] are based on page placement scheme for NUMA multiprocessors. Our scheme is based on adaptive DSM in which each node can independently choose (at run-time) a dierent protocol for each copy of a page. hybrid protocol: In [4, 12, 14], all copies of a block enter migratory mode or exit from migratory mode. In our scheme, each node decides independently to use migratory protocol. Therefore, some nodes can use a migratory protocol while the other nodes use other protocol (invalidate or competitive update protocol). This hybrid feature can be implemented by slightly modifying the proposed scheme. Table. 1 shows summary of each scheme. 3

4 Scheme Design domain Protocols (Schemes) Features [4] Dir or Bus Inv + Mig [14] Dir Inv + Mig [12] Dir Comp + Mig [11] Dir Inv + Self-Inv [3, 10] MM-NUMA Remote + Replicate + Mig CC [9] SDSM Inv + Comp CC Proposed SDSM Inv + Comp + Mig CC + TD Bus = bus-based cache coherence multiprocessor Dir = directory-based cache coherence multiprocessor MM-NUMA = memory management system for MUMA multiprocessor SDSM = software Distributed Shared Memory Inv = invalidate protocol Mig = migratory protocol (scheme) Remote = remote memory access Replicate = page replication Comp = competitive update protocol CC = cost comparison TD = totally distributed Table 1. Adaptive Protocols. 2 Basic Adaptive Protocol Adaptive migratory scheme is based on [9] in which each node can independently choose an invalidate protocol or a competitive update protocol for each copy of a page. The basic adaptive protocol [9] is summarized in this section: 1. Collect statistics over a \sampling period". (Accesses to each memory page are divided into sampling periods.) 4

5 2. Using the statistics, determine the protocol that minimizes the \cost" for each page P. 3. Use the minimum cost protocol for each page P to maintain consistency of page P over the next sampling period. 4. Repeat above steps. Essentially, the proposed implementation would use statistics collected during current execution to predict the optimal consistency protocol for the near-future. This prediction should be accurate, provided that the memory access patterns change relatively infrequently. In [9], we present an adaptive scheme that chooses between the invalidate protocol and the competitive update protocol [7, 5, 6]. The competitive update protocol is dened by a \threshold" parameter; we will rename the threshold as the \limit". The proposed adaptive scheme collects run-time data on number and size of messages; the data is used periodically to determine the new value of limit for each copy of a page. The protocol is completely distributed in that each node independently determines the limit to be used for each page it has in its local memory. (Thus, dierent nodes may choose dierent limits for the same page.) The objective of the protocol [9] is to minimize the \cost" metric of interest. Two cost metrics considered are: (i) number of messages, and (ii) amount of data transferred. Let us focus on the accesses to a particular page P as observed at a node A. These accesses can be partitioned into \segments". A new segment begins with the rst access by node A following an update to the page by another node. (Segments are dened from the point of view of each node. Therefore, for the same page, dierent nodes may observe dierent segments.) We evaluate the cost for a segment, in which the number of updates by other nodes is U, for invalidate protocol (or competitive update protocol with limit L = 0) and update protocol (or competitive update protocol with limit L = 1), when the number of updates by other nodes in the segment is U. Critical value of the number of updates by other nodes, U critical, where limit L = 0 and limit L = 1 require the same cost in a segment, is computed. If U > U critical, invalidate protocol has a lower cost. If U < U critical, update protocol performs better. We choose competitive update protocol with limit L = 3 instead of update protocol when U < U critical because competitive update protocol performs better 5

6 than update protocol in general. With the modications, the basic adaptive scheme that attempts to minimize the \cost" can be summarized as follows: Each node collects data over a \sampling period" for each local page, and estimates the average value of U. At the end of the sampling period, if U U critical then the invalidate protocol (L = 0) is chosen for the next sampling period for that page, otherwise, the competitive update protocol is chosen. 3 Adaptive Migratory Scheme 3.1 Migratory Sharing In migratory sharing, a page is accessed by a single node at any given instance. A page is modied within a critical section to maintain mutual exclusion. Every access for a page is ordered by a sequence of acquire, shared memory access, and release. Migratory sharing is formally dened in [14] by using a regular expression: (R i )(R i ) (W i )(R i + W i ) (R j )(R j ) (W j )(R j + W j ) ::: In the above regular expression, R i and W i denotes a read and a write, respectively, by node i. To detect migratory sharing, following conditions can be checked [4, 14, 12]: 1. i 6= j, writing node is not the same as the previous writer 2. The number of cached copies is two (for [4, 14]) or node i and j are the only one that read a block since node i write (for [12]). 3.2 Motivation Previous works [4, 12, 14] for migratory sharing are designed for bus-based multiprocessor and/or directory-based cache coherence multiprocessor. In these architectures, global state 6

7 (number of cached copies, last invalidator of a block) can be known by some or all nodes. However, these schemes can not be used directly in the DSM where no actual owner exists such as in Munin [2] or Quarks [8]. No node has global information about copies of a page. New adaptive scheme for migratory sharing needs to be designed for the software DSM (or similar architecture) in which memory coherency is maintained in totally distributed manner. Each node needs to detect the migratory memory access patterns with local information only. Our scheme is implemented based on [9] in which each node can independently choose an invalidate protocol or a competitive update protocol for each copy of a page. However, [9] can not adapt to a migratory memory access pattern. Proposed adaptive protocol includes migratory protocol as one of the choices in addition to invalidate and competitive update protocols. In a migratory memory access pattern, our previous adaptive scheme [9] will usually choose invalidate protocol if the number of nodes (N) is not quite small ((U N? 1) U critical ). Consider the following scenario in a migratory memory access pattern: 1. Only node A has a copy of page P. 2. Node B requests page P on a read miss and receives it from node A. 3. Node B writes page and sends an update message to node A. 4. Node A invalidates the copy of page P on receiving the update message from node B (node A selected an invalidate protocol for page P ). Protocol overhead can be reduced if we can anticipate that node A will receive an update message from node B before node A accesses page P again. Cost Analysis (Number of Messages) We now consider number of messages as the cost metric. Two messages can be saved (sending an update message and a responding message) by self-invalidation after sending a copy of page P to node B. If memory access pattern is migratory sharing, the number of messages required for a migratory protocol in one segment (M migratory ) is computed as follows: 7

8 M migratory = F + 2; where F is the average number of times the request for the page is forwarded (due to dynamic distributed ownership algorithm) before reaching a owner of the page. At the beginning of each segment, a page fault occurs which requires F messages for forwarding page request, and two additional messages for receiving the page and sending acknowledgement. The number of messages required for an invalidate protocol (M invalidate ) is F + 4 [9]. (Invalidate protocol requires two more messages than migratory protocol for receiving an invalidation message and sending a negative acknowledgement.) Figure 1 shows an analytical comparison of required number of messages for one segment in migratory memory access patterns (we assume that F = 4, and L = 3 for competitive update protocol). Note that only the cost for memory access (read and write) is considered (cost for synchronization, acquire, is not considered). A migratory protocol requires less number of messages than an invalidate protocol by reducing two messages as described before. This gure suggests that the migratory protocol be the best choice when U 4 in the migratory memory access pattern, and the migratory protocol be included as one of choice of our adaptive protocol. However, even in the migratory memory access pattern, update and competitive update protocols are the best choice if U 2. This special case happens when 2 or 3 nodes access a shared memory in migratory pattern. When U = 3, three protocols (migratory, update, and competitive update protocols) require the same number of messages. Cost Analysis (Amount of Data Transferred) In the above analysis, we consider the number of messages as the cost. Now, we consider the amount of data transferred as the cost metric. If memory access pattern is migratory, the amount of data transferred for a migratory protocol in one segment (D migratory ) is computed as follows: D migratory = (F + 1) p control + p page ; where p control is the size of a control message (page request, acknowledgment of update, etc.), and p page is the size of a message that is required to send a page from one node to 8

9 Number of Messages invalidate update competitive migratory Number of Updates (U) in a Segment Figure 1: Number of Messages per Segment (in Migratory Memory Access Pattern) another. The amount of data transferred for an invalidate protocol (D invalidate ) is p update + (F + 2) p control + p page, where p update is the average size of an update message [9]. (At the beginning of each segment, page fault occurs which requires data transfer of size F p control for forwarding page request, p page for receiving page, and p control for sending acknowledgement. After that, data transfer of size p update + p control is required for receiving update message and sending negative acknowledgement.) 3.3 Implementation From the above motivation, we add two features to our adaptive migratory scheme: 1. Detect migratory memory access pattern: Node A collects statistics over \sampling period" to detect a local copy of page P expected to be invalidated before node A accesses page P. Node A starts to count the number of updates by other nodes after sending page P to some other node, until node A accesses page P again. If the number of updates (between node As' sending page P and node As' rst accessing page P ) for 9

10 each segment in one sampling period is greater than or equal to U critical ) in node A, then it may require less cost to self-invalidate local copy of page P than to maintain local copy of page P because local copy of page P is expected to be invalidated before node A accesses page P (migratory sharing). 2. Self-Invalidation: Node A performs self-invalidation for a local copy of page P on page sending to other node if node A detects the migratory memory access pattern for the page P in step 1. Our adaptive scheme can enter migratory mode and exit from migratory mode by the following algorithm in each node for each page: page_reply: send page; // when a page request is received from another node pagesent = 1; if (mig < 0 && segment == 0) // new sampling period mig = 0; reset page access permission; // reset page access permission (to detect local access // before receiving update message from another node) if (migratory protocol) // migratory protocol has been used self-invalidate page; page_fault: if (page does not exist) request_page; if (updatecnt > 0) { segment++; updateinsampling += updatecnt; if (updatecnt >= U_critical && mig >= 0 && pagesent) mig++; // still seems to be migratory pattern else 10

11 mig = -1; // does not seem to be migratory pattern if (segment == Ns) { // end of sampling period if (updateinsampling / segment >= U_critical) { if (mig == Ns) // deemed to be migratory in all segment choose migratory protocol; else choose invalidate protocol; } else choose competitive update protocol; segment = 0; updateinsampling = 0; mig = -1; } } else // performs local access between page sending mig = -1; pagesent = 0; // and receiving update message Refer to [9] for the detailed implementation to count updatecnt and segment, etc. On page reply, counter mig resets at the new sampling period (segment = 0). The counter mig is used to check whether the number of updates by other nodes between page sending (page reply) and rst local access is greater than or equal to U critical for all segments in a sampling period in node A. If local copy of a page has already been in migratory mode, then self-invalidate local copy. Flag pagesent is used to detect the rst local access after page sending. On page fault, (1) send request page message if a copy of the page does not exist, (2) check the number of updates. If the number of updates is not less than threshold value (U critical ) and pagesent sets (the rst local access after page sending), increase mig (migratory access pattern is expected and is optimal protocol); otherwise, mig sets?1. When 11

12 one sampling period ends (segment = N s ), choose appropriate protocol: (1) If the average number of updates is less than the threshold value (U critical ), a competitive update protocol is chosen; (2) If the average number of updates is not less than the threshold value (U critical ), an invalidate protocol or a migratory protocol is chosen according to mig value (if mig = N s, a migratory protocol is chosen; else, an invalidate protocol is chosen). Above scheme for selecting migratory protocol is very strict: 1. Page sending is required in all segments of a sampling period and the number of updates between page sending and the rst local access should always be not less than critical value (U critical ); mig = N s. 2. The average number of updates per segment should not be less than critical value (U critical ). If node A does not send page P in any one of segments of a sampling period, a migratory protocol is not chosen (condition (1)). Condition 2 may not be necessary because condition 1 is tighter than condition 2. (If condition 1 is satised, condition 2 is also satised.) Adaptive migratory scheme can be summarized as follows: 1. Choose a competitive update protocol if U < U critical, otherwise choose an invalidate protocol or a migratory protocol. If a competitive update protocol is chosen, nish protocol selection; otherwise go to step (2). 2. Choose a migratory protocol if the rst condition (mig = N s ) for selecting migratory protocol is satised, otherwise, choose an invalidate protocol. Figure 2 shows examples how the appropriate protocol is chosen according to the memory access patterns for page P in node A. We assume that the sampling period (N s ) is 2 and the threshold value of the number of updates (U critical ) is 4. In the rst scenario (Figure 2 (a)), a competitive protocol is chosen because the average the number of updates of a sampling period (U avg ) is less than the threshold value (U avg = = 1:5 < (U critical = 4)). In the third scenario (Figure 2 (c)), a migratory protocol is chosen because (1) the average number of updates of a sampling period is not less than the critical value (U avg = = 4 (U critical = 4)) and (2) the number of update between page sending and the rst local 12

13 access is not less than the critical value (U critical = 4) in all segment of the sampling period. In the second scenario (Figure 2 (b)), however, an invalidate protocol is chosen because the number of update between page sending and the rst local access is 0 (less than the critical value (U critical = 4)) in the rst segment, and node A does not send page P in the second segment (any one condition is sucient to choose invalidate protocol). 4 Performance Evaluation Experiments are performed to evaluate the performance of the adaptive DSM by running applications on an implementation of the adaptive protocol. We implemented the adaptive protocol by modifying another DSM, named Quarks (Beta release 0.8) [1, 8]. This section presents the experimental results. We evaluated the adaptive scheme using synthetic applications (qtest, ProdCons, and Reader/Writer) as well as other applications (Floyd-Warshall, SOR, QSORT, IS, Matmult, and Gauss-Jacobi). qtest is a simple shared memory application based on a program available with the Quarks release [8]: all nodes access the shared data concurrently. A process acquires mutual exclusion before each access and releases it after that. We measured the cost (i.e., number of messages and size of data transferred) by executing dierent instances of the synthetic application, as described below. Floyd-Warshall, QSORT, IS, and Gauss-Jacobi were developed at Texas A&M University. SOR and Matmult are available with the Quarks release [8]. ProdCons and Reader/Writer are based on qtest. Sampling period (N s ) is chosen to be 2 for all applications. Results for qtest Application The body of the rst instance of the qtest program (named qtest1) is as follows: qtest1: repeat NLOOP times { acquire(lock_id); for (n = 1 to NSIZE) shmem[n]++; // shared memory access release(lock_id); 13

14 remote access by other nodes Page P local access remote access by other nodes Page P local access remote access by other nodes Page P local access.. A read.. A read A write.. A read A write A write Segment 1 B write C write send P A read B write C write B write D write. A read A read send P Segment 1 B write C write B write D write send P Segment 1 (a) (b) (c) B write... A write B write C write D write B write... A read A write Segment 2 B write C write D write B write... A write send P A read Segment 2 Segment 2 Figure 2: Protocol Selection (N s = 2, U critical = 4) select competitive A read select invalidate A read select migratory 14

15 } Each node performs the above task. All the shared data accessed in this application is conned to a single page. Each node executes the repeat loop 300 times, i.e., NLOOP = iterations were sucient for the results to converge. The size of shared data (N SIZE) is 2048 bytes { all in one page { page size being 4096 bytes. (The next experiment considers small NSIZE.) The adaptive protocol initializes L to 3 for each page at each node. At the end of each sampling period (N s = 2), each node estimates U and p update (the average size of message for update) for the page and selects the appropriate L { this L is used during the next sampling period. For this application, Figures 3 and 4 show the measured cost by increasing the number of nodes (N). The costs are plotted per \transaction" basis. A transaction denotes a sequence of operations { namely, acquire, shared memory access, and release { in one loop of the qtest1 main routine. The curve for the adaptive scheme in Figure 3 is plotted using the heuristic for minimizing the number of messages; the curve in Figure 4 is plotted using the heuristic for minimizing the amount of data transferred. In Figure 3, the curve named \protocol" denotes the number of messages required by the specied protocol, and \#update" denotes the average number of updates per segment (U) calculated over the entire application. \adaptive" denotes the scheme in [9]. \adaptive+" denotes adaptive migratory protocol. As number of nodes N increases, the average number of updates per segment (U) increases proportionally. Adaptive migratory protocol performs best because qtest1 shows the migratory memory access pattern. The adaptive migratory protocol requires approximately 2 less messages per transaction than the adaptive protocol, which chooses invalidate protocol. However, the adaptive migratory protocol requires the same number of messages as the adaptive protocol when N 4 (U U critical ) because both protocols choose competitive update protocol (L = 3). The cost graph for the adaptive migratory protocol (also invalidate protocol) is not at while it is at in the cost analysis as shown in the Figure 1. The reason is that the cost for synchronization (acquire) increases as the number of node (N) increases (however, the cost for synchronization, which is not shown in the Figure 1, will be approximately same for all protocols for the same number of node (N). 15

16 Figure 4 shows the comparison of the amount of data transferred per transaction. Since qtest1 application modies large amount of data (N SIZE = 2048 bytes), an update protocol requires larger amount of data transfer as the number of nodes (N) increases. However, an invalidate protocol requires nearly constant amount of data transfer (per transaction) for all N. Adaptive migratory protocol chooses the appropriate protocol including migratory protocol for all values of N, thereby minimizing the amount of data transferred. The second experiment was performed with the main loop (qtest2) shown below: qtest2: repeat NLOOP times { acquire(lock_id); if (random() < read_ratio) // 0 <= random <= 1 for (n = 1 to NSIZE) r_value = shmem[n]; // read else for (n = 1 to NSIZE) shmem[n] = w_value; // write release(lock_id); } All the shared data accessed in qtest2 is conned to a single page. For this experiment, we assume a small amount of shared data access per iteration of the repeat loop (NSIZE = 4). Additionally, each iteration of the repeat loop either reads or writes the shared data depending on whether a random number (random()) is smaller than the read ratio or not. This allows us to control the frequency of write accesses to the shared data. 8 nodes access the shared data 100 times each (N LOOP = 100). (We observed that the results converge quite quickly.) Figure 5 presents the number of messages per transaction (i.e., acquire, shared memory access, and release). Adaptive migratory protocol requires less number of messages than the adaptive protocol when read ratio is less than 20 % because qtest2 tends to show migratory memory access pattern at the low read ratio. Figure 6 shows the comparison of the amount of data transferred per transaction. Since qtest2 application modies small amount of data (N SIZE = 4 bytes), both adaptive protocol 16

17 Number of Messages per Transaction invalidate update competitive adaptive adaptive+ #updates Number of Node (N) Figure 3: qtest1: Average Number of Updates (U) and Messages per Transaction Amount of Data Transferred per Transaction invalidate update competitive adaptive adaptive Number of Node (N) Figure 4: qtest1: Amount of Data (Bytes) Transferred per Transaction 17

18 and adaptive migratory protocol choose a competitive protocol with large update limit (L). Therefore, the two adaptive protocols require small amount of data transfer. Results for Other Applications We now evaluate our adaptive scheme by executing eight additional applications (Floyd- Warshall, SOR, ProdCons, QSORT, IS, Reader/Writer, Matmult, and Gauss-Jacobi) on 8 nodes. Floyd-Warshall is all-pair-shortest-path algorithm. (We use 128 vertices as input.) SOR is Successive Over-Relaxation algorithm which executes simple iterative relaxation algorithm. (We use grid.) ProdCons is implementation of a simple Producer/Consumer model. Producers make data which will be used by consumers. (We execute total 1,600 \transactions" for ProdCons. A transaction denotes a sequence of operations { namely, acquire shared memory access and release { similar to as dened in qtest.) QSORT is Quick Sorting algorithm. (We use 65,536 elements to be sorted.) IS is Integer Sorting algorithm. (We use 3,200 keys of 100 range.) Reader/Writer is implemented by modifying the qtest to evaluate performance in time-varying memory access patterns. Execution time is divided into 4 stages and memory access pattern is dierent for each stage. A node can be either a reader or a writer for each page depending on the execution stage. The size of data for write is dierent for each stage. (Total 1,920 transactions are executed.) Matmult is a matrix multiplication program which compute A n. (We compute A 10, where A is a matrix.) Gauss-Jacobi is a linear system solver by using iteration method. (We solve a linear system of size 128.) Figure 7 and 8 show performance comparisons in each cost metric (the number of messages or the amount of data transferred). Costs are normalized to the base the protocol of the maximum cost for each application. Floyd-Warshall, SOR, Matmult, and Gauss-Jacobi use barriers for synchronization. In these types of applications, adaptive migratory protocol (+) does not show performance improvement over the adaptive protocol { similar to the adaptive protocol { because Floyd-Warshall, SOR, Matmult, and Gauss-Jacobi do not show the migratory memory access pattern. ProdCons and QSORT use lock/unlock for a task queue, IS uses lock/unlock for ranking, and Read/Writer uses lock/unlock for exclusive object access. These applications show 18

19 Number of Updates (U) or Messages invalidate update competitive adaptive adaptive+ #updates Read Ratio Figure 5: qtest2: Average Number of Updates (U) and Messages per Transaction Amount of Data Transferred per Transaction invalidate update competitive adaptive adaptive Read Ratio Figure 6: qtest2: Amount of Data (Bytes) Transferred per Transaction 19

20 migratory memory access patterns. (In Read/Writer, some page show migratory memory access patterns.) In three applications (ProdCons, IS, and Read/Writer), adaptive migratory protocol (+) requires the least number of messages. However, in QSORT, adaptive migratory protocol does not show performance gain over the invalidate protocol because a element list to be sorted may not show migratory memory access pattern in a page granularity. We evaluated the performance of our adaptive protocol on a synthetic Reader/Writer application where memory access patterns (read to write ratio, access period, amount of data written in each transaction, etc.) are time-varying. Results show that the adaptive protocol performs well by adapting to time-varying memory access patterns. Adaptive migratory protocol performs better than the adaptive protocol because migratory memory access pattern is performed in some page. Experimental results show that our adaptive migratory scheme performs well in migratory sharing. This results suggest that our adaptive migratory scheme can predict the migratory sharing when memory access patterns do not change frequently. 5 Conclusion and Future Work Our objective is to design an adaptive migratory scheme for DSM that can adapt to timevarying pattern (including migratory sharing) of accesses to the shared memory. Our approach continually gathers statistics, at run-time, and periodically determines the appropriate protocol for each copy of each page. The choice of the protocol is determined based on the \cost" metric that needs to be minimized. The cost metrics considered in this paper are number and size of messages required for executing an application using the DSM implementation. Our adaptive approach determines, at run-time, the cost of each candidate consistency protocol, and uses the protocol that appears to have the smaller cost. The proposed adaptive approach is illustrated here by means of an adaptive migratory scheme that chooses either migratory, invalidate, or competitive update protocol for each copy of a page { the choice changes with time, as the access patterns change. The paper presents experimental evalua- 20

21 Normalized Number of Messages Floyd-Warshall SOR ProdCons QSORT Isort Read/Writer Matmult Gauss-Jacobi Figure 7: Cost Comparisons (Number of Messages) 100 Normalized Amount of Data Floyd-Warshall SOR ProdCons QSORT Isort Read/Writer Matmult Gauss-Jacobi Figure 8: Cost Comparisons (Amount of Data Transferred) 21

22 tion of the adaptive migratory scheme using an implementation based on Quarks DSM [8]. Experimental results from the implementation suggest that the proposed adaptive approach can indeed reduce the cost. Further work is needed to fully examine the eectiveness of the proposed approach: The cost metrics considered in the paper are number and size of messages. Other cost metrics need to be considered. In particular, impact of our heuristics on application execution time needs to be evaluated. The adaptive approach (based on cost-comparison) presented here can be combined with ideas developed by other researchers (e.g., [13]) to obtain further improvement in DSM performance. As yet, we have not explored this possibility. Acknowledgements We thank John Carter and D. Khandekar at the University of Utah for making Quarks source code available in public domain, and Akhilesh Kumar for the Floyd-Warshall source code. References [1] J. Carter, D. Khandekar, and L. Kamb, \Distributed shared memory: Where we are and where we should be headed," in Proc. of the Fifth Workshop on Hot Topics in Operating Systems, pp. 119{122, May [2] J. B. Carter, Ecient Distributed Shared Memory Based On Multi-Protocol Release Consistency. PhD thesis, Rice University, Sept [3] A. Cox and R. Fower, \The implementation of a coherent memory abstraction on a numa multiprocessor: Experience with platinum," in Proc. of the 12th ACM Symposium on Operating Systems Principles, pp. 32{44, [4] A. Cox and R. Fowler, \Adaptive cache coherency for detecting migratory shared data," in Proceedings of the 20th Annual International Symposium on Computer Architecture, pp. 98{108, May [5] F. Dahlgren and P. Stenstrom, \Using write caches to improve performance of cache coherence protocols in shared-memory multiprocessors," Journal of Parallel and Distributed Computing, vol. 26, pp. 193{210, Apr

23 [6] H. Grahn, P. Stenstrom, and M. Dubois, \Implementation and evaluation of updatebased cache protocols under relaxed memory consistency models," Future Generation Computer Systems, vol. 11, pp. 247{271, June [7] A. Karlin, M. Manasse, L. Rudolph, and D. Sleator, \Competitive snoopy caching," in Proc. of the 27'th Annual Symposium on Foundations of Computer Science, pp. 244{254, [8] D. Khandekar, \Quarks: Portable dsm on unix," tech. rep., University of Utah. [9] J.-H. Kim and N. H. Vaidya, \A cost-comparison approach for adaptive distributed shared memory," in ACM International Conference on Supercomputing (ICS), pp. 44{ 51, May [10] R. LaRowe, C. Ellis, and L. Kaplan, \The robustness of numa memory management," in Proc. of the 13th ACM Symposium on Operating Systems Principles, pp. 137{151, [11] A. Lebeck and D. Wood, \Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors," in Proceedings of the 22nd Annual International Symposium on Computer Architecture, To appear. [12] H. Nilson and P. Stenstrom, \An adaptive update-based cache coherence protocol for reduction of miss rate and trac," tech. rep., Lund University. To appear in Parallel Architectures and Languages Europe, July [13] U. Ramachandran, G. Shah, A. Sivasubramaniam, A. Singla, and I. Yanasak, \Architectural mechanisms for explicit communication in shared memory multiproccessors," in Supercomputing `95, Dec [14] P. Stenstrom, M. Brorsson, and L. Sandberg, \An adaptive cache coherence protocol optimized for migratory sharing," in Proceedings of the 20th Annual International Symposium on Computer Architecture, pp. 109{118, May

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

A Cost Model for Distributed Shared Memory. Using Competitive Update. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science

A Cost Model for Distributed Shared Memory. Using Competitive Update. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science A Cost Model for Distributed Sared Memory Using Competitive Update Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, Texas, 77843-3112, USA E-mail: fjkim,vaidyag@cs.tamu.edu

More information

Recoverable Distributed Shared Memory Using the Competitive Update Protocol

Recoverable Distributed Shared Memory Using the Competitive Update Protocol Recoverable Distributed Shared Memory Using the Competitive Update Protocol Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX, 77843-32 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic To appear in Parallel Architectures and Languages Europe (PARLE), July 1994 An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic Håkan Nilsson and Per Stenström Department

More information

On Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme

On Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme On Checkpoint Latency Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 E-mail: vaidya@cs.tamu.edu Web: http://www.cs.tamu.edu/faculty/vaidya/ Abstract

More information

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo Real-Time Scalability of Nested Spin Locks Hiroaki Takada and Ken Sakamura Department of Information Science, Faculty of Science, University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo 113, Japan Abstract

More information

Modelling Accesses to Migratory and Producer-Consumer Characterised Data in a Shared Memory Multiprocessor

Modelling Accesses to Migratory and Producer-Consumer Characterised Data in a Shared Memory Multiprocessor In Proceedings of the 6th IEEE Symposium on Parallel and Distributed Processing Dallas, October 26-28 1994, pp. 612-619. Modelling Accesses to Migratory and Producer-Consumer Characterised Data in a Shared

More information

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax: Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical

More information

SYSTEMS MEMO #12. A Synchronization Library for ASIM. Beng-Hong Lim Laboratory for Computer Science.

SYSTEMS MEMO #12. A Synchronization Library for ASIM. Beng-Hong Lim Laboratory for Computer Science. ALEWIFE SYSTEMS MEMO #12 A Synchronization Library for ASIM Beng-Hong Lim (bhlim@masala.lcs.mit.edu) Laboratory for Computer Science Room NE43-633 January 9, 1992 Abstract This memo describes the functions

More information

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence

Parallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence Parallel Computer Architecture Spring 2018 Distributed Shared Memory Architectures & Directory-Based Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly

More information

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues

Lecture 8: Directory-Based Cache Coherence. Topics: scalable multiprocessor organizations, directory protocol design issues Lecture 8: Directory-Based Cache Coherence Topics: scalable multiprocessor organizations, directory protocol design issues 1 Scalable Multiprocessors P1 P2 Pn C1 C2 Cn 1 CA1 2 CA2 n CAn Scalable interconnection

More information

Assert. Reply. Rmiss Rmiss Rmiss. Wreq. Rmiss. Rreq. Wmiss Wmiss. Wreq. Ireq. no change. Read Write Read Write. Rreq. SI* Shared* Rreq no change 1 1 -

Assert. Reply. Rmiss Rmiss Rmiss. Wreq. Rmiss. Rreq. Wmiss Wmiss. Wreq. Ireq. no change. Read Write Read Write. Rreq. SI* Shared* Rreq no change 1 1 - Reducing Coherence Overhead in SharedBus ultiprocessors Sangyeun Cho 1 and Gyungho Lee 2 1 Dept. of Computer Science 2 Dept. of Electrical Engineering University of innesota, inneapolis, N 55455, USA Email:

More information

"is.n21.jiajia" "is.n21.nautilus" "is.n22.jiajia" "is.n22.nautilus"

is.n21.jiajia is.n21.nautilus is.n22.jiajia is.n22.nautilus A Preliminary Comparison Between Two Scope Consistency DSM Systems: JIAJIA and Nautilus Mario Donato Marino, Geraldo Lino de Campos Λ Computer Engineering Department- Polytechnic School of University of

More information

Implementing Sequential Consistency In Cache-Based Systems

Implementing Sequential Consistency In Cache-Based Systems To appear in the Proceedings of the 1990 International Conference on Parallel Processing Implementing Sequential Consistency In Cache-Based Systems Sarita V. Adve Mark D. Hill Computer Sciences Department

More information

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone:

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone: Some Thoughts on Distributed Recovery (preliminary version) Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 Phone: 409-845-0512 Fax: 409-847-8578 E-mail:

More information

Adaptive Prefetching Technique for Shared Virtual Memory

Adaptive Prefetching Technique for Shared Virtual Memory Adaptive Prefetching Technique for Shared Virtual Memory Sang-Kwon Lee Hee-Chul Yun Joonwon Lee Seungryoul Maeng Computer Architecture Laboratory Korea Advanced Institute of Science and Technology 373-1

More information

Ecient Use of Memory-Mapped Network Interfaces for. Sandhya Dwarkadas, Leonidas Kontothanassis, and Michael L. Scott. University ofrochester

Ecient Use of Memory-Mapped Network Interfaces for. Sandhya Dwarkadas, Leonidas Kontothanassis, and Michael L. Scott. University ofrochester Ecient Use of Memory-Mapped Network Interfaces for Shared Memory Computing Nikolaos Hardavellas, Galen C. Hunt, Sotiris Ioannidis, Robert Stets, Sandhya Dwarkadas, Leonidas Kontothanassis, and Michael

More information

On Checkpoint Latency. Nitin H. Vaidya. Texas A&M University. Phone: (409) Technical Report

On Checkpoint Latency. Nitin H. Vaidya. Texas A&M University.   Phone: (409) Technical Report On Checkpoint Latency Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 E-mail: vaidya@cs.tamu.edu Phone: (409) 845-0512 FAX: (409) 847-8578 Technical Report

More information

Distributed Shared Memory: Concepts and Systems

Distributed Shared Memory: Concepts and Systems Distributed Shared Memory: Concepts and Systems Jelica Protić, Milo Toma sević and Veljko Milutinović IEEE Parallel & Distributed Technology, Summer 1996 Context: distributed memory management high-performance

More information

Consistency Issues in Distributed Shared Memory Systems

Consistency Issues in Distributed Shared Memory Systems Consistency Issues in Distributed Shared Memory Systems CSE 6306 Advance Operating System Spring 2002 Chingwen Chai University of Texas at Arlington cxc9696@omega.uta.edu Abstract In the field of parallel

More information

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction : A Distributed Shared Memory System Based on Multiple Java Virtual Machines X. Chen and V.H. Allan Computer Science Department, Utah State University 1998 : Introduction Built on concurrency supported

More information

An Evaluation of Fine-Grain Producer-Initiated Communication in. and is fairly straightforward to implement. Therefore, commercial systems.

An Evaluation of Fine-Grain Producer-Initiated Communication in. and is fairly straightforward to implement. Therefore, commercial systems. An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors Hazim Abdel-Sha y, Jonathan Hall z, Sarita V. Adve y, Vikram S. Adve [ y Electrical and Computer Engineering

More information

Availability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742

Availability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742 Availability of Coding Based Replication Schemes Gagan Agrawal Department of Computer Science University of Maryland College Park, MD 20742 Abstract Data is often replicated in distributed systems to improve

More information

Won{Kee Hong, Tack{Don Han, Shin{Dug Kim, and Sung{Bong Yang. Dept. of Computer Science, Yonsei University, Seoul, , Korea.

Won{Kee Hong, Tack{Don Han, Shin{Dug Kim, and Sung{Bong Yang. Dept. of Computer Science, Yonsei University, Seoul, , Korea. An Eective Full-ap Directory Scheme for the Sectored Caches Won{Kee Hong, Tack{Don Han, Shin{Dug Kim, and Sung{Bong Yang Dept. of Computer Science, Yonsei University, Seoul, 120-749, Korea. fwkhong,hantack,sdkim,yangg@kurene.yonsei.ac.kr

More information

Distributed Systems. Distributed Shared Memory. Paul Krzyzanowski

Distributed Systems. Distributed Shared Memory. Paul Krzyzanowski Distributed Systems Distributed Shared Memory Paul Krzyzanowski pxk@cs.rutgers.edu Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

Distributed Shared Memory

Distributed Shared Memory Distributed Shared Memory EECS 498 Farnam Jahanian University of Michigan Reading List Supplemental Handout: pp. 312-313, 333-353 from Tanenbaum Dist. OS text (dist. in class) What DSM? Concept & Design

More information

Akhilesh Kumar and Laxmi N. Bhuyan. Department of Computer Science. Texas A&M University.

Akhilesh Kumar and Laxmi N. Bhuyan. Department of Computer Science. Texas A&M University. Evaluating Virtual Channels for Cache-Coherent Shared-Memory Multiprocessors Akhilesh Kumar and Laxmi N. Bhuyan Department of Computer Science Texas A&M University College Station, TX 77-11, USA. E-mail:

More information

Lect. 6: Directory Coherence Protocol

Lect. 6: Directory Coherence Protocol Lect. 6: Directory Coherence Protocol Snooping coherence Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere All cache controllers monitor

More information

Low-Cost Support for Fine-Grain Synchronization in. David Kranz, Beng-Hong Lim, Donald Yeung and Anant Agarwal. Massachusetts Institute of Technology

Low-Cost Support for Fine-Grain Synchronization in. David Kranz, Beng-Hong Lim, Donald Yeung and Anant Agarwal. Massachusetts Institute of Technology Low-Cost Support for Fine-Grain Synchronization in Multiprocessors David Kranz, Beng-Hong Lim, Donald Yeung and Anant Agarwal Laboratory for Computer Science Massachusetts Institute of Technology Cambridge,

More information

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols Portland State University ECE 588/688 Directory-Based Cache Coherence Protocols Copyright by Alaa Alameldeen and Haitham Akkary 2018 Why Directory Protocols? Snooping-based protocols may not scale All

More information

Portland State University ECE 588/688. Cache Coherence Protocols

Portland State University ECE 588/688. Cache Coherence Protocols Portland State University ECE 588/688 Cache Coherence Protocols Copyright by Alaa Alameldeen 2018 Conditions for Cache Coherence Program Order. A read by processor P to location A that follows a write

More information

CSE 513: Distributed Systems (Distributed Shared Memory)

CSE 513: Distributed Systems (Distributed Shared Memory) CSE 513: Distributed Systems (Distributed Shared Memory) Guohong Cao Department of Computer & Engineering 310 Pond Lab gcao@cse.psu.edu Distributed Shared Memory (DSM) Traditionally, distributed computing

More information

task object task queue

task object task queue Optimizations for Parallel Computing Using Data Access Information Martin C. Rinard Department of Computer Science University of California, Santa Barbara Santa Barbara, California 9316 martin@cs.ucsb.edu

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Lecture 16/17: Distributed Shared Memory. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 16/17: Distributed Shared Memory. CSC 469H1F Fall 2006 Angela Demke Brown Lecture 16/17: Distributed Shared Memory CSC 469H1F Fall 2006 Angela Demke Brown Outline Review distributed system basics What is distributed shared memory? Design issues and tradeoffs Distributed System

More information

Shared Virtual Memory. Programming Models

Shared Virtual Memory. Programming Models Shared Virtual Memory Arvind Krishnamurthy Fall 2004 Programming Models Shared memory model Collection of threads Sharing the same address space Reads/writes on shared address space visible to all other

More information

Adaptive Techniques for Homebased Software DSMs

Adaptive Techniques for Homebased Software DSMs Adaptive Techniques for Homebased Software DSMs Lauro Whately, Raquel Pinto, Muralidharan Rangarajan, Liviu Iftode, Ricardo Bianchini, Claudio L. Amorim COPPE / UFRJ Rutgers University Contents Motivation

More information

Boosting the Performance of Shared Memory Multiprocessors

Boosting the Performance of Shared Memory Multiprocessors Research Feature Boosting the Performance of Shared Memory Multiprocessors Proposed hardware optimizations to CC-NUMA machines shared memory multiprocessors that use cache consistency protocols can shorten

More information

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses,

More information

Data Distribution, Migration and Replication on a cc-numa Architecture

Data Distribution, Migration and Replication on a cc-numa Architecture Data Distribution, Migration and Replication on a cc-numa Architecture J. Mark Bull and Chris Johnson EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland,

More information

Using Simple Page Placement Policies to Reduce the Cost of Cache. Fills in Coherent Shared-Memory Systems. Michael Marchetti, Leonidas Kontothanassis,

Using Simple Page Placement Policies to Reduce the Cost of Cache. Fills in Coherent Shared-Memory Systems. Michael Marchetti, Leonidas Kontothanassis, Using Simple Page Placement Policies to Reduce the Cost of Cache Fills in Coherent Shared-Memory Systems Michael Marchetti, Leonidas Kontothanassis, Ricardo Bianchini, and Michael L. Scott Department of

More information

Combined Performance Gains of Simple Cache Protocol Extensions

Combined Performance Gains of Simple Cache Protocol Extensions Combined Performance Gains of Simple Cache Protocol Extensions Fredrik Dahlgren, Michel Duboi~ and Per Stenstrom Department of Computer Engineering Lund University P.O. Box 118, S-221 00 LUND, Sweden *Department

More information

Heuristic Algorithms for Multiconstrained Quality-of-Service Routing

Heuristic Algorithms for Multiconstrained Quality-of-Service Routing 244 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 10, NO 2, APRIL 2002 Heuristic Algorithms for Multiconstrained Quality-of-Service Routing Xin Yuan, Member, IEEE Abstract Multiconstrained quality-of-service

More information

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box

More information

Distributed Shared Memory and Memory Consistency Models

Distributed Shared Memory and Memory Consistency Models Lectures on distributed systems Distributed Shared Memory and Memory Consistency Models Paul Krzyzanowski Introduction With conventional SMP systems, multiple processors execute instructions in a single

More information

Selection-based Weak Sequential Consistency Models for. for Distributed Shared Memory.

Selection-based Weak Sequential Consistency Models for. for Distributed Shared Memory. Selection-based Weak Sequential Consistency Models for Distributed Shared Memory Z. Huang, C. Sun, and M. Purvis Departments of Computer & Information Science University of Otago, Dunedin, New Zealand

More information

Algorithms Implementing Distributed Shared Memory. Michael Stumm and Songnian Zhou. University of Toronto. Toronto, Canada M5S 1A4

Algorithms Implementing Distributed Shared Memory. Michael Stumm and Songnian Zhou. University of Toronto. Toronto, Canada M5S 1A4 Algorithms Implementing Distributed Shared Memory Michael Stumm and Songnian Zhou University of Toronto Toronto, Canada M5S 1A4 Email: stumm@csri.toronto.edu Abstract A critical issue in the design of

More information

Producer-Push a Technique to Tolerate Latency in Software Distributed Shared Memory Systems

Producer-Push a Technique to Tolerate Latency in Software Distributed Shared Memory Systems Producer-Push a Technique to Tolerate Latency in Software Distributed Shared Memory Systems Sven Karlsson and Mats Brorsson Computer Systems Group, Department of Information Technology, Lund University

More information

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language

An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language Martin C. Rinard (martin@cs.ucsb.edu) Department of Computer Science University

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors. CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

More information

Module 14: "Directory-based Cache Coherence" Lecture 31: "Managing Directory Overhead" Directory-based Cache Coherence: Replacement of S blocks

Module 14: Directory-based Cache Coherence Lecture 31: Managing Directory Overhead Directory-based Cache Coherence: Replacement of S blocks Directory-based Cache Coherence: Replacement of S blocks Serialization VN deadlock Starvation Overflow schemes Sparse directory Remote access cache COMA Latency tolerance Page migration Queue lock in hardware

More information

Distributed Shared Memory Consistency Object-based Model

Distributed Shared Memory Consistency Object-based Model Journal of Computer Science 3 (1): 57-61, 27 ISSN1549-3636 27 Science Publications Corresponding Author: Distributed Shared Memory Consistency Object-based Model Abdelfatah Aref Yahya and Rana Mohamad

More information

Compiler Reduction of Invalidation Trac in. Virtual Shared Memory Systems. Manchester, UK

Compiler Reduction of Invalidation Trac in. Virtual Shared Memory Systems. Manchester, UK Compiler Reduction of Invalidation Trac in Virtual Shared Memory Systems M.F.P. O'Boyle 1, R.W. Ford 2, A.P. Nisbet 2 1 Department of Computation, UMIST, Manchester, UK 2 Centre for Novel Computing, Dept.

More information

Replication of Data. Data-Centric Consistency Models. Reliability vs. Availability

Replication of Data. Data-Centric Consistency Models. Reliability vs. Availability CIS 505: Software Systems Lecture Note on Consistency and Replication Instructor: Insup Lee Department of Computer and Information Science University of Pennsylvania CIS 505, Spring 2007 Replication of

More information

Distributed Shared Memory. Presented by Humayun Arafat

Distributed Shared Memory. Presented by Humayun Arafat Distributed Shared Memory Presented by Humayun Arafat 1 Background Shared Memory, Distributed memory systems Distributed shared memory Design Implementation TreadMarks Comparison TreadMarks with Princeton

More information

Request Network Reply Network CPU L1 Cache L2 Cache STU Directory Memory L1 cache size unlimited L1 write buer 8 lines L2 cache size unlimited L2 outs

Request Network Reply Network CPU L1 Cache L2 Cache STU Directory Memory L1 cache size unlimited L1 write buer 8 lines L2 cache size unlimited L2 outs Evaluation of Communication Mechanisms in Invalidate-based Shared Memory Multiprocessors Gregory T. Byrd and Michael J. Flynn Computer Systems Laboratory Stanford University, Stanford, CA Abstract. Producer-initiated

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

A Migrating-Home Protocol for Implementing Scope Consistency Model on a Cluster of Workstations

A Migrating-Home Protocol for Implementing Scope Consistency Model on a Cluster of Workstations A Migrating-Home Protocol for Implementing Scope Consistency Model on a Cluster of Workstations Benny Wang-Leung Cheung, Cho-Li Wang and Kai Hwang Department of Computer Science and Information Systems

More information

UPDATE-BASED CACHE COHERENCE PROTOCOLS FOR SCALABLE SHARED-MEMORY MULTIPROCESSORS

UPDATE-BASED CACHE COHERENCE PROTOCOLS FOR SCALABLE SHARED-MEMORY MULTIPROCESSORS UPDATE-BASED CACHE COHERENCE PROTOCOLS FOR SCALABLE SHARED-MEMORY MULTIPROCESSORS David B. Glasco Bruce A. Delagi Michael J. Flynn Technical Report No. CSL-TR-93-588 November 1993 This work was supported

More information

Distributed shared memory - an overview

Distributed shared memory - an overview 6(17), October 1, 2013 Discovery ISSN 2278 5469 EISSN 2278 5450 Distributed shared memory - an overview Shahbaz Ali Khan, Harish R, Shokat Ali, Rajat, Vaibhav Jain, Nitish Raj CSE department, Dronacharya

More information

Henning Koch. Dept. of Computer Science. University of Darmstadt. Alexanderstr. 10. D Darmstadt. Germany. Keywords:

Henning Koch. Dept. of Computer Science. University of Darmstadt. Alexanderstr. 10. D Darmstadt. Germany. Keywords: Embedding Protocols for Scalable Replication Management 1 Henning Koch Dept. of Computer Science University of Darmstadt Alexanderstr. 10 D-64283 Darmstadt Germany koch@isa.informatik.th-darmstadt.de Keywords:

More information

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli Eect of fan-out on the Performance of a Single-message cancellation scheme Atul Prakash (Contact Author) Gwo-baw Wu Seema Jetli Department of Electrical Engineering and Computer Science University of Michigan,

More information

Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors

Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 10, OCTOBER 1998 1041 Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE

More information

DISTRIBUTED SHARED MEMORY

DISTRIBUTED SHARED MEMORY DISTRIBUTED SHARED MEMORY COMP 512 Spring 2018 Slide material adapted from Distributed Systems (Couloris, et. al), and Distr Op Systems and Algs (Chow and Johnson) 1 Outline What is DSM DSM Design and

More information

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico

More information

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed

More information

Performance Evaluation of View-Oriented Parallel Programming

Performance Evaluation of View-Oriented Parallel Programming Performance Evaluation of View-Oriented Parallel Programming Z. Huang, M. Purvis, P. Werstein Department of Computer Science Department of Information Science University of Otago, Dunedin, New Zealand

More information

Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network

Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network Diana Hecht 1 and Constantine Katsinis 2 1 Electrical and Computer Engineering, University of Alabama in Huntsville,

More information

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL Jun Sun, Yasushi Shinjo and Kozo Itano Institute of Information Sciences and Electronics University of Tsukuba Tsukuba,

More information

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele Large Scale Multiprocessors and Scientific Applications By Pushkar Ratnalikar Namrata Lele Agenda Introduction Interprocessor Communication Characteristics of Scientific Applications Synchronization: Scaling

More information

Abstract. Cache-only memory access (COMA) multiprocessors support scalable coherent shared

Abstract. Cache-only memory access (COMA) multiprocessors support scalable coherent shared ,, 1{19 () c Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. Latency Hiding on COMA Multiprocessors TAREK S. ABDELRAHMAN Department of Electrical and Computer Engineering The University

More information

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a Asynchronous Checkpointing for PVM Requires Message-Logging Kevin Skadron 18 April 1994 Abstract Distributed computing using networked workstations oers cost-ecient parallel computing, but the higher rate

More information

London SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency

London SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency Using Proxies to Reduce Controller Contention in Large Shared-Memory Multiprocessors Andrew J. Bennett, Paul H. J. Kelly, Jacob G. Refstrup, Sarah A. M. Talbot Department of Computing Imperial College

More information

CSE Traditional Operating Systems deal with typical system software designed to be:

CSE Traditional Operating Systems deal with typical system software designed to be: CSE 6431 Traditional Operating Systems deal with typical system software designed to be: general purpose running on single processor machines Advanced Operating Systems are designed for either a special

More information

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples Multicore Jukka Julku 19.2.2009 1 2 3 4 5 6 Disclaimer There are several low-level, languages and directive based approaches But no silver bullets This presentation only covers some examples of them is

More information

between Single Writer and Multiple Writer 1 Introduction This paper focuses on protocols for implementing

between Single Writer and Multiple Writer 1 Introduction This paper focuses on protocols for implementing Software DSM Protocols that Adapt between Single Writer and Multiple Writer Cristiana Amza y, Alan L. Cox y,sandhya Dwarkadas z, and Willy Zwaenepoel y y Department of Computer Science Rice University

More information

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)

S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4) 1 Cache Design You have already written your civic registration number (personnummer) on the cover page in the format YyMmDd-XXXX. Use the following formulas to calculate the parameters of your caches:

More information

Chapter 5. Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated

More information

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes. Data-Centric Consistency Models The general organization of a logical data store, physically distributed and replicated across multiple processes. Consistency models The scenario we will be studying: Some

More information

Joe Wingbermuehle, (A paper written under the guidance of Prof. Raj Jain)

Joe Wingbermuehle, (A paper written under the guidance of Prof. Raj Jain) 1 of 11 5/4/2011 4:49 PM Joe Wingbermuehle, wingbej@wustl.edu (A paper written under the guidance of Prof. Raj Jain) Download The Auto-Pipe system allows one to evaluate various resource mappings and topologies

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems Alexey Paznikov Saint Petersburg Electrotechnical University

More information

Page 1. Cache Coherence

Page 1. Cache Coherence Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale

More information

Cache Injection on Bus Based Multiprocessors

Cache Injection on Bus Based Multiprocessors Cache Injection on Bus Based Multiprocessors Aleksandar Milenkovic, Veljko Milutinovic School of Electrical Engineering, University of Belgrade E-mail: {emilenka,vm@etf.bg.ac.yu, Http: {galeb.etf.bg.ac.yu/~vm,

More information

CSE 120 Principles of Operating Systems

CSE 120 Principles of Operating Systems CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number

More information

Foundations of Computer Systems

Foundations of Computer Systems 18-600 Foundations of Computer Systems Lecture 21: Multicore Cache Coherence John P. Shen & Zhiyi Yu November 14, 2016 Prevalence of multicore processors: 2006: 75% for desktops, 85% for servers 2007:

More information

Distributed Shared Memory (DSM) Introduction Shared Memory Systems Distributed Shared Memory Systems Advantage of DSM Systems

Distributed Shared Memory (DSM) Introduction Shared Memory Systems Distributed Shared Memory Systems Advantage of DSM Systems Distributed Shared Memory (DSM) Introduction Shared Memory Systems Distributed Shared Memory Systems Advantage of DSM Systems Distributed Shared Memory Systems Logical location of M 1 Distributed Shared

More information

Scalable Cache Coherence

Scalable Cache Coherence Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient

More information

CONSISTENCY MODELS IN DISTRIBUTED SHARED MEMORY SYSTEMS

CONSISTENCY MODELS IN DISTRIBUTED SHARED MEMORY SYSTEMS Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,

More information

source3 Backbone s1 s2 R2 R3

source3 Backbone s1 s2 R2 R3 Fast and Optimal Multicast-Server Selection Based on Receivers' Preference Akihito Hiromori 1, Hirozumi Yamaguchi 1,Keiichi Yasumoto 2, Teruo Higashino 1, and Kenichi Taniguchi 1 1 Graduate School of Engineering

More information

The Use of Instruction-Based Prediction in Hardware Shared- Memory

The Use of Instruction-Based Prediction in Hardware Shared- Memory The Use of Instruction-Based Prediction in Hardware Shared- Memory Stefanos Kaxiras University of Wisconsin-Madison kaxiras@cs.wisc.edu Abstract In this paper we propose Instruction-based Prediction as

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

A Case for Two-Level Distributed Recovery Schemes. Nitin H. Vaidya. reduce the average performance overhead.

A Case for Two-Level Distributed Recovery Schemes. Nitin H. Vaidya.   reduce the average performance overhead. A Case for Two-Level Distributed Recovery Schemes Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-31, U.S.A. E-mail: vaidya@cs.tamu.edu Abstract Most distributed

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

CSCI 4717 Computer Architecture

CSCI 4717 Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Symmetric Multiprocessors & Clusters Reading: Stallings, Sections 18.1 through 18.4 Classifications of Parallel Processing M. Flynn classified types of parallel

More information

This paper describes and evaluates the Dual Reinforcement Q-Routing algorithm (DRQ-Routing)

This paper describes and evaluates the Dual Reinforcement Q-Routing algorithm (DRQ-Routing) DUAL REINFORCEMENT Q-ROUTING: AN ON-LINE ADAPTIVE ROUTING ALGORITHM 1 Shailesh Kumar Risto Miikkulainen The University oftexas at Austin The University oftexas at Austin Dept. of Elec. and Comp. Engg.

More information